Skip to content

Conversation

@marmbrus
Copy link
Contributor

Adds the ability for InMemoryColumnarTableScan to skip partitions that cannot possibly contain any matching rows, based on statistics collected when caching. For example, if we know that in a given partition max(a) = 10 and there is a predicate a = 15, the partition will be skipped.

TODO:

  • More tests
  • Code cleanup

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA tests have started for PR 1883. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18296/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA results for PR 1883:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AttributeMap[A](baseMap: Map[ExprId, %28Attribute, A%29])
class ColumnStatisticsSchema(a: Attribute) extends Serializable {
class PartitionStatistics(tableSchema: Seq[Attribute]) extends Serializable {
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18296/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA tests have started for PR 1883. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18297/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA results for PR 1883:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AttributeMap[A](baseMap: Map[ExprId, %28Attribute, A%29])
class ColumnStatisticsSchema(a: Attribute) extends Serializable {
class PartitionStatistics(tableSchema: Seq[Attribute]) extends Serializable {
case class OutputFaker(output: Seq[Attribute], child: SparkPlan) extends SparkPlan {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18297/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA tests have started for PR 1883. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18384/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA results for PR 1883:
- This patch PASSES unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18384/consoleFull

@marmbrus
Copy link
Contributor Author

Closed for #2188

@marmbrus marmbrus closed this Aug 29, 2014
asfgit pushed a commit that referenced this pull request Sep 4, 2014
…tions

This PR is based on #1883 authored by marmbrus. Key differences:

1. Batch pruning instead of partition pruning

   When #1883 was authored, batched column buffer building (#1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition).

1. More filters are supported

   Filter predicates consist of `=`, `<`, `<=`, `>`, `>=` and their conjunctions and disjunctions are supported.

Author: Cheng Lian <[email protected]>

Closes #2188 from liancheng/in-mem-batch-pruning and squashes the following commits:

68cf019 [Cheng Lian] Marked sqlContext as @transient
4254f6c [Cheng Lian] Enables in-memory partition pruning in PartitionBatchPruningSuite
3784105 [Cheng Lian] Overrides InMemoryColumnarTableScan.sqlContext
d2a1d66 [Cheng Lian] Disables in-memory partition pruning by default
062c315 [Cheng Lian] HiveCompatibilitySuite code cleanup
16b77bf [Cheng Lian] Fixed pruning predication conjunctions and disjunctions
16195c5 [Cheng Lian] Enabled both disjunction and conjunction
89950d0 [Cheng Lian] Worked around Scala style check
9c167f6 [Cheng Lian] Minor code cleanup
3c4d5c7 [Cheng Lian] Minor code cleanup
ea59ee5 [Cheng Lian] Renamed PartitionSkippingSuite to PartitionBatchPruningSuite
fc517d0 [Cheng Lian] More test cases
1868c18 [Cheng Lian] Code cleanup, bugfix, and adding tests
cb76da4 [Cheng Lian] Added more predicate filters, fixed table scan stats for testing purposes
385474a [Cheng Lian] Merge branch 'inMemStats' into in-mem-batch-pruning
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…tions

This PR is based on apache#1883 authored by marmbrus. Key differences:

1. Batch pruning instead of partition pruning

   When apache#1883 was authored, batched column buffer building (apache#1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition).

1. More filters are supported

   Filter predicates consist of `=`, `<`, `<=`, `>`, `>=` and their conjunctions and disjunctions are supported.

Author: Cheng Lian <[email protected]>

Closes apache#2188 from liancheng/in-mem-batch-pruning and squashes the following commits:

68cf019 [Cheng Lian] Marked sqlContext as @transient
4254f6c [Cheng Lian] Enables in-memory partition pruning in PartitionBatchPruningSuite
3784105 [Cheng Lian] Overrides InMemoryColumnarTableScan.sqlContext
d2a1d66 [Cheng Lian] Disables in-memory partition pruning by default
062c315 [Cheng Lian] HiveCompatibilitySuite code cleanup
16b77bf [Cheng Lian] Fixed pruning predication conjunctions and disjunctions
16195c5 [Cheng Lian] Enabled both disjunction and conjunction
89950d0 [Cheng Lian] Worked around Scala style check
9c167f6 [Cheng Lian] Minor code cleanup
3c4d5c7 [Cheng Lian] Minor code cleanup
ea59ee5 [Cheng Lian] Renamed PartitionSkippingSuite to PartitionBatchPruningSuite
fc517d0 [Cheng Lian] More test cases
1868c18 [Cheng Lian] Code cleanup, bugfix, and adding tests
cb76da4 [Cheng Lian] Added more predicate filters, fixed table scan stats for testing purposes
385474a [Cheng Lian] Merge branch 'inMemStats' into in-mem-batch-pruning
@marmbrus marmbrus deleted the inMemStats branch September 22, 2014 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants