[SPARK-15752] [SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators. #13494

lianhuiwang · 2016-06-03T07:45:37Z

What changes were proposed in this pull request?

when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.

How was this patch tested?

add unit tests

SparkQA · 2016-06-03T09:19:10Z

Test build #59925 has finished for PR 13494 at commit 2ca2c38.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-03T14:27:26Z

Test build #59929 has finished for PR 13494 at commit edea710.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-03T14:51:27Z

Test build #59930 has finished for PR 13494 at commit 8426522.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-03T18:53:37Z

Test build #59940 has finished for PR 13494 at commit 153293e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-06-04T05:40:14Z

Can you try to write a design doc on this? Would be great to discuss the reasons why we might want this, the kind of queries that can be answered, corner cases, and how it should be implemented. Thanks.

lianhuiwang · 2016-06-04T09:19:53Z

@rxin I have writed a design doc: https://docs.google.com/document/d/1Bmi4-PkTaBQ0HVaGjIqa3eA12toKX52QaiUyhb6WQiM/edit?usp=sharing.
Glad to get your comments. Thanks.

cloud-fan · 2016-06-23T07:19:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+        val partitionSchema = files.partitionSchema.toAttributes
+        lazy val converter = GenerateUnsafeProjection.generate(partitionSchema, partitionSchema)
+        val partitionValues = selectedPartitions.map(_.values)
+        files.sqlContext.sparkContext.parallelize(partitionValues, 1).map(converter(_))


what if this partition has more than one data files?

Now in this PR, default of spark.sql.optimizer.metadataOnly is false, So if user needs this feature, he should set spark.sql.optimizer.metadataOnly=true.

I think optimizer should never affect the correctness of the query result. If this optimization is too hard to implement with current code base, we should improve the code base first, instead of rushing in a partial implementation.

Yes, I rethink more and then i will add a metadataOnly optimizer to optimizer list.Thanks.

cloud-fan · 2016-06-23T07:30:38Z

hi @lianhuiwang , thanks for working on it!

The overall idea LGTM, we should elimiante unnecessary file scan if only partition columns are read. However, the current implementation looks not corrected, we should also consider the number of rows. I also took a look at the hive path, it only optimize partition columns used as aggregation keys, where the number of duplicated rows doesn't matter.

I think we should either narrow down the scope of this PR and focus on aggregation queries, or spent some more time for a more general design.

cc @yhuai @liancheng

lianhuiwang · 2016-06-23T08:02:07Z

@cloud-fan Yes, I think what you said is right. as Hive/Prestodb, if queries that did some functions (example: MIN/MAX) or distinct aggregates on partition column and the value of config 'spark.sql.optimizer.metadataOnly' is true, then we can use the metadata-only optimization.
I will add a metadataOnly optimizer to optimizer list.Thanks.

This reverts commit 153293e.

This reverts commit edea710.

SparkQA · 2016-06-24T07:55:19Z

Test build #61161 has finished for PR 13494 at commit 7d7ece0.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

lianhuiwang · 2016-06-24T08:04:35Z

@cloud-fan Now i have added a extendedHiveOptimizerRules that include MetadataOnly Optimization for Hive Optimizer.
Firstly,MetadataOnly Optimization should be in Hive Model because MetastoreRelation only can be used in Hive now.
Secondly, MetadataOnly Optimization should be between Analyzer and RewriteDistinctAggregates.
In the future, we can add ParquetConversions/OrcConversions and other optimizations into extendedHiveOptimizerRules.

rxin · 2016-06-24T08:07:40Z

Why is this rule Hive specific?

lianhuiwang · 2016-06-24T08:19:31Z

@rxin good point. Because now MetastoreRelation only be defined in Hive now and if we make it using MetadataOnly optimization, like this PR we can use MetadataOnly optimization in Hive Component.
if not, we needs divide MetadataOnly optimization into two part, one for common sql, other for HiveQl.
I will think more about it and try my best to resolve it. Thanks.

cloud-fan · 2016-06-24T09:32:01Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val OPTIMIZER_METADATA_ONLY = SQLConfigBuilder("spark.sql.optimizer.metadataOnly")
+    .doc("When true, enable the metadata-only query optimization.")
+    .booleanConf
+    .createWithDefault(false)


can we turn it on by default?

SparkQA · 2016-06-24T09:38:20Z

Test build #61162 has finished for PR 13494 at commit 2e55a9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-06-24T09:40:30Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

+          if files.partitionSchema.nonEmpty =>
+          (Some(relation), Seq.empty[Expression])
+
+        case relation: MetastoreRelation if relation.partitionKeys.nonEmpty =>


MetastoreRelation extends CatalogRelation, I think we can put this rule in sql core instead of hive module.

SparkQA · 2016-06-24T09:46:53Z

Test build #61163 has finished for PR 13494 at commit b2b6eba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-24T09:49:49Z

Test build #61164 has finished for PR 13494 at commit c5a291e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaPackage
- case class StreamingRelationExec(sourceName: String, output: Seq[Attribute]) extends LeafExecNode

lianhuiwang · 2016-07-11T18:22:12Z

@hvanhovell I have addressed some of your comments. Thanks. Could you look at again?

hvanhovell · 2016-07-11T18:58:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala

+    /**
+     * Returns the partition attributes of the table relation plan.
+     */
+    def getPartitionAttrs(partitionColumnNames: Seq[String], relation: LogicalPlan)


Nit: Style.

def getPartitionAttrs( partitionColumnNames: Seq[String], relation: LogicalPlan): Seq[Attribute] = { ...

While you are at it. Change the return type to AttributeSet.

Get it. Thanks.

SparkQA · 2016-07-11T20:11:04Z

Test build #62110 has finished for PR 13494 at commit d888c85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-07-12T00:40:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala

+      case plan if plan eq relation =>
+        relation match {
+          case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _) =>
+            val partAttrs = PartitionedRelation.getPartitionAttrs(


does getPartitionAttrs need to be a method in PartitionedRelation? I think it can just be a private method in parent class.

thanks. Because object PartitionedRelation also use getPartitionAttrs, Now i just define it in PartitionedRelation. If it define a private method in class OptimizeMetadataOnlyQuery, there are two same getPartitionAttrs() functions in PartitionedRelation and OptimizeMetadataOnlyQuery.
How about define two same getPartitionAttrs() functions? or has another way?

@cloud-fan I will define two functions for getPartitionAttrs(). In the future, I think we can put getPartitionAttrs() into relation plan. If i has some problem, please tell me. thanks.

lianhuiwang · 2016-07-12T02:59:50Z

@cloud-fan @hvanhovell about getPartitionAttrs() It has a improve place that we can define it in relation node. but now relation node has not this function. how about added in follow-up PRs? Thanks.

SparkQA · 2016-07-12T04:55:04Z

Test build #62137 has finished for PR 13494 at commit ff16509.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-07-12T05:49:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala

+    /**
+     * Returns the partition attributes of the table relation plan.
+     */
+    private def getPartitionAttrs(


IIRC, inner class can access private member of outer class, we don't need to duplicate the method in inner class.

Yes, thanks.

lianhuiwang · 2016-07-12T09:49:05Z

@cloud-fan I have addressed your latest comments. thanks.

SparkQA · 2016-07-12T11:42:57Z

Test build #62156 has finished for PR 13494 at commit 030776a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-07-12T16:51:23Z

LGTM - Merging to master. Thanks!

lianhuiwang · 2016-07-12T17:16:50Z

Thank you for review and merging. @rxin @hvanhovell @cloud-fan .

…cords correctly ## What changes were proposed in this pull request? When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. ## How was this patch tested? Unit test Closes #23635 from gengliangwang/optimizeMetadata. Lead-authored-by: Gengliang Wang <[email protected]> Co-authored-by: Xiao Li <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit f5b9370) Signed-off-by: gatorsmile <[email protected]>

…cords correctly ## What changes were proposed in this pull request? When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. ## How was this patch tested? Unit test Closes #23635 from gengliangwang/optimizeMetadata. Lead-authored-by: Gengliang Wang <[email protected]> Co-authored-by: Xiao Li <[email protected]> Signed-off-by: gatorsmile <[email protected]>

…cords correctly When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in apache#13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. Unit test Closes apache#23635 from gengliangwang/optimizeMetadata. Lead-authored-by: Gengliang Wang <[email protected]> Co-authored-by: Xiao Li <[email protected]> Signed-off-by: gatorsmile <[email protected]>

…dle empty records correctly ## What changes were proposed in this pull request? When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. ## How was this patch tested? Unit test Closes #23648 from gengliangwang/SPARK-26709. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

…cords correctly ## What changes were proposed in this pull request? When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in apache#13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. ## How was this patch tested? Unit test Closes apache#23635 from gengliangwang/optimizeMetadata. Lead-authored-by: Gengliang Wang <[email protected]> Co-authored-by: Xiao Li <[email protected]> Signed-off-by: gatorsmile <[email protected]>

…cords correctly ## What changes were proposed in this pull request? When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in apache#13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. ## How was this patch tested? Unit test Closes apache#23635 from gengliangwang/optimizeMetadata. Lead-authored-by: Gengliang Wang <[email protected]> Co-authored-by: Xiao Li <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit f5b9370) Signed-off-by: gatorsmile <[email protected]>

* [SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly ## What changes were proposed in this pull request? When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in apache#13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. ## How was this patch tested? Unit test Closes apache#23635 from gengliangwang/optimizeMetadata. Lead-authored-by: Gengliang Wang <[email protected]> Co-authored-by: Xiao Li <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit f5b9370) Signed-off-by: gatorsmile <[email protected]> * [SPARK-26080][PYTHON] Skips Python resource limit on Windows in Python worker ## What changes were proposed in this pull request? `resource` package is a Unix specific package. See https://docs.python.org/2/library/resource.html and https://docs.python.org/3/library/resource.html. Note that we document Windows support: > Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). This should be backported into branch-2.4 to restore Windows support in Spark 2.4.1. ## How was this patch tested? Manually mocking the changed logics. Closes apache#23055 from HyukjinKwon/SPARK-26080. Lead-authored-by: hyukjinkwon <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 9cda9a8) Signed-off-by: Hyukjin Kwon <[email protected]> * [SPARK-26873][SQL] Use a consistent timestamp to build Hadoop Job IDs. ## What changes were proposed in this pull request? Updates FileFormatWriter to create a consistent Hadoop Job ID for a write. ## How was this patch tested? Existing tests for regressions. Closes apache#23777 from rdblue/SPARK-26873-fix-file-format-writer-job-ids. Authored-by: Ryan Blue <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit 33334e2) Signed-off-by: Marcelo Vanzin <[email protected]> * [SPARK-26745][SPARK-24959][SQL][BRANCH-2.4] Revert count optimization in JSON datasource by ## What changes were proposed in this pull request? This PR reverts JSON count optimization part of apache#21909. We cannot distinguish the cases below without parsing: ``` [{...}, {...}] ``` ``` [] ``` ``` {...} ``` ```bash # empty string ``` when we `count()`. One line (input: IN) can be, 0 record, 1 record and multiple records and this is dependent on each input. See also apache#23665 (comment). ## How was this patch tested? Manually tested. Closes apache#23708 from HyukjinKwon/SPARK-26745-backport. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> * [SPARK-26677][BUILD] Update Parquet to 1.10.1 with notEq pushdown fix. ## What changes were proposed in this pull request? Update to Parquet Java 1.10.1. ## How was this patch tested? Added a test from HyukjinKwon that validates the notEq case from SPARK-26677. Closes apache#23704 from rdblue/SPARK-26677-fix-noteq-parquet-bug. Lead-authored-by: Ryan Blue <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: Ryan Blue <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit f72d217) Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-26677][FOLLOWUP][BRANCH-2.4] Update Parquet manifest with Hadoop-2.6 ## What changes were proposed in this pull request? During merging Parquet upgrade PR, `hadoop-2.6` profile dependency manifest is missed. ## How was this patch tested? Manual. ``` ./dev/test-dependencies.sh ``` Also, this will recover `branch-2.4` with `hadoop-2.6` build. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.6/281/ Closes apache#23738 from dongjoon-hyun/SPARK-26677-2. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-26708][SQL][BRANCH-2.4] Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan ## What changes were proposed in this pull request? When performing non-cascading cache invalidation, `recache` is called on the other cache entries which are dependent on the cache being invalidated. It leads to the the physical plans of those cache entries being re-compiled. For those cache entries, if the cache RDD has already been persisted, chances are there will be inconsistency between the data and the new plan. It can cause a correctness issue if the new plan's `outputPartitioning` or `outputOrdering` is different from the that of the actual data, and meanwhile the cache is used by another query that asks for specific `outputPartitioning` or `outputOrdering` which happens to match the new plan but not the actual data. The fix is to keep the cache entry as it is if the data has been loaded, otherwise re-build the cache entry, with a new plan and an empty cache buffer. ## How was this patch tested? Added UT. Closes apache#23678 from maryannxue/spark-26708-2.4. Authored-by: maryannxue <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> * [SPARK-26267][SS] Retry when detecting incorrect offsets from Kafka (2.4) ## What changes were proposed in this pull request? Backport apache#23324 to branch-2.4. ## How was this patch tested? Jenkins Closes apache#23365 from zsxwing/SPARK-26267-2.4. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Shixiong Zhu <[email protected]> * [SPARK-26706][SQL] Fix `illegalNumericPrecedence` for ByteType This PR contains a minor change in `Cast$mayTruncate` that fixes its logic for bytes. Right now, `mayTruncate(ByteType, LongType)` returns `false` while `mayTruncate(ShortType, LongType)` returns `true`. Consequently, `spark.range(1, 3).as[Byte]` and `spark.range(1, 3).as[Short]` behave differently. Potentially, this bug can silently corrupt someone's data. ```scala // executes silently even though Long is converted into Byte spark.range(Long.MaxValue - 10, Long.MaxValue).as[Byte] .map(b => b - 1) .show() +-----+ |value| +-----+ | -12| | -11| | -10| | -9| | -8| | -7| | -6| | -5| | -4| | -3| +-----+ // throws an AnalysisException: Cannot up cast `id` from bigint to smallint as it may truncate spark.range(Long.MaxValue - 10, Long.MaxValue).as[Short] .map(s => s - 1) .show() ``` This PR comes with a set of unit tests. Closes apache#23632 from aokolnychyi/cast-fix. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: DB Tsai <[email protected]> * [SPARK-26078][SQL][BACKPORT-2.4] Dedup self-join attributes on IN subqueries ## What changes were proposed in this pull request? When there is a self-join as result of a IN subquery, the join condition may be invalid, resulting in trivially true predicates and return wrong results. The PR deduplicates the subquery output in order to avoid the issue. ## How was this patch tested? added UT Closes apache#23449 from mgaido91/SPARK-26078_2.4. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-26233][SQL][BACKPORT-2.4] CheckOverflow when encoding a decimal value ## What changes were proposed in this pull request? When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations. ## How was this patch tested? added UT Closes apache#23232 from mgaido91/SPARK-26233_2.4. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-27097][CHERRY-PICK 2.4] Avoid embedding platform-dependent offsets literally in whole-stage generated code ## What changes were proposed in this pull request? Spark SQL performs whole-stage code generation to speed up query execution. There are two steps to it: - Java source code is generated from the physical query plan on the driver. A single version of the source code is generated from a query plan, and sent to all executors. - It's compiled to bytecode on the driver to catch compilation errors before sending to executors, but currently only the generated source code gets sent to the executors. The bytecode compilation is for fail-fast only. - Executors receive the generated source code and compile to bytecode, then the query runs like a hand-written Java program. In this model, there's an implicit assumption about the driver and executors being run on similar platforms. Some code paths accidentally embedded platform-dependent object layout information into the generated code, such as: ```java Platform.putLong(buffer, /* offset */ 24, /* value */ 1); ``` This code expects a field to be at offset +24 of the `buffer` object, and sets a value to that field. But whole-stage code generation generally uses platform-dependent information from the driver. If the object layout is significantly different on the driver and executors, the generated code can be reading/writing to wrong offsets on the executors, causing all kinds of data corruption. One code pattern that leads to such problem is the use of `Platform.XXX` constants in generated code, e.g. `Platform.BYTE_ARRAY_OFFSET`. Bad: ```scala val baseOffset = Platform.BYTE_ARRAY_OFFSET // codegen template: s"Platform.putLong($buffer, $baseOffset, $value);" ``` This will embed the value of `Platform.BYTE_ARRAY_OFFSET` on the driver into the generated code. Good: ```scala val baseOffset = "Platform.BYTE_ARRAY_OFFSET" // codegen template: s"Platform.putLong($buffer, $baseOffset, $value);" ``` This will generate the offset symbolically -- `Platform.putLong(buffer, Platform.BYTE_ARRAY_OFFSET, value)`, which will be able to pick up the correct value on the executors. Caveat: these offset constants are declared as runtime-initialized `static final` in Java, so they're not compile-time constants from the Java language's perspective. It does lead to a slightly increased size of the generated code, but this is necessary for correctness. NOTE: there can be other patterns that generate platform-dependent code on the driver which is invalid on the executors. e.g. if the endianness is different between the driver and the executors, and if some generated code makes strong assumption about endianness, it would also be problematic. ## How was this patch tested? Added a new test suite `WholeStageCodegenSparkSubmitSuite`. This test suite needs to set the driver's extraJavaOptions to force the driver and executor use different Java object layouts, so it's run as an actual SparkSubmit job. Authored-by: Kris Mok <kris.mokdatabricks.com> Closes apache#24032 from gatorsmile/testFailure. Lead-authored-by: Kris Mok <[email protected]> Co-authored-by: gatorsmile <[email protected]> Signed-off-by: DB Tsai <[email protected]> * [SPARK-26188][SQL] FileIndex: don't infer data types of partition columns if user specifies schema ## What changes were proposed in this pull request? This PR is to fix a regression introduced in: https://github.com/apache/spark/pull/21004/files#r236998030 If user specifies schema, Spark don't need to infer data type for of partition columns, otherwise the data type might not match with the one user provided. E.g. for partition directory `p=4d`, after data type inference the column value will be `4.0`. See https://issues.apache.org/jira/browse/SPARK-26188 for more details. Note that user specified schema **might not cover all the data columns**: ``` val schema = new StructType() .add("id", StringType) .add("ex", ArrayType(StringType)) val df = spark.read .schema(schema) .format("parquet") .load(src.toString) assert(df.schema.toList === List( StructField("ex", ArrayType(StringType)), StructField("part", IntegerType), // inferred partitionColumn dataType StructField("id", StringType))) // used user provided partitionColumn dataType ``` For the missing columns in user specified schema, Spark still need to infer their data types if `partitionColumnTypeInferenceEnabled` is enabled. To implement the partially inference, refactor `PartitioningUtils.parsePartitions` and pass the user specified schema as parameter to cast partition values. ## How was this patch tested? Add unit test. Closes apache#23165 from gengliangwang/fixFileIndex. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 9cfc3ee) Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-25921][PYSPARK] Fix barrier task run without BarrierTaskContext while python worker reuse ## What changes were proposed in this pull request? Running a barrier job after a normal spark job causes the barrier job to run without a BarrierTaskContext. This is because while python worker reuse, BarrierTaskContext._getOrCreate() will still return a TaskContext after firstly submit a normal spark job, we'll get a `AttributeError: 'TaskContext' object has no attribute 'barrier'`. Fix this by adding check logic in BarrierTaskContext._getOrCreate() and make sure it will return BarrierTaskContext in this scenario. ## How was this patch tested? Add new UT in pyspark-core. Closes apache#22962 from xuanyuanking/SPARK-25921. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c00e72f) Signed-off-by: Wenchen Fan <[email protected]>

init commit

2ca2c38

lianhuiwang added 2 commits June 3, 2016 20:58

fix unit test

edea710

Merge branch 'apache-master' into metadata-only

8426522

fix unit test

153293e

cloud-fan reviewed Jun 23, 2016
View reviewed changes

lianhuiwang added 6 commits June 24, 2016 14:29

update

7dfb743

Revert "fix unit test"

68e6d6d

This reverts commit 153293e.

Revert "fix unit test"

595ef36

This reverts commit edea710.

Merge branch 'apache-master' into metadata-only

7d7ece0

Merge branch 'apache-master' into metadata-only

2e55a9d

update

b2b6eba

Merge branch 'apache-master' into metadata-only

c5a291e

cloud-fan reviewed Jun 24, 2016
View reviewed changes

hvanhovell reviewed Jul 11, 2016
View reviewed changes

cloud-fan reviewed Jul 12, 2016
View reviewed changes

update

ff16509

cloud-fan reviewed Jul 12, 2016
View reviewed changes

lianhuiwang added 2 commits July 12, 2016 17:28

remove duplicate code

358ad13

fix minor

030776a

asfgit closed this in 5ad68ba Jul 12, 2016

gengliangwang mentioned this pull request Jan 24, 2019

[SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly #23635

Closed

gengliangwang mentioned this pull request Jan 25, 2019

[SPARK-26709][SQL][BRANCH-2.3] OptimizeMetadataOnlyQuery does not handle empty records correctly #23648

Closed

[SPARK-15752] [SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators. #13494

[SPARK-15752] [SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators. #13494

Uh oh!

Conversation

lianhuiwang commented Jun 3, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 3, 2016

Uh oh!

SparkQA commented Jun 3, 2016

Uh oh!

SparkQA commented Jun 3, 2016

Uh oh!

SparkQA commented Jun 3, 2016

Uh oh!

rxin commented Jun 4, 2016

Uh oh!

lianhuiwang commented Jun 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 23, 2016

Uh oh!

lianhuiwang commented Jun 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

lianhuiwang commented Jun 24, 2016

Uh oh!

rxin commented Jun 24, 2016

Uh oh!

lianhuiwang commented Jun 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

lianhuiwang commented Jul 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianhuiwang Jul 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianhuiwang commented Jul 12, 2016

Uh oh!

SparkQA commented Jul 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianhuiwang commented Jul 12, 2016

Uh oh!

SparkQA commented Jul 12, 2016

lianhuiwang commented Jun 4, 2016 •

edited

Loading

lianhuiwang commented Jun 23, 2016 •

edited

Loading

lianhuiwang commented Jun 24, 2016 •

edited

Loading

lianhuiwang Jul 12, 2016 •

edited

Loading