[SPARK-37196][SQL] HiveDecimal enforcePrecisionScale failed return null #34519

AngersZhuuuu · 2021-11-08T10:59:41Z

What changes were proposed in this pull request?

For case

withTempDir { dir =>
      withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") {
        withTable("test_precision") {
          val df = sql("SELECT 'dummy' AS name, 1000000000000000000010.7000000000000010 AS value")
          df.write.mode("Overwrite").parquet(dir.getAbsolutePath)
          sql(
            s"""
               |CREATE EXTERNAL TABLE test_precision(name STRING, value DECIMAL(18,6))
               |STORED AS PARQUET LOCATION '${dir.getAbsolutePath}'
               |""".stripMargin)
          checkAnswer(sql("SELECT * FROM test_precision"), Row("dummy", null))
        }
      }
    }

We write a data with schema

It's caused by you create a df with

root
 |-- name: string (nullable = false)
 |-- value: decimal(38,16) (nullable = false)

but create table schema

root
 |-- name: string (nullable = false)
 |-- value: decimal(18,6) (nullable = false)

This will cause enforcePrecisionScale return null

  public HiveDecimal getPrimitiveJavaObject(Object o) {
    return o == null ? null : this.enforcePrecisionScale(((HiveDecimalWritable)o).getHiveDecimal());
  }

Then throw NPE when call toCatalystDecimal

We should judge if the return value is null to avoid throw NPE

Why are the changes needed?

Fix bug

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

AngersZhuuuu · 2021-11-08T11:00:05Z

ping @dongjoon-hyun @cloud-fan

SparkQA · 2021-11-08T11:48:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49470/

SparkQA · 2021-11-08T11:49:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49469/

SparkQA · 2021-11-08T12:30:08Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49470/

SparkQA · 2021-11-08T12:32:01Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49469/

cloud-fan · 2021-11-08T12:44:57Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

+               |CREATE EXTERNAL TABLE test_precision(name STRING, value DECIMAL(18,6))
+               |STORED AS PARQUET LOCATION '${dir.getAbsolutePath}'
+               |""".stripMargin)
+          checkAnswer(sql("SELECT * FROM test_precision"), Row("dummy", null))


what's the behavior of builtin file source tables? do we also return null?

and what's the behavior if we do it purely in Hive?

what's the behavior of builtin file source tables? do we also return null?

Hmmm, non vectorized parquet reader return null as well, vectorized reader throw below exception

[info] Cause: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file file:///Users/yi.zhu/Documents/project/Angerszhuuuu/spark/sql/hive/target/tmp/hive_execution_test_group/spark-628e3c21-15ee-4473-b207-60a530ced804/part-00000-3f384838-d11f-4c3b-83a1-396efad6df79-c000.snappy.parquet. Column: [value], Expected: decimal(18,6), Found: FIXED_LEN_BYTE_ARRAY [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:635) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:195) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) [info] at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:531) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(generated.java:29) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:42) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) [info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1895) [info] at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1274) [info] at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1274) [info] at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2267) [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) [info] at org.apache.spark.scheduler.Task.run(Task.scala:136) [info] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) [info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) [info] Cause: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException: [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1079) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:174) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:154) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:296) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:194) [info] at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) [info] at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:531) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(generated.java:29) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:42) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) [info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1895) [info] at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1274) [info] at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1274) [info] at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2267) [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) [info] at org.apache.spark.scheduler.Task.run(Task.scala:136) [info] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) [info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748)

Hive return NULL too

SparkQA · 2021-11-08T12:55:51Z

Test build #144998 has finished for PR 34519 at commit 2da2b49.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-11-08T15:06:21Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

+    withTempDir { dir =>
+      withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") {
+        withTable("test_precision") {
+          val df = sql("SELECT 'dummy' AS name, 1000000000000000000010.7000000000000010 AS value")


can we use CAST(1.2 AS DECIMAL(38, 16)) instead of writing a long decimal literal?

CAST(1.2 AS DECIMAL(38, 16))

Can't, it's an enforce convert.

SparkQA · 2021-11-08T16:43:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49478/

SparkQA · 2021-11-08T17:25:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49478/

SparkQA · 2021-11-08T17:53:01Z

Test build #145005 has finished for PR 34519 at commit 20048fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM.

### What changes were proposed in this pull request? For case ``` withTempDir { dir => withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { withTable("test_precision") { val df = sql("SELECT 'dummy' AS name, 1000000000000000000010.7000000000000010 AS value") df.write.mode("Overwrite").parquet(dir.getAbsolutePath) sql( s""" |CREATE EXTERNAL TABLE test_precision(name STRING, value DECIMAL(18,6)) |STORED AS PARQUET LOCATION '${dir.getAbsolutePath}' |""".stripMargin) checkAnswer(sql("SELECT * FROM test_precision"), Row("dummy", null)) } } } ``` We write a data with schema It's caused by you create a df with ``` root |-- name: string (nullable = false) |-- value: decimal(38,16) (nullable = false) ``` but create table schema ``` root |-- name: string (nullable = false) |-- value: decimal(18,6) (nullable = false) ``` This will cause enforcePrecisionScale return `null` ``` public HiveDecimal getPrimitiveJavaObject(Object o) { return o == null ? null : this.enforcePrecisionScale(((HiveDecimalWritable)o).getHiveDecimal()); } ``` Then throw NPE when call `toCatalystDecimal ` We should judge if the return value is `null` to avoid throw NPE ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #34519 from AngersZhuuuu/SPARK-37196. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a4f8ffb) Signed-off-by: Dongjoon Hyun <[email protected]>

For case ``` withTempDir { dir => withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { withTable("test_precision") { val df = sql("SELECT 'dummy' AS name, 1000000000000000000010.7000000000000010 AS value") df.write.mode("Overwrite").parquet(dir.getAbsolutePath) sql( s""" |CREATE EXTERNAL TABLE test_precision(name STRING, value DECIMAL(18,6)) |STORED AS PARQUET LOCATION '${dir.getAbsolutePath}' |""".stripMargin) checkAnswer(sql("SELECT * FROM test_precision"), Row("dummy", null)) } } } ``` We write a data with schema It's caused by you create a df with ``` root |-- name: string (nullable = false) |-- value: decimal(38,16) (nullable = false) ``` but create table schema ``` root |-- name: string (nullable = false) |-- value: decimal(18,6) (nullable = false) ``` This will cause enforcePrecisionScale return `null` ``` public HiveDecimal getPrimitiveJavaObject(Object o) { return o == null ? null : this.enforcePrecisionScale(((HiveDecimalWritable)o).getHiveDecimal()); } ``` Then throw NPE when call `toCatalystDecimal ` We should judge if the return value is `null` to avoid throw NPE Fix bug No Added UT Closes #34519 from AngersZhuuuu/SPARK-37196. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a4f8ffb) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2021-11-08T20:32:24Z

Merged to master/3.2/3.1/3.0. Thank you, @AngersZhuuuu and @cloud-fan .

For case ``` withTempDir { dir => withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { withTable("test_precision") { val df = sql("SELECT 'dummy' AS name, 1000000000000000000010.7000000000000010 AS value") df.write.mode("Overwrite").parquet(dir.getAbsolutePath) sql( s""" |CREATE EXTERNAL TABLE test_precision(name STRING, value DECIMAL(18,6)) |STORED AS PARQUET LOCATION '${dir.getAbsolutePath}' |""".stripMargin) checkAnswer(sql("SELECT * FROM test_precision"), Row("dummy", null)) } } } ``` We write a data with schema It's caused by you create a df with ``` root |-- name: string (nullable = false) |-- value: decimal(38,16) (nullable = false) ``` but create table schema ``` root |-- name: string (nullable = false) |-- value: decimal(18,6) (nullable = false) ``` This will cause enforcePrecisionScale return `null` ``` public HiveDecimal getPrimitiveJavaObject(Object o) { return o == null ? null : this.enforcePrecisionScale(((HiveDecimalWritable)o).getHiveDecimal()); } ``` Then throw NPE when call `toCatalystDecimal ` We should judge if the return value is `null` to avoid throw NPE Fix bug No Added UT Closes #34519 from AngersZhuuuu/SPARK-37196. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a4f8ffb) Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? For case ``` withTempDir { dir => withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { withTable("test_precision") { val df = sql("SELECT 'dummy' AS name, 1000000000000000000010.7000000000000010 AS value") df.write.mode("Overwrite").parquet(dir.getAbsolutePath) sql( s""" |CREATE EXTERNAL TABLE test_precision(name STRING, value DECIMAL(18,6)) |STORED AS PARQUET LOCATION '${dir.getAbsolutePath}' |""".stripMargin) checkAnswer(sql("SELECT * FROM test_precision"), Row("dummy", null)) } } } ``` We write a data with schema It's caused by you create a df with ``` root |-- name: string (nullable = false) |-- value: decimal(38,16) (nullable = false) ``` but create table schema ``` root |-- name: string (nullable = false) |-- value: decimal(18,6) (nullable = false) ``` This will cause enforcePrecisionScale return `null` ``` public HiveDecimal getPrimitiveJavaObject(Object o) { return o == null ? null : this.enforcePrecisionScale(((HiveDecimalWritable)o).getHiveDecimal()); } ``` Then throw NPE when call `toCatalystDecimal ` We should judge if the return value is `null` to avoid throw NPE ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes apache#34519 from AngersZhuuuu/SPARK-37196. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a4f8ffb) Signed-off-by: Dongjoon Hyun <[email protected]>

For case ``` withTempDir { dir => withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { withTable("test_precision") { val df = sql("SELECT 'dummy' AS name, 1000000000000000000010.7000000000000010 AS value") df.write.mode("Overwrite").parquet(dir.getAbsolutePath) sql( s""" |CREATE EXTERNAL TABLE test_precision(name STRING, value DECIMAL(18,6)) |STORED AS PARQUET LOCATION '${dir.getAbsolutePath}' |""".stripMargin) checkAnswer(sql("SELECT * FROM test_precision"), Row("dummy", null)) } } } ``` We write a data with schema It's caused by you create a df with ``` root |-- name: string (nullable = false) |-- value: decimal(38,16) (nullable = false) ``` but create table schema ``` root |-- name: string (nullable = false) |-- value: decimal(18,6) (nullable = false) ``` This will cause enforcePrecisionScale return `null` ``` public HiveDecimal getPrimitiveJavaObject(Object o) { return o == null ? null : this.enforcePrecisionScale(((HiveDecimalWritable)o).getHiveDecimal()); } ``` Then throw NPE when call `toCatalystDecimal ` We should judge if the return value is `null` to avoid throw NPE Fix bug No Added UT Closes apache#34519 from AngersZhuuuu/SPARK-37196. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a4f8ffb) Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? For case ``` withTempDir { dir => withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { withTable("test_precision") { val df = sql("SELECT 'dummy' AS name, 1000000000000000000010.7000000000000010 AS value") df.write.mode("Overwrite").parquet(dir.getAbsolutePath) sql( s""" |CREATE EXTERNAL TABLE test_precision(name STRING, value DECIMAL(18,6)) |STORED AS PARQUET LOCATION '${dir.getAbsolutePath}' |""".stripMargin) checkAnswer(sql("SELECT * FROM test_precision"), Row("dummy", null)) } } } ``` We write a data with schema It's caused by you create a df with ``` root |-- name: string (nullable = false) |-- value: decimal(38,16) (nullable = false) ``` but create table schema ``` root |-- name: string (nullable = false) |-- value: decimal(18,6) (nullable = false) ``` This will cause enforcePrecisionScale return `null` ``` public HiveDecimal getPrimitiveJavaObject(Object o) { return o == null ? null : this.enforcePrecisionScale(((HiveDecimalWritable)o).getHiveDecimal()); } ``` Then throw NPE when call `toCatalystDecimal ` We should judge if the return value is `null` to avoid throw NPE ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes apache#34519 from AngersZhuuuu/SPARK-37196. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a4f8ffb) Signed-off-by: Dongjoon Hyun <[email protected]>

For case ``` withTempDir { dir => withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { withTable("test_precision") { val df = sql("SELECT 'dummy' AS name, 1000000000000000000010.7000000000000010 AS value") df.write.mode("Overwrite").parquet(dir.getAbsolutePath) sql( s""" |CREATE EXTERNAL TABLE test_precision(name STRING, value DECIMAL(18,6)) |STORED AS PARQUET LOCATION '${dir.getAbsolutePath}' |""".stripMargin) checkAnswer(sql("SELECT * FROM test_precision"), Row("dummy", null)) } } } ``` We write a data with schema It's caused by you create a df with ``` root |-- name: string (nullable = false) |-- value: decimal(38,16) (nullable = false) ``` but create table schema ``` root |-- name: string (nullable = false) |-- value: decimal(18,6) (nullable = false) ``` This will cause enforcePrecisionScale return `null` ``` public HiveDecimal getPrimitiveJavaObject(Object o) { return o == null ? null : this.enforcePrecisionScale(((HiveDecimalWritable)o).getHiveDecimal()); } ``` Then throw NPE when call `toCatalystDecimal ` We should judge if the return value is `null` to avoid throw NPE Fix bug No Added UT Closes #34519 from AngersZhuuuu/SPARK-37196. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-37196][SQL] HiveDecimal enforcePrecisionScala failed return null

f15b7c9

github-actions bot added the SQL label Nov 8, 2021

update

2da2b49

cloud-fan reviewed Nov 8, 2021

View reviewed changes

cloud-fan approved these changes Nov 8, 2021

View reviewed changes

Update SQLQuerySuite.scala

20048fb

dongjoon-hyun approved these changes Nov 8, 2021

View reviewed changes

dongjoon-hyun closed this in a4f8ffb Nov 8, 2021

[SPARK-37196][SQL] HiveDecimal enforcePrecisionScale failed return null #34519

[SPARK-37196][SQL] HiveDecimal enforcePrecisionScale failed return null #34519

Uh oh!

Conversation

AngersZhuuuu commented Nov 8, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AngersZhuuuu commented Nov 8, 2021

Uh oh!

SparkQA commented Nov 8, 2021

Uh oh!

SparkQA commented Nov 8, 2021

Uh oh!

SparkQA commented Nov 8, 2021

Uh oh!

SparkQA commented Nov 8, 2021

Uh oh!

cloud-fan Nov 8, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Nov 8, 2021

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Nov 8, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 8, 2021

Uh oh!

cloud-fan Nov 8, 2021

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 8, 2021

Uh oh!

SparkQA commented Nov 8, 2021

Uh oh!

SparkQA commented Nov 8, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cloud-fan Nov 8, 2021 •

edited

Loading

AngersZhuuuu Nov 8, 2021 •

edited

Loading