-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25028][SQL] Avoid NPE when analyzing partition with NULL values #22036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -204,6 +204,24 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared | |
| } | ||
| } | ||
|
|
||
| test("SPARK-25028: column stats collection for null partitioning columns") { | ||
| val table = "analyze_partition_with_null" | ||
| withTempDir { dir => | ||
| withTable(table) { | ||
| sql(s""" | ||
| |CREATE TABLE $table (name string, value string) | ||
| |USING PARQUET | ||
| |PARTITIONED BY (name) | ||
| |LOCATION '${dir.toURI}'""".stripMargin) | ||
| val df = Seq(("a", null), ("b", null)).toDF("value", "name") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. super nit: better to add a non-null partition value, e.g.,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it is needed to add another partition value, as the problem here is with The reverse column order is the way spark works when inserting data into a partitioned table. The partitioning columns are specified at the end, after the non-partitioning ones.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when creating the table, we can put partition column at the end, to avoid this confusion.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, will do, thanks. |
||
| df.write.mode("overwrite").insertInto(table) | ||
| sql(s"ANALYZE TABLE $table PARTITION (name) COMPUTE STATISTICS") | ||
| val partitions = spark.sessionState.catalog.listPartitions(TableIdentifier(table)) | ||
| assert(partitions.head.stats.get.rowCount.get == 2) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| test("number format in statistics") { | ||
| val numbers = Seq( | ||
| BigInt(0) -> (("0.0 B", "0")), | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to change the read path? i.e. where we use these statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, as the same situation would happen if Hive's statistics are used instead of the ones computed by Spark