-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23007][SQL][TEST] Add read schema suite for file-based data sources #20208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi, @gatorsmile , @cloud-fan , @HyukjinKwon , @viirya . |
|
Test build #85865 has finished for PR 20208 at commit
|
|
Also, ping @sameeragarwal , too. |
|
Also, ping @rxin , too. |
|
Retest this please. |
|
Test build #85983 has finished for PR 20208 at commit
|
|
Retest this please. |
|
We are working on the Spark 2.3 release. Could you ping us after the release? |
|
Thank you for review, @gatorsmile . |
|
Do not have enough review bandwidth on this test-only PRs before Spark 2.3 release |
|
Test build #86099 has finished for PR 20208 at commit
|
|
Retest this please. |
|
Test build #86144 has finished for PR 20208 at commit
|
|
Test build #86259 has finished for PR 20208 at commit
|
|
Retest this please. |
|
Test build #86281 has finished for PR 20208 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: `byte` -> byte or the opposite for consistency with the same instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here seems many tests have some duplicated codes .. can we maybe do such as something like as below?
Seq(byteDF, ...).zip("byte").foreach { case (df, t) =>
test(s"boolean to $t") {
spark.read
.schema("col1 long")
.format(format)
.options(options)
.load(path)
checkAnswer(df4, longDF)
}
}I am fine with any idea to deal with this duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, for this, when we put the variables (byteDF, ...) outside of Never mind. I handled that as lazy variables.test functions, it seems to cause SQLContext errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do withTempPath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we leave the number given above in this comment like (case 1.).
|
cc @sameeragarwal for reviewing too. I vaguely remember we had a talk about this before. |
|
Thank you for review, @HyukjinKwon . I'll update like that. |
|
Test build #86413 has finished for PR 20208 at commit
|
|
Test build #86414 has finished for PR 20208 at commit
|
|
Hi, @gatorsmile , @cloud-fan , @sameeragarwal , @HyukjinKwon . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun, how do we guarantee schema change in Parquet and ORC?
I thought we (roughly) randomly pick up a file, read its footer and then use it. So, I was thinking we don't properly support this. It makes sense to Parquet with mergeSchema tho.
I think it's not even guaranteed in CSV too because we will rely on its header from one file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, and this is not about schema merging.
The final correct schema is given by users (or Hive).
In this PR, all schema is given by users, but for Hive tables, we uses the Hive Metastore Schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohaaa, the schema is explicitly set here. Sorry, I missed it.
|
Will do it after 2.3 release |
|
Retest this please. |
|
Test build #86599 has finished for PR 20208 at commit
|
|
Retest this please. |
|
Test build #86603 has finished for PR 20208 at commit
|
|
Retest this please. |
|
Hi, @rxin , @cloud-fan , @sameeragarwal , @HyukjinKwon . Could you give me some opinions about this PR? I know that Xiao Li is busy for this period, so I didn't ping him at this time. For me, this PR is important. Sorry for being annoying you guys. |
|
I'll update like the followings.
For |
|
Test build #90919 has finished for PR 20208 at commit
|
|
Test build #90920 has finished for PR 20208 at commit
|
|
Sorry for the delay. I updated the PR according to the comments, @gatorsmile . |
|
Test build #91654 has finished for PR 20208 at commit
|
|
Rebased to the master. |
|
Test build #92731 has finished for PR 20208 at commit
|
|
Retest this please. |
|
Test build #92744 has finished for PR 20208 at commit
|
docs/sql-programming-guide.md
Outdated
| when `path/to/table/gender=male` is the path of the data and | ||
| users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. | ||
|
|
||
| ### Schema Evolution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still want to avoid using schema evolution in the doc or tests. Schema Projection might better. More importantly, you have to clarify that this only covers the read path.
What is the behavior in the write path when the physical and data schemas are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @gatorsmile . I'll update like that.
For write operation, we cannot specify schema like read path, .schema. Spark writes the new file into the directory additionally or overwrites the directory.
| /** | ||
| * All file-based data sources supports column addition and removal at the end. | ||
| */ | ||
| abstract class ReadSchemaSuite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile . Now, it becomes ReadSchemaSuite.
| import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} | ||
|
|
||
| /** | ||
| * The reader schema is said to be evolved (or projected) when it changed after the data is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I clearly mentioned read schema and used evolved and projected as general verbs.
|
@gatorsmile , @HyukjinKwon . |
| * | CSV | 1, 2, 4 | | | ||
| * | JSON | 1, 2, 3, 4 | | | ||
| * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | | ||
| * | PARQUET | 1, 2, 3 | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for helping improve the test coverage! All the included test cases are positive. How about the negative test cases? What kind of errors you hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Right. Since the main purpose of this PR is preventing regressions, it consists of positive-only. The errors are case-by-case for each data sources.
For BooleanTypeTest example, Parquet raises higher exceptions due to ClassCastException (at the bottom). JSON raises Results do not match test case failures without exceptions.
- Parquet
org.apache.spark.sql.execution.QueryExecutionException: Encounter error while reading parquet files. One possible cause: Parquet column cannot be converted in the corresponding files. Details:
...
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file file:/private/var/folders/dc/1pz9m69x14q_gw8t7m143t1c0000gn/T/spark-4b3d788b-1d7e-4ca2-9c01-88f639daf02f/part-00000-975391e5-1f1d-49f5-8e12-3213281618ed-c000.snappy.parquet
...
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableByte cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableBoolean
|
Test build #92828 has finished for PR 20208 at commit
|
|
Test build #92827 has finished for PR 20208 at commit
|
|
The test suite is designed like the following according to their features. class CSVReadSchemaSuite
extends ReadSchemaSuite
with IntegralTypeTest
with ToDoubleTypeTest
with ToDecimalTypeTest
with ToStringTypeTest {
override val format: String = "csv"
}To add a negative test case, we need to do something like |
|
Retest this please. |
|
@gatorsmile . Please let me know if I need to do more. |
|
Test build #92939 has finished for PR 20208 at commit
|
|
@dongjoon-hyun This PR is to improve the test coverage. LGTM. When the schema do not match with the schemas of underlying data source, the current error messages might be weird. This is a common issue, I think. Please submit a separate PR to improve the error handling in these cases? |
|
Thanks! Merged to master. |
|
Thank you so much, @gatorsmile . Sure. I'll make a PR to improve error handling for that. |
|
Also, thank you, @HyukjinKwon . |
What changes were proposed in this pull request?
The reader schema is said to be evolved (or projected) when it changed after the data is written. The followings are already supported in file-based data sources. Note that partition columns are not maintained in files. In this PR,
columnmeansnon-partition column.This issue aims to guarantee users a backward-compatible read-schema test coverage on file-based data sources and to prevent future regressions by adding read schema tests explicitly.
Here, we consider safe changes without data loss. For example, data type change should be from small types to larger types like
int-to-long, not vice versa.As of today, in the master branch, file-based data sources have the following coverage.
How was this patch tested?
Pass the Jenkins with newly added test suites.