-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25126][SQL] Avoid creating Reader for all orc files #22157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -562,20 +562,57 @@ abstract class OrcQueryTest extends OrcTest { | |
| } | ||
| } | ||
|
|
||
| def testAllCorruptFiles(): Unit = { | ||
| withTempDir { dir => | ||
| val basePath = dir.getCanonicalPath | ||
| spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString) | ||
| spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString) | ||
| val df = spark.read.orc( | ||
| new Path(basePath, "first").toString, | ||
| new Path(basePath, "second").toString) | ||
| assert(df.count() == 0) | ||
| } | ||
| } | ||
|
|
||
| def testAllCorruptFilesWithoutSchemaInfer(): Unit = { | ||
| withTempDir { dir => | ||
| val basePath = dir.getCanonicalPath | ||
| spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString) | ||
| spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString) | ||
| val df = spark.read.schema("a long").orc( | ||
| new Path(basePath, "first").toString, | ||
| new Path(basePath, "second").toString) | ||
| assert(df.count() == 0) | ||
| } | ||
| } | ||
|
|
||
| withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "true") { | ||
| testIgnoreCorruptFiles() | ||
| testIgnoreCorruptFilesWithoutSchemaInfer() | ||
| val m1 = intercept[AnalysisException] { | ||
| testAllCorruptFiles() | ||
| }.getMessage | ||
| assert(m1.contains("Unable to infer schema for ORC")) | ||
| testAllCorruptFilesWithoutSchemaInfer() | ||
| } | ||
|
|
||
| withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") { | ||
| val m1 = intercept[SparkException] { | ||
| testIgnoreCorruptFiles() | ||
| }.getMessage | ||
| assert(m1.contains("Could not read footer for file")) | ||
| assert(m1.contains("Malformed ORC file")) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why the error message changed?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is because of the behavior change #22157 (comment). Previously Orc source reads the third file which is corrupt and throws the exception of Now Orc source reads the first file for valid schema and skips other two files. When Orc source uses the schema to read the second Orc file, the schema is not consistent, so the exception of
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's make sure we don't backport it ... then I think it's fine. I sounds rather a bug to read and validate all schemas (which is inconsistent with Parquet) where we only needs to pick up single file. I don't think we make a guarantee about the pinking order. The possible behaviour change is when only read its schema. Previous code would throw an exception but after this PR it wouldn't. The previous behaviour is something we should expect when mergeSchema option is implemented within ORC side as you guys talked below.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with this take
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok. It's reasonable. |
||
| val m2 = intercept[SparkException] { | ||
| testIgnoreCorruptFilesWithoutSchemaInfer() | ||
| }.getMessage | ||
| assert(m2.contains("Malformed ORC file")) | ||
| val m3 = intercept[SparkException] { | ||
| testAllCorruptFiles() | ||
| }.getMessage | ||
| assert(m3.contains("Could not read footer for file")) | ||
| val m4 = intercept[SparkException] { | ||
| testAllCorruptFilesWithoutSchemaInfer() | ||
| }.getMessage | ||
| assert(m4.contains("Malformed ORC file")) | ||
| } | ||
| } | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a behavior change.
Previously if there are corrupt files, once
SQLConf.IGNORE_CORRUPT_FILESis false, Orc source will throw exception when reading those files.Now if Orc source reads the first valid schema, it doesn't read other Orc files further. So the corrupt files are ignored when
SQLConf.IGNORE_CORRUPT_FILESis false.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think we have to create a reader for each file when implementing schema merging like parquet, right?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think so. But in Parquet, schema merging is done in parallel. So it won't create all readers at one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya . The corrupt files are not ignored. Spark will throw
SparkExceptionwhile reading the content.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is only ignored during reading schema.
The change is the timing when the corrupt files are detected. Now it is postponed to actually reading file contents.
That might not be a big deal, though in user experience it is better to throw such exception early.