-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile #1959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -373,9 +373,11 @@ private[parquet] object ParquetTypesConverter extends Logging { | |
| } | ||
| ParquetRelation.enableLogForwarding() | ||
|
|
||
| // NOTE: Explicitly list "_temporary" because hadoop 0.23 removed the variable TEMP_DIR_NAME | ||
| // from FileOutputCommitter. Check MAPREDUCE-5229 for the detail. | ||
| val children = fs.listStatus(path).filterNot { status => | ||
| val name = status.getPath.getName | ||
| name(0) == '.' || name == FileOutputCommitter.SUCCEEDED_FILE_NAME | ||
| name(0) == '.' || name == FileOutputCommitter.SUCCEEDED_FILE_NAME || name == "_temporary" | ||
| } | ||
|
|
||
| // NOTE (lian): Parquet "_metadata" file can be very slow if the file consists of lots of row | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, a better solution for all of this could be: then: this then something like: and moreover, @liancheng, after carefully reading following comments, finally i know what you mean "complete Parquet file on HDFS should be directory" #2044 (comment) you mean the whole directory is "a single parquet file", and the files in it are "data"? but such a definition is really very very confusing... are you sure about this definition? i just googled, but found noting, only some like "Parquet files are self-describing so the schema is preserved" so, since they are self-describing, in my mind, each "data-file" in a parquet file (a parquet-folder actually...) is also valid parquet-format-file, it should also be able to take as an input source for parquet reader like our Spark SQLContext...
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
And yea, I also find definition of "Parquet file" somewhat confusing, and even the official Parquet documentations doesn't provide a precise definition. IMO a
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for the info, yep, this method is a confusing point, maybe we can reference some other parquet reader implementation |
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about ignoring any file starting with _ ? Hadoop (also) uses this convention, for things like the
_SUCCESSfile.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, that would ignore the metadata file "_metadata" as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should rethink about why we use filterNot here? simple filter works fine here, something like:
so we can ignore all of hidden/tmp files without
_metadataThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this. Just remove
.*and_*except_metadata.