-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SQL][SPARK-6632]: Read schema from each input split in the ReadSupport hook, reconciling with the metastore schema at that time #5298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
Ah, I'm also considering similar optimizations for Spark 1.4 :) The tricky part here is that, when scanning the Parquet table, Spark needs to call Fortunately, Parquet is also trying to avoid reading footers entirely at the driver side (see https://github.com/apache/incubator-parquet-mr/pull/91 and https://github.com/apache/incubator-parquet-mr/pull/45). After upgrading to Parquet 1.6, which is expected to be released next week, we can do this properly for better performance. So ideally, we don't read footers on driver side, and when we have a central arbitrative schema at hand, either from metastore or data source DDL, we don't do schema merging at driver side either. I haven't got time to walk through all related Parquet code path and PRs yet, so the above statements may be inaccurate. Please correct me if you find any mistakes. |
|
hmm i see. Would definitely go through these PRs. Anyways fixed the whitespace problem here. Please retest |
|
add to whitelist |
|
Test build #29958 has finished for PR 5298 at commit
|
…, reconciling with the metastore schema at that time
… columns, fullschema is derived later within ParquetRelation2
This is done so that partitionKeysIncludedInParquetSchema is computed correctly later on
|
Test build #30120 has finished for PR 5298 at commit
|
|
Test build #30118 has finished for PR 5298 at commit
|
|
hey @liancheng , this change now reconciles schema within the tasks. do suggest. After that I will remove the merge schema functions that are no longer needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we essentially duplicating ParquetRelation2.mergeMetastoreParquetSchema here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the difference being that this happens within each task, whereas ParquetRelation2.mergeMetastoreParquetSchema happens on the driver. This eliminates the need of mergeMetastoreParquetSchema method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParquetRelation2.mergeMetastoreParquetSchema is just a static method, can we just reuse that here? Especially comments for this method and ParquetRelation2.mergeMissingNullableFields are pretty useful. I would like to keep them.
And please don't put multiple }/) on a single line.
|
@saucam Right now I feel kinda hesitant to have this. As explained in my previous comment, the major bottleneck for Parquet metadata handling happens when reading footers. Without eliminating this, moving schema merging to task side doesn't bring performance benefits (although I haven't done any benchmark for this PR yet). Plus, there are risks of introducing regressions. However, this PR is still very valuable as it proves this approach is doable. Eventually, we would like to have this after upgrading Parquet to 1.6.0 and add the ability to avoid reading footers on driver side whenever a global arbitrative schema is available. I've opened SPARK-6795 to track this issue. I will probably start working on SPARK-6795 later this month. Would you mind me revisiting this at that time? |
|
ok @liancheng |
|
Yeah, sure. |
|
Hi @saucam, thanks for working on this. I think that a lot of this has been implemented in Spark 1.5. Can we close this issue? |
Hey @liancheng,
How about this approach for schema reconciliation, where we use the metastore schema, and reconcile within the ReadSupport init function. This way, we handle each input file in the map task, and no need to read schema from all part files and merging before initiating the tasks.
I have not removed the schema merging code/ test cases for now. Let me know your thoughts on this one.