Skip to content

Conversation

@saucam
Copy link

@saucam saucam commented Mar 31, 2015

Hey @liancheng,

How about this approach for schema reconciliation, where we use the metastore schema, and reconcile within the ReadSupport init function. This way, we handle each input file in the map task, and no need to read schema from all part files and merging before initiating the tasks.
I have not removed the schema merging code/ test cases for now. Let me know your thoughts on this one.

@liancheng
Copy link
Contributor

ok to test

@liancheng
Copy link
Contributor

Ah, I'm also considering similar optimizations for Spark 1.4 :)

The tricky part here is that, when scanning the Parquet table, Spark needs to call ParquetInputFormat.getSplits to compute (Spark) partition information. This getSplits call can be super expensive as it needs to read footers of all Parquet part-files to compute the Parquet splits. And that's why ParquetRelation2 caches those footers at the very beginning and injects them into an extended Parquet input format. With all these footers cached, ParquetRelation2.readSchma() is actually quite lightweight. So the real bottleneck is reading all those footers.

Fortunately, Parquet is also trying to avoid reading footers entirely at the driver side (see https://github.com/apache/incubator-parquet-mr/pull/91 and https://github.com/apache/incubator-parquet-mr/pull/45). After upgrading to Parquet 1.6, which is expected to be released next week, we can do this properly for better performance.

So ideally, we don't read footers on driver side, and when we have a central arbitrative schema at hand, either from metastore or data source DDL, we don't do schema merging at driver side either. I haven't got time to walk through all related Parquet code path and PRs yet, so the above statements may be inaccurate. Please correct me if you find any mistakes.

@saucam
Copy link
Author

saucam commented Apr 4, 2015

hmm i see. Would definitely go through these PRs. Anyways fixed the whitespace problem here. Please retest

@liancheng
Copy link
Contributor

add to whitelist

@SparkQA
Copy link

SparkQA commented Apr 9, 2015

Test build #29958 has finished for PR 5298 at commit 89efac5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

Yash Datta added 4 commits April 12, 2015 17:26
…, reconciling with the metastore schema at that time
… columns, fullschema is derived later within ParquetRelation2

            This is done so that partitionKeysIncludedInParquetSchema is computed correctly later on
@SparkQA
Copy link

SparkQA commented Apr 12, 2015

Test build #30120 has finished for PR 5298 at commit 90d7782.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 12, 2015

Test build #30118 has finished for PR 5298 at commit 866aa93.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@saucam
Copy link
Author

saucam commented Apr 12, 2015

hey @liancheng , this change now reconciles schema within the tasks. do suggest. After that I will remove the merge schema functions that are no longer needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we essentially duplicating ParquetRelation2.mergeMetastoreParquetSchema here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the difference being that this happens within each task, whereas ParquetRelation2.mergeMetastoreParquetSchema happens on the driver. This eliminates the need of mergeMetastoreParquetSchema method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParquetRelation2.mergeMetastoreParquetSchema is just a static method, can we just reuse that here? Especially comments for this method and ParquetRelation2.mergeMissingNullableFields are pretty useful. I would like to keep them.

And please don't put multiple }/) on a single line.

@liancheng
Copy link
Contributor

@saucam Right now I feel kinda hesitant to have this. As explained in my previous comment, the major bottleneck for Parquet metadata handling happens when reading footers. Without eliminating this, moving schema merging to task side doesn't bring performance benefits (although I haven't done any benchmark for this PR yet). Plus, there are risks of introducing regressions.

However, this PR is still very valuable as it proves this approach is doable. Eventually, we would like to have this after upgrading Parquet to 1.6.0 and add the ability to avoid reading footers on driver side whenever a global arbitrative schema is available. I've opened SPARK-6795 to track this issue. I will probably start working on SPARK-6795 later this month. Would you mind me revisiting this at that time?

@saucam
Copy link
Author

saucam commented Apr 13, 2015

ok @liancheng
Thanks for the comments. In the meantime let me try to address your suggestions. Can we keep this open in WIP state for now ?
Please let me know if I could be of help with SPARK-6795

@liancheng
Copy link
Contributor

Yeah, sure.

@marmbrus
Copy link
Contributor

marmbrus commented Sep 3, 2015

Hi @saucam, thanks for working on this. I think that a lot of this has been implemented in Spark 1.5. Can we close this issue?

@asfgit asfgit closed this in 804a012 Sep 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants