[WIP][SQL][SPARK-6632]: Read schema from each input split in the ReadSupport hook, reconciling with the metastore schema at that time #5298

saucam · 2015-03-31T13:30:23Z

How about this approach for schema reconciliation, where we use the metastore schema, and reconcile within the ReadSupport init function. This way, we handle each input file in the map task, and no need to read schema from all part files and merging before initiating the tasks.
I have not removed the schema merging code/ test cases for now. Let me know your thoughts on this one.

liancheng · 2015-04-04T17:36:02Z

ok to test

liancheng · 2015-04-04T18:02:42Z

Ah, I'm also considering similar optimizations for Spark 1.4 :)

The tricky part here is that, when scanning the Parquet table, Spark needs to call ParquetInputFormat.getSplits to compute (Spark) partition information. This getSplits call can be super expensive as it needs to read footers of all Parquet part-files to compute the Parquet splits. And that's why ParquetRelation2 caches those footers at the very beginning and injects them into an extended Parquet input format. With all these footers cached, ParquetRelation2.readSchma() is actually quite lightweight. So the real bottleneck is reading all those footers.

Fortunately, Parquet is also trying to avoid reading footers entirely at the driver side (see https://github.com/apache/incubator-parquet-mr/pull/91 and https://github.com/apache/incubator-parquet-mr/pull/45). After upgrading to Parquet 1.6, which is expected to be released next week, we can do this properly for better performance.

So ideally, we don't read footers on driver side, and when we have a central arbitrative schema at hand, either from metastore or data source DDL, we don't do schema merging at driver side either. I haven't got time to walk through all related Parquet code path and PRs yet, so the above statements may be inaccurate. Please correct me if you find any mistakes.

saucam · 2015-04-04T18:10:44Z

hmm i see. Would definitely go through these PRs. Anyways fixed the whitespace problem here. Please retest

liancheng · 2015-04-09T19:09:56Z

add to whitelist

SparkQA · 2015-04-09T20:01:08Z

Test build #29958 has finished for PR 5298 at commit 89efac5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

…, reconciling with the metastore schema at that time

… columns, fullschema is derived later within ParquetRelation2 This is done so that partitionKeysIncludedInParquetSchema is computed correctly later on

SparkQA · 2015-04-12T14:05:38Z

Test build #30120 has finished for PR 5298 at commit 90d7782.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-12T14:19:22Z

Test build #30118 has finished for PR 5298 at commit 866aa93.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

saucam · 2015-04-12T14:20:53Z

hey @liancheng , this change now reconciles schema within the tasks. do suggest. After that I will remove the merge schema functions that are no longer needed

liancheng · 2015-04-12T18:35:13Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala

Are we essentially duplicating ParquetRelation2.mergeMetastoreParquetSchema here?

yes, the difference being that this happens within each task, whereas ParquetRelation2.mergeMetastoreParquetSchema happens on the driver. This eliminates the need of mergeMetastoreParquetSchema method

ParquetRelation2.mergeMetastoreParquetSchema is just a static method, can we just reuse that here? Especially comments for this method and ParquetRelation2.mergeMissingNullableFields are pretty useful. I would like to keep them.

And please don't put multiple }/) on a single line.

liancheng · 2015-04-13T16:18:38Z

@saucam Right now I feel kinda hesitant to have this. As explained in my previous comment, the major bottleneck for Parquet metadata handling happens when reading footers. Without eliminating this, moving schema merging to task side doesn't bring performance benefits (although I haven't done any benchmark for this PR yet). Plus, there are risks of introducing regressions.

However, this PR is still very valuable as it proves this approach is doable. Eventually, we would like to have this after upgrading Parquet to 1.6.0 and add the ability to avoid reading footers on driver side whenever a global arbitrative schema is available. I've opened SPARK-6795 to track this issue. I will probably start working on SPARK-6795 later this month. Would you mind me revisiting this at that time?

saucam · 2015-04-13T18:18:41Z

ok @liancheng
Thanks for the comments. In the meantime let me try to address your suggestions. Can we keep this open in WIP state for now ?
Please let me know if I could be of help with SPARK-6795

liancheng · 2015-04-14T06:26:03Z

Yeah, sure.

marmbrus · 2015-09-03T20:24:38Z

Hi @saucam, thanks for working on this. I think that a lot of this has been implemented in Spark 1.5. Can we close this issue?

Yash Datta added 4 commits April 12, 2015 17:26

SPARK-6632: Read schema from each input split in the ReadSupport hook…

77c9d58

…, reconciling with the metastore schema at that time

SPARK-6632: Fix whitespace

9580823

SPARK-6632: maybeMetastoreSchema should only contain non-partitioning…

20e0825

… columns, fullschema is derived later within ParquetRelation2 This is done so that partitionKeysIncludedInParquetSchema is computed correctly later on

SPARK-6632: Fix whitespace

90d7782

saucam force-pushed the SPARK-6632 branch from 88b8d67 to 90d7782 Compare April 12, 2015 12:12

liancheng reviewed Apr 12, 2015
View reviewed changes

asfgit closed this in 804a012 Sep 4, 2015

[WIP][SQL][SPARK-6632]: Read schema from each input split in the ReadSupport hook, reconciling with the metastore schema at that time #5298

[WIP][SQL][SPARK-6632]: Read schema from each input split in the ReadSupport hook, reconciling with the metastore schema at that time #5298

Uh oh!

Conversation

saucam commented Mar 31, 2015

Uh oh!

liancheng commented Apr 4, 2015

Uh oh!

liancheng commented Apr 4, 2015

Uh oh!

saucam commented Apr 4, 2015

Uh oh!

liancheng commented Apr 9, 2015

Uh oh!

SparkQA commented Apr 9, 2015

Uh oh!

SparkQA commented Apr 12, 2015

Uh oh!

SparkQA commented Apr 12, 2015

Uh oh!

saucam commented Apr 12, 2015

Uh oh!

liancheng Apr 12, 2015

Choose a reason for hiding this comment

Uh oh!

saucam Apr 13, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng Apr 13, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Apr 13, 2015

Uh oh!

saucam commented Apr 13, 2015

Uh oh!

liancheng commented Apr 14, 2015

Uh oh!

marmbrus commented Sep 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants