[SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns #5141

saucam · 2015-03-23T17:51:08Z

Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema.
But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work.

…a to support dropping of columns using replace columns

AmplabJenkins · 2015-03-23T17:52:12Z

Can one of the admins verify this patch?

liancheng · 2015-03-24T23:01:00Z

ok to test

SparkQA · 2015-03-24T23:02:39Z

Test build #29118 has started for PR 5141 at commit 5f2f467.

This patch merges cleanly.

SparkQA · 2015-03-24T23:19:44Z

Test build #29118 has finished for PR 5141 at commit 5f2f467.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-24T23:19:47Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29118/
Test FAILed.

…to be subset of parquet schema

saucam · 2015-03-25T05:43:09Z

Fixed the test case. Added a new test case as well. Please retest

SparkQA · 2015-03-25T05:43:24Z

Test build #29150 has started for PR 5141 at commit e858d5b.

This patch merges cleanly.

SparkQA · 2015-03-25T07:07:04Z

Test build #29150 has finished for PR 5141 at commit e858d5b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-25T07:07:07Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29150/
Test PASSed.

liancheng · 2015-03-25T11:23:12Z

This LGTM now. Thanks for working on this!

@marmbrus This can be helpful when users remove a deprecated column from the metastore explicitly. Since this is an improvement rather than a bug, I guess we don't want to backport this to branch-1.3?

marmbrus · 2015-03-25T17:48:06Z

It depends on how safe you think it is. Also, did this work before we started pushing everything through the native parquet path? If so it sound more like a bug.

saucam · 2015-03-26T10:29:46Z

hi @liancheng , thanks for reviewing.

One small query on a separate note,
currently in the implementation of mergeMetastoreParquetSchema, I see that for finding out the merged parquetSchema, part files from all the partitions are being used. Does this scale ? What happens if we have millions of partitions, doesn't this slow down every read query even if only small number of partitions are being referred ?
Was wondering if we can change this to get a unified schema just from the referred partitions ? (Though in that case I think we will need to have a summary file containing all the columns in the base path of the table)

liancheng · 2015-03-26T13:11:57Z

@marmbrus @saucam Confirmed that 1.2 actually works in this case. So this is a regression. Merging to master and 1.3. Thanks for working on this and the comments!

And @saucam, yes, reading from all Parquet data files isn't a scalable way. The reason why this is necessary is that, while creating a ParquetRelation2, we need to figure out the full schema of this relation before any queries are executed. Thus, there is no information about "referred partitions". But I totally agree that this needs furthur optimization to make it more scalable. (BTW, do you really have Parquet tables consist of millions of partitions?...)

…t schema to support dropping of columns using replace columns Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. Author: Yash Datta <[email protected]> Closes #5141 from saucam/replace_col and squashes the following commits: e858d5b [Yash Datta] SPARK-6471: Fix test cases, add a new test case for metastore schema to be subset of parquet schema 5f2f467 [Yash Datta] SPARK-6471: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns (cherry picked from commit 1c05027) Signed-off-by: Cheng Lian <[email protected]>

saucam · 2015-03-27T05:32:58Z

Hi @liancheng ,

We do have use cases where 100K partitions will be registered in tables, (partitioned on timestamps, data is added in form of partitions for every 5min interval) , but it could be more in other cases.

Just one more query please:

I see that in spark 1.2 old parquet path we dont have support for add / replace columns, so if i add a third column 'c' to a metastore schema with columns 'a' , 'b' via alter table , I get a unresolved attribute exception on 'c' in the select query.
This will happen when

I donot insert some data with new column or
The part file spark picks to read data from (since it picks a random part file), does not contain the new column, even if it exists in some other part file.

To support such a scenario, is it enough to simply go on processing without throwing and pass on all 3 columns to parquet, and internally parquet will return nulls for the column that does not exist ('c') ? Or some special handling is required where I forward just the existing columns to parquet and fill out the additional column with nulls in spark ?

liancheng · 2015-03-27T08:34:13Z

@saucam I believe #5214 covers the scenario you mentioned. You may refer to this comment of mine in #5188 (which was later superceded by #5214).

saucam · 2015-03-27T09:35:28Z

Hi @liancheng , thanks for the references, I have already gone through these , but I was talking about ParquetRelation (old parquet path, the default one in spark 1.2) and not ParquetRelation2

liancheng · 2015-03-27T09:44:20Z

Oh sorry, I thought you were just using ParquetRelation as an example. Actually we're trying to replace ParquetRelation entirely with ParquetRelation2, probably this will happen in 1.4. So in general we don't intend to add schema evolution support for ParquetRelation. If you do need to backport this to ParquetRelation, you may follow the approach used in #5214. The key point is that, make sure those additional fields are nullable, as nullability is significant in Parquet.

saucam · 2015-03-27T10:53:51Z

Sorry for so many queries .

How about if I simply ignore reading schema from parquet part files, relying only on metastore schema (I will pass it from hivestrategy to ParquetRelation). Do you think it would have issues ?

liancheng · 2015-03-27T11:13:46Z

Unfortunately Hive is case insensitive and assumes all fields nullable (including nested fields in complex types), while for Parquet both case information and nullability are significant. That's one of the reason why we need to reconcile Hive metastore schema and Parquet schema in ParquetRelation2.

saucam · 2015-03-27T15:18:50Z

Thanks for confirming this, I hope there is no other reason for reconciling schema ?
(In our use cases we can safely make sure our schema is lowercase and all are nullable columns, so should be easier for me to use metastore schema itself in the ParquetRelation)

liancheng · 2015-03-27T17:01:08Z

Yeah, then I think this should be OK.

saucam · 2015-03-28T06:57:19Z

Thanks a lot @liancheng ! :)

SPARK-6471: Metastore schema should only be a subset of parquet schem…

5f2f467

…a to support dropping of columns using replace columns

SPARK-6471: Fix test cases, add a new test case for metastore schema …

e858d5b

…to be subset of parquet schema

marmbrus mentioned this pull request Mar 25, 2015

[SPARK-6538][SQL] Add missing nullable Metastore fields when merging a Parquet schema #5188

Closed

asfgit closed this in 1c05027 Mar 26, 2015

budde mentioned this pull request Mar 26, 2015

[SPARK-6538][SQL] Add missing nullable Metastore fields when merging a Parquet schema #5214

Closed

[SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns #5141

[SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns #5141

Uh oh!

Conversation

saucam commented Mar 23, 2015

Uh oh!

AmplabJenkins commented Mar 23, 2015

Uh oh!

liancheng commented Mar 24, 2015

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

AmplabJenkins commented Mar 24, 2015

Uh oh!

saucam commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

AmplabJenkins commented Mar 25, 2015

Uh oh!

liancheng commented Mar 25, 2015

Uh oh!

marmbrus commented Mar 25, 2015

Uh oh!

saucam commented Mar 26, 2015

Uh oh!

liancheng commented Mar 26, 2015

Uh oh!

saucam commented Mar 27, 2015

Uh oh!

liancheng commented Mar 27, 2015

Uh oh!

saucam commented Mar 27, 2015

Uh oh!

liancheng commented Mar 27, 2015

Uh oh!

saucam commented Mar 27, 2015

Uh oh!

liancheng commented Mar 27, 2015

Uh oh!

saucam commented Mar 27, 2015

Uh oh!

liancheng commented Mar 27, 2015

Uh oh!

saucam commented Mar 28, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants