Skip to content

Conversation

@saucam
Copy link

@saucam saucam commented Mar 23, 2015

Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema.
But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work.

…a to support dropping of columns using replace columns
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@saucam saucam changed the title [SPARK-6471][SQL]: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns [SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns Mar 23, 2015
@liancheng
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Mar 24, 2015

Test build #29118 has started for PR 5141 at commit 5f2f467.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 24, 2015

Test build #29118 has finished for PR 5141 at commit 5f2f467.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29118/
Test FAILed.

@saucam
Copy link
Author

saucam commented Mar 25, 2015

Fixed the test case. Added a new test case as well. Please retest

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29150 has started for PR 5141 at commit e858d5b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29150 has finished for PR 5141 at commit e858d5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29150/
Test PASSed.

@liancheng
Copy link
Contributor

This LGTM now. Thanks for working on this!

@marmbrus This can be helpful when users remove a deprecated column from the metastore explicitly. Since this is an improvement rather than a bug, I guess we don't want to backport this to branch-1.3?

@marmbrus
Copy link
Contributor

It depends on how safe you think it is. Also, did this work before we started pushing everything through the native parquet path? If so it sound more like a bug.

@saucam
Copy link
Author

saucam commented Mar 26, 2015

hi @liancheng , thanks for reviewing.

One small query on a separate note,
currently in the implementation of mergeMetastoreParquetSchema, I see that for finding out the merged parquetSchema, part files from all the partitions are being used. Does this scale ? What happens if we have millions of partitions, doesn't this slow down every read query even if only small number of partitions are being referred ?
Was wondering if we can change this to get a unified schema just from the referred partitions ? (Though in that case I think we will need to have a summary file containing all the columns in the base path of the table)

@liancheng
Copy link
Contributor

@marmbrus @saucam Confirmed that 1.2 actually works in this case. So this is a regression. Merging to master and 1.3. Thanks for working on this and the comments!

And @saucam, yes, reading from all Parquet data files isn't a scalable way. The reason why this is necessary is that, while creating a ParquetRelation2, we need to figure out the full schema of this relation before any queries are executed. Thus, there is no information about "referred partitions". But I totally agree that this needs furthur optimization to make it more scalable. (BTW, do you really have Parquet tables consist of millions of partitions?...)

@asfgit asfgit closed this in 1c05027 Mar 26, 2015
asfgit pushed a commit that referenced this pull request Mar 26, 2015
…t schema to support dropping of columns using replace columns

Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema.
But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work.

Author: Yash Datta <[email protected]>

Closes #5141 from saucam/replace_col and squashes the following commits:

e858d5b [Yash Datta] SPARK-6471: Fix test cases, add a new test case for metastore schema to be subset of parquet schema
5f2f467 [Yash Datta] SPARK-6471: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

(cherry picked from commit 1c05027)
Signed-off-by: Cheng Lian <[email protected]>
@saucam
Copy link
Author

saucam commented Mar 27, 2015

Hi @liancheng ,

We do have use cases where 100K partitions will be registered in tables, (partitioned on timestamps, data is added in form of partitions for every 5min interval) , but it could be more in other cases.

Just one more query please:

I see that in spark 1.2 old parquet path we dont have support for add / replace columns, so if i add a third column 'c' to a metastore schema with columns 'a' , 'b' via alter table , I get a unresolved attribute exception on 'c' in the select query.
This will happen when

  1. I donot insert some data with new column or
  2. The part file spark picks to read data from (since it picks a random part file), does not contain the new column, even if it exists in some other part file.

To support such a scenario, is it enough to simply go on processing without throwing and pass on all 3 columns to parquet, and internally parquet will return nulls for the column that does not exist ('c') ? Or some special handling is required where I forward just the existing columns to parquet and fill out the additional column with nulls in spark ?

@liancheng
Copy link
Contributor

@saucam I believe #5214 covers the scenario you mentioned. You may refer to this comment of mine in #5188 (which was later superceded by #5214).

@saucam
Copy link
Author

saucam commented Mar 27, 2015

Hi @liancheng , thanks for the references, I have already gone through these , but I was talking about ParquetRelation (old parquet path, the default one in spark 1.2) and not ParquetRelation2

@liancheng
Copy link
Contributor

Oh sorry, I thought you were just using ParquetRelation as an example. Actually we're trying to replace ParquetRelation entirely with ParquetRelation2, probably this will happen in 1.4. So in general we don't intend to add schema evolution support for ParquetRelation. If you do need to backport this to ParquetRelation, you may follow the approach used in #5214. The key point is that, make sure those additional fields are nullable, as nullability is significant in Parquet.

@saucam
Copy link
Author

saucam commented Mar 27, 2015

Sorry for so many queries .

How about if I simply ignore reading schema from parquet part files, relying only on metastore schema (I will pass it from hivestrategy to ParquetRelation). Do you think it would have issues ?

@liancheng
Copy link
Contributor

Unfortunately Hive is case insensitive and assumes all fields nullable (including nested fields in complex types), while for Parquet both case information and nullability are significant. That's one of the reason why we need to reconcile Hive metastore schema and Parquet schema in ParquetRelation2.

@saucam
Copy link
Author

saucam commented Mar 27, 2015

Thanks for confirming this, I hope there is no other reason for reconciling schema ?
(In our use cases we can safely make sure our schema is lowercase and all are nullable columns, so should be easier for me to use metastore schema itself in the ParquetRelation)

@liancheng
Copy link
Contributor

Yeah, then I think this should be OK.

@saucam
Copy link
Author

saucam commented Mar 28, 2015

Thanks a lot @liancheng ! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants