-
Notifications
You must be signed in to change notification settings - Fork 29k
[SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns #5141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…a to support dropping of columns using replace columns
|
Can one of the admins verify this patch? |
|
ok to test |
|
Test build #29118 has started for PR 5141 at commit
|
|
Test build #29118 has finished for PR 5141 at commit
|
|
Test FAILed. |
…to be subset of parquet schema
|
Fixed the test case. Added a new test case as well. Please retest |
|
Test build #29150 has started for PR 5141 at commit
|
|
Test build #29150 has finished for PR 5141 at commit
|
|
Test PASSed. |
|
This LGTM now. Thanks for working on this! @marmbrus This can be helpful when users remove a deprecated column from the metastore explicitly. Since this is an improvement rather than a bug, I guess we don't want to backport this to branch-1.3? |
|
It depends on how safe you think it is. Also, did this work before we started pushing everything through the native parquet path? If so it sound more like a bug. |
|
hi @liancheng , thanks for reviewing. One small query on a separate note, |
|
@marmbrus @saucam Confirmed that 1.2 actually works in this case. So this is a regression. Merging to master and 1.3. Thanks for working on this and the comments! And @saucam, yes, reading from all Parquet data files isn't a scalable way. The reason why this is necessary is that, while creating a |
…t schema to support dropping of columns using replace columns Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. Author: Yash Datta <[email protected]> Closes #5141 from saucam/replace_col and squashes the following commits: e858d5b [Yash Datta] SPARK-6471: Fix test cases, add a new test case for metastore schema to be subset of parquet schema 5f2f467 [Yash Datta] SPARK-6471: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns (cherry picked from commit 1c05027) Signed-off-by: Cheng Lian <[email protected]>
|
Hi @liancheng , We do have use cases where 100K partitions will be registered in tables, (partitioned on timestamps, data is added in form of partitions for every 5min interval) , but it could be more in other cases. Just one more query please: I see that in spark 1.2 old parquet path we dont have support for add / replace columns, so if i add a third column 'c' to a metastore schema with columns 'a' , 'b' via alter table , I get a unresolved attribute exception on 'c' in the select query.
To support such a scenario, is it enough to simply go on processing without throwing and pass on all 3 columns to parquet, and internally parquet will return nulls for the column that does not exist ('c') ? Or some special handling is required where I forward just the existing columns to parquet and fill out the additional column with nulls in spark ? |
|
@saucam I believe #5214 covers the scenario you mentioned. You may refer to this comment of mine in #5188 (which was later superceded by #5214). |
|
Hi @liancheng , thanks for the references, I have already gone through these , but I was talking about ParquetRelation (old parquet path, the default one in spark 1.2) and not ParquetRelation2 |
|
Oh sorry, I thought you were just using |
|
Sorry for so many queries . How about if I simply ignore reading schema from parquet part files, relying only on metastore schema (I will pass it from hivestrategy to ParquetRelation). Do you think it would have issues ? |
|
Unfortunately Hive is case insensitive and assumes all fields nullable (including nested fields in complex types), while for Parquet both case information and nullability are significant. That's one of the reason why we need to reconcile Hive metastore schema and Parquet schema in |
|
Thanks for confirming this, I hope there is no other reason for reconciling schema ? |
|
Yeah, then I think this should be OK. |
|
Thanks a lot @liancheng ! :) |
Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema.
But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work.