Skip to content

Conversation

@nandorKollar
Copy link
Contributor

No description provided.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@gszadovszky gszadovszky merged commit 33ee549 into apache:master Jun 25, 2018
String message =
"message StringMessage {\n" +
" required binary string (UTF8);\n" +
" required binary string (STRING);\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test change needed?

@rdblue
Copy link
Contributor

rdblue commented Jul 3, 2018

@gszadovszky, I think we might need to remove this commit. It looks like this changes the Parquet schema format. Is that correct?

@gszadovszky
Copy link
Contributor

@rdblue, this is not a breaking change but introducing the new logical types. Both UTF8 and STRING should work in the schema definition. See PR#463 for the whole change.
Though, you are right that this actual test should not have been modified to keep UTF8 tested. I don't think we need to revert this change, we can modify the tests in an additional commit.
@nandorKollar, could you please modify the tests so both the original type UTF8 and the new logical type STRING are tested? Please also make sure that all the other original types and logical types are tested in the schema definition language.

@nandorKollar
Copy link
Contributor Author

@rdblue like @gszadovszky told, the change shouldn't be a breaking change, since the "old" types are still honored. Though the change I did on the test is misleading, I should have added a new test case for STRING and leave UTF8 untouched. Created a new PR #503.

@rdblue
Copy link
Contributor

rdblue commented Jul 4, 2018

Thanks, it's good to hear the old types still work. Since the new logical type code changes schema serialization, is this a forward-incompatible change? Will old readers still be able to read files written after this change?

@nandorKollar
Copy link
Contributor Author

The new API writes both logicalType and converted_type fields for each SchemaElement. Therefore old readers, which only know about converted_type will be able to read files written by new writers.

What old parquet versions won't be able to interpret is the changes in the schema language, the text representation parseable by MessageParser. New logical types, like timestamps have new type parameters, which the old parser can't parse. Fortunately - as far as I know - the text schema representation is not written into the file, thus the files written by new writer should be readable by old readers. @rdblue does this answer your concern?

ghost pushed a commit to RMS/parquet-mr that referenced this pull request Aug 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants