-
Notifications
You must be signed in to change notification settings - Fork 45
Update matrix for parquet-cpp and parquet-java #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| | STRING | ✅ | ✅ | | | ✅ | | ||
| | ENUM | ❌ | ✅ | | | ❌ | | ||
| | UUID | ❌ | ✅ | | | ❌ | | ||
| | 8, 16, 32, 64 bit signed and unsigned INT | ✅ | ✅ | | | ✅ | | ||
| | DECIMAL (INT32) | ✅ | ✅ | | | ✅ | | ||
| | DECIMAL (INT64) | ✅ | ✅ | | | ✅ | | ||
| | DECIMAL (BYTE_ARRAY) | ✅ | ✅ | | | ✅ | | ||
| | DECIMAL (FIXED_LEN_BYTE_ARRAY) | ✅ | ✅ | | | ✅ | | ||
| | DATE | ✅ | ✅ | | | ✅ | | ||
| | TIME (INT32) | ✅ | ✅ | | | ✅ | | ||
| | TIME (INT64) | ✅ | ✅ | | | ✅ | | ||
| | TIMESTAMP (INT64) | ✅ | ✅ | | | ✅ | | ||
| | INTERVAL | ✅ | ✅ | | | ❌ | | ||
| | JSON | ✅ | ✅ | | | ❌ | | ||
| | BSON | ❌ | ✅ | | | ❌ | | ||
| | LIST | ✅ | ✅ | | | ✅ | | ||
| | MAP | ✅ | ✅ | | | ✅ | | ||
| | UNKNOWN (always null) | ✅ | ✅ | | | ✅ | | ||
| | FLOAT16 | ✅ | ✅ | | | ✅ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we mean by logical types are supported? Is it like we won't throw an exception and return something at least, or we expect to actually represent the related value correctly?
Since parquet-java does not have its own representation type system (except the one in the example package that mainly handles the primitive types only), the question is whether at least one binding supports the related type. I am not aware that any binding supports INTERVAL or BSON. JSON is mostly handle as string without any syntax check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think won't throw should be considered as supported. Let me remove INTERVAL and BSON.
Just to confirm that ENUM and UUID are supported by the Avro binding, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, UUID is mapped to the Avro UUID type, ENUM is mapped to string.
But, JSON and FLOAT16 do not seem to be supported either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think FLOAT16 should be considered as supported: apache/parquet-java#1142
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, that's why I asked what supporting a logical type mean in terms of parquet-java where we do not have our own representation type system. For FLOAT16, we return the containing 2 bytes as a Binary in our example package. We do not map this type in any of our bindings.
If we consider FLOAT16 as supported, then all the others are supported as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we consider FLOAT16 as supported, then all the others are supported as well.
Yes, that's why I marked all logical types as supported in the initial commit. Developers are able to implement their own bindings to write and read such types by extending the parquet-java library.
Is there value in adding another option here (P perhaps) that means the implementation at least supports the logical type enum value but will return the physical type?
I'm not sure. Maybe it is a minimum requirement for implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac, If we do not want to distinguish the support level of e.g. FLOAT16 and UUID, I'm fine having all marked as supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's why I marked all logical types as supported in the initial commit. Developers are able to implement their own bindings to write and read such types by extending the
parquet-javalibrary.
You should then add a note or asterisk to point out that "supported" means the physical type is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Added asterisk to explain that.
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ part LGTM
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some small comments, but this looks great @wgtmac Thanks for updating this 👍
| | DATE | ✅ | ✅ | | | ✅ | | ||
| | TIME (INT32) | ✅ | ✅ | | | ✅ | | ||
| | TIME (INT64) | ✅ | ✅ | | | ✅ | | ||
| | TIMESTAMP (INT64) | ✅ | ✅ | | | ✅ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we split this out on the unit as well?
| | TIMESTAMP (INT64) | ✅ | ✅ | | | ✅ | | |
| | TIMESTAMP (INT64, MICROS) | ✅ | ✅ | | | ✅ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that? Any unsupported unit in parquet-java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but I'm not sure if all the other implementations support nanos since that was added later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added almost 7 years ago: apache/parquet-format@b879065. We can be explicit if we really see an unsupported implementation then.
|
Do you have any comment with the latest change? @gszadovszky @pitrou @Fokko |
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for doing the work @wgtmac
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wgtmac! Looks good. I've got rust ready to go once this merges.
|
I just merged it. Thanks everyone! |
And here is @etseidl 's promised version for parquet-rs: 🥳 🦜 |
No description provided.