-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Arrow-Avro: Resolve named field discrepancies #8546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cdc38d4 to
72c2904
Compare
|
@mbrobbel @alamb If there's anyway this PR and #8550 can get into the 57.0.0 release that would be huge. After this there's just one last PR to add the remaining round trip tests. These two PRs will complete the remaining functionality though. CC: @nathaniel-d-ef |
f6bf96c to
6122ef9
Compare
a877ba5 to
45e816e
Compare
mbrobbel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nathaniel-d-ef and @jecsand838
|
🚀 |
# Which issue does this PR close? Relates to: #8348 #4886 # Rationale for this change This PR completes the efforts of @jecsand838, adding dense union support to the encoder side of the crate, along with four other minor extensions of existing time-related encoding. Note: currently this PR is stacked behind #8546. Once that's merged this will be updated and will not include those changes. # What changes are included in this PR? - Dense union support for the writer - Tests # Are these changes tested? - A full round-trip test, reading in an existing union avro file and asserting that the output matches expectations - Unit tests covering new encoders. # Are there any user-facing changes? Crate not yet public --------- Co-authored-by: Connor Sanders <[email protected]> Co-authored-by: Connor Sanders <[email protected]> Co-authored-by: Matthijs Brobbel <[email protected]>
Which issue does this PR close?
Related to: #4886 (“Add Avro Support”)
Rationale for this change
Prior to this PR, the crate lacked sufficient support for named types.
This PR introduces fixes to the Avro reader and writer to ensure correct and robust roundtrip serialization of complex union types. The core issue was that the previous implementation failed to properly distinguish between logically distinct types within a union if they shared the same physical representation. This fundamental flaw led to valid union schemas/data being flagged as invalid, the loss of specific names of named type branches (e.g., "Fx8" becoming "fixed").
This PR makes a change whereby name and namespace data is registered and retrieved from metadata, which ensures that complex Avro unions can now be reliably read, converted to Arrow, and written back to Avro without validation errors or loss of type information.
This behavior will be further validated in a follow-up PR to add support for writing dense unions.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
Crate not yet public