-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Make DefaultLogicalExtensionCodec support serialisation of build in file formats #19071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ExtensionCodec * Implemented `try_decode_file_format` and `try_encode_file_format` methods to handle various file formats (CSV, JSON, Parquet, Arrow). * Introduced internal structures for file format handling, including `FileFormatKind`, `FileFormatProtoWrapper`, and `FileFormatWrapper`. * Enhanced error handling for unsupported file formats.
…ExtensionCodec-support-serialisation-of-build-in-file-formats
7935eb2 to
bb09871
Compare
|
cc @milenkovicm Please take a look. Thanks! |
milenkovicm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @peterxcli it looks good on a quick glance, will have a better look later.
I have two comments to start with
datafusion_contributor_guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this file added? perhaps its part of different PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry i accidently committed these file, actually this file is just for myself to read.
thanks for catching, will revert
milenkovicm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @peterxcli
I have some follow up questions, please have a look when you get chance.
Also, I'd like to ask you to which extend you used coding assistant(s) for this PR?
| fn try_decode_roundtrip( | ||
| ctx: &TaskContext, | ||
| buf: &[u8], | ||
| ) -> Option<Arc<dyn FileFormatFactory>> { | ||
| #[cfg(feature = "parquet")] | ||
| let candidates: &[&dyn LogicalExtensionCodec] = &[ | ||
| &ParquetLogicalExtensionCodec {}, | ||
| &CsvLogicalExtensionCodec {}, | ||
| &JsonLogicalExtensionCodec {}, | ||
| ]; | ||
| #[cfg(not(feature = "parquet"))] | ||
| let candidates: &[&dyn LogicalExtensionCodec] = | ||
| &[&CsvLogicalExtensionCodec {}, &JsonLogicalExtensionCodec {}]; | ||
|
|
||
| for codec in candidates { | ||
| if let Ok(ff) = codec.try_decode_file_format(buf, ctx) { | ||
| let mut re = Vec::new(); | ||
| if codec | ||
| .try_encode_file_format( | ||
| &mut re, | ||
| Arc::<dyn FileFormatFactory>::clone(&ff), | ||
| ) | ||
| .is_ok() | ||
| && re == buf | ||
| { | ||
| return Some(ff); | ||
| } | ||
| } | ||
| } | ||
| None | ||
| } | ||
|
|
||
| if let Some(ff) = try_decode_roundtrip(ctx, buf) { | ||
| return Ok(ff); | ||
| } | ||
|
|
||
| // If nothing matched, return a clear error rather than guessing | ||
| exec_err!( | ||
| "Unsupported FileFormatFactory bytes for DefaultLogicalExtensionCodec ({} bytes)", | ||
| buf.len() | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the purpose of this code?
| #[derive(Clone, PartialEq, ::prost::Message)] | ||
| struct FileFormatProtoWrapper { | ||
| #[prost(int32, tag = "1")] | ||
| kind: i32, | ||
| #[prost(bytes, tag = "2")] | ||
| blob: Vec<u8>, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whould it make sense to have this as part of proto file ?
| struct FileFormatWrapper { | ||
| kind: FileFormatKind, | ||
| blob: Vec<u8>, | ||
| } | ||
|
|
||
| impl FileFormatWrapper { | ||
| fn new(kind: FileFormatKind, blob: Vec<u8>) -> Self { | ||
| Self { kind, blob } | ||
| } | ||
|
|
||
| fn encode_into(self, buf: &mut Vec<u8>) -> Result<()> { | ||
| let proto = FileFormatProtoWrapper { | ||
| kind: self.kind as i32, | ||
| blob: self.blob, | ||
| }; | ||
| proto | ||
| .encode(buf) | ||
| .map_err(|e| DataFusionError::Internal(e.to_string())) | ||
| } | ||
|
|
||
| fn try_decode(buf: &[u8]) -> Result<Self> { | ||
| let proto = FileFormatProtoWrapper::decode(buf) | ||
| .map_err(|e| DataFusionError::Internal(e.to_string()))?; | ||
| let kind = match proto.kind { | ||
| 0 => FileFormatKind::Csv, | ||
| 1 => FileFormatKind::Json, | ||
| 2 => FileFormatKind::Parquet, | ||
| 3 => FileFormatKind::Arrow, | ||
| _ => { | ||
| return Err(DataFusionError::Internal( | ||
| "Unknown file format kind".to_string(), | ||
| )) | ||
| } | ||
| }; | ||
| Ok(Self { | ||
| kind, | ||
| blob: proto.blob, | ||
| }) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need structure additional to actual decode function
| } | ||
|
|
||
| #[test] | ||
| fn test_default_codec_legacy_raw_bytes_roundtrip() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this functionality was not implemented how can we have a legacy implementation?
| } | ||
|
|
||
| #[test] | ||
| fn test_default_codec_legacy_empty_buffer() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
his functionality was not implemented how can we have a legacy implementation?
| csv_format.delimiter = b'|'; | ||
| csv_format.has_header = Some(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the purpose of setting this values if not checked later?
|
|
||
| // Test JSON with custom options | ||
| let mut json_format = table_options.json.clone(); | ||
| json_format.compression = CompressionTypeVariant::GZIP; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the purpose of setting this value if not checked later?
|
|
||
| // Test Parquet with custom options | ||
| let mut parquet_format = table_options.parquet.clone(); | ||
| parquet_format.global.bloom_filter_on_read = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the purpose of setting this value if not checked later?
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
try_decode_file_format(&self, buf: &[u8], ctx: &SessionContext) -> Result<Arc<dyn FileFormatFactory>>inDefaultLogicalExtensionCodectry_encode_file_format(&self, buf: &mut Vec<u8>, node: Arc<dyn FileFormatFactory>) -> Result<()>inDefaultLogicalExtensionCodecAre these changes tested?
datafusion/proto/tests/cases/roundtrip_logical_plan.rsroundtrip_arrow_scan— Useslogical_plan_to_bytes()andlogical_plan_from_bytes(), which useDefaultLogicalExtensionCodec, so it exercises Arrow file format encoding/decoding.roundtrip_custom_listing_tables— Uses the default codec and may involve file formats.Are there any user-facing changes?
Yes, makes the default codec able to handle the common formats out-of-the-box.