-
Notifications
You must be signed in to change notification settings - Fork 408
PyArrow: Don't enforce the schema #902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
361d066
a0c0c57
4ca513b
064ed0e
ee293a1
d2a0b36
3e86782
8d1ed75
0fcf93c
ab0db07
c4f044a
a760a52
4464bd7
97c5ce9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1884,8 +1884,9 @@ def to_arrow_batch_reader(self) -> pa.RecordBatchReader: | |
|
|
||
| from pyiceberg.io.pyarrow import project_batches, schema_to_pyarrow | ||
|
|
||
| target_schema = schema_to_pyarrow(self.projection()) | ||
| return pa.RecordBatchReader.from_batches( | ||
| schema_to_pyarrow(self.projection()), | ||
| target_schema, | ||
| project_batches( | ||
| self.plan_files(), | ||
| self.table_metadata, | ||
|
|
@@ -1895,7 +1896,7 @@ def to_arrow_batch_reader(self) -> pa.RecordBatchReader: | |
| case_sensitive=self.case_sensitive, | ||
| limit=self.limit, | ||
| ), | ||
| ) | ||
| ).cast(target_schema=target_schema) | ||
|
||
|
|
||
| def to_pandas(self, **kwargs: Any) -> pd.DataFrame: | ||
| return self.to_arrow().to_pandas(**kwargs) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we are making an opinionated decision on whether we are using large or small type as the pyarrow schema when reading the Iceberg table as a RecordBatchReader. Is there a reason why we don't want to do the same for the table API? I've noticed that we've changed the return type of the Table API to
Optional[pa.Table]in order to avoid having to useschema_to_pyarrow.Similarly, other libraries like polars use the approach of choosing one type over the other (large types in the case of polars).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be to let Arrow decide. For Polars it is different because they are also the query engine. Casting the types will recompute the buffers, consuming additional memory/CPU, which I would rather avoid.
For the table, we first materialize all the batches in memory, so if one of them is large, it will automatically upcast, otherwise, it will keep the small types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My knowledge on Parquet data to Arrow buffer conversion is less versed, so please do check me if I am not making much sense 🙂
But are we actually casting the types on read?
We make a decision on whether we are choosing to read with large or small types when instantiating the fragment scanner, which loads the parquet data into the Arrow buffers. The
schema_to_pyarrow()calls topa.Tableorpa.RecordBatchReaderor into_requested_schemafollowing that all represent the Table schema in the consistent (large or small) format which shouldn't result in any additional casting and reassigning of buffers.I think the only time we are casting the types is on write, where we may want to downcast it for forward compatibility. It looks like we have to choose a schema to use on write anyways, because using a schema for the ParquetWriter that isn't consistent with the schema within the dataframe results in an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Currently, we use "large_*" types during write. I think it could be better if we can write file based on the input pyarrow dataframe schema: if the dataframe is
string, we also write withstring