Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Added More Description
  • Loading branch information
vinjai committed Jul 5, 2024
commit b6345ba41ffeee43ccfbc85d13ad106bff44e742
21 changes: 20 additions & 1 deletion pyiceberg/utils/arrow_sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,24 @@ def convert_sort_field_to_pyarrow_sort_options(sort_field: SortField) -> PyArrow

def get_sort_indices_arrow_table(arrow_table: pa.Table, sort_seq: List[Tuple[str, PyArrowSortOptions]]) -> List[int]:
Copy link
Contributor Author

@vinjai vinjai Jul 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to clarify on the separate implementation for sort_indices other than the one provided by pyarrow.
This is because pyarrow sort_indices or Sort Options only supports one order for null placement across keys.
More details here:

While, the iceberg spec doesn't discriminate of having different null ordering across keys: https://iceberg.apache.org/spec/#sort-orders

This function specifically helps to implement the above functionality by sorting across keys and utilizing the stable nature of the sort_indices algo from pyarrow.


We can raise another issue to improve the performance of this function.


In future, if pyarrow sort_indices does support different null ordering across, we can mark this function as obsolete and keep the implementation clean in the iceberg table append and overwrite methods.

"""
Sorts a Pyarrow Table with a given sort sequence.
Return the indices that would sort the input arrow table.

This function computes an array of indices that define a stable sort of the input arrow_table

Currently, pyarrow sort_indices function doesn't accept different null ordering across multiple keys
To make sure, we are able to sort null orders across multiple keys:
1. Utilize a stable sort algo (e.g. pyarrow sort indices)
2. Sort on the last key first and reverse iterate sort to the first key.

For instance:
If the sorting is defined on age asc and then name desc, the sorting can be decomposed into single key stable
sorts in the following way:
- first sort by name desc
- then sort by age asc

Using a stable sort, we can guarantee that the output would be same across different order keys.

Pyarrow sort_indices function is stable as mentioned in the doc: https://arrow.apache.org/docs/python/generated/pyarrow.compute.sort_indices.html

Args:
arrow_table (pa.Table): Input table to be sorted
Expand All @@ -83,6 +100,8 @@ def get_sort_indices_arrow_table(arrow_table: pa.Table, sort_seq: List[Tuple[str

for col_name, sort_options in sort_seq[::-1]:
sorted_table = sorted_table.take(
# This function works because pyarrow sort_indices function is stable.
# As mentioned in the docs: https://arrow.apache.org/docs/python/generated/pyarrow.compute.sort_indices.html
pa.compute.sort_indices(
sorted_table, sort_keys=[(col_name, sort_options.sort_direction)], null_placement=sort_options.null_order
)
Expand Down
5 changes: 5 additions & 0 deletions tests/integration/test_writes/test_sorted_writes.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@
)
from utils import TABLE_SCHEMA, _create_table

########################################################################################################################
# Spark engine works here because the read is stable for one file.
# For these tests to run, it is necessary that only 1 file is generated per partition during writes for sorted queries.
########################################################################################################################


@pytest.mark.integration
@pytest.mark.parametrize(
Expand Down