-
Notifications
You must be signed in to change notification settings - Fork 388
feat: add delete file support to SnapshotProducer #1987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add delete file support to SnapshotProducer #1987
Conversation
|
Hi @ethan-tyler , before you digging too deep, here is an exisiting PR(#1606) that covers some changes here already. I'm planning to open an epic issue to track this effort in a more organized fashion |
This enables SnapshotProducer to accept and process delete files (position deletes and equality deletes) alongside data files. Changes: - Add added_delete_files field to SnapshotProducer - Add validate_added_delete_files() for delete file validation: - Rejects delete files in V1 format - Validates content types (PositionDeletes, EqualityDeletes) - Requires equality_ids for equality delete files - Validates partition spec compatibility - Add write_delete_manifest() to write delete manifests with ManifestContentType::Deletes - Update manifest_file() to include delete manifests - Update summary() to populate delete file metrics - Enhance validate_duplicate_files() for both data and delete files - Add comprehensive unit tests This lays the groundwork for operations like RowDelta that need to atomically commit both data files and delete files.
…pend - Add cross-type duplicate check: reject same path in data and delete files - Reject equality_ids on position delete files (spec compliance) - Remove unreachable V1 code path in write_delete_manifest - Add TODO for partition spec validation strictness (partition evolution) - Wire validate_added_delete_files into FastAppendAction - Add tests for cross-type duplicates and position delete validation - Apply rustfmt formatting
39debc5 to
26003b0
Compare
Hey @CTTY - thanks for the heads up, I looked at #1606 and there's definitely overlap in the delete file handling. This PR is intentionally focused on just the SnapshotProducer piece with validation. It could serve as a clean base that #1606 rebases onto, and it's intended to unlock the DML actions. Happy to sync on how this fits with your epic and help out where needed. I'm planning to continue with RowDelta, Delete, and Overwrite after this feature lands. Let me know if I should hold off or if landing this first helps. |
| // TODO: This validation is too strict for partition evolution scenarios where delete | ||
| // files may reference older partition specs. Once manifest-per-spec is implemented, | ||
| // relax this to check that the spec_id exists rather than matching the default. | ||
| if self.table.metadata().default_partition_spec_id() != delete_file.partition_spec_id { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should a ticket be filed for this TODO? Not sure if one already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will file one once this PR is ready to merge. Waiting to hear from @CTTY on how this fits with the DML epic and next steps.
Summary
Adds delete file support to SnapshotProducer, enabling atomic commits of position and equality delete files alongside data files.
Changes
Tests
Motivation
Foundation for RowDelta, Delete, and Overwrite operations that atomically commit data and delete files in a single snapshot.
Limitations
Delete files must use the default partition spec (manifest-per-spec grouping deferred)
Related:
RewriteFilesAction+ Validation Support + Process Delete Manifests #1606