-
-
Notifications
You must be signed in to change notification settings - Fork 596
feat(import): validate primary keys early #9349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(import): validate primary keys early #9349
Conversation
Add early validation to check if specified primary keys exist in the import file's schema before processing rows. This prevents users from waiting for large files to be processed only to discover that their primary key column names are invalid. Changes: - Add validatePrimaryKeysAgainstSchema function to check primary key existence against file schema - Integrate validation into newImportDataReader for create operations - Provide helpful error messages listing available columns when primary keys are not found - Add unit tests covering various validation scenarios - Add BATS integration tests for CSV, PSV, and large file scenarios The validation only runs for create operations when primary keys are explicitly specified and no schema file is provided. This ensures fast failure while maintaining backward compatibility. Before: Users waited minutes for large files to process before seeing "provided primary key not found" errors After: Users get immediate feedback with helpful column suggestions Refs: dolthub#1083
|
A quick once over of the This change has broken four existing bats tests. not ok 954 import-create-tables: try to table import with nonexistent --pk arg
# (in test file ./import-create-tables.bats, line 257)
# `[[ "$output" =~ "column 'batmansparents' not found" ]] || false' failed
# Successfully initialized dolt data repository.
not ok 955 import-create-tables: try to table import with one valid and one nonexistent --pk arg
# (in test file ./import-create-tables.bats, line 264)
# `[[ "$output" =~ "column 'batmansparents' not found" ]] || false' failed
# Successfully initialized dolt data repository.
ok 956 import-create-tables: create a table with two primary keys from csv import
ok 957 import-create-tables: import data from psv and create the table
ok 958 import-create-tables: import table using --delim
not ok 959 import-create-tables: create a table with a name map
# (in test file ./import-create-tables.bats, line 313)
# `[ "$status" -eq 0 ]' failed
# Successfully initialized dolt data repository.
not ok 960 import-create-tables: use a name map with missing and extra entries
# (in test file ./import-create-tables.bats, line 331)
# `[ "$status" -eq 0 ]' failedAnd your two new tests fail: not ok 1040 import-tables: validate primary keys exist in CSV file (issue #1083)
# (in test file ./import-tables.bats, line 69)
# `[[ "$output" =~ "primary keys \[invalid1 invalid2\] not found" ]] || false' failed
# Successfully initialized dolt data repository.
not ok 1041 import-tables: primary key validation with schema file should skip validation
# (in test file ./import-tables.bats, line 152)
# `[ "$status" -eq 0 ]' failedThese need to be fixed before merge. |
timsehn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment on. test location. Also see PR comments. This does not pass test.
|
Thanks @timsehn, I'll fix those. Should I add more validations in this PR or just stick with primary key validation for now and break up other checks (like the related issue referenced in the issue regarding number of args) into another PR? |
|
I think this is fine as an isolated change. I would make new PRs for additional checks. |
Update validation function to support name mapping during primary key validation, allowing users to specify primary keys using either original CSV column names or mapped target column names when using the --map option. Ensure validation is properly skipped when schema files are provided since --pk and --schema parameters are mutually exclusive. Update test expectations in import-create-tables.bats to match the existing detailed error message format that includes available columns, rather than changing the error messages to match the old test expectations. Move new validation tests from import-tables.bats to import-create-tables.bats as requested in code review. The two failing tests "try to table import with nonexistent --pk arg" and "try to table import with one valid and one nonexistent --pk arg" now pass because our early validation catches these errors before InferSchema, producing more helpful error messages that include available columns. Updated the test expectations to match our validation's error format instead of the old generic "column not found" message. Refs: dolthub#1083
|
Hey just bumping this up just in case y'all missed my replies/comments |
macneale4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks for the contribution 🙏
Requested a couple trivial changes to the bats tests. I'm running tests which need special permissions, so when make updates I can usher the merge through.
Remove unnecessary cleanup commands and update test comment format as requested in code review. The test harness automatically handles file cleanup, making manual rm commands redundant. Update test comment to reference GitHub issue dolthub#1083 using the preferred format. Refs: dolthub#1083
|
Thank you! I just pushed the commit resolving your suggestions. Let me know if I missed anything. |
macneale4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I'll merge when tests complete. Thanks again for your contribution!
__|__ |___| |\
|o__| |___| | \
|___| |___| |o \
_|___| |___| |__o\
/...\_____|___|____\_/
\ o * o * * o o /
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
Thank you! |
NOTE: currently this is just checking primary keys, but the related issue also mentioned in #1083 references issues with number of args. I'm wondering if I should split that into two separate PRs, or if it's fine to do both - currently I only have the PK validation in place? Personally I would argue to separate them since they are slightly different, albeit similar. But happy to take either path.
Add early validation to check if specified primary keys exist in the import file's schema before processing rows. This prevents users from waiting for large files to be processed only to discover that their primary key column names are invalid.
Changes:
The validation only runs for create operations when primary keys are explicitly specified and no schema file is provided. This ensures fast failure while maintaining backward compatibility.
Before: Users waited minutes for large files to process before seeing "provided primary key not found" errors
After: Users get immediate feedback with helpful column suggestions
Closes: #1083