Skip to content

Conversation

@codeaucafe
Copy link
Contributor

@codeaucafe codeaucafe commented Jun 13, 2025

NOTE: currently this is just checking primary keys, but the related issue also mentioned in #1083 references issues with number of args. I'm wondering if I should split that into two separate PRs, or if it's fine to do both - currently I only have the PK validation in place? Personally I would argue to separate them since they are slightly different, albeit similar. But happy to take either path.

Add early validation to check if specified primary keys exist in the import file's schema before processing rows. This prevents users from waiting for large files to be processed only to discover that their primary key column names are invalid.

Changes:

  • Add validatePrimaryKeysAgainstSchema function to check primary key existence against file schema
  • Integrate validation into newImportDataReader for create operations
  • Provide helpful error messages listing available columns when primary keys are not found
  • Add unit tests covering various validation scenarios
  • Add BATS integration tests for CSV, PSV, and large file scenarios

The validation only runs for create operations when primary keys are explicitly specified and no schema file is provided. This ensures fast failure while maintaining backward compatibility.

Before: Users waited minutes for large files to process before seeing "provided primary key not found" errors

After: Users get immediate feedback with helpful column suggestions

Closes: #1083

Add early validation to check if specified primary keys exist in the
import file's schema before processing rows. This prevents users from
waiting for large files to be processed only to discover that their
primary key column names are invalid.

Changes:
- Add validatePrimaryKeysAgainstSchema function to check primary key
  existence against file schema
- Integrate validation into newImportDataReader for create operations
- Provide helpful error messages listing available columns when
  primary keys are not found
- Add unit tests covering various validation scenarios
- Add BATS integration tests for CSV, PSV, and large file scenarios

The validation only runs for create operations when primary keys are
explicitly specified and no schema file is provided. This ensures
fast failure while maintaining backward compatibility.

Before: Users waited minutes for large files to process before seeing
"provided primary key not found" errors

After: Users get immediate feedback with helpful column suggestions

Refs: dolthub#1083
@timsehn
Copy link
Contributor

timsehn commented Jun 13, 2025

A quick once over of the

This change has broken four existing bats tests.

not ok 954 import-create-tables: try to table import with nonexistent --pk arg
# (in test file ./import-create-tables.bats, line 257)
#   `[[ "$output" =~ "column 'batmansparents' not found" ]] || false' failed
# Successfully initialized dolt data repository.
not ok 955 import-create-tables: try to table import with one valid and one nonexistent --pk arg
# (in test file ./import-create-tables.bats, line 264)
#   `[[ "$output" =~ "column 'batmansparents' not found" ]] || false' failed
# Successfully initialized dolt data repository.
ok 956 import-create-tables: create a table with two primary keys from csv import
ok 957 import-create-tables: import data from psv and create the table
ok 958 import-create-tables: import table using --delim
not ok 959 import-create-tables: create a table with a name map
# (in test file ./import-create-tables.bats, line 313)
#   `[ "$status" -eq 0 ]' failed
# Successfully initialized dolt data repository.
not ok 960 import-create-tables: use a name map with missing and extra entries
# (in test file ./import-create-tables.bats, line 331)
#   `[ "$status" -eq 0 ]' failed

And your two new tests fail:

not ok 1040 import-tables: validate primary keys exist in CSV file (issue #1083)
# (in test file ./import-tables.bats, line 69)
#   `[[ "$output" =~ "primary keys \[invalid1 invalid2\] not found" ]] || false' failed
# Successfully initialized dolt data repository.
not ok 1041 import-tables: primary key validation with schema file should skip validation
# (in test file ./import-tables.bats, line 152)
#   `[ "$status" -eq 0 ]' failed

These need to be fixed before merge.

Copy link
Contributor

@timsehn timsehn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment on. test location. Also see PR comments. This does not pass test.

@codeaucafe
Copy link
Contributor Author

Thanks @timsehn, I'll fix those. Should I add more validations in this PR or just stick with primary key validation for now and break up other checks (like the related issue referenced in the issue regarding number of args) into another PR?

@timsehn
Copy link
Contributor

timsehn commented Jun 13, 2025

I think this is fine as an isolated change. I would make new PRs for additional checks.

Update validation function to support name mapping during primary key
validation, allowing users to specify primary keys using either
original CSV column names or mapped target column names when using
the --map option. Ensure validation is properly skipped when schema
files are provided since --pk and --schema parameters are mutually
exclusive.

Update test expectations in import-create-tables.bats to match the
existing detailed error message format that includes available columns,
rather than changing the error messages to match the old test
expectations. Move new validation tests from import-tables.bats to
import-create-tables.bats as requested in code review.

The two failing tests "try to table import with nonexistent --pk arg"
and "try to table import with one valid and one nonexistent --pk arg"
now pass because our early validation catches these errors before
InferSchema, producing more helpful error messages that include
available columns. Updated the test expectations to match our
validation's error format instead of the old generic "column not found"
message.

Refs: dolthub#1083
@codeaucafe
Copy link
Contributor Author

Hey just bumping this up just in case y'all missed my replies/comments

@macneale4 macneale4 mentioned this pull request Jun 23, 2025
Copy link
Contributor

@macneale4 macneale4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for the contribution 🙏

Requested a couple trivial changes to the bats tests. I'm running tests which need special permissions, so when make updates I can usher the merge through.

Remove unnecessary cleanup commands and update test comment format as
requested in code review. The test harness automatically handles file
cleanup, making manual rm commands redundant. Update test comment to
reference GitHub issue dolthub#1083 using the preferred format.

Refs: dolthub#1083
@codeaucafe
Copy link
Contributor Author

Thank you! I just pushed the commit resolving your suggestions. Let me know if I missed anything.

@macneale4 macneale4 self-requested a review June 24, 2025 16:39
Copy link
Contributor

@macneale4 macneale4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'll merge when tests complete. Thanks again for your contribution!

    __|__ |___| |\
    |o__| |___| | \
    |___| |___| |o \
   _|___| |___| |__o\
  /...\_____|___|____\_/
  \   o * o * * o o  /
~~~~~~~~~~~~~~~~~~~~~~~~~~

@macneale4 macneale4 merged commit 44d79e9 into dolthub:main Jun 24, 2025
38 of 42 checks passed
@codeaucafe
Copy link
Contributor Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

validate all import args before reading import rows

4 participants