Skip to content

Fixing dockey/doc_id mismatch when no metadata is found#1288

Merged
jamesbraza merged 2 commits intomainfrom
fix-dockey-doc-id-mismatch-no-metadata
Feb 14, 2026
Merged

Fixing dockey/doc_id mismatch when no metadata is found#1288
jamesbraza merged 2 commits intomainfrom
fix-dockey-doc-id-mismatch-no-metadata

Conversation

@jamesbraza
Copy link
Copy Markdown
Collaborator

@jamesbraza jamesbraza commented Feb 13, 2026

Summary

  • Fixes the flaky test_get_directory_index[check-md-query] CI failure caused by a dockey/doc_id mismatch when both Crossref and Semantic Scholar fail to return metadata for "Gravity Hill"
  • When upgrade_doc_to_doc_details falls back (no metadata found) and the dockey was auto-generated from content_hash, "doc_id" is now included in fields_to_overwrite_from_metadata so the Pydantic validator can sync dockey with the newly computed doc_id
  • User-provided dockey values (e.g. dockey="test" in test_docs_lifecycle) are still preserved

Root cause

When both metadata providers fail, the fallback path in DocMetadataClient.upgrade_doc_to_doc_details set fields_to_overwrite_from_metadata = set(). This prevented overwrite_docname_dockey_for_compatibility_w_doc from syncing dockey with doc_id. The dockey stayed as the raw 32-char content_hash while the Pydantic validator recomputed doc_id as a 16-char hash via compute_unique_doc_id(), so neither expected ID in the test matched.

The latent bug was introduced by #1029, which changed the test expectations from raw md5sum() (32-char) to compute_unique_doc_id(None, md5sum(...)) (16-char) without updating the fallback path to keep dockey in sync. CI passed at the time because the bug only surfaces when metadata providers fail — before #1029, the fallback dockey (raw content hash) still matched the expected IDs, so provider failures were harmless. The test started failing now because Crossref is returning errors in CI (likely an expired CROSSREF_API_KEY secret), pushing the test into the broken fallback path.

Test plan

  • Verified fix locally: when both providers fail, dockey now equals doc_id (93085aa5ff54865c)
  • Verified no regression: test_docs_lifecycle still passes (user-provided dockey="test" is preserved)
  • All 47 tests in test_clients.py and test_paperqa.py::test_docs_lifecycle pass
  • All pre-commit hooks pass

Copilot AI review requested due to automatic review settings February 13, 2026 00:55
@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Feb 13, 2026
@dosubot
Copy link
Copy Markdown

dosubot bot commented Feb 13, 2026

Related Documentation

Checked 1 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@dosubot dosubot bot added the bug Something isn't working label Feb 13, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where dockey and doc_id could become mismatched when metadata providers fail to return metadata for a document. The issue caused flaky test failures in test_get_directory_index[check-md-query].

Changes:

  • Modified the fallback path in upgrade_doc_to_doc_details to include "doc_id" in fields_to_overwrite_from_metadata when dockey was auto-generated from content_hash
  • This allows the Pydantic validator to sync dockey with the newly computed doc_id, ensuring both fields use the same 16-character hash

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jamesbraza jamesbraza force-pushed the fix-dockey-doc-id-mismatch-no-metadata branch from 73b5fe7 to a0f1439 Compare February 13, 2026 00:58
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Feb 13, 2026
@jamesbraza jamesbraza force-pushed the fix-dockey-doc-id-mismatch-no-metadata branch from a0f1439 to aa5011d Compare February 13, 2026 01:01
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Feb 13, 2026
@jamesbraza jamesbraza force-pushed the fix-dockey-doc-id-mismatch-no-metadata branch from aa5011d to 5caf393 Compare February 13, 2026 01:04
When both metadata providers fail, `upgrade_doc_to_doc_details`'s
fallback path set `fields_to_overwrite_from_metadata` to an empty set.
This prevented the Pydantic validator from syncing `dockey` with the
newly computed `doc_id` (which incorporates `content_hash`), leaving
`dockey` as the raw 32-char content hash while `doc_id` became a 16-char
truncated hash via `compute_unique_doc_id()`.

Now, when `dockey` was auto-generated from `content_hash`, `"doc_id"` is
included in `fields_to_overwrite_from_metadata` so the validator can
sync them. User-provided dockey values are still preserved.

Fixes the flaky `test_get_directory_index[check-md-query]` CI failure.

Co-authored-by: Cursor <cursoragent@cursor.com>
@jamesbraza jamesbraza force-pushed the fix-dockey-doc-id-mismatch-no-metadata branch from 5caf393 to 35e8ce5 Compare February 13, 2026 01:39
Same inconsistency from #1029: `check-md-query` was updated to expect
`compute_unique_doc_id`-based IDs but `check-txt-query` still expected
the raw 32-char `md5sum`. With the dockey/doc_id sync fix, dockey is
now always aligned with `compute_unique_doc_id`, so update the assertion.

Co-authored-by: Cursor <cursoragent@cursor.com>
@jamesbraza jamesbraza force-pushed the fix-dockey-doc-id-mismatch-no-metadata branch from 02a4a0b to 0bbff19 Compare February 13, 2026 04:18
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 14, 2026
@jamesbraza jamesbraza merged commit f8e9b12 into main Feb 14, 2026
7 checks passed
@jamesbraza jamesbraza deleted the fix-dockey-doc-id-mismatch-no-metadata branch February 14, 2026 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants