-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix: Critical Source Attribution Bug in Batch Document Processing #235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…cessing - Replace shared extraction list with per-document tracking - Fix document boundary detection to attribute extractions correctly - Add safety handling for orphaned extractions - Maintain API compatibility and batch processing efficiency - Add comprehensive test for batch source attribution Fixes critical data integrity issue where all extractions were incorrectly attributed to the first document in batch processing. Tested: Multi-document batch processing now returns unique AnnotatedDocument objects with correct source attribution for each document.
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
No linked issues found. Please link an issue in your pull request description or title. Per our Contributing Guidelines, all PRs must:
You can also use cross-repo references like |
|
Your branch is 1 commits behind git fetch origin main
git merge origin/main
git pushNote: Enable "Allow edits by maintainers" to allow automatic updates. |
|
Your branch is 3 commits behind git fetch origin main
git merge origin/main
git pushNote: Enable "Allow edits by maintainers" to allow automatic updates. |
|
Your branch is 4 commits behind git fetch origin main
git merge origin/main
git pushNote: Enable "Allow edits by maintainers" to allow automatic updates. |
|
Your branch is 5 commits behind git fetch origin main
git merge origin/main
git pushNote: Enable "Allow edits by maintainers" to allow automatic updates. |
3 similar comments
|
Your branch is 5 commits behind git fetch origin main
git merge origin/main
git pushNote: Enable "Allow edits by maintainers" to allow automatic updates. |
|
Your branch is 5 commits behind git fetch origin main
git merge origin/main
git pushNote: Enable "Allow edits by maintainers" to allow automatic updates. |
|
Your branch is 5 commits behind git fetch origin main
git merge origin/main
git pushNote: Enable "Allow edits by maintainers" to allow automatic updates. |
|
Thank you for reporting this. It should be resolved now in #276. For future PRs, I would recommend also creating an issue for easier tracking and discussions. |
Fix: Critical Source Attribution Bug in Batch Document Processing
This PR description follows the template requirements specified in
.github/workflows/validate_pr_template.yamlDescription
Fixes #[issue number] - Critical bug in batch document processing causing incorrect source attribution
This PR resolves a critical data integrity bug in the
feat/fix-source-attrib-batchbranch where all extractions were incorrectly attributed to the first document during batch processing usinglangextract.extract(). The bug caused cross-contamination of extracted data, making it impossible to determine which document actually contained which information.Problem Summary
I've identified and fixed a critical bug in the
feat/fix-source-attrib-batchbranch that was causing incorrect source attribution during batch document processing. This bug resulted in all extractions being attributed to the first document in the batch, regardless of their actual source document.Problem Statement
Issue Description
When processing multiple documents in a single batch using
langextract.extract(), all extractions were incorrectly attributed to the first document in the batch. This caused:Reproduction Case
Input: 3 documents processed in batch
document_1.md(primary content)document_2.md(supplementary content)document_3.md(additional content)Expected Output: Each document should have its own unique extractions based on its content
Actual Output (BEFORE FIX):
When processing 3 documents, langextract returns 3
AnnotatedDocumentobjects, but with incorrect attribution:Result: All documents have identical extraction lists, making it impossible to determine which document actually contained which information.
Root Cause Analysis
Technical Investigation
I traced the bug to the
_annotate_documents_single_passmethod inlangextract/annotation.py. The issue was in the document boundary detection logic:Problematic Code (Lines 344-381):
The Core Problem
annotated_extractionswas a single list accumulating extractions from ALL documentsWhy This Happened
The original code was designed for single-document processing where the assumption of "current document gets all extractions" was valid. When batch processing was added, this assumption broke down because:
Solution Implemented
I replaced the shared extraction list with per-document tracking:
Key Changes
1. Per-Document Extraction Tracking
2. Correct Attribution Logic
3. Document Boundary Handling
4. Safety Net for Orphaned Extractions
How Has This Been Tested?
Development Environment Setup
src/langextract/.venv/uv sync --all-extras --dev.pre-commit-config.yamlCode Quality Validation
Unit Test Coverage
test_batch_source_attribution_different_content()intests/annotation_test.pyGeminiLanguageModelwith diverse responses per documentMultiPassHelperFunctionsTestclass following project patternsTest Environment
gemini-2.5-promodelunittest.mockandabsl.testing.parameterized.github/workflows/validate_pr_template.yamlrequirementsTest Results
Unit Test Validation
Before Fix Results (Production Testing)
After Fix Results (Production Testing)
Validation Commands
Impact Assessment
Business Impact
Technical Impact
Performance Impact
Files Changed
Modified Files
langextract/annotation.py: Core fix implementation_annotate_documents_single_passmethod (lines 271-426)No Breaking Changes
AnnotatedDocumentstructureChecklist
Code Quality
Testing
Documentation and Process
.github/workflows/validate_pr_template.yamlDevelopment Environment
src/langextract/.venv/)uv sync --all-extras --dev)Additional Notes
Why This Bug Was Critical
Edge Cases Handled
Future Considerations
Commit Message
Ready for Review: This fix resolves a critical data integrity bug while maintaining full API compatibility and processing efficiency. The solution is thoroughly tested with production-like data and includes comprehensive error handling.