-
Notifications
You must be signed in to change notification settings - Fork 8.2k
fix: Don't fail if doc column is missing #10746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe extract_docling_documents function now implements a two-tier column matching strategy for DataFrame inputs: exact column name match first, then fallback scan for any column containing DoclingDocument objects, with enhanced error handling and warnings. Comprehensive unit tests covering normal operations and edge cases have been added. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches✅ Passed checks (7 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/lfx/src/lfx/base/data/docling_utils.py (1)
44-53: Avoid callingdropna()twice for the same column.The current code calls
dropna()twice on each column during the fallback scan, which is inefficient for large DataFrames.for col in data_inputs.columns: try: # Check if this column contains DoclingDocument objects - sample = data_inputs[col].dropna().iloc[0] if len(data_inputs[col].dropna()) > 0 else None + non_null = data_inputs[col].dropna() + sample = non_null.iloc[0] if len(non_null) > 0 else None if sample is not None and isinstance(sample, DoclingDocument): found_column = col break except (IndexError, AttributeError): continuesrc/lfx/tests/unit/base/data/test_docling_utils.py (1)
70-84: Consider verifying the warning log is emitted during fallback.The implementation logs a warning when using a fallback column, but this test doesn't verify the warning was emitted. Consider capturing logs to ensure the warning behavior is tested.
+ def test_extract_from_dataframe_with_fallback_column_logs_warning(self, caplog): + """Test that fallback column extraction emits a warning.""" + import logging + + doc1 = DoclingDocument(name="test_doc1") + df = DataFrame([{"document": doc1, "file_path": "test1.pdf"}]) + + with caplog.at_level(logging.WARNING): + extract_docling_documents(df, "doc") + + assert "Column 'doc' not found" in caplog.text + assert "Using 'document' instead" in caplog.text
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/lfx/src/lfx/base/data/docling_utils.py(1 hunks)src/lfx/tests/unit/base/data/test_docling_utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/{test_*.py,*.test.ts,*.test.tsx}
📄 CodeRabbit inference engine (coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt)
Check that test files follow the project's naming conventions (test_*.py for backend, *.test.ts for frontend)
Files:
src/lfx/tests/unit/base/data/test_docling_utils.py
**/test_*.py
📄 CodeRabbit inference engine (coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt)
**/test_*.py: Backend tests should follow pytest structure with proper test_*.py naming
For async functions, ensure proper async testing patterns are used with pytest for backend
Files:
src/lfx/tests/unit/base/data/test_docling_utils.py
🧬 Code graph analysis (1)
src/lfx/tests/unit/base/data/test_docling_utils.py (2)
src/lfx/src/lfx/base/data/docling_utils.py (1)
extract_docling_documents(28-106)src/lfx/src/lfx/schema/data.py (1)
Data(26-288)
🪛 GitHub Actions: Ruff Style Check
src/lfx/tests/unit/base/data/test_docling_utils.py
[error] 3-3: ruff: Import block is unsorted or un-formatted (I001). Please sort/format imports.
🪛 GitHub Check: Ruff Style Check (3.13)
src/lfx/tests/unit/base/data/test_docling_utils.py
[failure] 71-71: Ruff (E501)
src/lfx/tests/unit/base/data/test_docling_utils.py:71:121: E501 Line too long (121 > 120)
[failure] 3-8: Ruff (I001)
src/lfx/tests/unit/base/data/test_docling_utils.py:3:1: I001 Import block is un-sorted or un-formatted
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Update Component Index
🔇 Additional comments (5)
src/lfx/src/lfx/base/data/docling_utils.py (2)
35-41: LGTM on the primary column lookup!The exact column name match as the primary path is clear and correctly handles extraction with proper exception chaining.
55-76: LGTM on the fallback behavior and error messaging.The warning log when using a fallback column provides clear guidance, and the detailed error message with available columns and suggested solutions is user-friendly.
src/lfx/tests/unit/base/data/test_docling_utils.py (3)
14-53: LGTM on Data extraction tests.Good coverage of single Data, wrong key error, and list of Data scenarios with clear assertions.
55-68: LGTM on DataFrame with correct column test.Properly validates the primary extraction path.
86-117: LGTM on error handling tests.Comprehensive coverage of edge cases: missing DoclingDocument column, empty DataFrame, empty list, and None input. The error message assertions validate the user-friendly messaging.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10746 +/- ##
==========================================
- Coverage 32.50% 32.49% -0.01%
==========================================
Files 1370 1370
Lines 63494 63513 +19
Branches 9391 9394 +3
==========================================
+ Hits 20639 20641 +2
- Misses 41815 41832 +17
Partials 1040 1040
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| logger.warning( | ||
| f"Column '{doc_key}' not found, but found DoclingDocument objects in column '{found_column}'. " | ||
| f"Using '{found_column}' instead. Consider updating the 'Doc Key' parameter." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to surface this to the UI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogabrielluiz how does it look now?
daniellicnerski1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick Test Report - PR #10746
Test Summary
Status: ✅ APPROVED
Tester: @daniellicnerski1
Results
| # | Pipeline | Format | Result | Note |
|---|---|---|---|---|
| 1 | VLM | Markdown | ✅ PASS | Issue #00105659 fixed |
| 2 | Standard | HTML | ✅ PASS | No regression |
| 3 | Standard | Plaintext | ✅ PASS | All formats work |
| 4 | Empty Input | Markdown | ✅ PASS | Error improved |
Key Findings
✅ Critical Fix Validated
VLM Pipeline + Export DoclingDocument now works!
- Before: ❌
TypeError: Column 'doc' not found - After: ✅ Works with automatic fallback
- Impact: Unblocks Verizon POC (5000-page documents)
✅ Enhanced Error Messages
When DataFrame is empty or column missing:
Error: Column 'doc' not found in DataFrame.
Available columns: []
Possible solutions:
1. Use 'Data' output instead of 'DataFrame'
2. Update 'Doc Key' parameter
✅ No Regression
- Standard pipeline: ✅ Working
- All export formats: ✅ Working
- Performance: ✅ No degradation
Recommendation
✅ APPROVED FOR MERGE - All tests passed, critical issue resolved.
Full detailed report available upon request
* fix: Don't fail if doc column is missing * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * Surface warning message to the UI * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * Update test_docling_utils.py * [autofix.ci] apply automated fixes * Update test_docling_utils.py --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
* fix: Don't fail if doc column is missing * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * Surface warning message to the UI * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * Update test_docling_utils.py * [autofix.ci] apply automated fixes * Update test_docling_utils.py --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
This pull request improves the robustness and usability of the
extract_docling_documentsfunction indocling_utils.py, especially when working with DataFrames that may not have the expected column names. It also adds comprehensive unit tests to ensure correct behavior and helpful error messaging in various scenarios.Enhancements to DataFrame extraction logic:
extract_docling_documentsto search for columns containingDoclingDocumentobjects when the exact column name is not found, and logs a warning if a fallback is used.Testing improvements:
extract_docling_documents, covering extraction fromData, lists ofData, DataFrames with correct and incorrect columns, fallback behavior, and error handling for empty inputs and missing columns.Summary by CodeRabbit
New Features
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.