fix: Don't fail if doc column is missing #10746

erichare · 2025-11-26T21:34:37Z

This pull request improves the robustness and usability of the extract_docling_documents function in docling_utils.py, especially when working with DataFrames that may not have the expected column names. It also adds comprehensive unit tests to ensure correct behavior and helpful error messaging in various scenarios.

Enhancements to DataFrame extraction logic:

Added a fallback mechanism in extract_docling_documents to search for columns containing DoclingDocument objects when the exact column name is not found, and logs a warning if a fallback is used.
Improved error messages to provide users with actionable suggestions when the expected column is missing, including listing available columns and possible solutions.

Testing improvements:

Added a full suite of unit tests for extract_docling_documents, covering extraction from Data, lists of Data, DataFrames with correct and incorrect columns, fallback behavior, and error handling for empty inputs and missing columns.

Summary by CodeRabbit

New Features
- Added intelligent fallback mechanism for document extraction: system now searches available columns when the specified column is not found, with informative guidance.
Bug Fixes
- Improved error messages to provide detailed information about available columns and recommended actions for extraction issues.
Tests
- Comprehensive unit test coverage added covering normal operations, edge cases, and error scenarios for document extraction.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-26T21:34:55Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The extract_docling_documents function now implements a two-tier column matching strategy for DataFrame inputs: exact column name match first, then fallback scan for any column containing DoclingDocument objects, with enhanced error handling and warnings. Comprehensive unit tests covering normal operations and edge cases have been added.

Changes

Cohort / File(s)	Summary
Implementation Enhancement `src/lfx/src/lfx/base/data/docling_utils.py`	Updated extract_docling_documents to use two-tier column matching for DataFrames: (1) exact column name match, (2) fallback scan for DoclingDocument columns with warning. Enhanced error messages now include target column details and differentiate between exact-match and fallback failures. Non-DataFrame handling logic unchanged.
Test Suite `src/lfx/tests/unit/base/data/test_docling_utils.py`	New unit test module covering extract_docling_documents functionality: correct extraction from Data and DataFrames, fallback column resolution, error handling for missing/wrong keys, empty inputs, None values, and missing DoclingDocument columns with detailed error messages.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Verify the two-tier column lookup logic handles all cases correctly and in the intended order
Confirm warning is triggered only for fallback scenarios and not for exact matches
Validate error messages are comprehensive, user-friendly, and include available columns and suggested solutions
Review edge case handling: empty DataFrames, empty data lists, None inputs, and fallback behavior
Ensure test coverage adequately validates both success and failure paths

Pre-merge checks and finishing touches

✅ Passed checks (7 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: Don't fail if doc column is missing' directly and accurately summarizes the main change: adding robustness to handle missing column names in extract_docling_documents by implementing a fallback mechanism.
Docstring Coverage	✅ Passed	Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
Test Coverage For New Implementations	✅ Passed	The PR includes 9 comprehensive test methods covering all major functionality changes: exact column matching, fallback column search, error handling, Data/DataFrame extraction, and edge cases with valid assertions and pytest.raises statements.
Test Quality And Coverage	✅ Passed	Test suite provides comprehensive coverage of extract_docling_documents functionality including main paths (Data, list, DataFrame), fallback column discovery, and 9+ error scenarios with detailed error message validation.
Test File Naming And Structure	✅ Passed	Test file follows all required patterns: proper test_*.py naming, pytest structure with descriptive class and method names, comprehensive coverage of 9 test cases including positive scenarios, negative scenarios, and edge cases, logically organized by data type, and complete mapping to implementation code paths.
Excessive Mock Usage Warning	✅ Passed	The test file uses real objects (DoclingDocument, Data, DataFrame) instead of mocks, with no mock libraries imported, verifying actual function behavior and appropriate error handling.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/lfx/src/lfx/base/data/docling_utils.py (1)

44-53: Avoid calling dropna() twice for the same column.

The current code calls dropna() twice on each column during the fallback scan, which is inefficient for large DataFrames.

             for col in data_inputs.columns:
                 try:
                     # Check if this column contains DoclingDocument objects
-                    sample = data_inputs[col].dropna().iloc[0] if len(data_inputs[col].dropna()) > 0 else None
+                    non_null = data_inputs[col].dropna()
+                    sample = non_null.iloc[0] if len(non_null) > 0 else None
                     if sample is not None and isinstance(sample, DoclingDocument):
                         found_column = col
                         break
                 except (IndexError, AttributeError):
                     continue

src/lfx/tests/unit/base/data/test_docling_utils.py (1)

70-84: Consider verifying the warning log is emitted during fallback.

The implementation logs a warning when using a fallback column, but this test doesn't verify the warning was emitted. Consider capturing logs to ensure the warning behavior is tested.

+    def test_extract_from_dataframe_with_fallback_column_logs_warning(self, caplog):
+        """Test that fallback column extraction emits a warning."""
+        import logging
+
+        doc1 = DoclingDocument(name="test_doc1")
+        df = DataFrame([{"document": doc1, "file_path": "test1.pdf"}])
+
+        with caplog.at_level(logging.WARNING):
+            extract_docling_documents(df, "doc")
+
+        assert "Column 'doc' not found" in caplog.text
+        assert "Using 'document' instead" in caplog.text

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ddfed3 and e25ef32.

📒 Files selected for processing (2)

src/lfx/src/lfx/base/data/docling_utils.py (1 hunks)
src/lfx/tests/unit/base/data/test_docling_utils.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/{test_*.py,*.test.ts,*.test.tsx}

📄 CodeRabbit inference engine (coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt)

Check that test files follow the project's naming conventions (test_*.py for backend, *.test.ts for frontend)

Files:

src/lfx/tests/unit/base/data/test_docling_utils.py

**/test_*.py

📄 CodeRabbit inference engine (coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt)

**/test_*.py: Backend tests should follow pytest structure with proper test_*.py naming
For async functions, ensure proper async testing patterns are used with pytest for backend

Files:

src/lfx/tests/unit/base/data/test_docling_utils.py

🧬 Code graph analysis (1)

src/lfx/tests/unit/base/data/test_docling_utils.py (2)

src/lfx/src/lfx/base/data/docling_utils.py (1)

extract_docling_documents (28-106)

src/lfx/src/lfx/schema/data.py (1)

Data (26-288)

🪛 GitHub Actions: Ruff Style Check

src/lfx/tests/unit/base/data/test_docling_utils.py

[error] 3-3: ruff: Import block is unsorted or un-formatted (I001). Please sort/format imports.

🪛 GitHub Check: Ruff Style Check (3.13)

src/lfx/tests/unit/base/data/test_docling_utils.py

[failure] 71-71: Ruff (E501)
src/lfx/tests/unit/base/data/test_docling_utils.py:71:121: E501 Line too long (121 > 120)

[failure] 3-8: Ruff (I001)
src/lfx/tests/unit/base/data/test_docling_utils.py:3:1: I001 Import block is un-sorted or un-formatted

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Update Component Index

🔇 Additional comments (5)

src/lfx/src/lfx/base/data/docling_utils.py (2)

35-41: LGTM on the primary column lookup!

The exact column name match as the primary path is clear and correctly handles extraction with proper exception chaining.

55-76: LGTM on the fallback behavior and error messaging.

The warning log when using a fallback column provides clear guidance, and the detailed error message with available columns and suggested solutions is user-friendly.

src/lfx/tests/unit/base/data/test_docling_utils.py (3)

14-53: LGTM on Data extraction tests.

Good coverage of single Data, wrong key error, and list of Data scenarios with clear assertions.

55-68: LGTM on DataFrame with correct column test.

Properly validates the primary extraction path.

86-117: LGTM on error handling tests.

Comprehensive coverage of edge cases: missing DoclingDocument column, empty DataFrame, empty list, and None input. The error message assertions validate the user-friendly messaging.

src/lfx/tests/unit/base/data/test_docling_utils.py

github-actions · 2025-11-26T21:41:38Z

Frontend Unit Test Coverage Report

Coverage Summary

Lines	Statements	Branches	Functions
	15.44% (4246/27500)	8.61% (1811/21013)	9.69% (587/6057)

Unit Test Results

Tests	Skipped	Failures	Errors	Time
1671	0 💤	0 ❌	0 🔥	21.011s ⏱️

codecov · 2025-11-26T21:42:42Z

Codecov Report

❌ Patch coverage is 0% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 32.49%. Comparing base (692d659) to head (ef9b881).

Files with missing lines	Patch %	Lines
src/lfx/src/lfx/base/data/docling_utils.py	0.00%	29 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10746      +/-   ##
==========================================
- Coverage   32.50%   32.49%   -0.01%     
==========================================
  Files        1370     1370              
  Lines       63494    63513      +19     
  Branches     9391     9394       +3     
==========================================
+ Hits        20639    20641       +2     
- Misses      41815    41832      +17     
  Partials     1040     1040

Flag	Coverage Δ
backend	`51.42% <ø> (+0.01%)`	⬆️
frontend	`14.29% <ø> (ø)`
lfx	`39.95% <0.00%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/lfx/src/lfx/base/data/docling_utils.py	`0.00% <0.00%> (ø)`

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ogabrielluiz · 2025-11-27T14:31:50Z

src/lfx/src/lfx/base/data/docling_utils.py

+                logger.warning(
+                    f"Column '{doc_key}' not found, but found DoclingDocument objects in column '{found_column}'. "
+                    f"Using '{found_column}' instead. Consider updating the 'Doc Key' parameter."
+                )


We need to surface this to the UI

@ogabrielluiz how does it look now?

daniellicnerski1

Quick Test Report - PR #10746

Test Summary

Status: ✅ APPROVED
Tester: @daniellicnerski1

Results

#	Pipeline	Format	Result	Note
1	VLM	Markdown	✅ PASS	Issue #00105659 fixed
2	Standard	HTML	✅ PASS	No regression
3	Standard	Plaintext	✅ PASS	All formats work
4	Empty Input	Markdown	✅ PASS	Error improved

Key Findings

✅ Critical Fix Validated

VLM Pipeline + Export DoclingDocument now works!

Before: ❌ TypeError: Column 'doc' not found
After: ✅ Works with automatic fallback
Impact: Unblocks Verizon POC (5000-page documents)

✅ Enhanced Error Messages

When DataFrame is empty or column missing:

Error: Column 'doc' not found in DataFrame.
Available columns: []
Possible solutions:
1. Use 'Data' output instead of 'DataFrame'
2. Update 'Doc Key' parameter

✅ No Regression

Standard pipeline: ✅ Working
All export formats: ✅ Working
Performance: ✅ No degradation

Recommendation

✅ APPROVED FOR MERGE - All tests passed, critical issue resolved.

Full detailed report available upon request

* fix: Don't fail if doc column is missing * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * Surface warning message to the UI * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * Update test_docling_utils.py * [autofix.ci] apply automated fixes * Update test_docling_utils.py --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

fix: Don't fail if doc column is missing

e25ef32

erichare requested a review from pushkala-datastax November 26, 2025 21:34

github-actions bot added the bug Something isn't working label Nov 26, 2025

[autofix.ci] apply automated fixes

cdcd96b

github-actions bot added bug Something isn't working and removed bug Something isn't working labels Nov 26, 2025

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

src/lfx/tests/unit/base/data/test_docling_utils.py Show resolved Hide resolved

src/lfx/tests/unit/base/data/test_docling_utils.py Show resolved Hide resolved

[autofix.ci] apply automated fixes (attempt 2/3)

476468d

github-actions bot added bug Something isn't working and removed bug Something isn't working labels Nov 26, 2025

ogabrielluiz reviewed Nov 27, 2025

View reviewed changes

Surface warning message to the UI

2e79019

erichare requested a review from ogabrielluiz December 2, 2025 21:02