Skip to content

Conversation

@erichare
Copy link
Collaborator

@erichare erichare commented Nov 26, 2025

This pull request improves the robustness and usability of the extract_docling_documents function in docling_utils.py, especially when working with DataFrames that may not have the expected column names. It also adds comprehensive unit tests to ensure correct behavior and helpful error messaging in various scenarios.

Enhancements to DataFrame extraction logic:

  • Added a fallback mechanism in extract_docling_documents to search for columns containing DoclingDocument objects when the exact column name is not found, and logs a warning if a fallback is used.
  • Improved error messages to provide users with actionable suggestions when the expected column is missing, including listing available columns and possible solutions.

Testing improvements:

  • Added a full suite of unit tests for extract_docling_documents, covering extraction from Data, lists of Data, DataFrames with correct and incorrect columns, fallback behavior, and error handling for empty inputs and missing columns.

Summary by CodeRabbit

  • New Features

    • Added intelligent fallback mechanism for document extraction: system now searches available columns when the specified column is not found, with informative guidance.
  • Bug Fixes

    • Improved error messages to provide detailed information about available columns and recommended actions for extraction issues.
  • Tests

    • Comprehensive unit test coverage added covering normal operations, edge cases, and error scenarios for document extraction.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 26, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The extract_docling_documents function now implements a two-tier column matching strategy for DataFrame inputs: exact column name match first, then fallback scan for any column containing DoclingDocument objects, with enhanced error handling and warnings. Comprehensive unit tests covering normal operations and edge cases have been added.

Changes

Cohort / File(s) Summary
Implementation Enhancement
src/lfx/src/lfx/base/data/docling_utils.py
Updated extract_docling_documents to use two-tier column matching for DataFrames: (1) exact column name match, (2) fallback scan for DoclingDocument columns with warning. Enhanced error messages now include target column details and differentiate between exact-match and fallback failures. Non-DataFrame handling logic unchanged.
Test Suite
src/lfx/tests/unit/base/data/test_docling_utils.py
New unit test module covering extract_docling_documents functionality: correct extraction from Data and DataFrames, fallback column resolution, error handling for missing/wrong keys, empty inputs, None values, and missing DoclingDocument columns with detailed error messages.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify the two-tier column lookup logic handles all cases correctly and in the intended order
  • Confirm warning is triggered only for fallback scenarios and not for exact matches
  • Validate error messages are comprehensive, user-friendly, and include available columns and suggested solutions
  • Review edge case handling: empty DataFrames, empty data lists, None inputs, and fallback behavior
  • Ensure test coverage adequately validates both success and failure paths

Pre-merge checks and finishing touches

✅ Passed checks (7 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: Don't fail if doc column is missing' directly and accurately summarizes the main change: adding robustness to handle missing column names in extract_docling_documents by implementing a fallback mechanism.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
Test Coverage For New Implementations ✅ Passed The PR includes 9 comprehensive test methods covering all major functionality changes: exact column matching, fallback column search, error handling, Data/DataFrame extraction, and edge cases with valid assertions and pytest.raises statements.
Test Quality And Coverage ✅ Passed Test suite provides comprehensive coverage of extract_docling_documents functionality including main paths (Data, list, DataFrame), fallback column discovery, and 9+ error scenarios with detailed error message validation.
Test File Naming And Structure ✅ Passed Test file follows all required patterns: proper test_*.py naming, pytest structure with descriptive class and method names, comprehensive coverage of 9 test cases including positive scenarios, negative scenarios, and edge cases, logically organized by data type, and complete mapping to implementation code paths.
Excessive Mock Usage Warning ✅ Passed The test file uses real objects (DoclingDocument, Data, DataFrame) instead of mocks, with no mock libraries imported, verifying actual function behavior and appropriate error handling.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the bug Something isn't working label Nov 26, 2025
@github-actions github-actions bot added bug Something isn't working and removed bug Something isn't working labels Nov 26, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
src/lfx/src/lfx/base/data/docling_utils.py (1)

44-53: Avoid calling dropna() twice for the same column.

The current code calls dropna() twice on each column during the fallback scan, which is inefficient for large DataFrames.

             for col in data_inputs.columns:
                 try:
                     # Check if this column contains DoclingDocument objects
-                    sample = data_inputs[col].dropna().iloc[0] if len(data_inputs[col].dropna()) > 0 else None
+                    non_null = data_inputs[col].dropna()
+                    sample = non_null.iloc[0] if len(non_null) > 0 else None
                     if sample is not None and isinstance(sample, DoclingDocument):
                         found_column = col
                         break
                 except (IndexError, AttributeError):
                     continue
src/lfx/tests/unit/base/data/test_docling_utils.py (1)

70-84: Consider verifying the warning log is emitted during fallback.

The implementation logs a warning when using a fallback column, but this test doesn't verify the warning was emitted. Consider capturing logs to ensure the warning behavior is tested.

+    def test_extract_from_dataframe_with_fallback_column_logs_warning(self, caplog):
+        """Test that fallback column extraction emits a warning."""
+        import logging
+
+        doc1 = DoclingDocument(name="test_doc1")
+        df = DataFrame([{"document": doc1, "file_path": "test1.pdf"}])
+
+        with caplog.at_level(logging.WARNING):
+            extract_docling_documents(df, "doc")
+
+        assert "Column 'doc' not found" in caplog.text
+        assert "Using 'document' instead" in caplog.text
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ddfed3 and e25ef32.

📒 Files selected for processing (2)
  • src/lfx/src/lfx/base/data/docling_utils.py (1 hunks)
  • src/lfx/tests/unit/base/data/test_docling_utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/{test_*.py,*.test.ts,*.test.tsx}

📄 CodeRabbit inference engine (coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt)

Check that test files follow the project's naming conventions (test_*.py for backend, *.test.ts for frontend)

Files:

  • src/lfx/tests/unit/base/data/test_docling_utils.py
**/test_*.py

📄 CodeRabbit inference engine (coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt)

**/test_*.py: Backend tests should follow pytest structure with proper test_*.py naming
For async functions, ensure proper async testing patterns are used with pytest for backend

Files:

  • src/lfx/tests/unit/base/data/test_docling_utils.py
🧬 Code graph analysis (1)
src/lfx/tests/unit/base/data/test_docling_utils.py (2)
src/lfx/src/lfx/base/data/docling_utils.py (1)
  • extract_docling_documents (28-106)
src/lfx/src/lfx/schema/data.py (1)
  • Data (26-288)
🪛 GitHub Actions: Ruff Style Check
src/lfx/tests/unit/base/data/test_docling_utils.py

[error] 3-3: ruff: Import block is unsorted or un-formatted (I001). Please sort/format imports.

🪛 GitHub Check: Ruff Style Check (3.13)
src/lfx/tests/unit/base/data/test_docling_utils.py

[failure] 71-71: Ruff (E501)
src/lfx/tests/unit/base/data/test_docling_utils.py:71:121: E501 Line too long (121 > 120)


[failure] 3-8: Ruff (I001)
src/lfx/tests/unit/base/data/test_docling_utils.py:3:1: I001 Import block is un-sorted or un-formatted

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Update Component Index
🔇 Additional comments (5)
src/lfx/src/lfx/base/data/docling_utils.py (2)

35-41: LGTM on the primary column lookup!

The exact column name match as the primary path is clear and correctly handles extraction with proper exception chaining.


55-76: LGTM on the fallback behavior and error messaging.

The warning log when using a fallback column provides clear guidance, and the detailed error message with available columns and suggested solutions is user-friendly.

src/lfx/tests/unit/base/data/test_docling_utils.py (3)

14-53: LGTM on Data extraction tests.

Good coverage of single Data, wrong key error, and list of Data scenarios with clear assertions.


55-68: LGTM on DataFrame with correct column test.

Properly validates the primary extraction path.


86-117: LGTM on error handling tests.

Comprehensive coverage of edge cases: missing DoclingDocument column, empty DataFrame, empty list, and None input. The error message assertions validate the user-friendly messaging.

@github-actions github-actions bot added bug Something isn't working and removed bug Something isn't working labels Nov 26, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Nov 26, 2025

Frontend Unit Test Coverage Report

Coverage Summary

Lines Statements Branches Functions
Coverage: 15%
15.44% (4246/27500) 8.61% (1811/21013) 9.69% (587/6057)

Unit Test Results

Tests Skipped Failures Errors Time
1671 0 💤 0 ❌ 0 🔥 21.011s ⏱️

@codecov
Copy link

codecov bot commented Nov 26, 2025

Codecov Report

❌ Patch coverage is 0% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 32.49%. Comparing base (692d659) to head (ef9b881).

Files with missing lines Patch % Lines
src/lfx/src/lfx/base/data/docling_utils.py 0.00% 29 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #10746      +/-   ##
==========================================
- Coverage   32.50%   32.49%   -0.01%     
==========================================
  Files        1370     1370              
  Lines       63494    63513      +19     
  Branches     9391     9394       +3     
==========================================
+ Hits        20639    20641       +2     
- Misses      41815    41832      +17     
  Partials     1040     1040              
Flag Coverage Δ
backend 51.42% <ø> (+0.01%) ⬆️
frontend 14.29% <ø> (ø)
lfx 39.95% <0.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/lfx/src/lfx/base/data/docling_utils.py 0.00% <0.00%> (ø)

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines 56 to 59
logger.warning(
f"Column '{doc_key}' not found, but found DoclingDocument objects in column '{found_column}'. "
f"Using '{found_column}' instead. Consider updating the 'Doc Key' parameter."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to surface this to the UI

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogabrielluiz how does it look now?

@erichare erichare requested a review from ogabrielluiz December 2, 2025 21:02
@github-actions github-actions bot added bug Something isn't working and removed bug Something isn't working labels Dec 2, 2025
@github-actions github-actions bot added bug Something isn't working and removed bug Something isn't working labels Dec 2, 2025
@github-actions github-actions bot added bug Something isn't working and removed bug Something isn't working labels Dec 2, 2025
@github-actions github-actions bot added bug Something isn't working and removed bug Something isn't working labels Dec 2, 2025
@github-actions github-actions bot added bug Something isn't working and removed bug Something isn't working labels Dec 2, 2025
@github-actions github-actions bot added bug Something isn't working and removed bug Something isn't working labels Dec 2, 2025
@github-actions github-actions bot added bug Something isn't working and removed bug Something isn't working labels Dec 2, 2025
Copy link
Collaborator

@daniellicnerski1 daniellicnerski1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick Test Report - PR #10746

Test Summary

Status: ✅ APPROVED
Tester: @daniellicnerski1

Results

# Pipeline Format Result Note
1 VLM Markdown ✅ PASS Issue #00105659 fixed
2 Standard HTML ✅ PASS No regression
3 Standard Plaintext ✅ PASS All formats work
4 Empty Input Markdown ✅ PASS Error improved

Key Findings

✅ Critical Fix Validated

VLM Pipeline + Export DoclingDocument now works!

  • Before: ❌ TypeError: Column 'doc' not found
  • After: ✅ Works with automatic fallback
  • Impact: Unblocks Verizon POC (5000-page documents)

✅ Enhanced Error Messages

When DataFrame is empty or column missing:

Error: Column 'doc' not found in DataFrame.
Available columns: []
Possible solutions:
1. Use 'Data' output instead of 'DataFrame'
2. Update 'Doc Key' parameter

✅ No Regression

  • Standard pipeline: ✅ Working
  • All export formats: ✅ Working
  • Performance: ✅ No degradation

Recommendation

APPROVED FOR MERGE - All tests passed, critical issue resolved.


Full detailed report available upon request

@github-actions github-actions bot added bug Something isn't working and removed bug Something isn't working labels Dec 3, 2025
@erichare erichare added this pull request to the merge queue Dec 3, 2025
Merged via the queue into main with commit 1bb048a Dec 3, 2025
22 of 24 checks passed
@erichare erichare deleted the fix-docling-doc-error branch December 3, 2025 22:02
erichare added a commit that referenced this pull request Dec 3, 2025
* fix: Don't fail if doc column is missing

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

* Surface warning message to the UI

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

* Update test_docling_utils.py

* [autofix.ci] apply automated fixes

* Update test_docling_utils.py

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
erichare added a commit that referenced this pull request Dec 3, 2025
* fix: Don't fail if doc column is missing

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

* Surface warning message to the UI

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

* Update test_docling_utils.py

* [autofix.ci] apply automated fixes

* Update test_docling_utils.py

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants