feat: add LangExtract processing component with Gemini-only models #10693

raphaelchristi · 2025-11-23T15:14:35Z

This PR introduces a new LangExtract processing component that wraps the langextract library to deliver structured
extraction with character-level grounding and optional HTML visualization. The node supports multiple ingestion paths
(plain text input, files/documents, and a table-based example schema) and emits both structured JSON and an HTML highlight
view.

Key details:

Inputs: Input Text, Files (text/PDF/Office converted), Documents handle, Prompt/Description, and Examples (Table) with
columns text, extraction_class, extraction_text (rows sharing the same text are grouped into one ExampleData). Legacy
JSON examples are still accepted as a fallback.
Models: dropdown constrained to recent Gemini SKUs (2.5 Pro/Flash variants and flash-lite/preview IDs), with combobox
enabled for manual overrides.
Outputs: structured_output (Data) and html_output (HTML string) with return_html toggle.
API key: visible MessageTextInput to simplify provider configuration.
Dependency: adds langextract to src/lfx/pyproject.toml; component index regenerated to register the node.

Implementation notes:

Example parsing prioritizes the table path; if absent, falls back to JSON (with json_repair for minor issues).
Extraction flow caches result/HTML per invocation to avoid duplicate calls.
Maintains compatibility with existing LangFlow infrastructure and dynamic import registry.

Summary by CodeRabbit

New Features
- Added LangExtract component for structured information extraction from text, files, and documents with optional HTML highlighting of results.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-23T15:14:52Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds the LangExtract library as a project dependency and introduces a new LangExtractComponent for performing structured text extraction with grounding and optional HTML visualization. The component integrates with the processing module pipeline and provides configurable inputs for text, examples, model selection, and API credentials.

Changes

Cohort / File(s)	Summary
Dependency Addition `src/lfx/pyproject.toml`	Added langextract>=1.1.0,<2.0.0 dependency to project requirements.
Component Registration `src/lfx/src/lfx/components/processing/__init__.py`	Registered LangExtractComponent for lazy loading via TYPE_CHECKING import, dynamic import mapping, and all export list.
Component Implementation `src/lfx/src/lfx/components/processing/langextract.py`	Implemented new LangExtractComponent class with text aggregation, example parsing (table-based and JSON-based), structured extraction, and optional HTML visualization. Includes error handling for missing inputs, invalid examples, and missing dependencies.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Component as LangExtractComponent
    participant LangExtract as LangExtract Library
    participant FileSystem as File System

    User->>Component: Provide text, examples, parameters
    
    rect rgb(200, 220, 255)
        Note over Component: Text Gathering Phase
        Component->>Component: _gather_text()
        Component->>Component: Collect from table, input_text, files, documents
    end
    
    rect rgb(200, 220, 255)
        Note over Component: Example Parsing Phase
        Component->>Component: _parse_examples()
        Component->>Component: Parse table-based or JSON examples
    end
    
    rect rgb(220, 200, 255)
        Note over Component: Extraction Phase
        Component->>Component: _run_extract()
        Component->>LangExtract: Call with text, prompt, examples, model_id, api_key
        LangExtract-->>Component: Return extraction result
        Component->>Component: Cache result
    end
    
    par Output Generation
        rect rgb(200, 255, 220)
            Note over Component: Structured Output
            Component->>Component: build_structured_output()
            Component->>Component: _result_to_dict()
            Component-->>User: Return Data with structured_output
        end
        
        rect rgb(200, 255, 220)
            Note over Component: HTML Output (Optional)
            Component->>Component: build_html_output()
            Component->>FileSystem: Save annotated documents to temp file
            Component->>LangExtract: Generate HTML visualization
            Component->>FileSystem: Read HTML content
            Component-->>User: Return Data with html_output
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

New component file complexity: The langextract.py file introduces substantial logic across multiple interconnected methods handling text aggregation, example parsing (with two distinct modes and repair fallback), external library integration, caching, and optional HTML generation.
Error handling patterns: Multiple validation points (missing text, missing examples, missing dependency) with custom error messages require careful review.
External library integration: Integration with LangExtract and json_repair libraries, including lazy imports and graceful error handling for missing dependencies.
File I/O and temporary file operations: HTML output generation involves temporary file creation and cleanup patterns that should be verified for safety.
Data structure conversions: Multiple result serialization approaches (_result_to_dict) need validation across different object types.

Suggested labels

lgtm

Suggested reviewers

jordanrfrazier
ogabrielluiz
carlosrcoelho

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 2 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Quality And Coverage	⚠️ Warning	No test files found for LangExtractComponent despite substantial new functionality including text gathering, example parsing, extraction execution, HTML generation, and caching.	Add comprehensive pytest tests covering input/output validation, error scenarios (ValueError, ImportError), example parsing logic (table and JSON modes), HTML generation, result serialization, and component pipeline integration.
Test Coverage For New Implementations	❓ Inconclusive	Repository could not be accessed to verify test file presence, testing conventions, and structure for the new LangExtractComponent.	Ensure repository is accessible and verify test files exist (test_langextract.py), review project testing conventions, and confirm unit/integration tests for key methods are included.
Test File Naming And Structure	❓ Inconclusive	Cannot definitively verify test file inclusion without direct repository access; PR summary lists three files with no test files mentioned, suggesting possible absence.	Provide repository access or confirm: (1) test files included in PR, (2) established testing patterns for codebase components, (3) CI/CD test coverage requirements.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Excessive Mock Usage Warning	✅ Passed	No test files found in PR; implementation contains no mock usage patterns.
Title check	✅ Passed	The title accurately summarizes the main change: adding a new LangExtract processing component with Gemini-only models, which directly matches the pull request's primary objective and all file changes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/lfx/src/lfx/components/processing/langextract.py (2)

43-138: Inputs look coherent; consider language consistency and minor option cleanup

The inputs set covers the expected surfaces (raw text, files, documents, examples, model, API key, HTML toggle) and matches the behavior in the helpers. Two small nits:

UI texts and error messages are mostly in Portuguese while other components in this package tend to use English; if the rest of the UI is English‑first, it may be worth standardizing here (or adding proper i18n later).

In the model_id options, "gemini-2.5-flash-lite" appears twice; trimming the duplicate keeps the dropdown a bit cleaner.

168-210: Clarify table-based text input handling and error messaging in _gather_text

Two points here:

The precedence block uses getattr(self, "input_table", None), but this component does not declare an input_table input. Unless input_table is injected programmatically by the framework, this branch will never be hit. If the intent was to support table‑driven text input, consider either:

Adding a corresponding TableInput (or HandleInput) named input_table, or

Removing this branch to avoid dead/unused logic.

The error message "Forneça 'Input Text' ou conecte 'Documents'." omits that text can also come from files (and potentially the table path). Updating the wording to mention all supported sources would reduce confusion when no text is resolved.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f6c9a7 and d7705a1.

⛔ Files ignored due to path filters (2)

src/frontend/package-lock.json is excluded by !**/package-lock.json
uv.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

src/lfx/pyproject.toml (1 hunks)
src/lfx/src/lfx/components/processing/__init__.py (3 hunks)
src/lfx/src/lfx/components/processing/langextract.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

src/lfx/src/lfx/components/processing/langextract.py (3)

src/backend/base/langflow/helpers/data.py (1)

safe_convert (166-192)

src/lfx/src/lfx/schema/data.py (1)

Data (26-288)

src/lfx/src/lfx/schema/table.py (1)

EditMode (27-30)

src/lfx/src/lfx/components/processing/__init__.py (1)

src/lfx/src/lfx/components/processing/langextract.py (1)

LangExtractComponent (26-352)

🔇 Additional comments (4)

src/lfx/src/lfx/components/processing/__init__.py (1)

11-11: LangExtractComponent is correctly wired into the lazy import surface

The TYPE_CHECKING import, _dynamic_imports mapping ("LangExtractComponent": "langextract"), and __all__ entry are consistent with the new module lfx.components.processing.langextract and the LangExtractComponent symbol. The lazy import machinery should pick this up cleanly.

Also applies to: 32-32, 48-48

src/lfx/pyproject.toml (1)

20-20: Remove: Version range verification is unnecessary

All APIs used—lx.extract, lx.data.ExampleData, lx.data.Extraction, lx.io.save_annotated_documents, and lx.visualize—are confirmed to exist in the current langextract release. The dependency specification langextract>=1.1.0,<2.0.0 is appropriate. No verification action needed.

src/lfx/src/lfx/components/processing/langextract.py (2)

211-291: Examples parsing has inconsistent semantics; add examples_json UI input or clarify optional/required behavior

The code design has two confirmed issues:

Missing UI input for examples_json: The _parse_examples() method accesses examples_json via getattr(self, "examples_json", None) (line 233), but this attribute has no corresponding input declared in lines 43–120+. The declared inputs include example_text and examples_table, but not examples_json. This means the JSON fallback path is only reachable if set programmatically outside the UI.

Mixed optional/required semantics: In _parse_examples(), when examples_json is None, "", or "[]", the method returns None without raising (lines 234–235). Later, if parsed JSON contains no valid examples, it raises ValueError (lines 275–279). Meanwhile, _run_extract() only passes examples to lx.extract() if they're truthy (if examples: kwargs["examples"] = examples). This creates inconsistent behavior: sometimes examples are optional, sometimes required.

Recommended actions:

If examples_json should be user-configurable, add a MultilineInput for examples_json to the inputs list (mirror the pattern from example_text or similar components).

Clarify whether examples are optional (update logic and error messages) or required (raise early if no examples provided).

Align the semantics: either always require examples, or consistently allow them to be optional with fallback behavior.

153-155: Verify component lifecycle and cache invalidation strategy

The review comment identifies a real issue: the _cached_result and _cached_html attributes are never cleared, and caches are not keyed to input signatures. If a LangExtractComponent instance is reused across multiple executions with different inputs (e.g., different input_text, files, prompt_description, or model_id), stale cached outputs will be returned.

Verified findings:

Lines 153–155 define _cached_result and _cached_html at the class level, initialized to None. The _run_extract() method short-circuits with if self._cached_result is not None: return self._cached_result (line 305), and build_html_output() does the same for _cached_html (line 335). Neither method resets these caches or validates that the input parameters (text, examples, model, etc.) match the cached result.

The actual risk depends on your framework's component lifecycle: If LFX creates a fresh component instance per execution, caches are harmless (though unnecessary). If instances are reused across runs, this is a bug.

Recommendations:

Reset caches at execution start – Clear _cached_result and _cached_html in build_structured_output() and build_html_output() before checking the cache, or use an __init__ hook.

Key caches on inputs – Compute a hash of effective inputs (text, prompt_description, examples, model_id, api_key) and only return cached values if the hash matches.

github-actions bot added the community Pull Request from an external contributor label Nov 23, 2025

raphaelchristi changed the title ~~Add LangExtract processing node with Gemini-only model list and table-based examples~~ feat: add LangExtract processing component with Gemini-only models Nov 23, 2025

coderabbitai bot reviewed Nov 23, 2025

View reviewed changes

github-actions bot added the enhancement New feature or request label Nov 23, 2025

raphaelchristi force-pushed the feature/langextract-integration branch from d7705a1 to 800ad46 Compare November 23, 2025 15:26