-
Notifications
You must be signed in to change notification settings - Fork 8.2k
feat: add LangExtract processing component with Gemini-only models #10693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add LangExtract processing component with Gemini-only models #10693
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughAdds the LangExtract library as a project dependency and introduces a new LangExtractComponent for performing structured text extraction with grounding and optional HTML visualization. The component integrates with the processing module pipeline and provides configurable inputs for text, examples, model selection, and API credentials. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Component as LangExtractComponent
participant LangExtract as LangExtract Library
participant FileSystem as File System
User->>Component: Provide text, examples, parameters
rect rgb(200, 220, 255)
Note over Component: Text Gathering Phase
Component->>Component: _gather_text()
Component->>Component: Collect from table, input_text, files, documents
end
rect rgb(200, 220, 255)
Note over Component: Example Parsing Phase
Component->>Component: _parse_examples()
Component->>Component: Parse table-based or JSON examples
end
rect rgb(220, 200, 255)
Note over Component: Extraction Phase
Component->>Component: _run_extract()
Component->>LangExtract: Call with text, prompt, examples, model_id, api_key
LangExtract-->>Component: Return extraction result
Component->>Component: Cache result
end
par Output Generation
rect rgb(200, 255, 220)
Note over Component: Structured Output
Component->>Component: build_structured_output()
Component->>Component: _result_to_dict()
Component-->>User: Return Data with structured_output
end
rect rgb(200, 255, 220)
Note over Component: HTML Output (Optional)
Component->>Component: build_html_output()
Component->>FileSystem: Save annotated documents to temp file
Component->>LangExtract: Generate HTML visualization
Component->>FileSystem: Read HTML content
Component-->>User: Return Data with html_output
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 2 inconclusive)
✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/lfx/src/lfx/components/processing/langextract.py (2)
43-138: Inputs look coherent; consider language consistency and minor option cleanupThe inputs set covers the expected surfaces (raw text, files, documents, examples, model, API key, HTML toggle) and matches the behavior in the helpers. Two small nits:
- UI texts and error messages are mostly in Portuguese while other components in this package tend to use English; if the rest of the UI is English‑first, it may be worth standardizing here (or adding proper i18n later).
- In the
model_idoptions,"gemini-2.5-flash-lite"appears twice; trimming the duplicate keeps the dropdown a bit cleaner.
168-210: Clarify table-based text input handling and error messaging in_gather_textTwo points here:
- The precedence block uses
getattr(self, "input_table", None), but this component does not declare aninput_tableinput. Unlessinput_tableis injected programmatically by the framework, this branch will never be hit. If the intent was to support table‑driven text input, consider either:
- Adding a corresponding
TableInput(orHandleInput) namedinput_table, or- Removing this branch to avoid dead/unused logic.
- The error message
"Forneça 'Input Text' ou conecte 'Documents'."omits that text can also come fromfiles(and potentially the table path). Updating the wording to mention all supported sources would reduce confusion when no text is resolved.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
src/frontend/package-lock.jsonis excluded by!**/package-lock.jsonuv.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
src/lfx/pyproject.toml(1 hunks)src/lfx/src/lfx/components/processing/__init__.py(3 hunks)src/lfx/src/lfx/components/processing/langextract.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/lfx/src/lfx/components/processing/langextract.py (3)
src/backend/base/langflow/helpers/data.py (1)
safe_convert(166-192)src/lfx/src/lfx/schema/data.py (1)
Data(26-288)src/lfx/src/lfx/schema/table.py (1)
EditMode(27-30)
src/lfx/src/lfx/components/processing/__init__.py (1)
src/lfx/src/lfx/components/processing/langextract.py (1)
LangExtractComponent(26-352)
🔇 Additional comments (4)
src/lfx/src/lfx/components/processing/__init__.py (1)
11-11: LangExtractComponent is correctly wired into the lazy import surfaceThe TYPE_CHECKING import,
_dynamic_importsmapping ("LangExtractComponent": "langextract"), and__all__entry are consistent with the new modulelfx.components.processing.langextractand theLangExtractComponentsymbol. The lazy import machinery should pick this up cleanly.Also applies to: 32-32, 48-48
src/lfx/pyproject.toml (1)
20-20: Remove: Version range verification is unnecessaryAll APIs used—
lx.extract,lx.data.ExampleData,lx.data.Extraction,lx.io.save_annotated_documents, andlx.visualize—are confirmed to exist in the current langextract release. The dependency specificationlangextract>=1.1.0,<2.0.0is appropriate. No verification action needed.src/lfx/src/lfx/components/processing/langextract.py (2)
211-291: Examples parsing has inconsistent semantics; addexamples_jsonUI input or clarify optional/required behaviorThe code design has two confirmed issues:
Missing UI input for
examples_json: The_parse_examples()method accessesexamples_jsonviagetattr(self, "examples_json", None)(line 233), but this attribute has no corresponding input declared in lines 43–120+. The declared inputs includeexample_textandexamples_table, but notexamples_json. This means the JSON fallback path is only reachable if set programmatically outside the UI.Mixed optional/required semantics: In
_parse_examples(), whenexamples_jsonisNone,"", or"[]", the method returnsNonewithout raising (lines 234–235). Later, if parsed JSON contains no valid examples, it raisesValueError(lines 275–279). Meanwhile,_run_extract()only passes examples tolx.extract()if they're truthy (if examples: kwargs["examples"] = examples). This creates inconsistent behavior: sometimes examples are optional, sometimes required.Recommended actions:
- If
examples_jsonshould be user-configurable, add aMultilineInputforexamples_jsonto the inputs list (mirror the pattern fromexample_textor similar components).- Clarify whether examples are optional (update logic and error messages) or required (raise early if no examples provided).
- Align the semantics: either always require examples, or consistently allow them to be optional with fallback behavior.
153-155: Verify component lifecycle and cache invalidation strategyThe review comment identifies a real issue: the
_cached_resultand_cached_htmlattributes are never cleared, and caches are not keyed to input signatures. If aLangExtractComponentinstance is reused across multiple executions with different inputs (e.g., differentinput_text,files,prompt_description, ormodel_id), stale cached outputs will be returned.Verified findings:
Lines 153–155 define
_cached_resultand_cached_htmlat the class level, initialized toNone. The_run_extract()method short-circuits withif self._cached_result is not None: return self._cached_result(line 305), andbuild_html_output()does the same for_cached_html(line 335). Neither method resets these caches or validates that the input parameters (text, examples, model, etc.) match the cached result.The actual risk depends on your framework's component lifecycle: If LFX creates a fresh component instance per execution, caches are harmless (though unnecessary). If instances are reused across runs, this is a bug.
Recommendations:
- Reset caches at execution start – Clear
_cached_resultand_cached_htmlinbuild_structured_output()andbuild_html_output()before checking the cache, or use an__init__hook.- Key caches on inputs – Compute a hash of effective inputs (text, prompt_description, examples, model_id, api_key) and only return cached values if the hash matches.
d7705a1 to
800ad46
Compare
800ad46 to
b3d9ea1
Compare
b3d9ea1 to
0955fd2
Compare
0955fd2 to
8a5d310
Compare
8a5d310 to
51fb904
Compare
This PR introduces a new LangExtract processing component that wraps the langextract library to deliver structured
extraction with character-level grounding and optional HTML visualization. The node supports multiple ingestion paths
(plain text input, files/documents, and a table-based example schema) and emits both structured JSON and an HTML highlight
view.
Key details:
columns text, extraction_class, extraction_text (rows sharing the same text are grouped into one ExampleData). Legacy
JSON examples are still accepted as a fallback.
enabled for manual overrides.
Implementation notes:
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.