feat: Add OpenSearch multimodal multi-embedding component #10714

edwinjosechittilappilly · 2025-11-25T00:27:05Z

Introduces OpenSearchVectorStoreComponentMultimodalMultiEmbedding, supporting multi-model hybrid semantic and keyword search with dynamic vector fields, parallel embedding generation, advanced filtering, and flexible authentication. Enables ingestion and search across multiple embedding models in OpenSearch, with robust index management and UI configuration handling.

Key Features Added

Multiple Embeddings Input
The embedding input accepts multiple embedding objects via is_list=True
Users can connect multiple embedding models from different providers (OpenAI, Watsonx, Cohere, etc.)
Backward compatible: single embeddings still work seamlessly
Selective Ingestion (Single Model)
Ingestion uses ONE selected embedding model specified by user
Selection via embedding_model_name field
Falls back to first embedding if no model name specified
Documents are stored in dynamic field: chunk_embedding_{model_name}
Multi-Model Search
Search queries across ALL embedding models found in the index
Automatically detects available models via aggregation
Generates query embeddings for each detected model
Combines results using hybrid search (dis_max + keyword matching)

Summary by CodeRabbit

New Features
- Added multi-model embedding support across OpenAI, Ollama, and IBM watsonx.ai providers, enabling per-model embeddings.
- Introduced OpenSearch vector store integration featuring hybrid search, dynamic field management, and multi-embedding support.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Introduces OpenSearchVectorStoreComponentMultimodalMultiEmbedding, supporting multi-model hybrid semantic and keyword search with dynamic vector fields, parallel embedding generation, advanced filtering, and flexible authentication. Enables ingestion and search across multiple embedding models in OpenSearch, with robust index management and UI configuration handling.

coderabbitai · 2025-11-25T00:27:11Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This pull request introduces multi-model embedding support by creating a new EmbeddingsWithModels wrapper class, refactoring EmbeddingModelComponent to return composite embeddings with per-model instances for OpenAI/Ollama/IBM watsonx.ai providers, and adding an OpenSearch vector store component with hybrid search supporting multiple embeddings simultaneously.

Changes

Cohort / File(s)	Summary
Core embeddings wrapper `src/lfx/src/lfx/base/embeddings/embeddings_class.py`	New `EmbeddingsWithModels` class extending LangChain Embeddings, storing primary embeddings and optional per-model mappings. Delegates embedding/async operations to primary instance, supports attribute forwarding and callable invocation.
Embedding model component refactoring `src/lfx/src/lfx/components/models_and_agents/embedding_model.py`	Updated `EmbeddingModelComponent.build_embeddings` to async; now returns `EmbeddingsWithModels` with per-model embeddings for OpenAI, Ollama, and IBM watsonx.ai providers instead of single embeddings instance. Adds import for `EmbeddingsWithModels`.
OpenSearch multi-model vector store `src/lfx/src/lfx/components/elastic/opensearch_multimodal.py`	New `OpenSearchVectorStoreComponentMultimodalMultiEmbedding` class with hybrid (KNN + keyword) search across multiple embeddings, dynamic field naming, bulk ingestion with embedding tracking, AOSS compatibility checks, and JWT/basic auth support. Includes helper methods `normalize_model_name` and `get_embedding_field_name`.
Starter project configuration `src/backend/base/langflow/initial_setup/starter_projects/Nvidia Remix.json`	Updated metadata code hash for `EmbeddingModelComponent` reflecting internal behavioral changes to return composite embeddings.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant EmbMod as EmbeddingModelComponent
    participant EmbWM as EmbeddingsWithModels
    participant Primary as Primary<br/>Embeddings
    participant PerModel as Per-Model<br/>Embeddings

    Client->>EmbMod: build_embeddings()
    activate EmbMod
    EmbMod->>EmbMod: Detect provider (OpenAI/Ollama/IBM)
    EmbMod->>Primary: Create primary embeddings instance
    EmbMod->>PerModel: Construct per-model instances<br/>(model_1, model_2, ...)
    EmbMod->>EmbWM: Create EmbeddingsWithModels<br/>(primary, {model_1, model_2, ...})
    deactivate EmbMod
    EmbMod-->>Client: Return EmbeddingsWithModels

    Note over Client,PerModel: Later usage:
    Client->>EmbWM: embed_documents(texts)
    activate EmbWM
    EmbWM->>Primary: Delegate to primary instance
    Primary-->>EmbWM: Return embeddings
    deactivate EmbWM
    EmbWM-->>Client: Return embeddings list

sequenceDiagram
    participant Client as Client
    participant OS as OpenSearchComponent
    participant EmbWM as EmbeddingsWithModels
    participant EmbN as Embedding N<br/>(Per-Model)
    participant OSClient as OpenSearch<br/>Client

    Client->>OS: search_documents(query_text, filters)
    activate OS
    OS->>OS: Detect available embedding models in index
    OS->>EmbWM: Generate embeddings for each model
    activate EmbWM
    loop For each model
        EmbWM->>EmbN: embed_query(query_text)
        EmbN-->>EmbWM: embedding_vector
    end
    deactivate EmbWM
    OS->>OS: Build per-model KNN queries
    OS->>OS: Build keyword query (multi_match)
    OS->>OSClient: Execute dis_max combination<br/>(KNN queries + keyword)
    OSClient-->>OS: Return ranked results
    OS->>OS: Convert results to Data objects
    deactivate OS
    OS-->>Client: Return search_documents results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

opensearch_multimodal.py: Extensive new component with complex hybrid search logic, dynamic field mapping, authentication handling, and error recovery paths requiring careful review of vector/metadata handling and AOSS compatibility checks.
embedding_model.py: Async refactoring combined with multi-provider support (OpenAI, Ollama, IBM watsonx.ai) and per-model instance construction logic needs verification for correctness across providers and URL/credential handling.
embeddings_class.py: Delegation pattern and attribute forwarding require review for potential issues with type safety, async delegation, and callable invocation edge cases.
Integration points: Cross-file dependencies between the new wrapper, updated component, and new OpenSearch integration need validation.

Possibly related PRs

fix: Improve appearance of IBM watsonX im Embedding Model component #10469 — Modifies the same EmbeddingModelComponent.build_embeddings code path including IBM watsonx.ai provider handling.
fix: Add IBM watsonx.ai support to EmbeddingModel #10677 — Updates EmbeddingModelComponent and its imports; extends it with multi-model wrapper support while prior work added watsonx.ai support.
fix: changed embedding model to have api base and watsonx api endpoint #10524 — Modifies build_embeddings flow and watsonx-related constants (IBM_WATSONX_URLS, base_url_ibm_watsonx).

Suggested labels

size:XXL, lgtm

Suggested reviewers

phact
erichare
lucaseduoli

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Test Quality And Coverage	⚠️ Warning	Test coverage critically insufficient for major new implementations: EmbeddingsWithModels class (117 lines), OpenSearchVectorStoreComponentMultimodalMultiEmbedding (1,575 lines), and EmbeddingModelComponent async updates (423 lines) lack dedicated unit/integration tests.	Create test_embeddings_class.py, test_opensearch_multimodal.py, and enhance EmbeddingModelComponent tests covering initialization, delegation, async patterns, error handling, and edge cases before merge.
Test File Naming And Structure	⚠️ Warning	PR introduces two major new components without corresponding test files and converts build_embeddings to async without updating existing test calls to await it.	Create test files for new components and update all build_embeddings calls to use await in existing test files.
Excessive Mock Usage Warning	❓ Inconclusive	No test files testing the new components were found in the repository despite extensive searching, making assessment of mock usage patterns impossible.	Verify if test files exist in a separate location or branch; if absent, prioritize adding unit and integration tests for the new complex components before assessing mock usage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately and concisely describes the main change: introducing a new OpenSearch component for multimodal multi-embedding support. The title is specific, relevant, and directly reflects the primary objective.
Docstring Coverage	✅ Passed	Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Introduces EmbeddingsWithModels class for wrapping embeddings and available models. Updates EmbeddingModelComponent to provide available model lists for OpenAI, Ollama, and IBM watsonx.ai providers, including synchronous Ollama model fetching using httpx. Updates starter project and component index metadata to reflect new dependencies and code changes.

…low-ai/langflow into opensearch-multi-embedding

Updated the EmbeddingModelComponent to fetch Ollama models asynchronously using await get_ollama_models instead of a synchronous httpx call. Removed httpx from dependencies in Nvidia Remix starter project and updated related metadata. This change improves consistency and reliability when fetching available models for the Ollama provider.

Added several Notion-related components to the component index, including AddContentToPage, NotionDatabaseProperties, NotionListPages, NotionPageContent, NotionPageCreator, NotionPageUpdate, and NotionSearch. These components enable interaction with Notion databases and pages, such as querying, updating, creating, and retrieving content.

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

src/lfx/src/lfx/components/models_and_agents/embedding_model.py (1)
372-373: Redundant API calls to fetch IBM models.

fetch_ibm_models is called twice with the same URL. Cache the result to avoid duplicate HTTP requests:
             elif field_value == "IBM watsonx.ai":
-                build_config["model"]["options"] = self.fetch_ibm_models(base_url=self.base_url_ibm_watsonx)
-                build_config["model"]["value"] = self.fetch_ibm_models(base_url=self.base_url_ibm_watsonx)[0]
+                ibm_models = self.fetch_ibm_models(base_url=self.base_url_ibm_watsonx)
+                build_config["model"]["options"] = ibm_models
+                build_config["model"]["value"] = ibm_models[0] if ibm_models else ""
The same issue exists at lines 384-385.
src/backend/base/langflow/initial_setup/starter_projects/Nvidia Remix.json (4)
2170-2210: Bug: ollama_base_url update ignores the new field_value.

In update_build_config, the branch for field_name == "ollama_base_url" assigns ollama_url = self.ollama_base_url, ignoring the freshly provided field_value. This can leave the model list stale until a second refresh.

Apply this diff:
-        elif field_name == "ollama_base_url":
-            # # Refresh Ollama models when base URL changes
-            # if hasattr(self, "provider") and self.provider == "Ollama":
-            # Use field_value if provided, otherwise fall back to instance attribute
-            ollama_url = self.ollama_base_url
+        elif field_name == "ollama_base_url":
+            # Use field_value if provided, otherwise fall back to instance attribute
+            ollama_url = field_value or getattr(self, "ollama_base_url", None)
             if await is_valid_ollama_url(url=ollama_url):
                 try:
                     models = await get_ollama_models(
                         base_url_value=ollama_url,
2339-2355: Default unsafe pickle loading should be False.

Allow Dangerous Deserialization is set to true by default, enabling pickle loading. This is a security risk and should be opt-in.

Apply this diff in the FAISS component template:
-                "display_name": "Allow Dangerous Deserialization",
+                "display_name": "Allow Dangerous Deserialization",
                 "dynamic": false,
                 "info": "Set to True to allow loading pickle files from untrusted sources. Only enable this if you trust the source of the data.",
@@
-                "type": "bool",
-                "value": true
+                "type": "bool",
+                "value": false
2095-2145: Guard OpenAI dimensions parameter per model to prevent 400 errors on text-embedding-ada-002.

The loop at src/lfx/src/lfx/components/models_and_agents/embedding_model.py:189-205 creates per-model OpenAIEmbeddings instances, passing dimensions to all models. However, text-embedding-ada-002 (included in OPENAI_EMBEDDING_MODEL_NAMES) does not accept the dimensions parameter and will error. Only text-embedding-3-small and text-embedding-3-large support it.

Apply this conditional guard:
+            DIMENSIONAL_MODELS = {"text-embedding-3-small", "text-embedding-3-large"}
             for model_name in OPENAI_EMBEDDING_MODEL_NAMES:
                 available_models_dict[model_name] = OpenAIEmbeddings(
                     model=model_name,
-                    dimensions=dimensions or None,  # Use same dimensions config for all
+                    dimensions=(dimensions if model_name in DIMENSIONAL_MODELS else None),
                     base_url=api_base or None,
                     api_key=api_key,
                     chunk_size=chunk_size,
2018-2252: Based on my comprehensive verification of the codebase, I can now provide the corrected review comment. Here are my key findings:

Key Verification Results:

✅ EmbeddingsWithModels IS compatible with FAISS - The class properly implements embed_documents(), embed_query(), async variants, __getattr__() delegation, and __call__() proxy. FAISS will work without modification.

✅ OpenAI dimensions issue IS valid - Only text-embedding-3-small and text-embedding-3-large support the dimensions parameter; text-embedding-ada-002 does not. Current code applies dimensions uniformly to all models.

✅ allow_dangerous_deserialization defaults to True - Confirmed security issue in FAISS component.

✅ Ollama field_value not used - Code line 390 uses self.ollama_base_url instead of the field_value parameter passed to the method.

❌ No httpx issue - Code uses requests.get with timeout=10, not httpx.

Guard OpenAI embedding models against unsupported dimension parameter.

The code applies the dimensions parameter uniformly to all OpenAI models, but text-embedding-ada-002 does not support this parameter and will raise an error. Only text-embedding-3-small and text-embedding-3-large support dimensions.

In the build_embeddings method's OpenAI provider block, guard dimensions per model:
 for model_name in OPENAI_EMBEDDING_MODEL_NAMES:
     available_models_dict[model_name] = OpenAIEmbeddings(
         model=model_name,
-        dimensions=dimensions or None,  # Use same dimensions config for all
+        dimensions=dimensions or None if model_name != "text-embedding-ada-002" else None,
         base_url=api_base or None,
         api_key=api_key,
         chunk_size=chunk_size,
         max_retries=max_retries,
         timeout=request_timeout or None,
         show_progress_bar=show_progress_bar,
         model_kwargs=model_kwargs,
     )
Set FAISS allow_dangerous_deserialization default to False.

The FAISS component currently defaults allow_dangerous_deserialization to True, which enables loading untrusted pickle files and poses a security risk. Change the default value in the component definition to False.

Fix Ollama URL refresh to use the field_value parameter.

In update_build_config, the ollama_base_url field handler ignores the field_value parameter and uses self.ollama_base_url instead. The comment indicates intent to use field_value. Update line 390 to use the passed parameter for consistency with base_url_ibm_watsonx handling:
 elif field_name == "ollama_base_url":
-    ollama_url = self.ollama_base_url
+    ollama_url = field_value or self.ollama_base_url

🧹 Nitpick comments (7)

src/lfx/src/lfx/components/models_and_agents/embedding_model.py (1)
239-255: URL inconsistency and missing error handling for model fetch.

URL inconsistency: get_ollama_models is called with self.ollama_base_url (raw input) while embedding instances use final_base_url (transformed). Although get_ollama_models transforms internally, this could cause subtle issues if the transformation logic diverges.

No fallback on failure: If get_ollama_models fails, the entire build_embeddings method fails. Consider falling back to an empty available_models dict or using the user-selected model as the only entry:
             # Fetch available Ollama models
-            available_model_names = await get_ollama_models(
-                base_url_value=self.ollama_base_url,
+            try:
+                available_model_names = await get_ollama_models(
+                    base_url_value=final_base_url,
-                desired_capability=DESIRED_CAPABILITY,
-                json_models_key=JSON_MODELS_KEY,
-                json_name_key=JSON_NAME_KEY,
-                json_capabilities_key=JSON_CAPABILITIES_KEY,
-            )
+                    desired_capability=DESIRED_CAPABILITY,
+                    json_models_key=JSON_MODELS_KEY,
+                    json_name_key=JSON_NAME_KEY,
+                    json_capabilities_key=JSON_CAPABILITIES_KEY,
+                )
+            except ValueError:
+                logger.warning("Failed to fetch Ollama models, using selected model only")
+                available_model_names = [model] if model else []
src/backend/base/langflow/initial_setup/starter_projects/Nvidia Remix.json (2)
2065-2145: Avoid eager instantiation of N embedding clients; create on demand.

Creating an instance for every model on each build is wasteful and can slow UI updates. Prefer a lazy factory (dict of callables) or instantiate only when requested by the consumer.

1759-1810: Add timeouts and error handling to documentation fetcher.

RemixDocumentation._fetch_all_documentation uses httpx.get without a timeout and minimal error handling. Add a short timeout and catch network errors to avoid hanging the flow.

Apply this diff inside the component code block:
-        response = httpx.get(search_index_url, follow_redirects=True)
+        try:
+            response = httpx.get(search_index_url, follow_redirects=True, timeout=10.0)
+        except httpx.HTTPError as e:
+            raise ValueError(f"Failed to fetch search index: {e!s}") from e
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py (4)
52-53: Remove or downgrade noisy logging in helper function.

logger.info is called every time get_embedding_field_name is invoked, which happens frequently during search operations with multiple models. This will clutter logs in production.
 def get_embedding_field_name(model_name: str) -> str:
-    logger.info(f"chunk_embedding_{normalize_model_name(model_name)}")
+    # logger.debug(f"chunk_embedding_{normalize_model_name(model_name)}")
     return f"chunk_embedding_{normalize_model_name(model_name)}"
593-594: Consider handling bulk ingestion errors.

The helpers.bulk call doesn't have explicit error handling. If some documents fail to index, the method will still return all IDs as if successful. Consider using raise_on_error=True (default) and handling partial failures.
-        helpers.bulk(client, requests, max_chunk_bytes=max_chunk_bytes)
+        success, failed = helpers.bulk(
+            client, requests, max_chunk_bytes=max_chunk_bytes, stats_only=False
+        )
+        if failed:
+            logger.warning(f"Failed to index {len(failed)} documents: {failed[:3]}")
         return return_ids
646-646: Downgrade embedding debug log.

logger.warning is used for a debug log that shows the embedding object. This should be logger.debug or removed.
-        logger.warning(f"Embedding: {self.embedding}")
+        logger.debug(f"Embedding: {self.embedding}")
1034-1034: Fix mutable default argument type hint.

The parameter filter_clauses: list[dict] = None should use | None type hint for clarity.
-    def _detect_available_models(self, client: OpenSearch, filter_clauses: list[dict] = None) -> list[str]:
+    def _detect_available_models(self, client: OpenSearch, filter_clauses: list[dict] | None = None) -> list[str]:

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3caacf4 and bfbeec3.

📒 Files selected for processing (4)

src/backend/base/langflow/initial_setup/starter_projects/Nvidia Remix.json (2 hunks)
src/lfx/src/lfx/base/embeddings/embeddings_class.py (1 hunks)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py (1 hunks)
src/lfx/src/lfx/components/models_and_agents/embedding_model.py (7 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

src/lfx/src/lfx/components/models_and_agents/embedding_model.py (2)

src/lfx/src/lfx/base/embeddings/embeddings_class.py (1)

EmbeddingsWithModels (6-116)

src/lfx/src/lfx/base/models/model_utils.py (1)

get_ollama_models (39-108)

src/lfx/src/lfx/base/embeddings/embeddings_class.py (2)

src/lfx/src/lfx/field_typing/constants.py (1)

Embeddings (49-50)

src/lfx/src/lfx/base/tools/flow_tool.py (1)

args (32-34)

src/lfx/src/lfx/components/elastic/opensearch_multimodal.py (3)

src/lfx/src/lfx/inputs/inputs.py (4)

BoolInput (419-432)

HandleInput (75-86)

IntInput (347-380)

StrInput (127-183)

src/lfx/src/lfx/schema/data.py (1)

Data (26-288)

src/lfx/src/lfx/base/embeddings/embeddings_class.py (2)

embed_documents (36-45)

embed_query (47-56)

🔇 Additional comments (10)

src/lfx/src/lfx/components/models_and_agents/embedding_model.py (1)

7-7: LGTM!

Import correctly added for the new EmbeddingsWithModels wrapper class.

src/backend/base/langflow/initial_setup/starter_projects/Nvidia Remix.json (2)

1856-1856: Code hash change acknowledged.

No action needed here; just confirming this corresponds to the EmbeddingModelComponent refactor.

2025-2050: fetch_ibm_models function does not exist in the langflow codebase; IBM model fetching is implemented via the watsonx.ai bundle, not a standalone function.

The review comment references a function that cannot be found in the repository. IBM integration in langflow uses a bundle-based architecture (watsonx.ai bundle) that handles dynamic model fetching, rather than a fetch_ibm_models function called by update_build_config. The suggestion about caching and request failure handling may be conceptually valid, but it is directed at the wrong implementation target.

The JSON configuration file snippet shown (lines 2025-2050) contains parameter definitions unrelated to model fetching logic, further confirming a mismatch between the review location and the actual concern.

Likely an incorrect or invalid review comment.

src/lfx/src/lfx/base/embeddings/embeddings_class.py (3)

6-34: LGTM!

The wrapper class is well-designed with proper inheritance from Embeddings, clear docstrings, and correct handling of the mutable default argument for available_models.

36-78: LGTM!

The embedding methods correctly delegate to the underlying embeddings instance with proper type annotations.

80-116: LGTM!

The __call__ method properly checks callability before delegation, __getattr__ correctly forwards unknown attributes to the wrapped instance, and __repr__ provides useful debug information.

src/lfx/src/lfx/components/elastic/opensearch_multimodal.py (4)

116-328: LGTM!

The input definitions are comprehensive and well-documented. The is_list=True on the embedding input correctly enables multi-model support.

330-392: LGTM!

The model name resolution logic correctly handles multiple embedding providers with a clear priority order. The fallback chain through deployment → model → model_id → model_name ensures compatibility across different providers.

855-882: LGTM!

The retry logic with exponential backoff (1s → 2s → 4s, capped at 8s) for embedding generation is well-implemented. The ThreadPoolExecutor usage with bounded workers (max 8) prevents resource exhaustion.

1547-1590: LGTM!

The dynamic UI configuration correctly toggles visibility and required status for authentication fields based on the selected auth mode.

coderabbitai · 2025-11-25T21:43:59Z

src/lfx/src/lfx/components/elastic/opensearch_multimodal.py

+                        "dimension": dim,
+                        "method": {
+                            "name": "disk_ann",
+                            "space_type": space_type,
+                            "engine": engine,
+                            "parameters": {"ef_construction": ef_construction, "m": m},
+                        },
+                    },


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Script executed:

#!/bin/bash # First, let's examine the file structure and understand the context around lines 428-435 cat -n src/lfx/src/lfx/components/elastic/opensearch_multimodal.py | sed -n '400,450p'

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

400 ef_construction: int = 100, 401 m: int = 16, 402 vector_field: str = "vector_field", 403 ) -> dict[str, Any]: 404 """Create the default OpenSearch index mapping for vector search. 405 406 This method generates the index configuration with k-NN settings optimized 407 for approximate nearest neighbor search using the specified vector engine. 408 Includes the embedding_model keyword field for tracking which model was used. 409 410 Args: 411 dim: Dimensionality of the vector embeddings 412 engine: Vector search engine (jvector, nmslib, faiss, lucene) 413 space_type: Distance metric for similarity calculation 414 ef_search: Size of dynamic list used during search 415 ef_construction: Size of dynamic list used during index construction 416 m: Number of bidirectional links for each vector 417 vector_field: Name of the field storing vector embeddings 418 419 Returns: 420 Dictionary containing OpenSearch index mapping configuration 421 """ 422 return { 423 "settings": {"index": {"knn": True, "knn.algo_param.ef_search": ef_search}}, 424 "mappings": { 425 "properties": { 426 vector_field: { 427 "type": "knn_vector", 428 "dimension": dim, 429 "method": { 430 "name": "disk_ann", 431 "space_type": space_type, 432 "engine": engine, 433 "parameters": {"ef_construction": ef_construction, "m": m}, 434 }, 435 }, 436 "embedding_model": {"type": "keyword"}, # Track which model was used 437 "embedding_dimensions": {"type": "integer"}, 438 } 439 }, 440 } 441 442 def _ensure_embedding_field_mapping( 443 self, 444 client: OpenSearch, 445 index_name: str, 446 field_name: str, 447 dim: int, 448 engine: str, 449 space_type: str, 450 ef_construction: int,

</shell_result>

Script executed:

#!/bin/bash # Search for "disk_ann" and "hnsw" occurrences in the file echo "=== Searching for disk_ann occurrences ===" rg -n "disk_ann" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py echo -e "\n=== Searching for hnsw occurrences ===" rg -n "hnsw" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py echo -e "\n=== Searching for engine parameter usage ===" rg -n "engine" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py | head -30

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

=== Searching for disk_ann occurrences === 430: "name": "disk_ann", 475: "name": "disk_ann", === Searching for hnsw occurrences === === Searching for engine parameter usage === 74: - Vector storage with configurable engines (jvector, nmslib, faiss, lucene) 108: "engine", 160: name="engine", 165: "Vector search engine for similarity calculations. 'jvector' is recommended for most use cases. " 397: engine: str = "jvector", 407: for approximate nearest neighbor search using the specified vector engine. 412: engine: Vector search engine (jvector, nmslib, faiss, lucene) 432: "engine": engine, 448: engine: str, 463: engine: Vector search engine 477: "engine": engine, 498: def _validate_aoss_with_engines(self, *, is_aoss: bool, engine: str) -> None: 499: """Validate engine compatibility with Amazon OpenSearch Serverless (AOSS). 501: Amazon OpenSearch Serverless has restrictions on which vector engines 502: can be used. This method ensures the selected engine is compatible. 506: engine: The selected vector search engine 509: ValueError: If AOSS is used with an incompatible engine 511: if is_aoss and engine not in {"nmslib", "faiss"}: 512: msg = "Amazon OpenSearch Service Serverless only supports `nmslib` or `faiss` engines" 901: # Validate engine with AOSS 902: engine = getattr(self, "engine", "jvector") 903: self._validate_aoss_with_engines(is_aoss=is_aoss, engine=engine) 912: engine=engine, 934: engine=engine,

</shell_result>

OpenSearch vector search methods disk_ann hnsw nmslib jvector faiss lucene engine compatibility

<web_search_result>
Summary of OpenSearch k-NN methods & engine compatibility (concise):

Engines supported in OpenSearch: Lucene (native), Faiss, NMSLIB (deprecated). [1][2].

HNSW: implemented across engines (Lucene, Faiss, NMSLIB). Use HNSW for most workloads; Lucene HNSW supports l2/cosinesimil/innerproduct and ignores ef_search (uses k). [1][2].

Faiss: supports HNSW and IVF (IVF requires training). Faiss is used for on-disk/large-scale and supports product quantization/compression options. [1][4].

NMSLIB: legacy HNSW engine (supported spaces: l2, innerproduct, cosinesimil, l1, linf) but deprecated—prefer Faiss or Lucene. [1].

Lucene engine: native HNSW, best for small-to-mid indexes and efficient filtering (can apply filters during search). [1][5].

on_disk / disk-based ANN modes: OpenSearch “on_disk” (low-cost) mode uses Faiss by default; there are also disk-based ANNs in the ecosystem (DiskANN/DISK_ANN in other vector DBs) and emerging disk_ann usage in OpenSearch examples (e.g., disk_ann method with jvector engine in recent posts/features). Check your OpenSearch version for exact support. [2][3][6].

jvector / disk_ann: OpenSearch examples and recent posts show a "jvector" engine and a "disk_ann" method used for concurrent / disk-optimized graph builds — this is an emergent/experimental option; verify availability in your OpenSearch release. [3].

AWS OpenSearch Service (managed) notes: serverless/vector-search collections currently restrict engines/features (e.g., some modes/engines like Lucene may not be supported in vector collections; AWS may restrict Faiss features such as IVF). Always consult the AWS docs for service-specific limitations. [2].

If you want, tell me your OpenSearch version (or AWS OpenSearch service type) and I’ll list exactly which methods/engine combinations and parameters are available and show example mappings.

Sources: OpenSearch docs & blog posts and AWS OpenSearch docs.
[1] OpenSearch docs — Methods and engines.
[2] OpenSearch docs — k-NN vector / modes (in_memory, on_disk).
[3] OpenSearch blog / changelog examples showing disk_ann + jvector.
[4] OpenSearch blog “Lucene-on-Faiss” (Faiss+Lucene integration / on-disk).
[5] OpenSearch vector search page (comparison table).
</web_search_result>

Script executed:

#!/bin/bash # Search for test files or examples that show engine-method combinations echo "=== Looking for test files ===" find src/lfx -name "*test*" -o -name "*opensearch*" | head -20 echo -e "\n=== Searching for requirements/version info ===" find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" | xargs grep -l "opensearch" 2>/dev/null echo -e "\n=== Looking for any method selection logic ===" rg -n "method.*name" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py -A 2 -B 2

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

=== Looking for test files === src/lfx/src/lfx/components/elastic/opensearch.py src/lfx/src/lfx/components/elastic/opensearch_multimodal.py src/lfx/tests src/lfx/tests/conftest.py src/lfx/tests/data/env_variable_test.json src/lfx/tests/unit/base/data/test_base_file.py src/lfx/tests/unit/base/data/test_storage_utils.py src/lfx/tests/unit/cli/test_common.py src/lfx/tests/unit/cli/test_run_command.py src/lfx/tests/unit/cli/test_run_real_flows.py src/lfx/tests/unit/cli/test_run_starter_projects.py src/lfx/tests/unit/cli/test_run_starter_projects_backward_compatibility.py src/lfx/tests/unit/cli/test_script_loader.py src/lfx/tests/unit/cli/test_serve.py src/lfx/tests/unit/cli/test_serve_app.py src/lfx/tests/unit/cli/test_serve_app_streaming.py src/lfx/tests/unit/cli/test_serve_components.py src/lfx/tests/unit/cli/test_serve_simple.py src/lfx/tests/unit/cli/test_validation.py src/lfx/tests/unit/components/langchain_utilities/test_csv_agent.py === Searching for requirements/version info === ./pyproject.toml === Looking for any method selection logic === 265- options=["basic", "jwt"], 266- info=( 267: "Authentication method: 'basic' for username/password authentication, " 268- "or 'jwt' for JSON Web Token (Bearer) authentication." 269- ),

</shell_result>

Script executed:

#!/bin/bash # Search for method configuration patterns and how engines are actually used echo "=== Full method definitions to see parameter patterns ===" rg -n "\"method\":" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py -A 5 -B 2 echo -e "\n=== Check if there are any comments about method names ===" rg -n "method|disk_ann|hnsw" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py | grep -E "(#|method|disk_ann|hnsw)"

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

=== Full method definitions to see parameter patterns === 427- "type": "knn_vector", 428- "dimension": dim, 429: "method": { 430- "name": "disk_ann", 431- "space_type": space_type, 432- "engine": engine, 433- "parameters": {"ef_construction": ef_construction, "m": m}, 434- }, -- 472- "type": "knn_vector", 473- "dimension": dim, 474: "method": { 475- "name": "disk_ann", 476- "space_type": space_type, 477- "engine": engine, 478- "parameters": {"ef_construction": ef_construction, "m": m}, 479- }, === Check if there are any comments about method names === 267: "Authentication method: 'basic' for username/password authentication, " 406: This method generates the index configuration with k-NN settings optimized 429: "method": { 430: "name": "disk_ann", 474: "method": { 475: "name": "disk_ann", 502: can be used. This method ensures the selected engine is compatible. 544: This method uses bulk operations to insert documents with their vector 654: This method handles the complete document ingestion pipeline: 663: # Convert DataFrame to Data if needed using parent's method 969: This method accepts two filter formats and converts them to standardized 1137: This method executes a sophisticated search that: 1518: This is the main interface method that performs the multi-model search using the 1539: This method handles real-time UI updates, particularly for authentication

</shell_result>

OpenSearch disk_ann method supported engines which engines work

<web_search_result>
Short answer: the disk_ann method in OpenSearch is implemented to use the jvector engine. See the OpenSearch blog example showing method.name = "disk_ann" with engine = "jvector". [1]

Context: OpenSearch's k-NN plugin generally supports engines faiss, lucene and (deprecated) nmslib for other methods (hnsw/ivf), but disk_ann specifically is shown using jvector in the OpenSearch blog announcing concurrent graph construction. [2][1]

Sources:

OpenSearch blog: "Concurrent vector graph construction" (disk_ann example using engine: "jvector"). [1]

OpenSearch k‑NN methods & engines documentation (faiss, lucene, nmslib overview). [2]
</web_search_result>

Hardcoded disk_ann method is incompatible with multiple supported engines.

The method name "disk_ann" is hardcoded at lines 430 and 475, but OpenSearch's disk_ann method only works with the jvector engine. Your component supports four engines (jvector, nmslib, faiss, lucene), but when users select nmslib, faiss, or lucene, the index mapping will fail because those engines don't support the disk_ann method—they require hnsw or other engine-specific methods instead.

The method name must be conditional based on the selected engine parameter.

🤖 Prompt for AI Agents

In src/lfx/src/lfx/components/elastic/opensearch_multimodal.py around lines 428-435 (and similarly at ~475) the vector index mapping hardcodes the method name "disk_ann", which only works with the jvector engine; this will fail for nmslib, faiss and lucene. Change the code to pick the method name conditionally based on the selected engine (e.g., use "disk_ann" only when engine == "jvector", otherwise use the engine-appropriate method such as "hnsw" for nmslib/faiss/lucene or another supported method), and ensure any engine-specific parameters (ef_construction, m, etc.) are set/renamed to match the chosen method; update both locations so the mapping construction uses the conditional method string and appropriate parameter set for that engine.

coderabbitai · 2025-11-25T21:43:59Z

src/lfx/src/lfx/components/elastic/opensearch_multimodal.py

+        available_models = self._detect_available_models(client, filter_clauses)
+
+        if not available_models:
+            logger.warning("No embedding models found in index, using current model")
+            available_models = [self._get_embedding_model_name()]


⚠️ Potential issue | 🔴 Critical

Critical: Variable shadowing causes logic error.

On line 1209, available_models = getattr(emb_obj, "available_models", None) shadows the available_models list from line 1182 that contains the detected models in the index. After the first loop iteration, the outer available_models is overwritten, causing the for model_name in available_models: loop on line 1282 to iterate over the wrong data.

Rename the inner variable to avoid shadowing:

model_name = getattr(emb_obj, "model_name", None) - available_models = getattr(emb_obj, "available_models", None) + emb_available_models = getattr(emb_obj, "available_models", None) logger.info( f"Embedding object {idx}: deployment={deployment}, model={model}, " f"model_id={model_id}, model_name={model_name}, dimensions={dimensions}, " - f"available_models={available_models}" + f"available_models={emb_available_models}" ) # If this embedding has available_models dict, map all models to their dedicated instances - if available_models and isinstance(available_models, dict): + if emb_available_models and isinstance(emb_available_models, dict): logger.info( - f"Embedding object {idx} provides {len(available_models)} models via available_models dict" + f"Embedding object {idx} provides {len(emb_available_models)} models via available_models dict" ) - for model_name_key, dedicated_embedding in available_models.items(): + for model_name_key, dedicated_embedding in emb_available_models.items():

Apply similar changes to all subsequent usages within the loop (lines 1218-1238).

Also applies to: 1209-1209

🤖 Prompt for AI Agents

In src/lfx/src/lfx/components/elastic/opensearch_multimodal.py around lines 1182-1282, the local variable available_models set at line 1182 is being shadowed by a second assignment at line 1209 (available_models = getattr(emb_obj, "available_models", None)), which breaks the outer loop later (line 1282) — rename the inner variable (for example emb_available_models) and update all its subsequent uses within that loop (lines ~1218-1238 and any other occurrences in the same block) so the outer available_models list remains untouched.

coderabbitai · 2025-11-25T21:44:00Z

src/lfx/src/lfx/components/elastic/opensearch_multimodal.py

+                    # Check if this is a dedicated instance from available_models dict
+                    if emb_available_models and isinstance(emb_available_models, dict):
+                        logger.info(
+                            f"Model '{model_name}' using dedicated instance from available_models dict "
+                            f"(pre-configured with correct model and dimensions)"
+                        )


⚠️ Potential issue | 🔴 Critical

Continue fix for variable shadowing.

These lines also reference emb_available_models (after renaming) and need the same fix applied.

# Check if this is a dedicated instance from available_models dict - if emb_available_models and isinstance(emb_available_models, dict): + emb_avail_models = getattr(emb_obj, "available_models", None) + if emb_avail_models and isinstance(emb_avail_models, dict): logger.info( f"Model '{model_name}' using dedicated instance from available_models dict " f"(pre-configured with correct model and dimensions)" )

Note: This is part of the same variable shadowing issue flagged earlier.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Check if this is a dedicated instance from available_models dict

if emb_available_models and isinstance(emb_available_models, dict):

logger.info(

f"Model '{model_name}' using dedicated instance from available_models dict "

f"(pre-configured with correct model and dimensions)"

)

# Check if this is a dedicated instance from available_models dict

emb_avail_models = getattr(emb_obj, "available_models", None)

if emb_avail_models and isinstance(emb_avail_models, dict):

logger.info(

f"Model '{model_name}' using dedicated instance from available_models dict "

f"(pre-configured with correct model and dimensions)"

)

🤖 Prompt for AI Agents

In src/lfx/src/lfx/components/elastic/opensearch_multimodal.py around lines 1300 to 1305, the code still references the old name `emb_available_models`; replace this reference with the new variable name you introduced earlier (the one used elsewhere in the file to avoid shadowing), and keep the same isinstance(dict) check and logging text; ensure the variable used matches the prior rename so there are no shadowed/undefined names at runtime.

coderabbitai · 2025-11-25T21:44:00Z

src/lfx/src/lfx/components/elastic/opensearch_multimodal.py

+        try:
+            raw = self.search(self.search_query or "")
+            return [Data(text=hit["page_content"], **hit["metadata"]) for hit in raw]
+            self.log(self.ingest_data)
+        except Exception as e:
+            self.log(f"search_documents error: {e}")
+            raise


⚠️ Potential issue | 🟡 Minor

Unreachable code after return statement.

Line 1542 (self.log(self.ingest_data)) is placed after the return statement on line 1541, making it unreachable. This is likely a debugging statement that should be removed or moved before the return.

def search_documents(self) -> list[Data]: try: raw = self.search(self.search_query or "") - return [Data(text=hit["page_content"], **hit["metadata"]) for hit in raw] - self.log(self.ingest_data) + results = [Data(text=hit["page_content"], **hit["metadata"]) for hit in raw] + return results except Exception as e: self.log(f"search_documents error: {e}") raise

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

raw = self.search(self.search_query or "")

return [Data(text=hit["page_content"], **hit["metadata"]) for hit in raw]

self.log(self.ingest_data)

except Exception as e:

self.log(f"search_documents error: {e}")

raise

try:

raw = self.search(self.search_query or "")

results = [Data(text=hit["page_content"], **hit["metadata"]) for hit in raw]

return results

except Exception as e:

self.log(f"search_documents error: {e}")

raise

🤖 Prompt for AI Agents

In src/lfx/src/lfx/components/elastic/opensearch_multimodal.py around lines 1539-1545 there is an unreachable self.log(self.ingest_data) placed after a return; remove the unreachable line or move the logging before the return so it executes. Better: call and assign raw = self.search(...), perform any needed self.log(...) before returning, then return the list comprehension; ensure self.search is only called once and preserve exception handling.

coderabbitai · 2025-11-25T21:44:00Z

src/lfx/src/lfx/components/models_and_agents/embedding_model.py

            return WATSONX_EMBEDDING_MODEL_NAMES

-    def build_embeddings(self) -> Embeddings:
+    async def build_embeddings(self) -> Embeddings:


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Script executed:

#!/bin/bash # Search for all usages of build_embeddings method rg -n --type=py 'build_embeddings' -B2 -A2

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Script execution failed with unrecoverable error: [CAPABILITY:RUN_SHELL_SCRIPT] Failed to clone repository langflow-ai/langflow: Clone operation failed
</shell_result>

Script executed:

#!/bin/bash # Search for all usages of build_embeddings method across the codebase rg -n --type=py 'build_embeddings' -B2 -A2

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-41- vector_store.set_on_output(name="dataframe", value=DataFrame(data=[Data(text="This is a test file.")]), cache=True) src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-42- vector_store.set( src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py:43: embedding_model=openai_embeddings.build_embeddings, src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-44- ingest_data=text_splitter.split_text, src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-45- api_endpoint="https://astra.example.com", -- src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-60- api_endpoint="https://astra.example.com", src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-61- token="token", # noqa: S106 src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py:62: embedding_model=openai_embeddings.build_embeddings, src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-63- ) src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-64- # Mock search_documents -- src/backend/tests/unit/components/vectorstores/test_local_db_component.py-32- src/backend/tests/unit/components/vectorstores/test_local_db_component.py-33- return { src/backend/tests/unit/components/vectorstores/test_local_db_component.py:34: "embedding": OpenAIEmbeddingsComponent(openai_api_key=api_key).build_embeddings(), src/backend/tests/unit/components/vectorstores/test_local_db_component.py-35- "collection_name": "test_collection", src/backend/tests/unit/components/vectorstores/test_local_db_component.py-36- "persist": True, -- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-120- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-121- @patch("lfx.components.models_and_agents.embedding_model.OpenAIEmbeddings") src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:122: async def test_build_embeddings_openai(self, mock_openai_embeddings, component_class, default_kwargs): src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-123- # Setup mock src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-124- mock_instance = MagicMock() -- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-135- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-136- # Build the embeddings src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:137: embeddings = component.build_embeddings() src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-138- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-139- # Verify the OpenAIEmbeddings was called with the correct parameters -- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-152- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-153- @patch("langchain_ollama.OllamaEmbeddings") src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:154: async def test_build_embeddings_ollama(self, mock_ollama_embeddings, component_class, default_kwargs): src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-155- # Setup mock src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-156- mock_instance = MagicMock() -- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-166- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-167- # Build the embeddings src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:168: embeddings = component.build_embeddings() src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-169- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-170- # Verify the OllamaEmbeddings was called with the correct parameters -- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-178- @patch("ibm_watsonx_ai.Credentials") src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-179- @patch("langchain_ibm.WatsonxEmbeddings") src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:180: async def test_build_embeddings_watsonx( src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-181- self, mock_watsonx_embeddings, mock_credentials, mock_api_client, component_class, default_kwargs src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-182- ): -- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-199- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-200- # Build the embeddings src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:201: embeddings = component.build_embeddings() src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-202- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-203- # Verify Credentials was created correctly -- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-225- assert embeddings == mock_instance src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-226- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:227: async def test_build_embeddings_watsonx_missing_project_id(self, component_class, default_kwargs): src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-228- kwargs = default_kwargs.copy() src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-229- kwargs["provider"] = "IBM watsonx.ai" -- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-232- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-233- with pytest.raises(ValueError, match=r"Project ID is required for IBM watsonx.ai"): src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:234: component.build_embeddings() src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-235- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:236: async def test_build_embeddings_openai_missing_api_key(self, component_class, default_kwargs): src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-237- component = component_class(**default_kwargs) src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-238- component.provider = "OpenAI" -- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-240- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-241- with pytest.raises(ValueError, match="OpenAI API key is required when using OpenAI provider"): src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:242: component.build_embeddings() src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-243- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:244: async def test_build_embeddings_watsonx_missing_api_key(self, component_class, default_kwargs): src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-245- kwargs = default_kwargs.copy() src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-246- kwargs["provider"] = "IBM watsonx.ai" -- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-251- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-252- with pytest.raises(ValueError, match=r"IBM watsonx.ai API key is required when using IBM watsonx.ai provider"): src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:253: component.build_embeddings() src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-254- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:255: async def test_build_embeddings_unknown_provider(self, component_class, default_kwargs): src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-256- component = component_class(**default_kwargs) src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-257- component.provider = "Unknown" src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-258- src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-259- with pytest.raises(ValueError, match="Unknown provider: Unknown"): src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:260: component.build_embeddings() -- src/backend/tests/unit/components/vectorstores/test_chroma_vector_store_component.py-29- src/backend/tests/unit/components/vectorstores/test_chroma_vector_store_component.py-30- return { src/backend/tests/unit/components/vectorstores/test_chroma_vector_store_component.py:31: "embedding": OpenAIEmbeddingsComponent(openai_api_key=api_key).build_embeddings(), src/backend/tests/unit/components/vectorstores/test_chroma_vector_store_component.py-32- "collection_name": "test_collection", src/backend/tests/unit/components/vectorstores/test_chroma_vector_store_component.py-33- "persist_directory": tmp_path, -- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-114- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-115- @patch("langchain_huggingface.HuggingFaceEmbeddings") src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:116: def test_build_embeddings_huggingface(self, mock_hf_embeddings, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-117- """Test building HuggingFace embeddings.""" src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-118- component = component_class(**default_kwargs) -- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-121- mock_hf_embeddings.return_value = mock_embeddings src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-122- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:123: result = component._build_embeddings("sentence-transformers/all-MiniLM-L6-v2", None) src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-124- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-125- mock_hf_embeddings.assert_called_once_with(model="sentence-transformers/all-MiniLM-L6-v2") -- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-127- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-128- @patch("langchain_openai.OpenAIEmbeddings") src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:129: def test_build_embeddings_openai(self, mock_openai_embeddings, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-130- """Test building OpenAI embeddings.""" src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-131- component = component_class(**default_kwargs) -- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-134- mock_openai_embeddings.return_value = mock_embeddings src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-135- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:136: result = component._build_embeddings("text-embedding-ada-002", "test-api-key") src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-137- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-138- mock_openai_embeddings.assert_called_once_with( -- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-143- assert result == mock_embeddings src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-144- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:145: def test_build_embeddings_openai_no_key(self, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-146- """Test building OpenAI embeddings without API key raises error.""" src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-147- component = component_class(**default_kwargs) src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-148- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-149- with pytest.raises(ValueError, match="OpenAI API key is required"): src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:150: component._build_embeddings("text-embedding-ada-002", None) src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-151- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-152- @patch("langchain_cohere.CohereEmbeddings") src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:153: def test_build_embeddings_cohere(self, mock_cohere_embeddings, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-154- """Test building Cohere embeddings.""" src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-155- component = component_class(**default_kwargs) -- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-158- mock_cohere_embeddings.return_value = mock_embeddings src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-159- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:160: result = component._build_embeddings("embed-english-v3.0", "test-api-key") src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-161- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-162- mock_cohere_embeddings.assert_called_once_with( -- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-166- assert result == mock_embeddings src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-167- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:168: def test_build_embeddings_cohere_no_key(self, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-169- """Test building Cohere embeddings without API key raises error.""" src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-170- component = component_class(**default_kwargs) src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-171- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-172- with pytest.raises(ValueError, match="Cohere API key is required"): src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:173: component._build_embeddings("embed-english-v3.0", None) src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-174- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:175: def test_build_embeddings_custom_not_supported(self, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-176- """Test building custom embeddings raises NotImplementedError.""" src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-177- component = component_class(**default_kwargs) src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-178- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-179- with pytest.raises(NotImplementedError, match="Custom embedding models not yet supported"): src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:180: component._build_embeddings("custom-model", "test-key") src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-181- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-182- @patch("langflow.components.knowledge_bases.ingestion.get_settings_service") -- src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-331- # Mock embedding validation src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-332- with ( src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:333: patch.object(component, "_build_embeddings") as mock_build_emb, src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-334- patch.object(component, "_save_embedding_metadata"), src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-335- ): -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-159- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-160- @patch("langchain_huggingface.HuggingFaceEmbeddings") src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:161: def test_build_embeddings_huggingface(self, mock_hf_embeddings, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-162- """Test building HuggingFace embeddings.""" src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-163- component = component_class(**default_kwargs) -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-172- mock_hf_embeddings.return_value = mock_embeddings src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-173- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:174: result = component._build_embeddings(metadata) src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-175- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-176- mock_hf_embeddings.assert_called_once_with(model="sentence-transformers/all-MiniLM-L6-v2") -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-178- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-179- @patch("langchain_openai.OpenAIEmbeddings") src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:180: def test_build_embeddings_openai(self, mock_openai_embeddings, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-181- """Test building OpenAI embeddings.""" src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-182- component = component_class(**default_kwargs) -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-192- mock_openai_embeddings.return_value = mock_embeddings src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-193- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:194: result = component._build_embeddings(metadata) src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-195- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-196- mock_openai_embeddings.assert_called_once_with( -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-201- assert result == mock_embeddings src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-202- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:203: def test_build_embeddings_openai_no_key(self, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-204- """Test building OpenAI embeddings without API key raises error.""" src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-205- component = component_class(**default_kwargs) -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-213- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-214- with pytest.raises(ValueError, match="OpenAI API key is required"): src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:215: component._build_embeddings(metadata) src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-216- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-217- @patch("langchain_cohere.CohereEmbeddings") src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:218: def test_build_embeddings_cohere(self, mock_cohere_embeddings, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-219- """Test building Cohere embeddings.""" src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-220- component = component_class(**default_kwargs) -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-230- mock_cohere_embeddings.return_value = mock_embeddings src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-231- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:232: result = component._build_embeddings(metadata) src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-233- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-234- mock_cohere_embeddings.assert_called_once_with( -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-238- assert result == mock_embeddings src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-239- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:240: def test_build_embeddings_cohere_no_key(self, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-241- """Test building Cohere embeddings without API key raises error.""" src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-242- component = component_class(**default_kwargs) -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-250- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-251- with pytest.raises(ValueError, match="Cohere API key is required"): src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:252: component._build_embeddings(metadata) src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-253- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:254: def test_build_embeddings_custom_not_supported(self, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-255- """Test building custom embeddings raises NotImplementedError.""" src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-256- component = component_class(**default_kwargs) -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-263- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-264- with pytest.raises(NotImplementedError, match="Custom embedding models not yet supported"): src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:265: component._build_embeddings(metadata) src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-266- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:267: def test_build_embeddings_unsupported_provider(self, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-268- """Test building embeddings with unsupported provider raises NotImplementedError.""" src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-269- component = component_class(**default_kwargs) -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-276- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-277- with pytest.raises(NotImplementedError, match="Embedding provider 'UnsupportedProvider' is not supported"): src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:278: component._build_embeddings(metadata) src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-279- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:280: def test_build_embeddings_with_user_api_key(self, component_class, default_kwargs): src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-281- """Test that user-provided API key overrides stored one.""" src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-282- # Use a real SecretStr object instead of a mock -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-297- mock_openai.return_value = mock_embeddings src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-298- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:299: component._build_embeddings(metadata) src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-300- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-301- # The user-provided key should override the stored key in metadata -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-348- with ( src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-349- patch.object(component, "_get_kb_metadata") as mock_get_metadata, src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:350: patch.object(component, "_build_embeddings") as mock_build_embeddings, src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-351- patch("langchain_chroma.Chroma"), src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-352- ): src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-353- mock_get_metadata.return_value = {"embedding_provider": "HuggingFace", "embedding_model": "test-model"} src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:354: mock_build_embeddings.return_value = MagicMock() src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-355- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-356- # This is a unit test focused on the component's internal logic -- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-360- # Verify internal methods were called src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-361- mock_get_metadata.assert_called_once() src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:362: mock_build_embeddings.assert_called_once() src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-363- src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-364- def test_include_embeddings_parameter(self, component_class, default_kwargs): -- src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-20- vector_store = AstraDBVectorStoreComponent() src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-21- vector_store.set( src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py:22: embedding_model=openai_embeddings.build_embeddings, src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-23- ingest_data=text_splitter.split_text, src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-24- ) -- src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-34- rag_vector_store.set( src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-35- search_query=chat_input.message_response, src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py:36: embedding_model=openai_embeddings.build_embeddings, src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-37- ) src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-38- -- src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py-34- src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py-35- outputs = [ src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py:36: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"), src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py-37- ] src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py-38- src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py:39: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py-40- try: src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py-41- from langchain_google_vertexai import VertexAIEmbeddings -- src/lfx/src/lfx/components/ollama/ollama_embeddings.py-41- src/lfx/src/lfx/components/ollama/ollama_embeddings.py-42- outputs = [ src/lfx/src/lfx/components/ollama/ollama_embeddings.py:43: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"), src/lfx/src/lfx/components/ollama/ollama_embeddings.py-44- ] src/lfx/src/lfx/components/ollama/ollama_embeddings.py-45- src/lfx/src/lfx/components/ollama/ollama_embeddings.py:46: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/ollama/ollama_embeddings.py-47- transformed_base_url = transform_localhost_url(self.base_url) src/lfx/src/lfx/components/ollama/ollama_embeddings.py-48- try: -- src/lfx/src/lfx/components/twelvelabs/text_embeddings.py-54- ] src/lfx/src/lfx/components/twelvelabs/text_embeddings.py-55- src/lfx/src/lfx/components/twelvelabs/text_embeddings.py:56: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/twelvelabs/text_embeddings.py-57- return TwelveLabsTextEmbeddings(api_key=self.api_key, model=self.model) -- src/lfx/src/lfx/components/twelvelabs/video_embeddings.py-97- ] src/lfx/src/lfx/components/twelvelabs/video_embeddings.py-98- src/lfx/src/lfx/components/twelvelabs/video_embeddings.py:99: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/twelvelabs/video_embeddings.py-100- return TwelveLabsVideoEmbeddings(api_key=self.api_key, model_name=self.model_name) -- src/lfx/src/lfx/components/openai/openai.py-73- ] src/lfx/src/lfx/components/openai/openai.py-74- src/lfx/src/lfx/components/openai/openai.py:75: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/openai/openai.py-76- return OpenAIEmbeddings( src/lfx/src/lfx/components/openai/openai.py-77- client=self.client or None, -- src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-50- if field_name == "base_url" and field_value: src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-51- try: src/lfx/src/lfx/components/nvidia/nvidia_embedding.py:52: build_model = self.build_embeddings() src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-53- ids = [model.id for model in build_model.available_models] src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-54- build_config["model"]["options"] = ids -- src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-59- return build_config src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-60- src/lfx/src/lfx/components/nvidia/nvidia_embedding.py:61: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-62- try: src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-63- from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings -- src/lfx/src/lfx/components/models_and_agents/embedding_model.py-153- return WATSONX_EMBEDDING_MODEL_NAMES src/lfx/src/lfx/components/models_and_agents/embedding_model.py-154- src/lfx/src/lfx/components/models_and_agents/embedding_model.py:155: async def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/models_and_agents/embedding_model.py-156- provider = self.provider src/lfx/src/lfx/components/models_and_agents/embedding_model.py-157- model = self.model -- src/lfx/src/lfx/components/lmstudio/lmstudioembeddings.py-71- ] src/lfx/src/lfx/components/lmstudio/lmstudioembeddings.py-72- src/lfx/src/lfx/components/lmstudio/lmstudioembeddings.py:73: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/lmstudio/lmstudioembeddings.py-74- try: src/lfx/src/lfx/components/lmstudio/lmstudioembeddings.py-75- from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings -- src/lfx/src/lfx/components/mistral/mistral_embeddings.py-39- src/lfx/src/lfx/components/mistral/mistral_embeddings.py-40- outputs = [ src/lfx/src/lfx/components/mistral/mistral_embeddings.py:41: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"), src/lfx/src/lfx/components/mistral/mistral_embeddings.py-42- ] src/lfx/src/lfx/components/mistral/mistral_embeddings.py-43- src/lfx/src/lfx/components/mistral/mistral_embeddings.py:44: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/mistral/mistral_embeddings.py-45- if not self.mistral_api_key: src/lfx/src/lfx/components/mistral/mistral_embeddings.py-46- msg = "Mistral API Key is required" -- src/lfx/src/lfx/components/langchain_utilities/fake_embeddings.py-21- ] src/lfx/src/lfx/components/langchain_utilities/fake_embeddings.py-22- src/lfx/src/lfx/components/langchain_utilities/fake_embeddings.py:23: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/langchain_utilities/fake_embeddings.py-24- return FakeEmbeddings( src/lfx/src/lfx/components/langchain_utilities/fake_embeddings.py-25- size=self.dimensions or 5, -- src/lfx/src/lfx/components/ibm/watsonx_embeddings.py-115- logger.exception("Error updating model options.") src/lfx/src/lfx/components/ibm/watsonx_embeddings.py-116- src/lfx/src/lfx/components/ibm/watsonx_embeddings.py:117: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/ibm/watsonx_embeddings.py-118- credentials = Credentials( src/lfx/src/lfx/components/ibm/watsonx_embeddings.py-119- api_key=SecretStr(self.api_key).get_secret_value(), -- src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-44- src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-45- outputs = [ src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py:46: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"), src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-47- ] src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-48- -- src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-83- return HuggingFaceInferenceAPIEmbeddings(api_key=api_key, api_url=api_url, model_name=model_name) src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-84- src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py:85: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-86- api_url = self.get_api_url() src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-87- -- src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py-34- src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py-35- outputs = [ src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py:36: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"), src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py-37- ] src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py-38- src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py:39: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py-40- if not self.api_key: src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py-41- msg = "API Key is required" -- src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-135- return metadata src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-136- src/lfx/src/lfx/components/files_and_knowledge/retrieval.py:137: def _build_embeddings(self, metadata: dict): src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-138- """Build embedding model from metadata.""" src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-139- runtime_api_key = self.api_key.get_secret_value() if isinstance(self.api_key, SecretStr) else self.api_key -- src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-203- src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-204- # Build the embedder for the knowledge base src/lfx/src/lfx/components/files_and_knowledge/retrieval.py:205: embedding_function = self._build_embeddings(metadata) src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-206- src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-207- # Load vector store -- src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-243- return "Custom" src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-244- src/lfx/src/lfx/components/files_and_knowledge/ingestion.py:245: def _build_embeddings(self, embedding_model: str, api_key: str): src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-246- """Build embedding model using provider patterns.""" src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-247- # Get provider by matching model name to lists -- src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-385- src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-386- # Create embeddings model src/lfx/src/lfx/components/files_and_knowledge/ingestion.py:387: embedding_function = self._build_embeddings(embedding_model, api_key) src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-388- src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-389- # Convert DataFrame to Data objects (following Local DB pattern) -- src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-655- src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-656- # We need to test the API Key one time against the embedding model src/lfx/src/lfx/components/files_and_knowledge/ingestion.py:657: embed_model = self._build_embeddings(embedding_model=field_value["02_embedding_model"], api_key=api_key) src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-658- src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-659- # Try to generate a dummy embedding to validate the API key without blocking the event loop -- src/lfx/src/lfx/components/azure/azure_openai_embeddings.py-64- src/lfx/src/lfx/components/azure/azure_openai_embeddings.py-65- outputs = [ src/lfx/src/lfx/components/azure/azure_openai_embeddings.py:66: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"), src/lfx/src/lfx/components/azure/azure_openai_embeddings.py-67- ] src/lfx/src/lfx/components/azure/azure_openai_embeddings.py-68- src/lfx/src/lfx/components/azure/azure_openai_embeddings.py:69: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/azure/azure_openai_embeddings.py-70- try: src/lfx/src/lfx/components/azure/azure_openai_embeddings.py-71- embeddings = AzureOpenAIEmbeddings( -- src/lfx/src/lfx/components/cloudflare/cloudflare.py-61- src/lfx/src/lfx/components/cloudflare/cloudflare.py-62- outputs = [ src/lfx/src/lfx/components/cloudflare/cloudflare.py:63: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"), src/lfx/src/lfx/components/cloudflare/cloudflare.py-64- ] src/lfx/src/lfx/components/cloudflare/cloudflare.py-65- src/lfx/src/lfx/components/cloudflare/cloudflare.py:66: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/cloudflare/cloudflare.py-67- try: src/lfx/src/lfx/components/cloudflare/cloudflare.py-68- embeddings = CloudflareWorkersAIEmbeddings( -- src/lfx/src/lfx/components/cohere/cohere_embeddings.py-40- src/lfx/src/lfx/components/cohere/cohere_embeddings.py-41- outputs = [ src/lfx/src/lfx/components/cohere/cohere_embeddings.py:42: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"), src/lfx/src/lfx/components/cohere/cohere_embeddings.py-43- ] src/lfx/src/lfx/components/cohere/cohere_embeddings.py-44- src/lfx/src/lfx/components/cohere/cohere_embeddings.py:45: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/cohere/cohere_embeddings.py-46- data = None src/lfx/src/lfx/components/cohere/cohere_embeddings.py-47- try: -- src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py-69- src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py-70- outputs = [ src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py:71: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"), src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py-72- ] src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py-73- src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py:74: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py-75- try: src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py-76- from langchain_aws import BedrockEmbeddings -- src/lfx/src/lfx/components/aiml/aiml_embeddings.py-31- ] src/lfx/src/lfx/components/aiml/aiml_embeddings.py-32- src/lfx/src/lfx/components/aiml/aiml_embeddings.py:33: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/components/aiml/aiml_embeddings.py-34- return AIMLEmbeddingsImpl( src/lfx/src/lfx/components/aiml/aiml_embeddings.py-35- api_key=self.aiml_api_key, -- src/lfx/src/lfx/base/embeddings/model.py-8- src/lfx/src/lfx/base/embeddings/model.py-9- outputs = [ src/lfx/src/lfx/base/embeddings/model.py:10: Output(display_name="Embedding Model", name="embeddings", method="build_embeddings"), src/lfx/src/lfx/base/embeddings/model.py-11- ] src/lfx/src/lfx/base/embeddings/model.py-12- src/lfx/src/lfx/base/embeddings/model.py-13- def _validate_outputs(self) -> None: src/lfx/src/lfx/base/embeddings/model.py:14: required_output_methods = ["build_embeddings"] src/lfx/src/lfx/base/embeddings/model.py-15- output_names = [output.name for output in self.outputs] src/lfx/src/lfx/base/embeddings/model.py-16- for method_name in required_output_methods: -- src/lfx/src/lfx/base/embeddings/model.py-22- raise ValueError(msg) src/lfx/src/lfx/base/embeddings/model.py-23- src/lfx/src/lfx/base/embeddings/model.py:24: def build_embeddings(self) -> Embeddings: src/lfx/src/lfx/base/embeddings/model.py:25: msg = "You must implement the build_embeddings method in your class." src/lfx/src/lfx/base/embeddings/model.py-26- raise NotImplementedError(msg)

</shell_result>

Script executed:

#!/bin/bash # Check test_local_db_component.py around line 34 sed -n '30,40p' src/backend/tests/unit/components/vectorstores/test_local_db_component.py # Check test_embedding_model_component.py around lines 137, 168 sed -n '130,145p' src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py # Check if the test functions are async sed -n '120,125p' src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

except ValueError: pytest.skip("OPENAI_API_KEY is not set") return { "embedding": OpenAIEmbeddingsComponent(openai_api_key=api_key).build_embeddings(), "collection_name": "test_collection", "persist": True, "persist_directory": str(tmp_path), # Convert Path to string "mode": "Ingest", } component.model = "text-embedding-3-small" component.api_key = "test-key" # pragma:allowlist secret component.chunk_size = 1000 component.max_retries = 3 component.show_progress_bar = False # Build the embeddings embeddings = component.build_embeddings() # Verify the OpenAIEmbeddings was called with the correct parameters mock_openai_embeddings.assert_called_once_with( model="text-embedding-3-small", dimensions=None, base_url=None, api_key="test-key", # pragma:allowlist secret chunk_size=1000, @patch("lfx.components.models_and_agents.embedding_model.OpenAIEmbeddings") async def test_build_embeddings_openai(self, mock_openai_embeddings, component_class, default_kwargs): # Setup mock mock_instance = MagicMock() mock_openai_embeddings.return_value = mock_instance

</shell_result>

Script executed:

#!/bin/bash # Search for any uses of "await build_embeddings" to see if callers are properly awaiting rg -n --type=py 'await.*build_embeddings' # Check the actual implementation of the async method sed -n '155,200p' src/lfx/src/lfx/components/models_and_agents/embedding_model.py

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

async def build_embeddings(self) -> Embeddings: provider = self.provider model = self.model api_key = self.api_key api_base = self.api_base base_url_ibm_watsonx = self.base_url_ibm_watsonx ollama_base_url = self.ollama_base_url dimensions = self.dimensions chunk_size = self.chunk_size request_timeout = self.request_timeout max_retries = self.max_retries show_progress_bar = self.show_progress_bar model_kwargs = self.model_kwargs or {} if provider == "OpenAI": if not api_key: msg = "OpenAI API key is required when using OpenAI provider" raise ValueError(msg) # Create the primary embedding instance embeddings_instance = OpenAIEmbeddings( model=model, dimensions=dimensions or None, base_url=api_base or None, api_key=api_key, chunk_size=chunk_size, max_retries=max_retries, timeout=request_timeout or None, show_progress_bar=show_progress_bar, model_kwargs=model_kwargs, ) # Create dedicated instances for each available model available_models_dict = {} for model_name in OPENAI_EMBEDDING_MODEL_NAMES: available_models_dict[model_name] = OpenAIEmbeddings( model=model_name, dimensions=dimensions or None, # Use same dimensions config for all base_url=api_base or None, api_key=api_key, chunk_size=chunk_size, max_retries=max_retries, timeout=request_timeout or None, show_progress_bar=show_progress_bar, model_kwargs=model_kwargs, )

</shell_result>

Breaking change not properly propagated: callers are not awaiting the async method.

The build_embeddings method was converted to async but call sites throughout the codebase are not awaiting it:

src/backend/tests/unit/components/vectorstores/test_local_db_component.py:34

src/backend/tests/unit/components/vectorstores/test_chroma_vector_store_component.py:31

src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:137, 168, 201, 234, 242, 253, 260

Additionally, the method body contains no await operations despite being declared async, and all other embedding components in the codebase remain synchronous. Either update all call sites to use await, or reconsider whether this method should be async.

🤖 Prompt for AI Agents

In src/lfx/src/lfx/components/models_and_agents/embedding_model.py around line 155, the method build_embeddings was changed to async but its body contains no awaits and many callers are not awaiting it, causing breaking behavior; either revert build_embeddings to a synchronous def and restore the previous (non-async) signature so all existing call sites continue to work, or keep it async and (1) introduce real awaitable operations inside (or wrap the work in an executor) and (2) update every caller/test listed to await build_embeddings; pick one approach and apply it consistently across the embedding component interface and all referenced call sites/tests.

coderabbitai · 2025-11-25T21:44:00Z

src/lfx/src/lfx/components/models_and_agents/embedding_model.py

+            # Create dedicated instances for each available model
+            available_models_dict = {}
+            for model_name in OPENAI_EMBEDDING_MODEL_NAMES:
+                available_models_dict[model_name] = OpenAIEmbeddings(
+                    model=model_name,
+                    dimensions=dimensions or None,  # Use same dimensions config for all
+                    base_url=api_base or None,
+                    api_key=api_key,
+                    chunk_size=chunk_size,
+                    max_retries=max_retries,
+                    timeout=request_timeout or None,
+                    show_progress_bar=show_progress_bar,
+                    model_kwargs=model_kwargs,
+                )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Script executed:

# Find the OPENAI_EMBEDDING_MODEL_NAMES constant definition rg "OPENAI_EMBEDDING_MODEL_NAMES" -n -B2 -A5

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1783- "title_case": false, src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1784- "type": "code", src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json:1785: "value": "from typing import Any\n\nfrom langchain_openai import OpenAIEmbeddings\n\nfrom langflow.base.embeddings.model import LCEmbeddingsModel\nfrom langflow.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES\nfrom langflow.field_typing import Embeddings\nfrom langflow.io import (\n BoolInput,\n DictInput,\n DropdownInput,\n FloatInput,\n IntInput,\n MessageTextInput,\n SecretStrInput,\n)\nfrom langflow.schema.dotdict import dotdict\n\n\nclass EmbeddingModelComponent(LCEmbeddingsModel):\n display_name = \"Embedding Model\"\n description = \"Generate embeddings using a specified provider.\"\n documentation: str = \"https://docs.langflow.org/components-embedding-models\"\n icon = \"binary\"\n name = \"EmbeddingModel\"\n category = \"models\"\n\n inputs = [\n DropdownInput(\n name=\"provider\",\n display_name=\"Model Provider\",\n options=[\"OpenAI\"],\n value=\"OpenAI\",\n info=\"Select the embedding model provider\",\n real_time_refresh=True,\n options_metadata=[{\"icon\": \"OpenAI\"}],\n ),\n DropdownInput(\n name=\"model\",\n display_name=\"Model Name\",\n options=OPENAI_EMBEDDING_MODEL_NAMES,\n value=OPENAI_EMBEDDING_MODEL_NAMES[0],\n info=\"Select the embedding model to use\",\n ),\n SecretStrInput(\n name=\"api_key\",\n display_name=\"OpenAI API Key\",\n info=\"Model Provider API key\",\n required=True,\n show=True,\n real_time_refresh=True,\n ),\n MessageTextInput(\n name=\"api_base\",\n display_name=\"API Base URL\",\n info=\"Base URL for the API. Leave empty for default.\",\n advanced=True,\n ),\n IntInput(\n name=\"dimensions\",\n display_name=\"Dimensions\",\n info=\"The number of dimensions the resulting output embeddings should have. \"\n \"Only supported by certain models.\",\n advanced=True,\n ),\n IntInput(name=\"chunk_size\", display_name=\"Chunk Size\", advanced=True, value=1000),\n FloatInput(name=\"request_timeout\", display_name=\"Request Timeout\", advanced=True),\n IntInput(name=\"max_retries\", display_name=\"Max Retries\", advanced=True, value=3),\n BoolInput(name=\"show_progress_bar\", display_name=\"Show Progress Bar\", advanced=True),\n DictInput(\n name=\"model_kwargs\",\n display_name=\"Model Kwargs\",\n advanced=True,\n info=\"Additional keyword arguments to pass to the model.\",\n ),\n ]\n\n def build_embeddings(self) -> Embeddings:\n provider = self.provider\n model = self.model\n api_key = self.api_key\n api_base = self.api_base\n dimensions = self.dimensions\n chunk_size = self.chunk_size\n request_timeout = self.request_timeout\n max_retries = self.max_retries\n show_progress_bar = self.show_progress_bar\n model_kwargs = self.model_kwargs or {}\n\n if provider == \"OpenAI\":\n if not api_key:\n msg = \"OpenAI API key is required when using OpenAI provider\"\n raise ValueError(msg)\n return OpenAIEmbeddings(\n model=model,\n dimensions=dimensions or None,\n base_url=api_base or None,\n api_key=api_key,\n chunk_size=chunk_size,\n max_retries=max_retries,\n timeout=request_timeout or None,\n show_progress_bar=show_progress_bar,\n model_kwargs=model_kwargs,\n )\n msg = f\"Unknown provider: {provider}\"\n raise ValueError(msg)\n\n def update_build_config(self, build_config: dotdict, field_value: Any, field_name: str | None = None) -> dotdict:\n if field_name == \"provider\" and field_value == \"OpenAI\":\n build_config[\"model\"][\"options\"] = OPENAI_EMBEDDING_MODEL_NAMES\n build_config[\"model\"][\"value\"] = OPENAI_EMBEDDING_MODEL_NAMES[0]\n build_config[\"api_key\"][\"display_name\"] = \"OpenAI API Key\"\n build_config[\"api_base\"][\"display_name\"] = \"OpenAI API Base URL\"\n return build_config\n" src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1786- }, src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1787- "dimensions": { src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1788- "_input_type": "IntInput", src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1789- "advanced": true, src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1790- "display_name": "Dimensions", -- src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1317- "title_case": false, src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1318- "type": "code", src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json:1319: "value": "from langchain_openai import OpenAIEmbeddings\n\nfrom langflow.base.embeddings.model import LCEmbeddingsModel\nfrom langflow.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES\nfrom langflow.field_typing import Embeddings\nfrom langflow.io import BoolInput, DictInput, DropdownInput, FloatInput, IntInput, MessageTextInput, SecretStrInput\n\n\nclass OpenAIEmbeddingsComponent(LCEmbeddingsModel):\n display_name = \"OpenAI Embeddings\"\n description = \"Generate embeddings using OpenAI models.\"\n icon = \"OpenAI\"\n name = \"OpenAIEmbeddings\"\n\n inputs = [\n DictInput(\n name=\"default_headers\",\n display_name=\"Default Headers\",\n advanced=True,\n info=\"Default headers to use for the API request.\",\n ),\n DictInput(\n name=\"default_query\",\n display_name=\"Default Query\",\n advanced=True,\n info=\"Default query parameters to use for the API request.\",\n ),\n IntInput(name=\"chunk_size\", display_name=\"Chunk Size\", advanced=True, value=1000),\n MessageTextInput(name=\"client\", display_name=\"Client\", advanced=True),\n MessageTextInput(name=\"deployment\", display_name=\"Deployment\", advanced=True),\n IntInput(name=\"embedding_ctx_length\", display_name=\"Embedding Context Length\", advanced=True, value=1536),\n IntInput(name=\"max_retries\", display_name=\"Max Retries\", value=3, advanced=True),\n DropdownInput(\n name=\"model\",\n display_name=\"Model\",\n advanced=False,\n options=OPENAI_EMBEDDING_MODEL_NAMES,\n value=\"text-embedding-3-small\",\n ),\n DictInput(name=\"model_kwargs\", display_name=\"Model Kwargs\", advanced=True),\n SecretStrInput(name=\"openai_api_key\", display_name=\"OpenAI API Key\", value=\"OPENAI_API_KEY\", required=True),\n MessageTextInput(name=\"openai_api_base\", display_name=\"OpenAI API Base\", advanced=True),\n MessageTextInput(name=\"openai_api_type\", display_name=\"OpenAI API Type\", advanced=True),\n MessageTextInput(name=\"openai_api_version\", display_name=\"OpenAI API Version\", advanced=True),\n MessageTextInput(\n name=\"openai_organization\",\n display_name=\"OpenAI Organization\",\n advanced=True,\n ),\n MessageTextInput(name=\"openai_proxy\", display_name=\"OpenAI Proxy\", advanced=True),\n FloatInput(name=\"request_timeout\", display_name=\"Request Timeout\", advanced=True),\n BoolInput(name=\"show_progress_bar\", display_name=\"Show Progress Bar\", advanced=True),\n BoolInput(name=\"skip_empty\", display_name=\"Skip Empty\", advanced=True),\n MessageTextInput(\n name=\"tiktoken_model_name\",\n display_name=\"TikToken Model Name\",\n advanced=True,\n ),\n BoolInput(\n name=\"tiktoken_enable\",\n display_name=\"TikToken Enable\",\n advanced=True,\n value=True,\n info=\"If False, you must have transformers installed.\",\n ),\n IntInput(\n name=\"dimensions\",\n display_name=\"Dimensions\",\n info=\"The number of dimensions the resulting output embeddings should have. \"\n \"Only supported by certain models.\",\n advanced=True,\n ),\n ]\n\n def build_embeddings(self) -> Embeddings:\n return OpenAIEmbeddings(\n client=self.client or None,\n model=self.model,\n dimensions=self.dimensions or None,\n deployment=self.deployment or None,\n api_version=self.openai_api_version or None,\n base_url=self.openai_api_base or None,\n openai_api_type=self.openai_api_type or None,\n openai_proxy=self.openai_proxy or None,\n embedding_ctx_length=self.embedding_ctx_length,\n api_key=self.openai_api_key or None,\n organization=self.openai_organization or None,\n allowed_special=\"all\",\n disallowed_special=\"all\",\n chunk_size=self.chunk_size,\n max_retries=self.max_retries,\n timeout=self.request_timeout or None,\n tiktoken_enabled=self.tiktoken_enable,\n tiktoken_model_name=self.tiktoken_model_name or None,\n show_progress_bar=self.show_progress_bar,\n model_kwargs=self.model_kwargs,\n skip_empty=self.skip_empty,\n default_headers=self.default_headers or None,\n default_query=self.default_query or None,\n )\n" src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1320- }, src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1321- "default_headers": { src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1322- "_input_type": "DictInput", src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1323- "advanced": true, src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1324- "display_name": "Default Headers", -- src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1850- "title_case": false, src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1851- "type": "code", src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json:1852: "value": "from langchain_openai import OpenAIEmbeddings\n\nfrom langflow.base.embeddings.model import LCEmbeddingsModel\nfrom langflow.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES\nfrom langflow.field_typing import Embeddings\nfrom langflow.io import BoolInput, DictInput, DropdownInput, FloatInput, IntInput, MessageTextInput, SecretStrInput\n\n\nclass OpenAIEmbeddingsComponent(LCEmbeddingsModel):\n display_name = \"OpenAI Embeddings\"\n description = \"Generate embeddings using OpenAI models.\"\n icon = \"OpenAI\"\n name = \"OpenAIEmbeddings\"\n\n inputs = [\n DictInput(\n name=\"default_headers\",\n display_name=\"Default Headers\",\n advanced=True,\n info=\"Default headers to use for the API request.\",\n ),\n DictInput(\n name=\"default_query\",\n display_name=\"Default Query\",\n advanced=True,\n info=\"Default query parameters to use for the API request.\",\n ),\n IntInput(name=\"chunk_size\", display_name=\"Chunk Size\", advanced=True, value=1000),\n MessageTextInput(name=\"client\", display_name=\"Client\", advanced=True),\n MessageTextInput(name=\"deployment\", display_name=\"Deployment\", advanced=True),\n IntInput(name=\"embedding_ctx_length\", display_name=\"Embedding Context Length\", advanced=True, value=1536),\n IntInput(name=\"max_retries\", display_name=\"Max Retries\", value=3, advanced=True),\n DropdownInput(\n name=\"model\",\n display_name=\"Model\",\n advanced=False,\n options=OPENAI_EMBEDDING_MODEL_NAMES,\n value=\"text-embedding-3-small\",\n ),\n DictInput(name=\"model_kwargs\", display_name=\"Model Kwargs\", advanced=True),\n SecretStrInput(name=\"openai_api_key\", display_name=\"OpenAI API Key\", value=\"OPENAI_API_KEY\", required=True),\n MessageTextInput(name=\"openai_api_base\", display_name=\"OpenAI API Base\", advanced=True),\n MessageTextInput(name=\"openai_api_type\", display_name=\"OpenAI API Type\", advanced=True),\n MessageTextInput(name=\"openai_api_version\", display_name=\"OpenAI API Version\", advanced=True),\n MessageTextInput(\n name=\"openai_organization\",\n display_name=\"OpenAI Organization\",\n advanced=True,\n ),\n MessageTextInput(name=\"openai_proxy\", display_name=\"OpenAI Proxy\", advanced=True),\n FloatInput(name=\"request_timeout\", display_name=\"Request Timeout\", advanced=True),\n BoolInput(name=\"show_progress_bar\", display_name=\"Show Progress Bar\", advanced=True),\n BoolInput(name=\"skip_empty\", display_name=\"Skip Empty\", advanced=True),\n MessageTextInput(\n name=\"tiktoken_model_name\",\n display_name=\"TikToken Model Name\",\n advanced=True,\n ),\n BoolInput(\n name=\"tiktoken_enable\",\n display_name=\"TikToken Enable\",\n advanced=True,\n value=True,\n info=\"If False, you must have transformers installed.\",\n ),\n IntInput(\n name=\"dimensions\",\n display_name=\"Dimensions\",\n info=\"The number of dimensions the resulting output embeddings should have. \"\n \"Only supported by certain models.\",\n advanced=True,\n ),\n ]\n\n def build_embeddings(self) -> Embeddings:\n return OpenAIEmbeddings(\n client=self.client or None,\n model=self.model,\n dimensions=self.dimensions or None,\n deployment=self.deployment or None,\n api_version=self.openai_api_version or None,\n base_url=self.openai_api_base or None,\n openai_api_type=self.openai_api_type or None,\n openai_proxy=self.openai_proxy or None,\n embedding_ctx_length=self.embedding_ctx_length,\n api_key=self.openai_api_key or None,\n organization=self.openai_organization or None,\n allowed_special=\"all\",\n disallowed_special=\"all\",\n chunk_size=self.chunk_size,\n max_retries=self.max_retries,\n timeout=self.request_timeout or None,\n tiktoken_enabled=self.tiktoken_enable,\n tiktoken_model_name=self.tiktoken_model_name or None,\n show_progress_bar=self.show_progress_bar,\n model_kwargs=self.model_kwargs,\n skip_empty=self.skip_empty,\n default_headers=self.default_headers or None,\n default_query=self.default_query or None,\n )\n" src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1853- }, src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1854- "default_headers": { src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1855- "_input_type": "DictInput", src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1856- "advanced": true, src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1857- "display_name": "Default Headers", -- src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-786- "title_case": false, src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-787- "type": "code", src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json:788: "value": "from __future__ import annotations\n\nimport asyncio\nimport contextlib\nimport hashlib\nimport json\nimport re\nimport uuid\nfrom dataclasses import asdict, dataclass, field\nfrom datetime import datetime, timezone\nfrom pathlib import Path\nfrom typing import TYPE_CHECKING, Any\n\nimport pandas as pd\nfrom cryptography.fernet import InvalidToken\nfrom langchain_chroma import Chroma\nfrom loguru import logger\n\nfrom langflow.base.knowledge_bases.knowledge_base_utils import get_knowledge_bases\nfrom langflow.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES\nfrom langflow.components.processing.converter import convert_to_dataframe\nfrom langflow.custom import Component\nfrom langflow.io import (\n BoolInput,\n DropdownInput,\n HandleInput,\n IntInput,\n Output,\n SecretStrInput,\n StrInput,\n TableInput,\n)\nfrom langflow.schema.data import Data\nfrom langflow.schema.dotdict import dotdict # noqa: TC001\nfrom langflow.schema.table import EditMode\nfrom langflow.services.auth.utils import decrypt_api_key, encrypt_api_key\nfrom langflow.services.database.models.user.crud import get_user_by_id\nfrom langflow.services.deps import (\n get_settings_service,\n get_variable_service,\n session_scope,\n)\n\nif TYPE_CHECKING:\n from langflow.schema.dataframe import DataFrame\n\nHUGGINGFACE_MODEL_NAMES = [\n \"sentence-transformers/all-MiniLM-L6-v2\",\n \"sentence-transformers/all-mpnet-base-v2\",\n]\nCOHERE_MODEL_NAMES = [\"embed-english-v3.0\", \"embed-multilingual-v3.0\"]\n\nsettings = get_settings_service().settings\nknowledge_directory = settings.knowledge_bases_dir\nif not knowledge_directory:\n msg = \"Knowledge bases directory is not set in the settings.\"\n raise ValueError(msg)\nKNOWLEDGE_BASES_ROOT_PATH = Path(knowledge_directory).expanduser()\n\n\nclass KnowledgeIngestionComponent(Component):\n \"\"\"Create or append to Langflow Knowledge from a DataFrame.\"\"\"\n\n # ------ UI metadata ---------------------------------------------------\n display_name = \"Knowledge Ingestion\"\n description = \"Create or update knowledge in Langflow.\"\n icon = \"upload\"\n name = \"KnowledgeIngestion\"\n\n def __init__(self, *args, **kwargs) -> None:\n super().__init__(*args, **kwargs)\n self._cached_kb_path: Path | None = None\n\n @dataclass\n class NewKnowledgeBaseInput:\n functionality: str = \"create\"\n fields: dict[str, dict] = field(\n default_factory=lambda: {\n \"data\": {\n \"node\": {\n \"name\": \"create_knowledge_base\",\n \"description\": \"Create new knowledge in Langflow.\",\n \"display_name\": \"Create new knowledge\",\n \"field_order\": [\n \"01_new_kb_name\",\n \"02_embedding_model\",\n \"03_api_key\",\n ],\n \"template\": {\n \"01_new_kb_name\": StrInput(\n name=\"new_kb_name\",\n display_name=\"Knowledge Name\",\n info=\"Name of the new knowledge to create.\",\n required=True,\n ),\n \"02_embedding_model\": DropdownInput(\n name=\"embedding_model\",\n display_name=\"Choose Embedding\",\n info=\"Select the embedding model to use for this knowledge base.\",\n required=True,\n options=OPENAI_EMBEDDING_MODEL_NAMES + HUGGINGFACE_MODEL_NAMES + COHERE_MODEL_NAMES,\n options_metadata=[{\"icon\": \"OpenAI\"} for _ in OPENAI_EMBEDDING_MODEL_NAMES]\n + [{\"icon\": \"HuggingFace\"} for _ in HUGGINGFACE_MODEL_NAMES]\n + [{\"icon\": \"Cohere\"} for _ in COHERE_MODEL_NAMES],\n ),\n \"03_api_key\": SecretStrInput(\n name=\"api_key\",\n display_name=\"API Key\",\n info=\"Provider API key for embedding model\",\n required=True,\n load_from_db=False,\n ),\n },\n },\n }\n }\n )\n\n # ------ Inputs --------------------------------------------------------\n inputs = [\n DropdownInput(\n name=\"knowledge_base\",\n display_name=\"Knowledge\",\n info=\"Select the knowledge to load data from.\",\n required=True,\n options=[],\n refresh_button=True,\n real_time_refresh=True,\n dialog_inputs=asdict(NewKnowledgeBaseInput()),\n ),\n HandleInput(\n name=\"input_df\",\n display_name=\"Input\",\n info=(\n \"Table with all original columns (already chunked / processed). \"\n \"Accepts Data or DataFrame. If Data is provided, it is converted to a DataFrame automatically.\"\n ),\n input_types=[\"Data\", \"DataFrame\"],\n required=True,\n ),\n TableInput(\n name=\"column_config\",\n display_name=\"Column Configuration\",\n info=\"Configure column behavior for the knowledge base.\",\n required=True,\n table_schema=[\n {\n \"name\": \"column_name\",\n \"display_name\": \"Column Name\",\n \"type\": \"str\",\n \"description\": \"Name of the column in the source DataFrame\",\n \"edit_mode\": EditMode.INLINE,\n },\n {\n \"name\": \"vectorize\",\n \"display_name\": \"Vectorize\",\n \"type\": \"boolean\",\n \"description\": \"Create embeddings for this column\",\n \"default\": False,\n \"edit_mode\": EditMode.INLINE,\n },\n {\n \"name\": \"identifier\",\n \"display_name\": \"Identifier\",\n \"type\": \"boolean\",\n \"description\": \"Use this column as unique identifier\",\n \"default\": False,\n \"edit_mode\": EditMode.INLINE,\n },\n ],\n value=[\n {\n \"column_name\": \"text\",\n \"vectorize\": True,\n \"identifier\": True,\n },\n ],\n ),\n IntInput(\n name=\"chunk_size\",\n display_name=\"Chunk Size\",\n info=\"Batch size for processing embeddings\",\n advanced=True,\n value=1000,\n ),\n SecretStrInput(\n name=\"api_key\",\n display_name=\"Embedding Provider API Key\",\n info=\"API key for the embedding provider to generate embeddings.\",\n advanced=True,\n required=False,\n ),\n BoolInput(\n name=\"allow_duplicates\",\n display_name=\"Allow Duplicates\",\n info=\"Allow duplicate rows in the knowledge base\",\n advanced=True,\n value=False,\n ),\n ]\n\n # ------ Outputs -------------------------------------------------------\n outputs = [Output(display_name=\"Results\", name=\"dataframe_output\", method=\"build_kb_info\")]\n\n # ------ Internal helpers ---------------------------------------------\n def _get_kb_root(self) -> Path:\n \"\"\"Return the root directory for knowledge bases.\"\"\"\n return KNOWLEDGE_BASES_ROOT_PATH\n\n def _validate_column_config(self, df_source: pd.DataFrame) -> list[dict[str, Any]]:\n \"\"\"Validate column configuration using Structured Output patterns.\"\"\"\n if not self.column_config:\n msg = \"Column configuration cannot be empty\"\n raise ValueError(msg)\n\n # Convert table input to list of dicts (similar to Structured Output)\n config_list = self.column_config if isinstance(self.column_config, list) else []\n\n # Validate column names exist in DataFrame\n df_columns = set(df_source.columns)\n for config in config_list:\n col_name = config.get(\"column_name\")\n if col_name not in df_columns:\n msg = f\"Column '{col_name}' not found in DataFrame. Available columns: {sorted(df_columns)}\"\n raise ValueError(msg)\n\n return config_list\n\n def _get_embedding_provider(self, embedding_model: str) -> str:\n \"\"\"Get embedding provider by matching model name to lists.\"\"\"\n if embedding_model in OPENAI_EMBEDDING_MODEL_NAMES:\n return \"OpenAI\"\n if embedding_model in HUGGINGFACE_MODEL_NAMES:\n return \"HuggingFace\"\n if embedding_model in COHERE_MODEL_NAMES:\n return \"Cohere\"\n return \"Custom\"\n\n def _build_embeddings(self, embedding_model: str, api_key: str):\n \"\"\"Build embedding model using provider patterns.\"\"\"\n # Get provider by matching model name to lists\n provider = self._get_embedding_provider(embedding_model)\n\n # Validate provider and model\n if provider == \"OpenAI\":\n from langchain_openai import OpenAIEmbeddings\n\n if not api_key:\n msg = \"OpenAI API key is required when using OpenAI provider\"\n raise ValueError(msg)\n return OpenAIEmbeddings(\n model=embedding_model,\n api_key=api_key,\n chunk_size=self.chunk_size,\n )\n if provider == \"HuggingFace\":\n from langchain_huggingface import HuggingFaceEmbeddings\n\n return HuggingFaceEmbeddings(\n model=embedding_model,\n )\n if provider == \"Cohere\":\n from langchain_cohere import CohereEmbeddings\n\n if not api_key:\n msg = \"Cohere API key is required when using Cohere provider\"\n raise ValueError(msg)\n return CohereEmbeddings(\n model=embedding_model,\n cohere_api_key=api_key,\n )\n if provider == \"Custom\":\n # For custom embedding models, we would need additional configuration\n msg = \"Custom embedding models not yet supported\"\n raise NotImplementedError(msg)\n msg = f\"Unknown provider: {provider}\"\n raise ValueError(msg)\n\n def _build_embedding_metadata(self, embedding_model, api_key) -> dict[str, Any]:\n \"\"\"Build embedding model metadata.\"\"\"\n # Get provider by matching model name to lists\n embedding_provider = self._get_embedding_provider(embedding_model)\n\n api_key_to_save = None\n if api_key and hasattr(api_key, \"get_secret_value\"):\n api_key_to_save = api_key.get_secret_value()\n elif isinstance(api_key, str):\n api_key_to_save = api_key\n\n encrypted_api_key = None\n if api_key_to_save:\n settings_service = get_settings_service()\n try:\n encrypted_api_key = encrypt_api_key(api_key_to_save, settings_service=settings_service)\n except (TypeError, ValueError) as e:\n self.log(f\"Could not encrypt API key: {e}\")\n logger.error(f\"Could not encrypt API key: {e}\")\n\n return {\n \"embedding_provider\": embedding_provider,\n \"embedding_model\": embedding_model,\n \"api_key\": encrypted_api_key,\n \"api_key_used\": bool(api_key),\n \"chunk_size\": self.chunk_size,\n \"created_at\": datetime.now(timezone.utc).isoformat(),\n }\n\n def _save_embedding_metadata(self, kb_path: Path, embedding_model: str, api_key: str) -> None:\n \"\"\"Save embedding model metadata.\"\"\"\n embedding_metadata = self._build_embedding_metadata(embedding_model, api_key)\n metadata_path = kb_path / \"embedding_metadata.json\"\n metadata_path.write_text(json.dumps(embedding_metadata, indent=2))\n\n def _save_kb_files(\n self,\n kb_path: Path,\n config_list: list[dict[str, Any]],\n ) -> None:\n \"\"\"Save KB files using File Component storage patterns.\"\"\"\n try:\n # Create directory (following File Component patterns)\n kb_path.mkdir(parents=True, exist_ok=True)\n\n # Save column configuration\n # Only do this if the file doesn't exist already\n cfg_path = kb_path / \"schema.json\"\n if not cfg_path.exists():\n cfg_path.write_text(json.dumps(config_list, indent=2))\n\n except (OSError, TypeError, ValueError) as e:\n self.log(f\"Error saving KB files: {e}\")\n\n def _build_column_metadata(self, config_list: list[dict[str, Any]], df_source: pd.DataFrame) -> dict[str, Any]:\n \"\"\"Build detailed column metadata.\"\"\"\n metadata: dict[str, Any] = {\n \"total_columns\": len(df_source.columns),\n \"mapped_columns\": len(config_list),\n \"unmapped_columns\": len(df_source.columns) - len(config_list),\n \"columns\": [],\n \"summary\": {\"vectorized_columns\": [], \"identifier_columns\": []},\n }\n\n for config in config_list:\n col_name = config.get(\"column_name\")\n vectorize = config.get(\"vectorize\") == \"True\" or config.get(\"vectorize\") is True\n identifier = config.get(\"identifier\") == \"True\" or config.get(\"identifier\") is True\n\n # Add to columns list\n metadata[\"columns\"].append(\n {\n \"name\": col_name,\n \"vectorize\": vectorize,\n \"identifier\": identifier,\n }\n )\n\n # Update summary\n if vectorize:\n metadata[\"summary\"][\"vectorized_columns\"].append(col_name)\n if identifier:\n metadata[\"summary\"][\"identifier_columns\"].append(col_name)\n\n return metadata\n\n async def _create_vector_store(\n self,\n df_source: pd.DataFrame,\n config_list: list[dict[str, Any]],\n embedding_model: str,\n api_key: str,\n ) -> None:\n \"\"\"Create vector store following Local DB component pattern.\"\"\"\n try:\n # Set up vector store directory\n vector_store_dir = await self._kb_path()\n if not vector_store_dir:\n msg = \"Knowledge base path is not set. Please create a new knowledge base first.\"\n raise ValueError(msg)\n vector_store_dir.mkdir(parents=True, exist_ok=True)\n\n # Create embeddings model\n embedding_function = self._build_embeddings(embedding_model, api_key)\n\n # Convert DataFrame to Data objects (following Local DB pattern)\n data_objects = await self._convert_df_to_data_objects(df_source, config_list)\n\n # Create vector store\n chroma = Chroma(\n persist_directory=str(vector_store_dir),\n embedding_function=embedding_function,\n collection_name=self.knowledge_base,\n )\n\n # Convert Data objects to LangChain Documents\n documents = []\n for data_obj in data_objects:\n doc = data_obj.to_lc_document()\n documents.append(doc)\n\n # Add documents to vector store\n if documents:\n chroma.add_documents(documents)\n self.log(f\"Added {len(documents)} documents to vector store '{self.knowledge_base}'\")\n\n except (OSError, ValueError, RuntimeError) as e:\n self.log(f\"Error creating vector store: {e}\")\n\n async def _convert_df_to_data_objects(\n self, df_source: pd.DataFrame, config_list: list[dict[str, Any]]\n ) -> list[Data]:\n \"\"\"Convert DataFrame to Data objects for vector store.\"\"\"\n data_objects: list[Data] = []\n\n # Set up vector store directory\n kb_path = await self._kb_path()\n\n # If we don't allow duplicates, we need to get the existing hashes\n chroma = Chroma(\n persist_directory=str(kb_path),\n collection_name=self.knowledge_base,\n )\n\n # Get all documents and their metadata\n all_docs = chroma.get()\n\n # Extract all _id values from metadata\n id_list = [metadata.get(\"_id\") for metadata in all_docs[\"metadatas\"] if metadata.get(\"_id\")]\n\n # Get column roles\n content_cols = []\n identifier_cols = []\n\n for config in config_list:\n col_name = config.get(\"column_name\")\n vectorize = config.get(\"vectorize\") == \"True\" or config.get(\"vectorize\") is True\n identifier = config.get(\"identifier\") == \"True\" or config.get(\"identifier\") is True\n\n if vectorize:\n content_cols.append(col_name)\n elif identifier:\n identifier_cols.append(col_name)\n\n # Convert each row to a Data object\n for _, row in df_source.iterrows():\n # Build content text from identifier columns using list comprehension\n identifier_parts = [str(row[col]) for col in content_cols if col in row and pd.notna(row[col])]\n\n # Join all parts into a single string\n page_content = \" \".join(identifier_parts)\n\n # Build metadata from NON-vectorized columns only (simple key-value pairs)\n data_dict = {\n \"text\": page_content, # Main content for vectorization\n }\n\n # Add identifier columns if they exist\n if identifier_cols:\n identifier_parts = [str(row[col]) for col in identifier_cols if col in row and pd.notna(row[col])]\n page_content = \" \".join(identifier_parts)\n\n # Add metadata columns as simple key-value pairs\n for col in df_source.columns:\n if col not in content_cols and col in row and pd.notna(row[col]):\n # Convert to simple types for Chroma metadata\n value = row[col]\n data_dict[col] = str(value) # Convert complex types to string\n\n # Hash the page_content for unique ID\n page_content_hash = hashlib.sha256(page_content.encode()).hexdigest()\n data_dict[\"_id\"] = page_content_hash\n\n # If duplicates are disallowed, and hash exists, prevent adding this row\n if not self.allow_duplicates and page_content_hash in id_list:\n self.log(f\"Skipping duplicate row with hash {page_content_hash}\")\n continue\n\n # Create Data object - everything except \"text\" becomes metadata\n data_obj = Data(data=data_dict)\n data_objects.append(data_obj)\n\n return data_objects\n\n def is_valid_collection_name(self, name, min_length: int = 3, max_length: int = 63) -> bool:\n \"\"\"Validates collection name against conditions 1-3.\n\n 1. Contains 3-63 characters\n 2. Starts and ends with alphanumeric character\n 3. Contains only alphanumeric characters, underscores, or hyphens.\n\n Args:\n name (str): Collection name to validate\n min_length (int): Minimum length of the name\n max_length (int): Maximum length of the name\n\n Returns:\n bool: True if valid, False otherwise\n \"\"\"\n # Check length (condition 1)\n if not (min_length <= len(name) <= max_length):\n return False\n\n # Check start/end with alphanumeric (condition 2)\n if not (name[0].isalnum() and name[-1].isalnum()):\n return False\n\n # Check allowed characters (condition 3)\n return re.match(r\"^[a-zA-Z0-9_-]+$\", name) is not None\n\n async def _kb_path(self) -> Path | None:\n # Check if we already have the path cached\n cached_path = getattr(self, \"_cached_kb_path\", None)\n if cached_path is not None:\n return cached_path\n\n # If not cached, compute it\n async with session_scope() as db:\n if not self.user_id:\n msg = \"User ID is required for fetching knowledge base path.\"\n raise ValueError(msg)\n current_user = await get_user_by_id(db, self.user_id)\n if not current_user:\n msg = f\"User with ID {self.user_id} not found.\"\n raise ValueError(msg)\n kb_user = current_user.username\n\n kb_root = self._get_kb_root()\n\n # Cache the result\n self._cached_kb_path = kb_root / kb_user / self.knowledge_base\n\n return self._cached_kb_path\n\n # ---------------------------------------------------------------------\n # OUTPUT METHODS\n # ---------------------------------------------------------------------\n async def build_kb_info(self) -> Data:\n \"\"\"Main ingestion routine → returns a dict with KB metadata.\"\"\"\n try:\n input_value = self.input_df[0] if isinstance(self.input_df, list) else self.input_df\n df_source: DataFrame = convert_to_dataframe(input_value)\n\n # Validate column configuration (using Structured Output patterns)\n config_list = self._validate_column_config(df_source)\n column_metadata = self._build_column_metadata(config_list, df_source)\n\n # Read the embedding info from the knowledge base folder\n kb_path = await self._kb_path()\n if not kb_path:\n msg = \"Knowledge base path is not set. Please create a new knowledge base first.\"\n raise ValueError(msg)\n metadata_path = kb_path / \"embedding_metadata.json\"\n\n # If the API key is not provided, try to read it from the metadata file\n if metadata_path.exists():\n settings_service = get_settings_service()\n metadata = json.loads(metadata_path.read_text())\n embedding_model = metadata.get(\"embedding_model\")\n try:\n api_key = decrypt_api_key(metadata[\"api_key\"], settings_service)\n except (InvalidToken, TypeError, ValueError) as e:\n logger.error(f\"Could not decrypt API key. Please provide it manually. Error: {e}\")\n\n # Check if a custom API key was provided, update metadata if so\n if self.api_key:\n api_key = self.api_key\n self._save_embedding_metadata(\n kb_path=kb_path,\n embedding_model=embedding_model,\n api_key=api_key,\n )\n\n # Create vector store following Local DB component pattern\n await self._create_vector_store(df_source, config_list, embedding_model=embedding_model, api_key=api_key)\n\n # Save KB files (using File Component storage patterns)\n self._save_kb_files(kb_path, config_list)\n\n # Build metadata response\n meta: dict[str, Any] = {\n \"kb_id\": str(uuid.uuid4()),\n \"kb_name\": self.knowledge_base,\n \"rows\": len(df_source),\n \"column_metadata\": column_metadata,\n \"path\": str(kb_path),\n \"config_columns\": len(config_list),\n \"timestamp\": datetime.now(tz=timezone.utc).isoformat(),\n }\n\n # Set status message\n self.status = f\"✅ KB **{self.knowledge_base}** saved · {len(df_source)} chunks.\"\n\n return Data(data=meta)\n\n except (OSError, ValueError, RuntimeError, KeyError) as e:\n msg = f\"Error during KB ingestion: {e}\"\n raise RuntimeError(msg) from e\n\n async def _get_api_key_variable(self, field_value: dict[str, Any]):\n async with session_scope() as db:\n if not self.user_id:\n msg = \"User ID is required for fetching global variables.\"\n raise ValueError(msg)\n current_user = await get_user_by_id(db, self.user_id)\n if not current_user:\n msg = f\"User with ID {self.user_id} not found.\"\n raise ValueError(msg)\n variable_service = get_variable_service()\n\n # Process the api_key field variable\n return await variable_service.get_variable(\n user_id=current_user.id,\n name=field_value[\"03_api_key\"],\n field=\"\",\n session=db,\n )\n\n async def update_build_config(\n self,\n build_config: dotdict,\n field_value: Any,\n field_name: str | None = None,\n ) -> dotdict:\n \"\"\"Update build configuration based on provider selection.\"\"\"\n # Create a new knowledge base\n if field_name == \"knowledge_base\":\n async with session_scope() as db:\n if not self.user_id:\n msg = \"User ID is required for fetching knowledge base list.\"\n raise ValueError(msg)\n current_user = await get_user_by_id(db, self.user_id)\n if not current_user:\n msg = f\"User with ID {self.user_id} not found.\"\n raise ValueError(msg)\n kb_user = current_user.username\n if isinstance(field_value, dict) and \"01_new_kb_name\" in field_value:\n # Validate the knowledge base name - Make sure it follows these rules:\n if not self.is_valid_collection_name(field_value[\"01_new_kb_name\"]):\n msg = f\"Invalid knowledge base name: {field_value['01_new_kb_name']}\"\n raise ValueError(msg)\n\n api_key = field_value.get(\"03_api_key\", None)\n with contextlib.suppress(Exception):\n # If the API key is a variable, resolve it\n api_key = await self._get_api_key_variable(field_value)\n\n # Make sure api_key is a string\n if not isinstance(api_key, str):\n msg = \"API key must be a string.\"\n raise ValueError(msg)\n\n # We need to test the API Key one time against the embedding model\n embed_model = self._build_embeddings(embedding_model=field_value[\"02_embedding_model\"], api_key=api_key)\n\n # Try to generate a dummy embedding to validate the API key without blocking the event loop\n try:\n await asyncio.wait_for(\n asyncio.to_thread(embed_model.embed_query, \"test\"),\n timeout=10,\n )\n except TimeoutError as e:\n msg = \"Embedding validation timed out. Please verify network connectivity and key.\"\n raise ValueError(msg) from e\n except Exception as e:\n msg = f\"Embedding validation failed: {e!s}\"\n raise ValueError(msg) from e\n\n # Create the new knowledge base directory\n kb_path = KNOWLEDGE_BASES_ROOT_PATH / kb_user / field_value[\"01_new_kb_name\"]\n kb_path.mkdir(parents=True, exist_ok=True)\n\n # Save the embedding metadata\n build_config[\"knowledge_base\"][\"value\"] = field_value[\"01_new_kb_name\"]\n self._save_embedding_metadata(\n kb_path=kb_path,\n embedding_model=field_value[\"02_embedding_model\"],\n api_key=api_key,\n )\n\n # Update the knowledge base options dynamically\n build_config[\"knowledge_base\"][\"options\"] = await get_knowledge_bases(\n KNOWLEDGE_BASES_ROOT_PATH,\n user_id=self.user_id,\n )\n\n # If the selected knowledge base is not available, reset it\n if build_config[\"knowledge_base\"][\"value\"] not in build_config[\"knowledge_base\"][\"options\"]:\n build_config[\"knowledge_base\"][\"value\"] = None\n\n return build_config\n" src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-789- }, src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-790- "column_config": { src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-791- "_input_type": "TableInput", src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-792- "advanced": false, src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-793- "display_name": "Column Configuration", -- src/lfx/src/lfx/components/openai/openai.py-2- src/lfx/src/lfx/components/openai/openai.py-3-from lfx.base.embeddings.model import LCEmbeddingsModel src/lfx/src/lfx/components/openai/openai.py:4:from lfx.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES src/lfx/src/lfx/components/openai/openai.py-5-from lfx.field_typing import Embeddings src/lfx/src/lfx/components/openai/openai.py-6-from lfx.io import BoolInput, DictInput, DropdownInput, FloatInput, IntInput, MessageTextInput, SecretStrInput src/lfx/src/lfx/components/openai/openai.py-7- src/lfx/src/lfx/components/openai/openai.py-8- src/lfx/src/lfx/components/openai/openai.py-9-class OpenAIEmbeddingsComponent(LCEmbeddingsModel): -- src/lfx/src/lfx/components/openai/openai.py-35- display_name="Model", src/lfx/src/lfx/components/openai/openai.py-36- advanced=False, src/lfx/src/lfx/components/openai/openai.py:37: options=OPENAI_EMBEDDING_MODEL_NAMES, src/lfx/src/lfx/components/openai/openai.py-38- value="text-embedding-3-small", src/lfx/src/lfx/components/openai/openai.py-39- ), src/lfx/src/lfx/components/openai/openai.py-40- DictInput(name="model_kwargs", display_name="Model Kwargs", advanced=True), src/lfx/src/lfx/components/openai/openai.py-41- SecretStrInput(name="openai_api_key", display_name="OpenAI API Key", value="OPENAI_API_KEY", required=True), src/lfx/src/lfx/components/openai/openai.py-42- MessageTextInput(name="openai_api_base", display_name="OpenAI API Base", advanced=True), -- src/lfx/src/lfx/components/models_and_agents/embedding_model.py-8-from lfx.base.embeddings.model import LCEmbeddingsModel src/lfx/src/lfx/components/models_and_agents/embedding_model.py-9-from lfx.base.models.model_utils import get_ollama_models, is_valid_ollama_url src/lfx/src/lfx/components/models_and_agents/embedding_model.py:10:from lfx.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES src/lfx/src/lfx/components/models_and_agents/embedding_model.py-11-from lfx.base.models.watsonx_constants import ( src/lfx/src/lfx/components/models_and_agents/embedding_model.py-12- IBM_WATSONX_URLS, src/lfx/src/lfx/components/models_and_agents/embedding_model.py-13- WATSONX_EMBEDDING_MODEL_NAMES, src/lfx/src/lfx/components/models_and_agents/embedding_model.py-14-) src/lfx/src/lfx/components/models_and_agents/embedding_model.py-15-from lfx.field_typing import Embeddings -- src/lfx/src/lfx/components/models_and_agents/embedding_model.py-81- name="model", src/lfx/src/lfx/components/models_and_agents/embedding_model.py-82- display_name="Model Name", src/lfx/src/lfx/components/models_and_agents/embedding_model.py:83: options=OPENAI_EMBEDDING_MODEL_NAMES, src/lfx/src/lfx/components/models_and_agents/embedding_model.py:84: value=OPENAI_EMBEDDING_MODEL_NAMES[0], src/lfx/src/lfx/components/models_and_agents/embedding_model.py-85- info="Select the embedding model to use", src/lfx/src/lfx/components/models_and_agents/embedding_model.py-86- real_time_refresh=True, src/lfx/src/lfx/components/models_and_agents/embedding_model.py-87- refresh_button=True, src/lfx/src/lfx/components/models_and_agents/embedding_model.py-88- ), src/lfx/src/lfx/components/models_and_agents/embedding_model.py-89- ... [stdout truncated: output exceeded 50000 characters (4027478 total). Please modify your command to reduce output size, for example by: - Using grep/rg to filter results - Using head/tail to limit lines - Using --max-count flag with grep/rg - Processing output with awk/sed to extract only needed information]

</shell_result>

OpenAI embedding models dimensions parameter support text-embedding-3 ada-002

<web_search_result>
Short answer: The dimensions parameter is supported for the text-embedding-3 family (e.g., text-embedding-3-small / -large) but not for text-embedding-ada-002 (ada-002 uses a fixed 1536-d vector). [1][2][3]

Sources:

OpenAI Embeddings guide: default sizes and ability to shorten text-embedding-3 via dimensions. [1]

OpenAI announcement: text-embedding-3-large up to 3072 dims and native support for shortening with a dimensions parameter. [2]

API docs / reference noting the dimensions parameter is only supported on text-embedding-3 and newer (ada-002 is fixed-size 1536). [3]
</web_search_result>

Script executed:

# Find and read the openai_constants.py file fd "openai_constants.py" -type f

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 2

stderr:

error: 'ype' isn't a valid value for '--type <filetype>' [possible values: file, directory, symlink, executable, empty, socket, pipe] For more information try '--help'

</shell_result>

Script executed:

# Read the embedding_model.py file around lines 187-200 for full context cat -n src/lfx/src/lfx/components/models_and_agents/embedding_model.py | sed -n '160,220p'

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

160 base_url_ibm_watsonx = self.base_url_ibm_watsonx 161 ollama_base_url = self.ollama_base_url 162 dimensions = self.dimensions 163 chunk_size = self.chunk_size 164 request_timeout = self.request_timeout 165 max_retries = self.max_retries 166 show_progress_bar = self.show_progress_bar 167 model_kwargs = self.model_kwargs or {} 168 169 if provider == "OpenAI": 170 if not api_key: 171 msg = "OpenAI API key is required when using OpenAI provider" 172 raise ValueError(msg) 173 174 # Create the primary embedding instance 175 embeddings_instance = OpenAIEmbeddings( 176 model=model, 177 dimensions=dimensions or None, 178 base_url=api_base or None, 179 api_key=api_key, 180 chunk_size=chunk_size, 181 max_retries=max_retries, 182 timeout=request_timeout or None, 183 show_progress_bar=show_progress_bar, 184 model_kwargs=model_kwargs, 185 ) 186 187 # Create dedicated instances for each available model 188 available_models_dict = {} 189 for model_name in OPENAI_EMBEDDING_MODEL_NAMES: 190 available_models_dict[model_name] = OpenAIEmbeddings( 191 model=model_name, 192 dimensions=dimensions or None, # Use same dimensions config for all 193 base_url=api_base or None, 194 api_key=api_key, 195 chunk_size=chunk_size, 196 max_retries=max_retries, 197 timeout=request_timeout or None, 198 show_progress_bar=show_progress_bar, 199 model_kwargs=model_kwargs, 200 ) 201 202 return EmbeddingsWithModels( 203 embeddings=embeddings_instance, 204 available_models=available_models_dict, 205 ) 206 207 if provider == "Ollama": 208 try: 209 from langchain_ollama import OllamaEmbeddings 210 except ImportError: 211 try: 212 from langchain_community.embeddings import OllamaEmbeddings 213 except ImportError: 214 msg = "Please install langchain-ollama: pip install langchain-ollama" 215 raise ImportError(msg) from None 216 217 transformed_base_url = transform_localhost_url(ollama_base_url) 218 219 # Check if URL contains /v1 suffix (OpenAI-compatible mode) 220 if transformed_base_url and transformed_base_url.rstrip("/").endswith("/v1"):

</shell_result>

Script executed:

# Find openai_constants.py file fd openai_constants.py

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

src/backend/base/langflow/base/models/openai_constants.py src/lfx/src/lfx/base/models/openai_constants.py

</shell_result>

Script executed:

# Read the lfx openai_constants.py file cat src/lfx/src/lfx/base/models/openai_constants.py

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

from .model_metadata import create_model_metadata # Unified model metadata - single source of truth OPENAI_MODELS_DETAILED = [ # GPT-5 Series create_model_metadata( provider="OpenAI", name="gpt-5.1", icon="OpenAI", tool_calling=True, reasoning=True, ), create_model_metadata( provider="OpenAI", name="gpt-5", icon="OpenAI", tool_calling=True, reasoning=True, ), create_model_metadata( provider="OpenAI", name="gpt-5-mini", icon="OpenAI", tool_calling=True, reasoning=True, ), create_model_metadata( provider="OpenAI", name="gpt-5-nano", icon="OpenAI", tool_calling=True, reasoning=True, ), create_model_metadata( provider="OpenAI", name="gpt-5-chat-latest", icon="OpenAI", tool_calling=False, reasoning=True, ), # Regular OpenAI Models create_model_metadata(provider="OpenAI", name="gpt-4o-mini", icon="OpenAI", tool_calling=True), create_model_metadata(provider="OpenAI", name="gpt-4o", icon="OpenAI", tool_calling=True), create_model_metadata(provider="OpenAI", name="gpt-4.1", icon="OpenAI", tool_calling=True), create_model_metadata(provider="OpenAI", name="gpt-4.1-mini", icon="OpenAI", tool_calling=True), create_model_metadata(provider="OpenAI", name="gpt-4.1-nano", icon="OpenAI", tool_calling=True), create_model_metadata( provider="OpenAI", name="gpt-4.5-preview", icon="OpenAI", tool_calling=True, preview=True, not_supported=True ), create_model_metadata(provider="OpenAI", name="gpt-4-turbo", icon="OpenAI", tool_calling=True), create_model_metadata( provider="OpenAI", name="gpt-4-turbo-preview", icon="OpenAI", tool_calling=True, preview=True ), create_model_metadata(provider="OpenAI", name="gpt-4", icon="OpenAI", tool_calling=True), create_model_metadata(provider="OpenAI", name="gpt-3.5-turbo", icon="OpenAI", tool_calling=True), # Reasoning Models create_model_metadata(provider="OpenAI", name="o1", icon="OpenAI", reasoning=True), create_model_metadata(provider="OpenAI", name="o1-mini", icon="OpenAI", reasoning=True, not_supported=True), create_model_metadata(provider="OpenAI", name="o1-pro", icon="OpenAI", reasoning=True, not_supported=True), create_model_metadata(provider="OpenAI", name="o3-mini", icon="OpenAI", reasoning=True), create_model_metadata(provider="OpenAI", name="o3", icon="OpenAI", reasoning=True), create_model_metadata(provider="OpenAI", name="o3-pro", icon="OpenAI", reasoning=True), create_model_metadata(provider="OpenAI", name="o4-mini", icon="OpenAI", reasoning=True), create_model_metadata(provider="OpenAI", name="o4-mini-high", icon="OpenAI", reasoning=True), # Search Models create_model_metadata( provider="OpenAI", name="gpt-4o-mini-search-preview", icon="OpenAI", tool_calling=True, search=True, preview=True, ), create_model_metadata( provider="OpenAI", name="gpt-4o-search-preview", icon="OpenAI", tool_calling=True, search=True, preview=True, ), # Not Supported Models create_model_metadata( provider="OpenAI", name="computer-use-preview", icon="OpenAI", not_supported=True, preview=True ), create_model_metadata( provider="OpenAI", name="gpt-4o-audio-preview", icon="OpenAI", not_supported=True, preview=True ), create_model_metadata( provider="OpenAI", name="gpt-4o-realtime-preview", icon="OpenAI", not_supported=True, preview=True ), create_model_metadata( provider="OpenAI", name="gpt-4o-mini-audio-preview", icon="OpenAI", not_supported=True, preview=True ), create_model_metadata( provider="OpenAI", name="gpt-4o-mini-realtime-preview", icon="OpenAI", not_supported=True, preview=True ), ] OPENAI_CHAT_MODEL_NAMES = [ metadata["name"] for metadata in OPENAI_MODELS_DETAILED if not metadata.get("not_supported", False) and not metadata.get("reasoning", False) and not metadata.get("search", False) ] OPENAI_REASONING_MODEL_NAMES = [ metadata["name"] for metadata in OPENAI_MODELS_DETAILED if metadata.get("reasoning", False) and not metadata.get("not_supported", False) ] OPENAI_SEARCH_MODEL_NAMES = [ metadata["name"] for metadata in OPENAI_MODELS_DETAILED if metadata.get("search", False) and not metadata.get("not_supported", False) ] NOT_SUPPORTED_MODELS = [metadata["name"] for metadata in OPENAI_MODELS_DETAILED if metadata.get("not_supported", False)] OPENAI_EMBEDDING_MODEL_NAMES = [ "text-embedding-3-small", "text-embedding-3-large", "text-embedding-ada-002", ] # Backwards compatibility MODEL_NAMES = OPENAI_CHAT_MODEL_NAMES OPENAI_MODEL_NAMES = OPENAI_CHAT_MODEL_NAMES

</shell_result>

Both performance and dimension compatibility concerns are confirmed; the suggested fix is accurate.

The code creates embedding instances for all three OpenAI models (text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002) upfront on lines 189–200, regardless of which model is selected. Additionally, the dimensions parameter is applied identically to all models on line 192.

Issues confirmed:

Eager instantiation is wasteful: Creating instances for all models when typically only one is used.

Dimension compatibility bug: Only text-embedding-3-* models support the dimensions parameter; text-embedding-ada-002 uses a fixed 1536-dimensional embedding and does not accept this parameter. Passing a non-None value would cause an API error.

The suggested fix—filtering dimensions based on model name—is correct and matches OpenAI's API behavior:

# Create dedicated instances for each available model available_models_dict = {} for model_name in OPENAI_EMBEDDING_MODEL_NAMES: + # Only apply dimensions to models that support it + model_dimensions = dimensions if model_name.startswith("text-embedding-3") else None available_models_dict[model_name] = OpenAIEmbeddings( model=model_name, - dimensions=dimensions or None, # Use same dimensions config for all + dimensions=model_dimensions, base_url=api_base or None, api_key=api_key, chunk_size=chunk_size, max_retries=max_retries, timeout=request_timeout or None, show_progress_bar=show_progress_bar, model_kwargs=model_kwargs, )

🤖 Prompt for AI Agents

In src/lfx/src/lfx/components/models_and_agents/embedding_model.py around lines 187–200, fix two issues: stop passing a global dimensions value to models that don't accept it (text-embedding-ada-002) and avoid eagerly creating all model instances. Update the loop to set model_dimensions = dimensions only for models whose name starts with "text-embedding-3" (otherwise use None) and either instantiate models lazily (store a factory/callable and create only when requested) or only create the single selected model instead of pre-instantiating all of them; ensure other parameters remain unchanged.

coderabbitai · 2025-11-25T21:44:00Z

src/lfx/src/lfx/components/models_and_agents/embedding_model.py

+            # Fetch available IBM watsonx.ai models
+            available_model_names = self.fetch_ibm_models(final_url)


⚠️ Potential issue | 🟠 Major

Blocking I/O in async context.

fetch_ibm_models uses synchronous requests.get() (lines 138-153), which blocks the event loop when called from this async method. This can cause performance degradation and unresponsiveness in async applications.

Consider either:

Converting fetch_ibm_models to an async method using httpx.AsyncClient

Running it in a thread executor:

+ import asyncio + # Fetch available IBM watsonx.ai models - available_model_names = self.fetch_ibm_models(final_url) + available_model_names = await asyncio.get_event_loop().run_in_executor( + None, self.fetch_ibm_models, final_url + )

Alternatively, convert fetch_ibm_models to async similar to get_ollama_models.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Fetch available IBM watsonx.ai models

available_model_names = self.fetch_ibm_models(final_url)

import asyncio

# Fetch available IBM watsonx.ai models

available_model_names = await asyncio.get_event_loop().run_in_executor(

None, self.fetch_ibm_models, final_url

)

🤖 Prompt for AI Agents

In src/lfx/src/lfx/components/models_and_agents/embedding_model.py around lines 303-304, the call to fetch_ibm_models(final_url) is invoked from an async method but fetch_ibm_models (lines ~138-153) uses blocking requests.get(), which will block the event loop; fix by either converting fetch_ibm_models to an async function using an async HTTP client (e.g., httpx.AsyncClient) and await it where called, or keep it sync but call it via an executor (e.g., await asyncio.to_thread(self.fetch_ibm_models, final_url)); update the function signature and imports accordingly and ensure error handling and timeouts are preserved in the chosen async approach.

coderabbitai · 2025-11-25T21:44:00Z

src/lfx/src/lfx/components/models_and_agents/embedding_model.py

+            # Create dedicated instances for each available model
+            available_models_dict = {}
+            for model_name in available_model_names:
+                available_models_dict[model_name] = WatsonxEmbeddings(
+                    model_id=model_name,
+                    params=params,
+                    watsonx_client=api_client,
+                    project_id=project_id,
+                )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

IBM watsonx.ai Python SDK APIClient thread safety documentation

<web_search_result>
Short answer: IBM's watsonx.ai Python SDK does not publish an explicit, general "APIClient is thread‑safe" guarantee. Practical guidance from source/docs implies you should treat APIClient/persistent HTTP connections cautiously in multithreaded code — best practice is to create one APIClient (or one HTTP session) per thread or otherwise avoid sharing a single APIClient across threads. [1][2][3]

Sources:

SDK source (party_wrapper) — shows internal use of requests/connection wrappers (no public thread‑safety claim). [1]

LlamaIndex integration note — notes the SDK enabled persistent_connection by default (since 1.1.2) and that persistent connections can cause problems in environments without long‑running connections (i.e., indicates connection behavior you must consider in concurrent scenarios). [2]

LangChain / integration docs — show patterns where callers create and pass an APIClient instance into higher‑level wrappers (implying explicit management of the client object). [3]

If you want, I can fetch specific SDK README or source lines (or check a specific SDK version) and show exact code/comments about connection/session handling.
</web_search_result>

Script executed:

#!/bin/bash # First, let's look at the file in question echo "=== File context ===" wc -l src/lfx/src/lfx/components/models_and_agents/embedding_model.py # Get the full context around lines 306-314 echo -e "\n=== Code context around lines 306-314 ===" sed -n '290,330p' src/lfx/src/lfx/components/models_and_agents/embedding_model.py # Look for imports related to WatsonxEmbeddings and APIClient echo -e "\n=== Imports in the file ===" head -30 src/lfx/src/lfx/components/models_and_agents/embedding_model.py

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

=== File context === 423 src/lfx/src/lfx/components/models_and_agents/embedding_model.py === Code context around lines 306-314 === params = { EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: self.truncate_input_tokens, EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": self.input_text}, } # Create the primary embedding instance embeddings_instance = WatsonxEmbeddings( model_id=model, params=params, watsonx_client=api_client, project_id=project_id, ) # Fetch available IBM watsonx.ai models available_model_names = self.fetch_ibm_models(final_url) # Create dedicated instances for each available model available_models_dict = {} for model_name in available_model_names: available_models_dict[model_name] = WatsonxEmbeddings( model_id=model_name, params=params, watsonx_client=api_client, project_id=project_id, ) return EmbeddingsWithModels( embeddings=embeddings_instance, available_models=available_models_dict, ) msg = f"Unknown provider: {provider}" raise ValueError(msg) async def update_build_config( self, build_config: dotdict, field_value: Any, field_name: str | None = None ) -> dotdict: if field_name == "provider": if field_value == "OpenAI": build_config["model"]["options"] = OPENAI_EMBEDDING_MODEL_NAMES build_config["model"]["value"] = OPENAI_EMBEDDING_MODEL_NAMES[0] === Imports in the file === from typing import Any import requests from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames from langchain_openai import OpenAIEmbeddings from lfx.base.embeddings.embeddings_class import EmbeddingsWithModels from lfx.base.embeddings.model import LCEmbeddingsModel from lfx.base.models.model_utils import get_ollama_models, is_valid_ollama_url from lfx.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES from lfx.base.models.watsonx_constants import ( IBM_WATSONX_URLS, WATSONX_EMBEDDING_MODEL_NAMES, ) from lfx.field_typing import Embeddings from lfx.io import ( BoolInput, DictInput, DropdownInput, FloatInput, IntInput, MessageTextInput, SecretStrInput, ) from lfx.log.logger import logger from lfx.schema.dotdict import dotdict from lfx.utils.util import transform_localhost_url # Ollama API constants HTTP_STATUS_OK = 200

</shell_result>

Script executed:

#!/bin/bash # Find all references to EmbeddingsWithModels and available_models echo "=== Searching for EmbeddingsWithModels usage ===" rg -n "EmbeddingsWithModels" --type py echo -e "\n=== Searching for available_models usage ===" rg -n "available_models" --type py -A 2 -B 2 echo -e "\n=== Check for async/threading patterns in embedding model ===" rg -n "async def|Thread|asyncio|concurrent" src/lfx/src/lfx/components/models_and_agents/embedding_model.py

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

=== Searching for EmbeddingsWithModels usage === src/lfx/src/lfx/components/models_and_agents/embedding_model.py:7:from lfx.base.embeddings.embeddings_class import EmbeddingsWithModels src/lfx/src/lfx/components/models_and_agents/embedding_model.py:202: return EmbeddingsWithModels( src/lfx/src/lfx/components/models_and_agents/embedding_model.py:257: return EmbeddingsWithModels( src/lfx/src/lfx/components/models_and_agents/embedding_model.py:316: return EmbeddingsWithModels( src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:695: # Also check available_models list from EmbeddingsWithModels src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1193: # Also leverage available_models list from EmbeddingsWithModels src/lfx/src/lfx/base/embeddings/embeddings_class.py:6:class EmbeddingsWithModels(Embeddings): src/lfx/src/lfx/base/embeddings/embeddings_class.py:24: """Initialize the EmbeddingsWithModels wrapper. src/lfx/src/lfx/base/embeddings/embeddings_class.py:114: f"EmbeddingsWithModels(embeddings={self.embeddings!r}, " === Searching for available_models usage === src/lfx/src/lfx/base/models/groq_model_discovery.py-71- src/lfx/src/lfx/base/models/groq_model_discovery.py-72- # Step 1: Get list of available models src/lfx/src/lfx/base/models/groq_model_discovery.py:73: available_models = self._fetch_available_models() src/lfx/src/lfx/base/models/groq_model_discovery.py:74: logger.info(f"Found {len(available_models)} models from Groq API") src/lfx/src/lfx/base/models/groq_model_discovery.py-75- src/lfx/src/lfx/base/models/groq_model_discovery.py-76- # Step 2: Categorize models -- src/lfx/src/lfx/base/models/groq_model_discovery.py-78- non_llm_models = [] src/lfx/src/lfx/base/models/groq_model_discovery.py-79- src/lfx/src/lfx/base/models/groq_model_discovery.py:80: for model_id in available_models: src/lfx/src/lfx/base/models/groq_model_discovery.py-81- if any(pattern in model_id.lower() for pattern in self.SKIP_PATTERNS): src/lfx/src/lfx/base/models/groq_model_discovery.py-82- non_llm_models.append(model_id) -- src/lfx/src/lfx/base/models/groq_model_discovery.py-115- return models_metadata src/lfx/src/lfx/base/models/groq_model_discovery.py-116- src/lfx/src/lfx/base/models/groq_model_discovery.py:117: def _fetch_available_models(self) -> list[str]: src/lfx/src/lfx/base/models/groq_model_discovery.py-118- """Fetch list of available models from Groq API.""" src/lfx/src/lfx/base/models/groq_model_discovery.py-119- url = f"{self.base_url}/openai/v1/models" -- src/lfx/src/lfx/base/embeddings/embeddings_class.py-13- Attributes: src/lfx/src/lfx/base/embeddings/embeddings_class.py-14- embeddings: The primary LangChain Embeddings instance (used as fallback). src/lfx/src/lfx/base/embeddings/embeddings_class.py:15: available_models: Dict mapping model names to their dedicated Embeddings instances. src/lfx/src/lfx/base/embeddings/embeddings_class.py-16- Each model has its own pre-configured instance with specific parameters. src/lfx/src/lfx/base/embeddings/embeddings_class.py-17- """ -- src/lfx/src/lfx/base/embeddings/embeddings_class.py-20- self, src/lfx/src/lfx/base/embeddings/embeddings_class.py-21- embeddings: Embeddings, src/lfx/src/lfx/base/embeddings/embeddings_class.py:22: available_models: dict[str, Embeddings] | None = None, src/lfx/src/lfx/base/embeddings/embeddings_class.py-23- ): src/lfx/src/lfx/base/embeddings/embeddings_class.py-24- """Initialize the EmbeddingsWithModels wrapper. -- src/lfx/src/lfx/base/embeddings/embeddings_class.py-26- Args: src/lfx/src/lfx/base/embeddings/embeddings_class.py-27- embeddings: The primary LangChain Embeddings instance (used as default/fallback). src/lfx/src/lfx/base/embeddings/embeddings_class.py:28: available_models: Dict mapping model names to dedicated Embeddings instances. src/lfx/src/lfx/base/embeddings/embeddings_class.py-29- Each value should be a fully configured Embeddings object ready to use. src/lfx/src/lfx/base/embeddings/embeddings_class.py-30- Defaults to empty dict if not provided. -- src/lfx/src/lfx/base/embeddings/embeddings_class.py-32- super().__init__() src/lfx/src/lfx/base/embeddings/embeddings_class.py-33- self.embeddings = embeddings src/lfx/src/lfx/base/embeddings/embeddings_class.py:34: self.available_models = available_models if available_models is not None else {} src/lfx/src/lfx/base/embeddings/embeddings_class.py-35- src/lfx/src/lfx/base/embeddings/embeddings_class.py-36- def embed_documents(self, texts: list[str]) -> list[list[float]]: -- src/lfx/src/lfx/base/embeddings/embeddings_class.py-113- return ( src/lfx/src/lfx/base/embeddings/embeddings_class.py-114- f"EmbeddingsWithModels(embeddings={self.embeddings!r}, " src/lfx/src/lfx/base/embeddings/embeddings_class.py:115: f"available_models={self.available_models!r})" src/lfx/src/lfx/base/embeddings/embeddings_class.py-116- ) src/lfx/src/lfx/base/embeddings/embeddings_class.py-117- -- src/lfx/src/lfx/components/nvidia/nvidia_rerank.py-47- try: src/lfx/src/lfx/components/nvidia/nvidia_rerank.py-48- build_model = self.build_compressor() src/lfx/src/lfx/components/nvidia/nvidia_rerank.py:49: ids = [model.id for model in build_model.available_models] src/lfx/src/lfx/components/nvidia/nvidia_rerank.py-50- build_config["model"]["options"] = ids src/lfx/src/lfx/components/nvidia/nvidia_rerank.py-51- build_config["model"]["value"] = ids[0] -- src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-51- try: src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-52- build_model = self.build_embeddings() src/lfx/src/lfx/components/nvidia/nvidia_embedding.py:53: ids = [model.id for model in build_model.available_models] src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-54- build_config["model"]["options"] = ids src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-55- build_config["model"]["value"] = ids[0] -- src/lfx/src/lfx/components/nvidia/nvidia.py-21- from langchain_nvidia_ai_endpoints import ChatNVIDIA src/lfx/src/lfx/components/nvidia/nvidia.py-22- src/lfx/src/lfx/components/nvidia/nvidia.py:23: all_models = ChatNVIDIA().get_available_models() src/lfx/src/lfx/components/nvidia/nvidia.py-24- except ImportError as e: src/lfx/src/lfx/components/nvidia/nvidia.py-25- msg = "Please install langchain-nvidia-ai-endpoints to use the NVIDIA model." -- src/lfx/src/lfx/components/nvidia/nvidia.py-102- model = ChatNVIDIA(base_url=self.base_url, api_key=self.api_key) src/lfx/src/lfx/components/nvidia/nvidia.py-103- if tool_model_enabled: src/lfx/src/lfx/components/nvidia/nvidia.py:104: tool_models = [m for m in model.get_available_models() if m.supports_tools] src/lfx/src/lfx/components/nvidia/nvidia.py-105- return sorted(m.id for m in tool_models) src/lfx/src/lfx/components/nvidia/nvidia.py:106: return sorted(m.id for m in model.available_models) src/lfx/src/lfx/components/nvidia/nvidia.py-107- src/lfx/src/lfx/components/nvidia/nvidia.py-108- def update_build_config(self, build_config: dotdict, _field_value: Any, field_name: str | None = None): -- src/lfx/src/lfx/components/models_and_agents/embedding_model.py-186- src/lfx/src/lfx/components/models_and_agents/embedding_model.py-187- # Create dedicated instances for each available model src/lfx/src/lfx/components/models_and_agents/embedding_model.py:188: available_models_dict = {} src/lfx/src/lfx/components/models_and_agents/embedding_model.py-189- for model_name in OPENAI_EMBEDDING_MODEL_NAMES: src/lfx/src/lfx/components/models_and_agents/embedding_model.py:190: available_models_dict[model_name] = OpenAIEmbeddings( src/lfx/src/lfx/components/models_and_agents/embedding_model.py-191- model=model_name, src/lfx/src/lfx/components/models_and_agents/embedding_model.py-192- dimensions=dimensions or None, # Use same dimensions config for all -- src/lfx/src/lfx/components/models_and_agents/embedding_model.py-202- return EmbeddingsWithModels( src/lfx/src/lfx/components/models_and_agents/embedding_model.py-203- embeddings=embeddings_instance, src/lfx/src/lfx/components/models_and_agents/embedding_model.py:204: available_models=available_models_dict, src/lfx/src/lfx/components/models_and_agents/embedding_model.py-205- ) src/lfx/src/lfx/components/models_and_agents/embedding_model.py-206- -- src/lfx/src/lfx/components/models_and_agents/embedding_model.py-247- src/lfx/src/lfx/components/models_and_agents/embedding_model.py-248- # Create dedicated instances for each available model src/lfx/src/lfx/components/models_and_agents/embedding_model.py:249: available_models_dict = {} src/lfx/src/lfx/components/models_and_agents/embedding_model.py-250- for model_name in available_model_names: src/lfx/src/lfx/components/models_and_agents/embedding_model.py:251: available_models_dict[model_name] = OllamaEmbeddings( src/lfx/src/lfx/components/models_and_agents/embedding_model.py-252- model=model_name, src/lfx/src/lfx/components/models_and_agents/embedding_model.py-253- base_url=final_base_url, -- src/lfx/src/lfx/components/models_and_agents/embedding_model.py-257- return EmbeddingsWithModels( src/lfx/src/lfx/components/models_and_agents/embedding_model.py-258- embeddings=embeddings_instance, src/lfx/src/lfx/components/models_and_agents/embedding_model.py:259: available_models=available_models_dict, src/lfx/src/lfx/components/models_and_agents/embedding_model.py-260- ) src/lfx/src/lfx/components/models_and_agents/embedding_model.py-261- -- src/lfx/src/lfx/components/models_and_agents/embedding_model.py-305- src/lfx/src/lfx/components/models_and_agents/embedding_model.py-306- # Create dedicated instances for each available model src/lfx/src/lfx/components/models_and_agents/embedding_model.py:307: available_models_dict = {} src/lfx/src/lfx/components/models_and_agents/embedding_model.py-308- for model_name in available_model_names: src/lfx/src/lfx/components/models_and_agents/embedding_model.py:309: available_models_dict[model_name] = WatsonxEmbeddings( src/lfx/src/lfx/components/models_and_agents/embedding_model.py-310- model_id=model_name, src/lfx/src/lfx/components/models_and_agents/embedding_model.py-311- params=params, -- src/lfx/src/lfx/components/models_and_agents/embedding_model.py-316- return EmbeddingsWithModels( src/lfx/src/lfx/components/models_and_agents/embedding_model.py-317- embeddings=embeddings_instance, src/lfx/src/lfx/components/models_and_agents/embedding_model.py:318: available_models=available_models_dict, src/lfx/src/lfx/components/models_and_agents/embedding_model.py-319- ) src/lfx/src/lfx/components/models_and_agents/embedding_model.py-320- -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-693- for emb_obj in embeddings_list: src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-694- # Check all possible model identifiers (deployment, model, model_id, model_name) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:695: # Also check available_models list from EmbeddingsWithModels src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-696- possible_names = [] src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-697- deployment = getattr(emb_obj, "deployment", None) -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-699- model_id = getattr(emb_obj, "model_id", None) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-700- model_name = getattr(emb_obj, "model_name", None) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:701: available_models_attr = getattr(emb_obj, "available_models", None) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-702- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-703- if deployment: -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-714- possible_names.append(f"{deployment}:{model}") src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-715- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:716: # Add all models from available_models dict src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:717: if available_models_attr and isinstance(available_models_attr, dict): src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-718- possible_names.extend( src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-719- str(model_key).strip() src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:720: for model_key in available_models_attr src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-721- if model_key and str(model_key).strip() src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-722- ) -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-724- # Match if target matches any of the possible names src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-725- if target_model_name in possible_names: src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:726: # Check if target is in available_models dict - use dedicated instance src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-727- if ( src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:728: available_models_attr src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:729: and isinstance(available_models_attr, dict) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:730: and target_model_name in available_models_attr src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-731- ): src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-732- # Use the dedicated embedding instance from the dict src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:733: selected_embedding = available_models_attr[target_model_name] src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-734- embedding_model = target_model_name src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:735: self.log(f"Found dedicated embedding instance for '{embedding_model}' in available_models dict") src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-736- else: src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-737- # Traditional identifier match -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-751- model_id = getattr(emb, "model_id", None) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-752- model_name = getattr(emb, "model_name", None) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:753: available_models_attr = getattr(emb, "available_models", None) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-754- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-755- if deployment: -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-766- identifiers.append(f"combined='{deployment}:{model}'") src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-767- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:768: # Add available_models dict if present src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:769: if available_models_attr and isinstance(available_models_attr, dict): src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:770: identifiers.append(f"available_models={list(available_models_attr.keys())}") src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-771- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-772- available_info.append( -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-803- if hasattr(selected_embedding, "dimensions"): src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-804- logger.info(f"Embedding dimensions: {selected_embedding.dimensions}") src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:805: if hasattr(selected_embedding, "available_models"): src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:806: logger.info(f"Embedding available_models: {selected_embedding.available_models}") src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-807- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:808: # No model switching needed - each model in available_models has its own dedicated instance src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-809- # The selected_embedding is already configured correctly for the target model src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-810- logger.info(f"Using embedding instance for '{embedding_model}' - pre-configured and ready to use") -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1031- return context_clauses src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1032- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1033: def _detect_available_models(self, client: OpenSearch, filter_clauses: list[dict] | None = None) -> list[str]: src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1034- """Detect which embedding models have documents in the index. src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1035- -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1177- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1178- # Detect available embedding models in the index (scoped by filters) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1179: available_models = self._detect_available_models(client, filter_clauses) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1180- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1181: if not available_models: src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1182- logger.warning("No embedding models found in index, using current model") src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1183: available_models = [self._get_embedding_model_name()] src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1184- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1185- # Generate embeddings for ALL detected models -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1191- # Create a comprehensive map of model names to embedding objects src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1192- # Check all possible identifiers (deployment, model, model_id, model_name) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1193: # Also leverage available_models list from EmbeddingsWithModels src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1194- # Handle duplicate identifiers by creating combined keys src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1195- embedding_by_model = {} -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1204- model_name = getattr(emb_obj, "model_name", None) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1205- dimensions = getattr(emb_obj, "dimensions", None) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1206: available_models = getattr(emb_obj, "available_models", None) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1207- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1208- logger.info( src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1209- f"Embedding object {idx}: deployment={deployment}, model={model}, " src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1210- f"model_id={model_id}, model_name={model_name}, dimensions={dimensions}, " src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1211: f"available_models={available_models}" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1212- ) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1213- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1214: # If this embedding has available_models dict, map all models to their dedicated instances src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1215: if available_models and isinstance(available_models, dict): src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1216: logger.info(f"Embedding object {idx} provides {len(available_models)} models via available_models dict") src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1217: for model_name_key, dedicated_embedding in available_models.items(): src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1218- if model_name_key and str(model_name_key).strip(): src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1219- model_str = str(model_name_key).strip() -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1268- logger.warning(f" Conflict on '{conflict_id}': {len(emb_list)} embeddings use this identifier") src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1269- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1270: logger.info(f"Generating embeddings for {len(available_models)} models in index") src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1271- logger.info(f"Available embedding identifiers: {list(embedding_by_model.keys())}") src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1272- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1273: for model_name in available_models: src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1274- try: src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1275- # Check if we have an embedding object for this model -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1281- emb_model_id = getattr(emb_obj, "model_id", None) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1282- emb_dimensions = getattr(emb_obj, "dimensions", None) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1283: emb_available_models = getattr(emb_obj, "available_models", None) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1284- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1285- logger.info( -- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1289- ) src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1290- src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1291: # Check if this is a dedicated instance from available_models dict src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1292: if emb_available_models and isinstance(emb_available_models, dict): src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1293- logger.info( src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1294: f"Model '{model_name}' using dedicated instance from available_models dict " src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1295- f"(pre-configured with correct model and dimensions)" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1296- ) -- src/backend/tests/unit/groq/test_groq_model_discovery.py-36- @patch("lfx.base.models.groq_model_discovery.requests.get") src/backend/tests/unit/groq/test_groq_model_discovery.py-37- @patch("groq.Groq") src/backend/tests/unit/groq/test_groq_model_discovery.py:38: def test_fetch_available_models_success( src/backend/tests/unit/groq/test_groq_model_discovery.py-39- self, mock_groq, mock_get, mock_api_key, mock_groq_models_response, mock_groq_client_tool_calling_success src/backend/tests/unit/groq/test_groq_model_discovery.py-40- ): -- src/backend/tests/unit/groq/test_groq_model_discovery.py-51- src/backend/tests/unit/groq/test_groq_model_discovery.py-52- discovery = GroqModelDiscovery(api_key=mock_api_key) src/backend/tests/unit/groq/test_groq_model_discovery.py:53: models = discovery._fetch_available_models() src/backend/tests/unit/groq/test_groq_model_discovery.py-54- src/backend/tests/unit/groq/test_groq_model_discovery.py-55- assert isinstance(models, list) === Check for async/threading patterns in embedding model === 155: async def build_embeddings(self) -> Embeddings: 324: async def update_build_config(

</shell_result>

Script executed:

#!/bin/bash # Search for APIClient creation and usage echo "=== Searching for APIClient ===" rg -n "APIClient|api_client" src/lfx/src/lfx/components/models_and_agents/embedding_model.py -B 2 -A 2 # Get the full method that contains lines 306-314 echo -e "\n=== Full method context ===" sed -n '250,330p' src/lfx/src/lfx/components/models_and_agents/embedding_model.py

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

=== Searching for APIClient === 277- raise ValueError(msg) 278- 279: from ibm_watsonx_ai import APIClient, Credentials 280- 281- final_url = base_url_ibm_watsonx or "https://us-south.ml.cloud.ibm.com" -- 286- ) 287- 288: api_client = APIClient(credentials) 289- 290- params = { -- 297- model_id=model, 298- params=params, 299: watsonx_client=api_client, 300- project_id=project_id, 301- ) -- 310- model_id=model_name, 311- params=params, 312: watsonx_client=api_client, 313- project_id=project_id, 314- ) === Full method context === for model_name in available_model_names: available_models_dict[model_name] = OllamaEmbeddings( model=model_name, base_url=final_base_url, **model_kwargs, ) return EmbeddingsWithModels( embeddings=embeddings_instance, available_models=available_models_dict, ) if provider == "IBM watsonx.ai": try: from langchain_ibm import WatsonxEmbeddings except ImportError: msg = "Please install langchain-ibm: pip install langchain-ibm" raise ImportError(msg) from None if not api_key: msg = "IBM watsonx.ai API key is required when using IBM watsonx.ai provider" raise ValueError(msg) project_id = self.project_id if not project_id: msg = "Project ID is required for IBM watsonx.ai provider" raise ValueError(msg) from ibm_watsonx_ai import APIClient, Credentials final_url = base_url_ibm_watsonx or "https://us-south.ml.cloud.ibm.com" credentials = Credentials( api_key=self.api_key, url=final_url, ) api_client = APIClient(credentials) params = { EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: self.truncate_input_tokens, EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": self.input_text}, } # Create the primary embedding instance embeddings_instance = WatsonxEmbeddings( model_id=model, params=params, watsonx_client=api_client, project_id=project_id, ) # Fetch available IBM watsonx.ai models available_model_names = self.fetch_ibm_models(final_url) # Create dedicated instances for each available model available_models_dict = {} for model_name in available_model_names: available_models_dict[model_name] = WatsonxEmbeddings( model_id=model_name, params=params, watsonx_client=api_client, project_id=project_id, ) return EmbeddingsWithModels( embeddings=embeddings_instance, available_models=available_models_dict, ) msg = f"Unknown provider: {provider}" raise ValueError(msg) async def update_build_config( self, build_config: dotdict, field_value: Any, field_name: str | None = None ) -> dotdict: if field_name == "provider": if field_value == "OpenAI": build_config["model"]["options"] = OPENAI_EMBEDDING_MODEL_NAMES build_config["model"]["value"] = OPENAI_EMBEDDING_MODEL_NAMES[0]

</shell_result>

ibm_watsonx_ai WatsonxEmbeddings watsonx_client shared instance concurrent usage

<web_search_result>
Short answer — yes, you can share a single WatsonxEmbeddings / watsonx_client instance across callers, but you should control concurrent requests (use the integration's built‑in concurrency option or your own queue/pool).

Key points and practical guidance

LangChain's WatsonxEmbeddings supports a concurrency control parameter (often called concurrency_limit / maxConcurrency) so multiple callers can safely share one embeddings instance while limiting parallel requests to the provider. [1][3]

The LangChain IBM embedding wrapper constructs/uses an underlying watsonx client (api_client) — you can pass a preconfigured client to the wrapper to reuse connections. If you share that client, ensure your app enforces a concurrency limit (or rely on the wrapper's concurrency option) to avoid rate limits or resource contention. [2][1]

If you need high throughput: use a shared client + set concurrency_limit / maxConcurrency to a value that matches your plan/rate limits, or implement a worker pool/queue to throttle requests and retry on 429/errors.

Sources

LangChain WatsonxEmbeddings reference (instantiation, parameters). [1]

LangChain IBM embeddings module source (shows api_client / internal client use). [2]

LangChain docs on rate‑limits / maxConcurrency for embeddings. [3]

If you want, tell me which SDK/library and language (Python, JS, LangChain version) you use and your target throughput, and I’ll suggest exact code/config values.
</web_search_result>

Script executed:

#!/bin/bash # Check the full WatsonxEmbeddings initialization in embedding_model.py echo "=== WatsonxEmbeddings initialization parameters ===" sed -n '295,315p' src/lfx/src/lfx/components/models_and_agents/embedding_model.py # Search for concurrency_limit or similar parameters in the file echo -e "\n=== Search for concurrency parameters ===" rg -n "concurrency|maxConcurrency|max_concurrency" src/lfx/src/lfx/components/models_and_agents/embedding_model.py # Check how embed_documents is called in opensearch_multimodal echo -e "\n=== How embeddings are called in opensearch_multimodal ===" rg -n "embed_documents\|embed_query" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py -B 2 -A 3 | head -50

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

=== WatsonxEmbeddings initialization parameters === # Create the primary embedding instance embeddings_instance = WatsonxEmbeddings( model_id=model, params=params, watsonx_client=api_client, project_id=project_id, ) # Fetch available IBM watsonx.ai models available_model_names = self.fetch_ibm_models(final_url) # Create dedicated instances for each available model available_models_dict = {} for model_name in available_model_names: available_models_dict[model_name] = WatsonxEmbeddings( model_id=model_name, params=params, watsonx_client=api_client, project_id=project_id, ) === Search for concurrency parameters === === How embeddings are called in opensearch_multimodal ===

</shell_result>

langchain_ibm WatsonxEmbeddings parameters concurrency_limit maxConcurrency source code

<web_search_result>
Short answer — both names exist but in different places:

In the langchain-ibm (LangChain) integrations:

JS/TS integration exposes maxConcurrency / maxRetries on Watsonx Embeddings/LLM input types (e.g., WatsonxInputEmbeddings / WatsonxCallOptionsLLM). See the API docs for WatsonxEmbeddings / WatsonxInputEmbeddings. [1][2]

In the upstream IBM watsonx.ai Python SDK (the underlying client) the Embeddings class uses concurrency_limit (and batch_size, max_retries, delay_time, retry_status_codes, etc.). See the SDK Embeddings constructor docs. [3]

Sources:

LangChain (community) JS/TS WatsonxEmbeddings / WatsonxInputEmbeddings docs (shows maxConcurrency). [turn2search6][turn2search2].

LangChain JS Watsonx call options / params (maxConcurrency field). [turn0search1].

IBM watsonx.ai Python SDK Embeddings constructor (concurrency_limit parameter). [turn2search9].
</web_search_result>

Script executed:

#!/bin/bash # Search for langchain-ibm version in dependencies echo "=== Searching for langchain-ibm version ===" find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" -o -name "setup.cfg" | head -20 # Check pyproject.toml or similar files if [ -f "pyproject.toml" ]; then echo -e "\n=== pyproject.toml content ===" grep -A 5 -B 5 "langchain" pyproject.toml | head -40 fi # Also check in src or root for f in requirements.txt requirements-dev.txt setup.py; do if [ -f "$f" ]; then echo -e "\n=== $f ===" grep "langchain" "$f" | head -10 fi done # Also check backend directory if [ -d "backend" ]; then echo -e "\n=== Checking backend for requirements ===" find backend -maxdepth 2 -name "*.txt" -o -name "pyproject.toml" | xargs grep -l "langchain" 2>/dev/null | head -5 fi

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

=== Searching for langchain-ibm version === ./pyproject.toml ./src/backend/base/langflow/initial_setup/setup.py ./src/backend/base/langflow/logging/setup.py ./src/backend/base/pyproject.toml ./src/lfx/pyproject.toml === pyproject.toml content === name = "langflow" version = "1.7.0" description = "A Python package with a built-in web application" requires-python = ">=3.10,<3.14" license = "MIT" keywords = ["nlp", "langchain", "openai", "gpt", "gui"] readme = "README.md" maintainers = [ { name = "Carlos Coelho", email = "[email protected]" }, { name = "Cristhian Zanforlin", email = "[email protected]" }, { name = "Gabriel Almeida", email = "[email protected]" }, -- "MarkupSafe==3.0.2", "boto3>=1.34.162,<2.0.0", "numexpr==2.10.2", "qianfan==0.3.5", "pgvector==0.3.6", "langchain==0.3.23", "elasticsearch==8.16.0", "pytube==15.0.0", "dspy-ai==2.5.41", "datasets>2.14.7,<4.0.0", "assemblyai==0.35.1", -- "langsmith>=0.3.42,<1.0.0", "yfinance==0.2.50", "wolframalpha==5.1.3", "astra-assistants[tools]>=2.2.13,<3.0.0", "composio==0.9.2", "composio-langchain==0.9.2", "spider-client==0.1.24", "nltk==3.9.1", "lark==1.2.2", "jq==1.8.0", "pydantic-settings>=2.2.0,<3.0.0", "duckduckgo_search==7.2.1", "opensearch-py==2.8.0", "langchain-google-genai==2.0.6", "langchain-cohere>=0.3.3,<1.0.0", "langchain-huggingface==0.3.1",

</shell_result>

Script executed:

#!/bin/bash # Look at how embed_documents is called - check context around line 1274 echo "=== Context around embed_documents call in opensearch_multimodal ===" sed -n '1270,1310p' src/lfx/src/lfx/components/elastic/opensearch_multimodal.py # Check if there's any threading or asyncio in this file echo -e "\n=== Threading/async patterns in opensearch_multimodal ===" rg -n "async def|Thread|concurrent|asyncio" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py | head -20

Repository: langflow-ai/langflow

<shell_result>
Repository: langflow-ai/langflow
Exit code: 0

stdout:

=== Context around embed_documents call in opensearch_multimodal === logger.info(f"Generating embeddings for {len(available_models)} models in index") logger.info(f"Available embedding identifiers: {list(embedding_by_model.keys())}") for model_name in available_models: try: # Check if we have an embedding object for this model if model_name in embedding_by_model: # Use the matching embedding object directly emb_obj = embedding_by_model[model_name] emb_deployment = getattr(emb_obj, "deployment", None) emb_model = getattr(emb_obj, "model", None) emb_model_id = getattr(emb_obj, "model_id", None) emb_dimensions = getattr(emb_obj, "dimensions", None) emb_available_models = getattr(emb_obj, "available_models", None) logger.info( f"Using embedding object for model '{model_name}': " f"deployment={emb_deployment}, model={emb_model}, model_id={emb_model_id}, " f"dimensions={emb_dimensions}" ) # Check if this is a dedicated instance from available_models dict if emb_available_models and isinstance(emb_available_models, dict): logger.info( f"Model '{model_name}' using dedicated instance from available_models dict " f"(pre-configured with correct model and dimensions)" ) # Use the embedding instance directly - no model switching needed! vec = emb_obj.embed_query(q) query_embeddings[model_name] = vec logger.info(f"Generated embedding for model: {model_name} (actual dimensions: {len(vec)})") else: # No matching embedding found for this model logger.warning( f"No matching embedding found for model '{model_name}'. " f"This model will be skipped. Available models: {list(embedding_by_model.keys())}" ) except (RuntimeError, ValueError, ConnectionError, TimeoutError, AttributeError, KeyError) as e: logger.warning(f"Failed to generate embedding for {model_name}: {e}") === Threading/async patterns in opensearch_multimodal === 7:from concurrent.futures import ThreadPoolExecutor, as_completed 858: with ThreadPoolExecutor(max_workers=max_workers) as executor: 874: "Threaded embedding generation failed for model %s (attempt %s/%s), retrying in %.1fs", 1536: async def update_build_config(self, build_config: dict, field_value: str, field_name: str | None = None) -> dict:

</shell_result>

Based on my verification, the review comment raises a valid concern that warrants action. Here's what I found:

Shared api_client confirmed: All WatsonxEmbeddings instances created in the loop (lines 309-314) share the same api_client object created at line 288.

Concurrent usage confirmed: opensearch_multimodal.py uses ThreadPoolExecutor (line 858) to execute embedding operations in parallel, and it directly uses the instances from the available_models dictionary (as shown in the loop at lines 1273+).

No concurrency control: The WatsonxEmbeddings instances are instantiated without any concurrency_limit parameter, which LangChain supports for controlling parallel requests.

IBM SDK thread-safety: Web search confirms IBM's watsonx.ai Python SDK does not provide explicit thread-safety guarantees for APIClient.

The review comment is accurate and identifies a legitimate issue that needs to be addressed. The solution would be to add a concurrency_limit parameter when instantiating WatsonxEmbeddings instances.

Verify that WatsonxEmbeddings instances are configured with a concurrency_limit parameter to control concurrent API calls.

All WatsonxEmbeddings instances (lines 309–314 and 297–301) share the same api_client object. The code is used with ThreadPoolExecutor in opensearch_multimodal.py for parallel embedding generation. Without an explicit concurrency_limit on each instance, concurrent calls through the shared APIClient risk exceeding API rate limits or exposing thread-unsafe behavior. Add concurrency_limit=N (e.g., 5–10) to each WatsonxEmbeddings() constructor to safely control concurrent requests.

🤖 Prompt for AI Agents

In src/lfx/src/lfx/components/models_and_agents/embedding_model.py around lines 306-314 (and also update the earlier instantiation at lines 297-301), the WatsonxEmbeddings instances are created without a concurrency_limit while sharing the same api_client used concurrently by ThreadPoolExecutor; fix this by passing a concurrency_limit parameter to each WatsonxEmbeddings(...) constructor (e.g., concurrency_limit=5 or 10) so that parallel embedding requests through the shared api_client are throttled and avoid thread-safety or rate-limit issues.

github-actions · 2025-11-25T22:10:15Z

Frontend Unit Test Coverage Report

Coverage Summary

Lines	Statements	Branches	Functions
	15.29% (4188/27381)	8.49% (1778/20935)	9.6% (579/6031)

Unit Test Results

Tests	Skipped	Failures	Errors	Time
1638	0 💤	0 ❌	0 🔥	21.875s ⏱️

codecov · 2025-11-25T22:10:30Z

Codecov Report

❌ Patch coverage is 0% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 32.44%. Comparing base (1065e6e) to head (7bcb66a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...rc/lfx/src/lfx/base/embeddings/embeddings_class.py	0.00%	21 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (40.04%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10714      +/-   ##
==========================================
- Coverage   32.48%   32.44%   -0.05%     
==========================================
  Files        1366     1367       +1     
  Lines       63294    63315      +21     
  Branches     9356     9357       +1     
==========================================
- Hits        20564    20542      -22     
- Misses      41698    41740      +42     
- Partials     1032     1033       +1

Flag	Coverage Δ
backend	`51.26% <ø> (-0.13%)`	⬇️
frontend	`14.13% <ø> (ø)`
lfx	`40.04% <0.00%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...rc/lfx/src/lfx/base/embeddings/embeddings_class.py	`0.00% <0.00%> (ø)`

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Updated ChatInput and ChatOutput components in starter project JSONs to use the session_id from the graph if not provided, ensuring consistent session management. This change improves message storage and retrieval logic for chat flows.

…low-ai/langflow into opensearch-multi-embedding

github-actions bot added the enhancement New feature or request label Nov 25, 2025

[autofix.ci] apply automated fixes

30aefae