-
Notifications
You must be signed in to change notification settings - Fork 8.2k
feat: Add OpenSearch multimodal multi-embedding component #10714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Introduces OpenSearchVectorStoreComponentMultimodalMultiEmbedding, supporting multi-model hybrid semantic and keyword search with dynamic vector fields, parallel embedding generation, advanced filtering, and flexible authentication. Enables ingestion and search across multiple embedding models in OpenSearch, with robust index management and UI configuration handling.
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis pull request introduces multi-model embedding support by creating a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant EmbMod as EmbeddingModelComponent
participant EmbWM as EmbeddingsWithModels
participant Primary as Primary<br/>Embeddings
participant PerModel as Per-Model<br/>Embeddings
Client->>EmbMod: build_embeddings()
activate EmbMod
EmbMod->>EmbMod: Detect provider (OpenAI/Ollama/IBM)
EmbMod->>Primary: Create primary embeddings instance
EmbMod->>PerModel: Construct per-model instances<br/>(model_1, model_2, ...)
EmbMod->>EmbWM: Create EmbeddingsWithModels<br/>(primary, {model_1, model_2, ...})
deactivate EmbMod
EmbMod-->>Client: Return EmbeddingsWithModels
Note over Client,PerModel: Later usage:
Client->>EmbWM: embed_documents(texts)
activate EmbWM
EmbWM->>Primary: Delegate to primary instance
Primary-->>EmbWM: Return embeddings
deactivate EmbWM
EmbWM-->>Client: Return embeddings list
sequenceDiagram
participant Client as Client
participant OS as OpenSearchComponent
participant EmbWM as EmbeddingsWithModels
participant EmbN as Embedding N<br/>(Per-Model)
participant OSClient as OpenSearch<br/>Client
Client->>OS: search_documents(query_text, filters)
activate OS
OS->>OS: Detect available embedding models in index
OS->>EmbWM: Generate embeddings for each model
activate EmbWM
loop For each model
EmbWM->>EmbN: embed_query(query_text)
EmbN-->>EmbWM: embedding_vector
end
deactivate EmbWM
OS->>OS: Build per-model KNN queries
OS->>OS: Build keyword query (multi_match)
OS->>OSClient: Execute dis_max combination<br/>(KNN queries + keyword)
OSClient-->>OS: Return ranked results
OS->>OS: Convert results to Data objects
deactivate OS
OS-->>Client: Return search_documents results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Introduces EmbeddingsWithModels class for wrapping embeddings and available models. Updates EmbeddingModelComponent to provide available model lists for OpenAI, Ollama, and IBM watsonx.ai providers, including synchronous Ollama model fetching using httpx. Updates starter project and component index metadata to reflect new dependencies and code changes.
…low-ai/langflow into opensearch-multi-embedding
Updated the EmbeddingModelComponent to fetch Ollama models asynchronously using await get_ollama_models instead of a synchronous httpx call. Removed httpx from dependencies in Nvidia Remix starter project and updated related metadata. This change improves consistency and reliability when fetching available models for the Ollama provider.
Added several Notion-related components to the component index, including AddContentToPage, NotionDatabaseProperties, NotionListPages, NotionPageContent, NotionPageCreator, NotionPageUpdate, and NotionSearch. These components enable interaction with Notion databases and pages, such as querying, updating, creating, and retrieving content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
src/lfx/src/lfx/components/models_and_agents/embedding_model.py (1)
372-373: Redundant API calls to fetch IBM models.
fetch_ibm_modelsis called twice with the same URL. Cache the result to avoid duplicate HTTP requests:elif field_value == "IBM watsonx.ai": - build_config["model"]["options"] = self.fetch_ibm_models(base_url=self.base_url_ibm_watsonx) - build_config["model"]["value"] = self.fetch_ibm_models(base_url=self.base_url_ibm_watsonx)[0] + ibm_models = self.fetch_ibm_models(base_url=self.base_url_ibm_watsonx) + build_config["model"]["options"] = ibm_models + build_config["model"]["value"] = ibm_models[0] if ibm_models else ""The same issue exists at lines 384-385.
src/backend/base/langflow/initial_setup/starter_projects/Nvidia Remix.json (4)
2170-2210: Bug: ollama_base_url update ignores the new field_value.In update_build_config, the branch for field_name == "ollama_base_url" assigns ollama_url = self.ollama_base_url, ignoring the freshly provided field_value. This can leave the model list stale until a second refresh.
Apply this diff:
- elif field_name == "ollama_base_url": - # # Refresh Ollama models when base URL changes - # if hasattr(self, "provider") and self.provider == "Ollama": - # Use field_value if provided, otherwise fall back to instance attribute - ollama_url = self.ollama_base_url + elif field_name == "ollama_base_url": + # Use field_value if provided, otherwise fall back to instance attribute + ollama_url = field_value or getattr(self, "ollama_base_url", None) if await is_valid_ollama_url(url=ollama_url): try: models = await get_ollama_models( base_url_value=ollama_url,
2339-2355: Default unsafe pickle loading should be False.Allow Dangerous Deserialization is set to true by default, enabling pickle loading. This is a security risk and should be opt-in.
Apply this diff in the FAISS component template:
- "display_name": "Allow Dangerous Deserialization", + "display_name": "Allow Dangerous Deserialization", "dynamic": false, "info": "Set to True to allow loading pickle files from untrusted sources. Only enable this if you trust the source of the data.", @@ - "type": "bool", - "value": true + "type": "bool", + "value": false
2095-2145: Guard OpenAI dimensions parameter per model to prevent 400 errors on text-embedding-ada-002.The loop at src/lfx/src/lfx/components/models_and_agents/embedding_model.py:189-205 creates per-model OpenAIEmbeddings instances, passing
dimensionsto all models. However, text-embedding-ada-002 (included in OPENAI_EMBEDDING_MODEL_NAMES) does not accept thedimensionsparameter and will error. Only text-embedding-3-small and text-embedding-3-large support it.Apply this conditional guard:
+ DIMENSIONAL_MODELS = {"text-embedding-3-small", "text-embedding-3-large"} for model_name in OPENAI_EMBEDDING_MODEL_NAMES: available_models_dict[model_name] = OpenAIEmbeddings( model=model_name, - dimensions=dimensions or None, # Use same dimensions config for all + dimensions=(dimensions if model_name in DIMENSIONAL_MODELS else None), base_url=api_base or None, api_key=api_key, chunk_size=chunk_size,
2018-2252: Based on my comprehensive verification of the codebase, I can now provide the corrected review comment. Here are my key findings:Key Verification Results:
- ✅ EmbeddingsWithModels IS compatible with FAISS - The class properly implements
embed_documents(),embed_query(), async variants,__getattr__()delegation, and__call__()proxy. FAISS will work without modification.- ✅ OpenAI dimensions issue IS valid - Only
text-embedding-3-smallandtext-embedding-3-largesupport the dimensions parameter;text-embedding-ada-002does not. Current code applies dimensions uniformly to all models.- ✅ allow_dangerous_deserialization defaults to True - Confirmed security issue in FAISS component.
- ✅ Ollama field_value not used - Code line 390 uses
self.ollama_base_urlinstead of thefield_valueparameter passed to the method.- ❌ No httpx issue - Code uses
requests.getwithtimeout=10, not httpx.
Guard OpenAI embedding models against unsupported dimension parameter.
The code applies the
dimensionsparameter uniformly to all OpenAI models, buttext-embedding-ada-002does not support this parameter and will raise an error. Onlytext-embedding-3-smallandtext-embedding-3-largesupport dimensions.In the
build_embeddingsmethod's OpenAI provider block, guard dimensions per model:for model_name in OPENAI_EMBEDDING_MODEL_NAMES: available_models_dict[model_name] = OpenAIEmbeddings( model=model_name, - dimensions=dimensions or None, # Use same dimensions config for all + dimensions=dimensions or None if model_name != "text-embedding-ada-002" else None, base_url=api_base or None, api_key=api_key, chunk_size=chunk_size, max_retries=max_retries, timeout=request_timeout or None, show_progress_bar=show_progress_bar, model_kwargs=model_kwargs, )Set FAISS allow_dangerous_deserialization default to False.
The FAISS component currently defaults
allow_dangerous_deserializationtoTrue, which enables loading untrusted pickle files and poses a security risk. Change the default value in the component definition toFalse.Fix Ollama URL refresh to use the field_value parameter.
In
update_build_config, theollama_base_urlfield handler ignores thefield_valueparameter and usesself.ollama_base_urlinstead. The comment indicates intent to usefield_value. Update line 390 to use the passed parameter for consistency withbase_url_ibm_watsonxhandling:elif field_name == "ollama_base_url": - ollama_url = self.ollama_base_url + ollama_url = field_value or self.ollama_base_url
🧹 Nitpick comments (7)
src/lfx/src/lfx/components/models_and_agents/embedding_model.py (1)
239-255: URL inconsistency and missing error handling for model fetch.
URL inconsistency:
get_ollama_modelsis called withself.ollama_base_url(raw input) while embedding instances usefinal_base_url(transformed). Althoughget_ollama_modelstransforms internally, this could cause subtle issues if the transformation logic diverges.No fallback on failure: If
get_ollama_modelsfails, the entirebuild_embeddingsmethod fails. Consider falling back to an emptyavailable_modelsdict or using the user-selected model as the only entry:# Fetch available Ollama models - available_model_names = await get_ollama_models( - base_url_value=self.ollama_base_url, + try: + available_model_names = await get_ollama_models( + base_url_value=final_base_url, - desired_capability=DESIRED_CAPABILITY, - json_models_key=JSON_MODELS_KEY, - json_name_key=JSON_NAME_KEY, - json_capabilities_key=JSON_CAPABILITIES_KEY, - ) + desired_capability=DESIRED_CAPABILITY, + json_models_key=JSON_MODELS_KEY, + json_name_key=JSON_NAME_KEY, + json_capabilities_key=JSON_CAPABILITIES_KEY, + ) + except ValueError: + logger.warning("Failed to fetch Ollama models, using selected model only") + available_model_names = [model] if model else []src/backend/base/langflow/initial_setup/starter_projects/Nvidia Remix.json (2)
2065-2145: Avoid eager instantiation of N embedding clients; create on demand.Creating an instance for every model on each build is wasteful and can slow UI updates. Prefer a lazy factory (dict of callables) or instantiate only when requested by the consumer.
1759-1810: Add timeouts and error handling to documentation fetcher.RemixDocumentation._fetch_all_documentation uses httpx.get without a timeout and minimal error handling. Add a short timeout and catch network errors to avoid hanging the flow.
Apply this diff inside the component code block:
- response = httpx.get(search_index_url, follow_redirects=True) + try: + response = httpx.get(search_index_url, follow_redirects=True, timeout=10.0) + except httpx.HTTPError as e: + raise ValueError(f"Failed to fetch search index: {e!s}") from esrc/lfx/src/lfx/components/elastic/opensearch_multimodal.py (4)
52-53: Remove or downgrade noisy logging in helper function.
logger.infois called every timeget_embedding_field_nameis invoked, which happens frequently during search operations with multiple models. This will clutter logs in production.def get_embedding_field_name(model_name: str) -> str: - logger.info(f"chunk_embedding_{normalize_model_name(model_name)}") + # logger.debug(f"chunk_embedding_{normalize_model_name(model_name)}") return f"chunk_embedding_{normalize_model_name(model_name)}"
593-594: Consider handling bulk ingestion errors.The
helpers.bulkcall doesn't have explicit error handling. If some documents fail to index, the method will still return all IDs as if successful. Consider usingraise_on_error=True(default) and handling partial failures.- helpers.bulk(client, requests, max_chunk_bytes=max_chunk_bytes) + success, failed = helpers.bulk( + client, requests, max_chunk_bytes=max_chunk_bytes, stats_only=False + ) + if failed: + logger.warning(f"Failed to index {len(failed)} documents: {failed[:3]}") return return_ids
646-646: Downgrade embedding debug log.
logger.warningis used for a debug log that shows the embedding object. This should belogger.debugor removed.- logger.warning(f"Embedding: {self.embedding}") + logger.debug(f"Embedding: {self.embedding}")
1034-1034: Fix mutable default argument type hint.The parameter
filter_clauses: list[dict] = Noneshould use| Nonetype hint for clarity.- def _detect_available_models(self, client: OpenSearch, filter_clauses: list[dict] = None) -> list[str]: + def _detect_available_models(self, client: OpenSearch, filter_clauses: list[dict] | None = None) -> list[str]:
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/backend/base/langflow/initial_setup/starter_projects/Nvidia Remix.json(2 hunks)src/lfx/src/lfx/base/embeddings/embeddings_class.py(1 hunks)src/lfx/src/lfx/components/elastic/opensearch_multimodal.py(1 hunks)src/lfx/src/lfx/components/models_and_agents/embedding_model.py(7 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/lfx/src/lfx/components/models_and_agents/embedding_model.py (2)
src/lfx/src/lfx/base/embeddings/embeddings_class.py (1)
EmbeddingsWithModels(6-116)src/lfx/src/lfx/base/models/model_utils.py (1)
get_ollama_models(39-108)
src/lfx/src/lfx/base/embeddings/embeddings_class.py (2)
src/lfx/src/lfx/field_typing/constants.py (1)
Embeddings(49-50)src/lfx/src/lfx/base/tools/flow_tool.py (1)
args(32-34)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py (3)
src/lfx/src/lfx/inputs/inputs.py (4)
BoolInput(419-432)HandleInput(75-86)IntInput(347-380)StrInput(127-183)src/lfx/src/lfx/schema/data.py (1)
Data(26-288)src/lfx/src/lfx/base/embeddings/embeddings_class.py (2)
embed_documents(36-45)embed_query(47-56)
🔇 Additional comments (10)
src/lfx/src/lfx/components/models_and_agents/embedding_model.py (1)
7-7: LGTM!Import correctly added for the new
EmbeddingsWithModelswrapper class.src/backend/base/langflow/initial_setup/starter_projects/Nvidia Remix.json (2)
1856-1856: Code hash change acknowledged.No action needed here; just confirming this corresponds to the EmbeddingModelComponent refactor.
2025-2050:fetch_ibm_modelsfunction does not exist in the langflow codebase; IBM model fetching is implemented via the watsonx.ai bundle, not a standalone function.The review comment references a function that cannot be found in the repository. IBM integration in langflow uses a bundle-based architecture (watsonx.ai bundle) that handles dynamic model fetching, rather than a
fetch_ibm_modelsfunction called byupdate_build_config. The suggestion about caching and request failure handling may be conceptually valid, but it is directed at the wrong implementation target.The JSON configuration file snippet shown (lines 2025-2050) contains parameter definitions unrelated to model fetching logic, further confirming a mismatch between the review location and the actual concern.
Likely an incorrect or invalid review comment.
src/lfx/src/lfx/base/embeddings/embeddings_class.py (3)
6-34: LGTM!The wrapper class is well-designed with proper inheritance from
Embeddings, clear docstrings, and correct handling of the mutable default argument foravailable_models.
36-78: LGTM!The embedding methods correctly delegate to the underlying embeddings instance with proper type annotations.
80-116: LGTM!The
__call__method properly checks callability before delegation,__getattr__correctly forwards unknown attributes to the wrapped instance, and__repr__provides useful debug information.src/lfx/src/lfx/components/elastic/opensearch_multimodal.py (4)
116-328: LGTM!The input definitions are comprehensive and well-documented. The
is_list=Trueon the embedding input correctly enables multi-model support.
330-392: LGTM!The model name resolution logic correctly handles multiple embedding providers with a clear priority order. The fallback chain through deployment → model → model_id → model_name ensures compatibility across different providers.
855-882: LGTM!The retry logic with exponential backoff (1s → 2s → 4s, capped at 8s) for embedding generation is well-implemented. The ThreadPoolExecutor usage with bounded workers (max 8) prevents resource exhaustion.
1547-1590: LGTM!The dynamic UI configuration correctly toggles visibility and required status for authentication fields based on the selected auth mode.
| "dimension": dim, | ||
| "method": { | ||
| "name": "disk_ann", | ||
| "space_type": space_type, | ||
| "engine": engine, | ||
| "parameters": {"ef_construction": ef_construction, "m": m}, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Script executed:
#!/bin/bash
# First, let's examine the file structure and understand the context around lines 428-435
cat -n src/lfx/src/lfx/components/elastic/opensearch_multimodal.py | sed -n '400,450p'Repository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
400 ef_construction: int = 100,
401 m: int = 16,
402 vector_field: str = "vector_field",
403 ) -> dict[str, Any]:
404 """Create the default OpenSearch index mapping for vector search.
405
406 This method generates the index configuration with k-NN settings optimized
407 for approximate nearest neighbor search using the specified vector engine.
408 Includes the embedding_model keyword field for tracking which model was used.
409
410 Args:
411 dim: Dimensionality of the vector embeddings
412 engine: Vector search engine (jvector, nmslib, faiss, lucene)
413 space_type: Distance metric for similarity calculation
414 ef_search: Size of dynamic list used during search
415 ef_construction: Size of dynamic list used during index construction
416 m: Number of bidirectional links for each vector
417 vector_field: Name of the field storing vector embeddings
418
419 Returns:
420 Dictionary containing OpenSearch index mapping configuration
421 """
422 return {
423 "settings": {"index": {"knn": True, "knn.algo_param.ef_search": ef_search}},
424 "mappings": {
425 "properties": {
426 vector_field: {
427 "type": "knn_vector",
428 "dimension": dim,
429 "method": {
430 "name": "disk_ann",
431 "space_type": space_type,
432 "engine": engine,
433 "parameters": {"ef_construction": ef_construction, "m": m},
434 },
435 },
436 "embedding_model": {"type": "keyword"}, # Track which model was used
437 "embedding_dimensions": {"type": "integer"},
438 }
439 },
440 }
441
442 def _ensure_embedding_field_mapping(
443 self,
444 client: OpenSearch,
445 index_name: str,
446 field_name: str,
447 dim: int,
448 engine: str,
449 space_type: str,
450 ef_construction: int,
</shell_result>
Script executed:
#!/bin/bash
# Search for "disk_ann" and "hnsw" occurrences in the file
echo "=== Searching for disk_ann occurrences ==="
rg -n "disk_ann" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py
echo -e "\n=== Searching for hnsw occurrences ==="
rg -n "hnsw" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py
echo -e "\n=== Searching for engine parameter usage ==="
rg -n "engine" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py | head -30Repository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
=== Searching for disk_ann occurrences ===
430: "name": "disk_ann",
475: "name": "disk_ann",
=== Searching for hnsw occurrences ===
=== Searching for engine parameter usage ===
74: - Vector storage with configurable engines (jvector, nmslib, faiss, lucene)
108: "engine",
160: name="engine",
165: "Vector search engine for similarity calculations. 'jvector' is recommended for most use cases. "
397: engine: str = "jvector",
407: for approximate nearest neighbor search using the specified vector engine.
412: engine: Vector search engine (jvector, nmslib, faiss, lucene)
432: "engine": engine,
448: engine: str,
463: engine: Vector search engine
477: "engine": engine,
498: def _validate_aoss_with_engines(self, *, is_aoss: bool, engine: str) -> None:
499: """Validate engine compatibility with Amazon OpenSearch Serverless (AOSS).
501: Amazon OpenSearch Serverless has restrictions on which vector engines
502: can be used. This method ensures the selected engine is compatible.
506: engine: The selected vector search engine
509: ValueError: If AOSS is used with an incompatible engine
511: if is_aoss and engine not in {"nmslib", "faiss"}:
512: msg = "Amazon OpenSearch Service Serverless only supports `nmslib` or `faiss` engines"
901: # Validate engine with AOSS
902: engine = getattr(self, "engine", "jvector")
903: self._validate_aoss_with_engines(is_aoss=is_aoss, engine=engine)
912: engine=engine,
934: engine=engine,
</shell_result>
OpenSearch vector search methods disk_ann hnsw nmslib jvector faiss lucene engine compatibility
<web_search_result>
Summary of OpenSearch k-NN methods & engine compatibility (concise):
- Engines supported in OpenSearch: Lucene (native), Faiss, NMSLIB (deprecated). [1][2].
- HNSW: implemented across engines (Lucene, Faiss, NMSLIB). Use HNSW for most workloads; Lucene HNSW supports l2/cosinesimil/innerproduct and ignores ef_search (uses k). [1][2].
- Faiss: supports HNSW and IVF (IVF requires training). Faiss is used for on-disk/large-scale and supports product quantization/compression options. [1][4].
- NMSLIB: legacy HNSW engine (supported spaces: l2, innerproduct, cosinesimil, l1, linf) but deprecated—prefer Faiss or Lucene. [1].
- Lucene engine: native HNSW, best for small-to-mid indexes and efficient filtering (can apply filters during search). [1][5].
- on_disk / disk-based ANN modes: OpenSearch “on_disk” (low-cost) mode uses Faiss by default; there are also disk-based ANNs in the ecosystem (DiskANN/DISK_ANN in other vector DBs) and emerging disk_ann usage in OpenSearch examples (e.g., disk_ann method with jvector engine in recent posts/features). Check your OpenSearch version for exact support. [2][3][6].
- jvector / disk_ann: OpenSearch examples and recent posts show a "jvector" engine and a "disk_ann" method used for concurrent / disk-optimized graph builds — this is an emergent/experimental option; verify availability in your OpenSearch release. [3].
- AWS OpenSearch Service (managed) notes: serverless/vector-search collections currently restrict engines/features (e.g., some modes/engines like Lucene may not be supported in vector collections; AWS may restrict Faiss features such as IVF). Always consult the AWS docs for service-specific limitations. [2].
If you want, tell me your OpenSearch version (or AWS OpenSearch service type) and I’ll list exactly which methods/engine combinations and parameters are available and show example mappings.
Sources: OpenSearch docs & blog posts and AWS OpenSearch docs.
[1] OpenSearch docs — Methods and engines.
[2] OpenSearch docs — k-NN vector / modes (in_memory, on_disk).
[3] OpenSearch blog / changelog examples showing disk_ann + jvector.
[4] OpenSearch blog “Lucene-on-Faiss” (Faiss+Lucene integration / on-disk).
[5] OpenSearch vector search page (comparison table).
</web_search_result>
Script executed:
#!/bin/bash
# Search for test files or examples that show engine-method combinations
echo "=== Looking for test files ==="
find src/lfx -name "*test*" -o -name "*opensearch*" | head -20
echo -e "\n=== Searching for requirements/version info ==="
find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" | xargs grep -l "opensearch" 2>/dev/null
echo -e "\n=== Looking for any method selection logic ==="
rg -n "method.*name" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py -A 2 -B 2Repository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
=== Looking for test files ===
src/lfx/src/lfx/components/elastic/opensearch.py
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py
src/lfx/tests
src/lfx/tests/conftest.py
src/lfx/tests/data/env_variable_test.json
src/lfx/tests/unit/base/data/test_base_file.py
src/lfx/tests/unit/base/data/test_storage_utils.py
src/lfx/tests/unit/cli/test_common.py
src/lfx/tests/unit/cli/test_run_command.py
src/lfx/tests/unit/cli/test_run_real_flows.py
src/lfx/tests/unit/cli/test_run_starter_projects.py
src/lfx/tests/unit/cli/test_run_starter_projects_backward_compatibility.py
src/lfx/tests/unit/cli/test_script_loader.py
src/lfx/tests/unit/cli/test_serve.py
src/lfx/tests/unit/cli/test_serve_app.py
src/lfx/tests/unit/cli/test_serve_app_streaming.py
src/lfx/tests/unit/cli/test_serve_components.py
src/lfx/tests/unit/cli/test_serve_simple.py
src/lfx/tests/unit/cli/test_validation.py
src/lfx/tests/unit/components/langchain_utilities/test_csv_agent.py
=== Searching for requirements/version info ===
./pyproject.toml
=== Looking for any method selection logic ===
265- options=["basic", "jwt"],
266- info=(
267: "Authentication method: 'basic' for username/password authentication, "
268- "or 'jwt' for JSON Web Token (Bearer) authentication."
269- ),
</shell_result>
Script executed:
#!/bin/bash
# Search for method configuration patterns and how engines are actually used
echo "=== Full method definitions to see parameter patterns ==="
rg -n "\"method\":" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py -A 5 -B 2
echo -e "\n=== Check if there are any comments about method names ==="
rg -n "method|disk_ann|hnsw" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py | grep -E "(#|method|disk_ann|hnsw)"Repository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
=== Full method definitions to see parameter patterns ===
427- "type": "knn_vector",
428- "dimension": dim,
429: "method": {
430- "name": "disk_ann",
431- "space_type": space_type,
432- "engine": engine,
433- "parameters": {"ef_construction": ef_construction, "m": m},
434- },
--
472- "type": "knn_vector",
473- "dimension": dim,
474: "method": {
475- "name": "disk_ann",
476- "space_type": space_type,
477- "engine": engine,
478- "parameters": {"ef_construction": ef_construction, "m": m},
479- },
=== Check if there are any comments about method names ===
267: "Authentication method: 'basic' for username/password authentication, "
406: This method generates the index configuration with k-NN settings optimized
429: "method": {
430: "name": "disk_ann",
474: "method": {
475: "name": "disk_ann",
502: can be used. This method ensures the selected engine is compatible.
544: This method uses bulk operations to insert documents with their vector
654: This method handles the complete document ingestion pipeline:
663: # Convert DataFrame to Data if needed using parent's method
969: This method accepts two filter formats and converts them to standardized
1137: This method executes a sophisticated search that:
1518: This is the main interface method that performs the multi-model search using the
1539: This method handles real-time UI updates, particularly for authentication
</shell_result>
OpenSearch disk_ann method supported engines which engines work
<web_search_result>
Short answer: the disk_ann method in OpenSearch is implemented to use the jvector engine. See the OpenSearch blog example showing method.name = "disk_ann" with engine = "jvector". [1]
Context: OpenSearch's k-NN plugin generally supports engines faiss, lucene and (deprecated) nmslib for other methods (hnsw/ivf), but disk_ann specifically is shown using jvector in the OpenSearch blog announcing concurrent graph construction. [2][1]
Sources:
- OpenSearch blog: "Concurrent vector graph construction" (disk_ann example using engine: "jvector"). [1]
- OpenSearch k‑NN methods & engines documentation (faiss, lucene, nmslib overview). [2]
</web_search_result>
Hardcoded disk_ann method is incompatible with multiple supported engines.
The method name "disk_ann" is hardcoded at lines 430 and 475, but OpenSearch's disk_ann method only works with the jvector engine. Your component supports four engines (jvector, nmslib, faiss, lucene), but when users select nmslib, faiss, or lucene, the index mapping will fail because those engines don't support the disk_ann method—they require hnsw or other engine-specific methods instead.
The method name must be conditional based on the selected engine parameter.
🤖 Prompt for AI Agents
In src/lfx/src/lfx/components/elastic/opensearch_multimodal.py around lines
428-435 (and similarly at ~475) the vector index mapping hardcodes the method
name "disk_ann", which only works with the jvector engine; this will fail for
nmslib, faiss and lucene. Change the code to pick the method name conditionally
based on the selected engine (e.g., use "disk_ann" only when engine ==
"jvector", otherwise use the engine-appropriate method such as "hnsw" for
nmslib/faiss/lucene or another supported method), and ensure any engine-specific
parameters (ef_construction, m, etc.) are set/renamed to match the chosen
method; update both locations so the mapping construction uses the conditional
method string and appropriate parameter set for that engine.
| available_models = self._detect_available_models(client, filter_clauses) | ||
|
|
||
| if not available_models: | ||
| logger.warning("No embedding models found in index, using current model") | ||
| available_models = [self._get_embedding_model_name()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Variable shadowing causes logic error.
On line 1209, available_models = getattr(emb_obj, "available_models", None) shadows the available_models list from line 1182 that contains the detected models in the index. After the first loop iteration, the outer available_models is overwritten, causing the for model_name in available_models: loop on line 1282 to iterate over the wrong data.
Rename the inner variable to avoid shadowing:
model_name = getattr(emb_obj, "model_name", None)
- available_models = getattr(emb_obj, "available_models", None)
+ emb_available_models = getattr(emb_obj, "available_models", None)
logger.info(
f"Embedding object {idx}: deployment={deployment}, model={model}, "
f"model_id={model_id}, model_name={model_name}, dimensions={dimensions}, "
- f"available_models={available_models}"
+ f"available_models={emb_available_models}"
)
# If this embedding has available_models dict, map all models to their dedicated instances
- if available_models and isinstance(available_models, dict):
+ if emb_available_models and isinstance(emb_available_models, dict):
logger.info(
- f"Embedding object {idx} provides {len(available_models)} models via available_models dict"
+ f"Embedding object {idx} provides {len(emb_available_models)} models via available_models dict"
)
- for model_name_key, dedicated_embedding in available_models.items():
+ for model_name_key, dedicated_embedding in emb_available_models.items():Apply similar changes to all subsequent usages within the loop (lines 1218-1238).
Also applies to: 1209-1209
🤖 Prompt for AI Agents
In src/lfx/src/lfx/components/elastic/opensearch_multimodal.py around lines
1182-1282, the local variable available_models set at line 1182 is being
shadowed by a second assignment at line 1209 (available_models =
getattr(emb_obj, "available_models", None)), which breaks the outer loop later
(line 1282) — rename the inner variable (for example emb_available_models) and
update all its subsequent uses within that loop (lines ~1218-1238 and any other
occurrences in the same block) so the outer available_models list remains
untouched.
| # Check if this is a dedicated instance from available_models dict | ||
| if emb_available_models and isinstance(emb_available_models, dict): | ||
| logger.info( | ||
| f"Model '{model_name}' using dedicated instance from available_models dict " | ||
| f"(pre-configured with correct model and dimensions)" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continue fix for variable shadowing.
These lines also reference emb_available_models (after renaming) and need the same fix applied.
# Check if this is a dedicated instance from available_models dict
- if emb_available_models and isinstance(emb_available_models, dict):
+ emb_avail_models = getattr(emb_obj, "available_models", None)
+ if emb_avail_models and isinstance(emb_avail_models, dict):
logger.info(
f"Model '{model_name}' using dedicated instance from available_models dict "
f"(pre-configured with correct model and dimensions)"
)Note: This is part of the same variable shadowing issue flagged earlier.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Check if this is a dedicated instance from available_models dict | |
| if emb_available_models and isinstance(emb_available_models, dict): | |
| logger.info( | |
| f"Model '{model_name}' using dedicated instance from available_models dict " | |
| f"(pre-configured with correct model and dimensions)" | |
| ) | |
| # Check if this is a dedicated instance from available_models dict | |
| emb_avail_models = getattr(emb_obj, "available_models", None) | |
| if emb_avail_models and isinstance(emb_avail_models, dict): | |
| logger.info( | |
| f"Model '{model_name}' using dedicated instance from available_models dict " | |
| f"(pre-configured with correct model and dimensions)" | |
| ) |
🤖 Prompt for AI Agents
In src/lfx/src/lfx/components/elastic/opensearch_multimodal.py around lines 1300
to 1305, the code still references the old name `emb_available_models`; replace
this reference with the new variable name you introduced earlier (the one used
elsewhere in the file to avoid shadowing), and keep the same isinstance(dict)
check and logging text; ensure the variable used matches the prior rename so
there are no shadowed/undefined names at runtime.
| try: | ||
| raw = self.search(self.search_query or "") | ||
| return [Data(text=hit["page_content"], **hit["metadata"]) for hit in raw] | ||
| self.log(self.ingest_data) | ||
| except Exception as e: | ||
| self.log(f"search_documents error: {e}") | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unreachable code after return statement.
Line 1542 (self.log(self.ingest_data)) is placed after the return statement on line 1541, making it unreachable. This is likely a debugging statement that should be removed or moved before the return.
def search_documents(self) -> list[Data]:
try:
raw = self.search(self.search_query or "")
- return [Data(text=hit["page_content"], **hit["metadata"]) for hit in raw]
- self.log(self.ingest_data)
+ results = [Data(text=hit["page_content"], **hit["metadata"]) for hit in raw]
+ return results
except Exception as e:
self.log(f"search_documents error: {e}")
raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| raw = self.search(self.search_query or "") | |
| return [Data(text=hit["page_content"], **hit["metadata"]) for hit in raw] | |
| self.log(self.ingest_data) | |
| except Exception as e: | |
| self.log(f"search_documents error: {e}") | |
| raise | |
| try: | |
| raw = self.search(self.search_query or "") | |
| results = [Data(text=hit["page_content"], **hit["metadata"]) for hit in raw] | |
| return results | |
| except Exception as e: | |
| self.log(f"search_documents error: {e}") | |
| raise |
🤖 Prompt for AI Agents
In src/lfx/src/lfx/components/elastic/opensearch_multimodal.py around lines
1539-1545 there is an unreachable self.log(self.ingest_data) placed after a
return; remove the unreachable line or move the logging before the return so it
executes. Better: call and assign raw = self.search(...), perform any needed
self.log(...) before returning, then return the list comprehension; ensure
self.search is only called once and preserve exception handling.
| return WATSONX_EMBEDDING_MODEL_NAMES | ||
|
|
||
| def build_embeddings(self) -> Embeddings: | ||
| async def build_embeddings(self) -> Embeddings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Script executed:
#!/bin/bash
# Search for all usages of build_embeddings method
rg -n --type=py 'build_embeddings' -B2 -A2Repository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Script execution failed with unrecoverable error: [CAPABILITY:RUN_SHELL_SCRIPT] Failed to clone repository langflow-ai/langflow: Clone operation failed
</shell_result>
Script executed:
#!/bin/bash
# Search for all usages of build_embeddings method across the codebase
rg -n --type=py 'build_embeddings' -B2 -A2Repository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-41- vector_store.set_on_output(name="dataframe", value=DataFrame(data=[Data(text="This is a test file.")]), cache=True)
src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-42- vector_store.set(
src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py:43: embedding_model=openai_embeddings.build_embeddings,
src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-44- ingest_data=text_splitter.split_text,
src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-45- api_endpoint="https://astra.example.com",
--
src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-60- api_endpoint="https://astra.example.com",
src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-61- token="token", # noqa: S106
src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py:62: embedding_model=openai_embeddings.build_embeddings,
src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-63- )
src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py-64- # Mock search_documents
--
src/backend/tests/unit/components/vectorstores/test_local_db_component.py-32-
src/backend/tests/unit/components/vectorstores/test_local_db_component.py-33- return {
src/backend/tests/unit/components/vectorstores/test_local_db_component.py:34: "embedding": OpenAIEmbeddingsComponent(openai_api_key=api_key).build_embeddings(),
src/backend/tests/unit/components/vectorstores/test_local_db_component.py-35- "collection_name": "test_collection",
src/backend/tests/unit/components/vectorstores/test_local_db_component.py-36- "persist": True,
--
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-120-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-121- @patch("lfx.components.models_and_agents.embedding_model.OpenAIEmbeddings")
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:122: async def test_build_embeddings_openai(self, mock_openai_embeddings, component_class, default_kwargs):
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-123- # Setup mock
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-124- mock_instance = MagicMock()
--
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-135-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-136- # Build the embeddings
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:137: embeddings = component.build_embeddings()
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-138-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-139- # Verify the OpenAIEmbeddings was called with the correct parameters
--
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-152-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-153- @patch("langchain_ollama.OllamaEmbeddings")
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:154: async def test_build_embeddings_ollama(self, mock_ollama_embeddings, component_class, default_kwargs):
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-155- # Setup mock
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-156- mock_instance = MagicMock()
--
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-166-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-167- # Build the embeddings
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:168: embeddings = component.build_embeddings()
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-169-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-170- # Verify the OllamaEmbeddings was called with the correct parameters
--
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-178- @patch("ibm_watsonx_ai.Credentials")
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-179- @patch("langchain_ibm.WatsonxEmbeddings")
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:180: async def test_build_embeddings_watsonx(
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-181- self, mock_watsonx_embeddings, mock_credentials, mock_api_client, component_class, default_kwargs
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-182- ):
--
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-199-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-200- # Build the embeddings
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:201: embeddings = component.build_embeddings()
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-202-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-203- # Verify Credentials was created correctly
--
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-225- assert embeddings == mock_instance
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-226-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:227: async def test_build_embeddings_watsonx_missing_project_id(self, component_class, default_kwargs):
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-228- kwargs = default_kwargs.copy()
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-229- kwargs["provider"] = "IBM watsonx.ai"
--
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-232-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-233- with pytest.raises(ValueError, match=r"Project ID is required for IBM watsonx.ai"):
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:234: component.build_embeddings()
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-235-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:236: async def test_build_embeddings_openai_missing_api_key(self, component_class, default_kwargs):
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-237- component = component_class(**default_kwargs)
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-238- component.provider = "OpenAI"
--
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-240-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-241- with pytest.raises(ValueError, match="OpenAI API key is required when using OpenAI provider"):
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:242: component.build_embeddings()
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-243-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:244: async def test_build_embeddings_watsonx_missing_api_key(self, component_class, default_kwargs):
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-245- kwargs = default_kwargs.copy()
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-246- kwargs["provider"] = "IBM watsonx.ai"
--
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-251-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-252- with pytest.raises(ValueError, match=r"IBM watsonx.ai API key is required when using IBM watsonx.ai provider"):
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:253: component.build_embeddings()
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-254-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:255: async def test_build_embeddings_unknown_provider(self, component_class, default_kwargs):
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-256- component = component_class(**default_kwargs)
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-257- component.provider = "Unknown"
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-258-
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py-259- with pytest.raises(ValueError, match="Unknown provider: Unknown"):
src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:260: component.build_embeddings()
--
src/backend/tests/unit/components/vectorstores/test_chroma_vector_store_component.py-29-
src/backend/tests/unit/components/vectorstores/test_chroma_vector_store_component.py-30- return {
src/backend/tests/unit/components/vectorstores/test_chroma_vector_store_component.py:31: "embedding": OpenAIEmbeddingsComponent(openai_api_key=api_key).build_embeddings(),
src/backend/tests/unit/components/vectorstores/test_chroma_vector_store_component.py-32- "collection_name": "test_collection",
src/backend/tests/unit/components/vectorstores/test_chroma_vector_store_component.py-33- "persist_directory": tmp_path,
--
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-114-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-115- @patch("langchain_huggingface.HuggingFaceEmbeddings")
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:116: def test_build_embeddings_huggingface(self, mock_hf_embeddings, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-117- """Test building HuggingFace embeddings."""
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-118- component = component_class(**default_kwargs)
--
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-121- mock_hf_embeddings.return_value = mock_embeddings
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-122-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:123: result = component._build_embeddings("sentence-transformers/all-MiniLM-L6-v2", None)
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-124-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-125- mock_hf_embeddings.assert_called_once_with(model="sentence-transformers/all-MiniLM-L6-v2")
--
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-127-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-128- @patch("langchain_openai.OpenAIEmbeddings")
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:129: def test_build_embeddings_openai(self, mock_openai_embeddings, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-130- """Test building OpenAI embeddings."""
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-131- component = component_class(**default_kwargs)
--
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-134- mock_openai_embeddings.return_value = mock_embeddings
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-135-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:136: result = component._build_embeddings("text-embedding-ada-002", "test-api-key")
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-137-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-138- mock_openai_embeddings.assert_called_once_with(
--
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-143- assert result == mock_embeddings
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-144-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:145: def test_build_embeddings_openai_no_key(self, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-146- """Test building OpenAI embeddings without API key raises error."""
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-147- component = component_class(**default_kwargs)
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-148-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-149- with pytest.raises(ValueError, match="OpenAI API key is required"):
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:150: component._build_embeddings("text-embedding-ada-002", None)
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-151-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-152- @patch("langchain_cohere.CohereEmbeddings")
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:153: def test_build_embeddings_cohere(self, mock_cohere_embeddings, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-154- """Test building Cohere embeddings."""
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-155- component = component_class(**default_kwargs)
--
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-158- mock_cohere_embeddings.return_value = mock_embeddings
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-159-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:160: result = component._build_embeddings("embed-english-v3.0", "test-api-key")
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-161-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-162- mock_cohere_embeddings.assert_called_once_with(
--
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-166- assert result == mock_embeddings
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-167-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:168: def test_build_embeddings_cohere_no_key(self, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-169- """Test building Cohere embeddings without API key raises error."""
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-170- component = component_class(**default_kwargs)
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-171-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-172- with pytest.raises(ValueError, match="Cohere API key is required"):
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:173: component._build_embeddings("embed-english-v3.0", None)
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-174-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:175: def test_build_embeddings_custom_not_supported(self, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-176- """Test building custom embeddings raises NotImplementedError."""
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-177- component = component_class(**default_kwargs)
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-178-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-179- with pytest.raises(NotImplementedError, match="Custom embedding models not yet supported"):
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:180: component._build_embeddings("custom-model", "test-key")
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-181-
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-182- @patch("langflow.components.knowledge_bases.ingestion.get_settings_service")
--
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-331- # Mock embedding validation
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-332- with (
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py:333: patch.object(component, "_build_embeddings") as mock_build_emb,
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-334- patch.object(component, "_save_embedding_metadata"),
src/backend/tests/unit/components/files_and_knowledge/test_ingestion.py-335- ):
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-159-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-160- @patch("langchain_huggingface.HuggingFaceEmbeddings")
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:161: def test_build_embeddings_huggingface(self, mock_hf_embeddings, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-162- """Test building HuggingFace embeddings."""
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-163- component = component_class(**default_kwargs)
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-172- mock_hf_embeddings.return_value = mock_embeddings
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-173-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:174: result = component._build_embeddings(metadata)
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-175-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-176- mock_hf_embeddings.assert_called_once_with(model="sentence-transformers/all-MiniLM-L6-v2")
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-178-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-179- @patch("langchain_openai.OpenAIEmbeddings")
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:180: def test_build_embeddings_openai(self, mock_openai_embeddings, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-181- """Test building OpenAI embeddings."""
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-182- component = component_class(**default_kwargs)
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-192- mock_openai_embeddings.return_value = mock_embeddings
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-193-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:194: result = component._build_embeddings(metadata)
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-195-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-196- mock_openai_embeddings.assert_called_once_with(
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-201- assert result == mock_embeddings
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-202-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:203: def test_build_embeddings_openai_no_key(self, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-204- """Test building OpenAI embeddings without API key raises error."""
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-205- component = component_class(**default_kwargs)
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-213-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-214- with pytest.raises(ValueError, match="OpenAI API key is required"):
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:215: component._build_embeddings(metadata)
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-216-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-217- @patch("langchain_cohere.CohereEmbeddings")
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:218: def test_build_embeddings_cohere(self, mock_cohere_embeddings, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-219- """Test building Cohere embeddings."""
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-220- component = component_class(**default_kwargs)
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-230- mock_cohere_embeddings.return_value = mock_embeddings
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-231-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:232: result = component._build_embeddings(metadata)
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-233-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-234- mock_cohere_embeddings.assert_called_once_with(
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-238- assert result == mock_embeddings
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-239-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:240: def test_build_embeddings_cohere_no_key(self, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-241- """Test building Cohere embeddings without API key raises error."""
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-242- component = component_class(**default_kwargs)
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-250-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-251- with pytest.raises(ValueError, match="Cohere API key is required"):
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:252: component._build_embeddings(metadata)
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-253-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:254: def test_build_embeddings_custom_not_supported(self, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-255- """Test building custom embeddings raises NotImplementedError."""
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-256- component = component_class(**default_kwargs)
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-263-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-264- with pytest.raises(NotImplementedError, match="Custom embedding models not yet supported"):
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:265: component._build_embeddings(metadata)
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-266-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:267: def test_build_embeddings_unsupported_provider(self, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-268- """Test building embeddings with unsupported provider raises NotImplementedError."""
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-269- component = component_class(**default_kwargs)
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-276-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-277- with pytest.raises(NotImplementedError, match="Embedding provider 'UnsupportedProvider' is not supported"):
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:278: component._build_embeddings(metadata)
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-279-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:280: def test_build_embeddings_with_user_api_key(self, component_class, default_kwargs):
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-281- """Test that user-provided API key overrides stored one."""
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-282- # Use a real SecretStr object instead of a mock
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-297- mock_openai.return_value = mock_embeddings
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-298-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:299: component._build_embeddings(metadata)
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-300-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-301- # The user-provided key should override the stored key in metadata
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-348- with (
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-349- patch.object(component, "_get_kb_metadata") as mock_get_metadata,
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:350: patch.object(component, "_build_embeddings") as mock_build_embeddings,
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-351- patch("langchain_chroma.Chroma"),
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-352- ):
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-353- mock_get_metadata.return_value = {"embedding_provider": "HuggingFace", "embedding_model": "test-model"}
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:354: mock_build_embeddings.return_value = MagicMock()
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-355-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-356- # This is a unit test focused on the component's internal logic
--
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-360- # Verify internal methods were called
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-361- mock_get_metadata.assert_called_once()
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py:362: mock_build_embeddings.assert_called_once()
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-363-
src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py-364- def test_include_embeddings_parameter(self, component_class, default_kwargs):
--
src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-20- vector_store = AstraDBVectorStoreComponent()
src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-21- vector_store.set(
src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py:22: embedding_model=openai_embeddings.build_embeddings,
src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-23- ingest_data=text_splitter.split_text,
src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-24- )
--
src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-34- rag_vector_store.set(
src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-35- search_query=chat_input.message_response,
src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py:36: embedding_model=openai_embeddings.build_embeddings,
src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-37- )
src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py-38-
--
src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py-34-
src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py-35- outputs = [
src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py:36: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"),
src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py-37- ]
src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py-38-
src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py:39: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py-40- try:
src/lfx/src/lfx/components/vertexai/vertexai_embeddings.py-41- from langchain_google_vertexai import VertexAIEmbeddings
--
src/lfx/src/lfx/components/ollama/ollama_embeddings.py-41-
src/lfx/src/lfx/components/ollama/ollama_embeddings.py-42- outputs = [
src/lfx/src/lfx/components/ollama/ollama_embeddings.py:43: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"),
src/lfx/src/lfx/components/ollama/ollama_embeddings.py-44- ]
src/lfx/src/lfx/components/ollama/ollama_embeddings.py-45-
src/lfx/src/lfx/components/ollama/ollama_embeddings.py:46: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/ollama/ollama_embeddings.py-47- transformed_base_url = transform_localhost_url(self.base_url)
src/lfx/src/lfx/components/ollama/ollama_embeddings.py-48- try:
--
src/lfx/src/lfx/components/twelvelabs/text_embeddings.py-54- ]
src/lfx/src/lfx/components/twelvelabs/text_embeddings.py-55-
src/lfx/src/lfx/components/twelvelabs/text_embeddings.py:56: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/twelvelabs/text_embeddings.py-57- return TwelveLabsTextEmbeddings(api_key=self.api_key, model=self.model)
--
src/lfx/src/lfx/components/twelvelabs/video_embeddings.py-97- ]
src/lfx/src/lfx/components/twelvelabs/video_embeddings.py-98-
src/lfx/src/lfx/components/twelvelabs/video_embeddings.py:99: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/twelvelabs/video_embeddings.py-100- return TwelveLabsVideoEmbeddings(api_key=self.api_key, model_name=self.model_name)
--
src/lfx/src/lfx/components/openai/openai.py-73- ]
src/lfx/src/lfx/components/openai/openai.py-74-
src/lfx/src/lfx/components/openai/openai.py:75: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/openai/openai.py-76- return OpenAIEmbeddings(
src/lfx/src/lfx/components/openai/openai.py-77- client=self.client or None,
--
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-50- if field_name == "base_url" and field_value:
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-51- try:
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py:52: build_model = self.build_embeddings()
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-53- ids = [model.id for model in build_model.available_models]
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-54- build_config["model"]["options"] = ids
--
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-59- return build_config
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-60-
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py:61: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-62- try:
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-63- from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
--
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-153- return WATSONX_EMBEDDING_MODEL_NAMES
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-154-
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:155: async def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-156- provider = self.provider
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-157- model = self.model
--
src/lfx/src/lfx/components/lmstudio/lmstudioembeddings.py-71- ]
src/lfx/src/lfx/components/lmstudio/lmstudioembeddings.py-72-
src/lfx/src/lfx/components/lmstudio/lmstudioembeddings.py:73: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/lmstudio/lmstudioembeddings.py-74- try:
src/lfx/src/lfx/components/lmstudio/lmstudioembeddings.py-75- from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
--
src/lfx/src/lfx/components/mistral/mistral_embeddings.py-39-
src/lfx/src/lfx/components/mistral/mistral_embeddings.py-40- outputs = [
src/lfx/src/lfx/components/mistral/mistral_embeddings.py:41: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"),
src/lfx/src/lfx/components/mistral/mistral_embeddings.py-42- ]
src/lfx/src/lfx/components/mistral/mistral_embeddings.py-43-
src/lfx/src/lfx/components/mistral/mistral_embeddings.py:44: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/mistral/mistral_embeddings.py-45- if not self.mistral_api_key:
src/lfx/src/lfx/components/mistral/mistral_embeddings.py-46- msg = "Mistral API Key is required"
--
src/lfx/src/lfx/components/langchain_utilities/fake_embeddings.py-21- ]
src/lfx/src/lfx/components/langchain_utilities/fake_embeddings.py-22-
src/lfx/src/lfx/components/langchain_utilities/fake_embeddings.py:23: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/langchain_utilities/fake_embeddings.py-24- return FakeEmbeddings(
src/lfx/src/lfx/components/langchain_utilities/fake_embeddings.py-25- size=self.dimensions or 5,
--
src/lfx/src/lfx/components/ibm/watsonx_embeddings.py-115- logger.exception("Error updating model options.")
src/lfx/src/lfx/components/ibm/watsonx_embeddings.py-116-
src/lfx/src/lfx/components/ibm/watsonx_embeddings.py:117: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/ibm/watsonx_embeddings.py-118- credentials = Credentials(
src/lfx/src/lfx/components/ibm/watsonx_embeddings.py-119- api_key=SecretStr(self.api_key).get_secret_value(),
--
src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-44-
src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-45- outputs = [
src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py:46: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"),
src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-47- ]
src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-48-
--
src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-83- return HuggingFaceInferenceAPIEmbeddings(api_key=api_key, api_url=api_url, model_name=model_name)
src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-84-
src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py:85: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-86- api_url = self.get_api_url()
src/lfx/src/lfx/components/huggingface/huggingface_inference_api.py-87-
--
src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py-34-
src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py-35- outputs = [
src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py:36: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"),
src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py-37- ]
src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py-38-
src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py:39: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py-40- if not self.api_key:
src/lfx/src/lfx/components/google/google_generative_ai_embeddings.py-41- msg = "API Key is required"
--
src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-135- return metadata
src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-136-
src/lfx/src/lfx/components/files_and_knowledge/retrieval.py:137: def _build_embeddings(self, metadata: dict):
src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-138- """Build embedding model from metadata."""
src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-139- runtime_api_key = self.api_key.get_secret_value() if isinstance(self.api_key, SecretStr) else self.api_key
--
src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-203-
src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-204- # Build the embedder for the knowledge base
src/lfx/src/lfx/components/files_and_knowledge/retrieval.py:205: embedding_function = self._build_embeddings(metadata)
src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-206-
src/lfx/src/lfx/components/files_and_knowledge/retrieval.py-207- # Load vector store
--
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-243- return "Custom"
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-244-
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py:245: def _build_embeddings(self, embedding_model: str, api_key: str):
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-246- """Build embedding model using provider patterns."""
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-247- # Get provider by matching model name to lists
--
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-385-
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-386- # Create embeddings model
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py:387: embedding_function = self._build_embeddings(embedding_model, api_key)
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-388-
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-389- # Convert DataFrame to Data objects (following Local DB pattern)
--
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-655-
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-656- # We need to test the API Key one time against the embedding model
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py:657: embed_model = self._build_embeddings(embedding_model=field_value["02_embedding_model"], api_key=api_key)
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-658-
src/lfx/src/lfx/components/files_and_knowledge/ingestion.py-659- # Try to generate a dummy embedding to validate the API key without blocking the event loop
--
src/lfx/src/lfx/components/azure/azure_openai_embeddings.py-64-
src/lfx/src/lfx/components/azure/azure_openai_embeddings.py-65- outputs = [
src/lfx/src/lfx/components/azure/azure_openai_embeddings.py:66: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"),
src/lfx/src/lfx/components/azure/azure_openai_embeddings.py-67- ]
src/lfx/src/lfx/components/azure/azure_openai_embeddings.py-68-
src/lfx/src/lfx/components/azure/azure_openai_embeddings.py:69: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/azure/azure_openai_embeddings.py-70- try:
src/lfx/src/lfx/components/azure/azure_openai_embeddings.py-71- embeddings = AzureOpenAIEmbeddings(
--
src/lfx/src/lfx/components/cloudflare/cloudflare.py-61-
src/lfx/src/lfx/components/cloudflare/cloudflare.py-62- outputs = [
src/lfx/src/lfx/components/cloudflare/cloudflare.py:63: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"),
src/lfx/src/lfx/components/cloudflare/cloudflare.py-64- ]
src/lfx/src/lfx/components/cloudflare/cloudflare.py-65-
src/lfx/src/lfx/components/cloudflare/cloudflare.py:66: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/cloudflare/cloudflare.py-67- try:
src/lfx/src/lfx/components/cloudflare/cloudflare.py-68- embeddings = CloudflareWorkersAIEmbeddings(
--
src/lfx/src/lfx/components/cohere/cohere_embeddings.py-40-
src/lfx/src/lfx/components/cohere/cohere_embeddings.py-41- outputs = [
src/lfx/src/lfx/components/cohere/cohere_embeddings.py:42: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"),
src/lfx/src/lfx/components/cohere/cohere_embeddings.py-43- ]
src/lfx/src/lfx/components/cohere/cohere_embeddings.py-44-
src/lfx/src/lfx/components/cohere/cohere_embeddings.py:45: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/cohere/cohere_embeddings.py-46- data = None
src/lfx/src/lfx/components/cohere/cohere_embeddings.py-47- try:
--
src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py-69-
src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py-70- outputs = [
src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py:71: Output(display_name="Embeddings", name="embeddings", method="build_embeddings"),
src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py-72- ]
src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py-73-
src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py:74: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py-75- try:
src/lfx/src/lfx/components/amazon/amazon_bedrock_embedding.py-76- from langchain_aws import BedrockEmbeddings
--
src/lfx/src/lfx/components/aiml/aiml_embeddings.py-31- ]
src/lfx/src/lfx/components/aiml/aiml_embeddings.py-32-
src/lfx/src/lfx/components/aiml/aiml_embeddings.py:33: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/components/aiml/aiml_embeddings.py-34- return AIMLEmbeddingsImpl(
src/lfx/src/lfx/components/aiml/aiml_embeddings.py-35- api_key=self.aiml_api_key,
--
src/lfx/src/lfx/base/embeddings/model.py-8-
src/lfx/src/lfx/base/embeddings/model.py-9- outputs = [
src/lfx/src/lfx/base/embeddings/model.py:10: Output(display_name="Embedding Model", name="embeddings", method="build_embeddings"),
src/lfx/src/lfx/base/embeddings/model.py-11- ]
src/lfx/src/lfx/base/embeddings/model.py-12-
src/lfx/src/lfx/base/embeddings/model.py-13- def _validate_outputs(self) -> None:
src/lfx/src/lfx/base/embeddings/model.py:14: required_output_methods = ["build_embeddings"]
src/lfx/src/lfx/base/embeddings/model.py-15- output_names = [output.name for output in self.outputs]
src/lfx/src/lfx/base/embeddings/model.py-16- for method_name in required_output_methods:
--
src/lfx/src/lfx/base/embeddings/model.py-22- raise ValueError(msg)
src/lfx/src/lfx/base/embeddings/model.py-23-
src/lfx/src/lfx/base/embeddings/model.py:24: def build_embeddings(self) -> Embeddings:
src/lfx/src/lfx/base/embeddings/model.py:25: msg = "You must implement the build_embeddings method in your class."
src/lfx/src/lfx/base/embeddings/model.py-26- raise NotImplementedError(msg)
</shell_result>
Script executed:
#!/bin/bash
# Check test_local_db_component.py around line 34
sed -n '30,40p' src/backend/tests/unit/components/vectorstores/test_local_db_component.py
# Check test_embedding_model_component.py around lines 137, 168
sed -n '130,145p' src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py
# Check if the test functions are async
sed -n '120,125p' src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.pyRepository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
except ValueError:
pytest.skip("OPENAI_API_KEY is not set")
return {
"embedding": OpenAIEmbeddingsComponent(openai_api_key=api_key).build_embeddings(),
"collection_name": "test_collection",
"persist": True,
"persist_directory": str(tmp_path), # Convert Path to string
"mode": "Ingest",
}
component.model = "text-embedding-3-small"
component.api_key = "test-key" # pragma:allowlist secret
component.chunk_size = 1000
component.max_retries = 3
component.show_progress_bar = False
# Build the embeddings
embeddings = component.build_embeddings()
# Verify the OpenAIEmbeddings was called with the correct parameters
mock_openai_embeddings.assert_called_once_with(
model="text-embedding-3-small",
dimensions=None,
base_url=None,
api_key="test-key", # pragma:allowlist secret
chunk_size=1000,
@patch("lfx.components.models_and_agents.embedding_model.OpenAIEmbeddings")
async def test_build_embeddings_openai(self, mock_openai_embeddings, component_class, default_kwargs):
# Setup mock
mock_instance = MagicMock()
mock_openai_embeddings.return_value = mock_instance
</shell_result>
Script executed:
#!/bin/bash
# Search for any uses of "await build_embeddings" to see if callers are properly awaiting
rg -n --type=py 'await.*build_embeddings'
# Check the actual implementation of the async method
sed -n '155,200p' src/lfx/src/lfx/components/models_and_agents/embedding_model.pyRepository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
async def build_embeddings(self) -> Embeddings:
provider = self.provider
model = self.model
api_key = self.api_key
api_base = self.api_base
base_url_ibm_watsonx = self.base_url_ibm_watsonx
ollama_base_url = self.ollama_base_url
dimensions = self.dimensions
chunk_size = self.chunk_size
request_timeout = self.request_timeout
max_retries = self.max_retries
show_progress_bar = self.show_progress_bar
model_kwargs = self.model_kwargs or {}
if provider == "OpenAI":
if not api_key:
msg = "OpenAI API key is required when using OpenAI provider"
raise ValueError(msg)
# Create the primary embedding instance
embeddings_instance = OpenAIEmbeddings(
model=model,
dimensions=dimensions or None,
base_url=api_base or None,
api_key=api_key,
chunk_size=chunk_size,
max_retries=max_retries,
timeout=request_timeout or None,
show_progress_bar=show_progress_bar,
model_kwargs=model_kwargs,
)
# Create dedicated instances for each available model
available_models_dict = {}
for model_name in OPENAI_EMBEDDING_MODEL_NAMES:
available_models_dict[model_name] = OpenAIEmbeddings(
model=model_name,
dimensions=dimensions or None, # Use same dimensions config for all
base_url=api_base or None,
api_key=api_key,
chunk_size=chunk_size,
max_retries=max_retries,
timeout=request_timeout or None,
show_progress_bar=show_progress_bar,
model_kwargs=model_kwargs,
)
</shell_result>
Breaking change not properly propagated: callers are not awaiting the async method.
The build_embeddings method was converted to async but call sites throughout the codebase are not awaiting it:
src/backend/tests/unit/components/vectorstores/test_local_db_component.py:34src/backend/tests/unit/components/vectorstores/test_chroma_vector_store_component.py:31src/backend/tests/unit/components/models_and_agents/test_embedding_model_component.py:137, 168, 201, 234, 242, 253, 260
Additionally, the method body contains no await operations despite being declared async, and all other embedding components in the codebase remain synchronous. Either update all call sites to use await, or reconsider whether this method should be async.
🤖 Prompt for AI Agents
In src/lfx/src/lfx/components/models_and_agents/embedding_model.py around line
155, the method build_embeddings was changed to async but its body contains no
awaits and many callers are not awaiting it, causing breaking behavior; either
revert build_embeddings to a synchronous def and restore the previous
(non-async) signature so all existing call sites continue to work, or keep it
async and (1) introduce real awaitable operations inside (or wrap the work in an
executor) and (2) update every caller/test listed to await build_embeddings;
pick one approach and apply it consistently across the embedding component
interface and all referenced call sites/tests.
| # Create dedicated instances for each available model | ||
| available_models_dict = {} | ||
| for model_name in OPENAI_EMBEDDING_MODEL_NAMES: | ||
| available_models_dict[model_name] = OpenAIEmbeddings( | ||
| model=model_name, | ||
| dimensions=dimensions or None, # Use same dimensions config for all | ||
| base_url=api_base or None, | ||
| api_key=api_key, | ||
| chunk_size=chunk_size, | ||
| max_retries=max_retries, | ||
| timeout=request_timeout or None, | ||
| show_progress_bar=show_progress_bar, | ||
| model_kwargs=model_kwargs, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Script executed:
# Find the OPENAI_EMBEDDING_MODEL_NAMES constant definition
rg "OPENAI_EMBEDDING_MODEL_NAMES" -n -B2 -A5Repository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1783- "title_case": false,
src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1784- "type": "code",
src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json:1785: "value": "from typing import Any\n\nfrom langchain_openai import OpenAIEmbeddings\n\nfrom langflow.base.embeddings.model import LCEmbeddingsModel\nfrom langflow.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES\nfrom langflow.field_typing import Embeddings\nfrom langflow.io import (\n BoolInput,\n DictInput,\n DropdownInput,\n FloatInput,\n IntInput,\n MessageTextInput,\n SecretStrInput,\n)\nfrom langflow.schema.dotdict import dotdict\n\n\nclass EmbeddingModelComponent(LCEmbeddingsModel):\n display_name = \"Embedding Model\"\n description = \"Generate embeddings using a specified provider.\"\n documentation: str = \"https://docs.langflow.org/components-embedding-models\"\n icon = \"binary\"\n name = \"EmbeddingModel\"\n category = \"models\"\n\n inputs = [\n DropdownInput(\n name=\"provider\",\n display_name=\"Model Provider\",\n options=[\"OpenAI\"],\n value=\"OpenAI\",\n info=\"Select the embedding model provider\",\n real_time_refresh=True,\n options_metadata=[{\"icon\": \"OpenAI\"}],\n ),\n DropdownInput(\n name=\"model\",\n display_name=\"Model Name\",\n options=OPENAI_EMBEDDING_MODEL_NAMES,\n value=OPENAI_EMBEDDING_MODEL_NAMES[0],\n info=\"Select the embedding model to use\",\n ),\n SecretStrInput(\n name=\"api_key\",\n display_name=\"OpenAI API Key\",\n info=\"Model Provider API key\",\n required=True,\n show=True,\n real_time_refresh=True,\n ),\n MessageTextInput(\n name=\"api_base\",\n display_name=\"API Base URL\",\n info=\"Base URL for the API. Leave empty for default.\",\n advanced=True,\n ),\n IntInput(\n name=\"dimensions\",\n display_name=\"Dimensions\",\n info=\"The number of dimensions the resulting output embeddings should have. \"\n \"Only supported by certain models.\",\n advanced=True,\n ),\n IntInput(name=\"chunk_size\", display_name=\"Chunk Size\", advanced=True, value=1000),\n FloatInput(name=\"request_timeout\", display_name=\"Request Timeout\", advanced=True),\n IntInput(name=\"max_retries\", display_name=\"Max Retries\", advanced=True, value=3),\n BoolInput(name=\"show_progress_bar\", display_name=\"Show Progress Bar\", advanced=True),\n DictInput(\n name=\"model_kwargs\",\n display_name=\"Model Kwargs\",\n advanced=True,\n info=\"Additional keyword arguments to pass to the model.\",\n ),\n ]\n\n def build_embeddings(self) -> Embeddings:\n provider = self.provider\n model = self.model\n api_key = self.api_key\n api_base = self.api_base\n dimensions = self.dimensions\n chunk_size = self.chunk_size\n request_timeout = self.request_timeout\n max_retries = self.max_retries\n show_progress_bar = self.show_progress_bar\n model_kwargs = self.model_kwargs or {}\n\n if provider == \"OpenAI\":\n if not api_key:\n msg = \"OpenAI API key is required when using OpenAI provider\"\n raise ValueError(msg)\n return OpenAIEmbeddings(\n model=model,\n dimensions=dimensions or None,\n base_url=api_base or None,\n api_key=api_key,\n chunk_size=chunk_size,\n max_retries=max_retries,\n timeout=request_timeout or None,\n show_progress_bar=show_progress_bar,\n model_kwargs=model_kwargs,\n )\n msg = f\"Unknown provider: {provider}\"\n raise ValueError(msg)\n\n def update_build_config(self, build_config: dotdict, field_value: Any, field_name: str | None = None) -> dotdict:\n if field_name == \"provider\" and field_value == \"OpenAI\":\n build_config[\"model\"][\"options\"] = OPENAI_EMBEDDING_MODEL_NAMES\n build_config[\"model\"][\"value\"] = OPENAI_EMBEDDING_MODEL_NAMES[0]\n build_config[\"api_key\"][\"display_name\"] = \"OpenAI API Key\"\n build_config[\"api_base\"][\"display_name\"] = \"OpenAI API Base URL\"\n return build_config\n"
src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1786- },
src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1787- "dimensions": {
src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1788- "_input_type": "IntInput",
src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1789- "advanced": true,
src/lfx/tests/data/starter_projects_1_6_0/Nvidia Remix.json-1790- "display_name": "Dimensions",
--
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1317- "title_case": false,
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1318- "type": "code",
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json:1319: "value": "from langchain_openai import OpenAIEmbeddings\n\nfrom langflow.base.embeddings.model import LCEmbeddingsModel\nfrom langflow.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES\nfrom langflow.field_typing import Embeddings\nfrom langflow.io import BoolInput, DictInput, DropdownInput, FloatInput, IntInput, MessageTextInput, SecretStrInput\n\n\nclass OpenAIEmbeddingsComponent(LCEmbeddingsModel):\n display_name = \"OpenAI Embeddings\"\n description = \"Generate embeddings using OpenAI models.\"\n icon = \"OpenAI\"\n name = \"OpenAIEmbeddings\"\n\n inputs = [\n DictInput(\n name=\"default_headers\",\n display_name=\"Default Headers\",\n advanced=True,\n info=\"Default headers to use for the API request.\",\n ),\n DictInput(\n name=\"default_query\",\n display_name=\"Default Query\",\n advanced=True,\n info=\"Default query parameters to use for the API request.\",\n ),\n IntInput(name=\"chunk_size\", display_name=\"Chunk Size\", advanced=True, value=1000),\n MessageTextInput(name=\"client\", display_name=\"Client\", advanced=True),\n MessageTextInput(name=\"deployment\", display_name=\"Deployment\", advanced=True),\n IntInput(name=\"embedding_ctx_length\", display_name=\"Embedding Context Length\", advanced=True, value=1536),\n IntInput(name=\"max_retries\", display_name=\"Max Retries\", value=3, advanced=True),\n DropdownInput(\n name=\"model\",\n display_name=\"Model\",\n advanced=False,\n options=OPENAI_EMBEDDING_MODEL_NAMES,\n value=\"text-embedding-3-small\",\n ),\n DictInput(name=\"model_kwargs\", display_name=\"Model Kwargs\", advanced=True),\n SecretStrInput(name=\"openai_api_key\", display_name=\"OpenAI API Key\", value=\"OPENAI_API_KEY\", required=True),\n MessageTextInput(name=\"openai_api_base\", display_name=\"OpenAI API Base\", advanced=True),\n MessageTextInput(name=\"openai_api_type\", display_name=\"OpenAI API Type\", advanced=True),\n MessageTextInput(name=\"openai_api_version\", display_name=\"OpenAI API Version\", advanced=True),\n MessageTextInput(\n name=\"openai_organization\",\n display_name=\"OpenAI Organization\",\n advanced=True,\n ),\n MessageTextInput(name=\"openai_proxy\", display_name=\"OpenAI Proxy\", advanced=True),\n FloatInput(name=\"request_timeout\", display_name=\"Request Timeout\", advanced=True),\n BoolInput(name=\"show_progress_bar\", display_name=\"Show Progress Bar\", advanced=True),\n BoolInput(name=\"skip_empty\", display_name=\"Skip Empty\", advanced=True),\n MessageTextInput(\n name=\"tiktoken_model_name\",\n display_name=\"TikToken Model Name\",\n advanced=True,\n ),\n BoolInput(\n name=\"tiktoken_enable\",\n display_name=\"TikToken Enable\",\n advanced=True,\n value=True,\n info=\"If False, you must have transformers installed.\",\n ),\n IntInput(\n name=\"dimensions\",\n display_name=\"Dimensions\",\n info=\"The number of dimensions the resulting output embeddings should have. \"\n \"Only supported by certain models.\",\n advanced=True,\n ),\n ]\n\n def build_embeddings(self) -> Embeddings:\n return OpenAIEmbeddings(\n client=self.client or None,\n model=self.model,\n dimensions=self.dimensions or None,\n deployment=self.deployment or None,\n api_version=self.openai_api_version or None,\n base_url=self.openai_api_base or None,\n openai_api_type=self.openai_api_type or None,\n openai_proxy=self.openai_proxy or None,\n embedding_ctx_length=self.embedding_ctx_length,\n api_key=self.openai_api_key or None,\n organization=self.openai_organization or None,\n allowed_special=\"all\",\n disallowed_special=\"all\",\n chunk_size=self.chunk_size,\n max_retries=self.max_retries,\n timeout=self.request_timeout or None,\n tiktoken_enabled=self.tiktoken_enable,\n tiktoken_model_name=self.tiktoken_model_name or None,\n show_progress_bar=self.show_progress_bar,\n model_kwargs=self.model_kwargs,\n skip_empty=self.skip_empty,\n default_headers=self.default_headers or None,\n default_query=self.default_query or None,\n )\n"
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1320- },
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1321- "default_headers": {
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1322- "_input_type": "DictInput",
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1323- "advanced": true,
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1324- "display_name": "Default Headers",
--
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1850- "title_case": false,
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1851- "type": "code",
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json:1852: "value": "from langchain_openai import OpenAIEmbeddings\n\nfrom langflow.base.embeddings.model import LCEmbeddingsModel\nfrom langflow.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES\nfrom langflow.field_typing import Embeddings\nfrom langflow.io import BoolInput, DictInput, DropdownInput, FloatInput, IntInput, MessageTextInput, SecretStrInput\n\n\nclass OpenAIEmbeddingsComponent(LCEmbeddingsModel):\n display_name = \"OpenAI Embeddings\"\n description = \"Generate embeddings using OpenAI models.\"\n icon = \"OpenAI\"\n name = \"OpenAIEmbeddings\"\n\n inputs = [\n DictInput(\n name=\"default_headers\",\n display_name=\"Default Headers\",\n advanced=True,\n info=\"Default headers to use for the API request.\",\n ),\n DictInput(\n name=\"default_query\",\n display_name=\"Default Query\",\n advanced=True,\n info=\"Default query parameters to use for the API request.\",\n ),\n IntInput(name=\"chunk_size\", display_name=\"Chunk Size\", advanced=True, value=1000),\n MessageTextInput(name=\"client\", display_name=\"Client\", advanced=True),\n MessageTextInput(name=\"deployment\", display_name=\"Deployment\", advanced=True),\n IntInput(name=\"embedding_ctx_length\", display_name=\"Embedding Context Length\", advanced=True, value=1536),\n IntInput(name=\"max_retries\", display_name=\"Max Retries\", value=3, advanced=True),\n DropdownInput(\n name=\"model\",\n display_name=\"Model\",\n advanced=False,\n options=OPENAI_EMBEDDING_MODEL_NAMES,\n value=\"text-embedding-3-small\",\n ),\n DictInput(name=\"model_kwargs\", display_name=\"Model Kwargs\", advanced=True),\n SecretStrInput(name=\"openai_api_key\", display_name=\"OpenAI API Key\", value=\"OPENAI_API_KEY\", required=True),\n MessageTextInput(name=\"openai_api_base\", display_name=\"OpenAI API Base\", advanced=True),\n MessageTextInput(name=\"openai_api_type\", display_name=\"OpenAI API Type\", advanced=True),\n MessageTextInput(name=\"openai_api_version\", display_name=\"OpenAI API Version\", advanced=True),\n MessageTextInput(\n name=\"openai_organization\",\n display_name=\"OpenAI Organization\",\n advanced=True,\n ),\n MessageTextInput(name=\"openai_proxy\", display_name=\"OpenAI Proxy\", advanced=True),\n FloatInput(name=\"request_timeout\", display_name=\"Request Timeout\", advanced=True),\n BoolInput(name=\"show_progress_bar\", display_name=\"Show Progress Bar\", advanced=True),\n BoolInput(name=\"skip_empty\", display_name=\"Skip Empty\", advanced=True),\n MessageTextInput(\n name=\"tiktoken_model_name\",\n display_name=\"TikToken Model Name\",\n advanced=True,\n ),\n BoolInput(\n name=\"tiktoken_enable\",\n display_name=\"TikToken Enable\",\n advanced=True,\n value=True,\n info=\"If False, you must have transformers installed.\",\n ),\n IntInput(\n name=\"dimensions\",\n display_name=\"Dimensions\",\n info=\"The number of dimensions the resulting output embeddings should have. \"\n \"Only supported by certain models.\",\n advanced=True,\n ),\n ]\n\n def build_embeddings(self) -> Embeddings:\n return OpenAIEmbeddings(\n client=self.client or None,\n model=self.model,\n dimensions=self.dimensions or None,\n deployment=self.deployment or None,\n api_version=self.openai_api_version or None,\n base_url=self.openai_api_base or None,\n openai_api_type=self.openai_api_type or None,\n openai_proxy=self.openai_proxy or None,\n embedding_ctx_length=self.embedding_ctx_length,\n api_key=self.openai_api_key or None,\n organization=self.openai_organization or None,\n allowed_special=\"all\",\n disallowed_special=\"all\",\n chunk_size=self.chunk_size,\n max_retries=self.max_retries,\n timeout=self.request_timeout or None,\n tiktoken_enabled=self.tiktoken_enable,\n tiktoken_model_name=self.tiktoken_model_name or None,\n show_progress_bar=self.show_progress_bar,\n model_kwargs=self.model_kwargs,\n skip_empty=self.skip_empty,\n default_headers=self.default_headers or None,\n default_query=self.default_query or None,\n )\n"
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1853- },
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1854- "default_headers": {
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1855- "_input_type": "DictInput",
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1856- "advanced": true,
src/lfx/tests/data/starter_projects_1_6_0/Vector Store RAG.json-1857- "display_name": "Default Headers",
--
src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-786- "title_case": false,
src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-787- "type": "code",
src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json:788: "value": "from __future__ import annotations\n\nimport asyncio\nimport contextlib\nimport hashlib\nimport json\nimport re\nimport uuid\nfrom dataclasses import asdict, dataclass, field\nfrom datetime import datetime, timezone\nfrom pathlib import Path\nfrom typing import TYPE_CHECKING, Any\n\nimport pandas as pd\nfrom cryptography.fernet import InvalidToken\nfrom langchain_chroma import Chroma\nfrom loguru import logger\n\nfrom langflow.base.knowledge_bases.knowledge_base_utils import get_knowledge_bases\nfrom langflow.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES\nfrom langflow.components.processing.converter import convert_to_dataframe\nfrom langflow.custom import Component\nfrom langflow.io import (\n BoolInput,\n DropdownInput,\n HandleInput,\n IntInput,\n Output,\n SecretStrInput,\n StrInput,\n TableInput,\n)\nfrom langflow.schema.data import Data\nfrom langflow.schema.dotdict import dotdict # noqa: TC001\nfrom langflow.schema.table import EditMode\nfrom langflow.services.auth.utils import decrypt_api_key, encrypt_api_key\nfrom langflow.services.database.models.user.crud import get_user_by_id\nfrom langflow.services.deps import (\n get_settings_service,\n get_variable_service,\n session_scope,\n)\n\nif TYPE_CHECKING:\n from langflow.schema.dataframe import DataFrame\n\nHUGGINGFACE_MODEL_NAMES = [\n \"sentence-transformers/all-MiniLM-L6-v2\",\n \"sentence-transformers/all-mpnet-base-v2\",\n]\nCOHERE_MODEL_NAMES = [\"embed-english-v3.0\", \"embed-multilingual-v3.0\"]\n\nsettings = get_settings_service().settings\nknowledge_directory = settings.knowledge_bases_dir\nif not knowledge_directory:\n msg = \"Knowledge bases directory is not set in the settings.\"\n raise ValueError(msg)\nKNOWLEDGE_BASES_ROOT_PATH = Path(knowledge_directory).expanduser()\n\n\nclass KnowledgeIngestionComponent(Component):\n \"\"\"Create or append to Langflow Knowledge from a DataFrame.\"\"\"\n\n # ------ UI metadata ---------------------------------------------------\n display_name = \"Knowledge Ingestion\"\n description = \"Create or update knowledge in Langflow.\"\n icon = \"upload\"\n name = \"KnowledgeIngestion\"\n\n def __init__(self, *args, **kwargs) -> None:\n super().__init__(*args, **kwargs)\n self._cached_kb_path: Path | None = None\n\n @dataclass\n class NewKnowledgeBaseInput:\n functionality: str = \"create\"\n fields: dict[str, dict] = field(\n default_factory=lambda: {\n \"data\": {\n \"node\": {\n \"name\": \"create_knowledge_base\",\n \"description\": \"Create new knowledge in Langflow.\",\n \"display_name\": \"Create new knowledge\",\n \"field_order\": [\n \"01_new_kb_name\",\n \"02_embedding_model\",\n \"03_api_key\",\n ],\n \"template\": {\n \"01_new_kb_name\": StrInput(\n name=\"new_kb_name\",\n display_name=\"Knowledge Name\",\n info=\"Name of the new knowledge to create.\",\n required=True,\n ),\n \"02_embedding_model\": DropdownInput(\n name=\"embedding_model\",\n display_name=\"Choose Embedding\",\n info=\"Select the embedding model to use for this knowledge base.\",\n required=True,\n options=OPENAI_EMBEDDING_MODEL_NAMES + HUGGINGFACE_MODEL_NAMES + COHERE_MODEL_NAMES,\n options_metadata=[{\"icon\": \"OpenAI\"} for _ in OPENAI_EMBEDDING_MODEL_NAMES]\n + [{\"icon\": \"HuggingFace\"} for _ in HUGGINGFACE_MODEL_NAMES]\n + [{\"icon\": \"Cohere\"} for _ in COHERE_MODEL_NAMES],\n ),\n \"03_api_key\": SecretStrInput(\n name=\"api_key\",\n display_name=\"API Key\",\n info=\"Provider API key for embedding model\",\n required=True,\n load_from_db=False,\n ),\n },\n },\n }\n }\n )\n\n # ------ Inputs --------------------------------------------------------\n inputs = [\n DropdownInput(\n name=\"knowledge_base\",\n display_name=\"Knowledge\",\n info=\"Select the knowledge to load data from.\",\n required=True,\n options=[],\n refresh_button=True,\n real_time_refresh=True,\n dialog_inputs=asdict(NewKnowledgeBaseInput()),\n ),\n HandleInput(\n name=\"input_df\",\n display_name=\"Input\",\n info=(\n \"Table with all original columns (already chunked / processed). \"\n \"Accepts Data or DataFrame. If Data is provided, it is converted to a DataFrame automatically.\"\n ),\n input_types=[\"Data\", \"DataFrame\"],\n required=True,\n ),\n TableInput(\n name=\"column_config\",\n display_name=\"Column Configuration\",\n info=\"Configure column behavior for the knowledge base.\",\n required=True,\n table_schema=[\n {\n \"name\": \"column_name\",\n \"display_name\": \"Column Name\",\n \"type\": \"str\",\n \"description\": \"Name of the column in the source DataFrame\",\n \"edit_mode\": EditMode.INLINE,\n },\n {\n \"name\": \"vectorize\",\n \"display_name\": \"Vectorize\",\n \"type\": \"boolean\",\n \"description\": \"Create embeddings for this column\",\n \"default\": False,\n \"edit_mode\": EditMode.INLINE,\n },\n {\n \"name\": \"identifier\",\n \"display_name\": \"Identifier\",\n \"type\": \"boolean\",\n \"description\": \"Use this column as unique identifier\",\n \"default\": False,\n \"edit_mode\": EditMode.INLINE,\n },\n ],\n value=[\n {\n \"column_name\": \"text\",\n \"vectorize\": True,\n \"identifier\": True,\n },\n ],\n ),\n IntInput(\n name=\"chunk_size\",\n display_name=\"Chunk Size\",\n info=\"Batch size for processing embeddings\",\n advanced=True,\n value=1000,\n ),\n SecretStrInput(\n name=\"api_key\",\n display_name=\"Embedding Provider API Key\",\n info=\"API key for the embedding provider to generate embeddings.\",\n advanced=True,\n required=False,\n ),\n BoolInput(\n name=\"allow_duplicates\",\n display_name=\"Allow Duplicates\",\n info=\"Allow duplicate rows in the knowledge base\",\n advanced=True,\n value=False,\n ),\n ]\n\n # ------ Outputs -------------------------------------------------------\n outputs = [Output(display_name=\"Results\", name=\"dataframe_output\", method=\"build_kb_info\")]\n\n # ------ Internal helpers ---------------------------------------------\n def _get_kb_root(self) -> Path:\n \"\"\"Return the root directory for knowledge bases.\"\"\"\n return KNOWLEDGE_BASES_ROOT_PATH\n\n def _validate_column_config(self, df_source: pd.DataFrame) -> list[dict[str, Any]]:\n \"\"\"Validate column configuration using Structured Output patterns.\"\"\"\n if not self.column_config:\n msg = \"Column configuration cannot be empty\"\n raise ValueError(msg)\n\n # Convert table input to list of dicts (similar to Structured Output)\n config_list = self.column_config if isinstance(self.column_config, list) else []\n\n # Validate column names exist in DataFrame\n df_columns = set(df_source.columns)\n for config in config_list:\n col_name = config.get(\"column_name\")\n if col_name not in df_columns:\n msg = f\"Column '{col_name}' not found in DataFrame. Available columns: {sorted(df_columns)}\"\n raise ValueError(msg)\n\n return config_list\n\n def _get_embedding_provider(self, embedding_model: str) -> str:\n \"\"\"Get embedding provider by matching model name to lists.\"\"\"\n if embedding_model in OPENAI_EMBEDDING_MODEL_NAMES:\n return \"OpenAI\"\n if embedding_model in HUGGINGFACE_MODEL_NAMES:\n return \"HuggingFace\"\n if embedding_model in COHERE_MODEL_NAMES:\n return \"Cohere\"\n return \"Custom\"\n\n def _build_embeddings(self, embedding_model: str, api_key: str):\n \"\"\"Build embedding model using provider patterns.\"\"\"\n # Get provider by matching model name to lists\n provider = self._get_embedding_provider(embedding_model)\n\n # Validate provider and model\n if provider == \"OpenAI\":\n from langchain_openai import OpenAIEmbeddings\n\n if not api_key:\n msg = \"OpenAI API key is required when using OpenAI provider\"\n raise ValueError(msg)\n return OpenAIEmbeddings(\n model=embedding_model,\n api_key=api_key,\n chunk_size=self.chunk_size,\n )\n if provider == \"HuggingFace\":\n from langchain_huggingface import HuggingFaceEmbeddings\n\n return HuggingFaceEmbeddings(\n model=embedding_model,\n )\n if provider == \"Cohere\":\n from langchain_cohere import CohereEmbeddings\n\n if not api_key:\n msg = \"Cohere API key is required when using Cohere provider\"\n raise ValueError(msg)\n return CohereEmbeddings(\n model=embedding_model,\n cohere_api_key=api_key,\n )\n if provider == \"Custom\":\n # For custom embedding models, we would need additional configuration\n msg = \"Custom embedding models not yet supported\"\n raise NotImplementedError(msg)\n msg = f\"Unknown provider: {provider}\"\n raise ValueError(msg)\n\n def _build_embedding_metadata(self, embedding_model, api_key) -> dict[str, Any]:\n \"\"\"Build embedding model metadata.\"\"\"\n # Get provider by matching model name to lists\n embedding_provider = self._get_embedding_provider(embedding_model)\n\n api_key_to_save = None\n if api_key and hasattr(api_key, \"get_secret_value\"):\n api_key_to_save = api_key.get_secret_value()\n elif isinstance(api_key, str):\n api_key_to_save = api_key\n\n encrypted_api_key = None\n if api_key_to_save:\n settings_service = get_settings_service()\n try:\n encrypted_api_key = encrypt_api_key(api_key_to_save, settings_service=settings_service)\n except (TypeError, ValueError) as e:\n self.log(f\"Could not encrypt API key: {e}\")\n logger.error(f\"Could not encrypt API key: {e}\")\n\n return {\n \"embedding_provider\": embedding_provider,\n \"embedding_model\": embedding_model,\n \"api_key\": encrypted_api_key,\n \"api_key_used\": bool(api_key),\n \"chunk_size\": self.chunk_size,\n \"created_at\": datetime.now(timezone.utc).isoformat(),\n }\n\n def _save_embedding_metadata(self, kb_path: Path, embedding_model: str, api_key: str) -> None:\n \"\"\"Save embedding model metadata.\"\"\"\n embedding_metadata = self._build_embedding_metadata(embedding_model, api_key)\n metadata_path = kb_path / \"embedding_metadata.json\"\n metadata_path.write_text(json.dumps(embedding_metadata, indent=2))\n\n def _save_kb_files(\n self,\n kb_path: Path,\n config_list: list[dict[str, Any]],\n ) -> None:\n \"\"\"Save KB files using File Component storage patterns.\"\"\"\n try:\n # Create directory (following File Component patterns)\n kb_path.mkdir(parents=True, exist_ok=True)\n\n # Save column configuration\n # Only do this if the file doesn't exist already\n cfg_path = kb_path / \"schema.json\"\n if not cfg_path.exists():\n cfg_path.write_text(json.dumps(config_list, indent=2))\n\n except (OSError, TypeError, ValueError) as e:\n self.log(f\"Error saving KB files: {e}\")\n\n def _build_column_metadata(self, config_list: list[dict[str, Any]], df_source: pd.DataFrame) -> dict[str, Any]:\n \"\"\"Build detailed column metadata.\"\"\"\n metadata: dict[str, Any] = {\n \"total_columns\": len(df_source.columns),\n \"mapped_columns\": len(config_list),\n \"unmapped_columns\": len(df_source.columns) - len(config_list),\n \"columns\": [],\n \"summary\": {\"vectorized_columns\": [], \"identifier_columns\": []},\n }\n\n for config in config_list:\n col_name = config.get(\"column_name\")\n vectorize = config.get(\"vectorize\") == \"True\" or config.get(\"vectorize\") is True\n identifier = config.get(\"identifier\") == \"True\" or config.get(\"identifier\") is True\n\n # Add to columns list\n metadata[\"columns\"].append(\n {\n \"name\": col_name,\n \"vectorize\": vectorize,\n \"identifier\": identifier,\n }\n )\n\n # Update summary\n if vectorize:\n metadata[\"summary\"][\"vectorized_columns\"].append(col_name)\n if identifier:\n metadata[\"summary\"][\"identifier_columns\"].append(col_name)\n\n return metadata\n\n async def _create_vector_store(\n self,\n df_source: pd.DataFrame,\n config_list: list[dict[str, Any]],\n embedding_model: str,\n api_key: str,\n ) -> None:\n \"\"\"Create vector store following Local DB component pattern.\"\"\"\n try:\n # Set up vector store directory\n vector_store_dir = await self._kb_path()\n if not vector_store_dir:\n msg = \"Knowledge base path is not set. Please create a new knowledge base first.\"\n raise ValueError(msg)\n vector_store_dir.mkdir(parents=True, exist_ok=True)\n\n # Create embeddings model\n embedding_function = self._build_embeddings(embedding_model, api_key)\n\n # Convert DataFrame to Data objects (following Local DB pattern)\n data_objects = await self._convert_df_to_data_objects(df_source, config_list)\n\n # Create vector store\n chroma = Chroma(\n persist_directory=str(vector_store_dir),\n embedding_function=embedding_function,\n collection_name=self.knowledge_base,\n )\n\n # Convert Data objects to LangChain Documents\n documents = []\n for data_obj in data_objects:\n doc = data_obj.to_lc_document()\n documents.append(doc)\n\n # Add documents to vector store\n if documents:\n chroma.add_documents(documents)\n self.log(f\"Added {len(documents)} documents to vector store '{self.knowledge_base}'\")\n\n except (OSError, ValueError, RuntimeError) as e:\n self.log(f\"Error creating vector store: {e}\")\n\n async def _convert_df_to_data_objects(\n self, df_source: pd.DataFrame, config_list: list[dict[str, Any]]\n ) -> list[Data]:\n \"\"\"Convert DataFrame to Data objects for vector store.\"\"\"\n data_objects: list[Data] = []\n\n # Set up vector store directory\n kb_path = await self._kb_path()\n\n # If we don't allow duplicates, we need to get the existing hashes\n chroma = Chroma(\n persist_directory=str(kb_path),\n collection_name=self.knowledge_base,\n )\n\n # Get all documents and their metadata\n all_docs = chroma.get()\n\n # Extract all _id values from metadata\n id_list = [metadata.get(\"_id\") for metadata in all_docs[\"metadatas\"] if metadata.get(\"_id\")]\n\n # Get column roles\n content_cols = []\n identifier_cols = []\n\n for config in config_list:\n col_name = config.get(\"column_name\")\n vectorize = config.get(\"vectorize\") == \"True\" or config.get(\"vectorize\") is True\n identifier = config.get(\"identifier\") == \"True\" or config.get(\"identifier\") is True\n\n if vectorize:\n content_cols.append(col_name)\n elif identifier:\n identifier_cols.append(col_name)\n\n # Convert each row to a Data object\n for _, row in df_source.iterrows():\n # Build content text from identifier columns using list comprehension\n identifier_parts = [str(row[col]) for col in content_cols if col in row and pd.notna(row[col])]\n\n # Join all parts into a single string\n page_content = \" \".join(identifier_parts)\n\n # Build metadata from NON-vectorized columns only (simple key-value pairs)\n data_dict = {\n \"text\": page_content, # Main content for vectorization\n }\n\n # Add identifier columns if they exist\n if identifier_cols:\n identifier_parts = [str(row[col]) for col in identifier_cols if col in row and pd.notna(row[col])]\n page_content = \" \".join(identifier_parts)\n\n # Add metadata columns as simple key-value pairs\n for col in df_source.columns:\n if col not in content_cols and col in row and pd.notna(row[col]):\n # Convert to simple types for Chroma metadata\n value = row[col]\n data_dict[col] = str(value) # Convert complex types to string\n\n # Hash the page_content for unique ID\n page_content_hash = hashlib.sha256(page_content.encode()).hexdigest()\n data_dict[\"_id\"] = page_content_hash\n\n # If duplicates are disallowed, and hash exists, prevent adding this row\n if not self.allow_duplicates and page_content_hash in id_list:\n self.log(f\"Skipping duplicate row with hash {page_content_hash}\")\n continue\n\n # Create Data object - everything except \"text\" becomes metadata\n data_obj = Data(data=data_dict)\n data_objects.append(data_obj)\n\n return data_objects\n\n def is_valid_collection_name(self, name, min_length: int = 3, max_length: int = 63) -> bool:\n \"\"\"Validates collection name against conditions 1-3.\n\n 1. Contains 3-63 characters\n 2. Starts and ends with alphanumeric character\n 3. Contains only alphanumeric characters, underscores, or hyphens.\n\n Args:\n name (str): Collection name to validate\n min_length (int): Minimum length of the name\n max_length (int): Maximum length of the name\n\n Returns:\n bool: True if valid, False otherwise\n \"\"\"\n # Check length (condition 1)\n if not (min_length <= len(name) <= max_length):\n return False\n\n # Check start/end with alphanumeric (condition 2)\n if not (name[0].isalnum() and name[-1].isalnum()):\n return False\n\n # Check allowed characters (condition 3)\n return re.match(r\"^[a-zA-Z0-9_-]+$\", name) is not None\n\n async def _kb_path(self) -> Path | None:\n # Check if we already have the path cached\n cached_path = getattr(self, \"_cached_kb_path\", None)\n if cached_path is not None:\n return cached_path\n\n # If not cached, compute it\n async with session_scope() as db:\n if not self.user_id:\n msg = \"User ID is required for fetching knowledge base path.\"\n raise ValueError(msg)\n current_user = await get_user_by_id(db, self.user_id)\n if not current_user:\n msg = f\"User with ID {self.user_id} not found.\"\n raise ValueError(msg)\n kb_user = current_user.username\n\n kb_root = self._get_kb_root()\n\n # Cache the result\n self._cached_kb_path = kb_root / kb_user / self.knowledge_base\n\n return self._cached_kb_path\n\n # ---------------------------------------------------------------------\n # OUTPUT METHODS\n # ---------------------------------------------------------------------\n async def build_kb_info(self) -> Data:\n \"\"\"Main ingestion routine → returns a dict with KB metadata.\"\"\"\n try:\n input_value = self.input_df[0] if isinstance(self.input_df, list) else self.input_df\n df_source: DataFrame = convert_to_dataframe(input_value)\n\n # Validate column configuration (using Structured Output patterns)\n config_list = self._validate_column_config(df_source)\n column_metadata = self._build_column_metadata(config_list, df_source)\n\n # Read the embedding info from the knowledge base folder\n kb_path = await self._kb_path()\n if not kb_path:\n msg = \"Knowledge base path is not set. Please create a new knowledge base first.\"\n raise ValueError(msg)\n metadata_path = kb_path / \"embedding_metadata.json\"\n\n # If the API key is not provided, try to read it from the metadata file\n if metadata_path.exists():\n settings_service = get_settings_service()\n metadata = json.loads(metadata_path.read_text())\n embedding_model = metadata.get(\"embedding_model\")\n try:\n api_key = decrypt_api_key(metadata[\"api_key\"], settings_service)\n except (InvalidToken, TypeError, ValueError) as e:\n logger.error(f\"Could not decrypt API key. Please provide it manually. Error: {e}\")\n\n # Check if a custom API key was provided, update metadata if so\n if self.api_key:\n api_key = self.api_key\n self._save_embedding_metadata(\n kb_path=kb_path,\n embedding_model=embedding_model,\n api_key=api_key,\n )\n\n # Create vector store following Local DB component pattern\n await self._create_vector_store(df_source, config_list, embedding_model=embedding_model, api_key=api_key)\n\n # Save KB files (using File Component storage patterns)\n self._save_kb_files(kb_path, config_list)\n\n # Build metadata response\n meta: dict[str, Any] = {\n \"kb_id\": str(uuid.uuid4()),\n \"kb_name\": self.knowledge_base,\n \"rows\": len(df_source),\n \"column_metadata\": column_metadata,\n \"path\": str(kb_path),\n \"config_columns\": len(config_list),\n \"timestamp\": datetime.now(tz=timezone.utc).isoformat(),\n }\n\n # Set status message\n self.status = f\"✅ KB **{self.knowledge_base}** saved · {len(df_source)} chunks.\"\n\n return Data(data=meta)\n\n except (OSError, ValueError, RuntimeError, KeyError) as e:\n msg = f\"Error during KB ingestion: {e}\"\n raise RuntimeError(msg) from e\n\n async def _get_api_key_variable(self, field_value: dict[str, Any]):\n async with session_scope() as db:\n if not self.user_id:\n msg = \"User ID is required for fetching global variables.\"\n raise ValueError(msg)\n current_user = await get_user_by_id(db, self.user_id)\n if not current_user:\n msg = f\"User with ID {self.user_id} not found.\"\n raise ValueError(msg)\n variable_service = get_variable_service()\n\n # Process the api_key field variable\n return await variable_service.get_variable(\n user_id=current_user.id,\n name=field_value[\"03_api_key\"],\n field=\"\",\n session=db,\n )\n\n async def update_build_config(\n self,\n build_config: dotdict,\n field_value: Any,\n field_name: str | None = None,\n ) -> dotdict:\n \"\"\"Update build configuration based on provider selection.\"\"\"\n # Create a new knowledge base\n if field_name == \"knowledge_base\":\n async with session_scope() as db:\n if not self.user_id:\n msg = \"User ID is required for fetching knowledge base list.\"\n raise ValueError(msg)\n current_user = await get_user_by_id(db, self.user_id)\n if not current_user:\n msg = f\"User with ID {self.user_id} not found.\"\n raise ValueError(msg)\n kb_user = current_user.username\n if isinstance(field_value, dict) and \"01_new_kb_name\" in field_value:\n # Validate the knowledge base name - Make sure it follows these rules:\n if not self.is_valid_collection_name(field_value[\"01_new_kb_name\"]):\n msg = f\"Invalid knowledge base name: {field_value['01_new_kb_name']}\"\n raise ValueError(msg)\n\n api_key = field_value.get(\"03_api_key\", None)\n with contextlib.suppress(Exception):\n # If the API key is a variable, resolve it\n api_key = await self._get_api_key_variable(field_value)\n\n # Make sure api_key is a string\n if not isinstance(api_key, str):\n msg = \"API key must be a string.\"\n raise ValueError(msg)\n\n # We need to test the API Key one time against the embedding model\n embed_model = self._build_embeddings(embedding_model=field_value[\"02_embedding_model\"], api_key=api_key)\n\n # Try to generate a dummy embedding to validate the API key without blocking the event loop\n try:\n await asyncio.wait_for(\n asyncio.to_thread(embed_model.embed_query, \"test\"),\n timeout=10,\n )\n except TimeoutError as e:\n msg = \"Embedding validation timed out. Please verify network connectivity and key.\"\n raise ValueError(msg) from e\n except Exception as e:\n msg = f\"Embedding validation failed: {e!s}\"\n raise ValueError(msg) from e\n\n # Create the new knowledge base directory\n kb_path = KNOWLEDGE_BASES_ROOT_PATH / kb_user / field_value[\"01_new_kb_name\"]\n kb_path.mkdir(parents=True, exist_ok=True)\n\n # Save the embedding metadata\n build_config[\"knowledge_base\"][\"value\"] = field_value[\"01_new_kb_name\"]\n self._save_embedding_metadata(\n kb_path=kb_path,\n embedding_model=field_value[\"02_embedding_model\"],\n api_key=api_key,\n )\n\n # Update the knowledge base options dynamically\n build_config[\"knowledge_base\"][\"options\"] = await get_knowledge_bases(\n KNOWLEDGE_BASES_ROOT_PATH,\n user_id=self.user_id,\n )\n\n # If the selected knowledge base is not available, reset it\n if build_config[\"knowledge_base\"][\"value\"] not in build_config[\"knowledge_base\"][\"options\"]:\n build_config[\"knowledge_base\"][\"value\"] = None\n\n return build_config\n"
src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-789- },
src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-790- "column_config": {
src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-791- "_input_type": "TableInput",
src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-792- "advanced": false,
src/lfx/tests/data/starter_projects_1_6_0/Knowledge Ingestion.json-793- "display_name": "Column Configuration",
--
src/lfx/src/lfx/components/openai/openai.py-2-
src/lfx/src/lfx/components/openai/openai.py-3-from lfx.base.embeddings.model import LCEmbeddingsModel
src/lfx/src/lfx/components/openai/openai.py:4:from lfx.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES
src/lfx/src/lfx/components/openai/openai.py-5-from lfx.field_typing import Embeddings
src/lfx/src/lfx/components/openai/openai.py-6-from lfx.io import BoolInput, DictInput, DropdownInput, FloatInput, IntInput, MessageTextInput, SecretStrInput
src/lfx/src/lfx/components/openai/openai.py-7-
src/lfx/src/lfx/components/openai/openai.py-8-
src/lfx/src/lfx/components/openai/openai.py-9-class OpenAIEmbeddingsComponent(LCEmbeddingsModel):
--
src/lfx/src/lfx/components/openai/openai.py-35- display_name="Model",
src/lfx/src/lfx/components/openai/openai.py-36- advanced=False,
src/lfx/src/lfx/components/openai/openai.py:37: options=OPENAI_EMBEDDING_MODEL_NAMES,
src/lfx/src/lfx/components/openai/openai.py-38- value="text-embedding-3-small",
src/lfx/src/lfx/components/openai/openai.py-39- ),
src/lfx/src/lfx/components/openai/openai.py-40- DictInput(name="model_kwargs", display_name="Model Kwargs", advanced=True),
src/lfx/src/lfx/components/openai/openai.py-41- SecretStrInput(name="openai_api_key", display_name="OpenAI API Key", value="OPENAI_API_KEY", required=True),
src/lfx/src/lfx/components/openai/openai.py-42- MessageTextInput(name="openai_api_base", display_name="OpenAI API Base", advanced=True),
--
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-8-from lfx.base.embeddings.model import LCEmbeddingsModel
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-9-from lfx.base.models.model_utils import get_ollama_models, is_valid_ollama_url
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:10:from lfx.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-11-from lfx.base.models.watsonx_constants import (
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-12- IBM_WATSONX_URLS,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-13- WATSONX_EMBEDDING_MODEL_NAMES,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-14-)
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-15-from lfx.field_typing import Embeddings
--
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-81- name="model",
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-82- display_name="Model Name",
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:83: options=OPENAI_EMBEDDING_MODEL_NAMES,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:84: value=OPENAI_EMBEDDING_MODEL_NAMES[0],
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-85- info="Select the embedding model to use",
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-86- real_time_refresh=True,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-87- refresh_button=True,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-88- ),
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-89-
... [stdout truncated: output exceeded 50000 characters (4027478 total). Please modify your command to reduce output size, for example by:
- Using grep/rg to filter results
- Using head/tail to limit lines
- Using --max-count flag with grep/rg
- Processing output with awk/sed to extract only needed information]
</shell_result>
OpenAI embedding models dimensions parameter support text-embedding-3 ada-002
<web_search_result>
Short answer: The dimensions parameter is supported for the text-embedding-3 family (e.g., text-embedding-3-small / -large) but not for text-embedding-ada-002 (ada-002 uses a fixed 1536-d vector). [1][2][3]
Sources:
- OpenAI Embeddings guide: default sizes and ability to shorten text-embedding-3 via dimensions. [1]
- OpenAI announcement: text-embedding-3-large up to 3072 dims and native support for shortening with a dimensions parameter. [2]
- API docs / reference noting the dimensions parameter is only supported on text-embedding-3 and newer (ada-002 is fixed-size 1536). [3]
</web_search_result>
Script executed:
# Find and read the openai_constants.py file
fd "openai_constants.py" -type fRepository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 2
stderr:
error: 'ype' isn't a valid value for '--type <filetype>'
[possible values: file, directory, symlink, executable, empty, socket, pipe]
For more information try '--help'
</shell_result>
Script executed:
# Read the embedding_model.py file around lines 187-200 for full context
cat -n src/lfx/src/lfx/components/models_and_agents/embedding_model.py | sed -n '160,220p'Repository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
160 base_url_ibm_watsonx = self.base_url_ibm_watsonx
161 ollama_base_url = self.ollama_base_url
162 dimensions = self.dimensions
163 chunk_size = self.chunk_size
164 request_timeout = self.request_timeout
165 max_retries = self.max_retries
166 show_progress_bar = self.show_progress_bar
167 model_kwargs = self.model_kwargs or {}
168
169 if provider == "OpenAI":
170 if not api_key:
171 msg = "OpenAI API key is required when using OpenAI provider"
172 raise ValueError(msg)
173
174 # Create the primary embedding instance
175 embeddings_instance = OpenAIEmbeddings(
176 model=model,
177 dimensions=dimensions or None,
178 base_url=api_base or None,
179 api_key=api_key,
180 chunk_size=chunk_size,
181 max_retries=max_retries,
182 timeout=request_timeout or None,
183 show_progress_bar=show_progress_bar,
184 model_kwargs=model_kwargs,
185 )
186
187 # Create dedicated instances for each available model
188 available_models_dict = {}
189 for model_name in OPENAI_EMBEDDING_MODEL_NAMES:
190 available_models_dict[model_name] = OpenAIEmbeddings(
191 model=model_name,
192 dimensions=dimensions or None, # Use same dimensions config for all
193 base_url=api_base or None,
194 api_key=api_key,
195 chunk_size=chunk_size,
196 max_retries=max_retries,
197 timeout=request_timeout or None,
198 show_progress_bar=show_progress_bar,
199 model_kwargs=model_kwargs,
200 )
201
202 return EmbeddingsWithModels(
203 embeddings=embeddings_instance,
204 available_models=available_models_dict,
205 )
206
207 if provider == "Ollama":
208 try:
209 from langchain_ollama import OllamaEmbeddings
210 except ImportError:
211 try:
212 from langchain_community.embeddings import OllamaEmbeddings
213 except ImportError:
214 msg = "Please install langchain-ollama: pip install langchain-ollama"
215 raise ImportError(msg) from None
216
217 transformed_base_url = transform_localhost_url(ollama_base_url)
218
219 # Check if URL contains /v1 suffix (OpenAI-compatible mode)
220 if transformed_base_url and transformed_base_url.rstrip("/").endswith("/v1"):
</shell_result>
Script executed:
# Find openai_constants.py file
fd openai_constants.pyRepository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
src/backend/base/langflow/base/models/openai_constants.py
src/lfx/src/lfx/base/models/openai_constants.py
</shell_result>
Script executed:
# Read the lfx openai_constants.py file
cat src/lfx/src/lfx/base/models/openai_constants.pyRepository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
from .model_metadata import create_model_metadata
# Unified model metadata - single source of truth
OPENAI_MODELS_DETAILED = [
# GPT-5 Series
create_model_metadata(
provider="OpenAI",
name="gpt-5.1",
icon="OpenAI",
tool_calling=True,
reasoning=True,
),
create_model_metadata(
provider="OpenAI",
name="gpt-5",
icon="OpenAI",
tool_calling=True,
reasoning=True,
),
create_model_metadata(
provider="OpenAI",
name="gpt-5-mini",
icon="OpenAI",
tool_calling=True,
reasoning=True,
),
create_model_metadata(
provider="OpenAI",
name="gpt-5-nano",
icon="OpenAI",
tool_calling=True,
reasoning=True,
),
create_model_metadata(
provider="OpenAI",
name="gpt-5-chat-latest",
icon="OpenAI",
tool_calling=False,
reasoning=True,
),
# Regular OpenAI Models
create_model_metadata(provider="OpenAI", name="gpt-4o-mini", icon="OpenAI", tool_calling=True),
create_model_metadata(provider="OpenAI", name="gpt-4o", icon="OpenAI", tool_calling=True),
create_model_metadata(provider="OpenAI", name="gpt-4.1", icon="OpenAI", tool_calling=True),
create_model_metadata(provider="OpenAI", name="gpt-4.1-mini", icon="OpenAI", tool_calling=True),
create_model_metadata(provider="OpenAI", name="gpt-4.1-nano", icon="OpenAI", tool_calling=True),
create_model_metadata(
provider="OpenAI", name="gpt-4.5-preview", icon="OpenAI", tool_calling=True, preview=True, not_supported=True
),
create_model_metadata(provider="OpenAI", name="gpt-4-turbo", icon="OpenAI", tool_calling=True),
create_model_metadata(
provider="OpenAI", name="gpt-4-turbo-preview", icon="OpenAI", tool_calling=True, preview=True
),
create_model_metadata(provider="OpenAI", name="gpt-4", icon="OpenAI", tool_calling=True),
create_model_metadata(provider="OpenAI", name="gpt-3.5-turbo", icon="OpenAI", tool_calling=True),
# Reasoning Models
create_model_metadata(provider="OpenAI", name="o1", icon="OpenAI", reasoning=True),
create_model_metadata(provider="OpenAI", name="o1-mini", icon="OpenAI", reasoning=True, not_supported=True),
create_model_metadata(provider="OpenAI", name="o1-pro", icon="OpenAI", reasoning=True, not_supported=True),
create_model_metadata(provider="OpenAI", name="o3-mini", icon="OpenAI", reasoning=True),
create_model_metadata(provider="OpenAI", name="o3", icon="OpenAI", reasoning=True),
create_model_metadata(provider="OpenAI", name="o3-pro", icon="OpenAI", reasoning=True),
create_model_metadata(provider="OpenAI", name="o4-mini", icon="OpenAI", reasoning=True),
create_model_metadata(provider="OpenAI", name="o4-mini-high", icon="OpenAI", reasoning=True),
# Search Models
create_model_metadata(
provider="OpenAI",
name="gpt-4o-mini-search-preview",
icon="OpenAI",
tool_calling=True,
search=True,
preview=True,
),
create_model_metadata(
provider="OpenAI",
name="gpt-4o-search-preview",
icon="OpenAI",
tool_calling=True,
search=True,
preview=True,
),
# Not Supported Models
create_model_metadata(
provider="OpenAI", name="computer-use-preview", icon="OpenAI", not_supported=True, preview=True
),
create_model_metadata(
provider="OpenAI", name="gpt-4o-audio-preview", icon="OpenAI", not_supported=True, preview=True
),
create_model_metadata(
provider="OpenAI", name="gpt-4o-realtime-preview", icon="OpenAI", not_supported=True, preview=True
),
create_model_metadata(
provider="OpenAI", name="gpt-4o-mini-audio-preview", icon="OpenAI", not_supported=True, preview=True
),
create_model_metadata(
provider="OpenAI", name="gpt-4o-mini-realtime-preview", icon="OpenAI", not_supported=True, preview=True
),
]
OPENAI_CHAT_MODEL_NAMES = [
metadata["name"]
for metadata in OPENAI_MODELS_DETAILED
if not metadata.get("not_supported", False)
and not metadata.get("reasoning", False)
and not metadata.get("search", False)
]
OPENAI_REASONING_MODEL_NAMES = [
metadata["name"]
for metadata in OPENAI_MODELS_DETAILED
if metadata.get("reasoning", False) and not metadata.get("not_supported", False)
]
OPENAI_SEARCH_MODEL_NAMES = [
metadata["name"]
for metadata in OPENAI_MODELS_DETAILED
if metadata.get("search", False) and not metadata.get("not_supported", False)
]
NOT_SUPPORTED_MODELS = [metadata["name"] for metadata in OPENAI_MODELS_DETAILED if metadata.get("not_supported", False)]
OPENAI_EMBEDDING_MODEL_NAMES = [
"text-embedding-3-small",
"text-embedding-3-large",
"text-embedding-ada-002",
]
# Backwards compatibility
MODEL_NAMES = OPENAI_CHAT_MODEL_NAMES
OPENAI_MODEL_NAMES = OPENAI_CHAT_MODEL_NAMES
</shell_result>
Both performance and dimension compatibility concerns are confirmed; the suggested fix is accurate.
The code creates embedding instances for all three OpenAI models (text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002) upfront on lines 189–200, regardless of which model is selected. Additionally, the dimensions parameter is applied identically to all models on line 192.
Issues confirmed:
- Eager instantiation is wasteful: Creating instances for all models when typically only one is used.
- Dimension compatibility bug: Only
text-embedding-3-*models support thedimensionsparameter;text-embedding-ada-002uses a fixed 1536-dimensional embedding and does not accept this parameter. Passing a non-None value would cause an API error.
The suggested fix—filtering dimensions based on model name—is correct and matches OpenAI's API behavior:
# Create dedicated instances for each available model
available_models_dict = {}
for model_name in OPENAI_EMBEDDING_MODEL_NAMES:
+ # Only apply dimensions to models that support it
+ model_dimensions = dimensions if model_name.startswith("text-embedding-3") else None
available_models_dict[model_name] = OpenAIEmbeddings(
model=model_name,
- dimensions=dimensions or None, # Use same dimensions config for all
+ dimensions=model_dimensions,
base_url=api_base or None,
api_key=api_key,
chunk_size=chunk_size,
max_retries=max_retries,
timeout=request_timeout or None,
show_progress_bar=show_progress_bar,
model_kwargs=model_kwargs,
)🤖 Prompt for AI Agents
In src/lfx/src/lfx/components/models_and_agents/embedding_model.py around lines
187–200, fix two issues: stop passing a global dimensions value to models that
don't accept it (text-embedding-ada-002) and avoid eagerly creating all model
instances. Update the loop to set model_dimensions = dimensions only for models
whose name starts with "text-embedding-3" (otherwise use None) and either
instantiate models lazily (store a factory/callable and create only when
requested) or only create the single selected model instead of pre-instantiating
all of them; ensure other parameters remain unchanged.
| # Fetch available IBM watsonx.ai models | ||
| available_model_names = self.fetch_ibm_models(final_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking I/O in async context.
fetch_ibm_models uses synchronous requests.get() (lines 138-153), which blocks the event loop when called from this async method. This can cause performance degradation and unresponsiveness in async applications.
Consider either:
- Converting
fetch_ibm_modelsto an async method usinghttpx.AsyncClient - Running it in a thread executor:
+ import asyncio
+
# Fetch available IBM watsonx.ai models
- available_model_names = self.fetch_ibm_models(final_url)
+ available_model_names = await asyncio.get_event_loop().run_in_executor(
+ None, self.fetch_ibm_models, final_url
+ )Alternatively, convert fetch_ibm_models to async similar to get_ollama_models.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Fetch available IBM watsonx.ai models | |
| available_model_names = self.fetch_ibm_models(final_url) | |
| import asyncio | |
| # Fetch available IBM watsonx.ai models | |
| available_model_names = await asyncio.get_event_loop().run_in_executor( | |
| None, self.fetch_ibm_models, final_url | |
| ) |
🤖 Prompt for AI Agents
In src/lfx/src/lfx/components/models_and_agents/embedding_model.py around lines
303-304, the call to fetch_ibm_models(final_url) is invoked from an async method
but fetch_ibm_models (lines ~138-153) uses blocking requests.get(), which will
block the event loop; fix by either converting fetch_ibm_models to an async
function using an async HTTP client (e.g., httpx.AsyncClient) and await it where
called, or keep it sync but call it via an executor (e.g., await
asyncio.to_thread(self.fetch_ibm_models, final_url)); update the function
signature and imports accordingly and ensure error handling and timeouts are
preserved in the chosen async approach.
| # Create dedicated instances for each available model | ||
| available_models_dict = {} | ||
| for model_name in available_model_names: | ||
| available_models_dict[model_name] = WatsonxEmbeddings( | ||
| model_id=model_name, | ||
| params=params, | ||
| watsonx_client=api_client, | ||
| project_id=project_id, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
IBM watsonx.ai Python SDK APIClient thread safety documentation
<web_search_result>
Short answer: IBM's watsonx.ai Python SDK does not publish an explicit, general "APIClient is thread‑safe" guarantee. Practical guidance from source/docs implies you should treat APIClient/persistent HTTP connections cautiously in multithreaded code — best practice is to create one APIClient (or one HTTP session) per thread or otherwise avoid sharing a single APIClient across threads. [1][2][3]
Sources:
- SDK source (party_wrapper) — shows internal use of requests/connection wrappers (no public thread‑safety claim). [1]
- LlamaIndex integration note — notes the SDK enabled persistent_connection by default (since 1.1.2) and that persistent connections can cause problems in environments without long‑running connections (i.e., indicates connection behavior you must consider in concurrent scenarios). [2]
- LangChain / integration docs — show patterns where callers create and pass an APIClient instance into higher‑level wrappers (implying explicit management of the client object). [3]
If you want, I can fetch specific SDK README or source lines (or check a specific SDK version) and show exact code/comments about connection/session handling.
</web_search_result>
Script executed:
#!/bin/bash
# First, let's look at the file in question
echo "=== File context ==="
wc -l src/lfx/src/lfx/components/models_and_agents/embedding_model.py
# Get the full context around lines 306-314
echo -e "\n=== Code context around lines 306-314 ==="
sed -n '290,330p' src/lfx/src/lfx/components/models_and_agents/embedding_model.py
# Look for imports related to WatsonxEmbeddings and APIClient
echo -e "\n=== Imports in the file ==="
head -30 src/lfx/src/lfx/components/models_and_agents/embedding_model.pyRepository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
=== File context ===
423 src/lfx/src/lfx/components/models_and_agents/embedding_model.py
=== Code context around lines 306-314 ===
params = {
EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: self.truncate_input_tokens,
EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": self.input_text},
}
# Create the primary embedding instance
embeddings_instance = WatsonxEmbeddings(
model_id=model,
params=params,
watsonx_client=api_client,
project_id=project_id,
)
# Fetch available IBM watsonx.ai models
available_model_names = self.fetch_ibm_models(final_url)
# Create dedicated instances for each available model
available_models_dict = {}
for model_name in available_model_names:
available_models_dict[model_name] = WatsonxEmbeddings(
model_id=model_name,
params=params,
watsonx_client=api_client,
project_id=project_id,
)
return EmbeddingsWithModels(
embeddings=embeddings_instance,
available_models=available_models_dict,
)
msg = f"Unknown provider: {provider}"
raise ValueError(msg)
async def update_build_config(
self, build_config: dotdict, field_value: Any, field_name: str | None = None
) -> dotdict:
if field_name == "provider":
if field_value == "OpenAI":
build_config["model"]["options"] = OPENAI_EMBEDDING_MODEL_NAMES
build_config["model"]["value"] = OPENAI_EMBEDDING_MODEL_NAMES[0]
=== Imports in the file ===
from typing import Any
import requests
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames
from langchain_openai import OpenAIEmbeddings
from lfx.base.embeddings.embeddings_class import EmbeddingsWithModels
from lfx.base.embeddings.model import LCEmbeddingsModel
from lfx.base.models.model_utils import get_ollama_models, is_valid_ollama_url
from lfx.base.models.openai_constants import OPENAI_EMBEDDING_MODEL_NAMES
from lfx.base.models.watsonx_constants import (
IBM_WATSONX_URLS,
WATSONX_EMBEDDING_MODEL_NAMES,
)
from lfx.field_typing import Embeddings
from lfx.io import (
BoolInput,
DictInput,
DropdownInput,
FloatInput,
IntInput,
MessageTextInput,
SecretStrInput,
)
from lfx.log.logger import logger
from lfx.schema.dotdict import dotdict
from lfx.utils.util import transform_localhost_url
# Ollama API constants
HTTP_STATUS_OK = 200
</shell_result>
Script executed:
#!/bin/bash
# Find all references to EmbeddingsWithModels and available_models
echo "=== Searching for EmbeddingsWithModels usage ==="
rg -n "EmbeddingsWithModels" --type py
echo -e "\n=== Searching for available_models usage ==="
rg -n "available_models" --type py -A 2 -B 2
echo -e "\n=== Check for async/threading patterns in embedding model ==="
rg -n "async def|Thread|asyncio|concurrent" src/lfx/src/lfx/components/models_and_agents/embedding_model.pyRepository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
=== Searching for EmbeddingsWithModels usage ===
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:7:from lfx.base.embeddings.embeddings_class import EmbeddingsWithModels
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:202: return EmbeddingsWithModels(
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:257: return EmbeddingsWithModels(
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:316: return EmbeddingsWithModels(
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:695: # Also check available_models list from EmbeddingsWithModels
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1193: # Also leverage available_models list from EmbeddingsWithModels
src/lfx/src/lfx/base/embeddings/embeddings_class.py:6:class EmbeddingsWithModels(Embeddings):
src/lfx/src/lfx/base/embeddings/embeddings_class.py:24: """Initialize the EmbeddingsWithModels wrapper.
src/lfx/src/lfx/base/embeddings/embeddings_class.py:114: f"EmbeddingsWithModels(embeddings={self.embeddings!r}, "
=== Searching for available_models usage ===
src/lfx/src/lfx/base/models/groq_model_discovery.py-71-
src/lfx/src/lfx/base/models/groq_model_discovery.py-72- # Step 1: Get list of available models
src/lfx/src/lfx/base/models/groq_model_discovery.py:73: available_models = self._fetch_available_models()
src/lfx/src/lfx/base/models/groq_model_discovery.py:74: logger.info(f"Found {len(available_models)} models from Groq API")
src/lfx/src/lfx/base/models/groq_model_discovery.py-75-
src/lfx/src/lfx/base/models/groq_model_discovery.py-76- # Step 2: Categorize models
--
src/lfx/src/lfx/base/models/groq_model_discovery.py-78- non_llm_models = []
src/lfx/src/lfx/base/models/groq_model_discovery.py-79-
src/lfx/src/lfx/base/models/groq_model_discovery.py:80: for model_id in available_models:
src/lfx/src/lfx/base/models/groq_model_discovery.py-81- if any(pattern in model_id.lower() for pattern in self.SKIP_PATTERNS):
src/lfx/src/lfx/base/models/groq_model_discovery.py-82- non_llm_models.append(model_id)
--
src/lfx/src/lfx/base/models/groq_model_discovery.py-115- return models_metadata
src/lfx/src/lfx/base/models/groq_model_discovery.py-116-
src/lfx/src/lfx/base/models/groq_model_discovery.py:117: def _fetch_available_models(self) -> list[str]:
src/lfx/src/lfx/base/models/groq_model_discovery.py-118- """Fetch list of available models from Groq API."""
src/lfx/src/lfx/base/models/groq_model_discovery.py-119- url = f"{self.base_url}/openai/v1/models"
--
src/lfx/src/lfx/base/embeddings/embeddings_class.py-13- Attributes:
src/lfx/src/lfx/base/embeddings/embeddings_class.py-14- embeddings: The primary LangChain Embeddings instance (used as fallback).
src/lfx/src/lfx/base/embeddings/embeddings_class.py:15: available_models: Dict mapping model names to their dedicated Embeddings instances.
src/lfx/src/lfx/base/embeddings/embeddings_class.py-16- Each model has its own pre-configured instance with specific parameters.
src/lfx/src/lfx/base/embeddings/embeddings_class.py-17- """
--
src/lfx/src/lfx/base/embeddings/embeddings_class.py-20- self,
src/lfx/src/lfx/base/embeddings/embeddings_class.py-21- embeddings: Embeddings,
src/lfx/src/lfx/base/embeddings/embeddings_class.py:22: available_models: dict[str, Embeddings] | None = None,
src/lfx/src/lfx/base/embeddings/embeddings_class.py-23- ):
src/lfx/src/lfx/base/embeddings/embeddings_class.py-24- """Initialize the EmbeddingsWithModels wrapper.
--
src/lfx/src/lfx/base/embeddings/embeddings_class.py-26- Args:
src/lfx/src/lfx/base/embeddings/embeddings_class.py-27- embeddings: The primary LangChain Embeddings instance (used as default/fallback).
src/lfx/src/lfx/base/embeddings/embeddings_class.py:28: available_models: Dict mapping model names to dedicated Embeddings instances.
src/lfx/src/lfx/base/embeddings/embeddings_class.py-29- Each value should be a fully configured Embeddings object ready to use.
src/lfx/src/lfx/base/embeddings/embeddings_class.py-30- Defaults to empty dict if not provided.
--
src/lfx/src/lfx/base/embeddings/embeddings_class.py-32- super().__init__()
src/lfx/src/lfx/base/embeddings/embeddings_class.py-33- self.embeddings = embeddings
src/lfx/src/lfx/base/embeddings/embeddings_class.py:34: self.available_models = available_models if available_models is not None else {}
src/lfx/src/lfx/base/embeddings/embeddings_class.py-35-
src/lfx/src/lfx/base/embeddings/embeddings_class.py-36- def embed_documents(self, texts: list[str]) -> list[list[float]]:
--
src/lfx/src/lfx/base/embeddings/embeddings_class.py-113- return (
src/lfx/src/lfx/base/embeddings/embeddings_class.py-114- f"EmbeddingsWithModels(embeddings={self.embeddings!r}, "
src/lfx/src/lfx/base/embeddings/embeddings_class.py:115: f"available_models={self.available_models!r})"
src/lfx/src/lfx/base/embeddings/embeddings_class.py-116- )
src/lfx/src/lfx/base/embeddings/embeddings_class.py-117-
--
src/lfx/src/lfx/components/nvidia/nvidia_rerank.py-47- try:
src/lfx/src/lfx/components/nvidia/nvidia_rerank.py-48- build_model = self.build_compressor()
src/lfx/src/lfx/components/nvidia/nvidia_rerank.py:49: ids = [model.id for model in build_model.available_models]
src/lfx/src/lfx/components/nvidia/nvidia_rerank.py-50- build_config["model"]["options"] = ids
src/lfx/src/lfx/components/nvidia/nvidia_rerank.py-51- build_config["model"]["value"] = ids[0]
--
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-51- try:
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-52- build_model = self.build_embeddings()
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py:53: ids = [model.id for model in build_model.available_models]
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-54- build_config["model"]["options"] = ids
src/lfx/src/lfx/components/nvidia/nvidia_embedding.py-55- build_config["model"]["value"] = ids[0]
--
src/lfx/src/lfx/components/nvidia/nvidia.py-21- from langchain_nvidia_ai_endpoints import ChatNVIDIA
src/lfx/src/lfx/components/nvidia/nvidia.py-22-
src/lfx/src/lfx/components/nvidia/nvidia.py:23: all_models = ChatNVIDIA().get_available_models()
src/lfx/src/lfx/components/nvidia/nvidia.py-24- except ImportError as e:
src/lfx/src/lfx/components/nvidia/nvidia.py-25- msg = "Please install langchain-nvidia-ai-endpoints to use the NVIDIA model."
--
src/lfx/src/lfx/components/nvidia/nvidia.py-102- model = ChatNVIDIA(base_url=self.base_url, api_key=self.api_key)
src/lfx/src/lfx/components/nvidia/nvidia.py-103- if tool_model_enabled:
src/lfx/src/lfx/components/nvidia/nvidia.py:104: tool_models = [m for m in model.get_available_models() if m.supports_tools]
src/lfx/src/lfx/components/nvidia/nvidia.py-105- return sorted(m.id for m in tool_models)
src/lfx/src/lfx/components/nvidia/nvidia.py:106: return sorted(m.id for m in model.available_models)
src/lfx/src/lfx/components/nvidia/nvidia.py-107-
src/lfx/src/lfx/components/nvidia/nvidia.py-108- def update_build_config(self, build_config: dotdict, _field_value: Any, field_name: str | None = None):
--
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-186-
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-187- # Create dedicated instances for each available model
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:188: available_models_dict = {}
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-189- for model_name in OPENAI_EMBEDDING_MODEL_NAMES:
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:190: available_models_dict[model_name] = OpenAIEmbeddings(
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-191- model=model_name,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-192- dimensions=dimensions or None, # Use same dimensions config for all
--
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-202- return EmbeddingsWithModels(
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-203- embeddings=embeddings_instance,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:204: available_models=available_models_dict,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-205- )
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-206-
--
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-247-
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-248- # Create dedicated instances for each available model
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:249: available_models_dict = {}
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-250- for model_name in available_model_names:
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:251: available_models_dict[model_name] = OllamaEmbeddings(
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-252- model=model_name,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-253- base_url=final_base_url,
--
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-257- return EmbeddingsWithModels(
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-258- embeddings=embeddings_instance,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:259: available_models=available_models_dict,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-260- )
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-261-
--
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-305-
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-306- # Create dedicated instances for each available model
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:307: available_models_dict = {}
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-308- for model_name in available_model_names:
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:309: available_models_dict[model_name] = WatsonxEmbeddings(
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-310- model_id=model_name,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-311- params=params,
--
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-316- return EmbeddingsWithModels(
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-317- embeddings=embeddings_instance,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py:318: available_models=available_models_dict,
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-319- )
src/lfx/src/lfx/components/models_and_agents/embedding_model.py-320-
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-693- for emb_obj in embeddings_list:
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-694- # Check all possible model identifiers (deployment, model, model_id, model_name)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:695: # Also check available_models list from EmbeddingsWithModels
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-696- possible_names = []
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-697- deployment = getattr(emb_obj, "deployment", None)
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-699- model_id = getattr(emb_obj, "model_id", None)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-700- model_name = getattr(emb_obj, "model_name", None)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:701: available_models_attr = getattr(emb_obj, "available_models", None)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-702-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-703- if deployment:
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-714- possible_names.append(f"{deployment}:{model}")
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-715-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:716: # Add all models from available_models dict
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:717: if available_models_attr and isinstance(available_models_attr, dict):
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-718- possible_names.extend(
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-719- str(model_key).strip()
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:720: for model_key in available_models_attr
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-721- if model_key and str(model_key).strip()
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-722- )
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-724- # Match if target matches any of the possible names
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-725- if target_model_name in possible_names:
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:726: # Check if target is in available_models dict - use dedicated instance
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-727- if (
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:728: available_models_attr
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:729: and isinstance(available_models_attr, dict)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:730: and target_model_name in available_models_attr
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-731- ):
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-732- # Use the dedicated embedding instance from the dict
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:733: selected_embedding = available_models_attr[target_model_name]
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-734- embedding_model = target_model_name
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:735: self.log(f"Found dedicated embedding instance for '{embedding_model}' in available_models dict")
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-736- else:
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-737- # Traditional identifier match
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-751- model_id = getattr(emb, "model_id", None)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-752- model_name = getattr(emb, "model_name", None)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:753: available_models_attr = getattr(emb, "available_models", None)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-754-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-755- if deployment:
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-766- identifiers.append(f"combined='{deployment}:{model}'")
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-767-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:768: # Add available_models dict if present
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:769: if available_models_attr and isinstance(available_models_attr, dict):
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:770: identifiers.append(f"available_models={list(available_models_attr.keys())}")
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-771-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-772- available_info.append(
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-803- if hasattr(selected_embedding, "dimensions"):
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-804- logger.info(f"Embedding dimensions: {selected_embedding.dimensions}")
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:805: if hasattr(selected_embedding, "available_models"):
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:806: logger.info(f"Embedding available_models: {selected_embedding.available_models}")
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-807-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:808: # No model switching needed - each model in available_models has its own dedicated instance
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-809- # The selected_embedding is already configured correctly for the target model
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-810- logger.info(f"Using embedding instance for '{embedding_model}' - pre-configured and ready to use")
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1031- return context_clauses
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1032-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1033: def _detect_available_models(self, client: OpenSearch, filter_clauses: list[dict] | None = None) -> list[str]:
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1034- """Detect which embedding models have documents in the index.
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1035-
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1177-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1178- # Detect available embedding models in the index (scoped by filters)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1179: available_models = self._detect_available_models(client, filter_clauses)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1180-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1181: if not available_models:
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1182- logger.warning("No embedding models found in index, using current model")
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1183: available_models = [self._get_embedding_model_name()]
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1184-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1185- # Generate embeddings for ALL detected models
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1191- # Create a comprehensive map of model names to embedding objects
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1192- # Check all possible identifiers (deployment, model, model_id, model_name)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1193: # Also leverage available_models list from EmbeddingsWithModels
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1194- # Handle duplicate identifiers by creating combined keys
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1195- embedding_by_model = {}
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1204- model_name = getattr(emb_obj, "model_name", None)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1205- dimensions = getattr(emb_obj, "dimensions", None)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1206: available_models = getattr(emb_obj, "available_models", None)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1207-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1208- logger.info(
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1209- f"Embedding object {idx}: deployment={deployment}, model={model}, "
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1210- f"model_id={model_id}, model_name={model_name}, dimensions={dimensions}, "
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1211: f"available_models={available_models}"
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1212- )
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1213-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1214: # If this embedding has available_models dict, map all models to their dedicated instances
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1215: if available_models and isinstance(available_models, dict):
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1216: logger.info(f"Embedding object {idx} provides {len(available_models)} models via available_models dict")
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1217: for model_name_key, dedicated_embedding in available_models.items():
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1218- if model_name_key and str(model_name_key).strip():
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1219- model_str = str(model_name_key).strip()
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1268- logger.warning(f" Conflict on '{conflict_id}': {len(emb_list)} embeddings use this identifier")
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1269-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1270: logger.info(f"Generating embeddings for {len(available_models)} models in index")
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1271- logger.info(f"Available embedding identifiers: {list(embedding_by_model.keys())}")
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1272-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1273: for model_name in available_models:
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1274- try:
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1275- # Check if we have an embedding object for this model
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1281- emb_model_id = getattr(emb_obj, "model_id", None)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1282- emb_dimensions = getattr(emb_obj, "dimensions", None)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1283: emb_available_models = getattr(emb_obj, "available_models", None)
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1284-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1285- logger.info(
--
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1289- )
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1290-
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1291: # Check if this is a dedicated instance from available_models dict
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1292: if emb_available_models and isinstance(emb_available_models, dict):
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1293- logger.info(
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py:1294: f"Model '{model_name}' using dedicated instance from available_models dict "
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1295- f"(pre-configured with correct model and dimensions)"
src/lfx/src/lfx/components/elastic/opensearch_multimodal.py-1296- )
--
src/backend/tests/unit/groq/test_groq_model_discovery.py-36- @patch("lfx.base.models.groq_model_discovery.requests.get")
src/backend/tests/unit/groq/test_groq_model_discovery.py-37- @patch("groq.Groq")
src/backend/tests/unit/groq/test_groq_model_discovery.py:38: def test_fetch_available_models_success(
src/backend/tests/unit/groq/test_groq_model_discovery.py-39- self, mock_groq, mock_get, mock_api_key, mock_groq_models_response, mock_groq_client_tool_calling_success
src/backend/tests/unit/groq/test_groq_model_discovery.py-40- ):
--
src/backend/tests/unit/groq/test_groq_model_discovery.py-51-
src/backend/tests/unit/groq/test_groq_model_discovery.py-52- discovery = GroqModelDiscovery(api_key=mock_api_key)
src/backend/tests/unit/groq/test_groq_model_discovery.py:53: models = discovery._fetch_available_models()
src/backend/tests/unit/groq/test_groq_model_discovery.py-54-
src/backend/tests/unit/groq/test_groq_model_discovery.py-55- assert isinstance(models, list)
=== Check for async/threading patterns in embedding model ===
155: async def build_embeddings(self) -> Embeddings:
324: async def update_build_config(
</shell_result>
Script executed:
#!/bin/bash
# Search for APIClient creation and usage
echo "=== Searching for APIClient ==="
rg -n "APIClient|api_client" src/lfx/src/lfx/components/models_and_agents/embedding_model.py -B 2 -A 2
# Get the full method that contains lines 306-314
echo -e "\n=== Full method context ==="
sed -n '250,330p' src/lfx/src/lfx/components/models_and_agents/embedding_model.pyRepository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
=== Searching for APIClient ===
277- raise ValueError(msg)
278-
279: from ibm_watsonx_ai import APIClient, Credentials
280-
281- final_url = base_url_ibm_watsonx or "https://us-south.ml.cloud.ibm.com"
--
286- )
287-
288: api_client = APIClient(credentials)
289-
290- params = {
--
297- model_id=model,
298- params=params,
299: watsonx_client=api_client,
300- project_id=project_id,
301- )
--
310- model_id=model_name,
311- params=params,
312: watsonx_client=api_client,
313- project_id=project_id,
314- )
=== Full method context ===
for model_name in available_model_names:
available_models_dict[model_name] = OllamaEmbeddings(
model=model_name,
base_url=final_base_url,
**model_kwargs,
)
return EmbeddingsWithModels(
embeddings=embeddings_instance,
available_models=available_models_dict,
)
if provider == "IBM watsonx.ai":
try:
from langchain_ibm import WatsonxEmbeddings
except ImportError:
msg = "Please install langchain-ibm: pip install langchain-ibm"
raise ImportError(msg) from None
if not api_key:
msg = "IBM watsonx.ai API key is required when using IBM watsonx.ai provider"
raise ValueError(msg)
project_id = self.project_id
if not project_id:
msg = "Project ID is required for IBM watsonx.ai provider"
raise ValueError(msg)
from ibm_watsonx_ai import APIClient, Credentials
final_url = base_url_ibm_watsonx or "https://us-south.ml.cloud.ibm.com"
credentials = Credentials(
api_key=self.api_key,
url=final_url,
)
api_client = APIClient(credentials)
params = {
EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: self.truncate_input_tokens,
EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": self.input_text},
}
# Create the primary embedding instance
embeddings_instance = WatsonxEmbeddings(
model_id=model,
params=params,
watsonx_client=api_client,
project_id=project_id,
)
# Fetch available IBM watsonx.ai models
available_model_names = self.fetch_ibm_models(final_url)
# Create dedicated instances for each available model
available_models_dict = {}
for model_name in available_model_names:
available_models_dict[model_name] = WatsonxEmbeddings(
model_id=model_name,
params=params,
watsonx_client=api_client,
project_id=project_id,
)
return EmbeddingsWithModels(
embeddings=embeddings_instance,
available_models=available_models_dict,
)
msg = f"Unknown provider: {provider}"
raise ValueError(msg)
async def update_build_config(
self, build_config: dotdict, field_value: Any, field_name: str | None = None
) -> dotdict:
if field_name == "provider":
if field_value == "OpenAI":
build_config["model"]["options"] = OPENAI_EMBEDDING_MODEL_NAMES
build_config["model"]["value"] = OPENAI_EMBEDDING_MODEL_NAMES[0]
</shell_result>
ibm_watsonx_ai WatsonxEmbeddings watsonx_client shared instance concurrent usage
<web_search_result>
Short answer — yes, you can share a single WatsonxEmbeddings / watsonx_client instance across callers, but you should control concurrent requests (use the integration's built‑in concurrency option or your own queue/pool).
Key points and practical guidance
- LangChain's WatsonxEmbeddings supports a concurrency control parameter (often called concurrency_limit / maxConcurrency) so multiple callers can safely share one embeddings instance while limiting parallel requests to the provider. [1][3]
- The LangChain IBM embedding wrapper constructs/uses an underlying watsonx client (api_client) — you can pass a preconfigured client to the wrapper to reuse connections. If you share that client, ensure your app enforces a concurrency limit (or rely on the wrapper's concurrency option) to avoid rate limits or resource contention. [2][1]
- If you need high throughput: use a shared client + set concurrency_limit / maxConcurrency to a value that matches your plan/rate limits, or implement a worker pool/queue to throttle requests and retry on 429/errors.
Sources
- LangChain WatsonxEmbeddings reference (instantiation, parameters). [1]
- LangChain IBM embeddings module source (shows api_client / internal client use). [2]
- LangChain docs on rate‑limits / maxConcurrency for embeddings. [3]
If you want, tell me which SDK/library and language (Python, JS, LangChain version) you use and your target throughput, and I’ll suggest exact code/config values.
</web_search_result>
Script executed:
#!/bin/bash
# Check the full WatsonxEmbeddings initialization in embedding_model.py
echo "=== WatsonxEmbeddings initialization parameters ==="
sed -n '295,315p' src/lfx/src/lfx/components/models_and_agents/embedding_model.py
# Search for concurrency_limit or similar parameters in the file
echo -e "\n=== Search for concurrency parameters ==="
rg -n "concurrency|maxConcurrency|max_concurrency" src/lfx/src/lfx/components/models_and_agents/embedding_model.py
# Check how embed_documents is called in opensearch_multimodal
echo -e "\n=== How embeddings are called in opensearch_multimodal ==="
rg -n "embed_documents\|embed_query" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py -B 2 -A 3 | head -50Repository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
=== WatsonxEmbeddings initialization parameters ===
# Create the primary embedding instance
embeddings_instance = WatsonxEmbeddings(
model_id=model,
params=params,
watsonx_client=api_client,
project_id=project_id,
)
# Fetch available IBM watsonx.ai models
available_model_names = self.fetch_ibm_models(final_url)
# Create dedicated instances for each available model
available_models_dict = {}
for model_name in available_model_names:
available_models_dict[model_name] = WatsonxEmbeddings(
model_id=model_name,
params=params,
watsonx_client=api_client,
project_id=project_id,
)
=== Search for concurrency parameters ===
=== How embeddings are called in opensearch_multimodal ===
</shell_result>
langchain_ibm WatsonxEmbeddings parameters concurrency_limit maxConcurrency source code
<web_search_result>
Short answer — both names exist but in different places:
-
In the langchain-ibm (LangChain) integrations:
- JS/TS integration exposes maxConcurrency / maxRetries on Watsonx Embeddings/LLM input types (e.g., WatsonxInputEmbeddings / WatsonxCallOptionsLLM). See the API docs for WatsonxEmbeddings / WatsonxInputEmbeddings. [1][2]
-
In the upstream IBM watsonx.ai Python SDK (the underlying client) the Embeddings class uses concurrency_limit (and batch_size, max_retries, delay_time, retry_status_codes, etc.). See the SDK Embeddings constructor docs. [3]
Sources:
- LangChain (community) JS/TS WatsonxEmbeddings / WatsonxInputEmbeddings docs (shows maxConcurrency). [turn2search6][turn2search2].
- LangChain JS Watsonx call options / params (maxConcurrency field). [turn0search1].
- IBM watsonx.ai Python SDK Embeddings constructor (concurrency_limit parameter). [turn2search9].
</web_search_result>
Script executed:
#!/bin/bash
# Search for langchain-ibm version in dependencies
echo "=== Searching for langchain-ibm version ==="
find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" -o -name "setup.cfg" | head -20
# Check pyproject.toml or similar files
if [ -f "pyproject.toml" ]; then
echo -e "\n=== pyproject.toml content ==="
grep -A 5 -B 5 "langchain" pyproject.toml | head -40
fi
# Also check in src or root
for f in requirements.txt requirements-dev.txt setup.py; do
if [ -f "$f" ]; then
echo -e "\n=== $f ==="
grep "langchain" "$f" | head -10
fi
done
# Also check backend directory
if [ -d "backend" ]; then
echo -e "\n=== Checking backend for requirements ==="
find backend -maxdepth 2 -name "*.txt" -o -name "pyproject.toml" | xargs grep -l "langchain" 2>/dev/null | head -5
fiRepository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
=== Searching for langchain-ibm version ===
./pyproject.toml
./src/backend/base/langflow/initial_setup/setup.py
./src/backend/base/langflow/logging/setup.py
./src/backend/base/pyproject.toml
./src/lfx/pyproject.toml
=== pyproject.toml content ===
name = "langflow"
version = "1.7.0"
description = "A Python package with a built-in web application"
requires-python = ">=3.10,<3.14"
license = "MIT"
keywords = ["nlp", "langchain", "openai", "gpt", "gui"]
readme = "README.md"
maintainers = [
{ name = "Carlos Coelho", email = "[email protected]" },
{ name = "Cristhian Zanforlin", email = "[email protected]" },
{ name = "Gabriel Almeida", email = "[email protected]" },
--
"MarkupSafe==3.0.2",
"boto3>=1.34.162,<2.0.0",
"numexpr==2.10.2",
"qianfan==0.3.5",
"pgvector==0.3.6",
"langchain==0.3.23",
"elasticsearch==8.16.0",
"pytube==15.0.0",
"dspy-ai==2.5.41",
"datasets>2.14.7,<4.0.0",
"assemblyai==0.35.1",
--
"langsmith>=0.3.42,<1.0.0",
"yfinance==0.2.50",
"wolframalpha==5.1.3",
"astra-assistants[tools]>=2.2.13,<3.0.0",
"composio==0.9.2",
"composio-langchain==0.9.2",
"spider-client==0.1.24",
"nltk==3.9.1",
"lark==1.2.2",
"jq==1.8.0",
"pydantic-settings>=2.2.0,<3.0.0",
"duckduckgo_search==7.2.1",
"opensearch-py==2.8.0",
"langchain-google-genai==2.0.6",
"langchain-cohere>=0.3.3,<1.0.0",
"langchain-huggingface==0.3.1",
</shell_result>
Script executed:
#!/bin/bash
# Look at how embed_documents is called - check context around line 1274
echo "=== Context around embed_documents call in opensearch_multimodal ==="
sed -n '1270,1310p' src/lfx/src/lfx/components/elastic/opensearch_multimodal.py
# Check if there's any threading or asyncio in this file
echo -e "\n=== Threading/async patterns in opensearch_multimodal ==="
rg -n "async def|Thread|concurrent|asyncio" src/lfx/src/lfx/components/elastic/opensearch_multimodal.py | head -20Repository: langflow-ai/langflow
<shell_result>
Repository: langflow-ai/langflow
Exit code: 0
stdout:
=== Context around embed_documents call in opensearch_multimodal ===
logger.info(f"Generating embeddings for {len(available_models)} models in index")
logger.info(f"Available embedding identifiers: {list(embedding_by_model.keys())}")
for model_name in available_models:
try:
# Check if we have an embedding object for this model
if model_name in embedding_by_model:
# Use the matching embedding object directly
emb_obj = embedding_by_model[model_name]
emb_deployment = getattr(emb_obj, "deployment", None)
emb_model = getattr(emb_obj, "model", None)
emb_model_id = getattr(emb_obj, "model_id", None)
emb_dimensions = getattr(emb_obj, "dimensions", None)
emb_available_models = getattr(emb_obj, "available_models", None)
logger.info(
f"Using embedding object for model '{model_name}': "
f"deployment={emb_deployment}, model={emb_model}, model_id={emb_model_id}, "
f"dimensions={emb_dimensions}"
)
# Check if this is a dedicated instance from available_models dict
if emb_available_models and isinstance(emb_available_models, dict):
logger.info(
f"Model '{model_name}' using dedicated instance from available_models dict "
f"(pre-configured with correct model and dimensions)"
)
# Use the embedding instance directly - no model switching needed!
vec = emb_obj.embed_query(q)
query_embeddings[model_name] = vec
logger.info(f"Generated embedding for model: {model_name} (actual dimensions: {len(vec)})")
else:
# No matching embedding found for this model
logger.warning(
f"No matching embedding found for model '{model_name}'. "
f"This model will be skipped. Available models: {list(embedding_by_model.keys())}"
)
except (RuntimeError, ValueError, ConnectionError, TimeoutError, AttributeError, KeyError) as e:
logger.warning(f"Failed to generate embedding for {model_name}: {e}")
=== Threading/async patterns in opensearch_multimodal ===
7:from concurrent.futures import ThreadPoolExecutor, as_completed
858: with ThreadPoolExecutor(max_workers=max_workers) as executor:
874: "Threaded embedding generation failed for model %s (attempt %s/%s), retrying in %.1fs",
1536: async def update_build_config(self, build_config: dict, field_value: str, field_name: str | None = None) -> dict:
</shell_result>
Based on my verification, the review comment raises a valid concern that warrants action. Here's what I found:
-
Shared
api_clientconfirmed: AllWatsonxEmbeddingsinstances created in the loop (lines 309-314) share the sameapi_clientobject created at line 288. -
Concurrent usage confirmed:
opensearch_multimodal.pyusesThreadPoolExecutor(line 858) to execute embedding operations in parallel, and it directly uses the instances from theavailable_modelsdictionary (as shown in the loop at lines 1273+). -
No concurrency control: The
WatsonxEmbeddingsinstances are instantiated without anyconcurrency_limitparameter, which LangChain supports for controlling parallel requests. -
IBM SDK thread-safety: Web search confirms IBM's watsonx.ai Python SDK does not provide explicit thread-safety guarantees for
APIClient.
The review comment is accurate and identifies a legitimate issue that needs to be addressed. The solution would be to add a concurrency_limit parameter when instantiating WatsonxEmbeddings instances.
Verify that WatsonxEmbeddings instances are configured with a concurrency_limit parameter to control concurrent API calls.
All WatsonxEmbeddings instances (lines 309–314 and 297–301) share the same api_client object. The code is used with ThreadPoolExecutor in opensearch_multimodal.py for parallel embedding generation. Without an explicit concurrency_limit on each instance, concurrent calls through the shared APIClient risk exceeding API rate limits or exposing thread-unsafe behavior. Add concurrency_limit=N (e.g., 5–10) to each WatsonxEmbeddings() constructor to safely control concurrent requests.
🤖 Prompt for AI Agents
In src/lfx/src/lfx/components/models_and_agents/embedding_model.py around lines
306-314 (and also update the earlier instantiation at lines 297-301), the
WatsonxEmbeddings instances are created without a concurrency_limit while
sharing the same api_client used concurrently by ThreadPoolExecutor; fix this by
passing a concurrency_limit parameter to each WatsonxEmbeddings(...) constructor
(e.g., concurrency_limit=5 or 10) so that parallel embedding requests through
the shared api_client are throttled and avoid thread-safety or rate-limit
issues.
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #10714 +/- ##
==========================================
- Coverage 32.48% 32.44% -0.05%
==========================================
Files 1366 1367 +1
Lines 63294 63315 +21
Branches 9356 9357 +1
==========================================
- Hits 20564 20542 -22
- Misses 41698 41740 +42
- Partials 1032 1033 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Updated ChatInput and ChatOutput components in starter project JSONs to use the session_id from the graph if not provided, ensuring consistent session management. This change improves message storage and retrieval logic for chat flows.
…low-ai/langflow into opensearch-multi-embedding
Introduces OpenSearchVectorStoreComponentMultimodalMultiEmbedding, supporting multi-model hybrid semantic and keyword search with dynamic vector fields, parallel embedding generation, advanced filtering, and flexible authentication. Enables ingestion and search across multiple embedding models in OpenSearch, with robust index management and UI configuration handling.
Key Features Added
The embedding input accepts multiple embedding objects via is_list=True
Users can connect multiple embedding models from different providers (OpenAI, Watsonx, Cohere, etc.)
Backward compatible: single embeddings still work seamlessly
Ingestion uses ONE selected embedding model specified by user
Selection via embedding_model_name field
Falls back to first embedding if no model name specified
Documents are stored in dynamic field: chunk_embedding_{model_name}
Search queries across ALL embedding models found in the index
Automatically detects available models via aggregation
Generates query embeddings for each detected model
Combines results using hybrid search (dis_max + keyword matching)
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.