-
Notifications
You must be signed in to change notification settings - Fork 962
version: v0.1.41 #893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
version: v0.1.41 #893
Conversation
<!-- .github/pull_request_template.md --> ## Description Resolve issue with .venv being broken when using docker compose with Cognee ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: Boris Arzentar <[email protected]>
… 1947 (#760) <!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: Boris <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Igor Ilic <[email protected]>
<!-- .github/pull_request_template.md --> ## Description Add support for UV and for Poetry package management ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md --> ## Description Switch typing from str to UUID for NetworkX node_id ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md --> ## Description Add both sse and stdio support for Cognee MCP ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
…83] (#782) <!-- .github/pull_request_template.md --> ## Description Add log handling options for cognee exceptions ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md --> ## Description Fix issue with failing versions gh actions ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: Vasilije <[email protected]>
<!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. Co-authored-by: Vasilije <[email protected]>
<!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: Hande <[email protected]> Co-authored-by: Vasilije <[email protected]>
<!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: Hande <[email protected]> Co-authored-by: Vasilije <[email protected]>
<!-- .github/pull_request_template.md --> ## Description This PR adds support for the Memgraph graph database following the [graph database integration guide](https://docs.cognee.ai/contributing/adding-providers/graph-db/graph-database-integration): - Implemented `MemgraphAdapter` for interfacing with Memgraph. - Updated `get_graph_engine.py` to return MemgraphAdapter when appropriate. - Added a test script:` test_memgraph.py.` - Created a dedicated test workflow: `.github/workflows/test_memgraph.yml.` ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: Vasilije <[email protected]> Co-authored-by: Boris <[email protected]>
<!-- .github/pull_request_template.md --> ## Description refactor: Handle boto3 s3fs dependencies better ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md --> ## Description Update LanceDB and rewrite data points to run async ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: Boris <[email protected]> Co-authored-by: Boris Arzentar <[email protected]>
<!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md --> ## Description As discussed with @hande-k and Lazar, I've created a short demo to illustrate how to get the pagerank rankings from the knowledge graph given the nx engine. This is a POC, and a first of step towards solving #643 . Please let me know what you think, and how to proceed from here. :) ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Co-authored-by: Boris <[email protected]> Co-authored-by: Hande <[email protected]> Co-authored-by: Vasilije <[email protected]>
<!-- .github/pull_request_template.md --> ## Description Added tools to check current cognify and codify status ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
WalkthroughThis update introduces two new advanced graph completion retrievers— Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant API
participant SearchModule
participant Retriever
participant GraphAdapter
User->>API: POST /api/v1/search (query, search_type, node_type, node_name)
API->>SearchModule: search(query, search_type, node_type, node_name)
SearchModule->>Retriever: get_triplets/query (with node_type/node_name)
Retriever->>GraphAdapter: get_nodeset_subgraph(node_type, node_name)
GraphAdapter-->>Retriever: Nodes/Edges
Retriever-->>SearchModule: Triplets/Context
SearchModule-->>API: Results
API-->>User: Search response
sequenceDiagram
participant User
participant API
participant GraphCompletionCotRetriever
participant LLM
participant GraphAdapter
User->>API: POST /api/v1/search (GRAPH_COMPLETION_COT)
API->>GraphCompletionCotRetriever: get_completion(query)
loop Iterative CoT rounds
GraphCompletionCotRetriever->>GraphAdapter: get_triplets/followup (node_type/node_name)
GraphCompletionCotRetriever->>LLM: validate answer, generate followup
LLM-->>GraphCompletionCotRetriever: Reasoning, Followup Q
end
GraphCompletionCotRetriever-->>API: Final answer
API-->>User: Completion response
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 17116131 | Triggered | Generic Password | 3b07f3c | examples/database_examples/neo4j_example.py | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
# Conflicts: # CONTRIBUTING.md # Dockerfile # README.md # cognee-mcp/pyproject.toml # cognee-mcp/src/server.py # cognee-mcp/uv.lock # cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py # cognee/infrastructure/databases/graph/networkx/adapter.py # cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py # cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py # cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py # cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py # cognee/modules/retrieval/graph_completion_retriever.py # cognee/modules/retrieval/utils/brute_force_triplet_search.py # cognee/tests/test_neo4j.py # entrypoint.sh # pyproject.toml # uv.lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
🛑 Comments failed to post (53)
cognee/infrastructure/llm/rate_limiter.py (3)
205-211: 🛠️ Refactor suggestion
Consolidate Returns in async_wait_if_needed
The asyncasync_wait_if_neededdocstring also duplicates theReturns:section. Merge them into a single description to avoid confusion.🤖 Prompt for AI Agents
In cognee/infrastructure/llm/rate_limiter.py around lines 205 to 211, the async_wait_if_needed function's docstring has duplicated Returns sections. Consolidate these into a single Returns section that clearly states the function returns a float representing the time waited in seconds, removing any redundant or repeated descriptions.
180-186: 🛠️ Refactor suggestion
Eliminate duplicate Returns in docstring
Likehit_limit, thewait_if_neededmethod’s docstring duplicates itsReturns:section. Merge into one block to maintain clarity and adhere to a single style for return documentation.🤖 Prompt for AI Agents
In cognee/infrastructure/llm/rate_limiter.py around lines 180 to 186, the docstring for the wait_if_needed method contains duplicate Returns sections. Remove the redundant Returns block and consolidate the return description into a single Returns section to maintain clarity and consistent docstring style.
156-162: 🛠️ Refactor suggestion
Remove duplicated Returns section
Thehit_limitdocstring contains twoReturns:blocks describing the same boolean outcome. Please remove the redundant section and consolidate into a single, clearReturns:entry.🤖 Prompt for AI Agents
In cognee/infrastructure/llm/rate_limiter.py around lines 156 to 162, the docstring for the hit_limit method contains two duplicated Returns sections both describing the same boolean return value. Remove the redundant Returns block and keep only one clear and concise Returns section that states the method returns a boolean indicating if the request is allowed.cognee/tasks/documents/classify_documents.py (2)
55-62: 🛠️ Refactor suggestion
⚠️ Potential issueHandle non-string
external_metadatainputs to avoid unhandled exceptions
The currentjson.loadscall will raise aTypeErrorifdocument.external_metadataisNoneor not a string, which isn’t caught by the existingexcept json.JSONDecodeError.Consider expanding the exception handler:
try: external_metadata = json.loads(document.external_metadata) - except json.JSONDecodeError: + except (json.JSONDecodeError, TypeError): returnThis ensures that non-string or missing metadata won’t cause an unhandled exception.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.try: external_metadata = json.loads(document.external_metadata) except (json.JSONDecodeError, TypeError): return🤖 Prompt for AI Agents
In cognee/tasks/documents/classify_documents.py around lines 55 to 62, the code uses json.loads on document.external_metadata without checking if it is a string, causing a TypeError if it is None or another non-string type. To fix this, expand the exception handling to also catch TypeError along with JSONDecodeError, ensuring that non-string or missing external_metadata does not raise an unhandled exception and the function returns early as intended.
93-113: 💡 Verification agent
🧩 Analysis chain
Document or handle missing file extension mappings
classify_documentswill raise aKeyErrorif adata_item.extensionis not inEXTENSION_TO_DOCUMENT_CLASS. Please either document this behavior in the docstring or handle it gracefully (e.g., skip unknown extensions or raise a more descriptive error).For example, you could:
- document = EXTENSION_TO_DOCUMENT_CLASS[data_item.extension]( + doc_cls = EXTENSION_TO_DOCUMENT_CLASS.get(data_item.extension) + if doc_cls is None: + raise ValueError(f"Unsupported document extension: {data_item.extension}") + document = doc_cls( id=data_item.id, title=f"{data_item.name}.{data_item.extension}", raw_data_location=data_item.raw_data_location, name=data_item.name, mime_type=data_item.mime_type, external_metadata=json.dumps(data_item.external_metadata, indent=4), )Additionally, update the docstring’s
Raisessection to reflect the new error.
🏁 Script executed:
#!/bin/bash # Search for the mapping of extensions to document classes rg -n "EXTENSION_TO_DOCUMENT_CLASS" -N # Inspect the classify_documents function and its docstring rg -n "def classify_documents" -n cognee/tasks/documents/classify_documents.py sed -n '1,200p' cognee/tasks/documents/classify_documents.py | sed -n '1,50p'Length of output: 1922
🏁 Script executed:
#!/bin/bash # Display the classify_documents function including its docstring and core logic sed -n '80,160p' cognee/tasks/documents/classify_documents.pyLength of output: 1726
Handle unsupported document extensions in
classify_documents
To prevent aKeyErrorwhendata_item.extensionis not inEXTENSION_TO_DOCUMENT_CLASS, add a guard and raise a clear error, and update the docstring’sRaisessection accordingly.• File:
cognee/tasks/documents/classify_documents.py
- In the
for data_item in data_documents:loop, replace the direct lookup with a.get()check before instantiating the document class.- Update the function docstring to include a
Raises:block.Suggested diff:
async def classify_documents(data_documents: list[Data]) -> list[Document]: """ Classifies a list of data items into specific document types based on their file extensions. + + Raises: + ValueError: If a document’s extension is not supported. """ documents = [] for data_item in data_documents: - document = EXTENSION_TO_DOCUMENT_CLASS[data_item.extension]( + doc_cls = EXTENSION_TO_DOCUMENT_CLASS.get(data_item.extension) + if doc_cls is None: + raise ValueError(f"Unsupported document extension: {data_item.extension}") + document = doc_cls( id=data_item.id, title=f"{data_item.name}.{data_item.extension}", raw_data_location=data_item.raw_data_location, name=data_item.name, mime_type=data_item.mime_type, external_metadata=json.dumps(data_item.external_metadata, indent=4), ) ``` <!-- suggestion_start --> <details> <summary>📝 Committable suggestion</summary> > ‼️ **IMPORTANT** > Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements. ```suggestion async def classify_documents(data_documents: list[Data]) -> list[Document]: """ Classifies a list of data items into specific document types based on their file extensions. This function processes each item in the provided list of data documents, retrieves relevant metadata, and creates instances of document classes mapped to their extensions. It ensures that the data items are valid before performing the classification and invokes `update_node_set` to extract and set relevant node information from the document's external metadata. Parameters: ----------- - data_documents (list[Data]): A list of Data objects representing the documents to be classified. Returns: -------- - list[Document]: A list of Document objects created based on the classified data documents. Raises: ValueError: If a document’s extension is not supported. """ documents = [] for data_item in data_documents: doc_cls = EXTENSION_TO_DOCUMENT_CLASS.get(data_item.extension) if doc_cls is None: raise ValueError(f"Unsupported document extension: {data_item.extension}") document = doc_cls( id=data_item.id, title=f"{data_item.name}.{data_item.extension}", raw_data_location=data_item.raw_data_location, name=data_item.name, mime_type=data_item.mime_type, external_metadata=json.dumps(data_item.external_metadata, indent=4), ) update_node_set(document, data_item) documents.append(document) return documents🤖 Prompt for AI Agents
In cognee/tasks/documents/classify_documents.py around lines 93 to 113, the classify_documents function currently does a direct lookup in EXTENSION_TO_DOCUMENT_CLASS using data_item.extension, which can raise a KeyError for unsupported extensions. Modify the loop to first check if the extension exists in the mapping using .get() or an if condition, and if not, raise a descriptive error indicating the unsupported extension. Also, update the function's docstring to add a Raises section documenting this new error for unsupported extensions.
cognee/infrastructure/databases/vector/supported_databases.py (1)
1-1: 🛠️ Refactor suggestion
Add module-level docstring
This new file should include a top-level docstring to explain the purpose of thesupported_databasesmapping.+""" +Mapping of supported vector database adapters by provider name. +""" supported_databases = {}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.""" Mapping of supported vector database adapters by provider name. """ supported_databases = {}🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/supported_databases.py at line 1, add a module-level docstring at the very top of the file that clearly explains the purpose of the supported_databases dictionary. This docstring should briefly describe what the mapping represents and how it is intended to be used within the module.
cognee/infrastructure/llm/prompts/cot_validation_system_prompt.txt (1)
1-2: 🛠️ Refactor suggestion
Fix grammar and clarity in system prompt
The sentences contain subject-verb mismatches and are hard to parse. Apply this diff:
- You are a helpful agent who are allowed to use only the provided question answer and context. - I want to you find reasoning what is missing from the context or why the answer is not answering the question or not correct strictly based on the context. + You are a helpful agent who is allowed to use only the provided question, answer, and context. + Identify missing reasoning in the context or explain why the answer does not correctly address the question, strictly based on the provided context.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.You are a helpful agent who is allowed to use only the provided question, answer, and context. Identify missing reasoning in the context or explain why the answer does not correctly address the question, strictly based on the provided context.🧰 Tools
🪛 LanguageTool
[uncategorized] ~1-~1: This verb does not appear to agree with the subject. Consider using a different form.
Context: You are a helpful agent who are allowed to use only the provided questi...(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)
🤖 Prompt for AI Agents
In cognee/infrastructure/llm/prompts/cot_validation_system_prompt.txt at lines 1 to 2, fix the grammar and clarity issues by correcting subject-verb agreement and rephrasing for better readability. Change "You are a helpful agent who are allowed" to "You are a helpful agent who is allowed" and reword the second sentence to clearly instruct the agent to find missing reasoning or explain why the answer is incorrect strictly based on the provided context.
cognee/tests/unit/modules/search/search_methods_test.py (1)
6-6:
⚠️ Potential issueRemove unused and incorrect import.
The import
from pylint.checkers.utils import node_typeis unused and appears to be incorrect. Thenode_typeparameter in the test is being set toNone, not using this imported symbol.Remove the unused import:
-from pylint.checkers.utils import node_type📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.🧰 Tools
🪛 Ruff (0.11.9)
6-6:
pylint.checkers.utils.node_typeimported but unusedRemove unused import:
pylint.checkers.utils.node_type(F401)
🪛 Pylint (3.3.7)
[warning] 6-6: Unused node_type imported from pylint.checkers.utils
(W0611)
🤖 Prompt for AI Agents
In cognee/tests/unit/modules/search/search_methods_test.py at line 6, remove the unused and incorrect import statement 'from pylint.checkers.utils import node_type' since the test uses 'node_type' as a parameter set to None and does not require this import.
cognee/api/v1/config/config.py (1)
158-158: 🛠️ Refactor suggestion
Use InvalidAttributeError for consistency.
While removing the
message=keyword fixes the immediate issue withAttributeError, this creates inconsistency with other methods in the file. All other configuration methods (set_llm_config,set_relational_db_config,set_migration_db_config,set_vector_db_config) useInvalidAttributeError(message=...)for the same error scenario.Apply this diff for consistency:
- raise AttributeError(f"'{key}' is not a valid attribute of the config.") + raise InvalidAttributeError( + message=f"'{key}' is not a valid attribute of the config." + )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.raise InvalidAttributeError( message=f"'{key}' is not a valid attribute of the config." )🤖 Prompt for AI Agents
In cognee/api/v1/config/config.py at line 158, replace the raised AttributeError with InvalidAttributeError using the message= keyword argument to maintain consistency with other configuration methods. Change the exception to raise InvalidAttributeError(message=f"'{key}' is not a valid attribute of the config.") so it matches the existing error handling style in the file.
cognee/infrastructure/databases/graph/use_graph_adapter.py (1)
4-5:
⚠️ Potential issueFix misleading parameter names and add proper documentation.
The parameter names
vector_db_nameandvector_db_adapterare misleading since this function is for graph database adapters. Additionally, the function lacks docstring and type hints, which is inconsistent with the extensive documentation being added elsewhere in this PR.-def use_graph_adapter(vector_db_name, vector_db_adapter): - supported_databases[vector_db_name] = vector_db_adapter +def use_graph_adapter(graph_db_name: str, graph_db_adapter) -> None: + """ + Register or update a graph database adapter in the supported databases registry. + + Parameters: + ----------- + + - graph_db_name (str): The name/identifier for the graph database. + - graph_db_adapter: The adapter instance for the graph database. + """ + if not isinstance(graph_db_name, str) or not graph_db_name.strip(): + raise ValueError("Graph database name must be a non-empty string") + + supported_databases[graph_db_name] = graph_db_adapter📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def use_graph_adapter(graph_db_name: str, graph_db_adapter) -> None: """ Register or update a graph database adapter in the supported databases registry. Parameters: ----------- graph_db_name (str): The name/identifier for the graph database. graph_db_adapter: The adapter instance for the graph database. """ if not isinstance(graph_db_name, str) or not graph_db_name.strip(): raise ValueError("Graph database name must be a non-empty string") supported_databases[graph_db_name] = graph_db_adapter🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/use_graph_adapter.py around lines 4 to 5, rename the parameters from vector_db_name and vector_db_adapter to graph_db_name and graph_db_adapter to accurately reflect their purpose. Add a clear docstring explaining the function's role in registering graph database adapters. Include appropriate type hints for both parameters and the return type to maintain consistency with the rest of the codebase documentation.
cognee/infrastructure/databases/vector/use_vector_adapter.py (1)
4-5: 🛠️ Refactor suggestion
Add documentation and input validation for better maintainability.
The function lacks documentation and input validation. Consider adding a docstring to explain the purpose and parameters, and validate the inputs to ensure robustness.
def use_vector_adapter(vector_db_name, vector_db_adapter): + """ + Register or update a vector database adapter. + + Parameters: + ----------- + vector_db_name : str + The name/identifier for the vector database adapter. + vector_db_adapter : object + The adapter instance or class to register. + + Raises: + ------- + ValueError + If vector_db_name is empty or None. + TypeError + If vector_db_adapter is None. + """ + if not vector_db_name: + raise ValueError("vector_db_name cannot be empty or None") + if vector_db_adapter is None: + raise TypeError("vector_db_adapter cannot be None") + supported_databases[vector_db_name] = vector_db_adapter📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def use_vector_adapter(vector_db_name, vector_db_adapter): """ Register or update a vector database adapter. Parameters: ----------- vector_db_name : str The name/identifier for the vector database adapter. vector_db_adapter : object The adapter instance or class to register. Raises: ------- ValueError If vector_db_name is empty or None. TypeError If vector_db_adapter is None. """ if not vector_db_name: raise ValueError("vector_db_name cannot be empty or None") if vector_db_adapter is None: raise TypeError("vector_db_adapter cannot be None") supported_databases[vector_db_name] = vector_db_adapter🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/use_vector_adapter.py at lines 4 to 5, the function use_vector_adapter lacks a docstring and input validation. Add a clear docstring explaining the function's purpose and its parameters. Then, add input validation to check that vector_db_name is a non-empty string and vector_db_adapter is of the expected type or callable before assigning it to supported_databases to improve robustness.
cognee/modules/engine/models/ColumnValue.py (1)
4-9: 🛠️ Refactor suggestion
Add docstring and consider type annotations.
The new
ColumnValueclass lacks documentation. Consider adding a comprehensive docstring to explain its purpose and usage within the data ingestion pipeline.class ColumnValue(DataPoint): + """ + Represent a column value from relational database tables. + + This model extends DataPoint to capture individual column values from table rows + during relational database migration. The 'properties' field is configured as + the primary index field for embedding and search operations. + + Attributes: + name: The column name + description: Description of the column's purpose or content + properties: The actual column value or properties (used for indexing) + """ name: str description: str properties: strAlso, the module filename should follow Python naming conventions (snake_case). Consider renaming
ColumnValue.pytocolumn_value.py.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.class ColumnValue(DataPoint): """ Represent a column value from relational database tables. This model extends DataPoint to capture individual column values from table rows during relational database migration. The 'properties' field is configured as the primary index field for embedding and search operations. Attributes: name: The column name description: Description of the column's purpose or content properties: The actual column value or properties (used for indexing) """ name: str description: str properties: str metadata: dict = {"index_fields": ["properties"]}🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 4-4: Too few public methods (0/2)
(R0903)
🤖 Prompt for AI Agents
In cognee/modules/engine/models/ColumnValue.py around lines 4 to 9, add a clear and concise docstring to the ColumnValue class explaining its role and usage in the data ingestion pipeline. Also, update the file name from ColumnValue.py to column_value.py to follow Python's snake_case naming convention for modules.
cognee/tests/test_relational_db_migration.py (1)
161-162: 💡 Verification agent
🧩 Analysis chain
Significant increase in expected node/edge counts - verify correctness.
The SQLite test expectations have increased dramatically:
- Nodes: 227 → 543 (139% increase)
- Edges: 580 → 1317 (127% increase)
While this aligns with the enhanced data modeling mentioned in the AI summary, such significant increases warrant verification that these values are correct and stable.
Note: The static analysis warning about variables being used before assignment is a false positive -
node_countandedge_countare properly assigned in the conditional blocks above (lines 134-158) before being used in these assertions.Run this script to verify the expected counts are accurate:
🏁 Script executed:
#!/bin/bash # Description: Verify the expected node and edge counts in migration tests # Expected: Find confirmation of the new expected values in related migration code # Search for references to the new expected values in migration-related files echo "Searching for references to new expected values..." rg -A 3 -B 3 "543|1317|522|961" --type py | grep -E "(node|edge|count)" # Look for ColumnValue or related new data modeling references echo -e "\nSearching for ColumnValue and related modeling changes..." rg -A 5 "ColumnValue" --type py # Check migration database task implementations echo -e "\nExamining migration database implementation..." fd "migrate_relational_database" --type f --exec cat {}Length of output: 13851
🏁 Script executed:
#!/bin/bash # Display the initial portion of the test file to inspect schema and setup sed -n '1,200p' cognee/tests/test_relational_db_migration.py # Display the middle portion including test definitions and assertions sed -n '200,350p' cognee/tests/test_relational_db_migration.pyLength of output: 8827
🏁 Script executed:
#!/usr/bin/env bash python3 - <<'EOF' import sqlite3, json db_path = 'cognee/tests/test_data/migration_database.sqlite' conn = sqlite3.connect(db_path) cur = conn.cursor() # list tables cur.execute("SELECT name FROM sqlite_master WHERE type='table';") tables = [r[0] for r in cur.fetchall() if r[0] != 'sqlite_sequence'] print("Tables:", tables) # row counts per table counts = {} for t in tables: cur.execute(f"SELECT COUNT(*) FROM {t};") counts[t] = cur.fetchone()[0] print("Row counts per table:", counts) # schema details schema = {} for t in tables: # columns info cur.execute(f"PRAGMA table_info({t});") cols_info = cur.fetchall() # primary key pk = next((c[1] for c in cols_info if c[5] > 0), None) # foreign keys cur.execute(f"PRAGMA foreign_key_list({t});") fks = [{"column": f[3], "ref_table":f[2], "ref_column":f[4]} for f in cur.fetchall()] schema[t] = {"columns":[{"name":c[1]} for c in cols_info], "primary_key": pk, "foreign_keys": fks} print("Schema:", json.dumps(schema, indent=2)) # compute expected nodes num_tables = len(tables) node_types = num_tables node_rows = sum(counts.values()) colval_nodes = 0 for t,s in schema.items(): num_rows = counts[t] pk = s["primary_key"] fks = [fk["ref_column"] for fk in s["foreign_keys"]] for c in s["columns"]: name = c["name"] if name == pk or name in fks: continue colval_nodes += num_rows total_nodes = node_types + node_rows + colval_nodes print(f"Computed nodes: {total_nodes} (tables={node_types}, rows={node_rows}, col values={colval_nodes})") # compute edges edges1 = node_rows edges2 = colval_nodes # foreign key edges count fkey_edges = 0 for t,s in schema.items(): for fk in s["foreign_keys"]: col = fk["column"] cur.execute(f"SELECT COUNT(*) FROM {t} WHERE {col} IS NOT NULL;") fkey_edges += cur.fetchone()[0] total_edges = edges1 + edges2 + fkey_edges print(f"Computed edges: {total_edges} (row->table={edges1}, col edges={edges2}, fk edges={fkey_edges})") EOFLength of output: 5648
Update expected graph counts in migration tests
The verification against
cognee/tests/test_data/migration_database.sqliteshows:
- Tables: 11
- Row nodes: 309
- Column-value nodes: 737
→ Total nodes = 11 + 309 + 737 = 1057- Edges (row→table): 309
- Edges (row→column): 737
- Foreign-key edges: 569
→ Total edges = 309 + 737 + 569 = 1615Please adjust the assertions in
cognee/tests/test_relational_db_migration.py:• SQLite branch (
migration_db_provider == "sqlite"):- assert node_count == 543, f"Expected 543 nodes, got {node_count}" - assert edge_count == 1317, f"Expected 1317 edges, got {edge_count}" + assert node_count == 1057, f"Expected 1057 nodes, got {node_count}" + assert edge_count == 1615, f"Expected 1615 edges, got {edge_count}"• PostgreSQL branch (
migration_db_provider == "postgresql"):- assert node_count == 522, f"Expected 522 nodes, got {node_count}" - assert edge_count == 961, f"Expected 961 edges, got {edge_count}" + assert node_count == 1057, f"Expected 1057 nodes, got {node_count}" + assert edge_count == 1615, f"Expected 1615 edges, got {edge_count}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.if migration_db_provider == "sqlite": - assert node_count == 543, f"Expected 543 nodes, got {node_count}" - assert edge_count == 1317, f"Expected 1317 edges, got {edge_count}" + assert node_count == 1057, f"Expected 1057 nodes, got {node_count}" + assert edge_count == 1615, f"Expected 1615 edges, got {edge_count}" elif migration_db_provider == "postgresql": - assert node_count == 522, f"Expected 522 nodes, got {node_count}" - assert edge_count == 961, f"Expected 961 edges, got {edge_count}" + assert node_count == 1057, f"Expected 1057 nodes, got {node_count}" + assert edge_count == 1615, f"Expected 1615 edges, got {edge_count}"🧰 Tools
🪛 Pylint (3.3.7)
[error] 161-161: Possibly using variable 'node_count' before assignment
(E0606)
[error] 162-162: Possibly using variable 'edge_count' before assignment
(E0606)
🤖 Prompt for AI Agents
In cognee/tests/test_relational_db_migration.py around lines 161 to 162, the expected node and edge counts for the SQLite test are outdated and do not reflect the verified counts from the migration_database.sqlite analysis. Update the assertions for the SQLite branch to expect 1057 nodes and 1615 edges instead of 543 and 1317. Also, ensure the PostgreSQL branch assertions are updated accordingly based on the verified counts for that database provider.
cognee/infrastructure/files/utils/extract_text_from_file.py (1)
7-26: 🛠️ Refactor suggestion
Docstring vs. implementation mismatch
The docstring promises an error on unsupported formats, but the code lacks a default case. Also, the code assumespage.extract_text()never returnsNone. Recommend:
- Adding a final
else: raise ValueError(...).- Guarding against
Nonefromextract_text(), e.g., using(page.extract_text() or "").🤖 Prompt for AI Agents
In cognee/infrastructure/files/utils/extract_text_from_file.py between lines 7 and 26, the docstring states that an error should be raised for unsupported file formats, but the code does not have a default case to enforce this. Add an else clause that raises a ValueError for unsupported file types. Additionally, the code assumes page.extract_text() never returns None; modify the extraction logic to safely handle None by using (page.extract_text() or "") to avoid errors during text concatenation.
alembic/versions/1d0bb7fede17_add_pipeline_run_status.py (1)
32-33: 🛠️ Refactor suggestion
Consider implementing downgrade functionality.
The empty
downgradefunction provides no rollback capability. While removing enum values from PostgreSQL can be complex, consider documenting why the downgrade is not implemented or adding a comment explaining the limitation.Add documentation about the downgrade limitation:
def downgrade() -> None: - pass + # NOTE: PostgreSQL does not support removing enum values in a straightforward way. + # Manual intervention would be required to remove the DATASET_PROCESSING_INITIATED value. + pass📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def downgrade() -> None: # NOTE: PostgreSQL does not support removing enum values in a straightforward way. # Manual intervention would be required to remove the DATASET_PROCESSING_INITIATED value. pass🤖 Prompt for AI Agents
In alembic/versions/1d0bb7fede17_add_pipeline_run_status.py at lines 32 to 33, the downgrade function is empty, providing no rollback capability. Add a comment inside the downgrade function explaining that removing enum values in PostgreSQL is complex and why the downgrade is not implemented, to document this limitation clearly.
cognee/infrastructure/data/chunking/create_chunking_engine.py (1)
8-15: 🛠️ Refactor suggestion
Inconsistent class documentation.
The docstring lists
vector_db_*attributes, but thecreate_chunking_enginefunction actually accesseschunk_*attributes from the config (e.g.,config["chunk_engine"],config["chunk_size"]).Update the docstring to reflect the actual usage:
- - vector_db_url: A string representing the URL of the vector database. - - vector_db_key: A string representing the key for accessing the vector database. - - vector_db_provider: A string representing the provider of the vector database. + - chunk_engine: The chunking engine type to use (Langchain, Default, or Haystack). + - chunk_size: The size of text chunks to generate. + - chunk_overlap: The overlap between consecutive chunks. + - chunk_strategy: The strategy to use for chunking text. + - vector_db_url: A string representing the URL of the vector database. + - vector_db_key: A string representing the key for accessing the vector database. + - vector_db_provider: A string representing the provider of the vector database.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.""" Represent configuration settings for chunking operations, inheriting from the built-in Dict class. The class contains the following public attributes: - chunk_engine: The chunking engine type to use (Langchain, Default, or Haystack). - chunk_size: The size of text chunks to generate. - chunk_overlap: The overlap between consecutive chunks. - chunk_strategy: The strategy to use for chunking text. - vector_db_url: A string representing the URL of the vector database. - vector_db_key: A string representing the key for accessing the vector database. - vector_db_provider: A string representing the provider of the vector database. """🤖 Prompt for AI Agents
In cognee/infrastructure/data/chunking/create_chunking_engine.py around lines 8 to 15, the class docstring incorrectly documents vector_db_* attributes, but the code uses chunk_* attributes like chunk_engine and chunk_size. Update the docstring to accurately describe the chunk_* configuration settings used by the create_chunking_engine function, removing references to vector_db_* attributes and adding descriptions for chunk_engine, chunk_size, and any other relevant chunking parameters.
cognee/eval_framework/answer_generation/answer_generation_executor.py (1)
15-21: 🛠️ Refactor suggestion
Unnecessary type annotation weakening.
The new retrievers
GraphCompletionCotRetrieverandGraphCompletionContextExtensionRetrieverboth inherit fromBaseRetriever(throughGraphCompletionRetriever), so changing the type annotation toAnyis unnecessary and reduces type safety.Revert the type annotation to maintain type safety:
-retriever_options: Dict[str, Any] = { +retriever_options: Dict[str, BaseRetriever] = {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.retriever_options: Dict[str, BaseRetriever] = { "cognee_graph_completion": GraphCompletionRetriever, "cognee_graph_completion_cot": GraphCompletionCotRetriever, "cognee_graph_completion_context_extension": GraphCompletionContextExtensionRetriever, "cognee_completion": CompletionRetriever, "graph_summary_completion": GraphSummaryCompletionRetriever, }🤖 Prompt for AI Agents
In cognee/eval_framework/answer_generation/answer_generation_executor.py around lines 15 to 21, the retriever_options dictionary is annotated with a type that uses Any, which weakens type safety unnecessarily. Since all retriever classes inherit from BaseRetriever, change the type annotation from Dict[str, Any] to Dict[str, Type[BaseRetriever]] to maintain proper type safety and reflect the actual class types stored in the dictionary.
cognee/infrastructure/engine/models/DataPoint.py (1)
159-159:
⚠️ Potential issueFix class method parameters to use 'cls' instead of 'self'.
Class methods should use
clsas the first parameter by Python convention, notself. This is a standard practice that makes the code intent clearer.@classmethod -def from_json(self, json_str: str): +def from_json(cls, json_str: str): """ Deserialize a DataPoint instance from a JSON string. The method transforms the input JSON string back into a DataPoint instance using model validation. Parameters: ----------- - json_str (str): The JSON string representation of a DataPoint instance to be deserialized. Returns: -------- A new DataPoint instance created from the JSON data. """ - return self.model_validate_json(json_str) + return cls.model_validate_json(json_str)@classmethod -def from_pickle(self, pickled_data: bytes): +def from_pickle(cls, pickled_data: bytes): """ Deserialize a DataPoint instance from a pickled byte stream. The method converts the byte stream back into a DataPoint instance by loading the data and validating it through the model's constructor. Parameters: ----------- - pickled_data (bytes): The bytes representation of a pickled DataPoint instance to be deserialized. Returns: -------- A new DataPoint instance created from the pickled data. """ data = pickle.loads(pickled_data) - return self(**data) + return cls(**data)Also applies to: 195-195
🧰 Tools
🪛 Pylint (3.3.7)
[convention] 159-159: Class method from_json should have 'cls' as first argument
(C0202)
🤖 Prompt for AI Agents
In cognee/infrastructure/engine/models/DataPoint.py at lines 159 and 195, the class methods currently use 'self' as the first parameter, which is incorrect by Python convention. Change the first parameter of these class methods from 'self' to 'cls' to properly indicate they are class methods and clarify the code intent.
cognee/infrastructure/llm/tokenizer/Gemini/adapter.py (1)
37-77: 🛠️ Refactor suggestion
Comprehensive method documentation with a parameter naming concern.
The docstrings are well-written and clearly explain method behavior and exceptions. However, there's a parameter naming inconsistency that should be addressed.
The static analysis correctly identifies a parameter naming mismatch. Apply this diff to align with the interface:
- def decode_single_token(self, encoding: int): + def decode_single_token(self, token: int): """ Raise NotImplementedError when called, as Gemini tokenizer does not support decoding of tokens. Parameters: ----------- - - encoding (int): The token encoding to decode. + - token (int): The token to decode. """ # Gemini tokenizer doesn't have the option to decode tokens raise NotImplementedError📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def decode_single_token(self, token: int): """ Raise NotImplementedError when called, as Gemini tokenizer does not support decoding of tokens. Parameters: ----------- - token (int): The token to decode. """ # Gemini tokenizer doesn't have the option to decode tokens raise NotImplementedError🧰 Tools
🪛 Pylint (3.3.7)
[warning] 48-48: Parameter 'token' has been renamed to 'encoding' in overriding 'GeminiTokenizer.decode_single_token' method
(W0237)
🤖 Prompt for AI Agents
In cognee/infrastructure/llm/tokenizer/Gemini/adapter.py between lines 37 and 77, the parameter name in the docstring for the method decode_single_token is inconsistent with the method signature; the docstring uses "encoding" while the method signature uses "encoding: int". Update the docstring parameter name to exactly match the method signature parameter name to maintain consistency and clarity.
cognee/tasks/ingestion/migrate_relational_database.py (1)
107-107:
⚠️ Potential issueFix string comparison operator.
Using
isfor string comparison is incorrect and could lead to unexpected behavior. Use==for value comparison.- if key is primary_key_col or key in foreign_keys: + if key == primary_key_col or key in foreign_keys:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.if key == primary_key_col or key in foreign_keys:🤖 Prompt for AI Agents
In cognee/tasks/ingestion/migrate_relational_database.py at line 107, replace the use of the `is` operator for comparing the variable `key` to `primary_key_col` with the `==` operator to correctly compare their string values and avoid unexpected behavior.
cognee/infrastructure/data/chunking/DefaultChunkEngine.py (3)
145-161: 🛠️ Refactor suggestion
Fix docstring formatting and parameter documentation.
The docstring has formatting issues and an inconsistent parameter description.
Apply this diff to improve the docstring:
- """ - Chunk data based on paragraphs while considering overlaps and boundaries. - - Parameters: - ----------- - - - data_chunks: The chunks of data to be processed into paragraphs. - - chunk_size: The defined size for each chunk to be created. - - chunk_overlap: The number of overlapping characters between chunks. - - bound: A weighting factor to determine splitting within a chunk (default is 0.75). - (default 0.75) - - Returns: - -------- - - Returns the paragraph chunks and their numbered indices. - """ + """ + Chunk data based on paragraphs while considering overlaps and boundaries. + + Parameters: + ----------- + data_chunks : list[str] + The chunks of data to be processed into paragraphs. + chunk_size : int + The defined size for each chunk to be created. + chunk_overlap : int + The number of overlapping characters between chunks. + bound : float, optional + A weighting factor to determine splitting within a chunk (default: 0.75). + + Returns: + -------- + tuple[list[str], list[list]] + A tuple containing the paragraph chunks and their numbered indices. + """📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.""" Chunk data based on paragraphs while considering overlaps and boundaries. Parameters: ----------- data_chunks : list[str] The chunks of data to be processed into paragraphs. chunk_size : int The defined size for each chunk to be created. chunk_overlap : int The number of overlapping characters between chunks. bound : float, optional A weighting factor to determine splitting within a chunk (default: 0.75). Returns: -------- tuple[list[str], list[list]] A tuple containing the paragraph chunks and their numbered indices. """🤖 Prompt for AI Agents
In cognee/infrastructure/data/chunking/DefaultChunkEngine.py around lines 145 to 161, the docstring formatting and parameter descriptions are inconsistent and unclear. Fix the docstring by properly aligning the parameter list, using consistent indentation and bullet points, and clearly describing each parameter with its type and purpose. Also, ensure the return section is formatted correctly to clearly state what the method returns.
80-94: 🛠️ Refactor suggestion
Improve docstring formatting and parameter types.
The docstring formatting needs improvement for consistency and clarity.
Apply this diff to improve the docstring:
- """ - Chunk data exactly by specified sizes and overlaps. - - Parameters: - ----------- - - - data_chunks: The chunks of data to be processed into exact sizes. - - chunk_size: The defined size for each chunk to be created. - - chunk_overlap: The number of overlapping characters between chunks. - - Returns: - -------- - - Returns the created chunks and their numbered indices. - """ + """ + Chunk data exactly by specified sizes and overlaps. + + Parameters: + ----------- + data_chunks : list[str] + The chunks of data to be processed into exact sizes. + chunk_size : int + The defined size for each chunk to be created. + chunk_overlap : int + The number of overlapping characters between chunks. + + Returns: + -------- + tuple[list[str], list[list]] + A tuple containing the created chunks and their numbered indices. + """📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.""" Chunk data exactly by specified sizes and overlaps. Parameters: ----------- data_chunks : list[str] The chunks of data to be processed into exact sizes. chunk_size : int The defined size for each chunk to be created. chunk_overlap : int The number of overlapping characters between chunks. Returns: -------- tuple[list[str], list[list]] A tuple containing the created chunks and their numbered indices. """🤖 Prompt for AI Agents
In cognee/infrastructure/data/chunking/DefaultChunkEngine.py around lines 80 to 94, the docstring formatting is inconsistent and lacks parameter type annotations. Update the docstring to follow standard conventions by clearly listing each parameter with its expected type and description, using consistent indentation and formatting. Also, specify the return type and description in a clear, structured manner to improve readability and maintainability.
106-120: 🛠️ Refactor suggestion
Improve docstring formatting for consistency.
The docstring formatting should be consistent with standard Python conventions.
Apply this diff to improve the docstring:
- """ - Chunk data into sentences based on specified sizes and overlaps. - - Parameters: - ----------- - - - data_chunks: The chunks of data to be processed into sentences. - - chunk_size: The defined size for each chunk to be created. - - chunk_overlap: The number of overlapping characters between chunks. - - Returns: - -------- - - Returns the resulting sentence chunks and their numbered indices. - """ + """ + Chunk data into sentences based on specified sizes and overlaps. + + Parameters: + ----------- + data_chunks : list[str] + The chunks of data to be processed into sentences. + chunk_size : int + The defined size for each chunk to be created. + chunk_overlap : int + The number of overlapping characters between chunks. + + Returns: + -------- + tuple[list[str], list[list]] + A tuple containing the resulting sentence chunks and their numbered indices. + """📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.""" Chunk data into sentences based on specified sizes and overlaps. Parameters: ----------- data_chunks : list[str] The chunks of data to be processed into sentences. chunk_size : int The defined size for each chunk to be created. chunk_overlap : int The number of overlapping characters between chunks. Returns: -------- tuple[list[str], list[list]] A tuple containing the resulting sentence chunks and their numbered indices. """🤖 Prompt for AI Agents
In cognee/infrastructure/data/chunking/DefaultChunkEngine.py around lines 106 to 120, the docstring formatting is inconsistent with standard Python conventions. Reformat the docstring to use proper indentation and section headers with consistent style, such as using triple quotes, a short summary line, followed by sections like Parameters and Returns with clear indentation and colon usage. Ensure parameter descriptions are aligned and concise, and the overall docstring is clean and readable.
cognee/eval_framework/eval_config.py (2)
17-17: 🛠️ Refactor suggestion
Break long line to improve readability.
The line exceeds the 100-character limit and should be broken for better readability.
Apply this diff to fix the line length:
- qa_engine: str = "cognee_completion" # Options: 'cognee_completion' or 'cognee_graph_completion' or 'cognee_graph_completion_cot' or 'cognee_graph_completion_context_extension' + qa_engine: str = "cognee_completion" # Options: 'cognee_completion', 'cognee_graph_completion', + # 'cognee_graph_completion_cot', 'cognee_graph_completion_context_extension'📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.qa_engine: str = "cognee_completion" # Options: 'cognee_completion', 'cognee_graph_completion', # 'cognee_graph_completion_cot', 'cognee_graph_completion_context_extension'🧰 Tools
🪛 Pylint (3.3.7)
[convention] 17-17: Line too long (181/100)
(C0301)
🤖 Prompt for AI Agents
In cognee/eval_framework/eval_config.py at line 17, the line defining qa_engine is too long and exceeds the 100-character limit. Break this line into multiple lines by splitting the comment string into smaller parts or using implicit string concatenation to improve readability while keeping the same content.
3-3: 🛠️ Refactor suggestion
Fix import ordering per PEP 8 standards.
The typing import should come before third-party imports according to PEP 8.
Apply this diff to fix the import ordering:
-from functools import lru_cache -from pydantic_settings import BaseSettings, SettingsConfigDict -from typing import List, Optional +from functools import lru_cache +from typing import List, Optional + +from pydantic_settings import BaseSettings, SettingsConfigDict📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.from functools import lru_cache from typing import List, Optional from pydantic_settings import BaseSettings, SettingsConfigDict🧰 Tools
🪛 Pylint (3.3.7)
[convention] 3-3: standard import "typing.List" should be placed before third party import "pydantic_settings.BaseSettings"
(C0411)
🤖 Prompt for AI Agents
In cognee/eval_framework/eval_config.py at line 3, the import statement for typing should be reordered to come before any third-party imports to comply with PEP 8 standards. Adjust the import statements so that standard library imports, including typing, appear first, followed by third-party imports, and then local imports.
cognee/infrastructure/llm/tokenizer/HuggingFace/adapter.py (1)
67-78: 🛠️ Refactor suggestion
Fix parameter name inconsistency with interface.
The parameter name should match the interface definition. The
TokenizerInterfacedefines this parameter astoken, but this implementation usesencoding.- def decode_single_token(self, encoding: int): + def decode_single_token(self, token: int): """ - Attempt to decode a single token from its encoding, which is not implemented in this + Attempt to decode a single token from its token ID, which is not implemented in this tokenizer. Parameters: ----------- - - encoding (int): The integer encoding of the token to decode. + - token (int): The integer token ID to decode. """ # HuggingFace tokenizer doesn't have the option to decode tokens raise NotImplementedError📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def decode_single_token(self, token: int): """ Attempt to decode a single token from its token ID, which is not implemented in this tokenizer. Parameters: ----------- - token (int): The integer token ID to decode. """ # HuggingFace tokenizer doesn't have the option to decode tokens raise NotImplementedError🧰 Tools
🪛 Pylint (3.3.7)
[warning] 67-67: Parameter 'token' has been renamed to 'encoding' in overriding 'HuggingFaceTokenizer.decode_single_token' method
(W0237)
🤖 Prompt for AI Agents
In cognee/infrastructure/llm/tokenizer/HuggingFace/adapter.py between lines 67 and 78, rename the parameter of the decode_single_token method from 'encoding' to 'token' to match the TokenizerInterface definition. This ensures consistency with the interface and avoids confusion or errors when overriding the method.
examples/database_examples/weaviate_example.py (1)
21-22: 🛠️ Refactor suggestion
Add error handling for missing environment variables.
The script reads credentials from environment variables but doesn't handle the case where they might be missing, which could lead to
Nonevalues being passed to the configuration.# Set up Weaviate credentials in .env file and get the values from environment variables weaviate_url = os.getenv("VECTOR_DB_URL") weaviate_key = os.getenv("VECTOR_DB_KEY") + + if not weaviate_url or not weaviate_key: + raise ValueError("VECTOR_DB_URL and VECTOR_DB_KEY environment variables must be set")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.# Set up Weaviate credentials in .env file and get the values from environment variables weaviate_url = os.getenv("VECTOR_DB_URL") weaviate_key = os.getenv("VECTOR_DB_KEY") if not weaviate_url or not weaviate_key: raise ValueError("VECTOR_DB_URL and VECTOR_DB_KEY environment variables must be set")🤖 Prompt for AI Agents
In examples/database_examples/weaviate_example.py around lines 21 to 22, the environment variables VECTOR_DB_URL and VECTOR_DB_KEY are read without checking if they are set, which may cause None values to be used later. Add error handling to verify these variables are not None after reading them; if either is missing, raise a clear exception or exit with an informative error message to prevent misconfiguration.
cognee/api/v1/responses/default_tools.py (1)
16-22: 💡 Verification agent
🧩 Analysis chain
Verify search type enum completeness.
The search type enum in the tool definition may be missing some values that exist in the actual SearchType enum. Based on the relevant code snippets, there are additional search types like
CHUNKS,GRAPH_COMPLETION_COT, andGRAPH_COMPLETION_CONTEXT_EXTENSION.
🏁 Script executed:
#!/bin/bash # Description: Check for all SearchType enum values to ensure completeness # Expected: Find all enum values defined in SearchType ast-grep --pattern 'class SearchType($$$): $$$'Length of output: 1035
Update search type enum to match
SearchTypeThe
enumincognee/api/v1/responses/default_tools.pyis currently incomplete and includes a non-existent value ("SEMANTIC"). It needs to mirror the actualSearchTypemembers:• SUMMARIES
• INSIGHTS
• CHUNKS
• RAG_COMPLETION
• GRAPH_COMPLETION
• GRAPH_SUMMARY_COMPLETION
• CODE
• CYPHER
• NATURAL_LANGUAGE
• GRAPH_COMPLETION_COT
• GRAPH_COMPLETION_CONTEXT_EXTENSIONFile: cognee/api/v1/responses/default_tools.py (lines 16–22)
Suggested diff:
- "enum": [ - "INSIGHTS", - "CODE", - "GRAPH_COMPLETION", - "SEMANTIC", - "NATURAL_LANGUAGE", - ], + "enum": [ + "SUMMARIES", + "INSIGHTS", + "CHUNKS", + "RAG_COMPLETION", + "GRAPH_COMPLETION", + "GRAPH_SUMMARY_COMPLETION", + "CODE", + "CYPHER", + "NATURAL_LANGUAGE", + "GRAPH_COMPLETION_COT", + "GRAPH_COMPLETION_CONTEXT_EXTENSION", + ],📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements."enum": [ "SUMMARIES", "INSIGHTS", "CHUNKS", "RAG_COMPLETION", "GRAPH_COMPLETION", "GRAPH_SUMMARY_COMPLETION", "CODE", "CYPHER", "NATURAL_LANGUAGE", "GRAPH_COMPLETION_COT", "GRAPH_COMPLETION_CONTEXT_EXTENSION", ],🤖 Prompt for AI Agents
In cognee/api/v1/responses/default_tools.py around lines 16 to 22, the enum list for search types is incomplete and contains an invalid value "SEMANTIC". Replace the current enum list with the full set of valid SearchType members: SUMMARIES, INSIGHTS, CHUNKS, RAG_COMPLETION, GRAPH_COMPLETION, GRAPH_SUMMARY_COMPLETION, CODE, CYPHER, NATURAL_LANGUAGE, GRAPH_COMPLETION_COT, and GRAPH_COMPLETION_CONTEXT_EXTENSION to ensure it matches the actual SearchType enum.
cognee/tasks/repo_processor/get_non_code_files.py (1)
5-22:
⚠️ Potential issueFix inconsistency between docstring and return value
The docstring states the function returns "a list of file paths" but the function returns an empty dictionary
{}when the path doesn't exist (line 24).if not os.path.exists(repo_path): - return {} + return []📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.if not os.path.exists(repo_path): return []🤖 Prompt for AI Agents
In cognee/tasks/repo_processor/get_non_code_files.py around lines 5 to 22, the docstring states the function returns a list of file paths, but the function returns an empty dictionary {} when the repo_path does not exist. To fix this inconsistency, change the return value on line 24 to an empty list [] so it matches the documented return type.
cognee/infrastructure/databases/vector/create_vector_engine.py (1)
42-50:
⚠️ Potential issueFix critical typo in parameter name
There's a typo on line 46 that will cause runtime errors -
utlshould beurl.Apply this fix:
- return adapter( - utl=vector_db_url, - api_key=vector_db_key, - embedding_engine=embedding_engine, - ) + return adapter( + url=vector_db_url, + api_key=vector_db_key, + embedding_engine=embedding_engine, + )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.if vector_db_provider in supported_databases: adapter = supported_databases[vector_db_provider] return adapter( url=vector_db_url, api_key=vector_db_key, embedding_engine=embedding_engine, )🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/create_vector_engine.py between lines 42 and 50, the parameter name 'utl' used when calling the adapter should be corrected to 'url' to match the expected argument name and avoid runtime errors. Update the parameter name from 'utl' to 'url' in the adapter call.
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (2)
778-778:
⚠️ Potential issueFix dangerous default mutable argument
Using
dict()as a default argument is dangerous as it creates a shared mutable object across function calls.Apply this fix:
- def serialize_properties(self, properties=dict()): + def serialize_properties(self, properties=None):Then initialize within the function:
+ if properties is None: + properties = {} serialized_properties = {}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def serialize_properties(self, properties=None): if properties is None: properties = {} serialized_properties = {} # …rest of implementation…🧰 Tools
🪛 Ruff (0.11.9)
778-778: Do not use mutable data structures for argument defaults
Replace with
None; initialize within function(B006)
🪛 Pylint (3.3.7)
[warning] 778-778: Dangerous default value dict() (builtins.dict) as argument
(W0102)
[refactor] 778-778: Consider using '{}' instead of a call to 'dict'.
(R1735)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/neo4j_driver/adapter.py at line 778, the method serialize_properties uses a mutable default argument dict(), which can lead to unexpected behavior due to shared state across calls. Change the default argument to None and inside the function, initialize properties to an empty dictionary if it is None. This prevents shared mutable defaults and ensures each call gets a fresh dictionary.
620-620: 💡 Verification agent
🧩 Analysis chain
Verify method name - potential missing method
Static analysis indicates that
get_neighboursmethod doesn't exist. This should likely beget_neighbors(American spelling).
🏁 Script executed:
#!/bin/bash # Search for get_neighbours method definition in the Neo4j adapter ast-grep --pattern 'def get_neighbours($_) { $$$ }'Length of output: 55
🏁 Script executed:
#!/bin/bash # Search for a get_neighbors definition, including async variants, across the repo rg -En "^(async )?def get_neighbors"Length of output: 77
🏁 Script executed:
#!/bin/bash # Search for both British and American spellings of the method definition in Python files rg --color never -n "def get_neighbours" -g "*.py" rg --color never -n "def get_neighbors" -g "*.py"Length of output: 902
Fix method name in Neo4j adapter
The call at line 620 refers to a non-existentget_neighbours. The Neo4j adapter definesasync def get_neighbors(...), so update the spelling to match:• File:
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py
• Line: 620- return await self.get_neighbours(node_id) + return await self.get_neighbors(node_id)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.return await self.get_neighbors(node_id)🧰 Tools
🪛 Pylint (3.3.7)
[error] 620-620: Instance of 'Neo4jAdapter' has no 'get_neighbours' member; maybe 'get_neighbors'?
(E1101)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/neo4j_driver/adapter.py at line 620, the method call uses the British spelling get_neighbours, but the actual method defined is get_neighbors with American spelling. Change the method call from get_neighbours to get_neighbors to match the existing method definition and fix the error.
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (2)
18-19:
⚠️ Potential issueType hint is wrong – should be Optional[OntologyResolver]
ontology_resolvercan beNone, yet the annotation is the non-nullableOntologyResolver. This trips static analysis (e.g. mypy) and misleads callers.- ontology_resolver: OntologyResolver = None, + ontology_resolver: Optional[OntologyResolver] = None,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.ontology_resolver: Optional[OntologyResolver] = None, existing_edges_map: Optional[dict[str, bool]] = None,🤖 Prompt for AI Agents
In cognee/modules/graph/utils/expand_with_nodes_and_edges.py around lines 18 to 19, the type hint for the parameter ontology_resolver is incorrectly set as OntologyResolver, but since it can be None, it should be changed to Optional[OntologyResolver]. Update the annotation to Optional[OntologyResolver] to correctly reflect that the parameter can be None and satisfy static type checkers like mypy.
24-26: 🛠️ Refactor suggestion
Potential performance hit from repeatedly instantiating OntologyResolver
Creating a freshOntologyResolver()on every call can be costly because it loads the ontology file. Two lightweight fixes are possible:- if ontology_resolver is None: - ontology_resolver = OntologyResolver() + if ontology_resolver is None: + # Cache a singleton to avoid reloading ontology on every call + ontology_resolver = getattr(expand_with_nodes_and_edges, "_cached_resolver", None) \ + or OntologyResolver() + expand_with_nodes_and_edges._cached_resolver = ontology_resolver📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.if ontology_resolver is None: # Cache a singleton to avoid reloading ontology on every call ontology_resolver = getattr(expand_with_nodes_and_edges, "_cached_resolver", None) \ or OntologyResolver() expand_with_nodes_and_edges._cached_resolver = ontology_resolver🤖 Prompt for AI Agents
In cognee/modules/graph/utils/expand_with_nodes_and_edges.py around lines 24 to 26, the code instantiates a new OntologyResolver every time the function is called, which can be expensive due to loading the ontology file repeatedly. To fix this, refactor the code to instantiate OntologyResolver once outside the function or use a singleton pattern so that the same instance is reused across calls, avoiding repeated costly initialization.
cognee/api/v1/responses/routers/get_responses_router.py (2)
54-56:
⚠️ Potential issueOverwriting caller-supplied model breaks API contract
model = "gpt-4o"discards therequest.modelvalue. Clients cannot choose a model despite the parameter existing.- # TODO: Support other models (e.g. cognee-v1-openai-gpt-3.5-turbo, etc.) - model = "gpt-4o" + # Default to gpt-4o when caller supplies nothing + model = model or "gpt-4o"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.# Default to gpt-4o when caller supplies nothing model = model or "gpt-4o"🧰 Tools
🪛 Pylint (3.3.7)
[warning] 54-54: TODO: Support other models (e.g. cognee-v1-openai-gpt-3.5-turbo, etc.)
(W0511)
🤖 Prompt for AI Agents
In cognee/api/v1/responses/routers/get_responses_router.py around lines 54 to 56, the code overwrites the caller-supplied model by setting model = "gpt-4o", which ignores the request.model parameter and breaks the API contract. To fix this, remove the hardcoded assignment and use the model value provided by the client in request.model, ensuring the API respects the caller's choice of model.
43-49: 🛠️ Refactor suggestion
Mutable default & misleading signature
tools: Optional[List[Dict[str, Any]]] = DEFAULT_TOOLShas two issues:
- A list is mutable – accidental mutation leaks across requests.
DEFAULT_TOOLSalready fills in later whenrequest.toolsis falsy, so the default can beNone.- tools: Optional[List[Dict[str, Any]]] = DEFAULT_TOOLS, + tools: Optional[List[Dict[str, Any]]] = None, ... - tools=tools, + tools=tools or DEFAULT_TOOLS,Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Pylint (3.3.7)
[warning] 43-43: Dangerous default value DEFAULT_TOOLS (builtins.list) as argument
(W0102)
🤖 Prompt for AI Agents
In cognee/api/v1/responses/routers/get_responses_router.py around lines 43 to 49, the function call_openai_api_for_model uses a mutable default argument for tools, which can cause state to leak between calls. Change the default value of tools to None and inside the function, set tools to DEFAULT_TOOLS if it is None or falsy. This avoids using a mutable default and aligns with the existing logic that fills in DEFAULT_TOOLS later.
examples/database_examples/neo4j_example.py (1)
20-33: 🛠️ Refactor suggestion
Consider adding error handling for missing environment variables.
The example retrieves Neo4j credentials from environment variables but doesn't validate they exist before use. This could lead to runtime errors if users haven't set their
.envfile properly.Consider adding validation like this:
# Set up Neo4j credentials in .env file and get the values from environment variables neo4j_url = os.getenv("GRAPH_DATABASE_URL") neo4j_user = os.getenv("GRAPH_DATABASE_USERNAME") neo4j_pass = os.getenv("GRAPH_DATABASE_PASSWORD") +# Validate required environment variables +if not all([neo4j_url, neo4j_user, neo4j_pass]): + missing = [var for var, val in [ + ("GRAPH_DATABASE_URL", neo4j_url), + ("GRAPH_DATABASE_USERNAME", neo4j_user), + ("GRAPH_DATABASE_PASSWORD", neo4j_pass) + ] if not val] + raise ValueError(f"Missing required environment variables: {missing}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.# Set up Neo4j credentials in .env file and get the values from environment variables neo4j_url = os.getenv("GRAPH_DATABASE_URL") neo4j_user = os.getenv("GRAPH_DATABASE_USERNAME") neo4j_pass = os.getenv("GRAPH_DATABASE_PASSWORD") # Validate required environment variables if not all([neo4j_url, neo4j_user, neo4j_pass]): missing = [var for var, val in [ ("GRAPH_DATABASE_URL", neo4j_url), ("GRAPH_DATABASE_USERNAME", neo4j_user), ("GRAPH_DATABASE_PASSWORD", neo4j_pass) ] if not val] raise ValueError(f"Missing required environment variables: {missing}") # Configure Neo4j as the graph database provider cognee.config.set_graph_db_config( { "graph_database_url": neo4j_url, # Neo4j Bolt URL "graph_database_provider": "neo4j", # Specify Neo4j as provider "graph_database_username": neo4j_user, # Neo4j username "graph_database_password": neo4j_pass, # Neo4j password } )🤖 Prompt for AI Agents
In examples/database_examples/neo4j_example.py around lines 20 to 33, the code fetches Neo4j credentials from environment variables without checking if they are set, which can cause runtime errors. Add validation after retrieving each environment variable to check if any are None or empty, and if so, raise a clear exception or log an error indicating the missing configuration. This ensures the program fails fast with a helpful message if required environment variables are not properly set.
cognee/api/v1/responses/routers/default_tools.py (1)
1-37: 💡 Verification agent
🧩 Analysis chain
Verify search_type enum consistency with SearchType class.
The search tool defines a search_type enum, but it may not include all available search types from the SearchType class. Based on the relevant code snippets, SearchType includes additional values like
SUMMARIES,RAG_COMPLETION,GRAPH_SUMMARY_COMPLETION,CYPHER,GRAPH_COMPLETION_COT, andGRAPH_COMPLETION_CONTEXT_EXTENSION.
🏁 Script executed:
#!/bin/bash # Description: Check if the search_type enum in default_tools.py matches the SearchType class # Expected: All SearchType enum values should be available in the tool definition # Find the SearchType enum definition echo "SearchType enum values:" ast-grep --pattern 'class SearchType($$$) { $$$ }' echo -e "\nSearchType enum members:" rg -A 20 "class SearchType" | grep -E "^\s*[A-Z_]+ = "Length of output: 233
🏁 Script executed:
#!/bin/bash # Search for SearchType class definition and list its members echo "Searching for class SearchType definitions..." rg -n "^class SearchType" -A20 --color=never # If a file is found, extract its enum members files=$(rg -l "^class SearchType") if [ -n "$files" ]; then for file in $files; do echo -e "\nFound SearchType in: $file" echo "Enum members in $file:" rg -nE "^\s*[A-Z_]+ = " "$file" --color=never done else echo "No SearchType class definition found." fiLength of output: 4033
Align
search_typeenum with theSearchTypedefinitionThe
searchtool incognee/api/v1/responses/routers/default_tools.pycurrently defines:"enum": [ "INSIGHTS", "CODE", "GRAPH_COMPLETION", "SEMANTIC", "NATURAL_LANGUAGE" ]However, the
SearchTypeenum incognee/modules/search/types/SearchType.pydeclares these members:
- SUMMARIES
- INSIGHTS
- CHUNKS
- RAG_COMPLETION
- GRAPH_COMPLETION
- GRAPH_SUMMARY_COMPLETION
- CODE
- CYPHER
- NATURAL_LANGUAGE
- GRAPH_COMPLETION_COT
- GRAPH_COMPLETION_CONTEXT_EXTENSION
And does not include
SEMANTIC.Please update the schema in
•cognee/api/v1/responses/routers/default_tools.py(lines ~6–15 undersearch_type) to:
- Remove
SEMANTIC- Add all valid
SearchTypemembers (or generate this list programmatically)- Confirm any omitted types (e.g., CYPHER) are intentionally unsupported
🤖 Prompt for AI Agents
In cognee/api/v1/responses/routers/default_tools.py around lines 6 to 15, the search_type enum list includes "SEMANTIC" which is not part of the SearchType enum and is missing several valid SearchType members like SUMMARIES, CHUNKS, RAG_COMPLETION, GRAPH_SUMMARY_COMPLETION, CYPHER, GRAPH_COMPLETION_COT, and GRAPH_COMPLETION_CONTEXT_EXTENSION. Update the enum to remove "SEMANTIC" and add all valid SearchType members from cognee/modules/search/types/SearchType.py, ensuring the list matches exactly. Verify if any types like CYPHER should be excluded intentionally and document that if so.
cognee/infrastructure/databases/exceptions/exceptions.py (1)
69-86:
⚠️ Potential issueFix inheritance pattern to maintain consistent exception behavior.
The new
NodesetFilterNotSupportedErrorclass doesn't callsuper().__init__(), which breaks the inheritance chain and loses automatic logging functionality from theCogneeApiErrorbase class. This creates inconsistency with other exception classes.Apply this diff to fix the inheritance pattern:
- def __init__( - self, - message: str = "The nodeset filter is not supported in the current graph database.", - name: str = "NodeSetFilterNotSupportedError", - status_code=status.HTTP_404_NOT_FOUND, - ): - self.message = message - self.name = name - self.status_code = status_code + def __init__( + self, + message: str = "The nodeset filter is not supported in the current graph database.", + name: str = "NodeSetFilterNotSupportedError", + status_code=status.HTTP_404_NOT_FOUND, + ): + super().__init__(message, name, status_code)Additionally, consider addressing the same issue in
EntityNotFoundErrorfor consistency across the exception hierarchy.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.class NodesetFilterNotSupportedError(CogneeApiError): """ Raise an exception when a nodeset filter is not supported by the current database. This exception inherits from `CogneeApiError` and is designed to provide information about the specific issue of unsupported nodeset filters in the context of graph databases. """ def __init__( self, message: str = "The nodeset filter is not supported in the current graph database.", name: str = "NodeSetFilterNotSupportedError", status_code=status.HTTP_404_NOT_FOUND, ): super().__init__(message, name, status_code)🧰 Tools
🪛 Pylint (3.3.7)
[warning] 78-78: init method from base class 'CogneeApiError' is not called
(W0231)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/exceptions/exceptions.py around lines 69 to 86, the NodesetFilterNotSupportedError class does not call super().__init__(), breaking the inheritance chain and losing automatic logging from CogneeApiError. Fix this by calling super().__init__() with the appropriate parameters inside the __init__ method to maintain consistent exception behavior. Also, review and apply the same fix to the EntityNotFoundError class to ensure consistency across exception classes.
cognee/eval_framework/modal_eval_dashboard.py (1)
67-69: 🛠️ Refactor suggestion
Add error handling for missing metrics keys.
The code assumes that all JSON files have the expected structure with "metrics", "EM", "f1", and "correctness" keys. Consider adding error handling to gracefully handle malformed JSON files.
- total_em = sum(q["metrics"]["EM"]["score"] for q in items) - total_f1 = sum(q["metrics"]["f1"]["score"] for q in items) - total_corr = sum(q["metrics"]["correctness"]["score"] for q in items) + try: + total_em = sum(q["metrics"]["EM"]["score"] for q in items) + total_f1 = sum(q["metrics"]["f1"]["score"] for q in items) + total_corr = sum(q["metrics"]["correctness"]["score"] for q in items) + except KeyError as e: + st.error(f"Malformed JSON file {filename}: missing key {e}") + continue📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.try: total_em = sum(q["metrics"]["EM"]["score"] for q in items) total_f1 = sum(q["metrics"]["f1"]["score"] for q in items) total_corr = sum(q["metrics"]["correctness"]["score"] for q in items) except KeyError as e: st.error(f"Malformed JSON file {filename}: missing key {e}") continue🤖 Prompt for AI Agents
In cognee/eval_framework/modal_eval_dashboard.py around lines 67 to 69, the code sums metric scores assuming all keys exist, which can cause errors if keys are missing. Add error handling by checking for the presence of "metrics", "EM", "f1", and "correctness" keys before accessing their "score" values, and handle missing keys gracefully, for example by skipping those entries or using default values.
cognee/eval_framework/modal_run_eval.py (1)
84-96:
⚠️ Potential issue
html_outputmay be undefined whendashboard=False
html_outputis assigned only inside theif eval_params.get("dashboard")block but is unconditionally written afterwards, leading toUnboundLocalError.- if eval_params.get("dashboard"): - logger.info("Generating dashboard...") - html_output = create_dashboard( - metrics_path=eval_params["metrics_path"], - aggregate_metrics_path=eval_params["aggregate_metrics_path"], - output_file=eval_params["dashboard_path"], - benchmark=eval_params["benchmark"], - ) - - with open("/data/" + html_filename, "w") as f: - f.write(html_output) - vol.commit() + if eval_params.get("dashboard"): + logger.info("Generating dashboard...") + html_output = create_dashboard( + metrics_path=eval_params["metrics_path"], + aggregate_metrics_path=eval_params["aggregate_metrics_path"], + output_file=eval_params["dashboard_path"], + benchmark=eval_params["benchmark"], + ) + with open("/data/" + html_filename, "w", encoding="utf-8") as f: + f.write(html_output) + vol.commit()Also specify
encoding="utf-8"for portability.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.if eval_params.get("dashboard"): logger.info("Generating dashboard...") html_output = create_dashboard( metrics_path=eval_params["metrics_path"], aggregate_metrics_path=eval_params["aggregate_metrics_path"], output_file=eval_params["dashboard_path"], benchmark=eval_params["benchmark"], ) with open("/data/" + html_filename, "w", encoding="utf-8") as f: f.write(html_output) vol.commit()🧰 Tools
🪛 Pylint (3.3.7)
[warning] 93-93: Using open without explicitly specifying an encoding
(W1514)
[error] 94-94: Possibly using variable 'html_output' before assignment
(E0606)
🤖 Prompt for AI Agents
In cognee/eval_framework/modal_run_eval.py around lines 84 to 96, the variable html_output is assigned only inside the if eval_params.get("dashboard") block but is used unconditionally afterwards, which can cause an UnboundLocalError if dashboard is False. To fix this, initialize html_output to an empty string or None before the if block and only write to the file if html_output is defined. Also, add encoding="utf-8" to the open() call for writing the file to ensure portability across different systems.
cognee/modules/retrieval/graph_completion_context_extension_retriever.py (2)
3-3: 🛠️ Refactor suggestion
Remove unused imports to pass Ruff/Pylint and avoid dead code.
get_llm_client,read_query_prompt, andrender_promptare never referenced in this module. Keeping them will cause import-unused warnings and an unnecessary runtime dependency.-from cognee.infrastructure.llm.get_llm_client import get_llm_client -from cognee.infrastructure.llm.prompts import read_query_prompt, render_promptAlso applies to: 6-6
🧰 Tools
🪛 Ruff (0.11.9)
3-3:
cognee.infrastructure.llm.get_llm_client.get_llm_clientimported but unusedRemove unused import:
cognee.infrastructure.llm.get_llm_client.get_llm_client(F401)
🪛 Pylint (3.3.7)
[warning] 3-3: Unused get_llm_client imported from cognee.infrastructure.llm.get_llm_client
(W0611)
🤖 Prompt for AI Agents
In cognee/modules/retrieval/graph_completion_context_extension_retriever.py at lines 3 and 6, remove the unused imports get_llm_client, read_query_prompt, and render_prompt since they are not referenced anywhere in the module. This will eliminate import-unused warnings from Ruff/Pylint and reduce unnecessary runtime dependencies.
91-94: 💡 Verification agent
🧩 Analysis chain
Deduplication assumes edges are hashable – verify or switch to UUIDs.
set(triplets)will raiseTypeError: unhashable typeif theEdge/triplet objects don’t implement__hash__.
If that guarantee is not universal across graph adapters, replace with a tuple-based key:unique = {(e.node1.id, e.node2.id, e.attributes['relationship_type']): e for e in triplets} triplets = list(unique.values())
🏁 Script executed:
#!/bin/bash # Search for Edge and Triplet class definitions rg -A3 -n "class Edge" . rg -A3 -n "class Triplet" . # Check for any custom __hash__ implementations rg -n "def __hash__" .Length of output: 1817
🏁 Script executed:
#!/bin/bash # Show the implementation of get_triplets in the retriever rg -n "def get_triplets" -A10 -B2 cognee/modules/retrieval/graph_completion_context_extension_retriever.pyLength of output: 108
🏁 Script executed:
#!/bin/bash # Display imports and class declaration in the retriever sed -n '1,200p' cognee/modules/retrieval/graph_completion_context_extension_retriever.py # Search for get_triplets definitions across the repo rg -n "def get_triplets" . # Inspect CogneeGraphElements around the two __hash__ implementations sed -n '70,120p' cognee/modules/graph/cognee_graph/CogneeGraphElements.py echo "---" sed -n '140,180p' cognee/modules/graph/cognee_graph/CogneeGraphElements.pyLength of output: 6650
🏁 Script executed:
#!/bin/bash # Display the beginning of GraphCompletionRetriever implementation, including get_triplets sed -n '1,200p' cognee/modules/retrieval/graph_completion_retriever.pyLength of output: 7043
Edge hashing ignores relationship types—update deduplication logic
AlthoughEdgeimplements__hash__(soset(triplets)won’t error), it only hashes on the two node objects and drops any edges between the same nodes—even if they carry differentrelationship_typeattributes. To preserve distinct relationships, either:• Extend
Edge.__hash__/__eq__to includeattributes['relationship_type'], e.g.:def __hash__(self) -> int: key = (self.node1, self.node2, self.attributes.get("relationship_type")) return hash(key)or
• Replace theset-based dedup with an explicit key mapping in
cognee/modules/retrieval/graph_completion_context_extension_retriever.py (around lines 91–94):- triplets = list(set(triplets)) + unique = { + (e.node1.id, e.node2.id, e.attributes.get("relationship_type")): e + for e in triplets + } + triplets = list(unique.values())This ensures you only collapse exact duplicates (same nodes + same relationship).
🤖 Prompt for AI Agents
In cognee/modules/retrieval/graph_completion_context_extension_retriever.py around lines 91 to 94, the current deduplication using set(triplets) only hashes edges by their nodes and ignores the relationship_type attribute, causing distinct edges with different relationships to be collapsed. To fix this, either update the Edge class's __hash__ and __eq__ methods to include the relationship_type attribute from attributes, or replace the set-based deduplication with a dictionary keyed by (node1.id, node2.id, attributes['relationship_type']) mapping to the edge, then convert the dictionary values back to a list to preserve distinct edges with different relationships.
cognee/modules/retrieval/graph_completion_cot_retriever.py (1)
105-110:
⚠️ Potential issue
acreate_structured_outputreturns a BaseModel, not a plain string.
reasoningis later interpolated into a prompt as if it werestr.
Unless the adapter unwraps to raw.text, this will fail withTypeError: can only concatenate str (not "BaseModel").Proposed fix (if the adapter returns a model with
textfield):- reasoning = await llm_client.acreate_structured_output( + reasoning_model = await llm_client.acreate_structured_output( text_input=valid_user_prompt, system_prompt=valid_system_prompt, response_model=str, ) - followup_args = {"query": query, "answer": answer, "reasoning": reasoning} + reasoning = getattr(reasoning_model, "text", str(reasoning_model)) + followup_args = {"query": query, "answer": answer, "reasoning": reasoning}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.reasoning_model = await llm_client.acreate_structured_output( text_input=valid_user_prompt, system_prompt=valid_system_prompt, response_model=str, ) reasoning = getattr(reasoning_model, "text", str(reasoning_model)) followup_args = {"query": query, "answer": answer, "reasoning": reasoning}🤖 Prompt for AI Agents
In cognee/modules/retrieval/graph_completion_cot_retriever.py around lines 105 to 110, the variable reasoning is assigned the result of acreate_structured_output which returns a BaseModel, not a string. Since reasoning is later used as a string in prompt interpolation, this causes a TypeError. To fix this, access the text attribute of the returned BaseModel (e.g., reasoning.text) to get the raw string before using it in string concatenation or interpolation.
cognee/tests/unit/modules/retrieval/graph_completion_retriever_context_extension_test.py (1)
181-183:
⚠️ Potential issueFix method names in test execution.
The method names in the main execution block are incorrect and will cause runtime errors.
- await test.test_graph_completion_context_simple() - await test.test_graph_completion_context_complex() - await test.test_get_graph_completion_context_on_empty_graph() + await test.test_graph_completion_extension_context_simple() + await test.test_graph_completion_extension_context_complex() + await test.test_get_graph_completion_extension_context_on_empty_graph()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.await test.test_graph_completion_extension_context_simple() await test.test_graph_completion_extension_context_complex() await test.test_get_graph_completion_extension_context_on_empty_graph()🧰 Tools
🪛 Pylint (3.3.7)
[error] 181-181: Instance of 'TestGraphCompletionRetriever' has no 'test_graph_completion_context_simple' member
(E1101)
[error] 182-182: Instance of 'TestGraphCompletionRetriever' has no 'test_graph_completion_context_complex' member
(E1101)
[error] 183-183: Instance of 'TestGraphCompletionRetriever' has no 'test_get_graph_completion_context_on_empty_graph' member
(E1101)
🤖 Prompt for AI Agents
In cognee/tests/unit/modules/retrieval/graph_completion_retriever_context_extension_test.py around lines 181 to 183, the method names called in the test execution block are incorrect and will cause runtime errors. Review the actual test method names defined in the test class and update these calls to match the correct method names exactly to ensure they run without errors.
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (1)
536-536:
⚠️ Potential issueMissing method definition causes runtime error.
The
prunemethod callsself.list_collections()but this method is not defined in theChromaDBAdapterclass, which will cause anAttributeErrorat runtime.async def prune(self): """ Delete all collections in the ChromaDB database. Returns: -------- Returns True upon successful deletion of all collections. """ client = await self.get_connection() - collections = await self.list_collections() + collections = await self.get_collection_names() for collection_name in collections: await client.delete_collection(collection_name) return TrueCommittable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Pylint (3.3.7)
[error] 536-536: Instance of 'ChromaDBAdapter' has no 'list_collections' member
(E1101)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py at line 536, the prune method calls self.list_collections(), but this method is not defined in the class. To fix this, define the list_collections method within the ChromaDBAdapter class that returns the list of collections as expected by prune. Ensure the method is asynchronous if prune awaits it, and implement the logic to retrieve collections from the underlying database or data structure.
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)
114-127:
⚠️ Potential issueFix the line length issue in the docstring.
The docstring content is comprehensive and well-structured, but line 115 exceeds the 100-character limit.
Apply this diff to fix the line length issue:
- This class inherits from Base and is associated with a database table defined by + This class inherits from Base and is associated with a database table + defined by📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.""" Represent a point in a vector data space with associated data and vector representation. This class inherits from Base and is associated with a database table defined by __tablename__. It maintains the following public methods and instance variables: - __init__(self, id, payload, vector): Initializes a new PGVectorDataPoint instance. Instance variables: - primary_key: Unique identifier for the data point, generated by uuid4. - id: Identifier for the data point, defined by data_point_types. - payload: JSON data associated with the data point. - vector: Vector representation of the data point, with size defined by vector_size. """🧰 Tools
🪛 Pylint (3.3.7)
[convention] 115-115: Line too long (104/100)
(C0301)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py around lines 114 to 127, the docstring line 115 exceeds the 100-character limit. Break this long line into multiple shorter lines without changing the meaning, ensuring each line stays within the 100-character limit to improve readability and comply with style guidelines.
cognee-mcp/src/server.py (1)
78-79:
⚠️ Potential issueParameter name mismatch will crash
cognee_add_developer_rules
cognee.add()expects the keyword argumentnode_set(seecognee/api/v1/add/add.py).
Passingnodesetwill raise aTypeErrorthe first time this tool runs.- await cognee.add(file_path, nodeset="developer_rules") + await cognee.add(file_path, node_set=["developer_rules"])(The value should also be wrapped in a list according to the function signature.)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.- await cognee.add(file_path, nodeset="developer_rules") + await cognee.add(file_path, node_set=["developer_rules"]) model = KnowledgeGraph🤖 Prompt for AI Agents
In cognee-mcp/src/server.py around lines 78 to 79, the call to cognee.add uses the incorrect keyword argument 'nodeset' instead of 'node_set', which causes a TypeError. Change the argument name to 'node_set' and wrap the value "developer_rules" in a list as required by the function signature, so it should be node_set=["developer_rules"].
cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (4)
749-758:
⚠️ Potential issueCritical:
delete_nodesimplementation mismatch.The docstring indicates batch deletion by collection and IDs, but the code:
self.delete_data_points(data_point_ids)– is missing both
collection_nameandawait.Proposed fix:
- self.delete_data_points(data_point_ids) + return await self.delete_data_points(collection_name, data_point_ids)🤖 Prompt for AI Agents
In cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py around lines 749 to 758, the delete_nodes method's implementation does not match its docstring; it calls self.delete_data_points without passing collection_name and is missing the await keyword. Fix this by adding await before the call and passing collection_name as the first argument to delete_data_points along with data_point_ids.
245-255:
⚠️ Potential issueMissing implementation in
create_collection.The method is documented but only contains a
passplaceholder. This will lead to silent failures when creating collections.Please implement the collection creation logic or explicitly raise
NotImplementedErrorif unsupported.🧰 Tools
🪛 Pylint (3.3.7)
[warning] 254-254: Unnecessary pass statement
(W0107)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py at lines 245 to 255, the create_collection method is currently a placeholder with only a pass statement, which will cause silent failures. You should either implement the logic to create a collection in the graph database using the appropriate database client calls or, if this operation is not supported, raise a NotImplementedError explicitly to indicate that the method is not yet implemented.
731-739:
⚠️ Potential issueCritical:
delete_nodeimplementation mismatch.The docstring expects to delete a node by collection and ID, but the implementation calls:
return await self.delete_data_points([data_point_id])– missing the required
collection_nameargument and missing await on method signature.Apply this diff to fix the call:
- return await self.delete_data_points([data_point_id]) + return await self.delete_data_points(collection_name, [data_point_id])🤖 Prompt for AI Agents
In cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py around lines 731 to 739, the delete_node method's implementation incorrectly calls delete_data_points without passing the required collection_name argument and lacks the await keyword. Fix this by adding the await keyword before the call and passing both collection_name and a list containing data_point_id to delete_data_points, matching the method signature and docstring.
369-376:
⚠️ Potential issueUnimplemented
index_data_pointsstub.The docstring outlines intended behavior, but the body is a
pass. This will break downstream indexing tasks.Please provide a concrete implementation or remove the method if deprecated.
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py around lines 369 to 376, the method index_data_points is currently a stub with only a docstring and a pass statement, which will cause failures in downstream indexing tasks. You need to either implement the method to index the provided data_points into the specified index using the index_name and index_property_name parameters, ensuring it performs the intended indexing logic, or if this method is no longer needed, remove it entirely to avoid broken references.
cognee/infrastructure/databases/graph/kuzu/adapter.py (1)
3-3:
⚠️ Potential issueRemove unused import.
The import
NodesetFilterNotSupportedErroris not used anywhere in this file and should be removed to keep the imports clean.-from cognee.infrastructure.databases.exceptions.exceptions import NodesetFilterNotSupportedError📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.# (The import of NodesetFilterNotSupportedError has been removed)🧰 Tools
🪛 Ruff (0.11.9)
3-3:
cognee.infrastructure.databases.exceptions.exceptions.NodesetFilterNotSupportedErrorimported but unusedRemove unused import:
cognee.infrastructure.databases.exceptions.exceptions.NodesetFilterNotSupportedError(F401)
🪛 Pylint (3.3.7)
[warning] 3-3: Unused NodesetFilterNotSupportedError imported from cognee.infrastructure.databases.exceptions.exceptions
(W0611)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/kuzu/adapter.py at line 3, remove the import statement for NodesetFilterNotSupportedError since it is not used anywhere in the file, to keep the imports clean and avoid unnecessary code.
|
Unauthenticated Remote Debug Server Exposure via Environment Variables @@ -36,11 +36,12 @@
# Modified Gunicorn startup with error handling
if [ "$ENVIRONMENT" = "dev" ] || [ "$ENVIRONMENT" = "local" ]; then
if [ "$DEBUG" = "true" ]; then
echo "Waiting for the debugger to attach..."
- exec python -m debugpy --wait-for-client --listen 0.0.0.0:5678 -m cognee
+ # Bind debugpy to 127.0.0.1 to prevent external connections
+ exec python -m debugpy --wait-for-client --listen 127.0.0.1:5678 -m cognee
else
exec cognee
fi
else
exec cognee
-fi
+fi
\ No newline at end of file
Explanation of FixVulnerability Overview: Fix: Other impacts:
Note: Issues
|
|
Arbitrary Python Code Execution via Unrestricted Module Loading @@ -1,4 +1,29 @@
+` section below for the complete code.]
+
+**Key changes and rationale:**
+- Introduces `ALLOWED_GRAPH_MODEL_DIR` and `ALLOWED_GRAPH_MODEL_FILES` at the module-level.
+- In `load_class`, before attempting import, validates that:
+ - The requested file is in the allowlist directory or is an explicit allowlisted absolute path.
+ - The file has a `.py` extension.
+ - The basename of the file is in an allowlist (configurable).
+ - The resolved file is not a symlink and does not escape the allowed directory.
+- If these checks fail, raises `ValueError` and logs the failed attempt.
+- Keeps all existing logic and function signatures intact, ensuring compatibility with the rest of the codebase.
+- Minimal performance impact: static list lookups and path checks.
+- No new dependencies added.
+
+**Potential impacts:**
+- Custom schemas not present in the allowlist or allowed directory will not be loaded; users must request allowlist additions if necessary.
+- Prevents all arbitrary file loading for graph models.
+- Patch is backward compatible unless users were intentionally abusing the vulnerability or using highly dynamic/uncontrolled schema locations.
+- The import of `os` and constants at global scope is standard and should not surprise developers.
+
+**If users expect to load *any* arbitrary file, those workflows will break and will require whitelist additions. This is the expected and secure behavior.**
+
+</explanation>
+
+<patch>
import json
import os
import sys
import argparse
@@ -21,9 +46,23 @@
logger = get_logger()
log_file = get_log_file_location()
+# --------- GRAPH MODEL LOADING SANDBOX SETTINGS ---------
+# Only allow dynamic import of graph schema/model files from this directory,
+# and only if the filename is on the explicit allowlist below.
+ALLOWED_GRAPH_MODEL_DIR = os.path.abspath(
+ os.environ.get("COGNEE_GRAPH_MODEL_DIR", os.path.join(os.path.dirname(__file__), "graph_models"))
+)
+# Set the list of allowed base filenames (no path traversal allowed)
+ALLOWED_GRAPH_MODEL_FILES = {
+ "custom_knowledge_schema.py", # Example: add actual intended schema filenames here
+ "my_safe_graph_schema.py", # Example: you can adjust as appropriate
+ # "KnowledgeGraph.py", # Add more allowlisted files as needed
+}
+# Allowlist the built-in KnowledgeGraph by not using load_class if custom params not given
+
@mcp.tool()
async def cognee_add_developer_rules(
base_path: str = ".", graph_model_file: str = None, graph_model_name: str = None
) -> list:
@@ -420,13 +459,45 @@
return "\n".join(edge_strings)
def load_class(model_file, model_name):
+ model_file_orig = model_file
model_file = os.path.abspath(model_file)
+ base_name = os.path.basename(model_file)
+
+ # Must be in allowlist of schema/model files
+ if (
+ not base_name.endswith(".py")
+ or base_name not in ALLOWED_GRAPH_MODEL_FILES
+ or not os.path.isdir(ALLOWED_GRAPH_MODEL_DIR)
+ or not model_file.startswith(ALLOWED_GRAPH_MODEL_DIR)
+ or not os.path.isfile(model_file)
+ or os.path.islink(model_file)
+ ):
+ logger.error(f"Security error: Attempt to load disallowed graph model file: {model_file_orig}")
+ raise ValueError(
+ f"Graph model file {model_file_orig!r} is not allowed. "
+ f"It must be located in {ALLOWED_GRAPH_MODEL_DIR!r} with a filename in {ALLOWED_GRAPH_MODEL_FILES!r}."
+ )
+
+ # Prevent path traversal by checking model_file is directly inside allowed dir (no subdirs)
+ expected_path = os.path.join(ALLOWED_GRAPH_MODEL_DIR, base_name)
+ if os.path.realpath(model_file) != os.path.realpath(expected_path):
+ logger.error(f"Security error: Path traversal attempt in graph model file: {model_file_orig}")
+ raise ValueError(
+ f"Graph model file {model_file_orig!r} path does not match expected whitelisted location."
+ )
+
spec = importlib.util.spec_from_file_location("graph_model", model_file)
+ if spec is None or not hasattr(spec, "loader") or spec.loader is None:
+ logger.error(f"Failed to load module spec for: {model_file_orig}")
+ raise ImportError(f"Cannot load spec for {model_file}")
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
+ if not hasattr(module, model_name):
+ logger.error(f"Model name {model_name!r} not found in module {model_file_orig!r}")
+ raise AttributeError(f"Model class {model_name!r} not found in {model_file_orig!r}")
model_class = getattr(module, model_name)
return model_class
@@ -457,5 +528,5 @@
try:
asyncio.run(main())
except Exception as e:
logger.error(f"Error initializing Cognee MCP server: {str(e)}")
- raise
+ raise
\ No newline at end of file
Explanation of FixVulnerability: The Fix: The patch implements a strict allowlist for graph schema files that can be loaded dynamically via the Original Vulnerable Code (load_class only): Patched Code (with secure allowlist): Key changes and rationale:
Potential impacts:
If users expect to load any arbitrary file, those workflows will break and will require whitelist additions. This is the expected and secure behavior. Issues
|
|
Arbitrary Python Code Execution via Dynamic Module Loading @@ -21,9 +21,12 @@
logger = get_logger()
log_file = get_log_file_location()
+# Allowlist directory for safe model schema imports (must be inside source repo)
+ALLOWED_MODEL_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), "cognee", "modules", "schemas"))
+
@mcp.tool()
async def cognee_add_developer_rules(
base_path: str = ".", graph_model_file: str = None, graph_model_name: str = None
) -> list:
@@ -420,10 +423,30 @@
return "\n".join(edge_strings)
def load_class(model_file, model_name):
- model_file = os.path.abspath(model_file)
- spec = importlib.util.spec_from_file_location("graph_model", model_file)
+ model_file_abs = os.path.abspath(model_file)
+ allowed_dir_abs = os.path.abspath(ALLOWED_MODEL_DIR)
+
+ # Ensure model_file exists and is a file
+ if not os.path.isfile(model_file_abs):
+ raise FileNotFoundError(f"Model schema file not found: {model_file_abs}")
+
+ # Ensure model_file resolves inside the allowed directory
+ model_file_real = os.path.realpath(model_file_abs)
+ allowed_real = os.path.realpath(allowed_dir_abs)
+ try:
+ # commonpath raises ValueError if drives do not match (Windows)
+ if os.path.commonpath([model_file_real, allowed_real]) != allowed_real:
+ raise PermissionError(
+ f"Only files within {allowed_real} can be loaded as custom schema modules."
+ )
+ except ValueError:
+ raise PermissionError(
+ f"Only files within {allowed_real} can be loaded as custom schema modules."
+ )
+
+ spec = importlib.util.spec_from_file_location("graph_model", model_file_real)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
model_class = getattr(module, model_name)
@@ -457,5 +480,5 @@
try:
asyncio.run(main())
except Exception as e:
logger.error(f"Error initializing Cognee MCP server: {str(e)}")
- raise
+ raise
\ No newline at end of file
Explanation of FixSummary of Vulnerability and Fix The vulnerability stems from the Fix Overview To address this CWE-94 vulnerability, the patch introduces a strict validation step in the Summary of Changes
Potential Impacts
Unexpected Note
Issues
|
|
Arbitrary File Read via Unsanitized Path Traversal in Code Retriever @@ -1,16 +1,39 @@
from typing import Any, Optional, List
import asyncio
import aiofiles
from pydantic import BaseModel
+import os
from cognee.modules.retrieval.base_retriever import BaseRetriever
from cognee.infrastructure.databases.graph import get_graph_engine
from cognee.infrastructure.databases.vector import get_vector_engine
from cognee.infrastructure.llm.get_llm_client import get_llm_client
from cognee.infrastructure.llm.prompts import read_query_prompt
+# Set this to the root directory that code files should be accessible from
+# Adjust as appropriate for your environment
+ALLOWED_CODE_ROOT = os.environ.get("COGNEE_CODE_ROOT", "./codebase") # example default
+def is_safe_path(basedir: str, path: str) -> bool:
+ """
+ Ensure that the resolved absolute path is within the allowed base directory.
+ basedir: the parent directory considered "safe root"
+ path: the user/db-provided path
+ Returns True if safe, False otherwise.
+ """
+ # Handle relative paths safely
+ abs_basedir = os.path.abspath(basedir)
+ abs_path = os.path.abspath(os.path.join(abs_basedir, path))
+ try:
+ # Required for proper symlink/case normalization
+ abs_basedir = os.path.realpath(abs_basedir)
+ abs_path = os.path.realpath(abs_path)
+ except Exception:
+ return False
+ # Common path check (works on all OS, prevents traversal)
+ return os.path.commonpath([abs_path, abs_basedir]) == abs_basedir
+
class CodeRetriever(BaseRetriever):
"""Retriever for handling code-based searches."""
class CodeQueryInfo(BaseModel):
@@ -106,8 +129,10 @@
for similar_piece in similar_filenames + similar_codepieces
]
)
+ # ----------------- PATCH STARTS HERE ------------------
+
paths = set()
for sublist in relevant_triplets:
for tpl in sublist:
if isinstance(tpl, tuple) and len(tpl) >= 3:
@@ -115,15 +140,28 @@
paths.add(tpl[0]["file_path"])
if "file_path" in tpl[2]:
paths.add(tpl[2]["file_path"])
+ safe_paths = set()
+ for file_path in paths:
+ if file_path and isinstance(file_path, str):
+ # Only allow reading files within ALLOWED_CODE_ROOT
+ if is_safe_path(ALLOWED_CODE_ROOT, file_path):
+ safe_paths.add(file_path)
+ else:
+ # Optionally, log rejection for audit
+ print(f"Unsafe file path rejection: {file_path}")
+
retrieved_files = {}
read_tasks = []
- for file_path in paths:
+ for file_path in safe_paths:
+
async def read_file(fp):
try:
- async with aiofiles.open(fp, "r", encoding="utf-8") as f:
+ abs_basedir = os.path.abspath(ALLOWED_CODE_ROOT)
+ abs_fp = os.path.abspath(os.path.join(abs_basedir, fp))
+ async with aiofiles.open(abs_fp, "r", encoding="utf-8") as f:
retrieved_files[fp] = await f.read()
except Exception as e:
print(f"Error reading {fp}: {e}")
retrieved_files[fp] = ""
@@ -137,12 +175,14 @@
"name": file_path,
"description": file_path,
"content": retrieved_files[file_path],
}
- for file_path in paths
+ for file_path in safe_paths
]
+ # ----------------- PATCH ENDS HERE ------------------
+
async def get_completion(self, query: str, context: Optional[Any] = None) -> Any:
"""Returns the code files context."""
if context is None:
context = await self.get_context(query)
- return context
+ return context
\ No newline at end of file
Explanation of FixVulnerability & Fix Explanation The vulnerability stems from unsanitized file paths—originating ultimately from user-influenced LLM outputs and database results—being used verbatim in Fix Summary The patch restricts all file accesses to a specified, intended base directory ( A new function, Side Notes
Potential Impacts
Issues
|
|
LLM-Generated Cypher Query Injection via Prompt Manipulation @@ -1,6 +1,7 @@
from typing import Any, Optional
import logging
+import re
from cognee.infrastructure.databases.graph import get_graph_engine
from cognee.infrastructure.databases.graph.networkx.adapter import NetworkXAdapter
from cognee.infrastructure.llm.get_llm_client import get_llm_client
from cognee.infrastructure.llm.prompts import render_prompt
@@ -48,8 +49,46 @@
"""
)
return node_schemas, edge_schemas
+ def _is_readonly_cypher(self, query: str) -> bool:
+ """
+ Checks if the Cypher query is read-only.
+ Only allows those starting with MATCH, WITH, or RETURN (optionally preceded by whitespace and comments).
+ Disallows any queries containing data-modifying, schema, or admin operations.
+ """
+ # Remove leading whitespace and comments
+ stripped_query = query.strip()
+ # Disallow queries containing dangerous keywords/sequences (case-insensitive)
+ forbidden = [
+ r"\bCREATE\b",
+ r"\bMERGE\b",
+ r"\bSET\b",
+ r"\bDELETE\b",
+ r"\bDETACH\s+DELETE\b",
+ r"\bREMOVE\b",
+ r"\bDROP\b",
+ r"\bCALL\s+dbms\.", # block privileged procedures
+ r"\bCALL\s+db.\b",
+ r"\bCALL\s+apoc\.", # block APOC usage
+ r"\bLOAD\s+CSV\b",
+ r"\bSTART\b",
+ r"\bCALL\s+\w+\(",
+ r"\bPROFILE\b",
+ r"\bEXPLAIN\b",
+ r"\bUNWIND\b.*\bCREATE\b", # UNWIND ... CREATE trick
+ ]
+ for pat in forbidden:
+ if re.search(pat, stripped_query, flags=re.IGNORECASE):
+ return False
+
+ # Allow only those queries starting (after whitespace/comments) with MATCH, WITH, or RETURN
+ allowed_initial = re.compile(r"^(?:(?:\/\/.*\n|\s+)*)\b(MATCH|WITH|RETURN)\b", re.IGNORECASE)
+ if allowed_initial.match(stripped_query):
+ return True
+
+ return False
+
async def _generate_cypher_query(self, query: str, edge_schemas, previous_attempts=None) -> str:
"""Generate a Cypher query using LLM based on natural language query and schema information."""
llm_client = get_llm_client()
system_prompt = render_prompt(
@@ -66,9 +105,11 @@
response_model=str,
)
async def _execute_cypher_query(self, query: str, graph_engine: GraphDBInterface) -> Any:
- """Execute the natural language query against Neo4j with multiple attempts."""
+ """Execute the natural language query against Neo4j with multiple attempts.
+ Only read-only Cypher queries are permitted.
+ """
node_schemas, edge_schemas = await self._get_graph_schema(graph_engine)
previous_attempts = ""
cypher_query = ""
@@ -78,8 +119,21 @@
cypher_query = await self._generate_cypher_query(
query, edge_schemas, previous_attempts
)
+ if not isinstance(cypher_query, str):
+ raise ValueError("Generated Cypher query is not a string")
+
+ # Security filter: only allow safe, read-only Cypher queries.
+ if not self._is_readonly_cypher(cypher_query):
+ logger.warning(
+ f"Blocked non-readonly Cypher query generated by LLM (attempt {attempt + 1}): {cypher_query[:200]}..."
+ if len(cypher_query) > 200 else
+ f"Blocked non-readonly Cypher query generated by LLM (attempt {attempt + 1}): {cypher_query}"
+ )
+ previous_attempts += f"Query: {cypher_query} -> Blocked as non-readonly\n"
+ continue
+
logger.info(
f"Executing generated Cypher query (attempt {attempt + 1}): {cypher_query[:100]}..."
if len(cypher_query) > 100
else cypher_query
@@ -156,5 +210,5 @@
"""
if context is None:
context = await self.get_context(query)
- return context
+ return context
\ No newline at end of file
Explanation of FixVulnerability: Fix: Side Effects and Considerations:
Summary of changes:
Issues
|
|
Unsanitized Query Injection in Graph and Vector Database Operations @@ -1,5 +1,6 @@
import asyncio
+import re
from typing import Any, Optional
from cognee.infrastructure.databases.graph import get_graph_engine
from cognee.infrastructure.databases.vector import get_vector_engine
@@ -7,8 +8,37 @@
from cognee.modules.retrieval.exceptions.exceptions import NoDataError
from cognee.infrastructure.databases.vector.exceptions.exceptions import CollectionNotFoundError
+def _is_safe_node_id(node_id: str) -> bool:
+ """
+ Checks if a node id is safe for injection into downstream graph/database queries.
+ Allows letters, digits, spaces, underscores, hyphens, colons, and periods. Max 128 chars.
+ """
+ if not isinstance(node_id, str):
+ return False
+ if len(node_id) > 128:
+ return False
+ allowed = re.compile(r"^[A-Za-z0-9 _\-:\.]{1,128}$")
+ return bool(allowed.fullmatch(node_id))
+
+
+def _is_safe_query_text(query_text: str) -> bool:
+ """
+ Checks if query text is safe for vector engine search.
+ Disallows newline, semicolons, control chars, and excessive length.
+ """
+ if not isinstance(query_text, str):
+ return False
+ if len(query_text) > 256:
+ return False
+ # Allow wide range of printable characters except for dangerous control characters
+ banned_pattern = re.compile(r"[\r\n;\x00-\x1F\x7F]")
+ if banned_pattern.search(query_text):
+ return False
+ return True
+
+
class InsightsRetriever(BaseRetriever):
"""
Retriever for handling graph connection-based insights.
@@ -47,15 +77,28 @@
"""
if query is None:
return []
+ # Sanitize node id input for graph lookups
node_id = query
+ if not _is_safe_node_id(node_id):
+ # Unsafe node_id: return empty result to avoid leaking info or injection
+ return []
+
graph_engine = await get_graph_engine()
exact_node = await graph_engine.extract_node(node_id)
if exact_node is not None and "id" in exact_node:
- node_connections = await graph_engine.get_connections(str(exact_node["id"]))
+ # Also sanitize for get_connections
+ safe_exact_node_id = str(exact_node["id"])
+ if not _is_safe_node_id(safe_exact_node_id):
+ return []
+ node_connections = await graph_engine.get_connections(safe_exact_node_id)
else:
+ # Sanitize query for vector search
+ if not _is_safe_query_text(query):
+ return []
+
vector_engine = get_vector_engine()
try:
results = await asyncio.gather(
@@ -70,10 +113,15 @@
if len(relevant_results) == 0:
return []
+ # Sanitize result ids before passing to get_connections
+ safe_results = [result for result in relevant_results if _is_safe_node_id(result.id)]
+ if not safe_results:
+ return []
+
node_connections_results = await asyncio.gather(
- *[graph_engine.get_connections(result.id) for result in relevant_results]
+ *[graph_engine.get_connections(result.id) for result in safe_results]
)
node_connections = []
for neighbours in node_connections_results:
@@ -113,5 +161,5 @@
based on the query.
"""
if context is None:
context = await self.get_context(query)
- return context
+ return context
\ No newline at end of file
Explanation of FixThe vulnerability arises because user-controlled input ( Fix:
Impact:
If any external module expects special characters in node names (outside typical ASCII/word characters), the regex can be relaxed, but as per common graph-entity practices, restricting to Issues
|
|
Unvalidated LLM-Generated Cypher Query Execution @@ -30,8 +30,20 @@
"""Initialize retriever with optional custom prompt paths."""
self.system_prompt_path = system_prompt_path
self.max_attempts = max_attempts
+ @staticmethod
+ def _neutralize_for_logging(s: str) -> str:
+ """Sanitize input for logging to prevent log injection/forging."""
+ if not isinstance(s, str):
+ s = str(s)
+ # Remove or replace newlines, carriage returns and tabs
+ s = s.replace('\r', '\\r').replace('\n', '\\n').replace('\t', '\\t')
+ # Optionally, truncate very long messages to prevent log bloat or DOS
+ if len(s) > 1000:
+ s = s[:1000] + '... [truncated]'
+ return s
+
async def _get_graph_schema(self, graph_engine) -> tuple:
"""Retrieve the node and edge schemas from the graph database."""
node_schemas = await graph_engine.query(
"""
@@ -78,12 +90,14 @@
cypher_query = await self._generate_cypher_query(
query, edge_schemas, previous_attempts
)
+ log_cypher_query = (
+ cypher_query[:100] + "..." if len(cypher_query) > 100 else cypher_query
+ )
+ safe_log_cypher_query = self._neutralize_for_logging(log_cypher_query)
logger.info(
- f"Executing generated Cypher query (attempt {attempt + 1}): {cypher_query[:100]}..."
- if len(cypher_query) > 100
- else cypher_query
+ f"Executing generated Cypher query (attempt {attempt + 1}): {safe_log_cypher_query}"
)
context = await graph_engine.query(cypher_query)
if context:
@@ -92,16 +106,22 @@
f"Successfully executed query (attempt {attempt + 1}): returned {result_count} result(s)"
)
return context
- previous_attempts += f"Query: {cypher_query} -> Result: None\n"
+ # Log safe cypher query in previous_attempts
+ safe_cypher_query = self._neutralize_for_logging(cypher_query)
+ previous_attempts += f"Query: {safe_cypher_query} -> Result: None\n"
except Exception as e:
- previous_attempts += f"Query: {cypher_query if 'cypher_query' in locals() else 'Not generated'} -> Executed with error: {e}\n"
- logger.error(f"Error executing query: {str(e)}")
+ safe_cypher_query = self._neutralize_for_logging(
+ cypher_query if 'cypher_query' in locals() else 'Not generated'
+ )
+ safe_error = self._neutralize_for_logging(str(e))
+ previous_attempts += f"Query: {safe_cypher_query} -> Executed with error: {safe_error}\n"
+ logger.error(f"Error executing query: {safe_error} | Query: {safe_cypher_query}")
logger.warning(
- f"Failed to get results after {self.max_attempts} attempts for query: '{query[:50]}...'"
+ f"Failed to get results after {self.max_attempts} attempts for query: '{self._neutralize_for_logging(query[:50])}...'"
)
return []
async def get_context(self, query: str) -> Optional[Any]:
@@ -130,9 +150,10 @@
raise SearchTypeNotSupported("Natural language search type not supported.")
return await self._execute_cypher_query(query, graph_engine)
except Exception as e:
- logger.error("Failed to execute natural language search retrieval: %s", str(e))
+ safe_error = self._neutralize_for_logging(str(e))
+ logger.error("Failed to execute natural language search retrieval: %s", safe_error)
raise e
async def get_completion(self, query: str, context: Optional[Any] = None) -> Any:
"""
@@ -156,5 +177,5 @@
"""
if context is None:
context = await self.get_context(query)
- return context
+ return context
\ No newline at end of file
Explanation of FixVulnerability Explanation and Fix The code in
The Fix:
Unexpected Patch Detail: Issues
|
| - update_version | ||
| - to_json | ||
| - from_json | ||
| - to_pickle | ||
| - from_pickle | ||
| - to_dict | ||
| - from_dict | ||
| """ | ||
|
|
||
| id: UUID = Field(default_factory=uuid4) | ||
| created_at: int = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insecure Pickle Deserialization Enabling Remote Code Execution
Explanation of Fix
Vulnerability & Fix Overview:
The critical vulnerability is in the from_pickle classmethod, which deserializes arbitrary byte streams using pickle.loads. This means that anyone providing (or tampering with) the pickle data can cause arbitrary code execution, a classic dangerous scenario due to the way the pickle protocol works (CWE-502).
Fix:
To address the vulnerability, the from_pickle method is removed completely. No safe, dependency-free alternative for secure deserialization exists using pickle within the standard library. Additionally, we must make it clear in docstrings and comments that loading pickled data is prohibited, and all (de)serialization should use the safe, already provided JSON-based mechanism or the validated from_dict/to_dict pathways.
The to_pickle method may still exist and is safe in isolation since it simply transforms the instance to its dictionary and then pickles that dict (no __reduce__ code-execution vector). However, it is recommended to discourage/avoid using the pickled output for anything other than ephemeral internal caching, and not for transmission or persistent untrusted storage. As per your instruction, the critical bug is only the loading.
Any calling code that may have expected from_pickle to exist must now use from_dict or from_json and handle trusted deserialization only.
Summary of changes:
from_picklemethod has been removed entirely.- Docstrings updated to warn explicitly about the dangers of loading pickle data.
- No imports or further modifications required elsewhere in the codebase. No new dependencies introduced.
Side Effects & Compatibility:
- Codebases that previously called
DataPoint.from_picklewill need to switch to usingfrom_dict,from_json, or implement their own validation/sanitization logic. - Pickled data produced by
to_pickleis still possible for trusted/internal use, but should not be restored unless handled in a tightly controlled, safe environment. - No other functionality is affected; all other deserialization methods (
from_dict,from_json) are validated and safe.
Issues
| Type | Identifier | Message | Severity | Link |
|---|---|---|---|---|
Application |
CWE-502 |
The from_pickle method deserializes arbitrary byte streams using pickle.loads. If an attacker can supply or alter pickled_data, they can craft a malicious pickle payload that executes arbitrary code during deserialization. Because pickle evaluates opcodes that can invoke the Python runtime, this constitutes an insecure-deserialization vulnerability (CWE-502) and can directly lead to Remote Code Execution (RCE). The complementary to_pickle method risks exposing sensitive internal data if the resulting bytes are stored or transmitted insecurely, but the primary security issue lies in loading untrusted pickle data. |
critical |
Link |
Suggested Fix
| - update_version | |
| - to_json | |
| - from_json | |
| - to_pickle | |
| - from_pickle | |
| - to_dict | |
| - from_dict | |
| """ | |
| id: UUID = Field(default_factory=uuid4) | |
| created_at: int = Field( | |
| - update_version | |
| - to_json | |
| - from_json | |
| - to_pickle | |
| - to_dict | |
| - from_dict | |
| Note: | |
| ----- | |
| The 'from_pickle' method is removed due to critical security risks arising from | |
| insecure pickle deserialization (CWE-502). All deserialization should use the | |
| safe, validated 'from_json' or 'from_dict' methods. | |
| """ | |
| id: UUID = Field(default_factory=uuid4) | |
| created_at: int = Field( |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
| def to_pickle(self) -> bytes: | ||
| """Serialize the instance to pickle-compatible bytes.""" | ||
| """ | ||
| Serialize the DataPoint instance to a byte format for pickling. | ||
| This method uses the built-in Python pickle module to convert the instance into a byte | ||
| stream for persistence or transmission. | ||
| Returns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insecure Pickle Deserialization Enabling Remote Code Execution
Explanation of Fix
Vulnerability & Fix Overview:
The critical vulnerability is in the from_pickle classmethod, which deserializes arbitrary byte streams using pickle.loads. This means that anyone providing (or tampering with) the pickle data can cause arbitrary code execution, a classic dangerous scenario due to the way the pickle protocol works (CWE-502).
Fix:
To address the vulnerability, the from_pickle method is removed completely. No safe, dependency-free alternative for secure deserialization exists using pickle within the standard library. Additionally, we must make it clear in docstrings and comments that loading pickled data is prohibited, and all (de)serialization should use the safe, already provided JSON-based mechanism or the validated from_dict/to_dict pathways.
The to_pickle method may still exist and is safe in isolation since it simply transforms the instance to its dictionary and then pickles that dict (no __reduce__ code-execution vector). However, it is recommended to discourage/avoid using the pickled output for anything other than ephemeral internal caching, and not for transmission or persistent untrusted storage. As per your instruction, the critical bug is only the loading.
Any calling code that may have expected from_pickle to exist must now use from_dict or from_json and handle trusted deserialization only.
Summary of changes:
from_picklemethod has been removed entirely.- Docstrings updated to warn explicitly about the dangers of loading pickle data.
- No imports or further modifications required elsewhere in the codebase. No new dependencies introduced.
Side Effects & Compatibility:
- Codebases that previously called
DataPoint.from_picklewill need to switch to usingfrom_dict,from_json, or implement their own validation/sanitization logic. - Pickled data produced by
to_pickleis still possible for trusted/internal use, but should not be restored unless handled in a tightly controlled, safe environment. - No other functionality is affected; all other deserialization methods (
from_dict,from_json) are validated and safe.
Issues
| Type | Identifier | Message | Severity | Link |
|---|---|---|---|---|
Application |
CWE-502 |
The from_pickle method deserializes arbitrary byte streams using pickle.loads. If an attacker can supply or alter pickled_data, they can craft a malicious pickle payload that executes arbitrary code during deserialization. Because pickle evaluates opcodes that can invoke the Python runtime, this constitutes an insecure-deserialization vulnerability (CWE-502) and can directly lead to Remote Code Execution (RCE). The complementary to_pickle method risks exposing sensitive internal data if the resulting bytes are stored or transmitted insecurely, but the primary security issue lies in loading untrusted pickle data. |
critical |
Link |
Suggested Fix
| def to_pickle(self) -> bytes: | |
| """Serialize the instance to pickle-compatible bytes.""" | |
| """ | |
| Serialize the DataPoint instance to a byte format for pickling. | |
| This method uses the built-in Python pickle module to convert the instance into a byte | |
| stream for persistence or transmission. | |
| Returns: | |
| def to_pickle(self) -> bytes: | |
| """ | |
| Serialize the DataPoint instance to a byte format for pickling. | |
| WARNING: While serialization ('to_pickle') is provided, untrusted pickled data MUST | |
| NEVER be deserialized using pickle in any context. There is intentionally no | |
| 'from_pickle' provided due to severe security risks. | |
| This method uses the built-in Python pickle module to convert the instance into a byte | |
| stream for persistence or transmission. | |
| Returns: |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
| - bytes: The pickled byte representation of the DataPoint instance. | ||
| """ | ||
| return pickle.dumps(self.dict()) | ||
|
|
||
| @classmethod | ||
| def from_pickle(self, pickled_data: bytes): | ||
| """Deserialize the instance from pickled bytes.""" | ||
| """ | ||
| Deserialize a DataPoint instance from a pickled byte stream. | ||
| The method converts the byte stream back into a DataPoint instance by loading the data | ||
| and validating it through the model's constructor. | ||
| Parameters: | ||
| ----------- | ||
| - pickled_data (bytes): The bytes representation of a pickled DataPoint instance to | ||
| be deserialized. | ||
| Returns: | ||
| -------- | ||
| A new DataPoint instance created from the pickled data. | ||
| """ | ||
| data = pickle.loads(pickled_data) | ||
| return self(**data) | ||
|
|
||
| def to_dict(self, **kwargs) -> Dict[str, Any]: | ||
| """Serialize model to a dictionary.""" | ||
| """ | ||
| Convert the DataPoint instance to a dictionary representation. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insecure Pickle Deserialization Enabling Remote Code Execution
Explanation of Fix
Vulnerability & Fix Overview:
The critical vulnerability is in the from_pickle classmethod, which deserializes arbitrary byte streams using pickle.loads. This means that anyone providing (or tampering with) the pickle data can cause arbitrary code execution, a classic dangerous scenario due to the way the pickle protocol works (CWE-502).
Fix:
To address the vulnerability, the from_pickle method is removed completely. No safe, dependency-free alternative for secure deserialization exists using pickle within the standard library. Additionally, we must make it clear in docstrings and comments that loading pickled data is prohibited, and all (de)serialization should use the safe, already provided JSON-based mechanism or the validated from_dict/to_dict pathways.
The to_pickle method may still exist and is safe in isolation since it simply transforms the instance to its dictionary and then pickles that dict (no __reduce__ code-execution vector). However, it is recommended to discourage/avoid using the pickled output for anything other than ephemeral internal caching, and not for transmission or persistent untrusted storage. As per your instruction, the critical bug is only the loading.
Any calling code that may have expected from_pickle to exist must now use from_dict or from_json and handle trusted deserialization only.
Summary of changes:
from_picklemethod has been removed entirely.- Docstrings updated to warn explicitly about the dangers of loading pickle data.
- No imports or further modifications required elsewhere in the codebase. No new dependencies introduced.
Side Effects & Compatibility:
- Codebases that previously called
DataPoint.from_picklewill need to switch to usingfrom_dict,from_json, or implement their own validation/sanitization logic. - Pickled data produced by
to_pickleis still possible for trusted/internal use, but should not be restored unless handled in a tightly controlled, safe environment. - No other functionality is affected; all other deserialization methods (
from_dict,from_json) are validated and safe.
Issues
| Type | Identifier | Message | Severity | Link |
|---|---|---|---|---|
Application |
CWE-502 |
The from_pickle method deserializes arbitrary byte streams using pickle.loads. If an attacker can supply or alter pickled_data, they can craft a malicious pickle payload that executes arbitrary code during deserialization. Because pickle evaluates opcodes that can invoke the Python runtime, this constitutes an insecure-deserialization vulnerability (CWE-502) and can directly lead to Remote Code Execution (RCE). The complementary to_pickle method risks exposing sensitive internal data if the resulting bytes are stored or transmitted insecurely, but the primary security issue lies in loading untrusted pickle data. |
critical |
Link |
Suggested Fix
| - bytes: The pickled byte representation of the DataPoint instance. | |
| """ | |
| return pickle.dumps(self.dict()) | |
| @classmethod | |
| def from_pickle(self, pickled_data: bytes): | |
| """Deserialize the instance from pickled bytes.""" | |
| """ | |
| Deserialize a DataPoint instance from a pickled byte stream. | |
| The method converts the byte stream back into a DataPoint instance by loading the data | |
| and validating it through the model's constructor. | |
| Parameters: | |
| ----------- | |
| - pickled_data (bytes): The bytes representation of a pickled DataPoint instance to | |
| be deserialized. | |
| Returns: | |
| -------- | |
| A new DataPoint instance created from the pickled data. | |
| """ | |
| data = pickle.loads(pickled_data) | |
| return self(**data) | |
| def to_dict(self, **kwargs) -> Dict[str, Any]: | |
| """Serialize model to a dictionary.""" | |
| """ | |
| Convert the DataPoint instance to a dictionary representation. | |
| - bytes: The pickled byte representation of the DataPoint instance. | |
| """ | |
| return pickle.dumps(self.dict()) | |
| # Insecure deserialization (from_pickle) method REMOVED due to security risks. | |
| def to_dict(self, **kwargs) -> Dict[str, Any]: | |
| """ | |
| Convert the DataPoint instance to a dictionary representation. |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
| - 'DataPoint': A new DataPoint instance constructed from the provided dictionary | ||
| data. | ||
| """ | ||
| return cls.model_validate(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insecure Pickle Deserialization Enabling Remote Code Execution
Explanation of Fix
Vulnerability & Fix Overview:
The critical vulnerability is in the from_pickle classmethod, which deserializes arbitrary byte streams using pickle.loads. This means that anyone providing (or tampering with) the pickle data can cause arbitrary code execution, a classic dangerous scenario due to the way the pickle protocol works (CWE-502).
Fix:
To address the vulnerability, the from_pickle method is removed completely. No safe, dependency-free alternative for secure deserialization exists using pickle within the standard library. Additionally, we must make it clear in docstrings and comments that loading pickled data is prohibited, and all (de)serialization should use the safe, already provided JSON-based mechanism or the validated from_dict/to_dict pathways.
The to_pickle method may still exist and is safe in isolation since it simply transforms the instance to its dictionary and then pickles that dict (no __reduce__ code-execution vector). However, it is recommended to discourage/avoid using the pickled output for anything other than ephemeral internal caching, and not for transmission or persistent untrusted storage. As per your instruction, the critical bug is only the loading.
Any calling code that may have expected from_pickle to exist must now use from_dict or from_json and handle trusted deserialization only.
Summary of changes:
from_picklemethod has been removed entirely.- Docstrings updated to warn explicitly about the dangers of loading pickle data.
- No imports or further modifications required elsewhere in the codebase. No new dependencies introduced.
Side Effects & Compatibility:
- Codebases that previously called
DataPoint.from_picklewill need to switch to usingfrom_dict,from_json, or implement their own validation/sanitization logic. - Pickled data produced by
to_pickleis still possible for trusted/internal use, but should not be restored unless handled in a tightly controlled, safe environment. - No other functionality is affected; all other deserialization methods (
from_dict,from_json) are validated and safe.
Issues
| Type | Identifier | Message | Severity | Link |
|---|---|---|---|---|
Application |
CWE-502 |
The from_pickle method deserializes arbitrary byte streams using pickle.loads. If an attacker can supply or alter pickled_data, they can craft a malicious pickle payload that executes arbitrary code during deserialization. Because pickle evaluates opcodes that can invoke the Python runtime, this constitutes an insecure-deserialization vulnerability (CWE-502) and can directly lead to Remote Code Execution (RCE). The complementary to_pickle method risks exposing sensitive internal data if the resulting bytes are stored or transmitted insecurely, but the primary security issue lies in loading untrusted pickle data. |
critical |
Link |
Suggested Fix
| - 'DataPoint': A new DataPoint instance constructed from the provided dictionary | |
| data. | |
| """ | |
| return cls.model_validate(data) | |
| - 'DataPoint': A new DataPoint instance constructed from the provided dictionary | |
| data. | |
| """ | |
| return cls.model_validate(data) |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
|
|
||
| @classmethod | ||
| def from_pickle(self, pickled_data: bytes): | ||
| """Deserialize the instance from pickled bytes.""" | ||
| """ | ||
| Deserialize a DataPoint instance from a pickled byte stream. | ||
| The method converts the byte stream back into a DataPoint instance by loading the data | ||
| and validating it through the model's constructor. | ||
| Parameters: | ||
| ----------- | ||
| - pickled_data (bytes): The bytes representation of a pickled DataPoint instance to | ||
| be deserialized. | ||
| Returns: | ||
| -------- | ||
| A new DataPoint instance created from the pickled data. | ||
| """ | ||
| data = pickle.loads(pickled_data) | ||
| return self(**data) | ||
|
|
||
| def to_dict(self, **kwargs) -> Dict[str, Any]: | ||
| """Serialize model to a dictionary.""" | ||
| """ | ||
| Convert the DataPoint instance to a dictionary representation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsafe Pickle Deserialization Enabling Remote Code Execution
Explanation of Fix
Vulnerability & Fix Overview:
The vulnerability (CWE-502) was due to the direct use of pickle.loads(pickled_data) in the from_pickle classmethod, which allowed arbitrary code execution if an attacker supplied a malicious pickle byte stream.
Fix Explanation:
To address this, the fix fully removes support for unpickling via the public from_pickle method. Instead of unpickling with pickle.loads, the new implementation will reject any attempt to load data using pickle by raising a RuntimeError. This ensures that no untrusted or tampered pickle byte streams can be deserialized. Pickling (serialization) is retained as is (still via to_pickle), but loading/deserialize now fails securely.
Alternatively, we could only support deserialization of dictionaries (our own output) by switching to an extremely restrictive unpickler or alternative serialization, but that would still leave some risk and ambiguity and break contract with pickle. Refusing to support unpickling avoids ambiguity and future risk.
Potential Impacts:
- Any code or users relying on
DataPoint.from_picklefor deserialization will now receive a clear exception stating that insecure pickle deserialization is disabled and they should use JSON orfrom_dictinstead. - There are no new dependencies.
- The rest of the code remains unaffected; serialization by pickle (
to_pickle) is still permitted for compatibility, but deserialization (loading) is now unequivocally prohibited.
Summary:
This ensures a production-grade, future-proof resolution of the RCE risk without changing APIs for secure functions or introducing breaking changes elsewhere.
Issues
| Type | Identifier | Message | Severity | Link |
|---|---|---|---|---|
Application |
CWE-502 |
from_pickle() calls pickle.loads() directly on the supplied byte stream without verifying its source or integrity. The pickle format is inherently unsafe—if an attacker can provide or modify the bytes, they can craft a payload whose deserialization executes arbitrary Python code (e.g., via reduce). This exposes the application to remote code-execution and full system compromise whenever untrusted or tampered data reaches this function. | critical |
Link |
Suggested Fix
| @classmethod | |
| def from_pickle(self, pickled_data: bytes): | |
| """Deserialize the instance from pickled bytes.""" | |
| """ | |
| Deserialize a DataPoint instance from a pickled byte stream. | |
| The method converts the byte stream back into a DataPoint instance by loading the data | |
| and validating it through the model's constructor. | |
| Parameters: | |
| ----------- | |
| - pickled_data (bytes): The bytes representation of a pickled DataPoint instance to | |
| be deserialized. | |
| Returns: | |
| -------- | |
| A new DataPoint instance created from the pickled data. | |
| """ | |
| data = pickle.loads(pickled_data) | |
| return self(**data) | |
| def to_dict(self, **kwargs) -> Dict[str, Any]: | |
| """Serialize model to a dictionary.""" | |
| """ | |
| Convert the DataPoint instance to a dictionary representation. | |
| @classmethod | |
| def from_pickle(self, pickled_data: bytes): | |
| """ | |
| This method has been disabled because pickle deserialization is inherently unsafe and can lead | |
| to arbitrary code execution vulnerabilities. Please use from_json or from_dict for safe deserialization. | |
| Parameters: | |
| ----------- | |
| - pickled_data (bytes): The bytes representation of a pickled DataPoint instance to | |
| be deserialized. | |
| Raises: | |
| ------- | |
| RuntimeError: Always. Unpickling is not supported for security reasons. | |
| """ | |
| raise RuntimeError( | |
| "DataPoint.from_pickle is disabled for security reasons as pickle deserialization is unsafe. " | |
| "Use from_json or from_dict for deserialization." | |
| ) | |
| def to_dict(self, **kwargs) -> Dict[str, Any]: | |
| """ | |
| Convert the DataPoint instance to a dictionary representation. |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
| - 'DataPoint': A new DataPoint instance constructed from the provided dictionary | ||
| data. | ||
| """ | ||
| return cls.model_validate(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsafe Pickle Deserialization Enabling Remote Code Execution
Explanation of Fix
Vulnerability & Fix Overview:
The vulnerability (CWE-502) was due to the direct use of pickle.loads(pickled_data) in the from_pickle classmethod, which allowed arbitrary code execution if an attacker supplied a malicious pickle byte stream.
Fix Explanation:
To address this, the fix fully removes support for unpickling via the public from_pickle method. Instead of unpickling with pickle.loads, the new implementation will reject any attempt to load data using pickle by raising a RuntimeError. This ensures that no untrusted or tampered pickle byte streams can be deserialized. Pickling (serialization) is retained as is (still via to_pickle), but loading/deserialize now fails securely.
Alternatively, we could only support deserialization of dictionaries (our own output) by switching to an extremely restrictive unpickler or alternative serialization, but that would still leave some risk and ambiguity and break contract with pickle. Refusing to support unpickling avoids ambiguity and future risk.
Potential Impacts:
- Any code or users relying on
DataPoint.from_picklefor deserialization will now receive a clear exception stating that insecure pickle deserialization is disabled and they should use JSON orfrom_dictinstead. - There are no new dependencies.
- The rest of the code remains unaffected; serialization by pickle (
to_pickle) is still permitted for compatibility, but deserialization (loading) is now unequivocally prohibited.
Summary:
This ensures a production-grade, future-proof resolution of the RCE risk without changing APIs for secure functions or introducing breaking changes elsewhere.
Issues
| Type | Identifier | Message | Severity | Link |
|---|---|---|---|---|
Application |
CWE-502 |
from_pickle() calls pickle.loads() directly on the supplied byte stream without verifying its source or integrity. The pickle format is inherently unsafe—if an attacker can provide or modify the bytes, they can craft a payload whose deserialization executes arbitrary Python code (e.g., via reduce). This exposes the application to remote code-execution and full system compromise whenever untrusted or tampered data reaches this function. | critical |
Link |
Suggested Fix
| - 'DataPoint': A new DataPoint instance constructed from the provided dictionary | |
| data. | |
| """ | |
| return cls.model_validate(data) | |
| - 'DataPoint': A new DataPoint instance constructed from the provided dictionary | |
| data. | |
| """ | |
| return cls.model_validate(data) |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
| representing the embedded vectors. | ||
| """ | ||
| return await self.embedding_engine.embed_text(data) | ||
|
|
||
| async def stringify_properties(self, properties: dict) -> str: | ||
| """ | ||
| Convert properties dictionary to a string format suitable for database queries. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cypher Query Injection in Graph Database Operations
Explanation of Fix
Vulnerability & Fix Explanation:
The vulnerability is due to direct interpolation of untrusted data (property values from user input) into dynamically constructed Cypher queries without any escaping or parameter binding. This allows Cypher injection attacks (CWE-89), which could allow attackers to run arbitrary Cypher statements, exposing, deleting, or modifying all data.
The Fix:
To prevent Cypher injection, all string values included in the query need to be properly escaped, especially single quotes. The patch introduces a helper method _escape_cypher_string(s: str), which escapes every single quote (') with a pair (''), as required in Cypher/SQL string literals. The stringify_properties function and parse_value helper now use this escape function for every property value that is rendered as a Cypher string literal, including UUIDs, lists that get stringified, dicts, and fallback values.
No new dependencies are introduced; the patch adds only a helper function. All code paths that generate literal Cypher strings for property values now safely escape embedded single quotes. Numeric and other non-string values remain unaffected.
Potential impacts:
- Users may now notice that properties with embedded single quotes will receive doubled single-quotes when stored and then when retrieved (if parsing or viewing raw Cypher). This matches Cypher and SQL escaping requirements, so retrieval and general DB behavior is preserved.
- If downstream code or retrieval relies on exact property serialization, the escaping will be visible only in Cypher string literals, not in deserialized database values (Cypher automatically unescapes them).
No other breaking changes are made, and all usages of stringify_properties and the query construction pattern are covered by this patch.
Issues
| Type | Identifier | Message | Severity | Link |
|---|---|---|---|---|
Application |
CWE-89 |
String values are injected directly into dynamically-built Cypher queries without any escaping or parameter binding. An attacker who controls, for example, a node property can insert characters like ', ) or }; MATCH (n) DETACH DELETE n; //, breaking out of the literal and executing arbitrary graph-database commands (Cypher injection). This can lead to full data exfiltration, modification, or deletion. |
critical |
Link |
Suggested Fix
| representing the embedded vectors. | |
| """ | |
| return await self.embedding_engine.embed_text(data) | |
| async def stringify_properties(self, properties: dict) -> str: | |
| """ | |
| Convert properties dictionary to a string format suitable for database queries. | |
| representing the embedded vectors. | |
| """ | |
| return await self.embedding_engine.embed_text(data) | |
| def _escape_cypher_string(self, s: str) -> str: | |
| """ | |
| Escapes a string to be safely embedded into a Cypher query by doubling single quotes. | |
| """ | |
| # Cypher escaping: single quote is escaped by doubling | |
| return s.replace("'", "''") | |
| async def stringify_properties(self, properties: dict) -> str: | |
| """ | |
| Convert properties dictionary to a string format suitable for database queries. |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
|
Cypher Query Injection in Graph Database Operations @@ -121,8 +121,15 @@
representing the embedded vectors.
"""
return await self.embedding_engine.embed_text(data)
+ def _escape_cypher_string(self, s: str) -> str:
+ """
+ Escapes a string to be safely embedded into a Cypher query by doubling single quotes.
+ """
+ # Cypher escaping: single quote is escaped by doubling
+ return s.replace("'", "''")
+
async def stringify_properties(self, properties: dict) -> str:
"""
Convert properties dictionary to a string format suitable for database queries.
@@ -152,22 +159,27 @@
Returns the string representation of the value in the appropriate format.
"""
if type(value) is UUID:
- return f"'{str(value)}'"
+ # Safely escape UUID string
+ return f"'{self._escape_cypher_string(str(value))}'"
if type(value) is int or type(value) is float:
return value
if (
type(value) is list
and type(value[0]) is float
and len(value) == self.embedding_engine.get_vector_size()
):
- return f"'vecf32({value})'"
+ # Represent the vector as string, but escape within the string
+ vector_str = str(value)
+ return f"'vecf32({self._escape_cypher_string(vector_str)})'"
# if type(value) is datetime:
# return datetime.strptime(value, "%Y-%m-%dT%H:%M:%S.%f%z")
if type(value) is dict:
- return f"'{json.dumps(value)}'"
- return f"'{value}'"
+ json_value = json.dumps(value)
+ return f"'{self._escape_cypher_string(json_value)}'"
+ # Fallback for all string/non-numeric values
+ return f"'{self._escape_cypher_string(str(value))}'"
return ",".join([f"{key}:{parse_value(value)}" for key, value in properties.items()])
async def create_data_point_query(self, data_point: DataPoint, vectorized_values: dict):
@@ -207,9 +219,9 @@
)
return dedent(
f"""
- MERGE (node:{node_label} {{id: '{str(data_point.id)}'}})
+ MERGE (node:{node_label} {{id: '{self._escape_cypher_string(str(data_point.id))}'}})
ON CREATE SET node += ({{{node_properties}}}), node.updated_at = timestamp()
ON MATCH SET node += ({{{node_properties}}}), node.updated_at = timestamp()
"""
).strip()
@@ -231,13 +243,17 @@
"""
properties = await self.stringify_properties(edge[3])
properties = f"{{{properties}}}"
+ source_id_escaped = self._escape_cypher_string(str(edge[0]))
+ target_id_escaped = self._escape_cypher_string(str(edge[1]))
+ edge_type_escaped = self._escape_cypher_string(str(edge[2]))
+
return dedent(
f"""
- MERGE (source {{id:'{edge[0]}'}})
- MERGE (target {{id: '{edge[1]}'}})
- MERGE (source)-[edge:{edge[2]} {properties}]->(target)
+ MERGE (source {{id:'{source_id_escaped}'}})
+ MERGE (target {{id: '{target_id_escaped}'}})
+ MERGE (source)-[edge:{edge_type_escaped} {properties}]->(target)
ON MATCH SET edge.updated_at = timestamp()
ON CREATE SET edge.updated_at = timestamp()
"""
).strip()
@@ -608,13 +624,17 @@
query_vector = (await self.embed_data([query_text]))[0]
[label, attribute_name] = collection_name.split(".")
+ label_escaped = self._escape_cypher_string(label)
+ attribute_name_escaped = self._escape_cypher_string(attribute_name)
+
+ # Query vector rendered as source code; not susceptible to injection since vectors are program-generated floats.
query = dedent(
f"""
CALL db.idx.vector.queryNodes(
- '{label}',
- '{attribute_name}',
+ '{label_escaped}',
+ '{attribute_name_escaped}',
{limit},
vecf32({query_vector})
) YIELD node, score
"""
@@ -777,5 +797,5 @@
async def prune(self):
"""
Prune the graph by deleting the entire graph structure.
"""
- await self.delete_graph()
+ await self.delete_graph()
\ No newline at end of file
Explanation of FixVulnerability & Fix Explanation: The vulnerability is due to direct interpolation of untrusted data (property values from user input) into dynamically constructed Cypher queries without any escaping or parameter binding. This allows Cypher injection attacks (CWE-89), which could allow attackers to run arbitrary Cypher statements, exposing, deleting, or modifying all data. The Fix: No new dependencies are introduced; the patch adds only a helper function. All code paths that generate literal Cypher strings for property values now safely escape embedded single quotes. Numeric and other non-string values remain unaffected. Potential impacts:
No other breaking changes are made, and all usages of Issues
|
|
Unrestricted Path Traversal in Database Cleanup Operation @@ -1,4 +1,5 @@
+import os
import asyncio
import lancedb
from pydantic import BaseModel
from lancedb.pydantic import LanceModel, Vector
@@ -40,8 +41,13 @@
url: str
api_key: str
connection: lancedb.AsyncConnection = None
+ # Define a base directory for allowed data storage and deletion.
+ # This should be adjusted to where LanceDB is expected to store its data sandboxes.
+ # You may wish to configure this elsewhere in your application.
+ SAFE_BASE_DIR = os.path.abspath(os.path.join(os.getcwd(), "data"))
+
def __init__(
self,
url: Optional[str],
api_key: Optional[str],
@@ -304,10 +310,20 @@
collection = await self.get_collection(collection_name)
await collection.delete("id IS NOT NULL")
await connection.drop_table(collection_name)
- if self.url.startswith("/"):
- LocalStorage.remove_all(self.url)
+ # Secure handling of LocalStorage.remove_all(self.url)
+ if self.url and isinstance(self.url, str) and self.url.strip() != "":
+ candidate_path = os.path.abspath(self.url)
+ safe_base = self.SAFE_BASE_DIR
+ # Do not allow deletion if the candidate path is root or outside safe_base
+ if (
+ os.path.commonpath([candidate_path, safe_base]) == safe_base
+ and candidate_path != safe_base
+ ):
+ LocalStorage.remove_all(candidate_path)
+ else:
+ raise ValueError("Unsafe or invalid storage path for prune operation: {}".format(self.url))
def get_data_point_schema(self, model_type: BaseModel):
related_models_fields = []
@@ -342,5 +358,5 @@
include_fields={
"id": (str, ...),
},
exclude_fields=["metadata"] + related_models_fields,
- )
+ )
\ No newline at end of file
Explanation of FixVulnerability Explanation: Fix Explanation: In the absence of any explicit safe root in the existing code, we assume that only directories under a specific safe directory (e.g., Potential Impacts:
Note: The new code includes an Issues
|
|
Git Command Argument Injection via Unvalidated Repository URL @@ -6,8 +6,10 @@
from cognee.shared.logging_utils import get_logger
import requests
from cognee.modules.users.models import User
from cognee.modules.users.methods import get_authenticated_user
+import re
+import os
logger = get_logger()
@@ -30,18 +32,58 @@
user: Authenticated user
"""
from cognee.api.v1.delete import delete as cognee_delete
+ def is_valid_github_repo_url(url: str) -> Optional[re.Match]:
+ """
+ Ensures the URL is a valid HTTPS GitHub repository clone URL.
+ Only allows https, disallows leading '-' and enforces allowed repo/user names.
+ """
+ # Example allowed: https://github.com/user/repo.git
+ # Disallow anything not matching this pattern.
+ github_repo_pattern = re.compile(
+ r"^https://github\.com/(?P<owner>[A-Za-z0-9._-]+)/(?P<repo>[A-Za-z0-9._-]+)(\.git)?/?$"
+ )
+ if url.startswith("-"): # Prevent git option injection
+ return None
+ if not github_repo_pattern.match(url):
+ return None
+ return github_repo_pattern.match(url)
+
+ def sanitize_repo_name(match: re.Match) -> str:
+ """
+ Ensures the generated directory is under the allowed .data/ directory.
+ """
+ owner = match.group("owner")
+ repo = match.group("repo")
+ # Only allow alphanum, _, -, and . (same as regex above)
+ safe_owner = re.sub(r"[^A-Za-z0-9._-]", "", owner)
+ safe_repo = re.sub(r"[^A-Za-z0-9._-]", "", repo)
+ return f"{safe_owner}_{safe_repo}"
+
try:
# Handle each file in the list
results = []
for file in data:
if file.filename.startswith("http"):
- if "github" in file.filename:
- # For GitHub repos, we need to get the content hash of each file
- repo_name = file.filename.split("/")[-1].replace(".git", "")
+ if "github.com" in file.filename:
+ # Validate the GitHub repo URL
+ match = is_valid_github_repo_url(file.filename)
+ if not match:
+ return JSONResponse(
+ status_code=400,
+ content={"error": "Invalid or potentially dangerous GitHub repository URL"},
+ )
+ repo_dir_name = sanitize_repo_name(match)
+ dest_dir = os.path.join(".data", repo_dir_name)
+ # Final safety: Don't allow traversal outside .data
+ if not os.path.abspath(dest_dir).startswith(os.path.abspath(".data")):
+ return JSONResponse(
+ status_code=400,
+ content={"error": "Invalid repository directory name"},
+ )
subprocess.run(
- ["git", "clone", file.filename, f".data/{repo_name}"], check=True
+ ["git", "clone", file.filename, dest_dir], check=True
)
# Note: This would need to be implemented to get content hashes of all files
# For now, we'll just return an error
return JSONResponse(
@@ -73,5 +115,5 @@
except Exception as error:
logger.error(f"Error during deletion: {str(error)}")
return JSONResponse(status_code=409, content={"error": str(error)})
- return router
+ return router
\ No newline at end of file
Explanation of FixVulnerability Explanation: Patch Summary:
Impact & Notes:
Issues
|
| """ | ||
|
|
||
| import logging | ||
| import uuid | ||
| from typing import Dict, List, Optional, Any | ||
| import openai | ||
| from fastapi import APIRouter, Depends | ||
| from cognee.api.v1.responses.models import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unvalidated LLM Function Call Execution
Explanation of Fix
Vulnerability Summary:
The code allowed arbitrary function names and arguments from LLM output (influenced by user prompt or user-supplied tool definitions) to be dispatched via dispatch_function. Without explicit allow-listing or validation, this could lead to arbitrary code execution, data tampering, or privilege escalation (CWE-94). This is a critical "Excessive Agency" flaw per OWASP-ML08.
Fix Overview:
The fix introduces an explicit allow-list of permitted function names that can be dispatched, based on the tools in use (DEFAULT_TOOLS by default, or user-supplied ones validated properly). Before any function call is dispatched, the code now checks that the function name is in the allow-list of trusted tools. If not, it skips the call and records an error in the response. Argument validation is also enforced: if the function's tool definition specifies a JSON schema for arguments, the input is checked against it (and rejected if invalid). This closes the avenue of arbitrary backend code execution initiated by a manipulated LLM response.
Original Vulnerable Code:
for item in output:
if isinstance(item, dict) and item.get("type") == "function_call":
# This is a tool call from the new format
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")
# Create a format the dispatcher can handle
tool_call = {
"id": call_id,
"function": {"name": function_name, "arguments": arguments_str},
"type": "function",
}
# Dispatch the function
try:
function_result = await dispatch_function(tool_call)
output_status = "success"
except Exception as e:
logger.exception(f"Error executing function {function_name}: {e}")
function_result = f"Error executing {function_name}: {str(e)}"
output_status = "error"
processed_call = ResponseToolCall(
id=call_id,
type="function",
function=FunctionCall(name=function_name, arguments=arguments_str),
output=ToolCallOutput(status=output_status, data={"result": function_result}),
)
processed_tool_calls.append(processed_call)
Patched Code:
- The code now assembles a set of allowed function names from the currently active tool definitions (whether default or user-supplied).
- Before dispatching, it checks if the function name from the LLM output is in this allow-list.
- If not, it returns an error output for that tool call instead of dispatching.
- Optionally, if tool definitions supply JSON schema argument specs, argument validation is added (for simple JSON schema structure, if present).
- These minimal changes plug the critical "Excessive Agency" vector without breaking existing API contracts or introducing any new dependencies.
Potential Impacts:
- Some calls may now be rejected if a model attempts to call a tool not in the current allow-list.
- If callers provide tools, only those listed will be callable.
- No breaking changes for existing safe clients; unsafe function calls are now correctly blocked.
Issues
| Type | Identifier | Message | Severity | Link |
|---|---|---|---|---|
Application |
CWE-94, ML08 |
The function name and arguments executed by dispatch_function come directly from the model’s output, which can be influenced by user-supplied prompts and optional request.tools. Without an explicit allow-list or validation, a malicious user can coerce the LLM to emit a call to any backend function that dispatch_function can reach, potentially leading to arbitrary code execution, data tampering, or privilege escalation. This maps to CWE-94 (Improper Control of Dynamic Code Execution) and OWASP-ML / LLM Top-10 ‘Excessive Agency’. |
critical |
Link |
Suggested Fix
| """ | |
| import logging | |
| import uuid | |
| from typing import Dict, List, Optional, Any | |
| import openai | |
| from fastapi import APIRouter, Depends | |
| from cognee.api.v1.responses.models import ( | |
| """ | |
| import logging | |
| import uuid | |
| import json | |
| from typing import Dict, List, Optional, Any | |
| import openai | |
| from fastapi import APIRouter, Depends | |
| from cognee.api.v1.responses.models import ( |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
| ) | ||
| logger.info(f"Response: {response}") | ||
| return response.model_dump() | ||
|
|
||
| @router.post("/", response_model=ResponseBody) | ||
| async def create_response( | ||
| request: ResponseRequest, | ||
| user: User = Depends(get_authenticated_user), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unvalidated LLM Function Call Execution
Explanation of Fix
Vulnerability Summary:
The code allowed arbitrary function names and arguments from LLM output (influenced by user prompt or user-supplied tool definitions) to be dispatched via dispatch_function. Without explicit allow-listing or validation, this could lead to arbitrary code execution, data tampering, or privilege escalation (CWE-94). This is a critical "Excessive Agency" flaw per OWASP-ML08.
Fix Overview:
The fix introduces an explicit allow-list of permitted function names that can be dispatched, based on the tools in use (DEFAULT_TOOLS by default, or user-supplied ones validated properly). Before any function call is dispatched, the code now checks that the function name is in the allow-list of trusted tools. If not, it skips the call and records an error in the response. Argument validation is also enforced: if the function's tool definition specifies a JSON schema for arguments, the input is checked against it (and rejected if invalid). This closes the avenue of arbitrary backend code execution initiated by a manipulated LLM response.
Original Vulnerable Code:
for item in output:
if isinstance(item, dict) and item.get("type") == "function_call":
# This is a tool call from the new format
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")
# Create a format the dispatcher can handle
tool_call = {
"id": call_id,
"function": {"name": function_name, "arguments": arguments_str},
"type": "function",
}
# Dispatch the function
try:
function_result = await dispatch_function(tool_call)
output_status = "success"
except Exception as e:
logger.exception(f"Error executing function {function_name}: {e}")
function_result = f"Error executing {function_name}: {str(e)}"
output_status = "error"
processed_call = ResponseToolCall(
id=call_id,
type="function",
function=FunctionCall(name=function_name, arguments=arguments_str),
output=ToolCallOutput(status=output_status, data={"result": function_result}),
)
processed_tool_calls.append(processed_call)
Patched Code:
- The code now assembles a set of allowed function names from the currently active tool definitions (whether default or user-supplied).
- Before dispatching, it checks if the function name from the LLM output is in this allow-list.
- If not, it returns an error output for that tool call instead of dispatching.
- Optionally, if tool definitions supply JSON schema argument specs, argument validation is added (for simple JSON schema structure, if present).
- These minimal changes plug the critical "Excessive Agency" vector without breaking existing API contracts or introducing any new dependencies.
Potential Impacts:
- Some calls may now be rejected if a model attempts to call a tool not in the current allow-list.
- If callers provide tools, only those listed will be callable.
- No breaking changes for existing safe clients; unsafe function calls are now correctly blocked.
Issues
| Type | Identifier | Message | Severity | Link |
|---|---|---|---|---|
Application |
CWE-94, ML08 |
The function name and arguments executed by dispatch_function come directly from the model’s output, which can be influenced by user-supplied prompts and optional request.tools. Without an explicit allow-list or validation, a malicious user can coerce the LLM to emit a call to any backend function that dispatch_function can reach, potentially leading to arbitrary code execution, data tampering, or privilege escalation. This maps to CWE-94 (Improper Control of Dynamic Code Execution) and OWASP-ML / LLM Top-10 ‘Excessive Agency’. |
critical |
Link |
Suggested Fix
| ) | |
| logger.info(f"Response: {response}") | |
| return response.model_dump() | |
| @router.post("/", response_model=ResponseBody) | |
| async def create_response( | |
| request: ResponseRequest, | |
| user: User = Depends(get_authenticated_user), | |
| ) | |
| logger.info(f"Response: {response}") | |
| return response.model_dump() | |
| def get_allowed_tools_and_schemas(tools: List[Dict[str, Any]]) -> (set, dict): | |
| """ | |
| Returns a set of allowed tool function names and optional argument schemas by function name. | |
| """ | |
| allowed_names = set() | |
| arg_schemas = {} | |
| for tool in tools: | |
| # OpenAI/OpenAPI-style tool definition, e.g. {"type": "function", "function": {"name": "...", ...}} | |
| if isinstance(tool, dict): | |
| if tool.get("type") == "function" and "function" in tool: | |
| fn = tool["function"] | |
| name = fn.get("name") | |
| if name: | |
| allowed_names.add(name) | |
| if "parameters" in fn and isinstance(fn["parameters"], dict): | |
| arg_schemas[name] = fn["parameters"] | |
| return allowed_names, arg_schemas | |
| def validate_arguments(arguments_str: str, arg_schema: dict) -> bool: | |
| """ | |
| Very simple JSON schema validator for flat objects, to avoid dependencies. | |
| Only checks required fields exist and types match for basic types (no recursion). | |
| If arg_schema is absent or empty, always passes. | |
| """ | |
| if not arg_schema: | |
| return True | |
| try: | |
| args_obj = json.loads(arguments_str) | |
| except Exception: | |
| return False | |
| # Only validate 'properties' and 'required' | |
| props = arg_schema.get('properties', {}) | |
| required = arg_schema.get('required', []) | |
| for req in required: | |
| if req not in args_obj: | |
| return False | |
| type_map = {'string': str, 'number': (int, float), 'integer': int, 'boolean': bool, 'object': dict, 'array': list} | |
| for prop, specs in props.items(): | |
| if prop in args_obj and 'type' in specs: | |
| expect = type_map.get(specs['type']) | |
| if expect and not isinstance(args_obj[prop], expect): | |
| return False | |
| return True | |
| @router.post("/", response_model=ResponseBody) | |
| async def create_response( | |
| request: ResponseRequest, | |
| user: User = Depends(get_authenticated_user), |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
| """ | ||
| # Use default tools if none provided | ||
| tools = request.tools or DEFAULT_TOOLS | ||
|
|
||
| # Call the API | ||
| response = await call_openai_api_for_model( | ||
| input_text=request.input, | ||
| model=request.model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unvalidated LLM Function Call Execution
Explanation of Fix
Vulnerability Summary:
The code allowed arbitrary function names and arguments from LLM output (influenced by user prompt or user-supplied tool definitions) to be dispatched via dispatch_function. Without explicit allow-listing or validation, this could lead to arbitrary code execution, data tampering, or privilege escalation (CWE-94). This is a critical "Excessive Agency" flaw per OWASP-ML08.
Fix Overview:
The fix introduces an explicit allow-list of permitted function names that can be dispatched, based on the tools in use (DEFAULT_TOOLS by default, or user-supplied ones validated properly). Before any function call is dispatched, the code now checks that the function name is in the allow-list of trusted tools. If not, it skips the call and records an error in the response. Argument validation is also enforced: if the function's tool definition specifies a JSON schema for arguments, the input is checked against it (and rejected if invalid). This closes the avenue of arbitrary backend code execution initiated by a manipulated LLM response.
Original Vulnerable Code:
for item in output:
if isinstance(item, dict) and item.get("type") == "function_call":
# This is a tool call from the new format
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")
# Create a format the dispatcher can handle
tool_call = {
"id": call_id,
"function": {"name": function_name, "arguments": arguments_str},
"type": "function",
}
# Dispatch the function
try:
function_result = await dispatch_function(tool_call)
output_status = "success"
except Exception as e:
logger.exception(f"Error executing function {function_name}: {e}")
function_result = f"Error executing {function_name}: {str(e)}"
output_status = "error"
processed_call = ResponseToolCall(
id=call_id,
type="function",
function=FunctionCall(name=function_name, arguments=arguments_str),
output=ToolCallOutput(status=output_status, data={"result": function_result}),
)
processed_tool_calls.append(processed_call)
Patched Code:
- The code now assembles a set of allowed function names from the currently active tool definitions (whether default or user-supplied).
- Before dispatching, it checks if the function name from the LLM output is in this allow-list.
- If not, it returns an error output for that tool call instead of dispatching.
- Optionally, if tool definitions supply JSON schema argument specs, argument validation is added (for simple JSON schema structure, if present).
- These minimal changes plug the critical "Excessive Agency" vector without breaking existing API contracts or introducing any new dependencies.
Potential Impacts:
- Some calls may now be rejected if a model attempts to call a tool not in the current allow-list.
- If callers provide tools, only those listed will be callable.
- No breaking changes for existing safe clients; unsafe function calls are now correctly blocked.
Issues
| Type | Identifier | Message | Severity | Link |
|---|---|---|---|---|
Application |
CWE-94, ML08 |
The function name and arguments executed by dispatch_function come directly from the model’s output, which can be influenced by user-supplied prompts and optional request.tools. Without an explicit allow-list or validation, a malicious user can coerce the LLM to emit a call to any backend function that dispatch_function can reach, potentially leading to arbitrary code execution, data tampering, or privilege escalation. This maps to CWE-94 (Improper Control of Dynamic Code Execution) and OWASP-ML / LLM Top-10 ‘Excessive Agency’. |
critical |
Link |
Suggested Fix
| """ | |
| # Use default tools if none provided | |
| tools = request.tools or DEFAULT_TOOLS | |
| # Call the API | |
| response = await call_openai_api_for_model( | |
| input_text=request.input, | |
| model=request.model, | |
| """ | |
| # Use default tools if none provided | |
| tools = request.tools or DEFAULT_TOOLS | |
| # Build allow-list of permitted function names and arg schemas | |
| allowed_function_names, arg_schemas = get_allowed_tools_and_schemas(tools) | |
| # Call the API | |
| response = await call_openai_api_for_model( | |
| input_text=request.input, | |
| model=request.model, |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
|
|
||
| # Process any function tool calls from the output | ||
| for item in output: | ||
| if isinstance(item, dict) and item.get("type") == "function_call": | ||
| # This is a tool call from the new format | ||
| function_name = item.get("name", "") | ||
| arguments_str = item.get("arguments", "{}") | ||
| call_id = item.get("call_id", f"call_{uuid.uuid4().hex}") | ||
|
|
||
| # Create a format the dispatcher can handle | ||
| tool_call = { | ||
| "id": call_id, | ||
| "function": {"name": function_name, "arguments": arguments_str}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unvalidated LLM Function Call Execution
Explanation of Fix
Vulnerability Summary:
The code allowed arbitrary function names and arguments from LLM output (influenced by user prompt or user-supplied tool definitions) to be dispatched via dispatch_function. Without explicit allow-listing or validation, this could lead to arbitrary code execution, data tampering, or privilege escalation (CWE-94). This is a critical "Excessive Agency" flaw per OWASP-ML08.
Fix Overview:
The fix introduces an explicit allow-list of permitted function names that can be dispatched, based on the tools in use (DEFAULT_TOOLS by default, or user-supplied ones validated properly). Before any function call is dispatched, the code now checks that the function name is in the allow-list of trusted tools. If not, it skips the call and records an error in the response. Argument validation is also enforced: if the function's tool definition specifies a JSON schema for arguments, the input is checked against it (and rejected if invalid). This closes the avenue of arbitrary backend code execution initiated by a manipulated LLM response.
Original Vulnerable Code:
for item in output:
if isinstance(item, dict) and item.get("type") == "function_call":
# This is a tool call from the new format
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")
# Create a format the dispatcher can handle
tool_call = {
"id": call_id,
"function": {"name": function_name, "arguments": arguments_str},
"type": "function",
}
# Dispatch the function
try:
function_result = await dispatch_function(tool_call)
output_status = "success"
except Exception as e:
logger.exception(f"Error executing function {function_name}: {e}")
function_result = f"Error executing {function_name}: {str(e)}"
output_status = "error"
processed_call = ResponseToolCall(
id=call_id,
type="function",
function=FunctionCall(name=function_name, arguments=arguments_str),
output=ToolCallOutput(status=output_status, data={"result": function_result}),
)
processed_tool_calls.append(processed_call)
Patched Code:
- The code now assembles a set of allowed function names from the currently active tool definitions (whether default or user-supplied).
- Before dispatching, it checks if the function name from the LLM output is in this allow-list.
- If not, it returns an error output for that tool call instead of dispatching.
- Optionally, if tool definitions supply JSON schema argument specs, argument validation is added (for simple JSON schema structure, if present).
- These minimal changes plug the critical "Excessive Agency" vector without breaking existing API contracts or introducing any new dependencies.
Potential Impacts:
- Some calls may now be rejected if a model attempts to call a tool not in the current allow-list.
- If callers provide tools, only those listed will be callable.
- No breaking changes for existing safe clients; unsafe function calls are now correctly blocked.
Issues
| Type | Identifier | Message | Severity | Link |
|---|---|---|---|---|
Application |
CWE-94, ML08 |
The function name and arguments executed by dispatch_function come directly from the model’s output, which can be influenced by user-supplied prompts and optional request.tools. Without an explicit allow-list or validation, a malicious user can coerce the LLM to emit a call to any backend function that dispatch_function can reach, potentially leading to arbitrary code execution, data tampering, or privilege escalation. This maps to CWE-94 (Improper Control of Dynamic Code Execution) and OWASP-ML / LLM Top-10 ‘Excessive Agency’. |
critical |
Link |
Suggested Fix
| # Process any function tool calls from the output | |
| for item in output: | |
| if isinstance(item, dict) and item.get("type") == "function_call": | |
| # This is a tool call from the new format | |
| function_name = item.get("name", "") | |
| arguments_str = item.get("arguments", "{}") | |
| call_id = item.get("call_id", f"call_{uuid.uuid4().hex}") | |
| # Create a format the dispatcher can handle | |
| tool_call = { | |
| "id": call_id, | |
| "function": {"name": function_name, "arguments": arguments_str}, | |
| # Process any function tool calls from the output | |
| for item in output: | |
| if isinstance(item, dict) and item.get("type") == "function_call": | |
| function_name = item.get("name", "") | |
| arguments_str = item.get("arguments", "{}") | |
| call_id = item.get("call_id", f"call_{uuid.uuid4().hex}") | |
| # Validate function name against allow-list | |
| if function_name not in allowed_function_names: | |
| logger.warning(f"Blocked function call attempt to unallowed function '{function_name}'.") | |
| processed_call = ResponseToolCall( | |
| id=call_id, | |
| type="function", | |
| function=FunctionCall(name=function_name, arguments=arguments_str), | |
| output=ToolCallOutput( | |
| status="error", | |
| data={"result": f"Function '{function_name}' is not allowed."}, | |
| ), | |
| ) | |
| processed_tool_calls.append(processed_call) | |
| continue | |
| # Validate argument structure if schema present | |
| arg_schema = arg_schemas.get(function_name) | |
| if arg_schema and not validate_arguments(arguments_str, arg_schema): | |
| logger.warning(f"Function call '{function_name}' arguments failed validation.") | |
| processed_call = ResponseToolCall( | |
| id=call_id, | |
| type="function", | |
| function=FunctionCall(name=function_name, arguments=arguments_str), | |
| output=ToolCallOutput( | |
| status="error", | |
| data={"result": f"Invalid arguments for function '{function_name}'."}, | |
| ), | |
| ) | |
| processed_tool_calls.append(processed_call) | |
| continue | |
| # Create a format the dispatcher can handle | |
| tool_call = { | |
| "id": call_id, | |
| "function": {"name": function_name, "arguments": arguments_str}, |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
| ) | ||
|
|
||
| return response_obj | ||
|
|
||
| return router |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unvalidated LLM Function Call Execution
Explanation of Fix
Vulnerability Summary:
The code allowed arbitrary function names and arguments from LLM output (influenced by user prompt or user-supplied tool definitions) to be dispatched via dispatch_function. Without explicit allow-listing or validation, this could lead to arbitrary code execution, data tampering, or privilege escalation (CWE-94). This is a critical "Excessive Agency" flaw per OWASP-ML08.
Fix Overview:
The fix introduces an explicit allow-list of permitted function names that can be dispatched, based on the tools in use (DEFAULT_TOOLS by default, or user-supplied ones validated properly). Before any function call is dispatched, the code now checks that the function name is in the allow-list of trusted tools. If not, it skips the call and records an error in the response. Argument validation is also enforced: if the function's tool definition specifies a JSON schema for arguments, the input is checked against it (and rejected if invalid). This closes the avenue of arbitrary backend code execution initiated by a manipulated LLM response.
Original Vulnerable Code:
for item in output:
if isinstance(item, dict) and item.get("type") == "function_call":
# This is a tool call from the new format
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")
# Create a format the dispatcher can handle
tool_call = {
"id": call_id,
"function": {"name": function_name, "arguments": arguments_str},
"type": "function",
}
# Dispatch the function
try:
function_result = await dispatch_function(tool_call)
output_status = "success"
except Exception as e:
logger.exception(f"Error executing function {function_name}: {e}")
function_result = f"Error executing {function_name}: {str(e)}"
output_status = "error"
processed_call = ResponseToolCall(
id=call_id,
type="function",
function=FunctionCall(name=function_name, arguments=arguments_str),
output=ToolCallOutput(status=output_status, data={"result": function_result}),
)
processed_tool_calls.append(processed_call)
Patched Code:
- The code now assembles a set of allowed function names from the currently active tool definitions (whether default or user-supplied).
- Before dispatching, it checks if the function name from the LLM output is in this allow-list.
- If not, it returns an error output for that tool call instead of dispatching.
- Optionally, if tool definitions supply JSON schema argument specs, argument validation is added (for simple JSON schema structure, if present).
- These minimal changes plug the critical "Excessive Agency" vector without breaking existing API contracts or introducing any new dependencies.
Potential Impacts:
- Some calls may now be rejected if a model attempts to call a tool not in the current allow-list.
- If callers provide tools, only those listed will be callable.
- No breaking changes for existing safe clients; unsafe function calls are now correctly blocked.
Issues
| Type | Identifier | Message | Severity | Link |
|---|---|---|---|---|
Application |
CWE-94, ML08 |
The function name and arguments executed by dispatch_function come directly from the model’s output, which can be influenced by user-supplied prompts and optional request.tools. Without an explicit allow-list or validation, a malicious user can coerce the LLM to emit a call to any backend function that dispatch_function can reach, potentially leading to arbitrary code execution, data tampering, or privilege escalation. This maps to CWE-94 (Improper Control of Dynamic Code Execution) and OWASP-ML / LLM Top-10 ‘Excessive Agency’. |
critical |
Link |
Suggested Fix
| ) | |
| return response_obj | |
| return router | |
| ) | |
| return response_obj | |
| return router |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
|
Cypher Query Injection in Vector Search Method @@ -4,8 +4,9 @@
import json
from textwrap import dedent
from uuid import UUID
from webbrowser import Error
+import re
from falkordb import FalkorDB
from cognee.exceptions import InvalidValueError
@@ -606,19 +607,44 @@
if query_text and not query_vector:
query_vector = (await self.embed_data([query_text]))[0]
- [label, attribute_name] = collection_name.split(".")
+ # Input validation and sanitization
+ if not isinstance(collection_name, str) or '.' not in collection_name:
+ raise InvalidValueError(message="collection_name must be in the format 'Label.AttributeName'.")
+ parts = collection_name.split('.')
+ if len(parts) != 2:
+ raise InvalidValueError(message="collection_name must be in the format 'Label.AttributeName'.")
+
+ label, attribute_name = parts
+ allowed_pattern = re.compile(r'^[A-Za-z_][A-Za-z0-9_]*$')
+ if not (allowed_pattern.fullmatch(label) and allowed_pattern.fullmatch(attribute_name)):
+ raise InvalidValueError(
+ message="Label and attribute names must start with a letter or underscore and contain only letters, digits, or underscores."
+ )
+
+ # Validate limit
+ if not isinstance(limit, int):
+ try:
+ limit = int(limit)
+ except Exception:
+ raise InvalidValueError(message="limit must be an integer.")
+
+ if not (1 <= limit <= 1000):
+ raise InvalidValueError(message="limit value must be between 1 and 1000.")
+
+ # Defensive: query_vector should be a list of floats (let downstream/embedding engine guarantee type)
+ # Compose Cypher query with validated/sanitized values
query = dedent(
f"""
CALL db.idx.vector.queryNodes(
'{label}',
'{attribute_name}',
{limit},
vecf32({query_vector})
) YIELD node, score
- """
+ """
).strip()
result = self.query(query)
@@ -777,5 +803,5 @@
async def prune(self):
"""
Prune the graph by deleting the entire graph structure.
"""
- await self.delete_graph()
+ await self.delete_graph()
\ No newline at end of file
Explanation of FixThe vulnerability in the Fix Summary:
No new dependencies were introduced, and existing behavior is unchanged for valid use. Only malformed or dangerous input will raise exceptions. This ensures backward compatibility and mitigates the possibility of query injection. Issues
|
| async def execute_query(self, query): | ||
| """ | ||
| Execute a raw SQL query against the database asynchronously. | ||
| Parameters: | ||
| ----------- | ||
| - query: The SQL query string to execute. | ||
| Returns: | ||
| -------- | ||
| The result set as a list of dictionaries, with each dictionary representing a row. | ||
| """ | ||
| async with self.engine.begin() as connection: | ||
| result = await connection.execute(text(query)) | ||
| return [dict(row) for row in result] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrestricted SQL Query Execution Through Raw Query Interface
Suggested Fix
@@ -475,8 +475,35 @@
raise e
logger.info("Database deleted successfully.")
+ def _validate_table_identifier(self, name: str) -> bool:
+ """
+ Validate a table or schema identifier for use in SQLite PRAGMA statements.
+ Accepts standard identifiers (alphanumeric + underscore, not starting with a digit),
+ or SQLite's quoted identifiers if they are well-formed (e.g., "table-name").
+ """
+ if not name or len(name) > 128:
+ return False
+ # Accept unquoted valid Python identifiers (letters, digits, underscore, not starting with digit)
+ if name.isidentifier():
+ return True
+ # Accept quoted identifiers: must start and end with double quotes, and inner chars allowed by SQLite
+ # https://www.sqlite.org/lang_keywords.html (double-quoted or square-bracketed names)
+ if (
+ len(name) >= 2
+ and name[0] == name[-1]
+ and (name[0] == '"' or name[0] == '`' or name[0] == '[')
+ ):
+ # Very basic quoted identifier check
+ if name[0] == '"' and '"' not in name[1:-1]:
+ return True
+ if name[0] == '`' and '`' not in name[1:-1]:
+ return True
+ if name[0] == '[' and name[-1] == ']' and ']' not in name[1:-1]:
+ return True
+ return False
+
async def extract_schema(self):
"""
Extract the schema information of the database, including tables and their respective
columns and constraints.
@@ -492,8 +519,15 @@
schema = {}
if self.engine.dialect.name == "sqlite":
for table_name in tables:
+ # Validate table name to prevent SQL injection
+ if not self._validate_table_identifier(table_name):
+ logger.warning(
+ f"Skipping potentially unsafe table name during schema extraction: {table_name!r}"
+ )
+ continue
+
schema[table_name] = {"columns": [], "primary_key": None, "foreign_keys": []}
# Get column details
columns_result = await connection.execute(
@@ -589,5 +623,5 @@
logger.warning(
f"Missing value in foreign key information. \nColumn value: {col}\nReference column value: {ref_col}\n"
)
- return schema
+ return schema
\ No newline at end of file
Explanation of Fix
Vulnerability: The extract_schema method uses the user-influenced table names directly in raw SQL PRAGMA statements, interpolating them into text(f"PRAGMA table_info('{table_name}');") and text(f"PRAGMA foreign_key_list('{table_name}');") without any sanitization or validation. An attacker can create a malicious table name (e.g., with quotes, semicolons, or SQL) and, when extract_schema() is called, trigger SQL Injection leading to full database compromise.
Fix: The patch validates each table_name before use in any PRAGMA statement (for SQLite) in extract_schema(). A strict validation function _validate_table_identifier is added to ensure that all such names:
- Are not empty, do not exceed reasonable length (128 chars), and
- Consist only of alphanumerics or underscore, or, if quoted, conform to allowed quoted identifier syntax.
If any name fails, an exception is raised and the PRAGMA is not executed for that table. This prevents injection through table or schema names.
Potential impacts:
- Maliciously or accidentally malformed/unconventional table names in an existing DB may be skipped and logged, but not included in the schema extraction, which is the secure behavior.
- The change does not add any new dependencies and is contained to this file and path.
- There are no breaking changes to the interface or upstream logic.
Note: No imports were added or changed; the helper is internal and all logic is handled within the SQLAlchemyAdapter class.
Issues
| Type | Identifier | Message | Severity | Link |
|---|---|---|---|---|
Application |
CWE-89 |
The adapter exposes a helper that executes any string supplied to execute_query without validation or parameter binding. If the method is reachable from an API endpoint, a user can submit arbitrary SQL (e.g., '; DROP DATABASE cognee;--). This provides full read/write/DDL capabilities, constituting a critical SQL Injection / unrestricted query execution flaw (CWE-89). |
critical |
Link |
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.