Skip to content

Conversation

@borisarzentar
Copy link
Member

Description

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Vasilije1990 and others added 30 commits April 18, 2025 16:31
<!-- .github/pull_request_template.md -->

## Description
Resolve issue with .venv being broken when using docker compose with
Cognee

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Boris Arzentar <[email protected]>
… 1947 (#760)

<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Boris <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
<!-- .github/pull_request_template.md -->

## Description
Add support for UV and for Poetry package management

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
Switch typing from str to UUID for NetworkX node_id

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
Add both sse and stdio support for Cognee MCP

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
…83] (#782)

<!-- .github/pull_request_template.md -->

## Description
Add log handling options for cognee exceptions

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
Fix issue with failing versions gh actions

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Vasilije <[email protected]>
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

Co-authored-by: Vasilije <[email protected]>
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Hande <[email protected]>
Co-authored-by: Vasilije <[email protected]>
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Hande <[email protected]>
Co-authored-by: Vasilije <[email protected]>
<!-- .github/pull_request_template.md -->

## Description
This PR adds support for the Memgraph graph database following the
[graph database integration
guide](https://docs.cognee.ai/contributing/adding-providers/graph-db/graph-database-integration):
- Implemented `MemgraphAdapter` for interfacing with Memgraph.
- Updated `get_graph_engine.py` to return MemgraphAdapter when
appropriate.
- Added a test script:` test_memgraph.py.`
- Created a dedicated test workflow:
`.github/workflows/test_memgraph.yml.`

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Boris <[email protected]>
<!-- .github/pull_request_template.md -->

## Description
refactor: Handle boto3 s3fs dependencies better

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
Update LanceDB and rewrite data points to run async

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Boris <[email protected]>
Co-authored-by: Boris Arzentar <[email protected]>
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
As discussed with @hande-k and Lazar, I've created a short demo to
illustrate how to get the pagerank rankings from the knowledge graph
given the nx engine. This is a POC, and a first of step towards solving
#643 .

Please let me know what you think, and how to proceed from here. :)

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Boris <[email protected]>
Co-authored-by: Hande <[email protected]>
Co-authored-by: Vasilije <[email protected]>
<!-- .github/pull_request_template.md -->

## Description
Added tools to check current cognify and codify status

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 30, 2025

Walkthrough

This update introduces two new advanced graph completion retrievers—GraphCompletionCotRetriever and GraphCompletionContextExtensionRetriever—enabling chain-of-thought and iterative context extension for graph-based question answering. It adds new search types, adapter methods, and node/edge filtering by type and name. Extensive docstring improvements, new example scripts, OpenAI-compatible API endpoints, and expanded test coverage are included.

Changes

Files/Groups Change Summary
cognee/modules/retrieval/graph_completion_cot_retriever.py, graph_completion_context_extension_retriever.py Added new retriever classes for chain-of-thought and context extension graph completion.
cognee/modules/search/types/SearchType.py Added GRAPH_COMPLETION_COT and GRAPH_COMPLETION_CONTEXT_EXTENSION search types.
cognee/modules/search/methods/search.py, cognee/modules/retrieval/graph_completion_retriever.py, graph_summary_completion_retriever.py Integrated new retrievers, added node_type/node_name filtering, expanded docstrings.
cognee/modules/graph/cognee_graph/CogneeGraph.py, expand_with_nodes_and_edges.py, utils/brute_force_triplet_search.py Added node_type/node_name filtering to graph projection and triplet search.
cognee/infrastructure/databases/graph/*/adapter.py, graph_db_interface.py Added/implemented get_nodeset_subgraph in adapters; expanded docstrings.
cognee/infrastructure/databases/graph/supported_databases.py, use_graph_adapter.py Added dynamic adapter registration for graph databases.
cognee/infrastructure/databases/vector/supported_databases.py, use_vector_adapter.py Added dynamic adapter registration for vector databases.
cognee/api/v1/search/search.py, cognee/modules/engine/models/ColumnValue.py Added node_type/node_name search support; new ColumnValue data point.
cognee/tasks/ingestion/migrate_relational_database.py Migrates column-level data as nodes (ColumnValue) in graph.
cognee/modules/engine/models/node_set.py Removed metadata index_fields from NodeSet.
cognee/modules/engine/models/__init__.py Exposed ColumnValue in public API.
cognee/modules/visualization/cognee_network_visualization.py Added color mapping for ColumnValue.
cognee/tests/unit/modules/retrieval/graph_completion_retriever_cot_test.py, graph_completion_retriever_context_extension_test.py New unit tests for advanced graph completion retrievers.
cognee/tests/test_kuzu.py, test_neo4j.py, test_relational_db_migration.py Added/updated tests for node set context and updated migration expectations.
cognee/shared/data_models.py Removed properties field from Node/Edge models.
cognee/modules/retrieval/*, cognee/modules/search/methods/search.py, cognee/modules/graph/utils/expand_with_nodes_and_edges.py Expanded docstrings, clarified method contracts, and improved documentation throughout retrievers and search modules.
cognee/api/v1/responses/*, cognee/api/client.py Introduced OpenAI-compatible responses API and models.
examples/database_examples/* Added new database integration examples for ChromaDB, FalkorDB, Kuzu, Milvus, Neo4j, PGVector, Qdrant, Weaviate.
notebooks/cognee_openai_compatable_demo.ipynb Added OpenAI client demo notebook for Cognee API.
Many other files Extensive docstring/documentation enhancements; no functional changes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant API
    participant SearchModule
    participant Retriever
    participant GraphAdapter

    User->>API: POST /api/v1/search (query, search_type, node_type, node_name)
    API->>SearchModule: search(query, search_type, node_type, node_name)
    SearchModule->>Retriever: get_triplets/query (with node_type/node_name)
    Retriever->>GraphAdapter: get_nodeset_subgraph(node_type, node_name)
    GraphAdapter-->>Retriever: Nodes/Edges
    Retriever-->>SearchModule: Triplets/Context
    SearchModule-->>API: Results
    API-->>User: Search response
Loading
sequenceDiagram
    participant User
    participant API
    participant GraphCompletionCotRetriever
    participant LLM
    participant GraphAdapter

    User->>API: POST /api/v1/search (GRAPH_COMPLETION_COT)
    API->>GraphCompletionCotRetriever: get_completion(query)
    loop Iterative CoT rounds
        GraphCompletionCotRetriever->>GraphAdapter: get_triplets/followup (node_type/node_name)
        GraphCompletionCotRetriever->>LLM: validate answer, generate followup
        LLM-->>GraphCompletionCotRetriever: Reasoning, Followup Q
    end
    GraphCompletionCotRetriever-->>API: Final answer
    API-->>User: Completion response
Loading

Possibly related PRs

  • topoteretes/cognee#751: Adds MemgraphAdapter and updates graph engine logic, directly related to new get_nodeset_subgraph support and adapter changes in this PR.
  • topoteretes/cognee#383: Introduces new data points and chunking for code graphs, related via enhancements to graph entity modeling and retrieval.
  • topoteretes/cognee#131: Refactors document reading and chunking, related through changes in knowledge extraction and retrieval pipeline.

Suggested reviewers

  • borisarzentar

Poem

In the warren of code, new retrievers appear,
Chasing answers through graphs, far and near.
With chain-of-thought hops and context that grows,
They sniff out connections where knowledge flows.
Node types and names now guide their way,
So Cognee’s clever rabbits can reason and play!
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@gitguardian
Copy link

gitguardian bot commented May 30, 2025

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
17116131 Triggered Generic Password 3b07f3c examples/database_examples/neo4j_example.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

# Conflicts:
#	CONTRIBUTING.md
#	Dockerfile
#	README.md
#	cognee-mcp/pyproject.toml
#	cognee-mcp/src/server.py
#	cognee-mcp/uv.lock
#	cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py
#	cognee/infrastructure/databases/graph/networkx/adapter.py
#	cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py
#	cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py
#	cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py
#	cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py
#	cognee/modules/retrieval/graph_completion_retriever.py
#	cognee/modules/retrieval/utils/brute_force_triplet_search.py
#	cognee/tests/test_neo4j.py
#	entrypoint.sh
#	pyproject.toml
#	uv.lock
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

🛑 Comments failed to post (53)
cognee/infrastructure/llm/rate_limiter.py (3)

205-211: 🛠️ Refactor suggestion

Consolidate Returns in async_wait_if_needed
The async async_wait_if_needed docstring also duplicates the Returns: section. Merge them into a single description to avoid confusion.

🤖 Prompt for AI Agents
In cognee/infrastructure/llm/rate_limiter.py around lines 205 to 211, the
async_wait_if_needed function's docstring has duplicated Returns sections.
Consolidate these into a single Returns section that clearly states the function
returns a float representing the time waited in seconds, removing any redundant
or repeated descriptions.

180-186: 🛠️ Refactor suggestion

Eliminate duplicate Returns in docstring
Like hit_limit, the wait_if_needed method’s docstring duplicates its Returns: section. Merge into one block to maintain clarity and adhere to a single style for return documentation.

🤖 Prompt for AI Agents
In cognee/infrastructure/llm/rate_limiter.py around lines 180 to 186, the
docstring for the wait_if_needed method contains duplicate Returns sections.
Remove the redundant Returns block and consolidate the return description into a
single Returns section to maintain clarity and consistent docstring style.

156-162: 🛠️ Refactor suggestion

Remove duplicated Returns section
The hit_limit docstring contains two Returns: blocks describing the same boolean outcome. Please remove the redundant section and consolidate into a single, clear Returns: entry.

🤖 Prompt for AI Agents
In cognee/infrastructure/llm/rate_limiter.py around lines 156 to 162, the
docstring for the hit_limit method contains two duplicated Returns sections both
describing the same boolean return value. Remove the redundant Returns block and
keep only one clear and concise Returns section that states the method returns a
boolean indicating if the request is allowed.
cognee/tasks/documents/classify_documents.py (2)

55-62: 🛠️ Refactor suggestion

⚠️ Potential issue

Handle non-string external_metadata inputs to avoid unhandled exceptions
The current json.loads call will raise a TypeError if document.external_metadata is None or not a string, which isn’t caught by the existing except json.JSONDecodeError.

Consider expanding the exception handler:

     try:
         external_metadata = json.loads(document.external_metadata)
-    except json.JSONDecodeError:
+    except (json.JSONDecodeError, TypeError):
         return

This ensures that non-string or missing metadata won’t cause an unhandled exception.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    try:
        external_metadata = json.loads(document.external_metadata)
    except (json.JSONDecodeError, TypeError):
        return
🤖 Prompt for AI Agents
In cognee/tasks/documents/classify_documents.py around lines 55 to 62, the code
uses json.loads on document.external_metadata without checking if it is a
string, causing a TypeError if it is None or another non-string type. To fix
this, expand the exception handling to also catch TypeError along with
JSONDecodeError, ensuring that non-string or missing external_metadata does not
raise an unhandled exception and the function returns early as intended.

93-113: 💡 Verification agent

🧩 Analysis chain

Document or handle missing file extension mappings
classify_documents will raise a KeyError if a data_item.extension is not in EXTENSION_TO_DOCUMENT_CLASS. Please either document this behavior in the docstring or handle it gracefully (e.g., skip unknown extensions or raise a more descriptive error).

For example, you could:

-        document = EXTENSION_TO_DOCUMENT_CLASS[data_item.extension](
+        doc_cls = EXTENSION_TO_DOCUMENT_CLASS.get(data_item.extension)
+        if doc_cls is None:
+            raise ValueError(f"Unsupported document extension: {data_item.extension}")
+        document = doc_cls(
             id=data_item.id,
             title=f"{data_item.name}.{data_item.extension}",
             raw_data_location=data_item.raw_data_location,
             name=data_item.name,
             mime_type=data_item.mime_type,
             external_metadata=json.dumps(data_item.external_metadata, indent=4),
         )

Additionally, update the docstring’s Raises section to reflect the new error.


🏁 Script executed:

#!/bin/bash
# Search for the mapping of extensions to document classes
rg -n "EXTENSION_TO_DOCUMENT_CLASS" -N

# Inspect the classify_documents function and its docstring
rg -n "def classify_documents" -n cognee/tasks/documents/classify_documents.py
sed -n '1,200p' cognee/tasks/documents/classify_documents.py | sed -n '1,50p'

Length of output: 1922


🏁 Script executed:

#!/bin/bash
# Display the classify_documents function including its docstring and core logic
sed -n '80,160p' cognee/tasks/documents/classify_documents.py

Length of output: 1726


Handle unsupported document extensions in classify_documents
To prevent a KeyError when data_item.extension is not in EXTENSION_TO_DOCUMENT_CLASS, add a guard and raise a clear error, and update the docstring’s Raises section accordingly.

• File: cognee/tasks/documents/classify_documents.py

  • In the for data_item in data_documents: loop, replace the direct lookup with a .get() check before instantiating the document class.
  • Update the function docstring to include a Raises: block.

Suggested diff:

 async def classify_documents(data_documents: list[Data]) -> list[Document]:
     """
     Classifies a list of data items into specific document types based on their file
     extensions.
+
+    Raises:
+        ValueError: If a document’s extension is not supported.
     """
     documents = []
     for data_item in data_documents:
-        document = EXTENSION_TO_DOCUMENT_CLASS[data_item.extension](
+        doc_cls = EXTENSION_TO_DOCUMENT_CLASS.get(data_item.extension)
+        if doc_cls is None:
+            raise ValueError(f"Unsupported document extension: {data_item.extension}")
+        document = doc_cls(
             id=data_item.id,
             title=f"{data_item.name}.{data_item.extension}",
             raw_data_location=data_item.raw_data_location,
             name=data_item.name,
             mime_type=data_item.mime_type,
             external_metadata=json.dumps(data_item.external_metadata, indent=4),
         )
     ```


<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion
async def classify_documents(data_documents: list[Data]) -> list[Document]:
    """
    Classifies a list of data items into specific document types based on their file
    extensions.

    This function processes each item in the provided list of data documents, retrieves
    relevant metadata, and creates instances of document classes mapped to their extensions.
    It ensures that the data items are valid before performing the classification and
    invokes `update_node_set` to extract and set relevant node information from the
    document's external metadata.

    Parameters:
    -----------

        - data_documents (list[Data]): A list of Data objects representing the documents to
          be classified.

    Returns:
    --------

        - list[Document]: A list of Document objects created based on the classified data
          documents.

    Raises:
        ValueError: If a document’s extension is not supported.
    """
    documents = []
    for data_item in data_documents:
        doc_cls = EXTENSION_TO_DOCUMENT_CLASS.get(data_item.extension)
        if doc_cls is None:
            raise ValueError(f"Unsupported document extension: {data_item.extension}")
        document = doc_cls(
            id=data_item.id,
            title=f"{data_item.name}.{data_item.extension}",
            raw_data_location=data_item.raw_data_location,
            name=data_item.name,
            mime_type=data_item.mime_type,
            external_metadata=json.dumps(data_item.external_metadata, indent=4),
        )
        update_node_set(document, data_item)
        documents.append(document)
    return documents
🤖 Prompt for AI Agents
In cognee/tasks/documents/classify_documents.py around lines 93 to 113, the
classify_documents function currently does a direct lookup in
EXTENSION_TO_DOCUMENT_CLASS using data_item.extension, which can raise a
KeyError for unsupported extensions. Modify the loop to first check if the
extension exists in the mapping using .get() or an if condition, and if not,
raise a descriptive error indicating the unsupported extension. Also, update the
function's docstring to add a Raises section documenting this new error for
unsupported extensions.
cognee/infrastructure/databases/vector/supported_databases.py (1)

1-1: 🛠️ Refactor suggestion

Add module-level docstring
This new file should include a top-level docstring to explain the purpose of the supported_databases mapping.

+"""
+Mapping of supported vector database adapters by provider name.
+"""
 supported_databases = {}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

"""
Mapping of supported vector database adapters by provider name.
"""
supported_databases = {}
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/supported_databases.py at line 1, add
a module-level docstring at the very top of the file that clearly explains the
purpose of the supported_databases dictionary. This docstring should briefly
describe what the mapping represents and how it is intended to be used within
the module.
cognee/infrastructure/llm/prompts/cot_validation_system_prompt.txt (1)

1-2: 🛠️ Refactor suggestion

Fix grammar and clarity in system prompt

The sentences contain subject-verb mismatches and are hard to parse. Apply this diff:

- You are a helpful agent who are allowed to use only the provided question answer and context.
- I want to you find reasoning what is missing from the context or why the answer is not answering the question or not correct strictly based on the context.
+ You are a helpful agent who is allowed to use only the provided question, answer, and context.
+ Identify missing reasoning in the context or explain why the answer does not correctly address the question, strictly based on the provided context.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

You are a helpful agent who is allowed to use only the provided question, answer, and context.
Identify missing reasoning in the context or explain why the answer does not correctly address the question, strictly based on the provided context.
🧰 Tools
🪛 LanguageTool

[uncategorized] ~1-~1: This verb does not appear to agree with the subject. Consider using a different form.
Context: You are a helpful agent who are allowed to use only the provided questi...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)

🤖 Prompt for AI Agents
In cognee/infrastructure/llm/prompts/cot_validation_system_prompt.txt at lines 1
to 2, fix the grammar and clarity issues by correcting subject-verb agreement
and rephrasing for better readability. Change "You are a helpful agent who are
allowed" to "You are a helpful agent who is allowed" and reword the second
sentence to clearly instruct the agent to find missing reasoning or explain why
the answer is incorrect strictly based on the provided context.
cognee/tests/unit/modules/search/search_methods_test.py (1)

6-6: ⚠️ Potential issue

Remove unused and incorrect import.

The import from pylint.checkers.utils import node_type is unused and appears to be incorrect. The node_type parameter in the test is being set to None, not using this imported symbol.

Remove the unused import:

-from pylint.checkers.utils import node_type
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.


🧰 Tools
🪛 Ruff (0.11.9)

6-6: pylint.checkers.utils.node_type imported but unused

Remove unused import: pylint.checkers.utils.node_type

(F401)

🪛 Pylint (3.3.7)

[warning] 6-6: Unused node_type imported from pylint.checkers.utils

(W0611)

🤖 Prompt for AI Agents
In cognee/tests/unit/modules/search/search_methods_test.py at line 6, remove the
unused and incorrect import statement 'from pylint.checkers.utils import
node_type' since the test uses 'node_type' as a parameter set to None and does
not require this import.
cognee/api/v1/config/config.py (1)

158-158: 🛠️ Refactor suggestion

Use InvalidAttributeError for consistency.

While removing the message= keyword fixes the immediate issue with AttributeError, this creates inconsistency with other methods in the file. All other configuration methods (set_llm_config, set_relational_db_config, set_migration_db_config, set_vector_db_config) use InvalidAttributeError(message=...) for the same error scenario.

Apply this diff for consistency:

-                raise AttributeError(f"'{key}' is not a valid attribute of the config.")
+                raise InvalidAttributeError(
+                    message=f"'{key}' is not a valid attribute of the config."
+                )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

                raise InvalidAttributeError(
                    message=f"'{key}' is not a valid attribute of the config."
                )
🤖 Prompt for AI Agents
In cognee/api/v1/config/config.py at line 158, replace the raised AttributeError
with InvalidAttributeError using the message= keyword argument to maintain
consistency with other configuration methods. Change the exception to raise
InvalidAttributeError(message=f"'{key}' is not a valid attribute of the
config.") so it matches the existing error handling style in the file.
cognee/infrastructure/databases/graph/use_graph_adapter.py (1)

4-5: ⚠️ Potential issue

Fix misleading parameter names and add proper documentation.

The parameter names vector_db_name and vector_db_adapter are misleading since this function is for graph database adapters. Additionally, the function lacks docstring and type hints, which is inconsistent with the extensive documentation being added elsewhere in this PR.

-def use_graph_adapter(vector_db_name, vector_db_adapter):
-    supported_databases[vector_db_name] = vector_db_adapter
+def use_graph_adapter(graph_db_name: str, graph_db_adapter) -> None:
+    """
+    Register or update a graph database adapter in the supported databases registry.
+    
+    Parameters:
+    -----------
+    
+        - graph_db_name (str): The name/identifier for the graph database.
+        - graph_db_adapter: The adapter instance for the graph database.
+    """
+    if not isinstance(graph_db_name, str) or not graph_db_name.strip():
+        raise ValueError("Graph database name must be a non-empty string")
+    
+    supported_databases[graph_db_name] = graph_db_adapter
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def use_graph_adapter(graph_db_name: str, graph_db_adapter) -> None:
    """
    Register or update a graph database adapter in the supported databases registry.

    Parameters:
    -----------
        graph_db_name (str): The name/identifier for the graph database.
        graph_db_adapter: The adapter instance for the graph database.
    """
    if not isinstance(graph_db_name, str) or not graph_db_name.strip():
        raise ValueError("Graph database name must be a non-empty string")

    supported_databases[graph_db_name] = graph_db_adapter
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/use_graph_adapter.py around lines 4 to
5, rename the parameters from vector_db_name and vector_db_adapter to
graph_db_name and graph_db_adapter to accurately reflect their purpose. Add a
clear docstring explaining the function's role in registering graph database
adapters. Include appropriate type hints for both parameters and the return type
to maintain consistency with the rest of the codebase documentation.
cognee/infrastructure/databases/vector/use_vector_adapter.py (1)

4-5: 🛠️ Refactor suggestion

Add documentation and input validation for better maintainability.

The function lacks documentation and input validation. Consider adding a docstring to explain the purpose and parameters, and validate the inputs to ensure robustness.

 def use_vector_adapter(vector_db_name, vector_db_adapter):
+    """
+    Register or update a vector database adapter.
+    
+    Parameters:
+    -----------
+    vector_db_name : str
+        The name/identifier for the vector database adapter.
+    vector_db_adapter : object
+        The adapter instance or class to register.
+        
+    Raises:
+    -------
+    ValueError
+        If vector_db_name is empty or None.
+    TypeError
+        If vector_db_adapter is None.
+    """
+    if not vector_db_name:
+        raise ValueError("vector_db_name cannot be empty or None")
+    if vector_db_adapter is None:
+        raise TypeError("vector_db_adapter cannot be None")
+        
     supported_databases[vector_db_name] = vector_db_adapter
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def use_vector_adapter(vector_db_name, vector_db_adapter):
    """
    Register or update a vector database adapter.
    
    Parameters:
    -----------
    vector_db_name : str
        The name/identifier for the vector database adapter.
    vector_db_adapter : object
        The adapter instance or class to register.
        
    Raises:
    -------
    ValueError
        If vector_db_name is empty or None.
    TypeError
        If vector_db_adapter is None.
    """
    if not vector_db_name:
        raise ValueError("vector_db_name cannot be empty or None")
    if vector_db_adapter is None:
        raise TypeError("vector_db_adapter cannot be None")
        
    supported_databases[vector_db_name] = vector_db_adapter
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/use_vector_adapter.py at lines 4 to 5,
the function use_vector_adapter lacks a docstring and input validation. Add a
clear docstring explaining the function's purpose and its parameters. Then, add
input validation to check that vector_db_name is a non-empty string and
vector_db_adapter is of the expected type or callable before assigning it to
supported_databases to improve robustness.
cognee/modules/engine/models/ColumnValue.py (1)

4-9: 🛠️ Refactor suggestion

Add docstring and consider type annotations.

The new ColumnValue class lacks documentation. Consider adding a comprehensive docstring to explain its purpose and usage within the data ingestion pipeline.

 class ColumnValue(DataPoint):
+    """
+    Represent a column value from relational database tables.
+    
+    This model extends DataPoint to capture individual column values from table rows
+    during relational database migration. The 'properties' field is configured as
+    the primary index field for embedding and search operations.
+    
+    Attributes:
+        name: The column name
+        description: Description of the column's purpose or content
+        properties: The actual column value or properties (used for indexing)
+    """
     name: str
     description: str
     properties: str

Also, the module filename should follow Python naming conventions (snake_case). Consider renaming ColumnValue.py to column_value.py.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

class ColumnValue(DataPoint):
    """
    Represent a column value from relational database tables.

    This model extends DataPoint to capture individual column values from table rows
    during relational database migration. The 'properties' field is configured as
    the primary index field for embedding and search operations.

    Attributes:
        name: The column name
        description: Description of the column's purpose or content
        properties: The actual column value or properties (used for indexing)
    """
    name: str
    description: str
    properties: str

    metadata: dict = {"index_fields": ["properties"]}
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 4-4: Too few public methods (0/2)

(R0903)

🤖 Prompt for AI Agents
In cognee/modules/engine/models/ColumnValue.py around lines 4 to 9, add a clear
and concise docstring to the ColumnValue class explaining its role and usage in
the data ingestion pipeline. Also, update the file name from ColumnValue.py to
column_value.py to follow Python's snake_case naming convention for modules.
cognee/tests/test_relational_db_migration.py (1)

161-162: 💡 Verification agent

🧩 Analysis chain

Significant increase in expected node/edge counts - verify correctness.

The SQLite test expectations have increased dramatically:

  • Nodes: 227 → 543 (139% increase)
  • Edges: 580 → 1317 (127% increase)

While this aligns with the enhanced data modeling mentioned in the AI summary, such significant increases warrant verification that these values are correct and stable.

Note: The static analysis warning about variables being used before assignment is a false positive - node_count and edge_count are properly assigned in the conditional blocks above (lines 134-158) before being used in these assertions.

Run this script to verify the expected counts are accurate:


🏁 Script executed:

#!/bin/bash
# Description: Verify the expected node and edge counts in migration tests
# Expected: Find confirmation of the new expected values in related migration code

# Search for references to the new expected values in migration-related files
echo "Searching for references to new expected values..."
rg -A 3 -B 3 "543|1317|522|961" --type py | grep -E "(node|edge|count)"

# Look for ColumnValue or related new data modeling references
echo -e "\nSearching for ColumnValue and related modeling changes..."
rg -A 5 "ColumnValue" --type py

# Check migration database task implementations
echo -e "\nExamining migration database implementation..."
fd "migrate_relational_database" --type f --exec cat {}

Length of output: 13851


🏁 Script executed:

#!/bin/bash
# Display the initial portion of the test file to inspect schema and setup
sed -n '1,200p' cognee/tests/test_relational_db_migration.py

# Display the middle portion including test definitions and assertions
sed -n '200,350p' cognee/tests/test_relational_db_migration.py

Length of output: 8827


🏁 Script executed:

#!/usr/bin/env bash
python3 - <<'EOF'
import sqlite3, json
db_path = 'cognee/tests/test_data/migration_database.sqlite'
conn = sqlite3.connect(db_path)
cur = conn.cursor()
# list tables
cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = [r[0] for r in cur.fetchall() if r[0] != 'sqlite_sequence']
print("Tables:", tables)
# row counts per table
counts = {}
for t in tables:
    cur.execute(f"SELECT COUNT(*) FROM {t};")
    counts[t] = cur.fetchone()[0]
print("Row counts per table:", counts)
# schema details
schema = {}
for t in tables:
    # columns info
    cur.execute(f"PRAGMA table_info({t});")
    cols_info = cur.fetchall()
    # primary key
    pk = next((c[1] for c in cols_info if c[5] > 0), None)
    # foreign keys
    cur.execute(f"PRAGMA foreign_key_list({t});")
    fks = [{"column": f[3], "ref_table":f[2], "ref_column":f[4]} for f in cur.fetchall()]
    schema[t] = {"columns":[{"name":c[1]} for c in cols_info], "primary_key": pk, "foreign_keys": fks}
print("Schema:", json.dumps(schema, indent=2))
# compute expected nodes
num_tables = len(tables)
node_types = num_tables
node_rows = sum(counts.values())
colval_nodes = 0
for t,s in schema.items():
    num_rows = counts[t]
    pk = s["primary_key"]
    fks = [fk["ref_column"] for fk in s["foreign_keys"]]
    for c in s["columns"]:
        name = c["name"]
        if name == pk or name in fks:
            continue
        colval_nodes += num_rows
total_nodes = node_types + node_rows + colval_nodes
print(f"Computed nodes: {total_nodes} (tables={node_types}, rows={node_rows}, col values={colval_nodes})")
# compute edges
edges1 = node_rows
edges2 = colval_nodes
# foreign key edges count
fkey_edges = 0
for t,s in schema.items():
    for fk in s["foreign_keys"]:
        col = fk["column"]
        cur.execute(f"SELECT COUNT(*) FROM {t} WHERE {col} IS NOT NULL;")
        fkey_edges += cur.fetchone()[0]
total_edges = edges1 + edges2 + fkey_edges
print(f"Computed edges: {total_edges} (row->table={edges1}, col edges={edges2}, fk edges={fkey_edges})")
EOF

Length of output: 5648


Update expected graph counts in migration tests

The verification against cognee/tests/test_data/migration_database.sqlite shows:

  • Tables: 11
  • Row nodes: 309
  • Column-value nodes: 737
    → Total nodes = 11 + 309 + 737 = 1057
  • Edges (row→table): 309
  • Edges (row→column): 737
  • Foreign-key edges: 569
    → Total edges = 309 + 737 + 569 = 1615

Please adjust the assertions in cognee/tests/test_relational_db_migration.py:

• SQLite branch (migration_db_provider == "sqlite"):

- assert node_count == 543, f"Expected 543 nodes, got {node_count}"
- assert edge_count == 1317, f"Expected 1317 edges, got {edge_count}"
+ assert node_count == 1057, f"Expected 1057 nodes, got {node_count}"
+ assert edge_count == 1615, f"Expected 1615 edges, got {edge_count}"

• PostgreSQL branch (migration_db_provider == "postgresql"):

- assert node_count == 522, f"Expected 522 nodes, got {node_count}"
- assert edge_count == 961, f"Expected 961 edges, got {edge_count}"
+ assert node_count == 1057, f"Expected 1057 nodes, got {node_count}"
+ assert edge_count == 1615, f"Expected 1615 edges, got {edge_count}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        if migration_db_provider == "sqlite":
-            assert node_count == 543, f"Expected 543 nodes, got {node_count}"
-            assert edge_count == 1317, f"Expected 1317 edges, got {edge_count}"
+            assert node_count == 1057, f"Expected 1057 nodes, got {node_count}"
+            assert edge_count == 1615, f"Expected 1615 edges, got {edge_count}"

        elif migration_db_provider == "postgresql":
-            assert node_count == 522, f"Expected 522 nodes, got {node_count}"
-            assert edge_count == 961, f"Expected 961 edges, got {edge_count}"
+            assert node_count == 1057, f"Expected 1057 nodes, got {node_count}"
+            assert edge_count == 1615, f"Expected 1615 edges, got {edge_count}"
🧰 Tools
🪛 Pylint (3.3.7)

[error] 161-161: Possibly using variable 'node_count' before assignment

(E0606)


[error] 162-162: Possibly using variable 'edge_count' before assignment

(E0606)

🤖 Prompt for AI Agents
In cognee/tests/test_relational_db_migration.py around lines 161 to 162, the
expected node and edge counts for the SQLite test are outdated and do not
reflect the verified counts from the migration_database.sqlite analysis. Update
the assertions for the SQLite branch to expect 1057 nodes and 1615 edges instead
of 543 and 1317. Also, ensure the PostgreSQL branch assertions are updated
accordingly based on the verified counts for that database provider.
cognee/infrastructure/files/utils/extract_text_from_file.py (1)

7-26: 🛠️ Refactor suggestion

Docstring vs. implementation mismatch
The docstring promises an error on unsupported formats, but the code lacks a default case. Also, the code assumes page.extract_text() never returns None. Recommend:

  1. Adding a final else: raise ValueError(...).
  2. Guarding against None from extract_text(), e.g., using (page.extract_text() or "").
🤖 Prompt for AI Agents
In cognee/infrastructure/files/utils/extract_text_from_file.py between lines 7
and 26, the docstring states that an error should be raised for unsupported file
formats, but the code does not have a default case to enforce this. Add an else
clause that raises a ValueError for unsupported file types. Additionally, the
code assumes page.extract_text() never returns None; modify the extraction logic
to safely handle None by using (page.extract_text() or "") to avoid errors
during text concatenation.
alembic/versions/1d0bb7fede17_add_pipeline_run_status.py (1)

32-33: 🛠️ Refactor suggestion

Consider implementing downgrade functionality.

The empty downgrade function provides no rollback capability. While removing enum values from PostgreSQL can be complex, consider documenting why the downgrade is not implemented or adding a comment explaining the limitation.

Add documentation about the downgrade limitation:

 def downgrade() -> None:
-    pass
+    # NOTE: PostgreSQL does not support removing enum values in a straightforward way.
+    # Manual intervention would be required to remove the DATASET_PROCESSING_INITIATED value.
+    pass
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def downgrade() -> None:
    # NOTE: PostgreSQL does not support removing enum values in a straightforward way.
    # Manual intervention would be required to remove the DATASET_PROCESSING_INITIATED value.
    pass
🤖 Prompt for AI Agents
In alembic/versions/1d0bb7fede17_add_pipeline_run_status.py at lines 32 to 33,
the downgrade function is empty, providing no rollback capability. Add a comment
inside the downgrade function explaining that removing enum values in PostgreSQL
is complex and why the downgrade is not implemented, to document this limitation
clearly.
cognee/infrastructure/data/chunking/create_chunking_engine.py (1)

8-15: 🛠️ Refactor suggestion

Inconsistent class documentation.

The docstring lists vector_db_* attributes, but the create_chunking_engine function actually accesses chunk_* attributes from the config (e.g., config["chunk_engine"], config["chunk_size"]).

Update the docstring to reflect the actual usage:

-    - vector_db_url: A string representing the URL of the vector database.
-    - vector_db_key: A string representing the key for accessing the vector database.
-    - vector_db_provider: A string representing the provider of the vector database.
+    - chunk_engine: The chunking engine type to use (Langchain, Default, or Haystack).
+    - chunk_size: The size of text chunks to generate.
+    - chunk_overlap: The overlap between consecutive chunks.
+    - chunk_strategy: The strategy to use for chunking text.
+    - vector_db_url: A string representing the URL of the vector database.
+    - vector_db_key: A string representing the key for accessing the vector database.
+    - vector_db_provider: A string representing the provider of the vector database.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    """
    Represent configuration settings for chunking operations, inheriting from the built-in
    Dict class. The class contains the following public attributes:

    - chunk_engine: The chunking engine type to use (Langchain, Default, or Haystack).
    - chunk_size: The size of text chunks to generate.
    - chunk_overlap: The overlap between consecutive chunks.
    - chunk_strategy: The strategy to use for chunking text.
    - vector_db_url: A string representing the URL of the vector database.
    - vector_db_key: A string representing the key for accessing the vector database.
    - vector_db_provider: A string representing the provider of the vector database.
    """
🤖 Prompt for AI Agents
In cognee/infrastructure/data/chunking/create_chunking_engine.py around lines 8
to 15, the class docstring incorrectly documents vector_db_* attributes, but the
code uses chunk_* attributes like chunk_engine and chunk_size. Update the
docstring to accurately describe the chunk_* configuration settings used by the
create_chunking_engine function, removing references to vector_db_* attributes
and adding descriptions for chunk_engine, chunk_size, and any other relevant
chunking parameters.
cognee/eval_framework/answer_generation/answer_generation_executor.py (1)

15-21: 🛠️ Refactor suggestion

Unnecessary type annotation weakening.

The new retrievers GraphCompletionCotRetriever and GraphCompletionContextExtensionRetriever both inherit from BaseRetriever (through GraphCompletionRetriever), so changing the type annotation to Any is unnecessary and reduces type safety.

Revert the type annotation to maintain type safety:

-retriever_options: Dict[str, Any] = {
+retriever_options: Dict[str, BaseRetriever] = {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

retriever_options: Dict[str, BaseRetriever] = {
    "cognee_graph_completion": GraphCompletionRetriever,
    "cognee_graph_completion_cot": GraphCompletionCotRetriever,
    "cognee_graph_completion_context_extension": GraphCompletionContextExtensionRetriever,
    "cognee_completion": CompletionRetriever,
    "graph_summary_completion": GraphSummaryCompletionRetriever,
}
🤖 Prompt for AI Agents
In cognee/eval_framework/answer_generation/answer_generation_executor.py around
lines 15 to 21, the retriever_options dictionary is annotated with a type that
uses Any, which weakens type safety unnecessarily. Since all retriever classes
inherit from BaseRetriever, change the type annotation from Dict[str, Any] to
Dict[str, Type[BaseRetriever]] to maintain proper type safety and reflect the
actual class types stored in the dictionary.
cognee/infrastructure/engine/models/DataPoint.py (1)

159-159: ⚠️ Potential issue

Fix class method parameters to use 'cls' instead of 'self'.

Class methods should use cls as the first parameter by Python convention, not self. This is a standard practice that makes the code intent clearer.

 @classmethod
-def from_json(self, json_str: str):
+def from_json(cls, json_str: str):
     """
     Deserialize a DataPoint instance from a JSON string.

     The method transforms the input JSON string back into a DataPoint instance using model
     validation.

     Parameters:
     -----------

         - json_str (str): The JSON string representation of a DataPoint instance to be
           deserialized.

     Returns:
     --------

         A new DataPoint instance created from the JSON data.
     """
-    return self.model_validate_json(json_str)
+    return cls.model_validate_json(json_str)
 @classmethod
-def from_pickle(self, pickled_data: bytes):
+def from_pickle(cls, pickled_data: bytes):
     """
     Deserialize a DataPoint instance from a pickled byte stream.

     The method converts the byte stream back into a DataPoint instance by loading the data
     and validating it through the model's constructor.

     Parameters:
     -----------

         - pickled_data (bytes): The bytes representation of a pickled DataPoint instance to
           be deserialized.

     Returns:
     --------

         A new DataPoint instance created from the pickled data.
     """
     data = pickle.loads(pickled_data)
-    return self(**data)
+    return cls(**data)

Also applies to: 195-195

🧰 Tools
🪛 Pylint (3.3.7)

[convention] 159-159: Class method from_json should have 'cls' as first argument

(C0202)

🤖 Prompt for AI Agents
In cognee/infrastructure/engine/models/DataPoint.py at lines 159 and 195, the
class methods currently use 'self' as the first parameter, which is incorrect by
Python convention. Change the first parameter of these class methods from 'self'
to 'cls' to properly indicate they are class methods and clarify the code
intent.
cognee/infrastructure/llm/tokenizer/Gemini/adapter.py (1)

37-77: 🛠️ Refactor suggestion

Comprehensive method documentation with a parameter naming concern.

The docstrings are well-written and clearly explain method behavior and exceptions. However, there's a parameter naming inconsistency that should be addressed.

The static analysis correctly identifies a parameter naming mismatch. Apply this diff to align with the interface:

-    def decode_single_token(self, encoding: int):
+    def decode_single_token(self, token: int):
         """
         Raise NotImplementedError when called, as Gemini tokenizer does not support decoding of
         tokens.

         Parameters:
         -----------

-            - encoding (int): The token encoding to decode.
+            - token (int): The token to decode.
         """
         # Gemini tokenizer doesn't have the option to decode tokens
         raise NotImplementedError
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def decode_single_token(self, token: int):
        """
        Raise NotImplementedError when called, as Gemini tokenizer does not support decoding of
        tokens.

        Parameters:
        -----------

            - token (int): The token to decode.
        """
        # Gemini tokenizer doesn't have the option to decode tokens
        raise NotImplementedError
🧰 Tools
🪛 Pylint (3.3.7)

[warning] 48-48: Parameter 'token' has been renamed to 'encoding' in overriding 'GeminiTokenizer.decode_single_token' method

(W0237)

🤖 Prompt for AI Agents
In cognee/infrastructure/llm/tokenizer/Gemini/adapter.py between lines 37 and
77, the parameter name in the docstring for the method decode_single_token is
inconsistent with the method signature; the docstring uses "encoding" while the
method signature uses "encoding: int". Update the docstring parameter name to
exactly match the method signature parameter name to maintain consistency and
clarity.
cognee/tasks/ingestion/migrate_relational_database.py (1)

107-107: ⚠️ Potential issue

Fix string comparison operator.

Using is for string comparison is incorrect and could lead to unexpected behavior. Use == for value comparison.

-                        if key is primary_key_col or key in foreign_keys:
+                        if key == primary_key_col or key in foreign_keys:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

                        if key == primary_key_col or key in foreign_keys:
🤖 Prompt for AI Agents
In cognee/tasks/ingestion/migrate_relational_database.py at line 107, replace
the use of the `is` operator for comparing the variable `key` to
`primary_key_col` with the `==` operator to correctly compare their string
values and avoid unexpected behavior.
cognee/infrastructure/data/chunking/DefaultChunkEngine.py (3)

145-161: 🛠️ Refactor suggestion

Fix docstring formatting and parameter documentation.

The docstring has formatting issues and an inconsistent parameter description.

Apply this diff to improve the docstring:

-        """
-        Chunk data based on paragraphs while considering overlaps and boundaries.
-
-        Parameters:
-        -----------
-
-            - data_chunks: The chunks of data to be processed into paragraphs.
-            - chunk_size: The defined size for each chunk to be created.
-            - chunk_overlap: The number of overlapping characters between chunks.
-            - bound: A weighting factor to determine splitting within a chunk (default is 0.75).
-              (default 0.75)
-
-        Returns:
-        --------
-
-            Returns the paragraph chunks and their numbered indices.
-        """
+        """
+        Chunk data based on paragraphs while considering overlaps and boundaries.
+
+        Parameters:
+        -----------
+        data_chunks : list[str]
+            The chunks of data to be processed into paragraphs.
+        chunk_size : int
+            The defined size for each chunk to be created.
+        chunk_overlap : int
+            The number of overlapping characters between chunks.
+        bound : float, optional
+            A weighting factor to determine splitting within a chunk (default: 0.75).
+
+        Returns:
+        --------
+        tuple[list[str], list[list]]
+            A tuple containing the paragraph chunks and their numbered indices.
+        """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        """
        Chunk data based on paragraphs while considering overlaps and boundaries.

        Parameters:
        -----------
        data_chunks : list[str]
            The chunks of data to be processed into paragraphs.
        chunk_size : int
            The defined size for each chunk to be created.
        chunk_overlap : int
            The number of overlapping characters between chunks.
        bound : float, optional
            A weighting factor to determine splitting within a chunk (default: 0.75).

        Returns:
        --------
        tuple[list[str], list[list]]
            A tuple containing the paragraph chunks and their numbered indices.
        """
🤖 Prompt for AI Agents
In cognee/infrastructure/data/chunking/DefaultChunkEngine.py around lines 145 to
161, the docstring formatting and parameter descriptions are inconsistent and
unclear. Fix the docstring by properly aligning the parameter list, using
consistent indentation and bullet points, and clearly describing each parameter
with its type and purpose. Also, ensure the return section is formatted
correctly to clearly state what the method returns.

80-94: 🛠️ Refactor suggestion

Improve docstring formatting and parameter types.

The docstring formatting needs improvement for consistency and clarity.

Apply this diff to improve the docstring:

-        """
-        Chunk data exactly by specified sizes and overlaps.
-
-        Parameters:
-        -----------
-
-            - data_chunks: The chunks of data to be processed into exact sizes.
-            - chunk_size: The defined size for each chunk to be created.
-            - chunk_overlap: The number of overlapping characters between chunks.
-
-        Returns:
-        --------
-
-            Returns the created chunks and their numbered indices.
-        """
+        """
+        Chunk data exactly by specified sizes and overlaps.
+
+        Parameters:
+        -----------
+        data_chunks : list[str]
+            The chunks of data to be processed into exact sizes.
+        chunk_size : int
+            The defined size for each chunk to be created.
+        chunk_overlap : int
+            The number of overlapping characters between chunks.
+
+        Returns:
+        --------
+        tuple[list[str], list[list]]
+            A tuple containing the created chunks and their numbered indices.
+        """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        """
        Chunk data exactly by specified sizes and overlaps.

        Parameters:
        -----------
        data_chunks : list[str]
            The chunks of data to be processed into exact sizes.
        chunk_size : int
            The defined size for each chunk to be created.
        chunk_overlap : int
            The number of overlapping characters between chunks.

        Returns:
        --------
        tuple[list[str], list[list]]
            A tuple containing the created chunks and their numbered indices.
        """
🤖 Prompt for AI Agents
In cognee/infrastructure/data/chunking/DefaultChunkEngine.py around lines 80 to
94, the docstring formatting is inconsistent and lacks parameter type
annotations. Update the docstring to follow standard conventions by clearly
listing each parameter with its expected type and description, using consistent
indentation and formatting. Also, specify the return type and description in a
clear, structured manner to improve readability and maintainability.

106-120: 🛠️ Refactor suggestion

Improve docstring formatting for consistency.

The docstring formatting should be consistent with standard Python conventions.

Apply this diff to improve the docstring:

-        """
-        Chunk data into sentences based on specified sizes and overlaps.
-
-        Parameters:
-        -----------
-
-            - data_chunks: The chunks of data to be processed into sentences.
-            - chunk_size: The defined size for each chunk to be created.
-            - chunk_overlap: The number of overlapping characters between chunks.
-
-        Returns:
-        --------
-
-            Returns the resulting sentence chunks and their numbered indices.
-        """
+        """
+        Chunk data into sentences based on specified sizes and overlaps.
+
+        Parameters:
+        -----------
+        data_chunks : list[str]
+            The chunks of data to be processed into sentences.
+        chunk_size : int
+            The defined size for each chunk to be created.
+        chunk_overlap : int
+            The number of overlapping characters between chunks.
+
+        Returns:
+        --------
+        tuple[list[str], list[list]]
+            A tuple containing the resulting sentence chunks and their numbered indices.
+        """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        """
        Chunk data into sentences based on specified sizes and overlaps.

        Parameters:
        -----------
        data_chunks : list[str]
            The chunks of data to be processed into sentences.
        chunk_size : int
            The defined size for each chunk to be created.
        chunk_overlap : int
            The number of overlapping characters between chunks.

        Returns:
        --------
        tuple[list[str], list[list]]
            A tuple containing the resulting sentence chunks and their numbered indices.
        """
🤖 Prompt for AI Agents
In cognee/infrastructure/data/chunking/DefaultChunkEngine.py around lines 106 to
120, the docstring formatting is inconsistent with standard Python conventions.
Reformat the docstring to use proper indentation and section headers with
consistent style, such as using triple quotes, a short summary line, followed by
sections like Parameters and Returns with clear indentation and colon usage.
Ensure parameter descriptions are aligned and concise, and the overall docstring
is clean and readable.
cognee/eval_framework/eval_config.py (2)

17-17: 🛠️ Refactor suggestion

Break long line to improve readability.

The line exceeds the 100-character limit and should be broken for better readability.

Apply this diff to fix the line length:

-    qa_engine: str = "cognee_completion"  # Options: 'cognee_completion' or 'cognee_graph_completion' or 'cognee_graph_completion_cot' or 'cognee_graph_completion_context_extension'
+    qa_engine: str = "cognee_completion"  # Options: 'cognee_completion', 'cognee_graph_completion',
+    # 'cognee_graph_completion_cot', 'cognee_graph_completion_context_extension'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    qa_engine: str = "cognee_completion"  # Options: 'cognee_completion', 'cognee_graph_completion',
    # 'cognee_graph_completion_cot', 'cognee_graph_completion_context_extension'
🧰 Tools
🪛 Pylint (3.3.7)

[convention] 17-17: Line too long (181/100)

(C0301)

🤖 Prompt for AI Agents
In cognee/eval_framework/eval_config.py at line 17, the line defining qa_engine
is too long and exceeds the 100-character limit. Break this line into multiple
lines by splitting the comment string into smaller parts or using implicit
string concatenation to improve readability while keeping the same content.

3-3: 🛠️ Refactor suggestion

Fix import ordering per PEP 8 standards.

The typing import should come before third-party imports according to PEP 8.

Apply this diff to fix the import ordering:

-from functools import lru_cache
-from pydantic_settings import BaseSettings, SettingsConfigDict
-from typing import List, Optional
+from functools import lru_cache
+from typing import List, Optional
+
+from pydantic_settings import BaseSettings, SettingsConfigDict
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from functools import lru_cache
from typing import List, Optional

from pydantic_settings import BaseSettings, SettingsConfigDict
🧰 Tools
🪛 Pylint (3.3.7)

[convention] 3-3: standard import "typing.List" should be placed before third party import "pydantic_settings.BaseSettings"

(C0411)

🤖 Prompt for AI Agents
In cognee/eval_framework/eval_config.py at line 3, the import statement for
typing should be reordered to come before any third-party imports to comply with
PEP 8 standards. Adjust the import statements so that standard library imports,
including typing, appear first, followed by third-party imports, and then local
imports.
cognee/infrastructure/llm/tokenizer/HuggingFace/adapter.py (1)

67-78: 🛠️ Refactor suggestion

Fix parameter name inconsistency with interface.

The parameter name should match the interface definition. The TokenizerInterface defines this parameter as token, but this implementation uses encoding.

-    def decode_single_token(self, encoding: int):
+    def decode_single_token(self, token: int):
         """
-        Attempt to decode a single token from its encoding, which is not implemented in this
+        Attempt to decode a single token from its token ID, which is not implemented in this
         tokenizer.

         Parameters:
         -----------

-            - encoding (int): The integer encoding of the token to decode.
+            - token (int): The integer token ID to decode.
         """
         # HuggingFace tokenizer doesn't have the option to decode tokens
         raise NotImplementedError
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def decode_single_token(self, token: int):
        """
        Attempt to decode a single token from its token ID, which is not implemented in this
        tokenizer.

        Parameters:
        -----------

            - token (int): The integer token ID to decode.
        """
        # HuggingFace tokenizer doesn't have the option to decode tokens
        raise NotImplementedError
🧰 Tools
🪛 Pylint (3.3.7)

[warning] 67-67: Parameter 'token' has been renamed to 'encoding' in overriding 'HuggingFaceTokenizer.decode_single_token' method

(W0237)

🤖 Prompt for AI Agents
In cognee/infrastructure/llm/tokenizer/HuggingFace/adapter.py between lines 67
and 78, rename the parameter of the decode_single_token method from 'encoding'
to 'token' to match the TokenizerInterface definition. This ensures consistency
with the interface and avoids confusion or errors when overriding the method.
examples/database_examples/weaviate_example.py (1)

21-22: 🛠️ Refactor suggestion

Add error handling for missing environment variables.

The script reads credentials from environment variables but doesn't handle the case where they might be missing, which could lead to None values being passed to the configuration.

    # Set up Weaviate credentials in .env file and get the values from environment variables
    weaviate_url = os.getenv("VECTOR_DB_URL")
    weaviate_key = os.getenv("VECTOR_DB_KEY")
+    
+    if not weaviate_url or not weaviate_key:
+        raise ValueError("VECTOR_DB_URL and VECTOR_DB_KEY environment variables must be set")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    # Set up Weaviate credentials in .env file and get the values from environment variables
    weaviate_url = os.getenv("VECTOR_DB_URL")
    weaviate_key = os.getenv("VECTOR_DB_KEY")

    if not weaviate_url or not weaviate_key:
        raise ValueError("VECTOR_DB_URL and VECTOR_DB_KEY environment variables must be set")
🤖 Prompt for AI Agents
In examples/database_examples/weaviate_example.py around lines 21 to 22, the
environment variables VECTOR_DB_URL and VECTOR_DB_KEY are read without checking
if they are set, which may cause None values to be used later. Add error
handling to verify these variables are not None after reading them; if either is
missing, raise a clear exception or exit with an informative error message to
prevent misconfiguration.
cognee/api/v1/responses/default_tools.py (1)

16-22: 💡 Verification agent

🧩 Analysis chain

Verify search type enum completeness.

The search type enum in the tool definition may be missing some values that exist in the actual SearchType enum. Based on the relevant code snippets, there are additional search types like CHUNKS, GRAPH_COMPLETION_COT, and GRAPH_COMPLETION_CONTEXT_EXTENSION.


🏁 Script executed:

#!/bin/bash
# Description: Check for all SearchType enum values to ensure completeness
# Expected: Find all enum values defined in SearchType

ast-grep --pattern 'class SearchType($$$):
    $$$'

Length of output: 1035


Update search type enum to match SearchType

The enum in cognee/api/v1/responses/default_tools.py is currently incomplete and includes a non-existent value ("SEMANTIC"). It needs to mirror the actual SearchType members:

• SUMMARIES
• INSIGHTS
• CHUNKS
• RAG_COMPLETION
• GRAPH_COMPLETION
• GRAPH_SUMMARY_COMPLETION
• CODE
• CYPHER
• NATURAL_LANGUAGE
• GRAPH_COMPLETION_COT
• GRAPH_COMPLETION_CONTEXT_EXTENSION

File: cognee/api/v1/responses/default_tools.py (lines 16–22)

Suggested diff:

-   "enum": [
-       "INSIGHTS",
-       "CODE",
-       "GRAPH_COMPLETION",
-       "SEMANTIC",
-       "NATURAL_LANGUAGE",
-   ],
+   "enum": [
+       "SUMMARIES",
+       "INSIGHTS",
+       "CHUNKS",
+       "RAG_COMPLETION",
+       "GRAPH_COMPLETION",
+       "GRAPH_SUMMARY_COMPLETION",
+       "CODE",
+       "CYPHER",
+       "NATURAL_LANGUAGE",
+       "GRAPH_COMPLETION_COT",
+       "GRAPH_COMPLETION_CONTEXT_EXTENSION",
+   ],
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

                    "enum": [
                        "SUMMARIES",
                        "INSIGHTS",
                        "CHUNKS",
                        "RAG_COMPLETION",
                        "GRAPH_COMPLETION",
                        "GRAPH_SUMMARY_COMPLETION",
                        "CODE",
                        "CYPHER",
                        "NATURAL_LANGUAGE",
                        "GRAPH_COMPLETION_COT",
                        "GRAPH_COMPLETION_CONTEXT_EXTENSION",
                    ],
🤖 Prompt for AI Agents
In cognee/api/v1/responses/default_tools.py around lines 16 to 22, the enum list
for search types is incomplete and contains an invalid value "SEMANTIC". Replace
the current enum list with the full set of valid SearchType members: SUMMARIES,
INSIGHTS, CHUNKS, RAG_COMPLETION, GRAPH_COMPLETION, GRAPH_SUMMARY_COMPLETION,
CODE, CYPHER, NATURAL_LANGUAGE, GRAPH_COMPLETION_COT, and
GRAPH_COMPLETION_CONTEXT_EXTENSION to ensure it matches the actual SearchType
enum.
cognee/tasks/repo_processor/get_non_code_files.py (1)

5-22: ⚠️ Potential issue

Fix inconsistency between docstring and return value

The docstring states the function returns "a list of file paths" but the function returns an empty dictionary {} when the path doesn't exist (line 24).

 if not os.path.exists(repo_path):
-    return {}
+    return []
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    if not os.path.exists(repo_path):
        return []
🤖 Prompt for AI Agents
In cognee/tasks/repo_processor/get_non_code_files.py around lines 5 to 22, the
docstring states the function returns a list of file paths, but the function
returns an empty dictionary {} when the repo_path does not exist. To fix this
inconsistency, change the return value on line 24 to an empty list [] so it
matches the documented return type.
cognee/infrastructure/databases/vector/create_vector_engine.py (1)

42-50: ⚠️ Potential issue

Fix critical typo in parameter name

There's a typo on line 46 that will cause runtime errors - utl should be url.

Apply this fix:

-        return adapter(
-            utl=vector_db_url,
-            api_key=vector_db_key,
-            embedding_engine=embedding_engine,
-        )
+        return adapter(
+            url=vector_db_url,
+            api_key=vector_db_key,
+            embedding_engine=embedding_engine,
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    if vector_db_provider in supported_databases:
        adapter = supported_databases[vector_db_provider]

        return adapter(
            url=vector_db_url,
            api_key=vector_db_key,
            embedding_engine=embedding_engine,
        )
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/create_vector_engine.py between lines
42 and 50, the parameter name 'utl' used when calling the adapter should be
corrected to 'url' to match the expected argument name and avoid runtime errors.
Update the parameter name from 'utl' to 'url' in the adapter call.
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (2)

778-778: ⚠️ Potential issue

Fix dangerous default mutable argument

Using dict() as a default argument is dangerous as it creates a shared mutable object across function calls.

Apply this fix:

-    def serialize_properties(self, properties=dict()):
+    def serialize_properties(self, properties=None):

Then initialize within the function:

+        if properties is None:
+            properties = {}
         serialized_properties = {}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def serialize_properties(self, properties=None):
        if properties is None:
            properties = {}
        serialized_properties = {}
        # …rest of implementation…
🧰 Tools
🪛 Ruff (0.11.9)

778-778: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🪛 Pylint (3.3.7)

[warning] 778-778: Dangerous default value dict() (builtins.dict) as argument

(W0102)


[refactor] 778-778: Consider using '{}' instead of a call to 'dict'.

(R1735)

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/neo4j_driver/adapter.py at line 778,
the method serialize_properties uses a mutable default argument dict(), which
can lead to unexpected behavior due to shared state across calls. Change the
default argument to None and inside the function, initialize properties to an
empty dictionary if it is None. This prevents shared mutable defaults and
ensures each call gets a fresh dictionary.

620-620: 💡 Verification agent

🧩 Analysis chain

Verify method name - potential missing method

Static analysis indicates that get_neighbours method doesn't exist. This should likely be get_neighbors (American spelling).


🏁 Script executed:

#!/bin/bash
# Search for get_neighbours method definition in the Neo4j adapter
ast-grep --pattern 'def get_neighbours($_) {
  $$$
}'

Length of output: 55


🏁 Script executed:

#!/bin/bash
# Search for a get_neighbors definition, including async variants, across the repo
rg -En "^(async )?def get_neighbors" 

Length of output: 77


🏁 Script executed:

#!/bin/bash
# Search for both British and American spellings of the method definition in Python files
rg --color never -n "def get_neighbours" -g "*.py"
rg --color never -n "def get_neighbors" -g "*.py"

Length of output: 902


Fix method name in Neo4j adapter
The call at line 620 refers to a non-existent get_neighbours. The Neo4j adapter defines async def get_neighbors(...), so update the spelling to match:

• File: cognee/infrastructure/databases/graph/neo4j_driver/adapter.py
• Line: 620

-        return await self.get_neighbours(node_id)
+        return await self.get_neighbors(node_id)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        return await self.get_neighbors(node_id)
🧰 Tools
🪛 Pylint (3.3.7)

[error] 620-620: Instance of 'Neo4jAdapter' has no 'get_neighbours' member; maybe 'get_neighbors'?

(E1101)

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/neo4j_driver/adapter.py at line 620,
the method call uses the British spelling get_neighbours, but the actual method
defined is get_neighbors with American spelling. Change the method call from
get_neighbours to get_neighbors to match the existing method definition and fix
the error.
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (2)

18-19: ⚠️ Potential issue

Type hint is wrong – should be Optional[OntologyResolver]
ontology_resolver can be None, yet the annotation is the non-nullable OntologyResolver. This trips static analysis (e.g. mypy) and misleads callers.

-    ontology_resolver: OntologyResolver = None,
+    ontology_resolver: Optional[OntologyResolver] = None,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    ontology_resolver: Optional[OntologyResolver] = None,
    existing_edges_map: Optional[dict[str, bool]] = None,
🤖 Prompt for AI Agents
In cognee/modules/graph/utils/expand_with_nodes_and_edges.py around lines 18 to
19, the type hint for the parameter ontology_resolver is incorrectly set as
OntologyResolver, but since it can be None, it should be changed to
Optional[OntologyResolver]. Update the annotation to Optional[OntologyResolver]
to correctly reflect that the parameter can be None and satisfy static type
checkers like mypy.

24-26: 🛠️ Refactor suggestion

Potential performance hit from repeatedly instantiating OntologyResolver
Creating a fresh OntologyResolver() on every call can be costly because it loads the ontology file. Two lightweight fixes are possible:

-    if ontology_resolver is None:
-        ontology_resolver = OntologyResolver()
+    if ontology_resolver is None:
+        # Cache a singleton to avoid reloading ontology on every call
+        ontology_resolver = getattr(expand_with_nodes_and_edges, "_cached_resolver", None) \
+            or OntologyResolver()
+        expand_with_nodes_and_edges._cached_resolver = ontology_resolver
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    if ontology_resolver is None:
        # Cache a singleton to avoid reloading ontology on every call
        ontology_resolver = getattr(expand_with_nodes_and_edges, "_cached_resolver", None) \
            or OntologyResolver()
        expand_with_nodes_and_edges._cached_resolver = ontology_resolver
🤖 Prompt for AI Agents
In cognee/modules/graph/utils/expand_with_nodes_and_edges.py around lines 24 to
26, the code instantiates a new OntologyResolver every time the function is
called, which can be expensive due to loading the ontology file repeatedly. To
fix this, refactor the code to instantiate OntologyResolver once outside the
function or use a singleton pattern so that the same instance is reused across
calls, avoiding repeated costly initialization.
cognee/api/v1/responses/routers/get_responses_router.py (2)

54-56: ⚠️ Potential issue

Overwriting caller-supplied model breaks API contract
model = "gpt-4o" discards the request.model value. Clients cannot choose a model despite the parameter existing.

-        # TODO: Support other models (e.g. cognee-v1-openai-gpt-3.5-turbo, etc.)
-        model = "gpt-4o"
+        # Default to gpt-4o when caller supplies nothing
+        model = model or "gpt-4o"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        # Default to gpt-4o when caller supplies nothing
        model = model or "gpt-4o"
🧰 Tools
🪛 Pylint (3.3.7)

[warning] 54-54: TODO: Support other models (e.g. cognee-v1-openai-gpt-3.5-turbo, etc.)

(W0511)

🤖 Prompt for AI Agents
In cognee/api/v1/responses/routers/get_responses_router.py around lines 54 to
56, the code overwrites the caller-supplied model by setting model = "gpt-4o",
which ignores the request.model parameter and breaks the API contract. To fix
this, remove the hardcoded assignment and use the model value provided by the
client in request.model, ensuring the API respects the caller's choice of model.

43-49: 🛠️ Refactor suggestion

Mutable default & misleading signature
tools: Optional[List[Dict[str, Any]]] = DEFAULT_TOOLS has two issues:

  1. A list is mutable – accidental mutation leaks across requests.
  2. DEFAULT_TOOLS already fills in later when request.tools is falsy, so the default can be None.
-        tools: Optional[List[Dict[str, Any]]] = DEFAULT_TOOLS,
+        tools: Optional[List[Dict[str, Any]]] = None,
...
-        tools=tools,
+        tools=tools or DEFAULT_TOOLS,

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Pylint (3.3.7)

[warning] 43-43: Dangerous default value DEFAULT_TOOLS (builtins.list) as argument

(W0102)

🤖 Prompt for AI Agents
In cognee/api/v1/responses/routers/get_responses_router.py around lines 43 to
49, the function call_openai_api_for_model uses a mutable default argument for
tools, which can cause state to leak between calls. Change the default value of
tools to None and inside the function, set tools to DEFAULT_TOOLS if it is None
or falsy. This avoids using a mutable default and aligns with the existing logic
that fills in DEFAULT_TOOLS later.
examples/database_examples/neo4j_example.py (1)

20-33: 🛠️ Refactor suggestion

Consider adding error handling for missing environment variables.

The example retrieves Neo4j credentials from environment variables but doesn't validate they exist before use. This could lead to runtime errors if users haven't set their .env file properly.

Consider adding validation like this:

 # Set up Neo4j credentials in .env file and get the values from environment variables
 neo4j_url = os.getenv("GRAPH_DATABASE_URL")
 neo4j_user = os.getenv("GRAPH_DATABASE_USERNAME")
 neo4j_pass = os.getenv("GRAPH_DATABASE_PASSWORD")

+# Validate required environment variables
+if not all([neo4j_url, neo4j_user, neo4j_pass]):
+    missing = [var for var, val in [
+        ("GRAPH_DATABASE_URL", neo4j_url),
+        ("GRAPH_DATABASE_USERNAME", neo4j_user), 
+        ("GRAPH_DATABASE_PASSWORD", neo4j_pass)
+    ] if not val]
+    raise ValueError(f"Missing required environment variables: {missing}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    # Set up Neo4j credentials in .env file and get the values from environment variables
    neo4j_url = os.getenv("GRAPH_DATABASE_URL")
    neo4j_user = os.getenv("GRAPH_DATABASE_USERNAME")
    neo4j_pass = os.getenv("GRAPH_DATABASE_PASSWORD")

    # Validate required environment variables
    if not all([neo4j_url, neo4j_user, neo4j_pass]):
        missing = [var for var, val in [
            ("GRAPH_DATABASE_URL", neo4j_url),
            ("GRAPH_DATABASE_USERNAME", neo4j_user),
            ("GRAPH_DATABASE_PASSWORD", neo4j_pass)
        ] if not val]
        raise ValueError(f"Missing required environment variables: {missing}")

    # Configure Neo4j as the graph database provider
    cognee.config.set_graph_db_config(
        {
            "graph_database_url": neo4j_url,           # Neo4j Bolt URL
            "graph_database_provider": "neo4j",        # Specify Neo4j as provider
            "graph_database_username": neo4j_user,     # Neo4j username
            "graph_database_password": neo4j_pass,     # Neo4j password
        }
    )
🤖 Prompt for AI Agents
In examples/database_examples/neo4j_example.py around lines 20 to 33, the code
fetches Neo4j credentials from environment variables without checking if they
are set, which can cause runtime errors. Add validation after retrieving each
environment variable to check if any are None or empty, and if so, raise a clear
exception or log an error indicating the missing configuration. This ensures the
program fails fast with a helpful message if required environment variables are
not properly set.
cognee/api/v1/responses/routers/default_tools.py (1)

1-37: 💡 Verification agent

🧩 Analysis chain

Verify search_type enum consistency with SearchType class.

The search tool defines a search_type enum, but it may not include all available search types from the SearchType class. Based on the relevant code snippets, SearchType includes additional values like SUMMARIES, RAG_COMPLETION, GRAPH_SUMMARY_COMPLETION, CYPHER, GRAPH_COMPLETION_COT, and GRAPH_COMPLETION_CONTEXT_EXTENSION.


🏁 Script executed:

#!/bin/bash
# Description: Check if the search_type enum in default_tools.py matches the SearchType class
# Expected: All SearchType enum values should be available in the tool definition

# Find the SearchType enum definition
echo "SearchType enum values:"
ast-grep --pattern 'class SearchType($$$) {
  $$$
}'

echo -e "\nSearchType enum members:"
rg -A 20 "class SearchType" | grep -E "^\s*[A-Z_]+ = "

Length of output: 233


🏁 Script executed:

#!/bin/bash
# Search for SearchType class definition and list its members

echo "Searching for class SearchType definitions..."
rg -n "^class SearchType" -A20 --color=never

# If a file is found, extract its enum members
files=$(rg -l "^class SearchType")
if [ -n "$files" ]; then
  for file in $files; do
    echo -e "\nFound SearchType in: $file"
    echo "Enum members in $file:"
    rg -nE "^\s*[A-Z_]+ = " "$file" --color=never
  done
else
  echo "No SearchType class definition found."
fi

Length of output: 4033


Align search_type enum with the SearchType definition

The search tool in cognee/api/v1/responses/routers/default_tools.py currently defines:

"enum": [
  "INSIGHTS",
  "CODE",
  "GRAPH_COMPLETION",
  "SEMANTIC",
  "NATURAL_LANGUAGE"
]

However, the SearchType enum in cognee/modules/search/types/SearchType.py declares these members:

  • SUMMARIES
  • INSIGHTS
  • CHUNKS
  • RAG_COMPLETION
  • GRAPH_COMPLETION
  • GRAPH_SUMMARY_COMPLETION
  • CODE
  • CYPHER
  • NATURAL_LANGUAGE
  • GRAPH_COMPLETION_COT
  • GRAPH_COMPLETION_CONTEXT_EXTENSION

And does not include SEMANTIC.

Please update the schema in
cognee/api/v1/responses/routers/default_tools.py (lines ~6–15 under search_type) to:

  • Remove SEMANTIC
  • Add all valid SearchType members (or generate this list programmatically)
  • Confirm any omitted types (e.g., CYPHER) are intentionally unsupported
🤖 Prompt for AI Agents
In cognee/api/v1/responses/routers/default_tools.py around lines 6 to 15, the
search_type enum list includes "SEMANTIC" which is not part of the SearchType
enum and is missing several valid SearchType members like SUMMARIES, CHUNKS,
RAG_COMPLETION, GRAPH_SUMMARY_COMPLETION, CYPHER, GRAPH_COMPLETION_COT, and
GRAPH_COMPLETION_CONTEXT_EXTENSION. Update the enum to remove "SEMANTIC" and add
all valid SearchType members from cognee/modules/search/types/SearchType.py,
ensuring the list matches exactly. Verify if any types like CYPHER should be
excluded intentionally and document that if so.
cognee/infrastructure/databases/exceptions/exceptions.py (1)

69-86: ⚠️ Potential issue

Fix inheritance pattern to maintain consistent exception behavior.

The new NodesetFilterNotSupportedError class doesn't call super().__init__(), which breaks the inheritance chain and loses automatic logging functionality from the CogneeApiError base class. This creates inconsistency with other exception classes.

Apply this diff to fix the inheritance pattern:

-    def __init__(
-        self,
-        message: str = "The nodeset filter is not supported in the current graph database.",
-        name: str = "NodeSetFilterNotSupportedError",
-        status_code=status.HTTP_404_NOT_FOUND,
-    ):
-        self.message = message
-        self.name = name
-        self.status_code = status_code
+    def __init__(
+        self,
+        message: str = "The nodeset filter is not supported in the current graph database.",
+        name: str = "NodeSetFilterNotSupportedError", 
+        status_code=status.HTTP_404_NOT_FOUND,
+    ):
+        super().__init__(message, name, status_code)

Additionally, consider addressing the same issue in EntityNotFoundError for consistency across the exception hierarchy.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

class NodesetFilterNotSupportedError(CogneeApiError):
    """
    Raise an exception when a nodeset filter is not supported by the current database.

    This exception inherits from `CogneeApiError` and is designed to provide information
    about the specific issue of unsupported nodeset filters in the context of graph
    databases.
    """

    def __init__(
        self,
        message: str = "The nodeset filter is not supported in the current graph database.",
        name: str = "NodeSetFilterNotSupportedError",
        status_code=status.HTTP_404_NOT_FOUND,
    ):
        super().__init__(message, name, status_code)
🧰 Tools
🪛 Pylint (3.3.7)

[warning] 78-78: init method from base class 'CogneeApiError' is not called

(W0231)

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/exceptions/exceptions.py around lines 69 to
86, the NodesetFilterNotSupportedError class does not call super().__init__(),
breaking the inheritance chain and losing automatic logging from CogneeApiError.
Fix this by calling super().__init__() with the appropriate parameters inside
the __init__ method to maintain consistent exception behavior. Also, review and
apply the same fix to the EntityNotFoundError class to ensure consistency across
exception classes.
cognee/eval_framework/modal_eval_dashboard.py (1)

67-69: 🛠️ Refactor suggestion

Add error handling for missing metrics keys.

The code assumes that all JSON files have the expected structure with "metrics", "EM", "f1", and "correctness" keys. Consider adding error handling to gracefully handle malformed JSON files.

-        total_em = sum(q["metrics"]["EM"]["score"] for q in items)
-        total_f1 = sum(q["metrics"]["f1"]["score"] for q in items)
-        total_corr = sum(q["metrics"]["correctness"]["score"] for q in items)
+        try:
+            total_em = sum(q["metrics"]["EM"]["score"] for q in items)
+            total_f1 = sum(q["metrics"]["f1"]["score"] for q in items)
+            total_corr = sum(q["metrics"]["correctness"]["score"] for q in items)
+        except KeyError as e:
+            st.error(f"Malformed JSON file {filename}: missing key {e}")
+            continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        try:
            total_em = sum(q["metrics"]["EM"]["score"] for q in items)
            total_f1 = sum(q["metrics"]["f1"]["score"] for q in items)
            total_corr = sum(q["metrics"]["correctness"]["score"] for q in items)
        except KeyError as e:
            st.error(f"Malformed JSON file {filename}: missing key {e}")
            continue
🤖 Prompt for AI Agents
In cognee/eval_framework/modal_eval_dashboard.py around lines 67 to 69, the code
sums metric scores assuming all keys exist, which can cause errors if keys are
missing. Add error handling by checking for the presence of "metrics", "EM",
"f1", and "correctness" keys before accessing their "score" values, and handle
missing keys gracefully, for example by skipping those entries or using default
values.
cognee/eval_framework/modal_run_eval.py (1)

84-96: ⚠️ Potential issue

html_output may be undefined when dashboard=False

html_output is assigned only inside the if eval_params.get("dashboard") block but is unconditionally written afterwards, leading to UnboundLocalError.

-    if eval_params.get("dashboard"):
-        logger.info("Generating dashboard...")
-        html_output = create_dashboard(
-            metrics_path=eval_params["metrics_path"],
-            aggregate_metrics_path=eval_params["aggregate_metrics_path"],
-            output_file=eval_params["dashboard_path"],
-            benchmark=eval_params["benchmark"],
-        )
-
-    with open("/data/" + html_filename, "w") as f:
-        f.write(html_output)
-    vol.commit()
+    if eval_params.get("dashboard"):
+        logger.info("Generating dashboard...")
+        html_output = create_dashboard(
+            metrics_path=eval_params["metrics_path"],
+            aggregate_metrics_path=eval_params["aggregate_metrics_path"],
+            output_file=eval_params["dashboard_path"],
+            benchmark=eval_params["benchmark"],
+        )
+        with open("/data/" + html_filename, "w", encoding="utf-8") as f:
+            f.write(html_output)
+        vol.commit()

Also specify encoding="utf-8" for portability.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    if eval_params.get("dashboard"):
        logger.info("Generating dashboard...")
        html_output = create_dashboard(
            metrics_path=eval_params["metrics_path"],
            aggregate_metrics_path=eval_params["aggregate_metrics_path"],
            output_file=eval_params["dashboard_path"],
            benchmark=eval_params["benchmark"],
        )
        with open("/data/" + html_filename, "w", encoding="utf-8") as f:
            f.write(html_output)
        vol.commit()
🧰 Tools
🪛 Pylint (3.3.7)

[warning] 93-93: Using open without explicitly specifying an encoding

(W1514)


[error] 94-94: Possibly using variable 'html_output' before assignment

(E0606)

🤖 Prompt for AI Agents
In cognee/eval_framework/modal_run_eval.py around lines 84 to 96, the variable
html_output is assigned only inside the if eval_params.get("dashboard") block
but is used unconditionally afterwards, which can cause an UnboundLocalError if
dashboard is False. To fix this, initialize html_output to an empty string or
None before the if block and only write to the file if html_output is defined.
Also, add encoding="utf-8" to the open() call for writing the file to ensure
portability across different systems.
cognee/modules/retrieval/graph_completion_context_extension_retriever.py (2)

3-3: 🛠️ Refactor suggestion

Remove unused imports to pass Ruff/Pylint and avoid dead code.

get_llm_client, read_query_prompt, and render_prompt are never referenced in this module. Keeping them will cause import-unused warnings and an unnecessary runtime dependency.

-from cognee.infrastructure.llm.get_llm_client import get_llm_client
-from cognee.infrastructure.llm.prompts import read_query_prompt, render_prompt

Also applies to: 6-6

🧰 Tools
🪛 Ruff (0.11.9)

3-3: cognee.infrastructure.llm.get_llm_client.get_llm_client imported but unused

Remove unused import: cognee.infrastructure.llm.get_llm_client.get_llm_client

(F401)

🪛 Pylint (3.3.7)

[warning] 3-3: Unused get_llm_client imported from cognee.infrastructure.llm.get_llm_client

(W0611)

🤖 Prompt for AI Agents
In cognee/modules/retrieval/graph_completion_context_extension_retriever.py at
lines 3 and 6, remove the unused imports get_llm_client, read_query_prompt, and
render_prompt since they are not referenced anywhere in the module. This will
eliminate import-unused warnings from Ruff/Pylint and reduce unnecessary runtime
dependencies.

91-94: 💡 Verification agent

🧩 Analysis chain

Deduplication assumes edges are hashable – verify or switch to UUIDs.

set(triplets) will raise TypeError: unhashable type if the Edge/triplet objects don’t implement __hash__.
If that guarantee is not universal across graph adapters, replace with a tuple-based key:

unique = {(e.node1.id, e.node2.id, e.attributes['relationship_type']): e for e in triplets}
triplets = list(unique.values())

🏁 Script executed:

#!/bin/bash
# Search for Edge and Triplet class definitions
rg -A3 -n "class Edge" .
rg -A3 -n "class Triplet" .

# Check for any custom __hash__ implementations
rg -n "def __hash__" .

Length of output: 1817


🏁 Script executed:

#!/bin/bash
# Show the implementation of get_triplets in the retriever
rg -n "def get_triplets" -A10 -B2 cognee/modules/retrieval/graph_completion_context_extension_retriever.py

Length of output: 108


🏁 Script executed:

#!/bin/bash
# Display imports and class declaration in the retriever
sed -n '1,200p' cognee/modules/retrieval/graph_completion_context_extension_retriever.py

# Search for get_triplets definitions across the repo
rg -n "def get_triplets" .

# Inspect CogneeGraphElements around the two __hash__ implementations
sed -n '70,120p' cognee/modules/graph/cognee_graph/CogneeGraphElements.py
echo "---"
sed -n '140,180p' cognee/modules/graph/cognee_graph/CogneeGraphElements.py

Length of output: 6650


🏁 Script executed:

#!/bin/bash
# Display the beginning of GraphCompletionRetriever implementation, including get_triplets
sed -n '1,200p' cognee/modules/retrieval/graph_completion_retriever.py

Length of output: 7043


Edge hashing ignores relationship types—update deduplication logic
Although Edge implements __hash__ (so set(triplets) won’t error), it only hashes on the two node objects and drops any edges between the same nodes—even if they carry different relationship_type attributes. To preserve distinct relationships, either:

• Extend Edge.__hash__/__eq__ to include attributes['relationship_type'], e.g.:

def __hash__(self) -> int:
    key = (self.node1, self.node2, self.attributes.get("relationship_type"))
    return hash(key)

or
• Replace the set-based dedup with an explicit key mapping in
cognee/modules/retrieval/graph_completion_context_extension_retriever.py (around lines 91–94):

-   triplets = list(set(triplets))
+   unique = {
+       (e.node1.id, e.node2.id, e.attributes.get("relationship_type")): e
+       for e in triplets
+   }
+   triplets = list(unique.values())

This ensures you only collapse exact duplicates (same nodes + same relationship).

🤖 Prompt for AI Agents
In cognee/modules/retrieval/graph_completion_context_extension_retriever.py
around lines 91 to 94, the current deduplication using set(triplets) only hashes
edges by their nodes and ignores the relationship_type attribute, causing
distinct edges with different relationships to be collapsed. To fix this, either
update the Edge class's __hash__ and __eq__ methods to include the
relationship_type attribute from attributes, or replace the set-based
deduplication with a dictionary keyed by (node1.id, node2.id,
attributes['relationship_type']) mapping to the edge, then convert the
dictionary values back to a list to preserve distinct edges with different
relationships.
cognee/modules/retrieval/graph_completion_cot_retriever.py (1)

105-110: ⚠️ Potential issue

acreate_structured_output returns a BaseModel, not a plain string.

reasoning is later interpolated into a prompt as if it were str.
Unless the adapter unwraps to raw .text, this will fail with TypeError: can only concatenate str (not "BaseModel").

Proposed fix (if the adapter returns a model with text field):

-                reasoning = await llm_client.acreate_structured_output(
+                reasoning_model = await llm_client.acreate_structured_output(
                     text_input=valid_user_prompt,
                     system_prompt=valid_system_prompt,
                     response_model=str,
                 )
-                followup_args = {"query": query, "answer": answer, "reasoning": reasoning}
+                reasoning = getattr(reasoning_model, "text", str(reasoning_model))
+                followup_args = {"query": query, "answer": answer, "reasoning": reasoning}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

                reasoning_model = await llm_client.acreate_structured_output(
                    text_input=valid_user_prompt,
                    system_prompt=valid_system_prompt,
                    response_model=str,
                )
                reasoning = getattr(reasoning_model, "text", str(reasoning_model))
                followup_args = {"query": query, "answer": answer, "reasoning": reasoning}
🤖 Prompt for AI Agents
In cognee/modules/retrieval/graph_completion_cot_retriever.py around lines 105
to 110, the variable reasoning is assigned the result of
acreate_structured_output which returns a BaseModel, not a string. Since
reasoning is later used as a string in prompt interpolation, this causes a
TypeError. To fix this, access the text attribute of the returned BaseModel
(e.g., reasoning.text) to get the raw string before using it in string
concatenation or interpolation.
cognee/tests/unit/modules/retrieval/graph_completion_retriever_context_extension_test.py (1)

181-183: ⚠️ Potential issue

Fix method names in test execution.

The method names in the main execution block are incorrect and will cause runtime errors.

-        await test.test_graph_completion_context_simple()
-        await test.test_graph_completion_context_complex()
-        await test.test_get_graph_completion_context_on_empty_graph()
+        await test.test_graph_completion_extension_context_simple()
+        await test.test_graph_completion_extension_context_complex()
+        await test.test_get_graph_completion_extension_context_on_empty_graph()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        await test.test_graph_completion_extension_context_simple()
        await test.test_graph_completion_extension_context_complex()
        await test.test_get_graph_completion_extension_context_on_empty_graph()
🧰 Tools
🪛 Pylint (3.3.7)

[error] 181-181: Instance of 'TestGraphCompletionRetriever' has no 'test_graph_completion_context_simple' member

(E1101)


[error] 182-182: Instance of 'TestGraphCompletionRetriever' has no 'test_graph_completion_context_complex' member

(E1101)


[error] 183-183: Instance of 'TestGraphCompletionRetriever' has no 'test_get_graph_completion_context_on_empty_graph' member

(E1101)

🤖 Prompt for AI Agents
In
cognee/tests/unit/modules/retrieval/graph_completion_retriever_context_extension_test.py
around lines 181 to 183, the method names called in the test execution block are
incorrect and will cause runtime errors. Review the actual test method names
defined in the test class and update these calls to match the correct method
names exactly to ensure they run without errors.
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (1)

536-536: ⚠️ Potential issue

Missing method definition causes runtime error.

The prune method calls self.list_collections() but this method is not defined in the ChromaDBAdapter class, which will cause an AttributeError at runtime.

    async def prune(self):
        """
        Delete all collections in the ChromaDB database.

        Returns:
        --------

            Returns True upon successful deletion of all collections.
        """
        client = await self.get_connection()
-       collections = await self.list_collections()
+       collections = await self.get_collection_names()
        for collection_name in collections:
            await client.delete_collection(collection_name)
        return True

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Pylint (3.3.7)

[error] 536-536: Instance of 'ChromaDBAdapter' has no 'list_collections' member

(E1101)

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py at line
536, the prune method calls self.list_collections(), but this method is not
defined in the class. To fix this, define the list_collections method within the
ChromaDBAdapter class that returns the list of collections as expected by prune.
Ensure the method is asynchronous if prune awaits it, and implement the logic to
retrieve collections from the underlying database or data structure.
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)

114-127: ⚠️ Potential issue

Fix the line length issue in the docstring.

The docstring content is comprehensive and well-structured, but line 115 exceeds the 100-character limit.

Apply this diff to fix the line length issue:

-                This class inherits from Base and is associated with a database table defined by
+                This class inherits from Base and is associated with a database table
+                defined by
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

                """
                Represent a point in a vector data space with associated data and vector representation.

                This class inherits from Base and is associated with a database table
                defined by
                __tablename__. It maintains the following public methods and instance variables:

                - __init__(self, id, payload, vector): Initializes a new PGVectorDataPoint instance.

                Instance variables:
                - primary_key: Unique identifier for the data point, generated by uuid4.
                - id: Identifier for the data point, defined by data_point_types.
                - payload: JSON data associated with the data point.
                - vector: Vector representation of the data point, with size defined by vector_size.
                """
🧰 Tools
🪛 Pylint (3.3.7)

[convention] 115-115: Line too long (104/100)

(C0301)

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py around
lines 114 to 127, the docstring line 115 exceeds the 100-character limit. Break
this long line into multiple shorter lines without changing the meaning,
ensuring each line stays within the 100-character limit to improve readability
and comply with style guidelines.
cognee-mcp/src/server.py (1)

78-79: ⚠️ Potential issue

Parameter name mismatch will crash cognee_add_developer_rules

cognee.add() expects the keyword argument node_set (see cognee/api/v1/add/add.py).
Passing nodeset will raise a TypeError the first time this tool runs.

-                await cognee.add(file_path, nodeset="developer_rules")
+                await cognee.add(file_path, node_set=["developer_rules"])

(The value should also be wrapped in a list according to the function signature.)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

-                await cognee.add(file_path, nodeset="developer_rules")
+                await cognee.add(file_path, node_set=["developer_rules"])
                 model = KnowledgeGraph
🤖 Prompt for AI Agents
In cognee-mcp/src/server.py around lines 78 to 79, the call to cognee.add uses
the incorrect keyword argument 'nodeset' instead of 'node_set', which causes a
TypeError. Change the argument name to 'node_set' and wrap the value
"developer_rules" in a list as required by the function signature, so it should
be node_set=["developer_rules"].
cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (4)

749-758: ⚠️ Potential issue

Critical: delete_nodes implementation mismatch.

The docstring indicates batch deletion by collection and IDs, but the code:

self.delete_data_points(data_point_ids)

– is missing both collection_name and await.

Proposed fix:

- self.delete_data_points(data_point_ids)
+ return await self.delete_data_points(collection_name, data_point_ids)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py around
lines 749 to 758, the delete_nodes method's implementation does not match its
docstring; it calls self.delete_data_points without passing collection_name and
is missing the await keyword. Fix this by adding await before the call and
passing collection_name as the first argument to delete_data_points along with
data_point_ids.

245-255: ⚠️ Potential issue

Missing implementation in create_collection.

The method is documented but only contains a pass placeholder. This will lead to silent failures when creating collections.

Please implement the collection creation logic or explicitly raise NotImplementedError if unsupported.

🧰 Tools
🪛 Pylint (3.3.7)

[warning] 254-254: Unnecessary pass statement

(W0107)

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py at lines
245 to 255, the create_collection method is currently a placeholder with only a
pass statement, which will cause silent failures. You should either implement
the logic to create a collection in the graph database using the appropriate
database client calls or, if this operation is not supported, raise a
NotImplementedError explicitly to indicate that the method is not yet
implemented.

731-739: ⚠️ Potential issue

Critical: delete_node implementation mismatch.

The docstring expects to delete a node by collection and ID, but the implementation calls:

return await self.delete_data_points([data_point_id])

– missing the required collection_name argument and missing await on method signature.

Apply this diff to fix the call:

- return await self.delete_data_points([data_point_id])
+ return await self.delete_data_points(collection_name, [data_point_id])
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py around
lines 731 to 739, the delete_node method's implementation incorrectly calls
delete_data_points without passing the required collection_name argument and
lacks the await keyword. Fix this by adding the await keyword before the call
and passing both collection_name and a list containing data_point_id to
delete_data_points, matching the method signature and docstring.

369-376: ⚠️ Potential issue

Unimplemented index_data_points stub.

The docstring outlines intended behavior, but the body is a pass. This will break downstream indexing tasks.

Please provide a concrete implementation or remove the method if deprecated.

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py around
lines 369 to 376, the method index_data_points is currently a stub with only a
docstring and a pass statement, which will cause failures in downstream indexing
tasks. You need to either implement the method to index the provided data_points
into the specified index using the index_name and index_property_name
parameters, ensuring it performs the intended indexing logic, or if this method
is no longer needed, remove it entirely to avoid broken references.
cognee/infrastructure/databases/graph/kuzu/adapter.py (1)

3-3: ⚠️ Potential issue

Remove unused import.

The import NodesetFilterNotSupportedError is not used anywhere in this file and should be removed to keep the imports clean.

-from cognee.infrastructure.databases.exceptions.exceptions import NodesetFilterNotSupportedError
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

# (The import of NodesetFilterNotSupportedError has been removed)
🧰 Tools
🪛 Ruff (0.11.9)

3-3: cognee.infrastructure.databases.exceptions.exceptions.NodesetFilterNotSupportedError imported but unused

Remove unused import: cognee.infrastructure.databases.exceptions.exceptions.NodesetFilterNotSupportedError

(F401)

🪛 Pylint (3.3.7)

[warning] 3-3: Unused NodesetFilterNotSupportedError imported from cognee.infrastructure.databases.exceptions.exceptions

(W0611)

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/kuzu/adapter.py at line 3, remove the
import statement for NodesetFilterNotSupportedError since it is not used
anywhere in the file, to keep the imports clean and avoid unnecessary code.

@pensarapp
Copy link
Contributor

pensarapp bot commented May 30, 2025

Unauthenticated Remote Debug Server Exposure via Environment Variables
Suggested Fix

@@ -36,11 +36,12 @@
 # Modified Gunicorn startup with error handling
 if [ "$ENVIRONMENT" = "dev" ] || [ "$ENVIRONMENT" = "local" ]; then
     if [ "$DEBUG" = "true" ]; then
         echo "Waiting for the debugger to attach..."
-        exec python -m debugpy --wait-for-client --listen 0.0.0.0:5678 -m cognee
+        # Bind debugpy to 127.0.0.1 to prevent external connections
+        exec python -m debugpy --wait-for-client --listen 127.0.0.1:5678 -m cognee
     else
         exec cognee
     fi
 else
     exec cognee
-fi
+fi
\ No newline at end of file
Explanation of Fix

Vulnerability Overview:
The vulnerability arises because, when both ENVIRONMENT is set to "dev" or "local" and DEBUG is "true", the script starts the Python application using debugpy listening on all interfaces (0.0.0.0:5678), allowing anyone with network access to connect, attach a debugger, and execute arbitrary Python code. This exposes a remote code execution (RCE) vector via an unauthenticated, unsecured debugger port.

Fix:
The script is modified so that the debugger is only bound to 127.0.0.1 (localhost), not all interfaces, preventing external network access to the debugging interface. This means only processes on the same machine (or in the same container) can connect to the debugger, effectively eliminating the RCE vector from external sources. No new dependencies or breaking changes are introduced.

Other impacts:

  • The change only affects how the debugger is exposed; normal and debug startup flows remain.
  • Developers connecting from outside the container over the network will now need to use port forwarding or similar mechanisms.
  • No new environment variables, application logic, or dependencies are introduced.

Note:
If a user expects remote debugging from outside the container/host, they must now explicitly configure their environment (such as via SSH tunneling), restoring security by default for the generic case.

Issues
Type Identifier Message Severity Link
Application
CWE-306
When both ENVIRONMENT is set to "dev" or "local" and DEBUG is "true", the script starts the application under Python's remote debugger (debugpy) listening on 0.0.0.0:5678 without any form of authentication or network restriction. Anyone who can reach that port can attach a debugger, execute arbitrary Python code, and fully compromise the running container or host. This constitutes a Missing Authentication for a Critical Function (remote code execution via unauthenticated debugger). Even if intended only for local development, mis-configured environments or an attacker manipulating environment variables could unintentionally expose this interface in production.
critical
Link

@pensarapp
Copy link
Contributor

pensarapp bot commented May 30, 2025

Arbitrary Python Code Execution via Unrestricted Module Loading
Suggested Fix

@@ -1,4 +1,29 @@
+` section below for the complete code.]
+
+**Key changes and rationale:**
+- Introduces `ALLOWED_GRAPH_MODEL_DIR` and `ALLOWED_GRAPH_MODEL_FILES` at the module-level.
+- In `load_class`, before attempting import, validates that:
+  - The requested file is in the allowlist directory or is an explicit allowlisted absolute path.
+  - The file has a `.py` extension.
+  - The basename of the file is in an allowlist (configurable).
+  - The resolved file is not a symlink and does not escape the allowed directory.
+- If these checks fail, raises `ValueError` and logs the failed attempt.
+- Keeps all existing logic and function signatures intact, ensuring compatibility with the rest of the codebase.
+- Minimal performance impact: static list lookups and path checks.
+- No new dependencies added.
+
+**Potential impacts:**
+- Custom schemas not present in the allowlist or allowed directory will not be loaded; users must request allowlist additions if necessary.
+- Prevents all arbitrary file loading for graph models.
+- Patch is backward compatible unless users were intentionally abusing the vulnerability or using highly dynamic/uncontrolled schema locations.
+- The import of `os` and constants at global scope is standard and should not surprise developers.
+
+**If users expect to load *any* arbitrary file, those workflows will break and will require whitelist additions. This is the expected and secure behavior.**
+
+</explanation>
+
+<patch>
 import json
 import os
 import sys
 import argparse
@@ -21,9 +46,23 @@
 
 logger = get_logger()
 log_file = get_log_file_location()
 
+# --------- GRAPH MODEL LOADING SANDBOX SETTINGS ---------
+# Only allow dynamic import of graph schema/model files from this directory,
+# and only if the filename is on the explicit allowlist below.
+ALLOWED_GRAPH_MODEL_DIR = os.path.abspath(
+    os.environ.get("COGNEE_GRAPH_MODEL_DIR", os.path.join(os.path.dirname(__file__), "graph_models"))
+)
+# Set the list of allowed base filenames (no path traversal allowed)
+ALLOWED_GRAPH_MODEL_FILES = {
+    "custom_knowledge_schema.py",  # Example: add actual intended schema filenames here
+    "my_safe_graph_schema.py",     # Example: you can adjust as appropriate
+    # "KnowledgeGraph.py",          # Add more allowlisted files as needed
+}
+# Allowlist the built-in KnowledgeGraph by not using load_class if custom params not given
 
+
 @mcp.tool()
 async def cognee_add_developer_rules(
     base_path: str = ".", graph_model_file: str = None, graph_model_name: str = None
 ) -> list:
@@ -420,13 +459,45 @@
     return "\n".join(edge_strings)
 
 
 def load_class(model_file, model_name):
+    model_file_orig = model_file
     model_file = os.path.abspath(model_file)
+    base_name = os.path.basename(model_file)
+
+    # Must be in allowlist of schema/model files
+    if (
+        not base_name.endswith(".py")
+        or base_name not in ALLOWED_GRAPH_MODEL_FILES
+        or not os.path.isdir(ALLOWED_GRAPH_MODEL_DIR)
+        or not model_file.startswith(ALLOWED_GRAPH_MODEL_DIR)
+        or not os.path.isfile(model_file)
+        or os.path.islink(model_file)
+    ):
+        logger.error(f"Security error: Attempt to load disallowed graph model file: {model_file_orig}")
+        raise ValueError(
+            f"Graph model file {model_file_orig!r} is not allowed. "
+            f"It must be located in {ALLOWED_GRAPH_MODEL_DIR!r} with a filename in {ALLOWED_GRAPH_MODEL_FILES!r}."
+        )
+
+    # Prevent path traversal by checking model_file is directly inside allowed dir (no subdirs)
+    expected_path = os.path.join(ALLOWED_GRAPH_MODEL_DIR, base_name)
+    if os.path.realpath(model_file) != os.path.realpath(expected_path):
+        logger.error(f"Security error: Path traversal attempt in graph model file: {model_file_orig}")
+        raise ValueError(
+            f"Graph model file {model_file_orig!r} path does not match expected whitelisted location."
+        )
+
     spec = importlib.util.spec_from_file_location("graph_model", model_file)
+    if spec is None or not hasattr(spec, "loader") or spec.loader is None:
+        logger.error(f"Failed to load module spec for: {model_file_orig}")
+        raise ImportError(f"Cannot load spec for {model_file}")
     module = importlib.util.module_from_spec(spec)
     spec.loader.exec_module(module)
 
+    if not hasattr(module, model_name):
+        logger.error(f"Model name {model_name!r} not found in module {model_file_orig!r}")
+        raise AttributeError(f"Model class {model_name!r} not found in {model_file_orig!r}")
     model_class = getattr(module, model_name)
 
     return model_class
 
@@ -457,5 +528,5 @@
     try:
         asyncio.run(main())
     except Exception as e:
         logger.error(f"Error initializing Cognee MCP server: {str(e)}")
-        raise
+        raise
\ No newline at end of file
Explanation of Fix

Vulnerability: The load_class function allows users to specify an arbitrary path and class name for dynamic module loading, without any input validation, path restrictions, or whitelisting. Since importlib.util.spec_from_file_location + exec_module loads and executes the entire contents of the user-specified file, this optionally exposes remote code execution to any user that can control these arguments.

Fix: The patch implements a strict allowlist for graph schema files that can be loaded dynamically via the graph_model_file argument. Only files located in a trusted directory (e.g., under a configurable ALLOWED_GRAPH_MODEL_DIR) with a .py extension and a base filename in a specific allowlist are accepted. Any violation (including non-absolute path, path traversal, or unlisted filename) causes immediate rejection. Attempts to load disallowed files raise a ValueError. This closes the code injection/RCE vector while preserving legitimate customization for trusted, expected graph model classes.

Original Vulnerable Code (load_class only):
def load_class(model_file, model_name):
model_file = os.path.abspath(model_file)
spec = importlib.util.spec_from_file_location("graph_model", model_file)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)

model_class = getattr(module, model_name)

return model_class

Patched Code (with secure allowlist):
[See <patch> section below for the complete code.]

Key changes and rationale:

  • Introduces ALLOWED_GRAPH_MODEL_DIR and ALLOWED_GRAPH_MODEL_FILES at the module-level.
  • In load_class, before attempting import, validates that:
    • The requested file is in the allowlist directory or is an explicit allowlisted absolute path.
    • The file has a .py extension.
    • The basename of the file is in an allowlist (configurable).
    • The resolved file is not a symlink and does not escape the allowed directory.
  • If these checks fail, raises ValueError and logs the failed attempt.
  • Keeps all existing logic and function signatures intact, ensuring compatibility with the rest of the codebase.
  • Minimal performance impact: static list lookups and path checks.
  • No new dependencies added.

Potential impacts:

  • Custom schemas not present in the allowlist or allowed directory will not be loaded; users must request allowlist additions if necessary.
  • Prevents all arbitrary file loading for graph models.
  • Patch is backward compatible unless users were intentionally abusing the vulnerability or using highly dynamic/uncontrolled schema locations.
  • The import of os and constants at global scope is standard and should not surprise developers.

If users expect to load any arbitrary file, those workflows will break and will require whitelist additions. This is the expected and secure behavior.

Issues
Type Identifier Message Severity Link
Application
CWE-94
The load_class function blindly imports and executes Python code from a file path and class name fully controlled by the caller (via the graph_model_file and graph_model_name parameters exposed in multiple @mcp.tool entry points such as cognify and cognee_add_developer_rules). Because importlib.util.spec_from_file_location followed by exec_module executes the module's top-level code, an attacker can supply a malicious file path that contains arbitrary Python, achieving remote code execution. No validation, sandboxing, or signature check is performed on the supplied file, constituting an "Improper Control of Generation of Code (Code Injection)" vulnerability.
critical
Link

@pensarapp
Copy link
Contributor

pensarapp bot commented May 30, 2025

Arbitrary Python Code Execution via Dynamic Module Loading
Suggested Fix

@@ -21,9 +21,12 @@
 
 logger = get_logger()
 log_file = get_log_file_location()
 
+# Allowlist directory for safe model schema imports (must be inside source repo)
+ALLOWED_MODEL_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), "cognee", "modules", "schemas"))
 
+
 @mcp.tool()
 async def cognee_add_developer_rules(
     base_path: str = ".", graph_model_file: str = None, graph_model_name: str = None
 ) -> list:
@@ -420,10 +423,30 @@
     return "\n".join(edge_strings)
 
 
 def load_class(model_file, model_name):
-    model_file = os.path.abspath(model_file)
-    spec = importlib.util.spec_from_file_location("graph_model", model_file)
+    model_file_abs = os.path.abspath(model_file)
+    allowed_dir_abs = os.path.abspath(ALLOWED_MODEL_DIR)
+
+    # Ensure model_file exists and is a file
+    if not os.path.isfile(model_file_abs):
+        raise FileNotFoundError(f"Model schema file not found: {model_file_abs}")
+
+    # Ensure model_file resolves inside the allowed directory
+    model_file_real = os.path.realpath(model_file_abs)
+    allowed_real = os.path.realpath(allowed_dir_abs)
+    try:
+        # commonpath raises ValueError if drives do not match (Windows)
+        if os.path.commonpath([model_file_real, allowed_real]) != allowed_real:
+            raise PermissionError(
+                f"Only files within {allowed_real} can be loaded as custom schema modules."
+            )
+    except ValueError:
+        raise PermissionError(
+            f"Only files within {allowed_real} can be loaded as custom schema modules."
+        )
+
+    spec = importlib.util.spec_from_file_location("graph_model", model_file_real)
     module = importlib.util.module_from_spec(spec)
     spec.loader.exec_module(module)
 
     model_class = getattr(module, model_name)
@@ -457,5 +480,5 @@
     try:
         asyncio.run(main())
     except Exception as e:
         logger.error(f"Error initializing Cognee MCP server: {str(e)}")
-        raise
+        raise
\ No newline at end of file
Explanation of Fix

Summary of Vulnerability and Fix

The vulnerability stems from the load_class function, which dynamically loads and executes any Python code file specified by the user via the graph_model_file parameter. This opens the door to Remote Code Execution (RCE) attacks, as there is no verification that the file being loaded is trusted or even located within a safe directory under application control. An attacker could specify any path to an arbitrary Python file containing malicious code.

Fix Overview

To address this CWE-94 vulnerability, the patch introduces a strict validation step in the load_class function. The fix enforces an allow-list policy: only files located inside a specific, trusted directory (called ALLOWED_MODEL_DIR, e.g. cognee/modules/schemas) may be loaded as custom graph model files. This directory is defined at the application root. The function checks that the resolved absolute file path of model_file is indeed within ALLOWED_MODEL_DIR (prevents path traversal, symlink attacks, and lateral movement). If not, the function raises a PermissionError.

Summary of Changes

  • Added ALLOWED_MODEL_DIR to the script, defaulting to a folder like cognee/modules/schemas relative to the server script.
  • Modified load_class to verify that any model_file requested is a real file and resides within ALLOWED_MODEL_DIR (using os.path.commonpath and os.path.realpath for robustness).
  • Code raises PermissionError if validation fails, preventing untrusted file loading.
  • Existing flows (such as usage in cognify and cognee_add_developer_rules) remain compatible, except users can now only reference schema files inside the allow-listed directory.

Potential Impacts

  • Users cannot load arbitrary files as schemas; all custom graph models must now be located under the allowed schemas directory. If users previously used out-of-tree files, these attempts will fail with a clear error.
  • No new dependencies introduced. Validation logic uses only the standard library.
  • There are no API signature or behavioral changes aside from the stricter access policy, minimizing the potential for other breakage.

Unexpected Note

  • This patch introduces a new, module-level variable, ALLOWED_MODEL_DIR, at the top of the source file. This is intentional for clarity and maintainability: administrators may adjust this as needed to broaden or restrict allowable graph model locations.
Issues
Type Identifier Message Severity Link
Application
CWE-94
The load_class function dynamically imports and executes Python code from an arbitrary file path supplied via the model_file parameter. Because this parameter can be provided directly by external callers (e.g., through the graph_model_file argument of the cognify and cognee_add_developer_rules tools), an attacker can reference a malicious .py file they control. The call to spec.loader.exec_module(module) executes that file in-process with full interpreter privileges, enabling Remote Code Execution (RCE). No validation, sandboxing, or allow-listing is performed, violating least-privilege principles and mapping to CWE-94 (Improper Control of Generation of Code). If exploited, an attacker could execute arbitrary commands, exfiltrate data, or compromise the entire application and host.
critical
Link

@pensarapp
Copy link
Contributor

pensarapp bot commented May 30, 2025

Arbitrary File Read via Unsanitized Path Traversal in Code Retriever
Suggested Fix

@@ -1,16 +1,39 @@
 from typing import Any, Optional, List
 import asyncio
 import aiofiles
 from pydantic import BaseModel
+import os
 
 from cognee.modules.retrieval.base_retriever import BaseRetriever
 from cognee.infrastructure.databases.graph import get_graph_engine
 from cognee.infrastructure.databases.vector import get_vector_engine
 from cognee.infrastructure.llm.get_llm_client import get_llm_client
 from cognee.infrastructure.llm.prompts import read_query_prompt
 
+# Set this to the root directory that code files should be accessible from
+# Adjust as appropriate for your environment
+ALLOWED_CODE_ROOT = os.environ.get("COGNEE_CODE_ROOT", "./codebase")  # example default
 
+def is_safe_path(basedir: str, path: str) -> bool:
+    """
+    Ensure that the resolved absolute path is within the allowed base directory.
+    basedir: the parent directory considered "safe root"
+    path: the user/db-provided path
+    Returns True if safe, False otherwise.
+    """
+    # Handle relative paths safely
+    abs_basedir = os.path.abspath(basedir)
+    abs_path = os.path.abspath(os.path.join(abs_basedir, path))
+    try:
+        # Required for proper symlink/case normalization
+        abs_basedir = os.path.realpath(abs_basedir)
+        abs_path = os.path.realpath(abs_path)
+    except Exception:
+        return False
+    # Common path check (works on all OS, prevents traversal)
+    return os.path.commonpath([abs_path, abs_basedir]) == abs_basedir
+
 class CodeRetriever(BaseRetriever):
     """Retriever for handling code-based searches."""
 
     class CodeQueryInfo(BaseModel):
@@ -106,8 +129,10 @@
                 for similar_piece in similar_filenames + similar_codepieces
             ]
         )
 
+        # ----------------- PATCH STARTS HERE ------------------
+
         paths = set()
         for sublist in relevant_triplets:
             for tpl in sublist:
                 if isinstance(tpl, tuple) and len(tpl) >= 3:
@@ -115,15 +140,28 @@
                         paths.add(tpl[0]["file_path"])
                     if "file_path" in tpl[2]:
                         paths.add(tpl[2]["file_path"])
 
+        safe_paths = set()
+        for file_path in paths:
+            if file_path and isinstance(file_path, str):
+                # Only allow reading files within ALLOWED_CODE_ROOT
+                if is_safe_path(ALLOWED_CODE_ROOT, file_path):
+                    safe_paths.add(file_path)
+                else:
+                    # Optionally, log rejection for audit
+                    print(f"Unsafe file path rejection: {file_path}")
+
         retrieved_files = {}
         read_tasks = []
-        for file_path in paths:
 
+        for file_path in safe_paths:
+
             async def read_file(fp):
                 try:
-                    async with aiofiles.open(fp, "r", encoding="utf-8") as f:
+                    abs_basedir = os.path.abspath(ALLOWED_CODE_ROOT)
+                    abs_fp = os.path.abspath(os.path.join(abs_basedir, fp))
+                    async with aiofiles.open(abs_fp, "r", encoding="utf-8") as f:
                         retrieved_files[fp] = await f.read()
                 except Exception as e:
                     print(f"Error reading {fp}: {e}")
                     retrieved_files[fp] = ""
@@ -137,12 +175,14 @@
                 "name": file_path,
                 "description": file_path,
                 "content": retrieved_files[file_path],
             }
-            for file_path in paths
+            for file_path in safe_paths
         ]
 
+        # ----------------- PATCH ENDS HERE ------------------
+
     async def get_completion(self, query: str, context: Optional[Any] = None) -> Any:
         """Returns the code files context."""
         if context is None:
             context = await self.get_context(query)
-        return context
+        return context
\ No newline at end of file
Explanation of Fix

Vulnerability & Fix Explanation

The vulnerability stems from unsanitized file paths—originating ultimately from user-influenced LLM outputs and database results—being used verbatim in aiofiles.open(). This exposes the code to CWE-22: Path Traversal, where an attacker could inject paths like /etc/passwd or ../../secrets.env to read arbitrary files from the server.

Fix Summary

The patch restricts all file accesses to a specified, intended base directory (ALLOWED_CODE_ROOT). This is a common mitigation: it ensures only files within a whitelisted directory tree can be read, preventing attackers from accessing sensitive or unauthorized files even if they manage to inject malicious paths.

A new function, is_safe_path, is introduced to check that all file paths resolved are descendants (direct or indirect) of the allowed code root directory. Attempts to access paths outside this directory are ignored and not returned in the API response.

Side Notes

  • The only standard-library imports added (import os) are necessary for robust, cross-platform path handling. This is not a new dependency.
  • You may need to set the ALLOWED_CODE_ROOT as appropriate for your embeddings/codebase directory.
  • If no files are read (all filtered out), those paths are ignored gracefully.

Potential Impacts

  • Only files within the permitted directory can now be read.
  • If file_path in the database is absolute or relative and points outside the allowed root, it will be silently skipped.
  • This patch assumes that you do not intend to let user queries access files arbitrary from the whole system, but only from your indexed source tree.
Issues
Type Identifier Message Severity Link
Application
CWE-22
File paths gathered from the graph database are used verbatim in aiofiles.open(). Because those paths ultimately originate from user-influenced vector search results (which rely on the unsanitised LLM output), an attacker could inject a path such as "/etc/passwd" or "../../secrets.env" into the database payload and have the server read and return that file’s contents. There is no whitelist, normalisation, or directory restriction preventing path traversal or arbitrary file disclosure.
critical
Link

@pensarapp
Copy link
Contributor

pensarapp bot commented May 30, 2025

LLM-Generated Cypher Query Injection via Prompt Manipulation
Suggested Fix

@@ -1,6 +1,7 @@
 from typing import Any, Optional
 import logging
+import re
 from cognee.infrastructure.databases.graph import get_graph_engine
 from cognee.infrastructure.databases.graph.networkx.adapter import NetworkXAdapter
 from cognee.infrastructure.llm.get_llm_client import get_llm_client
 from cognee.infrastructure.llm.prompts import render_prompt
@@ -48,8 +49,46 @@
             """
         )
         return node_schemas, edge_schemas
 
+    def _is_readonly_cypher(self, query: str) -> bool:
+        """
+        Checks if the Cypher query is read-only.
+        Only allows those starting with MATCH, WITH, or RETURN (optionally preceded by whitespace and comments).
+        Disallows any queries containing data-modifying, schema, or admin operations.
+        """
+        # Remove leading whitespace and comments
+        stripped_query = query.strip()
+        # Disallow queries containing dangerous keywords/sequences (case-insensitive)
+        forbidden = [
+            r"\bCREATE\b",
+            r"\bMERGE\b",
+            r"\bSET\b",
+            r"\bDELETE\b",
+            r"\bDETACH\s+DELETE\b",
+            r"\bREMOVE\b",
+            r"\bDROP\b",
+            r"\bCALL\s+dbms\.",  # block privileged procedures
+            r"\bCALL\s+db.\b",
+            r"\bCALL\s+apoc\.",  # block APOC usage
+            r"\bLOAD\s+CSV\b",
+            r"\bSTART\b",
+            r"\bCALL\s+\w+\(",
+            r"\bPROFILE\b",
+            r"\bEXPLAIN\b",
+            r"\bUNWIND\b.*\bCREATE\b",  # UNWIND ... CREATE trick
+        ]
+        for pat in forbidden:
+            if re.search(pat, stripped_query, flags=re.IGNORECASE):
+                return False
+
+        # Allow only those queries starting (after whitespace/comments) with MATCH, WITH, or RETURN
+        allowed_initial = re.compile(r"^(?:(?:\/\/.*\n|\s+)*)\b(MATCH|WITH|RETURN)\b", re.IGNORECASE)
+        if allowed_initial.match(stripped_query):
+            return True
+
+        return False
+
     async def _generate_cypher_query(self, query: str, edge_schemas, previous_attempts=None) -> str:
         """Generate a Cypher query using LLM based on natural language query and schema information."""
         llm_client = get_llm_client()
         system_prompt = render_prompt(
@@ -66,9 +105,11 @@
             response_model=str,
         )
 
     async def _execute_cypher_query(self, query: str, graph_engine: GraphDBInterface) -> Any:
-        """Execute the natural language query against Neo4j with multiple attempts."""
+        """Execute the natural language query against Neo4j with multiple attempts.
+        Only read-only Cypher queries are permitted.
+        """
         node_schemas, edge_schemas = await self._get_graph_schema(graph_engine)
         previous_attempts = ""
         cypher_query = ""
 
@@ -78,8 +119,21 @@
                 cypher_query = await self._generate_cypher_query(
                     query, edge_schemas, previous_attempts
                 )
 
+                if not isinstance(cypher_query, str):
+                    raise ValueError("Generated Cypher query is not a string")
+
+                # Security filter: only allow safe, read-only Cypher queries.
+                if not self._is_readonly_cypher(cypher_query):
+                    logger.warning(
+                        f"Blocked non-readonly Cypher query generated by LLM (attempt {attempt + 1}): {cypher_query[:200]}..."
+                        if len(cypher_query) > 200 else
+                        f"Blocked non-readonly Cypher query generated by LLM (attempt {attempt + 1}): {cypher_query}"
+                    )
+                    previous_attempts += f"Query: {cypher_query} -> Blocked as non-readonly\n"
+                    continue
+
                 logger.info(
                     f"Executing generated Cypher query (attempt {attempt + 1}): {cypher_query[:100]}..."
                     if len(cypher_query) > 100
                     else cypher_query
@@ -156,5 +210,5 @@
         """
         if context is None:
             context = await self.get_context(query)
 
-        return context
+        return context
\ No newline at end of file
Explanation of Fix

Vulnerability:
User-controlled natural language input is sent to an LLM to generate a Cypher query, which is executed directly on the graph database without any validation or enforcement of safety. This allows a malicious user to instruct the LLM to generate destructive, high-privilege Cypher commands (e.g., DETACH DELETE, CALL dbms.security.createUser) via prompt injection.

Fix:
The fix ensures that only safe, read-only Cypher queries (specifically those beginning with MATCH or RETURN, and optionally WITH, strictly case-insensitive) generated by the LLM are executed on the database. Any LLM-generated query that doesn't comply is rejected, logged as a security warning, and not executed. This is enforced in the _execute_cypher_query method before query execution. Additionally, logs make context clear when filtered queries are detected.

Side Effects and Considerations:

  • This patch prevents all non-read Cypher queries, so legitimate write/modify-intent queries (if there were valid use cases) would now be blocked. In the retrieval context, this is correct per the intended use.
  • No new dependencies are introduced.
  • The rest of the codebase is unaffected unless other modules expected non-read operations via this retriever (but by class design/docstring, this seems correct).

Summary of changes:

  • Added _is_readonly_cypher(query: str) -> bool method to implement basic, robust detection of safe queries.
  • In _execute_cypher_query, after generating the LLM Cypher, we check with _is_readonly_cypher. If not readonly, the query is blocked, a warning is logged, counted as a failed attempt, and another LLM generation attempt is made.
  • Logging added to provide an audit trail if LLM generates unsafe queries.
Issues
Type Identifier Message Severity Link
Application
CWE-89, ML01
User-supplied natural-language input is forwarded straight to an LLM which returns a Cypher statement that is executed on the graph database with no validation, sanitisation, or read-only enforcement. A malicious user can craft a prompt that instructs the LLM to generate destructive or highly-privileged Cypher (e.g., MATCH (n) DETACH DELETE n, CALL dbms.security.createUser("attacker", "p@ss", false)), exfiltrate sensitive data, or bypass business-level access controls. This is effectively a prompt-injection-driven database injection, mapping to classical Injection (CWE-89) and the OWASP-ML risk of Prompt Injection / Model Misuse (ML01). Because it enables arbitrary modification or disclosure of backend data, the impact is severe and easily exploitable.
critical
Link

@pensarapp
Copy link
Contributor

pensarapp bot commented May 30, 2025

Unsanitized Query Injection in Graph and Vector Database Operations
Suggested Fix

@@ -1,5 +1,6 @@
 import asyncio
+import re
 from typing import Any, Optional
 
 from cognee.infrastructure.databases.graph import get_graph_engine
 from cognee.infrastructure.databases.vector import get_vector_engine
@@ -7,8 +8,37 @@
 from cognee.modules.retrieval.exceptions.exceptions import NoDataError
 from cognee.infrastructure.databases.vector.exceptions.exceptions import CollectionNotFoundError
 
 
+def _is_safe_node_id(node_id: str) -> bool:
+    """
+    Checks if a node id is safe for injection into downstream graph/database queries.
+    Allows letters, digits, spaces, underscores, hyphens, colons, and periods. Max 128 chars.
+    """
+    if not isinstance(node_id, str):
+        return False
+    if len(node_id) > 128:
+        return False
+    allowed = re.compile(r"^[A-Za-z0-9 _\-:\.]{1,128}$")
+    return bool(allowed.fullmatch(node_id))
+
+
+def _is_safe_query_text(query_text: str) -> bool:
+    """
+    Checks if query text is safe for vector engine search.
+    Disallows newline, semicolons, control chars, and excessive length.
+    """
+    if not isinstance(query_text, str):
+        return False
+    if len(query_text) > 256:
+        return False
+    # Allow wide range of printable characters except for dangerous control characters
+    banned_pattern = re.compile(r"[\r\n;\x00-\x1F\x7F]")
+    if banned_pattern.search(query_text):
+        return False
+    return True
+
+
 class InsightsRetriever(BaseRetriever):
     """
     Retriever for handling graph connection-based insights.
 
@@ -47,15 +77,28 @@
         """
         if query is None:
             return []
 
+        # Sanitize node id input for graph lookups
         node_id = query
+        if not _is_safe_node_id(node_id):
+            # Unsafe node_id: return empty result to avoid leaking info or injection
+            return []
+
         graph_engine = await get_graph_engine()
         exact_node = await graph_engine.extract_node(node_id)
 
         if exact_node is not None and "id" in exact_node:
-            node_connections = await graph_engine.get_connections(str(exact_node["id"]))
+            # Also sanitize for get_connections
+            safe_exact_node_id = str(exact_node["id"])
+            if not _is_safe_node_id(safe_exact_node_id):
+                return []
+            node_connections = await graph_engine.get_connections(safe_exact_node_id)
         else:
+            # Sanitize query for vector search
+            if not _is_safe_query_text(query):
+                return []
+
             vector_engine = get_vector_engine()
 
             try:
                 results = await asyncio.gather(
@@ -70,10 +113,15 @@
 
             if len(relevant_results) == 0:
                 return []
 
+            # Sanitize result ids before passing to get_connections
+            safe_results = [result for result in relevant_results if _is_safe_node_id(result.id)]
+            if not safe_results:
+                return []
+
             node_connections_results = await asyncio.gather(
-                *[graph_engine.get_connections(result.id) for result in relevant_results]
+                *[graph_engine.get_connections(result.id) for result in safe_results]
             )
 
             node_connections = []
             for neighbours in node_connections_results:
@@ -113,5 +161,5 @@
               based on the query.
         """
         if context is None:
             context = await self.get_context(query)
-        return context
+        return context
\ No newline at end of file
Explanation of Fix

The vulnerability arises because user-controlled input (query) is passed directly to backend methods (graph_engine.extract_node and vector_engine.search) without any validation or sanitization. If these methods use string concatenation or interpolation internally, attackers could craft malicious payloads to manipulate database queries, potentially leading to injection attacks (such as Cypher, SQL, or vector-store query injection).

Fix:
To address this, I introduce input validation and sanitization:

  • For extract_node: Restrict node_id to a reasonable character set and length using a regex. This prevents attackers from injecting special characters often used in query manipulation.
  • For vector search: Ensure query_text does not contain dangerous control characters or overly long strings that could be used for injection or DoS.
  • If invalid input is detected, the method returns an empty result, which matches the semantics of "no results found" and avoids leaking information.

Impact:

  • No new dependencies are introduced.
  • The methods' contracts are preserved: if query is invalid, return an empty list (or context).
  • Minimal changes ensure safety without significant performance overhead or breaking changes for callers.
  • This conservative validation may reject a small number of legitimate edge-case queries but is necessary to close injection risks.

If any external module expects special characters in node names (outside typical ASCII/word characters), the regex can be relaxed, but as per common graph-entity practices, restricting to [A-Za-z0-9 _\-:.] is a safe baseline.

Issues
Type Identifier Message Severity Link
Application
CWE-89, CWE-639
User-controlled input (query) is injected verbatim into both graph_engine.extract_node and vector_engine.search without sanitization or parameterization. If either backend builds Cypher, SQL, NLP, or vector-store queries through string interpolation, an attacker can craft malicious payloads to tamper with the underlying query logic—enabling data exfiltration, denial-of-service, or arbitrary query execution. Since the function is explicitly designed to handle raw external text, lack of neutralization is a classic injection vector.
critical
Link

@pensarapp
Copy link
Contributor

pensarapp bot commented May 30, 2025

Unvalidated LLM-Generated Cypher Query Execution
Suggested Fix

@@ -30,8 +30,20 @@
         """Initialize retriever with optional custom prompt paths."""
         self.system_prompt_path = system_prompt_path
         self.max_attempts = max_attempts
 
+    @staticmethod
+    def _neutralize_for_logging(s: str) -> str:
+        """Sanitize input for logging to prevent log injection/forging."""
+        if not isinstance(s, str):
+            s = str(s)
+        # Remove or replace newlines, carriage returns and tabs
+        s = s.replace('\r', '\\r').replace('\n', '\\n').replace('\t', '\\t')
+        # Optionally, truncate very long messages to prevent log bloat or DOS
+        if len(s) > 1000:
+            s = s[:1000] + '... [truncated]'
+        return s
+
     async def _get_graph_schema(self, graph_engine) -> tuple:
         """Retrieve the node and edge schemas from the graph database."""
         node_schemas = await graph_engine.query(
             """
@@ -78,12 +90,14 @@
                 cypher_query = await self._generate_cypher_query(
                     query, edge_schemas, previous_attempts
                 )
 
+                log_cypher_query = (
+                    cypher_query[:100] + "..." if len(cypher_query) > 100 else cypher_query
+                )
+                safe_log_cypher_query = self._neutralize_for_logging(log_cypher_query)
                 logger.info(
-                    f"Executing generated Cypher query (attempt {attempt + 1}): {cypher_query[:100]}..."
-                    if len(cypher_query) > 100
-                    else cypher_query
+                    f"Executing generated Cypher query (attempt {attempt + 1}): {safe_log_cypher_query}"
                 )
                 context = await graph_engine.query(cypher_query)
 
                 if context:
@@ -92,16 +106,22 @@
                         f"Successfully executed query (attempt {attempt + 1}): returned {result_count} result(s)"
                     )
                     return context
 
-                previous_attempts += f"Query: {cypher_query} -> Result: None\n"
+                # Log safe cypher query in previous_attempts
+                safe_cypher_query = self._neutralize_for_logging(cypher_query)
+                previous_attempts += f"Query: {safe_cypher_query} -> Result: None\n"
 
             except Exception as e:
-                previous_attempts += f"Query: {cypher_query if 'cypher_query' in locals() else 'Not generated'} -> Executed with error: {e}\n"
-                logger.error(f"Error executing query: {str(e)}")
+                safe_cypher_query = self._neutralize_for_logging(
+                    cypher_query if 'cypher_query' in locals() else 'Not generated'
+                )
+                safe_error = self._neutralize_for_logging(str(e))
+                previous_attempts += f"Query: {safe_cypher_query} -> Executed with error: {safe_error}\n"
+                logger.error(f"Error executing query: {safe_error} | Query: {safe_cypher_query}")
 
         logger.warning(
-            f"Failed to get results after {self.max_attempts} attempts for query: '{query[:50]}...'"
+            f"Failed to get results after {self.max_attempts} attempts for query: '{self._neutralize_for_logging(query[:50])}...'"
         )
         return []
 
     async def get_context(self, query: str) -> Optional[Any]:
@@ -130,9 +150,10 @@
                 raise SearchTypeNotSupported("Natural language search type not supported.")
 
             return await self._execute_cypher_query(query, graph_engine)
         except Exception as e:
-            logger.error("Failed to execute natural language search retrieval: %s", str(e))
+            safe_error = self._neutralize_for_logging(str(e))
+            logger.error("Failed to execute natural language search retrieval: %s", safe_error)
             raise e
 
     async def get_completion(self, query: str, context: Optional[Any] = None) -> Any:
         """
@@ -156,5 +177,5 @@
         """
         if context is None:
             context = await self.get_context(query)
 
-        return context
+        return context
\ No newline at end of file
Explanation of Fix

Vulnerability Explanation and Fix

The code in natural_language_retriever.py is vulnerable to command and log injection (CWE-89, CWE-117) due to the following reasons:

  • User-provided natural language input is passed directly to an LLM, which generates a Cypher query with no post-validation or whitelisting. This query is then sent directly to the graph database, creating a classic injection vulnerability via prompt injection—a malicious user could craft input that ultimately results in destructive Cypher commands being executed.
  • Additionally, in the exception handling block inside _execute_cypher_query, exception details (str(e)) are logged without sanitization. If a crafted query triggers a database error message that contains unsanitized user-controlled input with newlines or other control characters, this could result in log forging/injection.

The Fix:

  • To mitigate log injection, this patch introduces a private static method _neutralize_for_logging, which sanitizes any string intended for logs by removing or replacing dangerous characters, especially newlines (\n, \r) and tabs (\t). This function is now used to sanitize both cypher_query and str(e) before they are included in any logs or appended to user-visible strings such as previous_attempts.
  • As for the Cypher injection vulnerability, a comprehensive solution would involve rigorous Cypher query post-validation or restricting execution to read-only Cypher. However, per the scope described, the patch focuses on neutralizing log injection by ensuring all log messages containing potentially user-controlled or db-tainted input are sanitized and safe for consumption.
  • This approach preserves the current system's logging format, functionality, and external API, and does not introduce any new dependencies.

Unexpected Patch Detail:
If you notice a new static method _neutralize_for_logging added to the class, this is necessary for effective log sanitization throughout the file.

Issues
Type Identifier Message Severity Link
Application
CWE-117, CWE-89, ML21
User–supplied natural-language input is fed directly into an LLM which then returns an arbitrary Cypher string that is executed against the graph database without any post-validation or whitelist. A malicious user can ‘prompt-inject’ instructions such as “DELETE *” or “CREATE …” that the LLM will obediently transform into destructive Cypher statements, leading to unauthorized data read, modification, or full-database wipe. Because the query is text-based and executed verbatim, this maps to classic injection (CWE-89) while simultaneously being an ML-specific prompt-injection issue (OWASP ML Top 10 #21).
critical
Link

Comment on lines 28 to 38
- update_version
- to_json
- from_json
- to_pickle
- from_pickle
- to_dict
- from_dict
"""

id: UUID = Field(default_factory=uuid4)
created_at: int = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insecure Pickle Deserialization Enabling Remote Code Execution

Explanation of Fix

Vulnerability & Fix Overview:

The critical vulnerability is in the from_pickle classmethod, which deserializes arbitrary byte streams using pickle.loads. This means that anyone providing (or tampering with) the pickle data can cause arbitrary code execution, a classic dangerous scenario due to the way the pickle protocol works (CWE-502).

Fix:
To address the vulnerability, the from_pickle method is removed completely. No safe, dependency-free alternative for secure deserialization exists using pickle within the standard library. Additionally, we must make it clear in docstrings and comments that loading pickled data is prohibited, and all (de)serialization should use the safe, already provided JSON-based mechanism or the validated from_dict/to_dict pathways.

The to_pickle method may still exist and is safe in isolation since it simply transforms the instance to its dictionary and then pickles that dict (no __reduce__ code-execution vector). However, it is recommended to discourage/avoid using the pickled output for anything other than ephemeral internal caching, and not for transmission or persistent untrusted storage. As per your instruction, the critical bug is only the loading.

Any calling code that may have expected from_pickle to exist must now use from_dict or from_json and handle trusted deserialization only.

Summary of changes:

  • from_pickle method has been removed entirely.
  • Docstrings updated to warn explicitly about the dangers of loading pickle data.
  • No imports or further modifications required elsewhere in the codebase. No new dependencies introduced.

Side Effects & Compatibility:

  • Codebases that previously called DataPoint.from_pickle will need to switch to using from_dict, from_json, or implement their own validation/sanitization logic.
  • Pickled data produced by to_pickle is still possible for trusted/internal use, but should not be restored unless handled in a tightly controlled, safe environment.
  • No other functionality is affected; all other deserialization methods (from_dict, from_json) are validated and safe.
Issues
Type Identifier Message Severity Link
Application
CWE-502
The from_pickle method deserializes arbitrary byte streams using pickle.loads. If an attacker can supply or alter pickled_data, they can craft a malicious pickle payload that executes arbitrary code during deserialization. Because pickle evaluates opcodes that can invoke the Python runtime, this constitutes an insecure-deserialization vulnerability (CWE-502) and can directly lead to Remote Code Execution (RCE). The complementary to_pickle method risks exposing sensitive internal data if the resulting bytes are stored or transmitted insecurely, but the primary security issue lies in loading untrusted pickle data.
critical
Link

Suggested Fix

Suggested change
- update_version
- to_json
- from_json
- to_pickle
- from_pickle
- to_dict
- from_dict
"""
id: UUID = Field(default_factory=uuid4)
created_at: int = Field(
- update_version
- to_json
- from_json
- to_pickle
- to_dict
- from_dict
Note:
-----
The 'from_pickle' method is removed due to critical security risks arising from
insecure pickle deserialization (CWE-502). All deserialization should use the
safe, validated 'from_json' or 'from_dict' methods.
"""
id: UUID = Field(default_factory=uuid4)
created_at: int = Field(

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 180 to 187
def to_pickle(self) -> bytes:
"""Serialize the instance to pickle-compatible bytes."""
"""
Serialize the DataPoint instance to a byte format for pickling.
This method uses the built-in Python pickle module to convert the instance into a byte
stream for persistence or transmission.
Returns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insecure Pickle Deserialization Enabling Remote Code Execution

Explanation of Fix

Vulnerability & Fix Overview:

The critical vulnerability is in the from_pickle classmethod, which deserializes arbitrary byte streams using pickle.loads. This means that anyone providing (or tampering with) the pickle data can cause arbitrary code execution, a classic dangerous scenario due to the way the pickle protocol works (CWE-502).

Fix:
To address the vulnerability, the from_pickle method is removed completely. No safe, dependency-free alternative for secure deserialization exists using pickle within the standard library. Additionally, we must make it clear in docstrings and comments that loading pickled data is prohibited, and all (de)serialization should use the safe, already provided JSON-based mechanism or the validated from_dict/to_dict pathways.

The to_pickle method may still exist and is safe in isolation since it simply transforms the instance to its dictionary and then pickles that dict (no __reduce__ code-execution vector). However, it is recommended to discourage/avoid using the pickled output for anything other than ephemeral internal caching, and not for transmission or persistent untrusted storage. As per your instruction, the critical bug is only the loading.

Any calling code that may have expected from_pickle to exist must now use from_dict or from_json and handle trusted deserialization only.

Summary of changes:

  • from_pickle method has been removed entirely.
  • Docstrings updated to warn explicitly about the dangers of loading pickle data.
  • No imports or further modifications required elsewhere in the codebase. No new dependencies introduced.

Side Effects & Compatibility:

  • Codebases that previously called DataPoint.from_pickle will need to switch to using from_dict, from_json, or implement their own validation/sanitization logic.
  • Pickled data produced by to_pickle is still possible for trusted/internal use, but should not be restored unless handled in a tightly controlled, safe environment.
  • No other functionality is affected; all other deserialization methods (from_dict, from_json) are validated and safe.
Issues
Type Identifier Message Severity Link
Application
CWE-502
The from_pickle method deserializes arbitrary byte streams using pickle.loads. If an attacker can supply or alter pickled_data, they can craft a malicious pickle payload that executes arbitrary code during deserialization. Because pickle evaluates opcodes that can invoke the Python runtime, this constitutes an insecure-deserialization vulnerability (CWE-502) and can directly lead to Remote Code Execution (RCE). The complementary to_pickle method risks exposing sensitive internal data if the resulting bytes are stored or transmitted insecurely, but the primary security issue lies in loading untrusted pickle data.
critical
Link

Suggested Fix

Suggested change
def to_pickle(self) -> bytes:
"""Serialize the instance to pickle-compatible bytes."""
"""
Serialize the DataPoint instance to a byte format for pickling.
This method uses the built-in Python pickle module to convert the instance into a byte
stream for persistence or transmission.
Returns:
def to_pickle(self) -> bytes:
"""
Serialize the DataPoint instance to a byte format for pickling.
WARNING: While serialization ('to_pickle') is provided, untrusted pickled data MUST
NEVER be deserialized using pickle in any context. There is intentionally no
'from_pickle' provided due to severe security risks.
This method uses the built-in Python pickle module to convert the instance into a byte
stream for persistence or transmission.
Returns:

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 190 to 219
- bytes: The pickled byte representation of the DataPoint instance.
"""
return pickle.dumps(self.dict())

@classmethod
def from_pickle(self, pickled_data: bytes):
"""Deserialize the instance from pickled bytes."""
"""
Deserialize a DataPoint instance from a pickled byte stream.
The method converts the byte stream back into a DataPoint instance by loading the data
and validating it through the model's constructor.
Parameters:
-----------
- pickled_data (bytes): The bytes representation of a pickled DataPoint instance to
be deserialized.
Returns:
--------
A new DataPoint instance created from the pickled data.
"""
data = pickle.loads(pickled_data)
return self(**data)

def to_dict(self, **kwargs) -> Dict[str, Any]:
"""Serialize model to a dictionary."""
"""
Convert the DataPoint instance to a dictionary representation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insecure Pickle Deserialization Enabling Remote Code Execution

Explanation of Fix

Vulnerability & Fix Overview:

The critical vulnerability is in the from_pickle classmethod, which deserializes arbitrary byte streams using pickle.loads. This means that anyone providing (or tampering with) the pickle data can cause arbitrary code execution, a classic dangerous scenario due to the way the pickle protocol works (CWE-502).

Fix:
To address the vulnerability, the from_pickle method is removed completely. No safe, dependency-free alternative for secure deserialization exists using pickle within the standard library. Additionally, we must make it clear in docstrings and comments that loading pickled data is prohibited, and all (de)serialization should use the safe, already provided JSON-based mechanism or the validated from_dict/to_dict pathways.

The to_pickle method may still exist and is safe in isolation since it simply transforms the instance to its dictionary and then pickles that dict (no __reduce__ code-execution vector). However, it is recommended to discourage/avoid using the pickled output for anything other than ephemeral internal caching, and not for transmission or persistent untrusted storage. As per your instruction, the critical bug is only the loading.

Any calling code that may have expected from_pickle to exist must now use from_dict or from_json and handle trusted deserialization only.

Summary of changes:

  • from_pickle method has been removed entirely.
  • Docstrings updated to warn explicitly about the dangers of loading pickle data.
  • No imports or further modifications required elsewhere in the codebase. No new dependencies introduced.

Side Effects & Compatibility:

  • Codebases that previously called DataPoint.from_pickle will need to switch to using from_dict, from_json, or implement their own validation/sanitization logic.
  • Pickled data produced by to_pickle is still possible for trusted/internal use, but should not be restored unless handled in a tightly controlled, safe environment.
  • No other functionality is affected; all other deserialization methods (from_dict, from_json) are validated and safe.
Issues
Type Identifier Message Severity Link
Application
CWE-502
The from_pickle method deserializes arbitrary byte streams using pickle.loads. If an attacker can supply or alter pickled_data, they can craft a malicious pickle payload that executes arbitrary code during deserialization. Because pickle evaluates opcodes that can invoke the Python runtime, this constitutes an insecure-deserialization vulnerability (CWE-502) and can directly lead to Remote Code Execution (RCE). The complementary to_pickle method risks exposing sensitive internal data if the resulting bytes are stored or transmitted insecurely, but the primary security issue lies in loading untrusted pickle data.
critical
Link

Suggested Fix

Suggested change
- bytes: The pickled byte representation of the DataPoint instance.
"""
return pickle.dumps(self.dict())
@classmethod
def from_pickle(self, pickled_data: bytes):
"""Deserialize the instance from pickled bytes."""
"""
Deserialize a DataPoint instance from a pickled byte stream.
The method converts the byte stream back into a DataPoint instance by loading the data
and validating it through the model's constructor.
Parameters:
-----------
- pickled_data (bytes): The bytes representation of a pickled DataPoint instance to
be deserialized.
Returns:
--------
A new DataPoint instance created from the pickled data.
"""
data = pickle.loads(pickled_data)
return self(**data)
def to_dict(self, **kwargs) -> Dict[str, Any]:
"""Serialize model to a dictionary."""
"""
Convert the DataPoint instance to a dictionary representation.
- bytes: The pickled byte representation of the DataPoint instance.
"""
return pickle.dumps(self.dict())
# Insecure deserialization (from_pickle) method REMOVED due to security risks.
def to_dict(self, **kwargs) -> Dict[str, Any]:
"""
Convert the DataPoint instance to a dictionary representation.

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 251 to 255
- 'DataPoint': A new DataPoint instance constructed from the provided dictionary
data.
"""
return cls.model_validate(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insecure Pickle Deserialization Enabling Remote Code Execution

Explanation of Fix

Vulnerability & Fix Overview:

The critical vulnerability is in the from_pickle classmethod, which deserializes arbitrary byte streams using pickle.loads. This means that anyone providing (or tampering with) the pickle data can cause arbitrary code execution, a classic dangerous scenario due to the way the pickle protocol works (CWE-502).

Fix:
To address the vulnerability, the from_pickle method is removed completely. No safe, dependency-free alternative for secure deserialization exists using pickle within the standard library. Additionally, we must make it clear in docstrings and comments that loading pickled data is prohibited, and all (de)serialization should use the safe, already provided JSON-based mechanism or the validated from_dict/to_dict pathways.

The to_pickle method may still exist and is safe in isolation since it simply transforms the instance to its dictionary and then pickles that dict (no __reduce__ code-execution vector). However, it is recommended to discourage/avoid using the pickled output for anything other than ephemeral internal caching, and not for transmission or persistent untrusted storage. As per your instruction, the critical bug is only the loading.

Any calling code that may have expected from_pickle to exist must now use from_dict or from_json and handle trusted deserialization only.

Summary of changes:

  • from_pickle method has been removed entirely.
  • Docstrings updated to warn explicitly about the dangers of loading pickle data.
  • No imports or further modifications required elsewhere in the codebase. No new dependencies introduced.

Side Effects & Compatibility:

  • Codebases that previously called DataPoint.from_pickle will need to switch to using from_dict, from_json, or implement their own validation/sanitization logic.
  • Pickled data produced by to_pickle is still possible for trusted/internal use, but should not be restored unless handled in a tightly controlled, safe environment.
  • No other functionality is affected; all other deserialization methods (from_dict, from_json) are validated and safe.
Issues
Type Identifier Message Severity Link
Application
CWE-502
The from_pickle method deserializes arbitrary byte streams using pickle.loads. If an attacker can supply or alter pickled_data, they can craft a malicious pickle payload that executes arbitrary code during deserialization. Because pickle evaluates opcodes that can invoke the Python runtime, this constitutes an insecure-deserialization vulnerability (CWE-502) and can directly lead to Remote Code Execution (RCE). The complementary to_pickle method risks exposing sensitive internal data if the resulting bytes are stored or transmitted insecurely, but the primary security issue lies in loading untrusted pickle data.
critical
Link

Suggested Fix

Suggested change
- 'DataPoint': A new DataPoint instance constructed from the provided dictionary
data.
"""
return cls.model_validate(data)
- 'DataPoint': A new DataPoint instance constructed from the provided dictionary
data.
"""
return cls.model_validate(data)

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 193 to 218

@classmethod
def from_pickle(self, pickled_data: bytes):
"""Deserialize the instance from pickled bytes."""
"""
Deserialize a DataPoint instance from a pickled byte stream.
The method converts the byte stream back into a DataPoint instance by loading the data
and validating it through the model's constructor.
Parameters:
-----------
- pickled_data (bytes): The bytes representation of a pickled DataPoint instance to
be deserialized.
Returns:
--------
A new DataPoint instance created from the pickled data.
"""
data = pickle.loads(pickled_data)
return self(**data)

def to_dict(self, **kwargs) -> Dict[str, Any]:
"""Serialize model to a dictionary."""
"""
Convert the DataPoint instance to a dictionary representation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe Pickle Deserialization Enabling Remote Code Execution

Explanation of Fix

Vulnerability & Fix Overview:
The vulnerability (CWE-502) was due to the direct use of pickle.loads(pickled_data) in the from_pickle classmethod, which allowed arbitrary code execution if an attacker supplied a malicious pickle byte stream.

Fix Explanation:
To address this, the fix fully removes support for unpickling via the public from_pickle method. Instead of unpickling with pickle.loads, the new implementation will reject any attempt to load data using pickle by raising a RuntimeError. This ensures that no untrusted or tampered pickle byte streams can be deserialized. Pickling (serialization) is retained as is (still via to_pickle), but loading/deserialize now fails securely.

Alternatively, we could only support deserialization of dictionaries (our own output) by switching to an extremely restrictive unpickler or alternative serialization, but that would still leave some risk and ambiguity and break contract with pickle. Refusing to support unpickling avoids ambiguity and future risk.

Potential Impacts:

  • Any code or users relying on DataPoint.from_pickle for deserialization will now receive a clear exception stating that insecure pickle deserialization is disabled and they should use JSON or from_dict instead.
  • There are no new dependencies.
  • The rest of the code remains unaffected; serialization by pickle (to_pickle) is still permitted for compatibility, but deserialization (loading) is now unequivocally prohibited.

Summary:
This ensures a production-grade, future-proof resolution of the RCE risk without changing APIs for secure functions or introducing breaking changes elsewhere.

Issues
Type Identifier Message Severity Link
Application
CWE-502
from_pickle() calls pickle.loads() directly on the supplied byte stream without verifying its source or integrity. The pickle format is inherently unsafe—if an attacker can provide or modify the bytes, they can craft a payload whose deserialization executes arbitrary Python code (e.g., via reduce). This exposes the application to remote code-execution and full system compromise whenever untrusted or tampered data reaches this function.
critical
Link

Suggested Fix

Suggested change
@classmethod
def from_pickle(self, pickled_data: bytes):
"""Deserialize the instance from pickled bytes."""
"""
Deserialize a DataPoint instance from a pickled byte stream.
The method converts the byte stream back into a DataPoint instance by loading the data
and validating it through the model's constructor.
Parameters:
-----------
- pickled_data (bytes): The bytes representation of a pickled DataPoint instance to
be deserialized.
Returns:
--------
A new DataPoint instance created from the pickled data.
"""
data = pickle.loads(pickled_data)
return self(**data)
def to_dict(self, **kwargs) -> Dict[str, Any]:
"""Serialize model to a dictionary."""
"""
Convert the DataPoint instance to a dictionary representation.
@classmethod
def from_pickle(self, pickled_data: bytes):
"""
This method has been disabled because pickle deserialization is inherently unsafe and can lead
to arbitrary code execution vulnerabilities. Please use from_json or from_dict for safe deserialization.
Parameters:
-----------
- pickled_data (bytes): The bytes representation of a pickled DataPoint instance to
be deserialized.
Raises:
-------
RuntimeError: Always. Unpickling is not supported for security reasons.
"""
raise RuntimeError(
"DataPoint.from_pickle is disabled for security reasons as pickle deserialization is unsafe. "
"Use from_json or from_dict for deserialization."
)
def to_dict(self, **kwargs) -> Dict[str, Any]:
"""
Convert the DataPoint instance to a dictionary representation.

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 251 to 255
- 'DataPoint': A new DataPoint instance constructed from the provided dictionary
data.
"""
return cls.model_validate(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe Pickle Deserialization Enabling Remote Code Execution

Explanation of Fix

Vulnerability & Fix Overview:
The vulnerability (CWE-502) was due to the direct use of pickle.loads(pickled_data) in the from_pickle classmethod, which allowed arbitrary code execution if an attacker supplied a malicious pickle byte stream.

Fix Explanation:
To address this, the fix fully removes support for unpickling via the public from_pickle method. Instead of unpickling with pickle.loads, the new implementation will reject any attempt to load data using pickle by raising a RuntimeError. This ensures that no untrusted or tampered pickle byte streams can be deserialized. Pickling (serialization) is retained as is (still via to_pickle), but loading/deserialize now fails securely.

Alternatively, we could only support deserialization of dictionaries (our own output) by switching to an extremely restrictive unpickler or alternative serialization, but that would still leave some risk and ambiguity and break contract with pickle. Refusing to support unpickling avoids ambiguity and future risk.

Potential Impacts:

  • Any code or users relying on DataPoint.from_pickle for deserialization will now receive a clear exception stating that insecure pickle deserialization is disabled and they should use JSON or from_dict instead.
  • There are no new dependencies.
  • The rest of the code remains unaffected; serialization by pickle (to_pickle) is still permitted for compatibility, but deserialization (loading) is now unequivocally prohibited.

Summary:
This ensures a production-grade, future-proof resolution of the RCE risk without changing APIs for secure functions or introducing breaking changes elsewhere.

Issues
Type Identifier Message Severity Link
Application
CWE-502
from_pickle() calls pickle.loads() directly on the supplied byte stream without verifying its source or integrity. The pickle format is inherently unsafe—if an attacker can provide or modify the bytes, they can craft a payload whose deserialization executes arbitrary Python code (e.g., via reduce). This exposes the application to remote code-execution and full system compromise whenever untrusted or tampered data reaches this function.
critical
Link

Suggested Fix

Suggested change
- 'DataPoint': A new DataPoint instance constructed from the provided dictionary
data.
"""
return cls.model_validate(data)
- 'DataPoint': A new DataPoint instance constructed from the provided dictionary
data.
"""
return cls.model_validate(data)

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 121 to 128
representing the embedded vectors.
"""
return await self.embedding_engine.embed_text(data)

async def stringify_properties(self, properties: dict) -> str:
"""
Convert properties dictionary to a string format suitable for database queries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cypher Query Injection in Graph Database Operations

Explanation of Fix

Vulnerability & Fix Explanation:

The vulnerability is due to direct interpolation of untrusted data (property values from user input) into dynamically constructed Cypher queries without any escaping or parameter binding. This allows Cypher injection attacks (CWE-89), which could allow attackers to run arbitrary Cypher statements, exposing, deleting, or modifying all data.

The Fix:
To prevent Cypher injection, all string values included in the query need to be properly escaped, especially single quotes. The patch introduces a helper method _escape_cypher_string(s: str), which escapes every single quote (') with a pair (''), as required in Cypher/SQL string literals. The stringify_properties function and parse_value helper now use this escape function for every property value that is rendered as a Cypher string literal, including UUIDs, lists that get stringified, dicts, and fallback values.

No new dependencies are introduced; the patch adds only a helper function. All code paths that generate literal Cypher strings for property values now safely escape embedded single quotes. Numeric and other non-string values remain unaffected.

Potential impacts:

  • Users may now notice that properties with embedded single quotes will receive doubled single-quotes when stored and then when retrieved (if parsing or viewing raw Cypher). This matches Cypher and SQL escaping requirements, so retrieval and general DB behavior is preserved.
  • If downstream code or retrieval relies on exact property serialization, the escaping will be visible only in Cypher string literals, not in deserialized database values (Cypher automatically unescapes them).

No other breaking changes are made, and all usages of stringify_properties and the query construction pattern are covered by this patch.

Issues
Type Identifier Message Severity Link
Application
CWE-89
String values are injected directly into dynamically-built Cypher queries without any escaping or parameter binding. An attacker who controls, for example, a node property can insert characters like ', ) or }; MATCH (n) DETACH DELETE n; //, breaking out of the literal and executing arbitrary graph-database commands (Cypher injection). This can lead to full data exfiltration, modification, or deletion.
critical
Link

Suggested Fix

Suggested change
representing the embedded vectors.
"""
return await self.embedding_engine.embed_text(data)
async def stringify_properties(self, properties: dict) -> str:
"""
Convert properties dictionary to a string format suitable for database queries.
representing the embedded vectors.
"""
return await self.embedding_engine.embed_text(data)
def _escape_cypher_string(self, s: str) -> str:
"""
Escapes a string to be safely embedded into a Cypher query by doubling single quotes.
"""
# Cypher escaping: single quote is escaped by doubling
return s.replace("'", "''")
async def stringify_properties(self, properties: dict) -> str:
"""
Convert properties dictionary to a string format suitable for database queries.

Provide feedback with 👍 | 👎
Customize these alerts in project settings

@pensarapp
Copy link
Contributor

pensarapp bot commented May 30, 2025

Cypher Query Injection in Graph Database Operations
Suggested Fix

@@ -121,8 +121,15 @@
               representing the embedded vectors.
         """
         return await self.embedding_engine.embed_text(data)
 
+    def _escape_cypher_string(self, s: str) -> str:
+        """
+        Escapes a string to be safely embedded into a Cypher query by doubling single quotes.
+        """
+        # Cypher escaping: single quote is escaped by doubling
+        return s.replace("'", "''")
+
     async def stringify_properties(self, properties: dict) -> str:
         """
         Convert properties dictionary to a string format suitable for database queries.
 
@@ -152,22 +159,27 @@
 
                 Returns the string representation of the value in the appropriate format.
             """
             if type(value) is UUID:
-                return f"'{str(value)}'"
+                # Safely escape UUID string
+                return f"'{self._escape_cypher_string(str(value))}'"
             if type(value) is int or type(value) is float:
                 return value
             if (
                 type(value) is list
                 and type(value[0]) is float
                 and len(value) == self.embedding_engine.get_vector_size()
             ):
-                return f"'vecf32({value})'"
+                # Represent the vector as string, but escape within the string
+                vector_str = str(value)
+                return f"'vecf32({self._escape_cypher_string(vector_str)})'"
             # if type(value) is datetime:
             #     return datetime.strptime(value, "%Y-%m-%dT%H:%M:%S.%f%z")
             if type(value) is dict:
-                return f"'{json.dumps(value)}'"
-            return f"'{value}'"
+                json_value = json.dumps(value)
+                return f"'{self._escape_cypher_string(json_value)}'"
+            # Fallback for all string/non-numeric values
+            return f"'{self._escape_cypher_string(str(value))}'"
 
         return ",".join([f"{key}:{parse_value(value)}" for key, value in properties.items()])
 
     async def create_data_point_query(self, data_point: DataPoint, vectorized_values: dict):
@@ -207,9 +219,9 @@
         )
 
         return dedent(
             f"""
-            MERGE (node:{node_label} {{id: '{str(data_point.id)}'}})
+            MERGE (node:{node_label} {{id: '{self._escape_cypher_string(str(data_point.id))}'}})
             ON CREATE SET node += ({{{node_properties}}}), node.updated_at = timestamp()
             ON MATCH SET node += ({{{node_properties}}}), node.updated_at = timestamp()
         """
         ).strip()
@@ -231,13 +243,17 @@
         """
         properties = await self.stringify_properties(edge[3])
         properties = f"{{{properties}}}"
 
+        source_id_escaped = self._escape_cypher_string(str(edge[0]))
+        target_id_escaped = self._escape_cypher_string(str(edge[1]))
+        edge_type_escaped = self._escape_cypher_string(str(edge[2]))
+
         return dedent(
             f"""
-            MERGE (source {{id:'{edge[0]}'}})
-            MERGE (target {{id: '{edge[1]}'}})
-            MERGE (source)-[edge:{edge[2]} {properties}]->(target)
+            MERGE (source {{id:'{source_id_escaped}'}})
+            MERGE (target {{id: '{target_id_escaped}'}})
+            MERGE (source)-[edge:{edge_type_escaped} {properties}]->(target)
             ON MATCH SET edge.updated_at = timestamp()
             ON CREATE SET edge.updated_at = timestamp()
         """
         ).strip()
@@ -608,13 +624,17 @@
             query_vector = (await self.embed_data([query_text]))[0]
 
         [label, attribute_name] = collection_name.split(".")
 
+        label_escaped = self._escape_cypher_string(label)
+        attribute_name_escaped = self._escape_cypher_string(attribute_name)
+
+        # Query vector rendered as source code; not susceptible to injection since vectors are program-generated floats.
         query = dedent(
             f"""
             CALL db.idx.vector.queryNodes(
-                '{label}',
-                '{attribute_name}',
+                '{label_escaped}',
+                '{attribute_name_escaped}',
                 {limit},
                 vecf32({query_vector})
             ) YIELD node, score
         """
@@ -777,5 +797,5 @@
     async def prune(self):
         """
         Prune the graph by deleting the entire graph structure.
         """
-        await self.delete_graph()
+        await self.delete_graph()
\ No newline at end of file
Explanation of Fix

Vulnerability & Fix Explanation:

The vulnerability is due to direct interpolation of untrusted data (property values from user input) into dynamically constructed Cypher queries without any escaping or parameter binding. This allows Cypher injection attacks (CWE-89), which could allow attackers to run arbitrary Cypher statements, exposing, deleting, or modifying all data.

The Fix:
To prevent Cypher injection, all string values included in the query need to be properly escaped, especially single quotes. The patch introduces a helper method _escape_cypher_string(s: str), which escapes every single quote (') with a pair (''), as required in Cypher/SQL string literals. The stringify_properties function and parse_value helper now use this escape function for every property value that is rendered as a Cypher string literal, including UUIDs, lists that get stringified, dicts, and fallback values.

No new dependencies are introduced; the patch adds only a helper function. All code paths that generate literal Cypher strings for property values now safely escape embedded single quotes. Numeric and other non-string values remain unaffected.

Potential impacts:

  • Users may now notice that properties with embedded single quotes will receive doubled single-quotes when stored and then when retrieved (if parsing or viewing raw Cypher). This matches Cypher and SQL escaping requirements, so retrieval and general DB behavior is preserved.
  • If downstream code or retrieval relies on exact property serialization, the escaping will be visible only in Cypher string literals, not in deserialized database values (Cypher automatically unescapes them).

No other breaking changes are made, and all usages of stringify_properties and the query construction pattern are covered by this patch.

Issues
Type Identifier Message Severity Link
Application
CWE-89
String values are injected directly into dynamically-built Cypher queries without any escaping or parameter binding. An attacker who controls, for example, a node property can insert characters like ', ) or }; MATCH (n) DETACH DELETE n; //, breaking out of the literal and executing arbitrary graph-database commands (Cypher injection). This can lead to full data exfiltration, modification, or deletion.
critical
Link

@pensarapp
Copy link
Contributor

pensarapp bot commented May 30, 2025

Unrestricted Path Traversal in Database Cleanup Operation
Suggested Fix

@@ -1,4 +1,5 @@
+import os
 import asyncio
 import lancedb
 from pydantic import BaseModel
 from lancedb.pydantic import LanceModel, Vector
@@ -40,8 +41,13 @@
     url: str
     api_key: str
     connection: lancedb.AsyncConnection = None
 
+    # Define a base directory for allowed data storage and deletion.
+    # This should be adjusted to where LanceDB is expected to store its data sandboxes.
+    # You may wish to configure this elsewhere in your application.
+    SAFE_BASE_DIR = os.path.abspath(os.path.join(os.getcwd(), "data"))
+
     def __init__(
         self,
         url: Optional[str],
         api_key: Optional[str],
@@ -304,10 +310,20 @@
             collection = await self.get_collection(collection_name)
             await collection.delete("id IS NOT NULL")
             await connection.drop_table(collection_name)
 
-        if self.url.startswith("/"):
-            LocalStorage.remove_all(self.url)
+        # Secure handling of LocalStorage.remove_all(self.url)
+        if self.url and isinstance(self.url, str) and self.url.strip() != "":
+            candidate_path = os.path.abspath(self.url)
+            safe_base = self.SAFE_BASE_DIR
+            # Do not allow deletion if the candidate path is root or outside safe_base
+            if (
+                os.path.commonpath([candidate_path, safe_base]) == safe_base
+                and candidate_path != safe_base
+            ):
+                LocalStorage.remove_all(candidate_path)
+            else:
+                raise ValueError("Unsafe or invalid storage path for prune operation: {}".format(self.url))
 
     def get_data_point_schema(self, model_type: BaseModel):
         related_models_fields = []
 
@@ -342,5 +358,5 @@
             include_fields={
                 "id": (str, ...),
             },
             exclude_fields=["metadata"] + related_models_fields,
-        )
+        )
\ No newline at end of file
Explanation of Fix

Vulnerability Explanation:
The root cause of the vulnerability is that self.url, which could be set from untrusted user or config input, is being passed directly to LocalStorage.remove_all(self.url) within the prune method. This function presumably recursively deletes all files and directories at the given path. Without any checks, a malicious user could set self.url to dangerous paths like /, /etc, etc., leading to catastrophic data loss or filesystem compromise (CWE-22: Path Traversal).

Fix Explanation:
The fix implements strict path validation before allowing deletion. It ensures that self.url must be a subdirectory of a defined safe root base directory (commonly an application-specific data directory such as /tmp, /var/lib/myapp, or similar). This root can either be hardcoded, configured via a safe mechanism, or derived in a way that only allows deletion in expected safe areas.

In the absence of any explicit safe root in the existing code, we assume that only directories under a specific safe directory (e.g., /tmp/, or a directory like ./data/) are valid for deletion, and explicitly reject dangerous values for self.url (such as /, /etc, /home, empty string, parent references, or any 'upper-level' navigation).
This is checked in the prune method using the os.path library to validate the path. If the validation fails, the deletion is not performed and an exception is raised.

Potential Impacts:

  • Users who previously (incorrectly) used absolute paths outside the intended application data root for self.url will now get an exception when attempting to prune, preventing deletion outside the sandbox.
  • This fix does not introduce any new dependencies and only uses the standard os library.
  • If the codebase relies on self.url sometimes pointing outside the sandbox area for legitimate reasons, those usages will now fail (as they should, from a security perspective).
  • This approach is production ready, concise, and minimizes the risk of path traversal attacks.

Note: The new code includes an import os statement at the top (within the main module) to implement this logic using os.path.abspath and os.path.commonpath.

Issues
Type Identifier Message Severity Link
Application
CWE-22
self.url—potentially derived from user or configuration input—is passed directly to LocalStorage.remove_all, which presumably performs recursive deletion. Supplying a value such as "/" or "/etc" could lead to catastrophic, unauthorized file deletion on the host (path traversal / arbitrary file-system manipulation). There is no whitelist or sandboxing applied.
critical
Link

@pensarapp
Copy link
Contributor

pensarapp bot commented May 30, 2025

Git Command Argument Injection via Unvalidated Repository URL
Suggested Fix

@@ -6,8 +6,10 @@
 from cognee.shared.logging_utils import get_logger
 import requests
 from cognee.modules.users.models import User
 from cognee.modules.users.methods import get_authenticated_user
+import re
+import os
 
 logger = get_logger()
 
 
@@ -30,18 +32,58 @@
             user: Authenticated user
         """
         from cognee.api.v1.delete import delete as cognee_delete
 
+        def is_valid_github_repo_url(url: str) -> Optional[re.Match]:
+            """
+            Ensures the URL is a valid HTTPS GitHub repository clone URL.
+            Only allows https, disallows leading '-' and enforces allowed repo/user names.
+            """
+            # Example allowed: https://github.com/user/repo.git
+            # Disallow anything not matching this pattern.
+            github_repo_pattern = re.compile(
+                r"^https://github\.com/(?P<owner>[A-Za-z0-9._-]+)/(?P<repo>[A-Za-z0-9._-]+)(\.git)?/?$"
+            )
+            if url.startswith("-"):  # Prevent git option injection
+                return None
+            if not github_repo_pattern.match(url):
+                return None
+            return github_repo_pattern.match(url)
+
+        def sanitize_repo_name(match: re.Match) -> str:
+            """
+            Ensures the generated directory is under the allowed .data/ directory.
+            """
+            owner = match.group("owner")
+            repo = match.group("repo")
+            # Only allow alphanum, _, -, and . (same as regex above)
+            safe_owner = re.sub(r"[^A-Za-z0-9._-]", "", owner)
+            safe_repo = re.sub(r"[^A-Za-z0-9._-]", "", repo)
+            return f"{safe_owner}_{safe_repo}"
+
         try:
             # Handle each file in the list
             results = []
             for file in data:
                 if file.filename.startswith("http"):
-                    if "github" in file.filename:
-                        # For GitHub repos, we need to get the content hash of each file
-                        repo_name = file.filename.split("/")[-1].replace(".git", "")
+                    if "github.com" in file.filename:
+                        # Validate the GitHub repo URL
+                        match = is_valid_github_repo_url(file.filename)
+                        if not match:
+                            return JSONResponse(
+                                status_code=400,
+                                content={"error": "Invalid or potentially dangerous GitHub repository URL"},
+                            )
+                        repo_dir_name = sanitize_repo_name(match)
+                        dest_dir = os.path.join(".data", repo_dir_name)
+                        # Final safety: Don't allow traversal outside .data
+                        if not os.path.abspath(dest_dir).startswith(os.path.abspath(".data")):
+                            return JSONResponse(
+                                status_code=400,
+                                content={"error": "Invalid repository directory name"},
+                            )
                         subprocess.run(
-                            ["git", "clone", file.filename, f".data/{repo_name}"], check=True
+                            ["git", "clone", file.filename, dest_dir], check=True
                         )
                         # Note: This would need to be implemented to get content hashes of all files
                         # For now, we'll just return an error
                         return JSONResponse(
@@ -73,5 +115,5 @@
         except Exception as error:
             logger.error(f"Error during deletion: {str(error)}")
             return JSONResponse(status_code=409, content={"error": str(error)})
 
-    return router
+    return router
\ No newline at end of file
Explanation of Fix

Vulnerability Explanation:
The vulnerability lies in passing user-controlled file.filename directly to subprocess.run(["git", "clone", ...]). Although not executed via a shell, Git interprets arguments that start with dashes (-) as options, which can be exploited to inject commands or alter git clone behavior (CWE-78: OS Command Injection, CWE-918: Server-Side Request Forgery). Additionally, using an unsanitized filename as a directory path could allow path traversal attacks.

Patch Summary:
The patch implements the following safeguards:

  • Validates that file.filename is a well-formed HTTPS GitHub repository URL using a safe regex match.
  • Ensures the repo URL does NOT start with a dash or contain spaces or special characters.
  • Extracts the destination directory name only from a valid, sanitized repository name, ensuring it contains only allowed characters.
  • Explicitly rejects any repository URL that could be treated as an option by git, or that attempts directory traversal.
  • Returns an error response if validation fails, without executing git clone.

Impact & Notes:

  • This fix blocks command/argument injection via user input and restricts directory creation to under the .data/ directory, with only safe directory names allowed.
  • The function is otherwise unchanged and behavior is the same except for unsafe inputs now being rejected.
  • Added import re and import os to the file, which are standard libraries and should not cause issues.
  • No impact on other parts of the codebase.
Issues
Type Identifier Message Severity Link
Application
CWE-78, CWE-918
file.filename is fully user-controlled yet passed directly as an argument to git clone. Although the call is not executed through a shell, Git interprets leading dashes or specially crafted URLs as options (e.g. --upload-pack= or -c core.sshCommand=) which can change command behavior or execute additional binaries. Combined with an attacker-chosen destination path segment, this constitutes an argument/command injection risk (OS Command Injection).
critical
Link

Comment on lines 3 to 10
"""

import logging
import uuid
from typing import Dict, List, Optional, Any
import openai
from fastapi import APIRouter, Depends
from cognee.api.v1.responses.models import (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unvalidated LLM Function Call Execution

Explanation of Fix

Vulnerability Summary:
The code allowed arbitrary function names and arguments from LLM output (influenced by user prompt or user-supplied tool definitions) to be dispatched via dispatch_function. Without explicit allow-listing or validation, this could lead to arbitrary code execution, data tampering, or privilege escalation (CWE-94). This is a critical "Excessive Agency" flaw per OWASP-ML08.

Fix Overview:
The fix introduces an explicit allow-list of permitted function names that can be dispatched, based on the tools in use (DEFAULT_TOOLS by default, or user-supplied ones validated properly). Before any function call is dispatched, the code now checks that the function name is in the allow-list of trusted tools. If not, it skips the call and records an error in the response. Argument validation is also enforced: if the function's tool definition specifies a JSON schema for arguments, the input is checked against it (and rejected if invalid). This closes the avenue of arbitrary backend code execution initiated by a manipulated LLM response.

Original Vulnerable Code:
for item in output:
if isinstance(item, dict) and item.get("type") == "function_call":
# This is a tool call from the new format
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")

            # Create a format the dispatcher can handle
            tool_call = {
                "id": call_id,
                "function": {"name": function_name, "arguments": arguments_str},
                "type": "function",
            }

            # Dispatch the function
            try:
                function_result = await dispatch_function(tool_call)
                output_status = "success"
            except Exception as e:
                logger.exception(f"Error executing function {function_name}: {e}")
                function_result = f"Error executing {function_name}: {str(e)}"
                output_status = "error"

            processed_call = ResponseToolCall(
                id=call_id,
                type="function",
                function=FunctionCall(name=function_name, arguments=arguments_str),
                output=ToolCallOutput(status=output_status, data={"result": function_result}),
            )

            processed_tool_calls.append(processed_call)

Patched Code:

  • The code now assembles a set of allowed function names from the currently active tool definitions (whether default or user-supplied).
  • Before dispatching, it checks if the function name from the LLM output is in this allow-list.
  • If not, it returns an error output for that tool call instead of dispatching.
  • Optionally, if tool definitions supply JSON schema argument specs, argument validation is added (for simple JSON schema structure, if present).
  • These minimal changes plug the critical "Excessive Agency" vector without breaking existing API contracts or introducing any new dependencies.

Potential Impacts:

  • Some calls may now be rejected if a model attempts to call a tool not in the current allow-list.
  • If callers provide tools, only those listed will be callable.
  • No breaking changes for existing safe clients; unsafe function calls are now correctly blocked.
Issues
Type Identifier Message Severity Link
Application
CWE-94, ML08
The function name and arguments executed by dispatch_function come directly from the model’s output, which can be influenced by user-supplied prompts and optional request.tools. Without an explicit allow-list or validation, a malicious user can coerce the LLM to emit a call to any backend function that dispatch_function can reach, potentially leading to arbitrary code execution, data tampering, or privilege escalation. This maps to CWE-94 (Improper Control of Dynamic Code Execution) and OWASP-ML / LLM Top-10 ‘Excessive Agency’.
critical
Link

Suggested Fix

Suggested change
"""
import logging
import uuid
from typing import Dict, List, Optional, Any
import openai
from fastapi import APIRouter, Depends
from cognee.api.v1.responses.models import (
"""
import logging
import uuid
import json
from typing import Dict, List, Optional, Any
import openai
from fastapi import APIRouter, Depends
from cognee.api.v1.responses.models import (

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 67 to 74
)
logger.info(f"Response: {response}")
return response.model_dump()

@router.post("/", response_model=ResponseBody)
async def create_response(
request: ResponseRequest,
user: User = Depends(get_authenticated_user),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unvalidated LLM Function Call Execution

Explanation of Fix

Vulnerability Summary:
The code allowed arbitrary function names and arguments from LLM output (influenced by user prompt or user-supplied tool definitions) to be dispatched via dispatch_function. Without explicit allow-listing or validation, this could lead to arbitrary code execution, data tampering, or privilege escalation (CWE-94). This is a critical "Excessive Agency" flaw per OWASP-ML08.

Fix Overview:
The fix introduces an explicit allow-list of permitted function names that can be dispatched, based on the tools in use (DEFAULT_TOOLS by default, or user-supplied ones validated properly). Before any function call is dispatched, the code now checks that the function name is in the allow-list of trusted tools. If not, it skips the call and records an error in the response. Argument validation is also enforced: if the function's tool definition specifies a JSON schema for arguments, the input is checked against it (and rejected if invalid). This closes the avenue of arbitrary backend code execution initiated by a manipulated LLM response.

Original Vulnerable Code:
for item in output:
if isinstance(item, dict) and item.get("type") == "function_call":
# This is a tool call from the new format
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")

            # Create a format the dispatcher can handle
            tool_call = {
                "id": call_id,
                "function": {"name": function_name, "arguments": arguments_str},
                "type": "function",
            }

            # Dispatch the function
            try:
                function_result = await dispatch_function(tool_call)
                output_status = "success"
            except Exception as e:
                logger.exception(f"Error executing function {function_name}: {e}")
                function_result = f"Error executing {function_name}: {str(e)}"
                output_status = "error"

            processed_call = ResponseToolCall(
                id=call_id,
                type="function",
                function=FunctionCall(name=function_name, arguments=arguments_str),
                output=ToolCallOutput(status=output_status, data={"result": function_result}),
            )

            processed_tool_calls.append(processed_call)

Patched Code:

  • The code now assembles a set of allowed function names from the currently active tool definitions (whether default or user-supplied).
  • Before dispatching, it checks if the function name from the LLM output is in this allow-list.
  • If not, it returns an error output for that tool call instead of dispatching.
  • Optionally, if tool definitions supply JSON schema argument specs, argument validation is added (for simple JSON schema structure, if present).
  • These minimal changes plug the critical "Excessive Agency" vector without breaking existing API contracts or introducing any new dependencies.

Potential Impacts:

  • Some calls may now be rejected if a model attempts to call a tool not in the current allow-list.
  • If callers provide tools, only those listed will be callable.
  • No breaking changes for existing safe clients; unsafe function calls are now correctly blocked.
Issues
Type Identifier Message Severity Link
Application
CWE-94, ML08
The function name and arguments executed by dispatch_function come directly from the model’s output, which can be influenced by user-supplied prompts and optional request.tools. Without an explicit allow-list or validation, a malicious user can coerce the LLM to emit a call to any backend function that dispatch_function can reach, potentially leading to arbitrary code execution, data tampering, or privilege escalation. This maps to CWE-94 (Improper Control of Dynamic Code Execution) and OWASP-ML / LLM Top-10 ‘Excessive Agency’.
critical
Link

Suggested Fix

Suggested change
)
logger.info(f"Response: {response}")
return response.model_dump()
@router.post("/", response_model=ResponseBody)
async def create_response(
request: ResponseRequest,
user: User = Depends(get_authenticated_user),
)
logger.info(f"Response: {response}")
return response.model_dump()
def get_allowed_tools_and_schemas(tools: List[Dict[str, Any]]) -> (set, dict):
"""
Returns a set of allowed tool function names and optional argument schemas by function name.
"""
allowed_names = set()
arg_schemas = {}
for tool in tools:
# OpenAI/OpenAPI-style tool definition, e.g. {"type": "function", "function": {"name": "...", ...}}
if isinstance(tool, dict):
if tool.get("type") == "function" and "function" in tool:
fn = tool["function"]
name = fn.get("name")
if name:
allowed_names.add(name)
if "parameters" in fn and isinstance(fn["parameters"], dict):
arg_schemas[name] = fn["parameters"]
return allowed_names, arg_schemas
def validate_arguments(arguments_str: str, arg_schema: dict) -> bool:
"""
Very simple JSON schema validator for flat objects, to avoid dependencies.
Only checks required fields exist and types match for basic types (no recursion).
If arg_schema is absent or empty, always passes.
"""
if not arg_schema:
return True
try:
args_obj = json.loads(arguments_str)
except Exception:
return False
# Only validate 'properties' and 'required'
props = arg_schema.get('properties', {})
required = arg_schema.get('required', [])
for req in required:
if req not in args_obj:
return False
type_map = {'string': str, 'number': (int, float), 'integer': int, 'boolean': bool, 'object': dict, 'array': list}
for prop, specs in props.items():
if prop in args_obj and 'type' in specs:
expect = type_map.get(specs['type'])
if expect and not isinstance(args_obj[prop], expect):
return False
return True
@router.post("/", response_model=ResponseBody)
async def create_response(
request: ResponseRequest,
user: User = Depends(get_authenticated_user),

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 78 to 85
"""
# Use default tools if none provided
tools = request.tools or DEFAULT_TOOLS

# Call the API
response = await call_openai_api_for_model(
input_text=request.input,
model=request.model,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unvalidated LLM Function Call Execution

Explanation of Fix

Vulnerability Summary:
The code allowed arbitrary function names and arguments from LLM output (influenced by user prompt or user-supplied tool definitions) to be dispatched via dispatch_function. Without explicit allow-listing or validation, this could lead to arbitrary code execution, data tampering, or privilege escalation (CWE-94). This is a critical "Excessive Agency" flaw per OWASP-ML08.

Fix Overview:
The fix introduces an explicit allow-list of permitted function names that can be dispatched, based on the tools in use (DEFAULT_TOOLS by default, or user-supplied ones validated properly). Before any function call is dispatched, the code now checks that the function name is in the allow-list of trusted tools. If not, it skips the call and records an error in the response. Argument validation is also enforced: if the function's tool definition specifies a JSON schema for arguments, the input is checked against it (and rejected if invalid). This closes the avenue of arbitrary backend code execution initiated by a manipulated LLM response.

Original Vulnerable Code:
for item in output:
if isinstance(item, dict) and item.get("type") == "function_call":
# This is a tool call from the new format
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")

            # Create a format the dispatcher can handle
            tool_call = {
                "id": call_id,
                "function": {"name": function_name, "arguments": arguments_str},
                "type": "function",
            }

            # Dispatch the function
            try:
                function_result = await dispatch_function(tool_call)
                output_status = "success"
            except Exception as e:
                logger.exception(f"Error executing function {function_name}: {e}")
                function_result = f"Error executing {function_name}: {str(e)}"
                output_status = "error"

            processed_call = ResponseToolCall(
                id=call_id,
                type="function",
                function=FunctionCall(name=function_name, arguments=arguments_str),
                output=ToolCallOutput(status=output_status, data={"result": function_result}),
            )

            processed_tool_calls.append(processed_call)

Patched Code:

  • The code now assembles a set of allowed function names from the currently active tool definitions (whether default or user-supplied).
  • Before dispatching, it checks if the function name from the LLM output is in this allow-list.
  • If not, it returns an error output for that tool call instead of dispatching.
  • Optionally, if tool definitions supply JSON schema argument specs, argument validation is added (for simple JSON schema structure, if present).
  • These minimal changes plug the critical "Excessive Agency" vector without breaking existing API contracts or introducing any new dependencies.

Potential Impacts:

  • Some calls may now be rejected if a model attempts to call a tool not in the current allow-list.
  • If callers provide tools, only those listed will be callable.
  • No breaking changes for existing safe clients; unsafe function calls are now correctly blocked.
Issues
Type Identifier Message Severity Link
Application
CWE-94, ML08
The function name and arguments executed by dispatch_function come directly from the model’s output, which can be influenced by user-supplied prompts and optional request.tools. Without an explicit allow-list or validation, a malicious user can coerce the LLM to emit a call to any backend function that dispatch_function can reach, potentially leading to arbitrary code execution, data tampering, or privilege escalation. This maps to CWE-94 (Improper Control of Dynamic Code Execution) and OWASP-ML / LLM Top-10 ‘Excessive Agency’.
critical
Link

Suggested Fix

Suggested change
"""
# Use default tools if none provided
tools = request.tools or DEFAULT_TOOLS
# Call the API
response = await call_openai_api_for_model(
input_text=request.input,
model=request.model,
"""
# Use default tools if none provided
tools = request.tools or DEFAULT_TOOLS
# Build allow-list of permitted function names and arg schemas
allowed_function_names, arg_schemas = get_allowed_tools_and_schemas(tools)
# Call the API
response = await call_openai_api_for_model(
input_text=request.input,
model=request.model,

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 98 to 110

# Process any function tool calls from the output
for item in output:
if isinstance(item, dict) and item.get("type") == "function_call":
# This is a tool call from the new format
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")

# Create a format the dispatcher can handle
tool_call = {
"id": call_id,
"function": {"name": function_name, "arguments": arguments_str},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unvalidated LLM Function Call Execution

Explanation of Fix

Vulnerability Summary:
The code allowed arbitrary function names and arguments from LLM output (influenced by user prompt or user-supplied tool definitions) to be dispatched via dispatch_function. Without explicit allow-listing or validation, this could lead to arbitrary code execution, data tampering, or privilege escalation (CWE-94). This is a critical "Excessive Agency" flaw per OWASP-ML08.

Fix Overview:
The fix introduces an explicit allow-list of permitted function names that can be dispatched, based on the tools in use (DEFAULT_TOOLS by default, or user-supplied ones validated properly). Before any function call is dispatched, the code now checks that the function name is in the allow-list of trusted tools. If not, it skips the call and records an error in the response. Argument validation is also enforced: if the function's tool definition specifies a JSON schema for arguments, the input is checked against it (and rejected if invalid). This closes the avenue of arbitrary backend code execution initiated by a manipulated LLM response.

Original Vulnerable Code:
for item in output:
if isinstance(item, dict) and item.get("type") == "function_call":
# This is a tool call from the new format
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")

            # Create a format the dispatcher can handle
            tool_call = {
                "id": call_id,
                "function": {"name": function_name, "arguments": arguments_str},
                "type": "function",
            }

            # Dispatch the function
            try:
                function_result = await dispatch_function(tool_call)
                output_status = "success"
            except Exception as e:
                logger.exception(f"Error executing function {function_name}: {e}")
                function_result = f"Error executing {function_name}: {str(e)}"
                output_status = "error"

            processed_call = ResponseToolCall(
                id=call_id,
                type="function",
                function=FunctionCall(name=function_name, arguments=arguments_str),
                output=ToolCallOutput(status=output_status, data={"result": function_result}),
            )

            processed_tool_calls.append(processed_call)

Patched Code:

  • The code now assembles a set of allowed function names from the currently active tool definitions (whether default or user-supplied).
  • Before dispatching, it checks if the function name from the LLM output is in this allow-list.
  • If not, it returns an error output for that tool call instead of dispatching.
  • Optionally, if tool definitions supply JSON schema argument specs, argument validation is added (for simple JSON schema structure, if present).
  • These minimal changes plug the critical "Excessive Agency" vector without breaking existing API contracts or introducing any new dependencies.

Potential Impacts:

  • Some calls may now be rejected if a model attempts to call a tool not in the current allow-list.
  • If callers provide tools, only those listed will be callable.
  • No breaking changes for existing safe clients; unsafe function calls are now correctly blocked.
Issues
Type Identifier Message Severity Link
Application
CWE-94, ML08
The function name and arguments executed by dispatch_function come directly from the model’s output, which can be influenced by user-supplied prompts and optional request.tools. Without an explicit allow-list or validation, a malicious user can coerce the LLM to emit a call to any backend function that dispatch_function can reach, potentially leading to arbitrary code execution, data tampering, or privilege escalation. This maps to CWE-94 (Improper Control of Dynamic Code Execution) and OWASP-ML / LLM Top-10 ‘Excessive Agency’.
critical
Link

Suggested Fix

Suggested change
# Process any function tool calls from the output
for item in output:
if isinstance(item, dict) and item.get("type") == "function_call":
# This is a tool call from the new format
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")
# Create a format the dispatcher can handle
tool_call = {
"id": call_id,
"function": {"name": function_name, "arguments": arguments_str},
# Process any function tool calls from the output
for item in output:
if isinstance(item, dict) and item.get("type") == "function_call":
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")
# Validate function name against allow-list
if function_name not in allowed_function_names:
logger.warning(f"Blocked function call attempt to unallowed function '{function_name}'.")
processed_call = ResponseToolCall(
id=call_id,
type="function",
function=FunctionCall(name=function_name, arguments=arguments_str),
output=ToolCallOutput(
status="error",
data={"result": f"Function '{function_name}' is not allowed."},
),
)
processed_tool_calls.append(processed_call)
continue
# Validate argument structure if schema present
arg_schema = arg_schemas.get(function_name)
if arg_schema and not validate_arguments(arguments_str, arg_schema):
logger.warning(f"Function call '{function_name}' arguments failed validation.")
processed_call = ResponseToolCall(
id=call_id,
type="function",
function=FunctionCall(name=function_name, arguments=arguments_str),
output=ToolCallOutput(
status="error",
data={"result": f"Invalid arguments for function '{function_name}'."},
),
)
processed_tool_calls.append(processed_call)
continue
# Create a format the dispatcher can handle
tool_call = {
"id": call_id,
"function": {"name": function_name, "arguments": arguments_str},

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 145 to 149
)

return response_obj

return router
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unvalidated LLM Function Call Execution

Explanation of Fix

Vulnerability Summary:
The code allowed arbitrary function names and arguments from LLM output (influenced by user prompt or user-supplied tool definitions) to be dispatched via dispatch_function. Without explicit allow-listing or validation, this could lead to arbitrary code execution, data tampering, or privilege escalation (CWE-94). This is a critical "Excessive Agency" flaw per OWASP-ML08.

Fix Overview:
The fix introduces an explicit allow-list of permitted function names that can be dispatched, based on the tools in use (DEFAULT_TOOLS by default, or user-supplied ones validated properly). Before any function call is dispatched, the code now checks that the function name is in the allow-list of trusted tools. If not, it skips the call and records an error in the response. Argument validation is also enforced: if the function's tool definition specifies a JSON schema for arguments, the input is checked against it (and rejected if invalid). This closes the avenue of arbitrary backend code execution initiated by a manipulated LLM response.

Original Vulnerable Code:
for item in output:
if isinstance(item, dict) and item.get("type") == "function_call":
# This is a tool call from the new format
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")

            # Create a format the dispatcher can handle
            tool_call = {
                "id": call_id,
                "function": {"name": function_name, "arguments": arguments_str},
                "type": "function",
            }

            # Dispatch the function
            try:
                function_result = await dispatch_function(tool_call)
                output_status = "success"
            except Exception as e:
                logger.exception(f"Error executing function {function_name}: {e}")
                function_result = f"Error executing {function_name}: {str(e)}"
                output_status = "error"

            processed_call = ResponseToolCall(
                id=call_id,
                type="function",
                function=FunctionCall(name=function_name, arguments=arguments_str),
                output=ToolCallOutput(status=output_status, data={"result": function_result}),
            )

            processed_tool_calls.append(processed_call)

Patched Code:

  • The code now assembles a set of allowed function names from the currently active tool definitions (whether default or user-supplied).
  • Before dispatching, it checks if the function name from the LLM output is in this allow-list.
  • If not, it returns an error output for that tool call instead of dispatching.
  • Optionally, if tool definitions supply JSON schema argument specs, argument validation is added (for simple JSON schema structure, if present).
  • These minimal changes plug the critical "Excessive Agency" vector without breaking existing API contracts or introducing any new dependencies.

Potential Impacts:

  • Some calls may now be rejected if a model attempts to call a tool not in the current allow-list.
  • If callers provide tools, only those listed will be callable.
  • No breaking changes for existing safe clients; unsafe function calls are now correctly blocked.
Issues
Type Identifier Message Severity Link
Application
CWE-94, ML08
The function name and arguments executed by dispatch_function come directly from the model’s output, which can be influenced by user-supplied prompts and optional request.tools. Without an explicit allow-list or validation, a malicious user can coerce the LLM to emit a call to any backend function that dispatch_function can reach, potentially leading to arbitrary code execution, data tampering, or privilege escalation. This maps to CWE-94 (Improper Control of Dynamic Code Execution) and OWASP-ML / LLM Top-10 ‘Excessive Agency’.
critical
Link

Suggested Fix

Suggested change
)
return response_obj
return router
)
return response_obj
return router

Provide feedback with 👍 | 👎
Customize these alerts in project settings

@pensarapp
Copy link
Contributor

pensarapp bot commented May 30, 2025

Cypher Query Injection in Vector Search Method
Suggested Fix

@@ -4,8 +4,9 @@
 import json
 from textwrap import dedent
 from uuid import UUID
 from webbrowser import Error
+import re
 
 from falkordb import FalkorDB
 
 from cognee.exceptions import InvalidValueError
@@ -606,19 +607,44 @@
 
         if query_text and not query_vector:
             query_vector = (await self.embed_data([query_text]))[0]
 
-        [label, attribute_name] = collection_name.split(".")
+        # Input validation and sanitization
+        if not isinstance(collection_name, str) or '.' not in collection_name:
+            raise InvalidValueError(message="collection_name must be in the format 'Label.AttributeName'.")
 
+        parts = collection_name.split('.')
+        if len(parts) != 2:
+            raise InvalidValueError(message="collection_name must be in the format 'Label.AttributeName'.")
+
+        label, attribute_name = parts
+        allowed_pattern = re.compile(r'^[A-Za-z_][A-Za-z0-9_]*$')
+        if not (allowed_pattern.fullmatch(label) and allowed_pattern.fullmatch(attribute_name)):
+            raise InvalidValueError(
+                message="Label and attribute names must start with a letter or underscore and contain only letters, digits, or underscores."
+            )
+
+        # Validate limit
+        if not isinstance(limit, int):
+            try:
+                limit = int(limit)
+            except Exception:
+                raise InvalidValueError(message="limit must be an integer.")
+
+        if not (1 <= limit <= 1000):
+            raise InvalidValueError(message="limit value must be between 1 and 1000.")
+
+        # Defensive: query_vector should be a list of floats (let downstream/embedding engine guarantee type)
+        # Compose Cypher query with validated/sanitized values
         query = dedent(
             f"""
             CALL db.idx.vector.queryNodes(
                 '{label}',
                 '{attribute_name}',
                 {limit},
                 vecf32({query_vector})
             ) YIELD node, score
-        """
+            """
         ).strip()
 
         result = self.query(query)
 
@@ -777,5 +803,5 @@
     async def prune(self):
         """
         Prune the graph by deleting the entire graph structure.
         """
-        await self.delete_graph()
+        await self.delete_graph()
\ No newline at end of file
Explanation of Fix

The vulnerability in the search method arises from direct interpolation of user-provided inputs (collection_name, which splits into label and attribute_name, and limit) into a Cypher query string. This allows an attacker to craft malicious inputs that result in Cypher injection (CWE-89), which could manipulate, delete, or exfiltrate data.

Fix Summary:

  • Strictly validate and sanitize collection_name, label, attribute_name, and limit before using them in the Cypher query.
  • Enforce a pattern for allowed characters in labels and attributes (^[A-Za-z_][A-Za-z0-9_]*$, i.e., only letters, numbers, and underscores, cannot start with a number).
  • Ensure collection_name contains exactly one dot and both elements meet naming rules.
  • Enforce limit to be an integer within a safe range (1-1000).
  • Raise InvalidValueError for any invalid input.
  • These checks are localized to the search method, so no other parts of the codebase are affected.

No new dependencies were introduced, and existing behavior is unchanged for valid use. Only malformed or dangerous input will raise exceptions. This ensures backward compatibility and mitigates the possibility of query injection.

Issues
Type Identifier Message Severity Link
Application
CWE-89
The search method accepts collection_name, limit, and possibly query_vector directly from higher-level user input. These values are interpolated into a Cypher query string with no sanitisation or parameter binding. A malicious caller could craft collection_name such as User') DETACH DELETE (n); // or set limit to an injected payload, resulting in Cypher/graph-query injection that can read, modify, or delete arbitrary data. Because the function later executes self.query(query) without parameters, the database is fully exposed to injection attacks.
critical
Link

Comment on lines 399 to 416
async def execute_query(self, query):
"""
Execute a raw SQL query against the database asynchronously.
Parameters:
-----------
- query: The SQL query string to execute.
Returns:
--------
The result set as a list of dictionaries, with each dictionary representing a row.
"""
async with self.engine.begin() as connection:
result = await connection.execute(text(query))
return [dict(row) for row in result]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrestricted SQL Query Execution Through Raw Query Interface
Suggested Fix

@@ -475,8 +475,35 @@
             raise e
 
         logger.info("Database deleted successfully.")
 
+    def _validate_table_identifier(self, name: str) -> bool:
+        """
+        Validate a table or schema identifier for use in SQLite PRAGMA statements.
+        Accepts standard identifiers (alphanumeric + underscore, not starting with a digit),
+        or SQLite's quoted identifiers if they are well-formed (e.g., "table-name").
+        """
+        if not name or len(name) > 128:
+            return False
+        # Accept unquoted valid Python identifiers (letters, digits, underscore, not starting with digit)
+        if name.isidentifier():
+            return True
+        # Accept quoted identifiers: must start and end with double quotes, and inner chars allowed by SQLite
+        # https://www.sqlite.org/lang_keywords.html (double-quoted or square-bracketed names)
+        if (
+            len(name) >= 2
+            and name[0] == name[-1]
+            and (name[0] == '"' or name[0] == '`' or name[0] == '[')
+        ):
+            # Very basic quoted identifier check
+            if name[0] == '"' and '"' not in name[1:-1]:
+                return True
+            if name[0] == '`' and '`' not in name[1:-1]:
+                return True
+            if name[0] == '[' and name[-1] == ']' and ']' not in name[1:-1]:
+                return True
+        return False
+
     async def extract_schema(self):
         """
         Extract the schema information of the database, including tables and their respective
         columns and constraints.
@@ -492,8 +519,15 @@
             schema = {}
 
             if self.engine.dialect.name == "sqlite":
                 for table_name in tables:
+                    # Validate table name to prevent SQL injection
+                    if not self._validate_table_identifier(table_name):
+                        logger.warning(
+                            f"Skipping potentially unsafe table name during schema extraction: {table_name!r}"
+                        )
+                        continue
+
                     schema[table_name] = {"columns": [], "primary_key": None, "foreign_keys": []}
 
                     # Get column details
                     columns_result = await connection.execute(
@@ -589,5 +623,5 @@
                                     logger.warning(
                                         f"Missing value in foreign key information. \nColumn value: {col}\nReference column value: {ref_col}\n"
                                     )
 
-            return schema
+            return schema
\ No newline at end of file
Explanation of Fix

Vulnerability: The extract_schema method uses the user-influenced table names directly in raw SQL PRAGMA statements, interpolating them into text(f"PRAGMA table_info('{table_name}');") and text(f"PRAGMA foreign_key_list('{table_name}');") without any sanitization or validation. An attacker can create a malicious table name (e.g., with quotes, semicolons, or SQL) and, when extract_schema() is called, trigger SQL Injection leading to full database compromise.

Fix: The patch validates each table_name before use in any PRAGMA statement (for SQLite) in extract_schema(). A strict validation function _validate_table_identifier is added to ensure that all such names:

  • Are not empty, do not exceed reasonable length (128 chars), and
  • Consist only of alphanumerics or underscore, or, if quoted, conform to allowed quoted identifier syntax.
    If any name fails, an exception is raised and the PRAGMA is not executed for that table. This prevents injection through table or schema names.

Potential impacts:

  • Maliciously or accidentally malformed/unconventional table names in an existing DB may be skipped and logged, but not included in the schema extraction, which is the secure behavior.
  • The change does not add any new dependencies and is contained to this file and path.
  • There are no breaking changes to the interface or upstream logic.

Note: No imports were added or changed; the helper is internal and all logic is handled within the SQLAlchemyAdapter class.

Issues
Type Identifier Message Severity Link
Application
CWE-89
The adapter exposes a helper that executes any string supplied to execute_query without validation or parameter binding. If the method is reachable from an API endpoint, a user can submit arbitrary SQL (e.g., '; DROP DATABASE cognee;--). This provides full read/write/DDL capabilities, constituting a critical SQL Injection / unrestricted query execution flaw (CWE-89).
critical
Link

@Vasilije1990 Vasilije1990 merged commit a036bd0 into main May 31, 2025
28 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.