Skip to content

Conversation

@Vasilije1990
Copy link
Contributor

Description

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

@Vasilije1990 Vasilije1990 requested a review from dexters1 May 16, 2025 12:25
@pull-checklist
Copy link

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 16, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This update introduces a comprehensive set of changes across the Cognee codebase. Major highlights include the addition of an OpenAI-compatible responses API with function-calling support, a new Memgraph graph database adapter, and refactored MCP server tooling. The pipeline and dataset handling logic is enhanced with context passing, unique dataset ID generation, and improved pipeline status tracking. Numerous vector database adapters are updated for better error handling and type safety. The frontend receives updates to dataset and cognify workflows, and new example scripts and tests are added for various database backends. Several obsolete files, visualizations, and tests are removed.

Changes

File(s) / Path(s) Change Summary
cognee/api/v1/responses/ (all new modules: models.py, dispatch_function.py, default_tools.py, routers/) Introduces an OpenAI-compatible responses API endpoint, models for function-calling, dispatch logic, and default tool definitions.
cognee/api/client.py, cognee/api/v1/responses/__init__.py Integrates the new responses router into the FastAPI app and exposes the router in the package API.
cognee-mcp/src/server.py Refactors the MCP server to use explicit async tools, background task execution, and status tools.
cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py (new) Adds a full-featured async MemgraphAdapter implementing graph CRUD, traversal, and metrics.
cognee/infrastructure/databases/graph/get_graph_engine.py Adds support for Memgraph in the graph engine factory.
cognee/infrastructure/databases/graph/networkx/adapter.py Updates type hints for node IDs to use UUIDs instead of strings.
cognee/infrastructure/databases/vector/ (all adapters) Refactors adapters for ChromaDB, LanceDB, Milvus, PGVector, QDrant, and Weaviate: improves error handling, unifies collection access, removes deprecated methods, and adjusts search defaults.
cognee/infrastructure/databases/vector/exceptions/exceptions.py Updates CollectionNotFoundError constructor for better logging and default name.
cognee/modules/pipelines/operations/ (pipeline.py, get_pipeline_status.py, log_pipeline_run_initiated.py, run_tasks_base.py, run_tasks.py, __init__.py) Refactors pipeline and task execution logic: adds context passing, pipeline name filtering, logging of initiation, and unique dataset ID handling.
cognee/modules/data/methods/ (create_dataset.py, get_unique_dataset_id.py, __init__.py) Refactors dataset creation to use user objects and unique ID generation; exposes new utility.
cognee/modules/engine/models/ColumnValue.py, cognee/modules/engine/models/__init__.py Adds new ColumnValue node type for column-level data; exposes it in the models package.
cognee/tasks/ingestion/migrate_relational_database.py Adds option to migrate and represent column-level data as nodes.
cognee/modules/visualization/cognee_network_visualization.py Adds color mapping for the new "ColumnValue" node type.
cognee/modules/observability/observers.py, get_observe.py, base_config.py, infrastructure/llm/*/adapter.py Refactors observability/monitoring tool configuration and decorator usage.
cognee/modules/retrieval/exceptions/exceptions.py, __init__.py Removes CollectionDistancesNotFoundError and related imports.
cognee/modules/retrieval/utils/brute_force_triplet_search.py, graph_completion_retriever.py Improves handling of missing entities/collections in brute-force search and context retrieval.
cognee/modules/pipelines/models/PipelineRun.py Adds new status enum member for dataset processing initiation.
cognee/modules/settings/get_settings.py Makes LLMConfig endpoint and api_version fields optional.
cognee/modules/graph/cognee_graph/CogneeGraph.py Updates vector distance mapping to use new search method signature.
cognee/tasks/ingestion/ingest_data.py Passes full user object to dataset creation.
cognee/api/v1/cognify/code_graph_pipeline.py, cognify.py, datasets/datasets.py, config/config.py Refactors imports, adds pipeline name parameterization, and aligns exception raising style.
cognee/exceptions/exceptions.py Adds logging control and log level to CogneeApiError.
cognee/shared/data_models.py Removes obsolete MonitoringTool enum.
cognee/shared/logging_utils.py Adjusts SQLAlchemy warning filtering threshold.
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py Restricts table dropping to fixed schemas.
cognee/tasks/temporal_awareness/index_graphiti_objects.py Simplifies graph data retrieval and iteration logic.
cognee-frontend/src/app/, src/modules/, src/ui/ Updates dataset and cognify callback logic, dataset prop types, and search input handling.
cognee-frontend/src/utils/fetch.ts Changes API base URL to use localhost.
CONTRIBUTING.md, README.md Updates PR instructions, removes graph visualization, and introduces cognee UI section.
Dockerfile, entrypoint.sh Refactors Docker build to multi-stage, switches to uv, updates entrypoint script for direct Alembic migration and server startup.
examples/database_examples/, examples/data/ Adds new example scripts for ChromaDB, FalkorDB, Kuzu, Milvus, Neo4j, PGVector, QDrant, Weaviate, and new sample data.
notebooks/ Adds OpenAI-compatible demo notebook, updates and cleans up other notebooks, removes HR demo, and deletes graph visualization HTML assets.
cognee/tests/ Adds Memgraph integration test, updates relational DB migration test expectations, modifies or removes obsolete and redundant tests, and updates async test execution style.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant FastAPI
    participant ResponsesRouter
    participant OpenAI Model
    participant DispatchFunction
    participant ToolHandler

    Client->>FastAPI: POST /api/v1/responses (input, tools, etc.)
    FastAPI->>ResponsesRouter: Handle request
    ResponsesRouter->>OpenAI Model: Call with input, tools
    OpenAI Model-->>ResponsesRouter: Response (may include tool calls)
    loop For each tool call in response
        ResponsesRouter->>DispatchFunction: dispatch_function(tool_call)
        DispatchFunction->>ToolHandler: handle_search/handle_cognify/handle_prune
        ToolHandler-->>DispatchFunction: Tool call result
        DispatchFunction-->>ResponsesRouter: Tool call output
    end
    ResponsesRouter-->>FastAPI: Structured response (id, model, tool_calls, usage)
    FastAPI-->>Client: API response
Loading
sequenceDiagram
    participant User
    participant Frontend
    participant API
    participant DatasetManager
    participant Pipeline

    User->>Frontend: Add data, select dataset, click "Cognify"
    Frontend->>API: Request to cognify dataset
    API->>DatasetManager: Find or create dataset (by name/user)
    DatasetManager->>Pipeline: Start cognify pipeline (with unique dataset ID)
    Pipeline-->>API: Pipeline status updated/logged
    API-->>Frontend: Cognify started/status
    Frontend-->>User: Show notification, update UI
Loading

Possibly related issues

Possibly related PRs

Suggested labels

run-checks

Poem

A rabbit hopped through code so wide,
Adding new tools with pride inside.
Pipelines now with context run,
Memgraph joins the graphy fun!
OpenAI calls, responses clear,
Cognee's future hops right here.
🐇✨


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@gitguardian
Copy link

gitguardian bot commented May 16, 2025

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
While these secrets were previously flagged, we no longer have a reference to the
specific commits where they were detected. Once a secret has been leaked into a git
repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@Vasilije1990 Vasilije1990 changed the base branch from main to dev May 16, 2025 12:26
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

🛑 Comments failed to post (49)
cognee/modules/retrieval/exceptions/__init__.py (1)

7-7: 💡 Verification agent

🧩 Analysis chain

Removed CollectionDistancesNotFoundError from imports.

The removal of this exception from imports suggests it's no longer needed in the codebase. Ensure all code that previously used this exception has been properly updated to handle these error cases differently.

The static analysis flags that imported exceptions are unused within this file. Consider adding them to __all__ for proper re-export:

-from .exceptions import SearchTypeNotSupported, CypherSearchError
+from .exceptions import SearchTypeNotSupported, CypherSearchError
+
+__all__ = ["SearchTypeNotSupported", "CypherSearchError"]

🏁 Script executed:

#!/bin/bash
# Check if CollectionDistancesNotFoundError is still referenced anywhere in the codebase
echo "Checking for any remaining references to CollectionDistancesNotFoundError..."
rg -l "CollectionDistancesNotFoundError" --type py

Length of output: 213


To ensure the exception was actually removed and isn’t still defined elsewhere (which could break exports), let’s locate its definition:


🏁 Script executed:

#!/bin/bash
# Locate the definition of CollectionDistancesNotFoundError
echo "Searching for the definition of CollectionDistancesNotFoundError..."
rg -n "class CollectionDistancesNotFoundError" --type py

Length of output: 203


Confirm safe removal of CollectionDistancesNotFoundError and add all for exports

  • No references or definitions of CollectionDistancesNotFoundError were found in the codebase—its removal is safe.
  • To make your module’s public API explicit, add an __all__ declaration.

File: cognee/modules/retrieval/exceptions/init.py

 from .exceptions import SearchTypeNotSupported, CypherSearchError
+__all__ = ["SearchTypeNotSupported", "CypherSearchError"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from .exceptions import SearchTypeNotSupported, CypherSearchError

__all__ = ["SearchTypeNotSupported", "CypherSearchError"]
🧰 Tools
🪛 Ruff (0.8.2)

7-7: .exceptions.SearchTypeNotSupported imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


7-7: .exceptions.CypherSearchError imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

🤖 Prompt for AI Agents
In cognee/modules/retrieval/exceptions/__init__.py at line 7, confirm that
CollectionDistancesNotFoundError is fully removed since it is no longer
referenced or defined anywhere in the codebase. Then, add an __all__ list to
explicitly declare the public API by including the currently imported exceptions
SearchTypeNotSupported and CypherSearchError to ensure proper re-export and
clarity of the module's exports.
cognee/modules/settings/get_settings.py (1)

2-2: 💡 Verification agent

🧩 Analysis chain

Made LLM endpoint and API version optional.

Good improvement making the endpoint and api_version fields optional. This increases flexibility when integrating with different LLM backends that might not require these fields.

Let's verify that the code handling these fields properly handles the case when they're None:

Also applies to: 24-25


🏁 Script executed:

#!/bin/bash
# Check how these optional fields are used throughout the codebase
echo "Checking usage of llm_endpoint and llm_api_version..."
rg "llm_endpoint" --type py -A 2 -B 2
rg "llm_api_version" --type py -A 2 -B 2

Length of output: 5608


🏁 Script executed:

#!/bin/bash
# Display OpenAIAdapter and GenericAPIAdapter implementations
rg -n "class OpenAIAdapter" -A 20 -B 5 cognee/infrastructure/llm/get_llm_client.py
rg -n "class GenericAPIAdapter" -A 20 -B 5 cognee/infrastructure/llm/get_llm_client.py

Length of output: 173


The previous search didn’t locate any class definitions—let’s inspect the top of get_llm_client.py to see how adapters are defined or imported.


🏁 Script executed:

#!/bin/bash
# Show the first 100 lines of get_llm_client.py to locate adapter definitions
sed -n '1,100p' cognee/infrastructure/llm/get_llm_client.py

Length of output: 3260


Inspecting adapter constructors for optional endpoint and api_version handling:


🏁 Script executed:

#!/bin/bash
# Inspect adapter __init__ signatures and conditional logic for endpoint/api_version
for file in \
    cognee/infrastructure/llm/openai/adapter.py \
    cognee/infrastructure/llm/generic_llm_api/adapter.py \
    cognee/infrastructure/llm/gemini/adapter.py; do
  echo "==> $file"
  rg -n -A5 -B5 "class .+Adapter" "$file"
  rg -n -A5 -B5 "def __init__" "$file"
done

Length of output: 2815


Ensure adapter constructors properly accept and handle optional endpoint and api_version

While the config now allows llm_endpoint and llm_api_version to be omitted, the adapter constructors still require them as non-optional parameters. This will fail at runtime if either is None. Please update the signatures and guard their usage:

• cognee/infrastructure/llm/openai/adapter.py

  • Change
    def __init__(
        self,
        api_key: str,
        endpoint: str,
        api_version: str,
        model: str,
        …
    ):
    to
    def __init__(
        self,
        api_key: str,
        endpoint: Optional[str] = None,
        api_version: Optional[str] = None,
        model: str,
        …
    ):
  • Inside __init__, only pass endpoint/api_version into the OpenAI client if they’re not None.

• cognee/infrastructure/llm/generic_llm_api/adapter.py

  • Change
    def __init__(self, endpoint, api_key: str, model: str, name: str, max_tokens: int):
    to
    def __init__(self, endpoint: Optional[str] = None, api_key: str, model: str, name: str, max_tokens: int):
  • Guard any requests that rely on self.endpoint being set.

• (Optionally) cognee/infrastructure/llm/get_llm_client.py

  • Before instantiating adapters, default missing fields or raise a clear error if a provider truly requires them.

These changes will ensure omitting endpoint or api_version does not cause unexpected crashes.

🤖 Prompt for AI Agents
In cognee/infrastructure/llm/openai/adapter.py around the adapter __init__
method, update the endpoint and api_version parameters to be Optional[str] with
default None, and modify the constructor logic to only pass these parameters to
the OpenAI client if they are not None. Similarly, in
cognee/infrastructure/llm/generic_llm_api/adapter.py, change the __init__
signature to make endpoint Optional[str] with default None and add guards around
any usage of self.endpoint to handle the None case safely. Optionally, in
cognee/infrastructure/llm/get_llm_client.py, add logic before adapter
instantiation to provide default values or raise clear errors if required fields
are missing. This will prevent runtime errors when these fields are omitted.
README.md (1)

137-137: ⚠️ Potential issue

Remove stray closing anchor tag.

The <img> element on line 137 ends with </a> without a matching opening <a> tag, causing malformed HTML in the README. Remove the extraneous </a> or wrap the image in a proper link if that was the intention.

🤖 Prompt for AI Agents
In README.md at line 137, there is a stray closing anchor tag </a> after the
<img> element without a matching opening <a> tag, causing malformed HTML. Remove
the extraneous </a> tag or, if the image should be a link, wrap the <img>
element inside a proper <a> tag with a valid href attribute.
cognee/modules/pipelines/models/PipelineRun.py (1)

8-12: ⚠️ Potential issue

Add a database migration for the new enum value.

Introducing DATASET_PROCESSING_INITIATED to a PostgreSQL/SQLAlchemy Enum column requires an Alembic migration to update the existing type, e.g.:

ALTER TYPE pipeline_runstatus ADD VALUE 'DATASET_PROCESSING_INITIATED';

Without this migration, deployments will break due to an unknown enum value. Please add the corresponding migration script.

🤖 Prompt for AI Agents
In cognee/modules/pipelines/models/PipelineRun.py around lines 8 to 12, you
added a new enum value DATASET_PROCESSING_INITIATED to the PipelineRunStatus
enum. To prevent deployment failures, create an Alembic migration script that
alters the corresponding PostgreSQL enum type by adding this new value using the
SQL command ALTER TYPE pipeline_runstatus ADD VALUE
'DATASET_PROCESSING_INITIATED';. This migration must be applied before deploying
the updated code.
cognee/api/v1/config/config.py (1)

158-158: 🛠️ Refactor suggestion

Inconsistent exception class usage.

The set_graph_db_config method now raises a native AttributeError while similar methods in this file (set_llm_config, set_relational_db_config, etc.) use the custom InvalidAttributeError class with the message parameter.

For consistency, consider modifying line 158 to match the pattern used in other methods:

-                raise AttributeError(f"'{key}' is not a valid attribute of the config.")
+                raise InvalidAttributeError(
+                    message=f"'{key}' is not a valid attribute of the config."
+                )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

                raise InvalidAttributeError(
                    message=f"'{key}' is not a valid attribute of the config."
                )
🤖 Prompt for AI Agents
In cognee/api/v1/config/config.py at line 158, replace the native AttributeError
with the custom InvalidAttributeError class used in other similar methods.
Change the raise statement to use InvalidAttributeError and pass the error
message via the message parameter to maintain consistency across the file.
cognee-frontend/src/modules/exploration/getExplorationGraphUrl.ts (1)

3-3: 💡 Verification agent

🧩 Analysis chain

Parameter type change doesn't seem to be used in the function

The parameter type has been updated from { id: string } to { name: string }, but it's not clear how this parameter is used in the fetch request. There's no reference to dataset.name in the request URL or body.


🏁 Script executed:

#!/bin/bash
# Check if the visualization endpoint uses any dataset identifier in the backend
rg -A 10 -B 5 "/v1/visualize" --type py

Length of output: 981


🏁 Script executed:

#!/bin/bash
# Show the contents of getExplorationGraphUrl.ts
sed -n '1,200p' cognee-frontend/src/modules/exploration/getExplorationGraphUrl.ts

# Check if dataset.name is referenced anywhere in this file
rg "dataset\.name" cognee-frontend/src/modules/exploration/getExplorationGraphUrl.ts || echo "No occurrences of dataset.name"

# Find all call sites of getExplorationGraphUrl
rg -n "getExplorationGraphUrl" -t ts

Length of output: 1140


Include the dataset identifier in the fetch URL

The dataset parameter (now typed as { name: string }) is never used—fetch('/v1/visualize') always hits the same endpoint regardless of which dataset is passed in. Either:

  • Remove the unused dataset parameter, or
  • Update the call to include dataset.name (or dataset.id if you revert to using the ID) in the URL or request body.

Locations to update:

  • cognee-frontend/src/modules/exploration/getExplorationGraphUrl.ts
    • Signature and first fetch call
  • Any related tests or call sites assuming that the dataset is scoped

Example fix:

export default function getExplorationGraphUrl(dataset: { name: string }) {
  const url = `/v1/visualize?datasetName=${encodeURIComponent(dataset.name)}`;
  return fetch(url)
     
}
🤖 Prompt for AI Agents
In cognee-frontend/src/modules/exploration/getExplorationGraphUrl.ts at line 3,
the dataset parameter typed as { name: string } is not used in the fetch request
URL, causing the function to always call the same endpoint regardless of the
dataset. To fix this, update the fetch URL to include the dataset.name as a
query parameter, for example by constructing the URL as
`/v1/visualize?datasetName=${encodeURIComponent(dataset.name)}` before calling
fetch. Also, review and update any related tests or call sites to reflect this
change in how the dataset is scoped.
cognee-frontend/src/modules/datasets/cognifyDataset.ts (1)

3-3: 🛠️ Refactor suggestion

Add validation for missing dataset identifiers

The function now accepts optional id and name properties, using a fallback mechanism to support both identifier types. However, if neither property is present, this could lead to undefined being included in the datasets array.

-export default function cognifyDataset(dataset: { id?: string, name?: string }) {
+export default function cognifyDataset(dataset: { id?: string, name?: string }) {
+  if (!dataset.id && !dataset.name) {
+    return Promise.reject(new Error('Dataset must have either id or name property'));
+  }
+
   return fetch('/v1/cognify', {
     method: 'POST',
     headers: {
       'Content-Type': 'application/json',
     },
     body: JSON.stringify({
       datasets: [dataset.id || dataset.name],
     }),
   }).then((response) => response.json());
 }

Also applies to: 10-10

🤖 Prompt for AI Agents
In cognee-frontend/src/modules/datasets/cognifyDataset.ts at lines 3 and 10, the
function accepts optional id and name properties but does not handle the case
where both are missing, which can cause undefined values in the datasets array.
Add validation to check if either id or name is present before including the
dataset in the array, and handle the case where both are missing by either
throwing an error or skipping the dataset to prevent undefined entries.
cognee/base_config.py (1)

5-11: 💡 Verification agent

🧩 Analysis chain

Updated to use new Observer enum

The default monitoring tool configuration now uses the new centralized Observer enum instead of the removed MonitoringTool enum. This is consistent with the observability refactoring.


🏁 Script executed:

#!/bin/bash
# Verify that all references to MonitoringTool have been updated to Observer

echo "Checking for any remaining references to MonitoringTool..."
rg -p "MonitoringTool" --type py

echo "\nChecking for imports of get_observe function..."
rg -p "from cognee.modules.observability.get_observe import get_observe" --type py

Length of output: 690


Ensure all observability references are updated to use the new Observer enum

The change in cognee/base_config.py to default to Observer.LANGFUSE is correct, but there are still lingering imports of the old get_observe function that need to be replaced with the centralized Observer enum interface.

Files requiring updates:

  • cognee/api/v1/cognify/code_graph_pipeline.py (line 6)
  • cognee/infrastructure/llm/openai/adapter.py (line 18)
  • cognee/infrastructure/llm/gemini/adapter.py (line 7)

For each, replace the outdated import and usages, for example:

-from cognee.modules.observability.get_observe import get_observe
+from cognee.modules.observability.observers import Observer

Then update any calls like get_observe(...) to use the appropriate Observer enum member (e.g. Observer.LANGFUSE) or the new observer-instantiation API.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cognee/api/v1/cognify/code_graph_pipeline.py at line 6,
cognee/infrastructure/llm/openai/adapter.py at line 18, and
cognee/infrastructure/llm/gemini/adapter.py at line 7, replace the outdated
import of get_observe with an import of the new centralized Observer enum from
cognee.modules.observability.observers. Then update all usages of
get_observe(...) to use the appropriate Observer enum member directly, such as
Observer.LANGFUSE, or adapt to the new observer-instantiation API as required by
the refactoring.
cognee-frontend/src/app/page.tsx (1)

63-69: 🛠️ Refactor suggestion

Consider adding null check for dataset lookup

The function has a potential null reference exception. When accessing dataset!.id and dataset!.name, there's an assumption that dataset exists, but if selectedDataset doesn't match any dataset ID, this will cause a runtime error.

const onCognify = useCallback(() => {
  const dataset = datasets.find((dataset) => dataset.id === selectedDataset);
+  if (!dataset) {
+    showNotification("No dataset selected for cognification.", 5000);
+    return;
+  }
  return onDatasetCognify({
-    id: dataset!.id,
-    name: dataset!.name,
+    id: dataset.id,
+    name: dataset.name,
  });
-}, [datasets, onDatasetCognify, selectedDataset]);
+}, [datasets, onDatasetCognify, selectedDataset, showNotification]);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

  const onCognify = useCallback(() => {
    const dataset = datasets.find((dataset) => dataset.id === selectedDataset);
    if (!dataset) {
      showNotification("No dataset selected for cognification.", 5000);
      return;
    }
    return onDatasetCognify({
      id: dataset.id,
      name: dataset.name,
    });
  }, [datasets, onDatasetCognify, selectedDataset, showNotification]);
🤖 Prompt for AI Agents
In cognee-frontend/src/app/page.tsx around lines 63 to 69, the onCognify
function assumes the dataset exists without checking, which can cause a runtime
error if no matching dataset is found. Add a null check after finding the
dataset to verify it is not undefined before accessing its properties. If the
dataset is null, handle the case gracefully, such as returning early or showing
an error, to prevent exceptions.
entrypoint.sh (1)

36-37: 💡 Verification agent

🧩 Analysis chain

Removal of exec from process launching commands

Removing exec from these commands means the shell script process continues running after starting the server. In Docker environments, this can affect signal handling as the shell process (not Gunicorn) becomes PID 1.

Verify this change won't affect how the container handles termination signals:

Also applies to: 38-39, 41-42


🏁 Script executed:

#!/bin/bash
# Check if this is intended to run in Docker/container environment
grep -r "ENTRYPOINT" --include="Dockerfile*" .

Length of output: 203


Ensure proper signal handling by restoring exec in the entrypoint
Without exec, the shell remains PID 1 in your Docker containers and won’t forward SIGTERM/SIGINT to Gunicorn/Uvicorn. Add exec back before your debugpy/Gunicorn launch commands:

Locations to update in entrypoint.sh:

  • Lines 36–37
  • Lines 38–39
  • Lines 41–42

Suggested diff for each block (example for lines 36–37):

-        debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn \
-          -w 3 -k uvicorn.workers.UvicornWorker -t 30000 \
-          --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
+        exec debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn \
+          -w 3 -k uvicorn.workers.UvicornWorker -t 30000 \
+          --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app

Apply the same prefix in the other invocation blocks so that Gunicorn/Uvicorn replaces the shell process and receives container signals directly.

🤖 Prompt for AI Agents
In entrypoint.sh around lines 36 to 37, the launch command for debugpy and
Gunicorn is missing the exec prefix, causing the shell script to remain as PID 1
and not forward termination signals properly in Docker. To fix this, add exec
before the debugpy command so that Gunicorn/Uvicorn replaces the shell process
and receives signals directly. Apply the same exec prefix to the other similar
command blocks at lines 38-39 and 41-42.
cognee-frontend/src/modules/ingestion/DataView/DataView.tsx (1)

36-36: 💡 Verification agent

🧩 Analysis chain

Rename component to avoid shadowing global DataView interface

This component shadows the global DataView interface which could lead to confusion.


🏁 Script executed:

#!/bin/bash
# Check for DataView in browser globals
cat << EOF | node -
console.log("Global DataView exists:", typeof DataView !== 'undefined');
console.log("DataView description:", DataView ? DataView.toString() : "not available");
EOF

Length of output: 103


Rename DataView component to avoid shadowing the global DataView
The browser provides a native DataView constructor (seen via typeof DataView === 'function'), so keeping the same name for your React component can lead to confusion or unintentional references.

Please update the component and its file (and all imports/usages) accordingly. For example:

  • Rename in cognee-frontend/src/modules/ingestion/DataView/DataView.tsx
    - export default function DataView(props: DataViewProps) {
    + export default function DataViewComponent(props: DataViewProps) {
        // …
    }
  • Rename the file to DataViewComponent.tsx (or a preferred alternative)
  • Update all imports/exports/JSX tags to match the new name
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

-export default function DataView({ datasetId, data, onClose, onDataAdd, onCognify }: DataViewProps) {
+export default function DataViewComponent({ datasetId, data, onClose, onDataAdd, onCognify }: DataViewProps) {
🧰 Tools
🪛 Biome (1.9.4)

[error] 36-36: Do not shadow the global "DataView" property.

Consider renaming this variable. It's easy to confuse the origin of variables when they're named after a known global.

(lint/suspicious/noShadowRestrictedNames)

🤖 Prompt for AI Agents
In cognee-frontend/src/modules/ingestion/DataView/DataView.tsx at line 36,
rename the React component from DataView to a different name such as
DataViewComponent to avoid shadowing the global DataView interface. Also rename
the file to DataViewComponent.tsx and update all imports, exports, and JSX
usages throughout the codebase to use the new component name consistently.
examples/database_examples/neo4j_example.py (4)

62-64: 🛠️ Refactor suggestion

Add error handling for asynchronous operations

The script lacks error handling for asynchronous operations, which could lead to unhandled exceptions and unclear error messages when something goes wrong.

-    # Process the added document to extract knowledge
-    await cognee.cognify([dataset_name])
+    # Process the added document to extract knowledge
+    try:
+        await cognee.cognify([dataset_name])
+    except Exception as e:
+        print(f"Error processing dataset: {e}")
+        raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    # Process the added document to extract knowledge
    try:
        await cognee.cognify([dataset_name])
    except Exception as e:
        print(f"Error processing dataset: {e}")
        raise
🤖 Prompt for AI Agents
In examples/database_examples/neo4j_example.py around lines 62 to 64, the
asynchronous call to cognee.cognify lacks error handling, which may cause
unhandled exceptions. Wrap the await cognee.cognify([dataset_name]) call in a
try-except block to catch potential exceptions, and log or handle the errors
appropriately to ensure clear error messages and prevent the script from
crashing unexpectedly.

67-70: 🛠️ Refactor suggestion

Add error handling for search operations

Search operations may fail due to various reasons (like connectivity issues or malformed queries) but lack proper error handling.

-    insights_results = await cognee.search(query_type=SearchType.INSIGHTS, query_text="Neo4j")
-    print("\nInsights about Neo4j:")
-    for result in insights_results:
-        print(f"- {result}")
+    try:
+        insights_results = await cognee.search(query_type=SearchType.INSIGHTS, query_text="Neo4j")
+        print("\nInsights about Neo4j:")
+        for result in insights_results:
+            print(f"- {result}")
+    except Exception as e:
+        print(f"Error searching for insights: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    try:
        insights_results = await cognee.search(query_type=SearchType.INSIGHTS, query_text="Neo4j")
        print("\nInsights about Neo4j:")
        for result in insights_results:
            print(f"- {result}")
    except Exception as e:
        print(f"Error searching for insights: {e}")
🤖 Prompt for AI Agents
In examples/database_examples/neo4j_example.py around lines 67 to 70, the search
operation lacks error handling which can cause the program to crash on failures
like connectivity issues or malformed queries. Wrap the search call and
subsequent processing in a try-except block to catch exceptions, log or print an
appropriate error message, and prevent the program from crashing unexpectedly.

44-46: 🛠️ Refactor suggestion

Make data pruning truly optional

These operations are marked as "optional" in the comment but are always executed, which could lead to unexpected data loss if users run the example against a database containing important data.

-    # Clean any existing data (optional)
-    await cognee.prune.prune_data()
-    await cognee.prune.prune_system(metadata=True)
+    # Clean any existing data (optional)
+    # Set to True to clean existing data before running the example
+    should_clean_data = False
+    if should_clean_data:
+        await cognee.prune.prune_data()
+        await cognee.prune.prune_system(metadata=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    # Clean any existing data (optional)
    # Set to True to clean existing data before running the example
    should_clean_data = False
    if should_clean_data:
        await cognee.prune.prune_data()
        await cognee.prune.prune_system(metadata=True)
🤖 Prompt for AI Agents
In examples/database_examples/neo4j_example.py around lines 44 to 46, the data
pruning calls are always executed despite being labeled as optional, risking
unintended data loss. Modify the code to execute these pruning operations only
if a specific flag or parameter is set, such as adding a conditional check or a
command-line argument that controls whether pruning should occur, ensuring the
example does not delete data unless explicitly requested by the user.

20-23: 🛠️ Refactor suggestion

Consider validating environment variables

The script retrieves Neo4j credentials from environment variables without validating their existence. If these variables are missing, the database connection will fail later in execution.

-    neo4j_url = os.getenv("GRAPH_DATABASE_URL")
-    neo4j_user = os.getenv("GRAPH_DATABASE_USERNAME")
-    neo4j_pass = os.getenv("GRAPH_DATABASE_PASSWORD")
+    neo4j_url = os.getenv("GRAPH_DATABASE_URL")
+    neo4j_user = os.getenv("GRAPH_DATABASE_USERNAME")
+    neo4j_pass = os.getenv("GRAPH_DATABASE_PASSWORD")
+    
+    # Validate required environment variables
+    if not all([neo4j_url, neo4j_user, neo4j_pass]):
+        missing = [var for var, val in 
+                  zip(["GRAPH_DATABASE_URL", "GRAPH_DATABASE_USERNAME", "GRAPH_DATABASE_PASSWORD"],
+                     [neo4j_url, neo4j_user, neo4j_pass]) if not val]
+        raise ValueError(f"Missing required environment variables: {', '.join(missing)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    # Set up Neo4j credentials in .env file and get the values from environment variables
    neo4j_url = os.getenv("GRAPH_DATABASE_URL")
    neo4j_user = os.getenv("GRAPH_DATABASE_USERNAME")
    neo4j_pass = os.getenv("GRAPH_DATABASE_PASSWORD")
    
    # Validate required environment variables
    if not all([neo4j_url, neo4j_user, neo4j_pass]):
        missing = [var for var, val in 
                  zip(
                      ["GRAPH_DATABASE_URL", "GRAPH_DATABASE_USERNAME", "GRAPH_DATABASE_PASSWORD"],
                      [neo4j_url, neo4j_user, neo4j_pass]
                  ) if not val]
        raise ValueError(f"Missing required environment variables: {', '.join(missing)}")
🤖 Prompt for AI Agents
In examples/database_examples/neo4j_example.py around lines 20 to 23, the Neo4j
credentials are fetched from environment variables without checking if they are
set. Add validation to verify that neo4j_url, neo4j_user, and neo4j_pass are not
None or empty after retrieval. If any are missing, raise an informative error or
exit gracefully to prevent connection failures later in the script.
cognee/tasks/temporal_awareness/index_graphiti_objects.py (2)

34-34: 💡 Verification agent

🧩 Analysis chain

Updated method for retrieving graph data

The function now uses get_graph_data() instead of get_model_independent_graph_data(). This change simplifies the API, but ensure all callers are updated accordingly.


🏁 Script executed:

#!/bin/bash
# Check for any remaining calls to get_model_independent_graph_data()
rg "get_model_independent_graph_data\(\)" --type py

Length of output: 311


Action Required: Update remaining get_model_independent_graph_data() calls

The search shows two adapters still invoking the old method. Please update them to use get_graph_data() for consistency:

  • cognee/infrastructure/databases/graph/neo4j_driver/adapter.py
  • cognee/infrastructure/databases/graph/kuzu/adapter.py

Suggested change in each file:

-    nodes, edges = await self.get_model_independent_graph_data()
+    nodes, edges = await self.get_graph_data()

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/neo4j_driver/adapter.py and
cognee/infrastructure/databases/graph/kuzu/adapter.py, locate all calls to
get_model_independent_graph_data() and replace them with get_graph_data() to
maintain consistency with the updated API. Ensure that any parameters or usage
align with the new method's signature and behavior.

64-65: 💡 Verification agent

🧩 Analysis chain

Updated edge data structure access

The edge data structure access has been updated to use a different index for retrieving the edge relationship name.


🏁 Script executed:

#!/bin/bash
# Check the structure of edges_data to verify the correct index is being used
# Look at where get_graph_data is defined to understand the returned data structure
rg -A 5 "def get_graph_data" --type py

Length of output: 3583


🏁 Script executed:

#!/bin/bash
# Locate and display the code around the edge indexing in the target file
FILE_PATH=$(fd -I -t f -p "index_graphiti_objects.py")
echo "File path: $FILE_PATH"
sed -n '1,200p' "$FILE_PATH"

Length of output: 3898


Fix edge relationship_name extraction for NetworkX adapter

The current code unconditionally uses edge[2] to get the relationship name, which only works for DB adapters (Kuzu/Neo4j/Memgraph) where the tuple is (src, tgt, relationship_name, props). NetworkX’s edges(data=True, keys=True) yields (src, tgt, key, props), so edge[2] is the internal key, not the actual relationship_name stored in props.

Please update in cognee/tasks/temporal_awareness/index_graphiti_objects.py (around lines 64–65):

  • Replace:
    edge_types = Counter(
        edge[2]  # The edge key (relationship name) is at index 2
        for edge in edges_data
    )
  • With something like:
    edge_types = Counter(
        # DB adapters return a string at edge[2], otherwise fall back to props['relationship_name']
        edge[2] if isinstance(edge[2], str) else edge[3]["relationship_name"]
        for edge in edges_data
    )

This ensures both tuple shapes yield the correct relationship_name.

🤖 Prompt for AI Agents
In cognee/tasks/temporal_awareness/index_graphiti_objects.py around lines 64 to
65, the code currently extracts the edge relationship name using edge[2], which
works only for DB adapters but not for NetworkX where edge[2] is an internal key
and the actual relationship name is in edge[3]["relationship_name"]. Update the
code to check if edge[2] is a string and use it directly; otherwise, extract the
relationship name from edge[3]["relationship_name"]. This will handle both tuple
formats correctly.
examples/database_examples/weaviate_example.py (3)

42-44: 🛠️ Refactor suggestion

Make data pruning truly optional

These operations are marked as "optional" in the comment but are always executed, which could lead to unexpected data loss if users run the example against a database containing important data.

-    # Clean any existing data (optional)
-    await cognee.prune.prune_data()
-    await cognee.prune.prune_system(metadata=True)
+    # Clean any existing data (optional)
+    # Set to True to clean existing data before running the example
+    should_clean_data = False
+    if should_clean_data:
+        await cognee.prune.prune_data()
+        await cognee.prune.prune_system(metadata=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    # Clean any existing data (optional)
    # Set to True to clean existing data before running the example
    should_clean_data = False
    if should_clean_data:
        await cognee.prune.prune_data()
        await cognee.prune.prune_system(metadata=True)
🤖 Prompt for AI Agents
In examples/database_examples/weaviate_example.py around lines 42 to 44, the
data pruning calls are always executed despite being marked as optional, risking
unintended data loss. Modify the code to conditionally execute these pruning
operations based on a user-configurable flag or parameter, such as a boolean
variable named 'enable_pruning'. This way, pruning only occurs when explicitly
enabled, preserving existing data by default.

20-23: 🛠️ Refactor suggestion

Consider validating environment variables

The script retrieves Weaviate credentials from environment variables without validating their existence. If these variables are missing, the database connection will fail later.

-    weaviate_url = os.getenv("VECTOR_DB_URL")
-    weaviate_key = os.getenv("VECTOR_DB_KEY")
+    weaviate_url = os.getenv("VECTOR_DB_URL")
+    weaviate_key = os.getenv("VECTOR_DB_KEY")
+    
+    # Validate required environment variables
+    if not weaviate_url:
+        raise ValueError("Missing required environment variable: VECTOR_DB_URL")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    # Set up Weaviate credentials in .env file and get the values from environment variables
    weaviate_url = os.getenv("VECTOR_DB_URL")
    weaviate_key = os.getenv("VECTOR_DB_KEY")
    
    # Validate required environment variables
    if not weaviate_url:
        raise ValueError("Missing required environment variable: VECTOR_DB_URL")
🤖 Prompt for AI Agents
In examples/database_examples/weaviate_example.py around lines 20 to 23, the
environment variables VECTOR_DB_URL and VECTOR_DB_KEY are retrieved without
validation. Add checks to verify these variables are not None or empty after
retrieval, and raise a clear error or exit with a message if they are missing to
prevent connection failures later.

57-84: 🛠️ Refactor suggestion

Add error handling for core operations

The script lacks error handling for asynchronous operations (cognify and search), which could lead to unclear failure modes.

Add try-except blocks around key operations:

-    # Process the added document to extract knowledge
-    await cognee.cognify([dataset_name])
+    # Process the added document to extract knowledge
+    try:
+        await cognee.cognify([dataset_name])
+    except Exception as e:
+        print(f"Error processing dataset: {e}")
+        raise

Similar error handling should be added to each search operation.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    # Add the sample text to the dataset
    await cognee.add([sample_text], dataset_name)

    # Process the added document to extract knowledge
    try:
        await cognee.cognify([dataset_name])
    except Exception as e:
        print(f"Error processing dataset: {e}")
        raise

    # Now let's perform some searches
    # 1. Search for insights related to "Weaviate"
    insights_results = await cognee.search(query_type=SearchType.INSIGHTS, query_text="Weaviate")
    print("\nInsights about Weaviate:")
    for result in insights_results:
        print(f"- {result}")

    # 2. Search for text chunks related to "vector search"
    chunks_results = await cognee.search(
        query_type=SearchType.CHUNKS, query_text="vector search", datasets=[dataset_name]
    )
    print("\nChunks about vector search:")
    for result in chunks_results:
        print(f"- {result}")

    # 3. Get graph completion related to databases
    graph_completion_results = await cognee.search(
        query_type=SearchType.GRAPH_COMPLETION, query_text="database"
    )
    print("\nGraph completion for databases:")
    for result in graph_completion_results:
        print(f"- {result}")
🤖 Prompt for AI Agents
In examples/database_examples/weaviate_example.py between lines 57 and 84, the
asynchronous calls to cognee.cognify and cognee.search lack error handling,
which can cause unclear failures. Wrap each await call to cognee.cognify and
each cognee.search invocation in try-except blocks to catch exceptions, log or
print meaningful error messages, and prevent the script from crashing silently.
examples/database_examples/chromadb_example.py (2)

37-39: 🛠️ Refactor suggestion

Make data pruning truly optional

These operations are marked as "optional" in the comment but are always executed, which could lead to unexpected data loss if users run the example against a database containing important data.

-    # Clean any existing data (optional)
-    await cognee.prune.prune_data()
-    await cognee.prune.prune_system(metadata=True)
+    # Clean any existing data (optional)
+    # Set to True to clean existing data before running the example
+    should_clean_data = False
+    if should_clean_data:
+        await cognee.prune.prune_data()
+        await cognee.prune.prune_system(metadata=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    # Clean any existing data (optional)
    # Set to True to clean existing data before running the example
    should_clean_data = False
    if should_clean_data:
        await cognee.prune.prune_data()
        await cognee.prune.prune_system(metadata=True)
🤖 Prompt for AI Agents
In examples/database_examples/chromadb_example.py around lines 37 to 39, the
data pruning calls are always executed despite being marked as optional, risking
unintended data loss. Modify the code to conditionally execute these pruning
operations based on a user-configurable flag or parameter, so pruning only
occurs when explicitly requested. Update the comment to clarify this behavior.

52-79: 🛠️ Refactor suggestion

Add error handling for core operations

The script lacks error handling for asynchronous operations (add, cognify, and search), which could lead to unclear failure modes and difficult debugging.

-    # Add the sample text to the dataset
-    await cognee.add([sample_text], dataset_name)
+    # Add the sample text to the dataset
+    try:
+        await cognee.add([sample_text], dataset_name)
+    except Exception as e:
+        print(f"Error adding sample text: {e}")
+        raise

Similar error handling should be added to the cognify and search operations.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    # Add the sample text to the dataset
    try:
        await cognee.add([sample_text], dataset_name)
    except Exception as e:
        print(f"Error adding sample text: {e}")
        raise

    # Process the added document to extract knowledge
    await cognee.cognify([dataset_name])

    # Now let's perform some searches
    # 1. Search for insights related to "ChromaDB"
    insights_results = await cognee.search(query_type=SearchType.INSIGHTS, query_text="ChromaDB")
    print("\nInsights about ChromaDB:")
    for result in insights_results:
        print(f"- {result}")

    # 2. Search for text chunks related to "vector search"
    chunks_results = await cognee.search(
        query_type=SearchType.CHUNKS, query_text="vector search", datasets=[dataset_name]
    )
    print("\nChunks about vector search:")
    for result in chunks_results:
        print(f"- {result}")

    # 3. Get graph completion related to databases
    graph_completion_results = await cognee.search(
        query_type=SearchType.GRAPH_COMPLETION, query_text="database"
    )
    print("\nGraph completion for databases:")
    for result in graph_completion_results:
        print(f"- {result}")
🤖 Prompt for AI Agents
In examples/database_examples/chromadb_example.py around lines 52 to 79, the
asynchronous operations add, cognify, and search lack error handling, which can
cause unclear failures. Wrap each await call for add, cognify, and each search
operation in try-except blocks to catch exceptions. Log or print meaningful
error messages in the except blocks to aid debugging and ensure the script
handles failures gracefully.
examples/database_examples/milvus_example.py (1)

29-37: 💡 Verification agent

🧩 Analysis chain

Verify that a local file-path is really accepted by the Milvus adapter

vector_db_url is set to <…>/milvus.db, but Milvus typically expects an HTTP / gRPC endpoint (tcp://host:port). If the Cognee Milvus adapter falls back to “local-disk SQlite” for demo purposes that’s fine; otherwise this will fail at connection time.

Run:

…to confirm that a path value is supported. Update the example or add an explanatory comment if the adapter really needs a network URL.


🏁 Script executed:

#!/bin/bash
# Search whether the Milvus adapter treats file paths specially
rg -n '"milvus"' --type py | rg 'vector_db_url'

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Locate the setter for vector DB configuration
rg -n 'def set_vector_db_config' --type py

# Find all occurrences of vector_db_url in Python source
rg -n 'vector_db_url' --type py

# Find all occurrences of vector_db_provider in Python source
rg -n 'vector_db_provider' --type py

# Search for any Milvus adapter class definitions
rg -n 'class .*Milvus' --type py

# Case-insensitive search for "milvus" to catch any usage
rg -n -i 'milvus' --type py

Length of output: 13041


🏁 Script executed:

#!/bin/bash
# Inspect MilvusAdapter implementation around initialization and client creation
sed -n '1,200p' cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py

# Search for any fallback logic for local file paths or SQLite usage
rg -n 'sqlite' -n 'file' cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py

Length of output: 7218


Clarify Milvus URL Format for vector_db_url

The Milvus adapter passes vector_db_url directly to pymilvus.MilvusClient(uri=…) and does not fall back to a local SQLite file. Using a filesystem path will result in a connection error at runtime. Please update the example to use a proper Milvus endpoint (e.g. gRPC URL) or document that this must be replaced.

• File: examples/database_examples/milvus_example.py
Line 34:

-            "vector_db_url": local_milvus_db_path,  # Enter Milvus Endpoint if exist
+            "vector_db_url": "tcp://localhost:19530",  # Milvus gRPC endpoint (replace as needed)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    local_milvus_db_path = os.path.join(cognee_directory_path, "databases", "milvus.db")

    # Configure Milvus as the vector database provider
    cognee.config.set_vector_db_config(
        {
-           "vector_db_url": local_milvus_db_path,  # Enter Milvus Endpoint if exist
+           "vector_db_url": "tcp://localhost:19530",  # Milvus gRPC endpoint (replace as needed)
            "vector_db_key": "",  # Enter Token
            "vector_db_provider": "milvus",  # Specify Milvus as provider
        }
🤖 Prompt for AI Agents
In examples/database_examples/milvus_example.py around lines 29 to 37, the
vector_db_url is set to a local file path which is incorrect because the Milvus
adapter expects a network endpoint URL like a gRPC address. Update the example
to use a proper Milvus endpoint URL (e.g., tcp://host:port) instead of a file
path, or add a clear comment explaining that users must replace the placeholder
with their actual Milvus server URL to avoid connection errors.
cognee/tasks/ingestion/migrate_relational_database.py (2)

110-118: 🛠️ Refactor suggestion

Potential duplicate ColumnValue nodes across rows

column_node_id = f"{table_name}:{key}:{value}" is identical for every row that shares the same value, so multiple rows referencing "status:active" will clash.
Either:

  • include the row’s PK in the ID, or
  • de-duplicate intentionally and add many-to-one edges.

Clarify the intended semantics and update the ID strategy accordingly.

🤖 Prompt for AI Agents
In cognee/tasks/ingestion/migrate_relational_database.py around lines 110 to
118, the current ID generation for ColumnValue nodes uses only table name,
column key, and value, causing collisions for identical values across different
rows. To fix this, modify the column_node_id to include the row's primary key to
ensure uniqueness per row, or if intentional deduplication is desired, implement
logic to create many-to-one edges from rows to shared ColumnValue nodes. Clarify
the intended behavior and update the ID generation and node mapping accordingly.

98-109: ⚠️ Potential issue

Logical bugs in column-migration filter

  1. if key is primary_key_col uses the is operator, which checks identity, not equality.
  2. The foreign-key exclusion list appends fk["ref_column"], but the actual FK column in this table is fk["column"].
    As written, FK columns are still turned into ColumnValue nodes, while unrelated columns in the referenced table are skipped.
-foreign_keys.append(fk["ref_column"])
+foreign_keys.append(fk["column"])

-if key is primary_key_col or key in foreign_keys:
+if key == primary_key_col or key in foreign_keys:

These fixes prevent spurious ColumnValue nodes and ensure correct filtering.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

                # Migrate data stored in columns of table rows
                if migrate_column_data:
                    # Get foreign key columns to filter them out from column migration
                    foreign_keys = []
                    for fk in details.get("foreign_keys", []):
-                        foreign_keys.append(fk["ref_column"])
+                        foreign_keys.append(fk["column"])

                    for key, value in row_properties.items():
                        # Skip mapping primary key information to itself and mapping of foreign key information (as it will be mapped bellow)
-                        if key is primary_key_col or key in foreign_keys:
+                        if key == primary_key_col or key in foreign_keys:
                            continue
🤖 Prompt for AI Agents
In cognee/tasks/ingestion/migrate_relational_database.py around lines 98 to 109,
replace the identity check 'if key is primary_key_col' with an equality check
'if key == primary_key_col' to correctly compare column names. Also, change the
foreign key columns list to append 'fk["column"]' instead of 'fk["ref_column"]'
so that the filter excludes the actual foreign key columns in the current table,
preventing incorrect inclusion of referenced table columns.
cognee/api/v1/responses/routers/default_tools.py (1)

16-23: ⚠️ Potential issue

search_type enum is out of sync with SearchType – will cause validation errors

SEMANTIC is not in cognee.modules.search.types.SearchType, while valid options such as CHUNKS, RAG_COMPLETION, GRAPH_SUMMARY_COMPLETION, and CYPHER are missing. Down-stream validation (e.g., Pydantic/OpenAPI) will reject legitimate calls.

-"INSIGHTS", "CODE", "GRAPH_COMPLETION", "SEMANTIC", "NATURAL_LANGUAGE"
+"INSIGHTS", "CHUNKS", "CODE", "GRAPH_COMPLETION", "GRAPH_SUMMARY_COMPLETION",
+"RAG_COMPLETION", "CYPHER", "NATURAL_LANGUAGE"

Synchronise the list or, better, import the enum values programmatically to avoid future drift.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cognee/api/v1/responses/routers/default_tools.py around lines 16 to 23, the
"enum" list for "search_type" is not synchronized with the SearchType enum in
cognee.modules.search.types. Replace the hardcoded enum list with a dynamic
import of the SearchType enum values to ensure they always match and prevent
validation errors. This can be done by importing SearchType and programmatically
extracting its members for the enum definition.
cognee/api/v1/responses/routers/get_responses_router.py (6)

140-144: ⚠️ Potential issue

Token usage mapping mismatch

The returned dict uses prompt_tokens / completion_tokens. Your conversion expects input_tokens / output_tokens. Adjust field names to avoid zeroed statistics.

-                prompt_tokens=usage.get("input_tokens", 0),
-                completion_tokens=usage.get("output_tokens", 0),
+                prompt_tokens=usage.get("prompt_tokens", 0),
+                completion_tokens=usage.get("completion_tokens", 0),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

            usage=ChatUsage(
                prompt_tokens=usage.get("prompt_tokens", 0),
                completion_tokens=usage.get("completion_tokens", 0),
                total_tokens=usage.get("total_tokens", 0),
            ),
🤖 Prompt for AI Agents
In cognee/api/v1/responses/routers/get_responses_router.py around lines 140 to
144, the code incorrectly maps usage dictionary keys 'input_tokens' and
'output_tokens' to ChatUsage fields prompt_tokens and completion_tokens. Update
the keys to 'prompt_tokens' and 'completion_tokens' respectively to match the
returned dictionary and avoid zeroed token statistics.

46-49: ⚠️ Potential issue

Mutable default argument – potential cross-request bleed-through

tools: Optional[List[Dict[str, Any]]] = DEFAULT_TOOLS stores a single list instance that will be shared and possibly mutated across requests. Default to None and assign inside the function:

-        tools: Optional[List[Dict[str, Any]]] = DEFAULT_TOOLS,
+        tools: Optional[List[Dict[str, Any]]] = None,
 ...
-        tools = tools or DEFAULT_TOOLS
+        tools = tools or DEFAULT_TOOLS
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

async def get_responses_router(
    obj_id: str,
    query: str,
-   tools: Optional[List[Dict[str, Any]]] = DEFAULT_TOOLS,
+   tools: Optional[List[Dict[str, Any]]] = None,
    tool_choice: Any = "auto",
    temperature: float = 1.0,
) -> Dict[str, Any]:
    tools = tools or DEFAULT_TOOLS
    ...
🤖 Prompt for AI Agents
In cognee/api/v1/responses/routers/get_responses_router.py around lines 46 to
49, the function parameter 'tools' uses a mutable default argument
DEFAULT_TOOLS, which can cause shared state issues across requests. Change the
default value of 'tools' to None, then inside the function check if 'tools' is
None and if so, assign it to DEFAULT_TOOLS. This prevents unintended mutations
of the default list across different calls.

54-56: ⚠️ Potential issue

Model name is being silently overwritten

model = "gpt-4o" ignores the caller’s requested model. Either honour request.model or reject unsupported models with a 400.

-        model = "gpt-4o"
+        if model not in {"gpt-4o"}:  # extend as back-ends grow
+            raise HTTPException(status_code=400, detail=f"Unsupported model: {model}")

Silently switching models is surprising for API consumers.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        # TODO: Support other models (e.g. cognee-v1-openai-gpt-3.5-turbo, etc.)
        if model not in {"gpt-4o"}:  # extend as back-ends grow
            raise HTTPException(status_code=400, detail=f"Unsupported model: {model}")
🤖 Prompt for AI Agents
In cognee/api/v1/responses/routers/get_responses_router.py around lines 54 to
56, the code overwrites the model variable with a hardcoded value "gpt-4o",
ignoring the model requested by the caller. To fix this, modify the code to
check if the requested model in request.model is supported; if it is, use it,
otherwise return a 400 error response rejecting unsupported models. This ensures
the API respects the caller's input and avoids silently switching models.

91-109: ⚠️ Potential issue

Parsing tool calls from the wrong field

The OpenAI Responses API places tool/function calls in tool_calls, not in the top-level output list. This loop will never run in production.

-        output = response.get("output", [])
+        output = response.get("tool_calls", [])

Verify against openai>=1.24.0 docs.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        # Use the response ID from the API or generate a new one
        response_id = response.get("id", f"resp_{uuid.uuid4().hex}")

        # Check if there are function tool calls in the output
        output = response.get("tool_calls", [])

        processed_tool_calls = []

        # Process any function tool calls from the output
        for item in output:
            if isinstance(item, dict) and item.get("type") == "function_call":
                # This is a tool call from the new format
                function_name = item.get("name", "")
                arguments_str = item.get("arguments", "{}")
                call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")

                # Create a format the dispatcher can handle
                tool_call = {
                    "id": call_id,
🤖 Prompt for AI Agents
In cognee/api/v1/responses/routers/get_responses_router.py around lines 91 to
109, the code incorrectly attempts to parse function tool calls from the
"output" field, but according to openai>=1.24.0 docs, tool calls are located in
the "tool_calls" field. Update the code to extract and iterate over the
"tool_calls" field from the response instead of "output" to correctly process
function tool calls.

36-42: 🛠️ Refactor suggestion

Cache the OpenAI client instead of rebuilding it per request

_get_model_client() instantiates openai.AsyncOpenAI for every incoming call.
Because the client object is stateless and thread-safe, create it once (e.g. at module import or via functools.cache) to avoid repeated HTTP-session/bootstrap overhead.

-from cognee.infrastructure.llm.config import get_llm_config
+from functools import cache
+from cognee.infrastructure.llm.config import get_llm_config
 ...
-    def _get_model_client():
+    @cache
+    def _get_model_client() -> openai.AsyncOpenAI:

This shaves latency from each request.

🤖 Prompt for AI Agents
In cognee/api/v1/responses/routers/get_responses_router.py around lines 36 to
42, the _get_model_client function creates a new openai.AsyncOpenAI client on
every call, causing unnecessary overhead. To fix this, cache the client instance
by either creating it once at module import or by decorating the function with
functools.cache so that the same client is reused across calls, reducing latency
and resource usage.

100-113: 🛠️ Refactor suggestion

Arguments string is never parsed – JSON stays a string

arguments_str is forwarded untouched to both dispatch_function and the response body.
Convert it to a dict to guarantee type-safe handling and prevent double-serialization mistakes:

-import json
 ...
-                arguments_str = item.get("arguments", "{}")
+                import json
+                arguments_dict = json.loads(item.get("arguments", "{}"))
 ...
-                    "function": {"name": function_name, "arguments": arguments_str},
+                    "function": {"name": function_name, "arguments": arguments_dict},
 ...
-                    function=FunctionCall(name=function_name, arguments=arguments_str),
+                    function=FunctionCall(name=function_name, arguments=json.dumps(arguments_dict)),

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cognee/api/v1/responses/routers/get_responses_router.py around lines 100 to
113, the arguments string is currently passed as a raw JSON string without
parsing, which risks type errors and double serialization. Fix this by parsing
the arguments_str from JSON string to a dictionary using a JSON parsing method
before assigning it to the tool_call's function arguments. This ensures
type-safe handling and prevents serialization issues downstream.
cognee/modules/pipelines/operations/pipeline.py (1)

63-92: 💡 Verification agent

🧩 Analysis chain

Consider persisting new dataset instances to the database.

The improved dataset handling logic now supports matching by name or ID and creating new instances for unmatched datasets. However, these new instances aren't saved to the database, only created in memory.

- dataset_instances.append(
-     Dataset(
-         id=await get_unique_dataset_id(dataset_name=dataset_name, user=user),
-         name=dataset_name,
-         owner_id=user.id,
-     )
- )
+ # Create and persist new dataset instance
+ new_dataset = Dataset(
+     id=await get_unique_dataset_id(dataset_name=dataset_name, user=user),
+     name=dataset_name,
+     owner_id=user.id,
+ )
+ # Save to database
+ db_engine = get_relational_engine()
+ async with db_engine.get_async_session() as session:
+     session.add(new_dataset)
+     await session.commit()
+ dataset_instances.append(new_dataset)

Let's verify if these datasets should be persisted:


🏁 Script executed:

#!/bin/bash
# Check if new datasets are persisted elsewhere in the codebase
rg -A 5 "get_unique_dataset_id" --type py

Length of output: 5150


🏁 Script executed:

#!/bin/bash
# Display the first 200 lines of pipeline.py for context
sed -n '1,200p' cognee/modules/pipelines/operations/pipeline.py

Length of output: 6323


🏁 Script executed:

#!/bin/bash
# Locate get_datasets function implementation
rg -R "async def get_datasets" -A 30 --type py

Length of output: 453


🏁 Script executed:

#!/bin/bash
# Locate get_datasets implementation
rg "async def get_datasets" -A 30 --type py

Length of output: 8883


🏁 Script executed:

#!/bin/bash
# Display implementation of create_dataset to understand persistence logic
sed -n '1,200p' cognee/modules/data/methods/create_dataset.py

Length of output: 1189


Persist new datasets in the database in cognee_pipeline
Newly created Dataset instances in pipeline.py are only in-memory—they need to be saved via the existing persistence logic (create_dataset).

Please update the block at cognee/modules/pipelines/operations/pipeline.py (lines ~80–92) to use create_dataset instead of raw Dataset(...), for example:

 from cognee.modules.data.methods.get_unique_dataset_id import get_unique_dataset_id
+from cognee.modules.data.methods.create_dataset import create_dataset
+from cognee.infrastructure.databases.relational import get_relational_engine

     for dataset_name in datasets:
         is_dataset_found = False

         for existing_dataset in existing_datasets:
             if (
                 existing_dataset.name == dataset_name
                 or str(existing_dataset.id) == dataset_name
             ):
                 dataset_instances.append(existing_dataset)
                 is_dataset_found = True
                 break

         if not is_dataset_found:
-            dataset_instances.append(
-                Dataset(
-                    id=await get_unique_dataset_id(dataset_name=dataset_name, user=user),
-                    name=dataset_name,
-                    owner_id=user.id,
-                )
-            )
+            # Create and persist new dataset if missing
+            db_engine = get_relational_engine()
+            async with db_engine.get_async_session() as session:
+                new_dataset = await create_dataset(dataset_name=dataset_name, user=user, session=session)
+            dataset_instances.append(new_dataset)
  • This leverages the existing create_dataset method which commits only when a new record is required.
  • Imports:
    • create_dataset from cognee.modules.data.methods.create_dataset
    • get_relational_engine from cognee.infrastructure.databases.relational
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

# add these imports at the top of cognee/modules/pipelines/operations/pipeline.py
from cognee.modules.data.methods.get_unique_dataset_id import get_unique_dataset_id
+from cognee.modules.data.methods.create_dataset import create_dataset
+from cognee.infrastructure.databases.relational import get_relational_engine

    if not datasets:
        # Get datasets from database if none sent.
        datasets = existing_datasets
    else:
        # If dataset is already in database, use it, otherwise create a new instance.
        dataset_instances = []

        for dataset_name in datasets:
            is_dataset_found = False

            for existing_dataset in existing_datasets:
                if (
                    existing_dataset.name == dataset_name
                    or str(existing_dataset.id) == dataset_name
                ):
                    dataset_instances.append(existing_dataset)
                    is_dataset_found = True
                    break

            if not is_dataset_found:
-                dataset_instances.append(
-                    Dataset(
-                        id=await get_unique_dataset_id(dataset_name=dataset_name, user=user),
-                        name=dataset_name,
-                        owner_id=user.id,
-                    )
-                )
+                # Create and persist new dataset if missing
+                db_engine = get_relational_engine()
+                async with db_engine.get_async_session() as session:
+                    new_dataset = await create_dataset(
+                        dataset_name=dataset_name,
+                        user=user,
+                        session=session,
+                    )
+                dataset_instances.append(new_dataset)

        datasets = dataset_instances
🤖 Prompt for AI Agents
In cognee/modules/pipelines/operations/pipeline.py around lines 63 to 92, the
new Dataset instances created in memory are not persisted to the database. To
fix this, replace the direct Dataset(...) instantiation with calls to the
existing create_dataset function, which handles persistence and commits only
when needed. Import create_dataset from
cognee.modules.data.methods.create_dataset and get_relational_engine from
cognee.infrastructure.databases.relational, then use create_dataset with the
appropriate parameters to save new datasets instead of just creating them in
memory.
cognee/api/v1/responses/dispatch_function.py (4)

87-101: 🛠️ Refactor suggestion

Add error handling for cognify operations.

The function processes both add and cognify operations but lacks error handling for potential failures.

- if text:
-     await add(data=text, user=user)
-
- await cognify(user=user, ontology_file_path=ontology_file_path if ontology_file_path else None)
-
- return (
-     "Text successfully converted into knowledge graph."
-     if text
-     else "Knowledge graph successfully updated with new information."
- )
+ try:
+     if text:
+         await add(data=text, user=user)
+
+     await cognify(user=user, ontology_file_path=ontology_file_path if ontology_file_path else None)
+
+     return (
+         "Text successfully converted into knowledge graph."
+         if text
+         else "Knowledge graph successfully updated with new information."
+     )
+ except Exception as e:
+     logger.error(f"Error during cognify operation: {e}")
+     return f"Error processing knowledge graph: {str(e)}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

async def handle_cognify(arguments: Dict[str, Any], user) -> str:
    """Handle cognify function call"""
    text = arguments.get("text")
    ontology_file_path = arguments.get("ontology_file_path")

    try:
        if text:
            await add(data=text, user=user)

        await cognify(
            user=user,
            ontology_file_path=ontology_file_path if ontology_file_path else None,
        )

        return (
            "Text successfully converted into knowledge graph."
            if text
            else "Knowledge graph successfully updated with new information."
        )
    except Exception as e:
        logger.error(f"Error during cognify operation: {e}")
        return f"Error processing knowledge graph: {str(e)}"
🤖 Prompt for AI Agents
In cognee/api/v1/responses/dispatch_function.py around lines 87 to 101, the
handle_cognify function lacks error handling for the asynchronous add and
cognify calls. Wrap the calls to add and cognify in try-except blocks to catch
exceptions, log or handle errors appropriately, and ensure the function returns
meaningful error messages or statuses if these operations fail.

104-107: 🛠️ Refactor suggestion

Add error handling for prune operation.

The function calls prune() but doesn't handle potential errors that might occur during execution.

- async def handle_prune(arguments: Dict[str, Any], user) -> str:
-     """Handle prune function call"""
-     await prune()
-     return "Memory has been pruned successfully."
+ async def handle_prune(arguments: Dict[str, Any], user) -> str:
+     """Handle prune function call"""
+     try:
+         await prune()
+         return "Memory has been pruned successfully."
+     except Exception as e:
+         logger.error(f"Error during prune operation: {e}")
+         return f"Error pruning memory: {str(e)}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

async def handle_prune(arguments: Dict[str, Any], user) -> str:
    """Handle prune function call"""
    try:
        await prune()
        return "Memory has been pruned successfully."
    except Exception as e:
        logger.error(f"Error during prune operation: {e}")
        return f"Error pruning memory: {str(e)}"
🤖 Prompt for AI Agents
In cognee/api/v1/responses/dispatch_function.py around lines 104 to 107, the
handle_prune function calls the async prune() without any error handling, which
can cause unhandled exceptions. Wrap the await prune() call in a try-except
block to catch potential exceptions, and return an appropriate error message or
re-raise the error after logging it. This will ensure the function handles
failures gracefully.

19-45: 🛠️ Refactor suggestion

Add error handling for JSON parsing failures.

The function parses JSON arguments but doesn't handle potential JSON parsing errors, which could occur with malformed input.

- arguments = json.loads(arguments_str)
+ try:
+     arguments = json.loads(arguments_str)
+ except json.JSONDecodeError as e:
+     logger.error(f"Failed to parse arguments: {e}")
+     return f"Error: Invalid function arguments format - {str(e)}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

async def dispatch_function(tool_call: Union[ToolCall, Dict[str, Any]]) -> str:
    """
    Dispatches a function call to the appropriate Cognee function.
    """
    if isinstance(tool_call, dict):
        function_data = tool_call.get("function", {})
        function_name = function_data.get("name", "")
        arguments_str = function_data.get("arguments", "{}")
    else:
        function_name = tool_call.function.name
        arguments_str = tool_call.function.arguments

    try:
        arguments = json.loads(arguments_str)
    except json.JSONDecodeError as e:
        logger.error(f"Failed to parse arguments: {e}")
        return f"Error: Invalid function arguments format - {str(e)}"

    logger.info(f"Dispatching function: {function_name} with args: {arguments}")

    user = await get_default_user()

    if function_name == "search":
        return await handle_search(arguments, user)
    elif function_name == "cognify":
        return await handle_cognify(arguments, user)
    elif function_name == "prune":
        return await handle_prune(arguments, user)
    else:
        return f"Error: Unknown function {function_name}"
🤖 Prompt for AI Agents
In cognee/api/v1/responses/dispatch_function.py around lines 19 to 45, the code
parses JSON arguments without handling JSONDecodeError exceptions. To fix this,
wrap the json.loads call in a try-except block catching JSONDecodeError, and
return or raise a clear error message indicating malformed JSON input to prevent
unhandled exceptions.

47-84: 🛠️ Refactor suggestion

Handle potential errors in search execution.

While the parameter validation is thorough, there's no try-except block to handle potential errors in the search operation itself.

- results = await search(
-     query_text=query,
-     query_type=query_type,
-     datasets=datasets,
-     user=user,
-     system_prompt_path=system_prompt_path,
-     top_k=top_k if isinstance(top_k, int) else 10,
- )
-
- return results
+ try:
+     results = await search(
+         query_text=query,
+         query_type=query_type,
+         datasets=datasets,
+         user=user,
+         system_prompt_path=system_prompt_path,
+         top_k=top_k if isinstance(top_k, int) else 10,
+     )
+     return results
+ except Exception as e:
+     logger.error(f"Error during search operation: {e}")
+     return f"Error executing search: {str(e)}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

async def handle_search(arguments: Dict[str, Any], user) -> list:
    """Handle search function call"""
    search_tool = next((tool for tool in DEFAULT_TOOLS if tool["name"] == "search"), None)
    required_params = (
        search_tool["parameters"].get("required", []) if search_tool else ["search_query"]
    )

    query = arguments.get("search_query")
    if not query and "search_query" in required_params:
        return "Error: Missing required 'search_query' parameter"

    search_type_str = arguments.get("search_type", "GRAPH_COMPLETION")
    valid_search_types = (
        search_tool["parameters"]["properties"]["search_type"]["enum"]
        if search_tool
        else ["INSIGHTS", "CODE", "GRAPH_COMPLETION", "SEMANTIC", "NATURAL_LANGUAGE"]
    )

    if search_type_str not in valid_search_types:
        logger.warning(f"Invalid search_type: {search_type_str}, defaulting to GRAPH_COMPLETION")
        search_type_str = "GRAPH_COMPLETION"

    query_type = SearchType[search_type_str]

    top_k = arguments.get("top_k")
    datasets = arguments.get("datasets")
    system_prompt_path = arguments.get("system_prompt_path", "answer_simple_question.txt")

    try:
        results = await search(
            query_text=query,
            query_type=query_type,
            datasets=datasets,
            user=user,
            system_prompt_path=system_prompt_path,
            top_k=top_k if isinstance(top_k, int) else 10,
        )
        return results
    except Exception as e:
        logger.error(f"Error during search operation: {e}")
        return f"Error executing search: {str(e)}"
🤖 Prompt for AI Agents
In cognee/api/v1/responses/dispatch_function.py around lines 47 to 84, the
handle_search function lacks error handling for the asynchronous search call.
Wrap the search invocation and its related logic inside a try-except block to
catch any exceptions that may occur during the search execution. In the except
block, log the error appropriately and return a meaningful error message or
response to ensure the function handles failures gracefully.
cognee/infrastructure/databases/graph/networkx/adapter.py (1)

45-47: 🛠️ Refactor suggestion

Type-hint transition to UUID is incomplete – API now mixes UUID and str.

has_node & many new signatures correctly use UUID, but methods such as has_edge, add_edge, add_edges, etc. still use str parameters.
Down-stream callers now need to cast back-and-forth which is error-prone and defeats the goal of stronger typing.

Action items:

  1. Update remaining public methods to accept UUID (or generic Hashable) consistently.
  2. Add mypy/pyright to CI to prevent future drift.
-async def has_edge(self, from_node: str, to_node: str, edge_label: str) -> bool:
+async def has_edge(self, from_node: UUID, to_node: UUID, edge_label: str) -> bool:

Repeat for add_edge, add_edges, etc.

Also applies to: 62-64

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/networkx/adapter.py around lines 45 to
47 and 62 to 64, the type hints for methods like has_edge, add_edge, and
add_edges still use str instead of UUID, causing inconsistent typing and forcing
callers to cast types. Update these method signatures to accept UUID or a
generic Hashable type consistently to align with has_node and improve type
safety. Additionally, integrate mypy or pyright into the CI pipeline to enforce
consistent type usage and prevent future type hint drift.
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (1)

121-127: 🛠️ Refactor suggestion

Avoid double round-trips when retrieving a collection

get_collection() first calls has_collection(), which itself performs list_collections().
Immediately afterwards we call client.get_collection(). For every lookup this results in two HTTP calls:

  1. list_collections (to check existence)
  2. get_collection

When the caller only needs the collection handler, we can skip the first call and rely on the Chroma client to raise an error – or catch NotFound once – eliminating ~50 % of the latency per call.

-async def get_collection(self, collection_name: str) -> AsyncHttpClient:
-    if not await self.has_collection(collection_name):
-        raise CollectionNotFoundError(f"Collection '{collection_name}' not found!")
-
-    client = await self.get_connection()
-    return await client.get_collection(collection_name)
+async def get_collection(self, collection_name: str):
+    client = await self.get_connection()
+    try:
+        return await client.get_collection(collection_name)
+    except Exception as exc:          # replace with the specific Chroma *NotFound once available
+        raise CollectionNotFoundError(
+            f"Collection '{collection_name}' not found!"
+        ) from exc
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def get_collection(self, collection_name: str):
        client = await self.get_connection()
        try:
            return await client.get_collection(collection_name)
        except Exception as exc:  # replace with the specific Chroma *NotFound once available
            raise CollectionNotFoundError(
                f"Collection '{collection_name}' not found!"
            ) from exc
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py around
lines 121 to 127, the get_collection method currently calls has_collection which
triggers a list_collections HTTP request before calling client.get_collection,
causing two HTTP calls per lookup. To fix this, remove the call to
has_collection and directly call client.get_collection, then handle the
potential NotFound error from the client to raise CollectionNotFoundError. This
eliminates the redundant list_collections call and reduces latency by half.
cognee-mcp/src/server.py (1)

87-93: ⚠️ Potential issue

Background task fails immediately: wrong keyword argument

cognify_task() expects data, but we pass text.
At runtime this raises TypeError: cognify_task() got an unexpected keyword argument 'text', and the background job never starts.

-    asyncio.create_task(
-        cognify_task(
-            text=data,
-            graph_model_file=graph_model_file,
-            graph_model_name=graph_model_name,
-        )
-    )
+    asyncio.create_task(
+        cognify_task(
+            data=data,
+            graph_model_file=graph_model_file,
+            graph_model_name=graph_model_name,
+        )
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    asyncio.create_task(
        cognify_task(
-            text=data,
+            data=data,
            graph_model_file=graph_model_file,
            graph_model_name=graph_model_name,
        )
    )
🤖 Prompt for AI Agents
In cognee-mcp/src/server.py around lines 87 to 93, the call to cognify_task uses
the keyword argument 'text', but the function expects 'data'. To fix this,
rename the keyword argument from 'text' to 'data' in the asyncio.create_task
call so it matches the function signature and prevents the TypeError.
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1)

101-103: ⚠️ Potential issue

Creating data points fails if the collection does not exist

create_data_points() goes directly to get_collection(), which raises CollectionNotFoundError when the collection is missing.
Other adapters (e.g., ChromaDBAdapter) first call create_collection() to guarantee existence.

-        collection = await self.get_collection(collection_name)
+        await self.create_collection(collection_name)
+        collection = await self.get_collection(collection_name)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        await self.create_collection(collection_name)
        collection = await self.get_collection(collection_name)

        try:
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py around
lines 101 to 103, the create_data_points() method calls get_collection()
directly, which raises CollectionNotFoundError if the collection does not exist.
To fix this, modify create_data_points() to first check if the collection exists
and call create_collection() to create it if missing before calling
get_collection(). This ensures the collection is always present before adding
data points.
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)

205-211: 🛠️ Refactor suggestion

Row-by-row deletes are very slow

delete_data_points issues one collection.delete per ID which yields O(n) separate commits and network calls.

If LanceDB supports a single DELETE WHERE id IN [...] you can slash latency and lock contention:

-        # Delete one at a time to avoid commit conflicts
-        for data_point_id in data_point_ids:
-            await collection.delete(f"id = '{data_point_id}'")
+        # Bulk delete in one shot
+        if data_point_ids:
+            ids_tuple = tuple(map(str, data_point_ids))
+            await collection.delete(f"id IN {ids_tuple}")

Test on a staging dataset – on SSD the difference is multiple orders of magnitude for 10k+ IDs.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def delete_data_points(self, collection_name: str, data_point_ids: list[str]):
        collection = await self.get_collection(collection_name)

        # Bulk delete in one shot
        if data_point_ids:
            ids_tuple = tuple(map(str, data_point_ids))
            await collection.delete(f"id IN {ids_tuple}")
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py around lines
205 to 211, the delete_data_points method deletes data points one by one,
causing slow performance due to multiple commits and network calls. To fix this,
modify the method to perform a single delete operation using a query with "id IN
[...]" that includes all data_point_ids at once, reducing latency and lock
contention significantly.
cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py (7)

120-124: ⚠️ Potential issue

Malformed DELETE pattern – extra colon & braces

MATCH (node: {{id: $node_id}}) DETACH DELETE node
  1. Double braces render invalid syntax.
  2. The colon implies a label before the map pattern.

Correct form:

-MATCH (node: {{id: $node_id}}) DETACH DELETE node
+MATCH (node {id: $node_id}) DETACH DELETE node
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 120 to 124, the Cypher query syntax is incorrect due to double braces and
misuse of the colon before the property map. Replace the query string with
correct Cypher syntax by removing the colon and double braces, using parentheses
with a property map like MATCH (node {id: $node_id}) DETACH DELETE node to
properly match the node by its id property.

82-88: ⚠️ Potential issue

Same parameterised-label issue in bulk node insertion

ON CREATE SET n:node.label repeats the forbidden pattern.

You can UNWIND and then SET n: via CASE or APOC, or run separate UNWIND per label group.

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 82 to 88, the Cypher query uses `ON CREATE SET n:node.label` which
incorrectly tries to parameterize labels. To fix this, avoid setting labels
directly from parameters in the query. Instead, restructure the query to either
use a CASE statement or APOC procedures to conditionally set labels, or group
nodes by label and run separate UNWIND queries per label group to apply labels
correctly without parameterization.

415-418: ⚠️ Potential issue

{id} / {edge_label} will raise NameError in f-string

MATCH (node:`{id}`)<-[r:{edge_label}]-(successor)

id is undefined in Python scope, so the adapter crashes before hitting the DB.

Remove the f prefix and restructure as parameterised Cypher:

-        query = f"""
+        query = """
         UNWIND $node_ids AS nid
         MATCH (node)<-[r]-(successor)
         WHERE node.id = nid AND type(r) = $edge_label
         DELETE r
         """
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 415 to 418, the Cypher query uses f-string placeholders `{id}` and
`{edge_label}` which cause a NameError because these variables are not defined
in Python. To fix this, remove the f-string formatting and rewrite the query to
use parameterized Cypher syntax, passing `node_ids` and `edge_label` as
parameters to safely inject values without Python variable interpolation.

430-434: 🛠️ Refactor suggestion

Mutable default argument triggers Ruff B006

serialize_properties(self, properties=dict()) shares a single dict between calls.

-    def serialize_properties(self, properties=dict()):
+    def serialize_properties(self, properties: Optional[dict] = None):
         serialized_properties = {}
-        for property_key, property_value in properties.items():
+        for property_key, property_value in (properties or {}).items():

Avoids surprising state bleed.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def serialize_properties(self, properties: Optional[dict] = None):
        serialized_properties = {}

        for property_key, property_value in (properties or {}).items():
            if isinstance(property_value, UUID):
                ...
🧰 Tools
🪛 Ruff (0.8.2)

430-430: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 430 to 434, the method serialize_properties uses a mutable default
argument properties=dict(), which can cause unexpected shared state between
calls. Change the default value to None and inside the method, initialize
properties to an empty dictionary if it is None to avoid state bleed.

67-72: ⚠️ Potential issue

Cypher does not allow parameterised labels – query will throw

node:$node_label attempts to bind a label from a parameter, but Neo4j/Memgraph disallow this for security & plan-cache reasons.

Replace with string interpolation (with validated input) or use SET node: after the merge:

-        MERGE (node {id: $node_id})
-        ON CREATE SET node:$node_label, node += $properties, node.updated_at = timestamp()
-        ON MATCH SET node:$node_label, node += $properties, node.updated_at = timestamp()
+        MERGE (node {id: $node_id})
+        SET node += $properties,
+            node.updated_at = timestamp()
+        WITH node
+        CALL apoc.create.addLabels(id(node), [$node_label]) YIELD node  // or string-interpolate safely
+        RETURN ID(node) AS internal_id, node.id AS nodeId

Failing to correct this will raise SyntaxError: Parameters cannot be used for labels.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 67 to 72, the Cypher query incorrectly uses a parameter for a label with
node:$node_label, which is not allowed and causes a syntax error. To fix this,
remove the parameterized label from the MERGE clause and instead apply the label
using a separate SET statement after the MERGE, ensuring the label is inserted
via validated string interpolation to avoid injection risks.

247-257: ⚠️ Potential issue

Relationship tuple indexing fragile

result["r"][1] assumes the second item of the relationship tuple is always the relationship type.
In Neo4j the returned Relationship is an object, not a tuple, so indexing yields TypeError.

Use type(r) in Cypher or result["r"].type:

-            (result["n"]["id"], result["m"]["id"], {"relationship_name": result["r"][1]})
+            (
+                result["n"]["id"],
+                result["m"]["id"],
+                {"relationship_name": result["r"].type}
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def get_edges(self, node_id: str):
        query = """
        MATCH (n {id: $node_id})-[r]-(m)
        RETURN n, r, m
        """

        results = await self.query(query, dict(node_id=node_id))

        return [
            (
                result["n"]["id"],
                result["m"]["id"],
                {"relationship_name": result["r"].type}
            )
            for result in results
        ]
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 247 to 257, the code incorrectly accesses the relationship type using
result["r"][1], which causes a TypeError because the relationship is an object,
not a tuple. Fix this by replacing result["r"][1] with result["r"].type or by
modifying the Cypher query to return type(r) and accessing that instead,
ensuring the relationship type is retrieved correctly without indexing errors.

401-405: ⚠️ Potential issue

MATCH (node {id: nid}) inside UNWIND is invalid

Inside a UNWIND you must either use WHERE node.id = nid or APOC.
Also the entire string is wrapped in an f-string but no Python variables are interpolated – drop the f to avoid accidental formatting.

-        query = f"""
+        query = """
         UNWIND $node_ids AS nid
-        MATCH (node {id: nid})-[r]->(predecessor)
-        WHERE type(r) = $edge_label
+        MATCH (node)-[r]->(predecessor)
+        WHERE node.id = nid AND type(r) = $edge_label
         DELETE r
         """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        query = """
        UNWIND $node_ids AS nid
        MATCH (node)-[r]->(predecessor)
        WHERE node.id = nid AND type(r) = $edge_label
        DELETE r
        """
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 401 to 405, the Cypher query uses an invalid pattern MATCH (node {id:
nid}) inside UNWIND. Replace this with MATCH (node) WHERE node.id = nid to fix
the syntax. Also, remove the unnecessary f-string prefix since no Python
variables are interpolated in the query string to prevent accidental formatting.

@Vasilije1990 Vasilije1990 merged commit 5cf14eb into dev May 16, 2025
25 of 28 checks passed
@Vasilije1990 Vasilije1990 deleted the mcp_small_updates branch May 16, 2025 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants