feat: Add database deletion on dataset delete #1893

dexters1 · 2025-12-12T16:45:28Z

Description

Add support for database deletion when dataset is deleted
Simplify dataset handler usage in Cognee

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Code refactoring
Performance improvement
Other (please specify):

Screenshots/Videos (if applicable)

Pre-submission Checklist

I have tested my changes thoroughly before submitting this PR
This PR contains minimal changes necessary to address the issue/feature
My code follows the project's coding standards and style guidelines
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if applicable)
All new and existing tests pass
I have searched existing PRs to ensure this change hasn't been submitted already
I have linked any relevant issues in the description
My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

Bug Fixes
- Improved dataset deletion: stronger authorization checks and reliable removal of associated graph and vector storage.
Tests
- Added end-to-end test to verify complete dataset deletion and cleanup of all related storage components.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

pull-checklist · 2025-12-12T16:45:32Z

Please make sure all the checkboxes are checked:

I have tested these changes locally.
I have reviewed the code changes.
I have added end-to-end and unit tests (if applicable).
I have updated the documentation and README.md file (if necessary).
I have removed unnecessary code and debug statements.
PR title is clear and follows the convention.
I have tagged reviewers or team members for feedback.

coderabbitai · 2025-12-12T16:45:37Z

Walkthrough

Adds per-dataset database handler utilities and integrates them into dataset deletion flows; introduces an end-to-end GitHub Actions job and test that verifies graph and vector database files are removed when datasets are deleted.

Changes

Cohort / File(s)	Summary
CI & E2E test `\.github/workflows/e2e_tests.yml`, `cognee/tests/test_dataset_delete.py`	Adds a new GitHub Actions job `test-dataset-database-deletion` and an integration test that seeds datasets, runs cognify, deletes datasets, and asserts graph (.pkl) and vector (.lance.db) DB files are removed.
Handler utilities & exports `cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py`, `cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py`, `cognee/infrastructure/databases/utils/__init__.py`	Introduces `get_graph_dataset_database_handler` and `get_vector_dataset_database_handler` functions and re-exports them from the utils package.
Deletion/resolution refactor `cognee/modules/data/methods/delete_dataset.py`, `cognee/modules/data/deletion/prune_system.py`, `cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py`	Replace inlined per-dataset handler logic with calls to the new handler utilities; delete flow now looks up DatasetDatabase records and invokes per-dataset graph/vector handlers before removing Dataset.
API router change `cognee/api/v1/datasets/routers/get_datasets_router.py`	In dataset delete endpoint, switched dataset retrieval to use an authorization-aware lookup (`get_authorized_existing_datasets`) and pass the resolved dataset to deletion routine.
Minor formatting changes `cognee/api/v1/cognify/cognify.py`, `cognee/infrastructure/llm/LLMGateway.py`	Cosmetic formatting: trailing comma added to `cognify` signature and multiline reformatting of `acreate_structured_output` call; no behavioral changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Verify mapping keys used by the new handler getters match DatasetDatabase fields (graph_dataset_database_handler, vector_dataset_database_handler).
Confirm delete_dataset now calls handlers in the correct order and handles missing DatasetDatabase records safely.
Check prune_system and resolve_dataset_database_connection_info correctly invoke handler resolvers and propagate results/exceptions.
Review the router change to ensure authorization semantics remain correct when using get_authorized_existing_datasets.

Possibly related PRs

feat: Add dataset database handler info #1887 — Similar changes to dataset-database handler flow and per-dataset handler lookups in prune/resolve/delete logic.

Suggested reviewers

pazone
siillee

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.71% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding database deletion functionality when a dataset is deleted, which is the primary objective across multiple modified files.
Description check	✅ Passed	The description follows the required template with a clear human-written summary, type of change selection, and DCO affirmation, though most pre-submission checklist items remain unchecked.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch improve-dataset-delete

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cognee/api/v1/datasets/routers/get_datasets_router.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

cognee/api/v1/datasets/routers/get_datasets_router.py (1)

225-271: Clarify future of delete_data endpoint or remove the TODO

The inline TODO notes that this endpoint may be redundant with other data deletion mechanisms, but the implementation is still fully active.

Either:

Decide to keep it and remove/adjust the TODO (possibly documenting how it differs from the /v1/delete endpoint), or

Plan for deprecation/removal (e.g., mark as deprecated in docs and route comments).

This will avoid long-lived ambiguity around whether clients should rely on this endpoint.
[submit_optional_refactor]
cognee/modules/data/deletion/prune_system.py (1)
38-42: Same critical type mismatch and missing docstring as prune_graph_databases.

This function has the same issues identified in prune_graph_databases:

Critical: Type mismatch between dict-like Row mappings and DatasetDatabase model instances

Major: Missing docstring

Minor: No validation of handler dictionary structure

Apply the same fixes as suggested for prune_graph_databases:
 async def prune_vector_databases():
+    """
+    Delete all vector databases for all dataset_database records.
+    Used when backend_access_control is enabled to prune vector databases per-dataset.
+    """
     db_engine = get_relational_engine()
     try:
-        dataset_databases = await db_engine.get_all_data_from_table("dataset_database")
+        dataset_database_rows = await db_engine.get_all_data_from_table("dataset_database")
         # Go through each dataset database and delete the vector database
-        for dataset_database in dataset_databases:
+        for row in dataset_database_rows:
+            dataset_database = DatasetDatabase(**row)
             handler = get_vector_dataset_database_handler(dataset_database)
+            if "handler_instance" not in handler:
+                logger.warning("Handler missing 'handler_instance' key for dataset_database")
+                continue
             await handler["handler_instance"].delete_dataset(dataset_database)

🧹 Nitpick comments (6)

cognee/modules/data/methods/delete_dataset.py (1)
1-33: Dataset deletion flow via graph/vector handlers looks sound; consider docstring and optional defensive handler lookup

The sequence (fetch DatasetDatabase, invoke graph/vector handlers, then delete the dataset entity) looks correct and matches the new handler utilities.

To improve maintainability and debuggability:

Add a short docstring documenting that this also deletes associated graph/vector databases, not just the relational dataset row.

Optionally guard against misconfiguration of handler names (KeyError from the handler mapping) so failures surface with a clearer error message.

For example, adding a docstring:
 async def delete_dataset(dataset: Dataset):
-    db_engine = get_relational_engine()
+    """Delete a dataset and its associated graph and vector databases."""
+
+    db_engine = get_relational_engine()
cognee/api/v1/datasets/routers/get_datasets_router.py (1)
211-218: Avoid shadowing the router delete_dataset function with the imported method

Importing delete_dataset inside the delete_dataset endpoint shadows the endpoint’s own name, which is confusing when reading or debugging.

Consider importing under an alias (or importing at module level) to make intent clearer:
-        from cognee.modules.data.methods import delete_dataset
+        from cognee.modules.data.methods import delete_dataset as delete_dataset_method
@@
-        await delete_dataset(dataset[0])
+        await delete_dataset_method(dataset[0])
The authorization pattern via get_authorized_existing_datasets looks good.
cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py (1)
1-10: Graph handler lookup is straightforward; consider docstring and clearer error on unknown handler

The direct lookup into supported_dataset_database_handlers is fine assuming configuration is always valid.

Two small improvements:

Add a one-line docstring clarifying that this picks the graph handler based on dataset_database.graph_dataset_database_handler.

Optionally wrap the dict access to raise a more descriptive error if an unsupported handler name is stored in the DB (instead of a bare KeyError).

Example:
 def get_graph_dataset_database_handler(dataset_database: DatasetDatabase) -> dict:
-    from cognee.infrastructure.databases.dataset_database_handler.supported_dataset_database_handlers import (
-        supported_dataset_database_handlers,
-    )
-
-    handler = supported_dataset_database_handlers[dataset_database.graph_dataset_database_handler]
-    return handler
+    from cognee.infrastructure.databases.dataset_database_handler.supported_dataset_database_handlers import (
+        supported_dataset_database_handlers,
+    )
+
+    """Return the graph dataset database handler config for the given dataset."""
+    try:
+        return supported_dataset_database_handlers[
+            dataset_database.graph_dataset_database_handler
+        ]
+    except KeyError as exc:
+        raise ValueError(
+            f"Unsupported graph dataset database handler: "
+            f"{dataset_database.graph_dataset_database_handler!r}"
+        ) from exc
cognee/tests/test_dataset_delete.py (1)
1-76: E2E dataset deletion test is valuable; consider simplifying event-loop setup

The test does a good end-to-end check: build datasets, verify graph/vector DB files exist, delete, then assert both are gone. This directly exercises the new handler-based deletion.

If you don’t rely on custom loop semantics here, you could simplify the __main__ block with asyncio.run(main()) rather than manually creating and managing the loop, e.g.:
-if __name__ == "__main__":
-    logger = setup_logging(log_level=ERROR)
-    loop = asyncio.new_event_loop()
-    asyncio.set_event_loop(loop)
-    try:
-        loop.run_until_complete(main())
-    finally:
-        loop.run_until_complete(loop.shutdown_asyncgens())
+if __name__ == "__main__":
+    logger = setup_logging(log_level=ERROR)
+    asyncio.run(main())
The current approach still works, so this is mostly about readability and reducing boilerplate.
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py (1)

1-29: Refactor to delegate connection resolution to handlers looks correct

Using get_vector_dataset_database_handler and get_graph_dataset_database_handler and then calling each handler’s resolve_dataset_connection_info sequentially aligns with the function’s purpose and centralizes handler logic.

If you later need better diagnostics for misconfigured handler names, the same defensive pattern suggested for the handler getter utilities can be applied there, but the current flow is coherent with the new design.

cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py (1)

1-10: Vector handler lookup mirrors graph helper; same docstring/error-handling suggestions apply

The implementation correctly mirrors the graph handler utility, using vector_dataset_database_handler as the key.

For consistency and clarity, consider adding:

A brief docstring explaining that this returns the vector database handler config for a dataset.

Optional error handling to surface a clear message if an unsupported handler name is stored.

Pattern would be analogous to the graph helper refactor suggested earlier.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 127d986 and 031cc46.

📒 Files selected for processing (9)

.github/workflows/e2e_tests.yml (1 hunks)
cognee/api/v1/datasets/routers/get_datasets_router.py (1 hunks)
cognee/infrastructure/databases/utils/__init__.py (1 hunks)
cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py (1 hunks)
cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py (1 hunks)
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py (2 hunks)
cognee/modules/data/deletion/prune_system.py (3 hunks)
cognee/modules/data/methods/delete_dataset.py (1 hunks)
cognee/tests/test_dataset_delete.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (7)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/modules/data/deletion/prune_system.py
cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
cognee/modules/data/methods/delete_dataset.py
cognee/infrastructure/databases/utils/__init__.py
cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py
cognee/tests/test_dataset_delete.py
cognee/api/v1/datasets/routers/get_datasets_router.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/modules/data/deletion/prune_system.py
cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
cognee/modules/data/methods/delete_dataset.py
cognee/infrastructure/databases/utils/__init__.py
cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py
cognee/tests/test_dataset_delete.py
cognee/api/v1/datasets/routers/get_datasets_router.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/modules/data/deletion/prune_system.py
cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
cognee/modules/data/methods/delete_dataset.py
cognee/infrastructure/databases/utils/__init__.py
cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py
cognee/tests/test_dataset_delete.py
cognee/api/v1/datasets/routers/get_datasets_router.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/modules/data/deletion/prune_system.py
cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
cognee/modules/data/methods/delete_dataset.py
cognee/infrastructure/databases/utils/__init__.py
cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py

.github/**

⚙️ CodeRabbit configuration file

.github/**: * When the project is hosted on GitHub: All GitHub-specific configurations, templates, and tools should be found in the '.github' directory tree.

'actionlint' erroneously generates false positives when dealing with GitHub's ${{ ... }} syntax in conditionals.

'actionlint' erroneously generates incorrect solutions when suggesting the removal of valid ${{ ... }} syntax.

Files:

.github/workflows/e2e_tests.yml

cognee/tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Files:

cognee/tests/test_dataset_delete.py

cognee/tests/*

⚙️ CodeRabbit configuration file

cognee/tests/*: When reviewing test code:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance, pointing out any violations discovered.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a project rule, Python source files with names prefixed by the string "test_" and located in the project's "tests" directory are the project's unit-testing code. It is safe, albeit a heuristic, to assume these are considered part of the project's minimal acceptance testing unless a justifying exception to this assumption is documented.

As a project rule, any files without extensions and with names prefixed by either the string "check_" or the string "test_", and located in the project's "tests" directory, are the project's non-unit test code. "Non-unit test" in this context refers to any type of testing other than unit testing, such as (but not limited to) functional testing, style linting, regression testing, etc. It can also be assumed that non-unit testing code is usually written as Bash shell scripts.

Files:

cognee/tests/test_dataset_delete.py

cognee/api/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Public APIs should be type-annotated in Python where practical

Files:

cognee/api/v1/datasets/routers/get_datasets_router.py

🧠 Learnings (2)

📚 Learning: 2025-10-11T04:18:24.594Z

Learnt from: Vattikuti-Manideep-Sitaram
Repo: topoteretes/cognee PR: 1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.594Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.

Applied to files:

cognee/modules/data/deletion/prune_system.py

📚 Learning: 2024-11-18T12:54:36.758Z

Learnt from: 0xideas
Repo: topoteretes/cognee PR: 233
File: .github/workflows/test_cognee_llama_index_notebook.yml:0-0
Timestamp: 2024-11-18T12:54:36.758Z
Learning: In the `.github/workflows/test_cognee_llama_index_notebook.yml` workflow, it's acceptable to remove the `--all-extras` flag from `poetry install` to reduce costs by not installing unnecessary dependencies.

Applied to files:

.github/workflows/e2e_tests.yml

🧬 Code graph analysis (7)

cognee/modules/data/deletion/prune_system.py (7)

cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py (1)

get_graph_dataset_database_handler (4-10)

cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py (1)

get_vector_dataset_database_handler (4-10)

cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)

get_all_data_from_table (417-450)

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)

delete_dataset (58-81)

cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (1)

delete_dataset (43-50)

cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDevDatasetDatabaseHandler.py (1)

delete_dataset (167-168)

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)

delete_dataset (71-80)

cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py (1)

cognee/modules/users/models/DatasetDatabase.py (1)

DatasetDatabase (7-40)

cognee/modules/data/methods/delete_dataset.py (4)

cognee/modules/users/models/DatasetDatabase.py (1)

DatasetDatabase (7-40)

cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py (1)

get_vector_dataset_database_handler (4-10)

cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py (1)

get_graph_dataset_database_handler (4-10)

cognee/api/v1/datasets/routers/get_datasets_router.py (1)

delete_dataset (184-218)

cognee/infrastructure/databases/utils/__init__.py (2)

cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py (1)

get_graph_dataset_database_handler (4-10)

cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py (1)

get_vector_dataset_database_handler (4-10)

cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py (1)

cognee/modules/users/models/DatasetDatabase.py (1)

DatasetDatabase (7-40)

cognee/tests/test_dataset_delete.py (3)

cognee/shared/logging_utils.py (1)

setup_logging (288-561)

cognee/modules/users/methods/get_default_user.py (1)

get_default_user (13-38)

cognee/api/v1/config/config.py (2)

data_root_directory (36-38)

system_root_directory (18-33)

cognee/api/v1/datasets/routers/get_datasets_router.py (2)

cognee/modules/data/methods/delete_dataset.py (1)

delete_dataset (14-34)

cognee/api/v1/delete/routers/get_delete_router.py (1)

delete (18-60)

🔇 Additional comments (3)

.github/workflows/e2e_tests.yml (1)

240-263: New dataset-deletion E2E job is consistent with existing workflow structure

The test-dataset-database-deletion job reuses the standard Cognee setup, secrets, and environment pattern and correctly points to cognee/tests/test_dataset_delete.py. This should integrate cleanly with the existing reusable integration tests.

cognee/infrastructure/databases/utils/__init__.py (1)

3-4: Re-exporting handler getters from the utils package looks appropriate

Re-exporting get_graph_dataset_database_handler and get_vector_dataset_database_handler here matches the existing pattern and simplifies callers’ imports.

cognee/modules/data/deletion/prune_system.py (1)

8-11: LGTM!

The new utility imports are correctly structured and align with the refactored deletion logic.

coderabbitai · 2025-12-15T13:18:36Z

cognee/modules/data/deletion/prune_system.py

 logger = get_logger()


 async def prune_graph_databases():


⚠️ Potential issue | 🟠 Major

Add docstring to document function purpose.

Per coding guidelines, undocumented function definitions are considered incomplete. This function should have a docstring explaining that it deletes all graph databases for all dataset_database records when backend access control is enabled.

Apply this diff:

async def prune_graph_databases(): + """ + Delete all graph databases for all dataset_database records. + Used when backend_access_control is enabled to prune graph databases per-dataset. + """ db_engine = get_relational_engine()

As per coding guidelines, undocumented functions are assumed incomplete.

🤖 Prompt for AI Agents

In cognee/modules/data/deletion/prune_system.py around line 19, the function prune_graph_databases lacks a docstring; add a concise docstring immediately below the async def that states the function's purpose: that it deletes all graph databases for all dataset_database records when backend access control is enabled, describes input (none) and behavior (side-effect deletion) and notes any exceptions or important preconditions (e.g., requires backend access control enabled). Ensure the docstring follows project style (single- or triple-quoted summary) and is brief and descriptive.

coderabbitai · 2025-12-15T13:18:36Z

cognee/modules/data/deletion/prune_system.py

+        dataset_databases = await db_engine.get_all_data_from_table("dataset_database")
        # Go through each dataset database and delete the graph database
-        for data_item in data:
-            await _prune_graph_db(data_item)
+        for dataset_database in dataset_databases:
+            handler = get_graph_dataset_database_handler(dataset_database)
+            await handler["handler_instance"].delete_dataset(dataset_database)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n cognee/modules/data/deletion/prune_system.py

Repository: topoteretes/cognee

Length of output: 3544

🏁 Script executed:

# Find the db_engine.get_all_data_from_table implementation find . -type f -name "*.py" | xargs grep -l "def get_all_data_from_table" | head -5

Repository: topoteretes/cognee

Length of output: 138

🏁 Script executed:

# Find get_graph_dataset_database_handler function rg -n "def get_graph_dataset_database_handler" --type=py

Repository: topoteretes/cognee

Length of output: 222

🏁 Script executed:

# Find DatasetDatabase class definition rg -n "class DatasetDatabase" --type=py

Repository: topoteretes/cognee

Length of output: 280

🏁 Script executed:

cat -n cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py | grep -A 20 "def get_all_data_from_table"

Repository: topoteretes/cognee

Length of output: 1057

🏁 Script executed:

cat -n cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py

Repository: topoteretes/cognee

Length of output: 569

🏁 Script executed:

cat -n cognee/modules/users/models/DatasetDatabase.py

Repository: topoteretes/cognee

Length of output: 2140

🏁 Script executed:

cat -n cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py | sed -n '417,460p'

Repository: topoteretes/cognee

Length of output: 1885

🏁 Script executed:

# Search for complete implementation of get_all_data_from_table rg -A 30 "async def get_all_data_from_table" cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py

Repository: topoteretes/cognee

Length of output: 1234

🏁 Script executed:

# Check how get_all_data_from_table is used elsewhere in the codebase to see if there are other examples rg -B 2 -A 5 "get_all_data_from_table$" --type=py | head -50

Repository: topoteretes/cognee

Length of output: 3949

🏁 Script executed:

# Check SQLAlchemy Row behavior - does mappings() support attribute access? python3 << 'EOF' # Check SQLAlchemy documentation for Row behavior import sys try: from sqlalchemy import Row # Check if Row supports attribute access print("SQLAlchemy Row class:", Row) print("Row supports attribute access via __getattr__") help(Row) except Exception as e: print(f"Error: {e}") EOF

Repository: topoteretes/cognee

Length of output: 97

🏁 Script executed:

# Let's check if there are any similar patterns in the codebase where Row mappings are used with attribute access rg -B 3 -A 3 "\.mappings\($" --type=py cognee/infrastructure/databases/relational/sqlalchemy/

Repository: topoteretes/cognee

Length of output: 802

🏁 Script executed:

# Check if DatasetDatabase is instantiated directly elsewhere or if Row objects are used directly rg -B 2 -A 2 "DatasetDatabase\(" --type=py cognee/ | grep -v test | head -40

Repository: topoteretes/cognee

Length of output: 942

🏁 Script executed:

# Check SQLAlchemy version in requirements or setup files find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" | xargs grep -i sqlalchemy

Repository: topoteretes/cognee

Length of output: 230

🏁 Script executed:

# Check if there are other usages where get_all_data_from_table results are used with attribute access rg -B 2 -A 5 "get_all_data_from_table" --type=py cognee/ | grep -A 5 "for.*in.*get_all_data_from_table"

Repository: topoteretes/cognee

Length of output: 44

🏁 Script executed:

# Look for prune_vector_databases which has same pattern as prune_graph_databases cat -n cognee/modules/data/deletion/prune_system.py | sed -n '35,48p'

Repository: topoteretes/cognee

Length of output: 825

🏁 Script executed:

# Check SQLAlchemy Row attribute access support in 2.0+ # Search for any existing uses of Row attribute access in the codebase rg "result\[" cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py | head -10

Repository: topoteretes/cognee

Length of output: 140

🏁 Script executed:

# Look for any test that verifies the attribute access on Row objects from get_all_data_from_table rg -B 2 -A 2 "result\[.*\]" cognee/tests/test_deduplication.py | head -20

Repository: topoteretes/cognee

Length of output: 1194

🏁 Script executed:

# Check if get_vector_dataset_database_handler also uses attribute access like get_graph_dataset_database_handler cat -n cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py

Repository: topoteretes/cognee

Length of output: 571

🏁 Script executed:

# Check if there's a typing issue - look at the actual parameter types expected by the handlers rg -B 5 -A 10 "def get_graph_dataset_database_handler" cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py

Repository: topoteretes/cognee

Length of output: 499

🏁 Script executed:

# Check if Row objects from SQLAlchemy 2.0 are being used with attribute access anywhere in the codebase rg "for .* in .*get_all_data_from_table" -A 3 --type=py cognee/ | grep -v test

Repository: topoteretes/cognee

Length of output: 44

🏁 Script executed:

# Check if there are any Row to Model conversions happening anywhere rg "Row\|mappings" --type=py cognee/infrastructure/databases/relational/sqlalchemy/

Repository: topoteretes/cognee

Length of output: 44

🏁 Script executed:

# Let me verify SQLAlchemy 2.0 Row behavior by checking pyproject.toml more carefully cat -n pyproject.toml | grep -A 5 -B 5 "sqlalchemy"

Repository: topoteretes/cognee

Length of output: 1354

🏁 Script executed:

# Check if Row objects truly support both dict and attribute access in actual usage # Look for any error handling or type: ignore comments related to this rg "type: ignore|# noqa" cognee/modules/data/deletion/prune_system.py

Repository: topoteretes/cognee

Length of output: 44

🏁 Script executed:

# Search for any other functions that receive Row objects and use attribute access rg -B 3 "\.graph_dataset_database_handler|\.vector_dataset_database_handler" --type=py cognee/

Repository: topoteretes/cognee

Length of output: 3574

🏁 Script executed:

# Verify SQLAlchemy 2.0+ Row attribute access support by checking actual Row class behavior # Look for any issues or type annotations in the codebase related to this rg "RoomRowMapping|LegacyRow" --type=py cognee/

Repository: topoteretes/cognee

Length of output: 44

🏁 Script executed:

# Check if there are type checking issues - look for mypy or pyright configurations cat pyproject.toml | grep -A 10 "\[tool.mypy\]"

Repository: topoteretes/cognee

Length of output: 44

🏁 Script executed:

# Verify the actual usage pattern - if the code works, Row objects must support attribute access # Check if there are any integration tests that use prune_system rg "prune_graph_databases\|prune_vector_databases" --type=py cognee/tests/

Repository: topoteretes/cognee

Length of output: 44

🏁 Script executed:

# Final verification: Check SQLAlchemy Row documentation behavior in 2.0 # by looking at actual attribute access patterns in the codebase rg "for .* in .*await.*\." -A 3 --type=py cognee/modules/data/ | head -30

Repository: topoteretes/cognee

Length of output: 44

🏁 Script executed:

# Check if Row objects from mappings() actually support attribute access # by looking for any similar patterns where Row attributes are accessed rg "dataset_databases\[.*\]\." --type=py cognee/

Repository: topoteretes/cognee

Length of output: 44

🏁 Script executed:

# Verify by checking actual usage in context_global_variables where similar access happens cat -n cognee/context_global_variables.py | sed -n '1,80p'

Repository: topoteretes/cognee

Length of output: 5277

🏁 Script executed:

# Final check: SQLAlchemy 2.0+ Row objects support both dict and attribute access # Let's verify this by checking the actual Row implementation usage rg "from sqlalchemy" cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py | head -10

Repository: topoteretes/cognee

Length of output: 309

🏁 Script executed:

# Check if there's any Row wrapping or conversion happening rg "class.*Row|def.*mappings" --type=py cognee/infrastructure/databases/

Repository: topoteretes/cognee

Length of output: 44

🏁 Script executed:

# Verify the complete picture - check what happens in a similar function that works cat -n cognee/infrastructure/databases/utils/get_or_create_dataset_database.py | head -50

Repository: topoteretes/cognee

Length of output: 2232

🏁 Script executed:

# Let's verify SQLAlchemy Row attribute access by examining the actual imports and checking documentation # SQLAlchemy 2.0+ Row objects support both dict and attribute access by default python3 << 'EOF' # Check SQLAlchemy 2.0 Row behavior documentation # According to SQLAlchemy 2.0 docs, Row objects returned from mappings() support attribute access # This is a documented feature of SQLAlchemy 2.0+ # The Row class in SQLAlchemy 2.0+ implements both: # - __getitem__ for dict-like access: row["column_name"] # - __getattr__ for attribute access: row.column_name print("SQLAlchemy 2.0+ Row objects support BOTH dict-like and attribute access:") print("- row['column_name'] works") print("- row.column_name works") print("\nThis is documented behavior in SQLAlchemy 2.0+") print("See: https://docs.sqlalchemy.org/en/20/orm/queryguide.html#rows-are-namedtuples") EOF

Repository: topoteretes/cognee

Length of output: 313

🏁 Script executed:

# Verify that the code actually works with Row attribute access by checking for any actual runtime errors # or exception handling around this pattern rg "AttributeError|except.*Error" cognee/modules/data/deletion/prune_system.py

Repository: topoteretes/cognee

Length of output: 250

Add docstring to prune_graph_databases function.

The function lacks documentation as required by project guidelines (code style guideline: "undocumented function definitions in the project's Python code are assumed incomplete").

🤖 Prompt for AI Agents

cognee/modules/data/deletion/prune_system.py lines 22-26: The function prune_graph_databases is missing a docstring; add a concise triple-quoted docstring immediately below the def line that states the function purpose (prune/delete graph DBs for all dataset_database records), describes parameters (if any, e.g., db_engine) and their types, notes return value (None) and side effects (calls handler.delete_dataset for each dataset_database), and documents possible exceptions or errors that may be raised; keep it short (2–4 sentences) and follow project docstring style.

coderabbitai · 2025-12-15T13:18:36Z

cognee/modules/data/deletion/prune_system.py

+            handler = get_graph_dataset_database_handler(dataset_database)
+            await handler["handler_instance"].delete_dataset(dataset_database)


⚠️ Potential issue | 🟡 Minor

Consider validating handler dictionary structure.

The code accesses handler["handler_instance"] without verifying the key exists, which could raise a KeyError if the handler structure changes.

Consider adding defensive validation:

handler = get_graph_dataset_database_handler(dataset_database) + if "handler_instance" not in handler: + logger.warning("Handler missing 'handler_instance' key for dataset_database") + continue await handler["handler_instance"].delete_dataset(dataset_database)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In cognee/modules/data/deletion/prune_system.py around lines 25-26, the code assumes handler["handler_instance"] exists which can raise KeyError; validate that handler is a dict and contains the "handler_instance" key (and that its value is not None and exposes delete_dataset) before calling await handler["handler_instance"].delete_dataset(dataset_database); if validation fails, log or raise a clear error with context (including dataset_database) or skip gracefully to avoid unhandled exceptions.

cognee/modules/data/deletion/prune_system.py

cognee/tests/test_dataset_delete.py

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (1)

cognee-mcp/src/server.py (1)

861-867: Migration invocation will block the event loop.

The code awaits run_migrations(), but the implementation in cognee/run_migrations.py uses blocking subprocess.run(). This will freeze the MCP server during startup.

This issue is a consequence of the problems flagged in cognee/run_migrations.py. Once those are fixed (either by making the function synchronous and removing await here, or by making the subprocess call async), this invocation will work correctly.

The removal of custom error handling (UserAlreadyExists, etc.) is appropriate since errors will now be raised as exceptions from run_migrations().

🧹 Nitpick comments (1)

cognee/run_migrations.py (1)
7-10: Clarify package structure assumptions in comments.

The comment on lines 7-8 mentions that this is a "placeholder for the path logic," which suggests temporary or incomplete implementation. The constants reference "alembic" as the migrations directory.

Consider expanding the comment to document the expected package structure:
-# Assuming your package is named 'cognee' and the migrations are under 'cognee/migrations'
-# This is a placeholder for the path logic.
+# Expected package structure:
+# cognee/
+#   alembic.ini
+#   alembic/
+#     versions/
 MIGRATIONS_PACKAGE = "cognee"
 MIGRATIONS_DIR_NAME = "alembic"

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 031cc46 and ca746ea.

⛔ Files ignored due to path filters (1)

cognee-mcp/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (7)

alembic/README (0 hunks)
cognee-mcp/pyproject.toml (1 hunks)
cognee-mcp/src/server.py (1 hunks)
cognee/__init__.py (1 hunks)
cognee/alembic/README (1 hunks)
cognee/alembic/env.py (1 hunks)
cognee/run_migrations.py (1 hunks)

💤 Files with no reviewable changes (1)

alembic/README

✅ Files skipped from review due to trivial changes (1)

cognee/alembic/README

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/alembic/env.py
cognee-mcp/src/server.py
cognee/run_migrations.py
cognee/__init__.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/alembic/env.py
cognee-mcp/src/server.py
cognee/run_migrations.py
cognee/__init__.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/alembic/env.py
cognee/run_migrations.py
cognee/__init__.py

🧬 Code graph analysis (2)

cognee-mcp/src/server.py (2)

cognee/shared/logging_utils.py (1)

info (205-205)

cognee/run_migrations.py (1)

run_migrations (13-46)

cognee/__init__.py (1)

cognee/run_migrations.py (1)

run_migrations (13-46)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)

GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Test graph edge ingestion
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Test Cognify - Edge Centered Payload
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: Basic Tests / Run Simple Examples
GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: End-to-End Tests / Test dataset database deletion in Cognee
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: Basic Tests / Run Simple Examples BAML
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: CLI Tests / CLI Functionality Tests

🔇 Additional comments (2)

cognee/__init__.py (1)

36-37: LGTM: Public API export is properly placed.

The import is correctly positioned after all other imports and makes run_migrations available as part of the public package API.

Note: The underlying implementation in cognee/run_migrations.py has issues flagged in a separate review.

cognee/alembic/env.py (1)

82-95: The review comment references code that does not exist in the current repository.

The actual implementation at alembic/env.py lines 80-83 uses asyncio.run(run_async_migrations()), which is the correct approach. The problematic try-except pattern with asyncio.get_running_loop() and loop.run_until_complete() is not present in the codebase. Additionally, the file path in the comment is incorrect (cognee/alembic/env.py should be alembic/env.py).

Likely an incorrect or invalid review comment.

coderabbitai · 2025-12-15T15:29:23Z

cognee-mcp/pyproject.toml

 dependencies = [
    # For local cognee repo usage remove comment bellow and add absolute path to cognee. Then run `uv sync --reinstall` in the mcp folder on local cognee changes.
-    #"cognee[postgres,codegraph,gemini,huggingface,docs,neo4j] @ file:/Users/igorilic/Desktop/cognee",
+    "cognee[postgres,codegraph,gemini,huggingface,docs,neo4j] @ file:/Users/igorilic/Desktop/cognee/dist/cognee-0.5.0.dev0-py3-none-any.whl",


⚠️ Potential issue | 🔴 Critical

Remove hardcoded absolute path before merging.

The dependency uses a hardcoded absolute path specific to a single developer's machine (/Users/igorilic/Desktop/...). This will break the build for other developers and CI environments.

Consider one of these approaches:

Use a relative path: file://../../dist/cognee-0.5.0.dev0-py3-none-any.whl

Switch to the commented-out PyPI version on line 12 for production

Document this as a local-development-only configuration in a README

🤖 Prompt for AI Agents

In cognee-mcp/pyproject.toml around line 10, the dependency entry contains a hardcoded absolute path to a developer's wheel which will break other environments; replace it with a non-absolute solution by either (a) changing to a relative file URL pointing to the local dist (e.g. file://../../dist/... ), (b) reverting to the commented PyPI package/version for CI/production, or (c) if this must remain for local dev only, remove it from the committed file and add documentation in the repo README describing how to install the local wheel for development; update the file accordingly and ensure CI uses the PyPI entry or a relative path.

coderabbitai · 2025-12-15T15:29:23Z

cognee/run_migrations.py

+MIGRATIONS_DIR_NAME = "alembic"
+
+
+async def run_migrations():


🛠️ Refactor suggestion | 🟠 Major

Remove async keyword from non-async function.

The function is declared as async but contains no await expressions. The only I/O operation is subprocess.run(), which is synchronous and blocking.

Apply this diff:

-async def run_migrations(): +def run_migrations(): """ Finds the Alembic configuration within the installed package and programmatically executes 'alembic upgrade head'. """

Update callers to remove await:

# In cognee-mcp/src/server.py run_migrations() # Remove await

🤖 Prompt for AI Agents

In cognee/run_migrations.py around line 13, the function is declared async but contains no await and calls the synchronous subprocess.run, so remove the async keyword to make it a normal function (change "async def run_migrations()" to "def run_migrations()") and leave the subprocess.run call as-is; then update all callers (e.g., cognee-mcp/src/server.py) to call run_migrations() without awaiting it.

coderabbitai · 2025-12-15T15:29:24Z

cognee/run_migrations.py

+    migration_result = subprocess.run(
+        ["alembic", "upgrade", "head"],
+        capture_output=True,
+        text=True,
+        cwd=Path(package_root),
+    )


⚠️ Potential issue | 🟠 Major

Blocking subprocess call will freeze the event loop.

subprocess.run() is synchronous and will block the entire event loop if called from async code. Since this function is imported and awaited in cognee-mcp/src/server.py (line 865), it will block the MCP server startup.

If the function must remain async, use the async subprocess API:

- migration_result = subprocess.run( - ["alembic", "upgrade", "head"], - capture_output=True, - text=True, - cwd=Path(package_root), - ) + process = await asyncio.create_subprocess_exec( + "alembic", "upgrade", "head", + stdout=asyncio.subprocess.PIPE, + stderr=asyncio.subprocess.PIPE, + cwd=Path(package_root), + ) + stdout, stderr = await process.communicate() + migration_result = type('Result', (), { + 'returncode': process.returncode, + 'stdout': stdout.decode(), + 'stderr': stderr.decode() + })()

Add import:

+import asyncio

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

migration_result = subprocess.run(

["alembic", "upgrade", "head"],

capture_output=True,

text=True,

cwd=Path(package_root),

)

process = await asyncio.create_subprocess_exec(

"alembic", "upgrade", "head",

stdout=asyncio.subprocess.PIPE,

stderr=asyncio.subprocess.PIPE,

cwd=Path(package_root),

)

stdout, stderr = await process.communicate()

migration_result = type('Result', (), {

'returncode': process.returncode,

'stdout': stdout.decode(),

'stderr': stderr.decode()

})()

🤖 Prompt for AI Agents

In cognee/run_migrations.py around lines 36 to 41 the code uses subprocess.run which is synchronous and will block the event loop; replace it with the asyncio subprocess API (e.g., asyncio.create_subprocess_exec or asyncio.subprocess.create_subprocess_exec) and await the process to completion, capturing stdout/stderr asynchronously, checking returncode and raising or returning a proper error if non-zero; also add the necessary import (import asyncio) and ensure cwd is passed to the async subprocess call (or use os.chdir temporarily) and that the function remains async so callers can await it without blocking.

coderabbitai · 2025-12-15T15:29:24Z

cognee/run_migrations.py

+    if migration_result.returncode != 0:
+        migration_output = migration_result.stderr + migration_result.stdout
+        print(f"Migration failed with unexpected error: {migration_output}")
+        sys.exit(1)


🛠️ Refactor suggestion | 🟠 Major

Replace sys.exit() with exception raising.

Calling sys.exit(1) abruptly terminates the process without allowing cleanup or proper error propagation. This prevents the caller from handling migration failures gracefully.

Apply this diff:

if migration_result.returncode != 0: migration_output = migration_result.stderr + migration_result.stdout - print(f"Migration failed with unexpected error: {migration_output}") - sys.exit(1) + raise RuntimeError(f"Migration failed: {migration_output}")

This allows callers to catch and handle the exception appropriately.

🤖 Prompt for AI Agents

In cognee/run_migrations.py around lines 43 to 46, replace the abrupt process termination using sys.exit(1) with raising a descriptive exception so callers can catch and handle migration failures; construct an exception (e.g., RuntimeError) including the combined migration_output (migration_result.stderr + migration_result.stdout) in the message and raise it instead of calling sys.exit, preserving the existing printed/logged output if desired.

coderabbitai · 2025-12-15T15:29:24Z

cognee/run_migrations.py

+
+    if migration_result.returncode != 0:
+        migration_output = migration_result.stderr + migration_result.stdout
+        print(f"Migration failed with unexpected error: {migration_output}")


🛠️ Refactor suggestion | 🟠 Major

Use logging utilities instead of print().

Per coding guidelines for cognee/**/*.py, use shared logging utilities from cognee.shared.logging_utils.

Based on coding guidelines, apply this diff:

+from cognee.shared.logging_utils import get_logger + +logger = get_logger() + async def run_migrations(): """ Finds the Alembic configuration within the installed package and programmatically executes 'alembic upgrade head'. """ # ... existing code ... if migration_result.returncode != 0: migration_output = migration_result.stderr + migration_result.stdout - print(f"Migration failed with unexpected error: {migration_output}") + logger.error(f"Migration failed with unexpected error: {migration_output}") sys.exit(1)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(f"Migration failed with unexpected error: {migration_output}")

from cognee.shared.logging_utils import get_logger

logger = get_logger()

async def run_migrations():

"""

Finds the Alembic configuration within the installed package and

programmatically executes 'alembic upgrade head'.

"""

# ... existing code ...

if migration_result.returncode != 0:

migration_output = migration_result.stderr + migration_result.stdout

logger.error(f"Migration failed with unexpected error: {migration_output}")

sys.exit(1)

🤖 Prompt for AI Agents

In cognee/run_migrations.py around line 45 replace the call to print(...) with the project logging utilities: import the shared logger from cognee.shared.logging_utils at the top of the file (or reuse an existing logger), then change print(f"Migration failed with unexpected error: {migration_output}") to an appropriate logger method (e.g., logger.error(...)) that includes the migration_output; ensure the import is added only once and no print() calls remain.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

cognee/api/v1/datasets/routers/get_datasets_router.py (1)
211-218: Clarify auth result handling and avoid delete_dataset name shadowing inside the endpoint

The new flow correctly (a) enforces delete permission via get_authorized_existing_datasets and (b) delegates to the shared cognee.modules.data.methods.delete_dataset, which satisfies the PR goal of cleaning up associated databases.

There are two minor points worth tightening:

The import on Line 211 shadows the endpoint function name delete_dataset, which is legal but confusing for readers and makes it harder to tell at a glance whether await delete_dataset(...) refers to the endpoint or the data-layer method.

On Lines 213–218, you only guard against dataset is None. If get_authorized_existing_datasets ever returns an empty list for “no authorized dataset”, dataset[0] on Line 218 would raise IndexError. Using a truthiness check handles both None and empty lists defensively.

You can address both with a small refactor:
-        from cognee.modules.data.methods import delete_dataset
-
-        dataset = await get_authorized_existing_datasets([dataset_id], "delete", user)
-
-        if dataset is None:
-            raise DatasetNotFoundError(message=f"Dataset ({str(dataset_id)}) not found.")
-
-        await delete_dataset(dataset[0])
+        from cognee.modules.data.methods import delete_dataset as delete_dataset_method
+
+        authorized_datasets = await get_authorized_existing_datasets(
+            [dataset_id],
+            "delete",
+            user,
+        )
+
+        if not authorized_datasets:
+            raise DatasetNotFoundError(message=f"Dataset ({str(dataset_id)}) not found.")
+
+        await delete_dataset_method(authorized_datasets[0])
This keeps behavior the same in the happy path, improves robustness if the helper ever returns [], and removes the local name collision, which should make the code easier to follow.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca746ea and 31a3410.

📒 Files selected for processing (1)

cognee/api/v1/datasets/routers/get_datasets_router.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/api/v1/datasets/routers/get_datasets_router.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/api/v1/datasets/routers/get_datasets_router.py

cognee/api/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Public APIs should be type-annotated in Python where practical

Files:

cognee/api/v1/datasets/routers/get_datasets_router.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/api/v1/datasets/routers/get_datasets_router.py

🧬 Code graph analysis (1)

cognee/api/v1/datasets/routers/get_datasets_router.py (1)

cognee/modules/data/methods/delete_dataset.py (1)

delete_dataset (14-34)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / Test dataset database deletion in Cognee
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / Test Cognify - Edge Centered Payload
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: CLI Tests / CLI Functionality Tests

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

cognee/infrastructure/llm/LLMGateway.py (1)

12-26: Consider forwarding **kwargs in the BAML branch and adding brief method docstrings

Right now acreate_structured_output accepts **kwargs, but only the LiteLLM path forwards them; the BAML path ignores them. If the gateway is meant to be backend-agnostic, consider passing **kwargs through to the BAML acreate_structured_output as well for interface parity.

Also, per the project’s Python guidelines, you may want to add short docstrings to the static methods (e.g., summarizing what acreate_structured_output expects/returns) to complement the class-level docstring and align with PEP 257-style documentation.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31a3410 and d36fb2d.

📒 Files selected for processing (2)

cognee/api/v1/cognify/cognify.py (1 hunks)
cognee/infrastructure/llm/LLMGateway.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/infrastructure/llm/LLMGateway.py
cognee/api/v1/cognify/cognify.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/infrastructure/llm/LLMGateway.py
cognee/api/v1/cognify/cognify.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/infrastructure/llm/LLMGateway.py
cognee/api/v1/cognify/cognify.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/infrastructure/llm/LLMGateway.py

cognee/api/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Public APIs should be type-annotated in Python where practical

Files:

cognee/api/v1/cognify/cognify.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)

GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: Basic Tests / Run Simple Examples
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: Basic Tests / Run Simple Examples BAML
GitHub Check: CLI Tests / CLI Functionality Tests
GitHub Check: CLI Tests / CLI Integration Tests

🔇 Additional comments (2)

cognee/api/v1/cognify/cognify.py (1)

56-56: LGTM! Trailing comma aligns with Python conventions.

The trailing comma after **kwargs is a non-functional style improvement that makes future parameter additions cleaner in diffs and is consistent with modern Python formatters like ruff.

cognee/infrastructure/llm/LLMGateway.py (1)

34-37: Formatting-only change; behavior stays the same

Reformatting this call to multi-line keyword arguments preserves the existing behavior (including forwarding **kwargs) and improves readability. No issues here.

dexters1 added 2 commits December 12, 2025 17:25

refactor: simplify dataset database handler calling

cb16dc0

feat: Enable database deletion when dataset is deleted

47c0278

dexters1 self-assigned this Dec 12, 2025

dexters1 added the core-team label Dec 12, 2025

dexters1 commented Dec 12, 2025

View reviewed changes

cognee/api/v1/datasets/routers/get_datasets_router.py Outdated Show resolved Hide resolved

test: Add test for dataset database deletion mechanism

0398663

dexters1 requested a review from borisarzentar December 12, 2025 18:22

dexters1 added 2 commits December 12, 2025 19:29

refactor: Update test

cc72e7f

docs: Add todo to fix issue with readding data to a recreated dataset

031cc46

dexters1 marked this pull request as ready for review December 15, 2025 13:11

dexters1 requested review from hajdul88, pazone and siillee December 15, 2025 13:11

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

borisarzentar reviewed Dec 15, 2025

View reviewed changes

cognee/modules/data/deletion/prune_system.py Show resolved Hide resolved

cognee/tests/test_dataset_delete.py Show resolved Hide resolved

borisarzentar changed the title ~~Feat: Add database deletion on dataset delete~~ feat: Add database deletion on dataset delete Dec 15, 2025

dexters1 force-pushed the improve-dataset-delete branch from ca746ea to 031cc46 Compare December 15, 2025 15:26

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

dexters1 and others added 3 commits December 15, 2025 17:27

refactor: Remove TODO

87d8fe0

Merge branch 'dev' into improve-dataset-delete

31a3410

refactor: add formatting

d36fb2d

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

borisarzentar approved these changes Dec 15, 2025

View reviewed changes

dexters1 merged commit 14d9540 into dev Dec 15, 2025
114 of 118 checks passed

dexters1 deleted the improve-dataset-delete branch December 15, 2025 17:15

		handler = get_graph_dataset_database_handler(dataset_database)
		await handler["handler_instance"].delete_dataset(dataset_database)

-    migration_result = subprocess.run(
-        ["alembic", "upgrade", "head"],
-        capture_output=True,
-        text=True,
-        cwd=Path(package_root),
-    )
+    process = await asyncio.create_subprocess_exec(
+        "alembic", "upgrade", "head",
+        stdout=asyncio.subprocess.PIPE,
+        stderr=asyncio.subprocess.PIPE,
+        cwd=Path(package_root),
+    )
+    stdout, stderr = await process.communicate()
+    migration_result = type('Result', (), {
+        'returncode': process.returncode,
+        'stdout': stdout.decode(),
+        'stderr': stderr.decode()
+    })()

feat: Add database deletion on dataset delete #1893

feat: Add database deletion on dataset delete #1893

Uh oh!

Conversation

dexters1 commented Dec 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Screenshots/Videos (if applicable)

Pre-submission Checklist

DCO Affirmation

Summary by CodeRabbit

Uh oh!

pull-checklist bot commented Dec 12, 2025

Please make sure all the checkboxes are checked:

Uh oh!

coderabbitai bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dexters1 commented Dec 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2025 •

edited

Loading