Skip to content

Conversation

@dexters1
Copy link
Collaborator

@dexters1 dexters1 commented Dec 12, 2025

Description

  • Add support for database deletion when dataset is deleted
  • Simplify dataset handler usage in Cognee

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Code refactoring
  • Performance improvement
  • Other (please specify):

Screenshots/Videos (if applicable)

Pre-submission Checklist

  • I have tested my changes thoroughly before submitting this PR
  • This PR contains minimal changes necessary to address the issue/feature
  • My code follows the project's coding standards and style guidelines
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if applicable)
  • All new and existing tests pass
  • I have searched existing PRs to ensure this change hasn't been submitted already
  • I have linked any relevant issues in the description
  • My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

  • Bug Fixes

    • Improved dataset deletion: stronger authorization checks and reliable removal of associated graph and vector storage.
  • Tests

    • Added end-to-end test to verify complete dataset deletion and cleanup of all related storage components.

✏️ Tip: You can customize this high-level summary in your review settings.

@dexters1 dexters1 self-assigned this Dec 12, 2025
@pull-checklist
Copy link

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 12, 2025

Walkthrough

Adds per-dataset database handler utilities and integrates them into dataset deletion flows; introduces an end-to-end GitHub Actions job and test that verifies graph and vector database files are removed when datasets are deleted.

Changes

Cohort / File(s) Summary
CI & E2E test
\.github/workflows/e2e_tests.yml, cognee/tests/test_dataset_delete.py
Adds a new GitHub Actions job test-dataset-database-deletion and an integration test that seeds datasets, runs cognify, deletes datasets, and asserts graph (.pkl) and vector (.lance.db) DB files are removed.
Handler utilities & exports
cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py, cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py, cognee/infrastructure/databases/utils/__init__.py
Introduces get_graph_dataset_database_handler and get_vector_dataset_database_handler functions and re-exports them from the utils package.
Deletion/resolution refactor
cognee/modules/data/methods/delete_dataset.py, cognee/modules/data/deletion/prune_system.py, cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
Replace inlined per-dataset handler logic with calls to the new handler utilities; delete flow now looks up DatasetDatabase records and invokes per-dataset graph/vector handlers before removing Dataset.
API router change
cognee/api/v1/datasets/routers/get_datasets_router.py
In dataset delete endpoint, switched dataset retrieval to use an authorization-aware lookup (get_authorized_existing_datasets) and pass the resolved dataset to deletion routine.
Minor formatting changes
cognee/api/v1/cognify/cognify.py, cognee/infrastructure/llm/LLMGateway.py
Cosmetic formatting: trailing comma added to cognify signature and multiline reformatting of acreate_structured_output call; no behavioral changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Verify mapping keys used by the new handler getters match DatasetDatabase fields (graph_dataset_database_handler, vector_dataset_database_handler).
  • Confirm delete_dataset now calls handlers in the correct order and handles missing DatasetDatabase records safely.
  • Check prune_system and resolve_dataset_database_connection_info correctly invoke handler resolvers and propagate results/exceptions.
  • Review the router change to ensure authorization semantics remain correct when using get_authorized_existing_datasets.

Possibly related PRs

Suggested reviewers

  • pazone
  • siillee

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.71% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding database deletion functionality when a dataset is deleted, which is the primary objective across multiple modified files.
Description check ✅ Passed The description follows the required template with a clear human-written summary, type of change selection, and DCO affirmation, though most pre-submission checklist items remain unchecked.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch improve-dataset-delete

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dexters1 dexters1 marked this pull request as ready for review December 15, 2025 13:11
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
cognee/api/v1/datasets/routers/get_datasets_router.py (1)

225-271: Clarify future of delete_data endpoint or remove the TODO

The inline TODO notes that this endpoint may be redundant with other data deletion mechanisms, but the implementation is still fully active.

Either:

  • Decide to keep it and remove/adjust the TODO (possibly documenting how it differs from the /v1/delete endpoint), or
  • Plan for deprecation/removal (e.g., mark as deprecated in docs and route comments).

This will avoid long-lived ambiguity around whether clients should rely on this endpoint.
[submit_optional_refactor]

cognee/modules/data/deletion/prune_system.py (1)

38-42: Same critical type mismatch and missing docstring as prune_graph_databases.

This function has the same issues identified in prune_graph_databases:

  1. Critical: Type mismatch between dict-like Row mappings and DatasetDatabase model instances
  2. Major: Missing docstring
  3. Minor: No validation of handler dictionary structure

Apply the same fixes as suggested for prune_graph_databases:

 async def prune_vector_databases():
+    """
+    Delete all vector databases for all dataset_database records.
+    Used when backend_access_control is enabled to prune vector databases per-dataset.
+    """
     db_engine = get_relational_engine()
     try:
-        dataset_databases = await db_engine.get_all_data_from_table("dataset_database")
+        dataset_database_rows = await db_engine.get_all_data_from_table("dataset_database")
         # Go through each dataset database and delete the vector database
-        for dataset_database in dataset_databases:
+        for row in dataset_database_rows:
+            dataset_database = DatasetDatabase(**row)
             handler = get_vector_dataset_database_handler(dataset_database)
+            if "handler_instance" not in handler:
+                logger.warning("Handler missing 'handler_instance' key for dataset_database")
+                continue
             await handler["handler_instance"].delete_dataset(dataset_database)
🧹 Nitpick comments (6)
cognee/modules/data/methods/delete_dataset.py (1)

1-33: Dataset deletion flow via graph/vector handlers looks sound; consider docstring and optional defensive handler lookup

The sequence (fetch DatasetDatabase, invoke graph/vector handlers, then delete the dataset entity) looks correct and matches the new handler utilities.

To improve maintainability and debuggability:

  • Add a short docstring documenting that this also deletes associated graph/vector databases, not just the relational dataset row.
  • Optionally guard against misconfiguration of handler names (KeyError from the handler mapping) so failures surface with a clearer error message.

For example, adding a docstring:

 async def delete_dataset(dataset: Dataset):
-    db_engine = get_relational_engine()
+    """Delete a dataset and its associated graph and vector databases."""
+
+    db_engine = get_relational_engine()
cognee/api/v1/datasets/routers/get_datasets_router.py (1)

211-218: Avoid shadowing the router delete_dataset function with the imported method

Importing delete_dataset inside the delete_dataset endpoint shadows the endpoint’s own name, which is confusing when reading or debugging.

Consider importing under an alias (or importing at module level) to make intent clearer:

-        from cognee.modules.data.methods import delete_dataset
+        from cognee.modules.data.methods import delete_dataset as delete_dataset_method
@@
-        await delete_dataset(dataset[0])
+        await delete_dataset_method(dataset[0])

The authorization pattern via get_authorized_existing_datasets looks good.

cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py (1)

1-10: Graph handler lookup is straightforward; consider docstring and clearer error on unknown handler

The direct lookup into supported_dataset_database_handlers is fine assuming configuration is always valid.

Two small improvements:

  • Add a one-line docstring clarifying that this picks the graph handler based on dataset_database.graph_dataset_database_handler.
  • Optionally wrap the dict access to raise a more descriptive error if an unsupported handler name is stored in the DB (instead of a bare KeyError).

Example:

 def get_graph_dataset_database_handler(dataset_database: DatasetDatabase) -> dict:
-    from cognee.infrastructure.databases.dataset_database_handler.supported_dataset_database_handlers import (
-        supported_dataset_database_handlers,
-    )
-
-    handler = supported_dataset_database_handlers[dataset_database.graph_dataset_database_handler]
-    return handler
+    from cognee.infrastructure.databases.dataset_database_handler.supported_dataset_database_handlers import (
+        supported_dataset_database_handlers,
+    )
+
+    """Return the graph dataset database handler config for the given dataset."""
+    try:
+        return supported_dataset_database_handlers[
+            dataset_database.graph_dataset_database_handler
+        ]
+    except KeyError as exc:
+        raise ValueError(
+            f"Unsupported graph dataset database handler: "
+            f"{dataset_database.graph_dataset_database_handler!r}"
+        ) from exc
cognee/tests/test_dataset_delete.py (1)

1-76: E2E dataset deletion test is valuable; consider simplifying event-loop setup

The test does a good end-to-end check: build datasets, verify graph/vector DB files exist, delete, then assert both are gone. This directly exercises the new handler-based deletion.

If you don’t rely on custom loop semantics here, you could simplify the __main__ block with asyncio.run(main()) rather than manually creating and managing the loop, e.g.:

-if __name__ == "__main__":
-    logger = setup_logging(log_level=ERROR)
-    loop = asyncio.new_event_loop()
-    asyncio.set_event_loop(loop)
-    try:
-        loop.run_until_complete(main())
-    finally:
-        loop.run_until_complete(loop.shutdown_asyncgens())
+if __name__ == "__main__":
+    logger = setup_logging(log_level=ERROR)
+    asyncio.run(main())

The current approach still works, so this is mostly about readability and reducing boilerplate.

cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py (1)

1-29: Refactor to delegate connection resolution to handlers looks correct

Using get_vector_dataset_database_handler and get_graph_dataset_database_handler and then calling each handler’s resolve_dataset_connection_info sequentially aligns with the function’s purpose and centralizes handler logic.

If you later need better diagnostics for misconfigured handler names, the same defensive pattern suggested for the handler getter utilities can be applied there, but the current flow is coherent with the new design.

cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py (1)

1-10: Vector handler lookup mirrors graph helper; same docstring/error-handling suggestions apply

The implementation correctly mirrors the graph handler utility, using vector_dataset_database_handler as the key.

For consistency and clarity, consider adding:

  • A brief docstring explaining that this returns the vector database handler config for a dataset.
  • Optional error handling to surface a clear message if an unsupported handler name is stored.

Pattern would be analogous to the graph helper refactor suggested earlier.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 127d986 and 031cc46.

📒 Files selected for processing (9)
  • .github/workflows/e2e_tests.yml (1 hunks)
  • cognee/api/v1/datasets/routers/get_datasets_router.py (1 hunks)
  • cognee/infrastructure/databases/utils/__init__.py (1 hunks)
  • cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py (1 hunks)
  • cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py (1 hunks)
  • cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py (2 hunks)
  • cognee/modules/data/deletion/prune_system.py (3 hunks)
  • cognee/modules/data/methods/delete_dataset.py (1 hunks)
  • cognee/tests/test_dataset_delete.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/modules/data/deletion/prune_system.py
  • cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py
  • cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
  • cognee/modules/data/methods/delete_dataset.py
  • cognee/infrastructure/databases/utils/__init__.py
  • cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py
  • cognee/tests/test_dataset_delete.py
  • cognee/api/v1/datasets/routers/get_datasets_router.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/modules/data/deletion/prune_system.py
  • cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py
  • cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
  • cognee/modules/data/methods/delete_dataset.py
  • cognee/infrastructure/databases/utils/__init__.py
  • cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py
  • cognee/tests/test_dataset_delete.py
  • cognee/api/v1/datasets/routers/get_datasets_router.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/modules/data/deletion/prune_system.py
  • cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py
  • cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
  • cognee/modules/data/methods/delete_dataset.py
  • cognee/infrastructure/databases/utils/__init__.py
  • cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py
  • cognee/tests/test_dataset_delete.py
  • cognee/api/v1/datasets/routers/get_datasets_router.py
cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

  • cognee/modules/data/deletion/prune_system.py
  • cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py
  • cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
  • cognee/modules/data/methods/delete_dataset.py
  • cognee/infrastructure/databases/utils/__init__.py
  • cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py
.github/**

⚙️ CodeRabbit configuration file

.github/**: * When the project is hosted on GitHub: All GitHub-specific configurations, templates, and tools should be found in the '.github' directory tree.

  • 'actionlint' erroneously generates false positives when dealing with GitHub's ${{ ... }} syntax in conditionals.
  • 'actionlint' erroneously generates incorrect solutions when suggesting the removal of valid ${{ ... }} syntax.

Files:

  • .github/workflows/e2e_tests.yml
cognee/tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Files:

  • cognee/tests/test_dataset_delete.py
cognee/tests/*

⚙️ CodeRabbit configuration file

cognee/tests/*: When reviewing test code:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance, pointing out any violations discovered.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a project rule, Python source files with names prefixed by the string "test_" and located in the project's "tests" directory are the project's unit-testing code. It is safe, albeit a heuristic, to assume these are considered part of the project's minimal acceptance testing unless a justifying exception to this assumption is documented.
  6. As a project rule, any files without extensions and with names prefixed by either the string "check_" or the string "test_", and located in the project's "tests" directory, are the project's non-unit test code. "Non-unit test" in this context refers to any type of testing other than unit testing, such as (but not limited to) functional testing, style linting, regression testing, etc. It can also be assumed that non-unit testing code is usually written as Bash shell scripts.

Files:

  • cognee/tests/test_dataset_delete.py
cognee/api/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Public APIs should be type-annotated in Python where practical

Files:

  • cognee/api/v1/datasets/routers/get_datasets_router.py
🧠 Learnings (2)
📚 Learning: 2025-10-11T04:18:24.594Z
Learnt from: Vattikuti-Manideep-Sitaram
Repo: topoteretes/cognee PR: 1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.594Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.

Applied to files:

  • cognee/modules/data/deletion/prune_system.py
📚 Learning: 2024-11-18T12:54:36.758Z
Learnt from: 0xideas
Repo: topoteretes/cognee PR: 233
File: .github/workflows/test_cognee_llama_index_notebook.yml:0-0
Timestamp: 2024-11-18T12:54:36.758Z
Learning: In the `.github/workflows/test_cognee_llama_index_notebook.yml` workflow, it's acceptable to remove the `--all-extras` flag from `poetry install` to reduce costs by not installing unnecessary dependencies.

Applied to files:

  • .github/workflows/e2e_tests.yml
🧬 Code graph analysis (7)
cognee/modules/data/deletion/prune_system.py (7)
cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py (1)
  • get_graph_dataset_database_handler (4-10)
cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py (1)
  • get_vector_dataset_database_handler (4-10)
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)
  • get_all_data_from_table (417-450)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)
  • delete_dataset (58-81)
cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (1)
  • delete_dataset (43-50)
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDevDatasetDatabaseHandler.py (1)
  • delete_dataset (167-168)
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)
  • delete_dataset (71-80)
cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py (1)
cognee/modules/users/models/DatasetDatabase.py (1)
  • DatasetDatabase (7-40)
cognee/modules/data/methods/delete_dataset.py (4)
cognee/modules/users/models/DatasetDatabase.py (1)
  • DatasetDatabase (7-40)
cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py (1)
  • get_vector_dataset_database_handler (4-10)
cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py (1)
  • get_graph_dataset_database_handler (4-10)
cognee/api/v1/datasets/routers/get_datasets_router.py (1)
  • delete_dataset (184-218)
cognee/infrastructure/databases/utils/__init__.py (2)
cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py (1)
  • get_graph_dataset_database_handler (4-10)
cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py (1)
  • get_vector_dataset_database_handler (4-10)
cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py (1)
cognee/modules/users/models/DatasetDatabase.py (1)
  • DatasetDatabase (7-40)
cognee/tests/test_dataset_delete.py (3)
cognee/shared/logging_utils.py (1)
  • setup_logging (288-561)
cognee/modules/users/methods/get_default_user.py (1)
  • get_default_user (13-38)
cognee/api/v1/config/config.py (2)
  • data_root_directory (36-38)
  • system_root_directory (18-33)
cognee/api/v1/datasets/routers/get_datasets_router.py (2)
cognee/modules/data/methods/delete_dataset.py (1)
  • delete_dataset (14-34)
cognee/api/v1/delete/routers/get_delete_router.py (1)
  • delete (18-60)
🔇 Additional comments (3)
.github/workflows/e2e_tests.yml (1)

240-263: New dataset-deletion E2E job is consistent with existing workflow structure

The test-dataset-database-deletion job reuses the standard Cognee setup, secrets, and environment pattern and correctly points to cognee/tests/test_dataset_delete.py. This should integrate cleanly with the existing reusable integration tests.

cognee/infrastructure/databases/utils/__init__.py (1)

3-4: Re-exporting handler getters from the utils package looks appropriate

Re-exporting get_graph_dataset_database_handler and get_vector_dataset_database_handler here matches the existing pattern and simplifies callers’ imports.

cognee/modules/data/deletion/prune_system.py (1)

8-11: LGTM!

The new utility imports are correctly structured and align with the refactored deletion logic.

logger = get_logger()


async def prune_graph_databases():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add docstring to document function purpose.

Per coding guidelines, undocumented function definitions are considered incomplete. This function should have a docstring explaining that it deletes all graph databases for all dataset_database records when backend access control is enabled.

Apply this diff:

 async def prune_graph_databases():
+    """
+    Delete all graph databases for all dataset_database records.
+    Used when backend_access_control is enabled to prune graph databases per-dataset.
+    """
     db_engine = get_relational_engine()

As per coding guidelines, undocumented functions are assumed incomplete.

🤖 Prompt for AI Agents
In cognee/modules/data/deletion/prune_system.py around line 19, the function
prune_graph_databases lacks a docstring; add a concise docstring immediately
below the async def that states the function's purpose: that it deletes all
graph databases for all dataset_database records when backend access control is
enabled, describes input (none) and behavior (side-effect deletion) and notes
any exceptions or important preconditions (e.g., requires backend access control
enabled). Ensure the docstring follows project style (single- or triple-quoted
summary) and is brief and descriptive.

Comment on lines +22 to +26
dataset_databases = await db_engine.get_all_data_from_table("dataset_database")
# Go through each dataset database and delete the graph database
for data_item in data:
await _prune_graph_db(data_item)
for dataset_database in dataset_databases:
handler = get_graph_dataset_database_handler(dataset_database)
await handler["handler_instance"].delete_dataset(dataset_database)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n cognee/modules/data/deletion/prune_system.py

Repository: topoteretes/cognee

Length of output: 3544


🏁 Script executed:

# Find the db_engine.get_all_data_from_table implementation
find . -type f -name "*.py" | xargs grep -l "def get_all_data_from_table" | head -5

Repository: topoteretes/cognee

Length of output: 138


🏁 Script executed:

# Find get_graph_dataset_database_handler function
rg -n "def get_graph_dataset_database_handler" --type=py

Repository: topoteretes/cognee

Length of output: 222


🏁 Script executed:

# Find DatasetDatabase class definition
rg -n "class DatasetDatabase" --type=py

Repository: topoteretes/cognee

Length of output: 280


🏁 Script executed:

cat -n cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py | grep -A 20 "def get_all_data_from_table"

Repository: topoteretes/cognee

Length of output: 1057


🏁 Script executed:

cat -n cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py

Repository: topoteretes/cognee

Length of output: 569


🏁 Script executed:

cat -n cognee/modules/users/models/DatasetDatabase.py

Repository: topoteretes/cognee

Length of output: 2140


🏁 Script executed:

cat -n cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py | sed -n '417,460p'

Repository: topoteretes/cognee

Length of output: 1885


🏁 Script executed:

# Search for complete implementation of get_all_data_from_table
rg -A 30 "async def get_all_data_from_table" cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py

Repository: topoteretes/cognee

Length of output: 1234


🏁 Script executed:

# Check how get_all_data_from_table is used elsewhere in the codebase to see if there are other examples
rg -B 2 -A 5 "get_all_data_from_table\(" --type=py | head -50

Repository: topoteretes/cognee

Length of output: 3949


🏁 Script executed:

# Check SQLAlchemy Row behavior - does mappings() support attribute access?
python3 << 'EOF'
# Check SQLAlchemy documentation for Row behavior
import sys
try:
    from sqlalchemy import Row
    # Check if Row supports attribute access
    print("SQLAlchemy Row class:", Row)
    print("Row supports attribute access via __getattr__")
    help(Row)
except Exception as e:
    print(f"Error: {e}")
EOF

Repository: topoteretes/cognee

Length of output: 97


🏁 Script executed:

# Let's check if there are any similar patterns in the codebase where Row mappings are used with attribute access
rg -B 3 -A 3 "\.mappings\(\)" --type=py cognee/infrastructure/databases/relational/sqlalchemy/

Repository: topoteretes/cognee

Length of output: 802


🏁 Script executed:

# Check if DatasetDatabase is instantiated directly elsewhere or if Row objects are used directly
rg -B 2 -A 2 "DatasetDatabase\(" --type=py cognee/ | grep -v test | head -40

Repository: topoteretes/cognee

Length of output: 942


🏁 Script executed:

# Check SQLAlchemy version in requirements or setup files
find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" | xargs grep -i sqlalchemy

Repository: topoteretes/cognee

Length of output: 230


🏁 Script executed:

# Check if there are other usages where get_all_data_from_table results are used with attribute access
rg -B 2 -A 5 "get_all_data_from_table" --type=py cognee/ | grep -A 5 "for.*in.*get_all_data_from_table"

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Look for prune_vector_databases which has same pattern as prune_graph_databases
cat -n cognee/modules/data/deletion/prune_system.py | sed -n '35,48p'

Repository: topoteretes/cognee

Length of output: 825


🏁 Script executed:

# Check SQLAlchemy Row attribute access support in 2.0+
# Search for any existing uses of Row attribute access in the codebase
rg "result\[" cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py | head -10

Repository: topoteretes/cognee

Length of output: 140


🏁 Script executed:

# Look for any test that verifies the attribute access on Row objects from get_all_data_from_table
rg -B 2 -A 2 "result\[.*\]" cognee/tests/test_deduplication.py | head -20

Repository: topoteretes/cognee

Length of output: 1194


🏁 Script executed:

# Check if get_vector_dataset_database_handler also uses attribute access like get_graph_dataset_database_handler
cat -n cognee/infrastructure/databases/utils/get_vector_dataset_database_handler.py

Repository: topoteretes/cognee

Length of output: 571


🏁 Script executed:

# Check if there's a typing issue - look at the actual parameter types expected by the handlers
rg -B 5 -A 10 "def get_graph_dataset_database_handler" cognee/infrastructure/databases/utils/get_graph_dataset_database_handler.py

Repository: topoteretes/cognee

Length of output: 499


🏁 Script executed:

# Check if Row objects from SQLAlchemy 2.0 are being used with attribute access anywhere in the codebase
rg "for .* in .*get_all_data_from_table" -A 3 --type=py cognee/ | grep -v test

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check if there are any Row to Model conversions happening anywhere
rg "Row\|mappings" --type=py cognee/infrastructure/databases/relational/sqlalchemy/

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Let me verify SQLAlchemy 2.0 Row behavior by checking pyproject.toml more carefully
cat -n pyproject.toml | grep -A 5 -B 5 "sqlalchemy"

Repository: topoteretes/cognee

Length of output: 1354


🏁 Script executed:

# Check if Row objects truly support both dict and attribute access in actual usage
# Look for any error handling or type: ignore comments related to this
rg "type: ignore|# noqa" cognee/modules/data/deletion/prune_system.py

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Search for any other functions that receive Row objects and use attribute access
rg -B 3 "\.graph_dataset_database_handler|\.vector_dataset_database_handler" --type=py cognee/

Repository: topoteretes/cognee

Length of output: 3574


🏁 Script executed:

# Verify SQLAlchemy 2.0+ Row attribute access support by checking actual Row class behavior
# Look for any issues or type annotations in the codebase related to this
rg "RoomRowMapping|LegacyRow" --type=py cognee/

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check if there are type checking issues - look for mypy or pyright configurations
cat pyproject.toml | grep -A 10 "\[tool.mypy\]"

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Verify the actual usage pattern - if the code works, Row objects must support attribute access
# Check if there are any integration tests that use prune_system
rg "prune_graph_databases\|prune_vector_databases" --type=py cognee/tests/

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Final verification: Check SQLAlchemy Row documentation behavior in 2.0
# by looking at actual attribute access patterns in the codebase
rg "for .* in .*await.*\." -A 3 --type=py cognee/modules/data/ | head -30

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check if Row objects from mappings() actually support attribute access
# by looking for any similar patterns where Row attributes are accessed
rg "dataset_databases\[.*\]\." --type=py cognee/

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Verify by checking actual usage in context_global_variables where similar access happens
cat -n cognee/context_global_variables.py | sed -n '1,80p'

Repository: topoteretes/cognee

Length of output: 5277


🏁 Script executed:

# Final check: SQLAlchemy 2.0+ Row objects support both dict and attribute access
# Let's verify this by checking the actual Row implementation usage
rg "from sqlalchemy" cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py | head -10

Repository: topoteretes/cognee

Length of output: 309


🏁 Script executed:

# Check if there's any Row wrapping or conversion happening
rg "class.*Row|def.*mappings" --type=py cognee/infrastructure/databases/

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Verify the complete picture - check what happens in a similar function that works
cat -n cognee/infrastructure/databases/utils/get_or_create_dataset_database.py | head -50

Repository: topoteretes/cognee

Length of output: 2232


🏁 Script executed:

# Let's verify SQLAlchemy Row attribute access by examining the actual imports and checking documentation
# SQLAlchemy 2.0+ Row objects support both dict and attribute access by default
python3 << 'EOF'
# Check SQLAlchemy 2.0 Row behavior documentation
# According to SQLAlchemy 2.0 docs, Row objects returned from mappings() support attribute access
# This is a documented feature of SQLAlchemy 2.0+

# The Row class in SQLAlchemy 2.0+ implements both:
# - __getitem__ for dict-like access: row["column_name"]
# - __getattr__ for attribute access: row.column_name

print("SQLAlchemy 2.0+ Row objects support BOTH dict-like and attribute access:")
print("- row['column_name'] works")
print("- row.column_name works")
print("\nThis is documented behavior in SQLAlchemy 2.0+")
print("See: https://docs.sqlalchemy.org/en/20/orm/queryguide.html#rows-are-namedtuples")
EOF

Repository: topoteretes/cognee

Length of output: 313


🏁 Script executed:

# Verify that the code actually works with Row attribute access by checking for any actual runtime errors
# or exception handling around this pattern
rg "AttributeError|except.*Error" cognee/modules/data/deletion/prune_system.py

Repository: topoteretes/cognee

Length of output: 250


Add docstring to prune_graph_databases function.

The function lacks documentation as required by project guidelines (code style guideline: "undocumented function definitions in the project's Python code are assumed incomplete").

🤖 Prompt for AI Agents
cognee/modules/data/deletion/prune_system.py lines 22-26: The function
prune_graph_databases is missing a docstring; add a concise triple-quoted
docstring immediately below the def line that states the function purpose
(prune/delete graph DBs for all dataset_database records), describes parameters
(if any, e.g., db_engine) and their types, notes return value (None) and side
effects (calls handler.delete_dataset for each dataset_database), and documents
possible exceptions or errors that may be raised; keep it short (2–4 sentences)
and follow project docstring style.

Comment on lines +25 to +26
handler = get_graph_dataset_database_handler(dataset_database)
await handler["handler_instance"].delete_dataset(dataset_database)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider validating handler dictionary structure.

The code accesses handler["handler_instance"] without verifying the key exists, which could raise a KeyError if the handler structure changes.

Consider adding defensive validation:

             handler = get_graph_dataset_database_handler(dataset_database)
+            if "handler_instance" not in handler:
+                logger.warning("Handler missing 'handler_instance' key for dataset_database")
+                continue
             await handler["handler_instance"].delete_dataset(dataset_database)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cognee/modules/data/deletion/prune_system.py around lines 25-26, the code
assumes handler["handler_instance"] exists which can raise KeyError; validate
that handler is a dict and contains the "handler_instance" key (and that its
value is not None and exposes delete_dataset) before calling await
handler["handler_instance"].delete_dataset(dataset_database); if validation
fails, log or raise a clear error with context (including dataset_database) or
skip gracefully to avoid unhandled exceptions.

@borisarzentar borisarzentar changed the title Feat: Add database deletion on dataset delete feat: Add database deletion on dataset delete Dec 15, 2025
@dexters1 dexters1 force-pushed the improve-dataset-delete branch from ca746ea to 031cc46 Compare December 15, 2025 15:26
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
cognee-mcp/src/server.py (1)

861-867: Migration invocation will block the event loop.

The code awaits run_migrations(), but the implementation in cognee/run_migrations.py uses blocking subprocess.run(). This will freeze the MCP server during startup.

This issue is a consequence of the problems flagged in cognee/run_migrations.py. Once those are fixed (either by making the function synchronous and removing await here, or by making the subprocess call async), this invocation will work correctly.

The removal of custom error handling (UserAlreadyExists, etc.) is appropriate since errors will now be raised as exceptions from run_migrations().

🧹 Nitpick comments (1)
cognee/run_migrations.py (1)

7-10: Clarify package structure assumptions in comments.

The comment on lines 7-8 mentions that this is a "placeholder for the path logic," which suggests temporary or incomplete implementation. The constants reference "alembic" as the migrations directory.

Consider expanding the comment to document the expected package structure:

-# Assuming your package is named 'cognee' and the migrations are under 'cognee/migrations'
-# This is a placeholder for the path logic.
+# Expected package structure:
+# cognee/
+#   alembic.ini
+#   alembic/
+#     versions/
 MIGRATIONS_PACKAGE = "cognee"
 MIGRATIONS_DIR_NAME = "alembic"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 031cc46 and ca746ea.

⛔ Files ignored due to path filters (1)
  • cognee-mcp/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • alembic/README (0 hunks)
  • cognee-mcp/pyproject.toml (1 hunks)
  • cognee-mcp/src/server.py (1 hunks)
  • cognee/__init__.py (1 hunks)
  • cognee/alembic/README (1 hunks)
  • cognee/alembic/env.py (1 hunks)
  • cognee/run_migrations.py (1 hunks)
💤 Files with no reviewable changes (1)
  • alembic/README
✅ Files skipped from review due to trivial changes (1)
  • cognee/alembic/README
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/alembic/env.py
  • cognee-mcp/src/server.py
  • cognee/run_migrations.py
  • cognee/__init__.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/alembic/env.py
  • cognee-mcp/src/server.py
  • cognee/run_migrations.py
  • cognee/__init__.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/alembic/env.py
  • cognee/run_migrations.py
  • cognee/__init__.py
🧬 Code graph analysis (2)
cognee-mcp/src/server.py (2)
cognee/shared/logging_utils.py (1)
  • info (205-205)
cognee/run_migrations.py (1)
  • run_migrations (13-46)
cognee/__init__.py (1)
cognee/run_migrations.py (1)
  • run_migrations (13-46)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
  • GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: End-to-End Tests / Test graph edge ingestion
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Test Cognify - Edge Centered Payload
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: Basic Tests / Run Simple Examples
  • GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: End-to-End Tests / Test dataset database deletion in Cognee
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: Basic Tests / Run Simple Examples BAML
  • GitHub Check: Basic Tests / Run Unit Tests
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: CLI Tests / CLI Functionality Tests
🔇 Additional comments (2)
cognee/__init__.py (1)

36-37: LGTM: Public API export is properly placed.

The import is correctly positioned after all other imports and makes run_migrations available as part of the public package API.

Note: The underlying implementation in cognee/run_migrations.py has issues flagged in a separate review.

cognee/alembic/env.py (1)

82-95: The review comment references code that does not exist in the current repository.

The actual implementation at alembic/env.py lines 80-83 uses asyncio.run(run_async_migrations()), which is the correct approach. The problematic try-except pattern with asyncio.get_running_loop() and loop.run_until_complete() is not present in the codebase. Additionally, the file path in the comment is incorrect (cognee/alembic/env.py should be alembic/env.py).

Likely an incorrect or invalid review comment.

dependencies = [
# For local cognee repo usage remove comment bellow and add absolute path to cognee. Then run `uv sync --reinstall` in the mcp folder on local cognee changes.
#"cognee[postgres,codegraph,gemini,huggingface,docs,neo4j] @ file:/Users/igorilic/Desktop/cognee",
"cognee[postgres,codegraph,gemini,huggingface,docs,neo4j] @ file:/Users/igorilic/Desktop/cognee/dist/cognee-0.5.0.dev0-py3-none-any.whl",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove hardcoded absolute path before merging.

The dependency uses a hardcoded absolute path specific to a single developer's machine (/Users/igorilic/Desktop/...). This will break the build for other developers and CI environments.

Consider one of these approaches:

  • Use a relative path: file://../../dist/cognee-0.5.0.dev0-py3-none-any.whl
  • Switch to the commented-out PyPI version on line 12 for production
  • Document this as a local-development-only configuration in a README
🤖 Prompt for AI Agents
In cognee-mcp/pyproject.toml around line 10, the dependency entry contains a
hardcoded absolute path to a developer's wheel which will break other
environments; replace it with a non-absolute solution by either (a) changing to
a relative file URL pointing to the local dist (e.g. file://../../dist/... ),
(b) reverting to the commented PyPI package/version for CI/production, or (c) if
this must remain for local dev only, remove it from the committed file and add
documentation in the repo README describing how to install the local wheel for
development; update the file accordingly and ensure CI uses the PyPI entry or a
relative path.

MIGRATIONS_DIR_NAME = "alembic"


async def run_migrations():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove async keyword from non-async function.

The function is declared as async but contains no await expressions. The only I/O operation is subprocess.run(), which is synchronous and blocking.

Apply this diff:

-async def run_migrations():
+def run_migrations():
     """
     Finds the Alembic configuration within the installed package and
     programmatically executes 'alembic upgrade head'.
     """

Update callers to remove await:

# In cognee-mcp/src/server.py
run_migrations()  # Remove await
🤖 Prompt for AI Agents
In cognee/run_migrations.py around line 13, the function is declared async but
contains no await and calls the synchronous subprocess.run, so remove the async
keyword to make it a normal function (change "async def run_migrations()" to
"def run_migrations()") and leave the subprocess.run call as-is; then update all
callers (e.g., cognee-mcp/src/server.py) to call run_migrations() without
awaiting it.

Comment on lines 36 to 41
migration_result = subprocess.run(
["alembic", "upgrade", "head"],
capture_output=True,
text=True,
cwd=Path(package_root),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Blocking subprocess call will freeze the event loop.

subprocess.run() is synchronous and will block the entire event loop if called from async code. Since this function is imported and awaited in cognee-mcp/src/server.py (line 865), it will block the MCP server startup.

If the function must remain async, use the async subprocess API:

-    migration_result = subprocess.run(
-        ["alembic", "upgrade", "head"],
-        capture_output=True,
-        text=True,
-        cwd=Path(package_root),
-    )
+    process = await asyncio.create_subprocess_exec(
+        "alembic", "upgrade", "head",
+        stdout=asyncio.subprocess.PIPE,
+        stderr=asyncio.subprocess.PIPE,
+        cwd=Path(package_root),
+    )
+    stdout, stderr = await process.communicate()
+    migration_result = type('Result', (), {
+        'returncode': process.returncode,
+        'stdout': stdout.decode(),
+        'stderr': stderr.decode()
+    })()

Add import:

+import asyncio
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
migration_result = subprocess.run(
["alembic", "upgrade", "head"],
capture_output=True,
text=True,
cwd=Path(package_root),
)
process = await asyncio.create_subprocess_exec(
"alembic", "upgrade", "head",
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
cwd=Path(package_root),
)
stdout, stderr = await process.communicate()
migration_result = type('Result', (), {
'returncode': process.returncode,
'stdout': stdout.decode(),
'stderr': stderr.decode()
})()
🤖 Prompt for AI Agents
In cognee/run_migrations.py around lines 36 to 41 the code uses subprocess.run
which is synchronous and will block the event loop; replace it with the asyncio
subprocess API (e.g., asyncio.create_subprocess_exec or
asyncio.subprocess.create_subprocess_exec) and await the process to completion,
capturing stdout/stderr asynchronously, checking returncode and raising or
returning a proper error if non-zero; also add the necessary import (import
asyncio) and ensure cwd is passed to the async subprocess call (or use os.chdir
temporarily) and that the function remains async so callers can await it without
blocking.

Comment on lines 43 to 46
if migration_result.returncode != 0:
migration_output = migration_result.stderr + migration_result.stdout
print(f"Migration failed with unexpected error: {migration_output}")
sys.exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Replace sys.exit() with exception raising.

Calling sys.exit(1) abruptly terminates the process without allowing cleanup or proper error propagation. This prevents the caller from handling migration failures gracefully.

Apply this diff:

     if migration_result.returncode != 0:
         migration_output = migration_result.stderr + migration_result.stdout
-        print(f"Migration failed with unexpected error: {migration_output}")
-        sys.exit(1)
+        raise RuntimeError(f"Migration failed: {migration_output}")

This allows callers to catch and handle the exception appropriately.

🤖 Prompt for AI Agents
In cognee/run_migrations.py around lines 43 to 46, replace the abrupt process
termination using sys.exit(1) with raising a descriptive exception so callers
can catch and handle migration failures; construct an exception (e.g.,
RuntimeError) including the combined migration_output (migration_result.stderr +
migration_result.stdout) in the message and raise it instead of calling
sys.exit, preserving the existing printed/logged output if desired.


if migration_result.returncode != 0:
migration_output = migration_result.stderr + migration_result.stdout
print(f"Migration failed with unexpected error: {migration_output}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use logging utilities instead of print().

Per coding guidelines for cognee/**/*.py, use shared logging utilities from cognee.shared.logging_utils.

Based on coding guidelines, apply this diff:

+from cognee.shared.logging_utils import get_logger
+
+logger = get_logger()
+
 async def run_migrations():
     """
     Finds the Alembic configuration within the installed package and
     programmatically executes 'alembic upgrade head'.
     """
     # ... existing code ...
     
     if migration_result.returncode != 0:
         migration_output = migration_result.stderr + migration_result.stdout
-        print(f"Migration failed with unexpected error: {migration_output}")
+        logger.error(f"Migration failed with unexpected error: {migration_output}")
         sys.exit(1)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(f"Migration failed with unexpected error: {migration_output}")
from cognee.shared.logging_utils import get_logger
logger = get_logger()
async def run_migrations():
"""
Finds the Alembic configuration within the installed package and
programmatically executes 'alembic upgrade head'.
"""
# ... existing code ...
if migration_result.returncode != 0:
migration_output = migration_result.stderr + migration_result.stdout
logger.error(f"Migration failed with unexpected error: {migration_output}")
sys.exit(1)
🤖 Prompt for AI Agents
In cognee/run_migrations.py around line 45 replace the call to print(...) with
the project logging utilities: import the shared logger from
cognee.shared.logging_utils at the top of the file (or reuse an existing
logger), then change print(f"Migration failed with unexpected error:
{migration_output}") to an appropriate logger method (e.g., logger.error(...))
that includes the migration_output; ensure the import is added only once and no
print() calls remain.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cognee/api/v1/datasets/routers/get_datasets_router.py (1)

211-218: Clarify auth result handling and avoid delete_dataset name shadowing inside the endpoint

The new flow correctly (a) enforces delete permission via get_authorized_existing_datasets and (b) delegates to the shared cognee.modules.data.methods.delete_dataset, which satisfies the PR goal of cleaning up associated databases.

There are two minor points worth tightening:

  • The import on Line 211 shadows the endpoint function name delete_dataset, which is legal but confusing for readers and makes it harder to tell at a glance whether await delete_dataset(...) refers to the endpoint or the data-layer method.
  • On Lines 213–218, you only guard against dataset is None. If get_authorized_existing_datasets ever returns an empty list for “no authorized dataset”, dataset[0] on Line 218 would raise IndexError. Using a truthiness check handles both None and empty lists defensively.

You can address both with a small refactor:

-        from cognee.modules.data.methods import delete_dataset
-
-        dataset = await get_authorized_existing_datasets([dataset_id], "delete", user)
-
-        if dataset is None:
-            raise DatasetNotFoundError(message=f"Dataset ({str(dataset_id)}) not found.")
-
-        await delete_dataset(dataset[0])
+        from cognee.modules.data.methods import delete_dataset as delete_dataset_method
+
+        authorized_datasets = await get_authorized_existing_datasets(
+            [dataset_id],
+            "delete",
+            user,
+        )
+
+        if not authorized_datasets:
+            raise DatasetNotFoundError(message=f"Dataset ({str(dataset_id)}) not found.")
+
+        await delete_dataset_method(authorized_datasets[0])

This keeps behavior the same in the happy path, improves robustness if the helper ever returns [], and removes the local name collision, which should make the code easier to follow.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca746ea and 31a3410.

📒 Files selected for processing (1)
  • cognee/api/v1/datasets/routers/get_datasets_router.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/api/v1/datasets/routers/get_datasets_router.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/api/v1/datasets/routers/get_datasets_router.py
cognee/api/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Public APIs should be type-annotated in Python where practical

Files:

  • cognee/api/v1/datasets/routers/get_datasets_router.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/api/v1/datasets/routers/get_datasets_router.py
🧬 Code graph analysis (1)
cognee/api/v1/datasets/routers/get_datasets_router.py (1)
cognee/modules/data/methods/delete_dataset.py (1)
  • delete_dataset (14-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: End-to-End Tests / Test dataset database deletion in Cognee
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / Test Cognify - Edge Centered Payload
  • GitHub Check: Basic Tests / Run Unit Tests
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: CLI Tests / CLI Functionality Tests

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cognee/infrastructure/llm/LLMGateway.py (1)

12-26: Consider forwarding **kwargs in the BAML branch and adding brief method docstrings

Right now acreate_structured_output accepts **kwargs, but only the LiteLLM path forwards them; the BAML path ignores them. If the gateway is meant to be backend-agnostic, consider passing **kwargs through to the BAML acreate_structured_output as well for interface parity.

Also, per the project’s Python guidelines, you may want to add short docstrings to the static methods (e.g., summarizing what acreate_structured_output expects/returns) to complement the class-level docstring and align with PEP 257-style documentation.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31a3410 and d36fb2d.

📒 Files selected for processing (2)
  • cognee/api/v1/cognify/cognify.py (1 hunks)
  • cognee/infrastructure/llm/LLMGateway.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/infrastructure/llm/LLMGateway.py
  • cognee/api/v1/cognify/cognify.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/infrastructure/llm/LLMGateway.py
  • cognee/api/v1/cognify/cognify.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/infrastructure/llm/LLMGateway.py
  • cognee/api/v1/cognify/cognify.py
cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

  • cognee/infrastructure/llm/LLMGateway.py
cognee/api/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Public APIs should be type-annotated in Python where practical

Files:

  • cognee/api/v1/cognify/cognify.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: Basic Tests / Run Simple Examples
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: Basic Tests / Run Unit Tests
  • GitHub Check: Basic Tests / Run Simple Examples BAML
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: CLI Tests / CLI Integration Tests
🔇 Additional comments (2)
cognee/api/v1/cognify/cognify.py (1)

56-56: LGTM! Trailing comma aligns with Python conventions.

The trailing comma after **kwargs is a non-functional style improvement that makes future parameter additions cleaner in diffs and is consistent with modern Python formatters like ruff.

cognee/infrastructure/llm/LLMGateway.py (1)

34-37: Formatting-only change; behavior stays the same

Reformatting this call to multi-line keyword arguments preserves the existing behavior (including forwarding **kwargs) and improves readability. No issues here.

@dexters1 dexters1 merged commit 14d9540 into dev Dec 15, 2025
114 of 118 checks passed
@dexters1 dexters1 deleted the improve-dataset-delete branch December 15, 2025 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants