-
Notifications
You must be signed in to change notification settings - Fork 954
feat: Add dataset database handler info #1887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please make sure all the checkboxes are checked:
|
WalkthroughAdds two non-nullable handler identifier columns to the DatasetDatabase schema and migration, populates them in handler implementations and tests, and changes runtime code to select handlers from per-record fields instead of global config. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…andler from instance and not config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDevDatasetDatabaseHandler.py (1)
129-135: Persisting Neo4j Aura handler identifierIncluding
"graph_dataset_database_handler": "neo4j_aura_dev"in the payload is consistent with the new DatasetDatabase column and prune_system’s per-record selection. To reduce future string drift between handlers, migrations, and tests, consider centralizing these handler IDs in a shared enum/constants module instead of repeating raw literals.cognee/modules/data/deletion/prune_system.py (1)
16-25: Pruning now depends on per-record handler IDs – verify handler registry coverageUsing
dataset_database.graph_dataset_database_handler/vector_dataset_database_handlerto choose the handler makes pruning correctly per-dataset, but it also means:
- Every persisted handler ID must correspond to a key in
supported_dataset_database_handlers.- Existing rows populated via the migration defaults must match a real handler that can safely delete those datasets (especially if you already had non-default handlers in use).
Consider adding a defensive check (e.g.,
dict.getwith a logged warning or a clearer exception) for unknown handler IDs, and please verify on a real database that the new columns’ values and the handler registry are consistent before relying on prune in production. Based on learnings, this matters because prune_system is called by core pipelines.Also applies to: 41-49
alembic/versions/46a6ce2bd2b2_expand_dataset_database_with_json_.py (2)
52-65: New handler columns are wired through SQLite table recreationThe additions of
vector_dataset_database_handler("lancedb") andgraph_dataset_database_handler("kuzu") to both SQLite recreate helpers, plus their inclusion in theINSERT ... SELECTcopy statements, keep existing data intact while enforcing non-null defaults during table rebuilds. One tiny nit: the comment “Add LanceDB as the default graph dataset database handler” is misleading for the vector handler case and could be reworded to avoid confusion.Also applies to: 99-100, 139-152, 186-187
228-274: Migration logic for handler defaults looks sound – please run on both dialectsThe upgrade path conditionally adds the two handler columns (with defaults) before any constraint manipulation, and the downgrade path removes them after restoring unique constraints, which is structurally correct for both PostgreSQL and SQLite. Because this migration now owns both the JSON connection fields and the handler fields, it would be good to:
- Test
alembic upgrade 46a6ce2bd2b2anddowngrade 76625596c5c3on representative PostgreSQL and SQLite databases (with existingdataset_databaserows) to confirm the new columns are populated as expected and that unique constraints are correctly reapplied.- Sanity‑check that the chosen defaults
"lancedb"/"kuzu"match the actual handlers used for any pre-existing datasets.For reference, see the Alembic docs on
batch_alter_tableand SQLite limitations, and the SQLAlchemy docs onColumn.server_default:Alembic batch operations: https://alembic.sqlalchemy.org/en/latest/batch.html SQLAlchemy Column.server_default: https://docs.sqlalchemy.org/en/latest/core/metadata.html#sqlalchemy.schema.Column.params.server_defaultAlso applies to: 330-333
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
alembic/versions/46a6ce2bd2b2_expand_dataset_database_with_json_.py(7 hunks)cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py(1 hunks)cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDevDatasetDatabaseHandler.py(1 hunks)cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py(1 hunks)cognee/modules/data/deletion/prune_system.py(2 hunks)cognee/modules/users/models/DatasetDatabase.py(1 hunks)cognee/tests/test_dataset_database_handler.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/tests/test_dataset_database_handler.pycognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDevDatasetDatabaseHandler.pycognee/modules/data/deletion/prune_system.pyalembic/versions/46a6ce2bd2b2_expand_dataset_database_with_json_.pycognee/modules/users/models/DatasetDatabase.pycognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.pycognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/tests/test_dataset_database_handler.pycognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDevDatasetDatabaseHandler.pycognee/modules/data/deletion/prune_system.pyalembic/versions/46a6ce2bd2b2_expand_dataset_database_with_json_.pycognee/modules/users/models/DatasetDatabase.pycognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.pycognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/tests/test_dataset_database_handler.pycognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDevDatasetDatabaseHandler.pycognee/modules/data/deletion/prune_system.pycognee/modules/users/models/DatasetDatabase.pycognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.pycognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests
Files:
cognee/tests/test_dataset_database_handler.py
cognee/tests/*
⚙️ CodeRabbit configuration file
cognee/tests/*: When reviewing test code:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance, pointing out any violations discovered.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a project rule, Python source files with names prefixed by the string "test_" and located in the project's "tests" directory are the project's unit-testing code. It is safe, albeit a heuristic, to assume these are considered part of the project's minimal acceptance testing unless a justifying exception to this assumption is documented.
- As a project rule, any files without extensions and with names prefixed by either the string "check_" or the string "test_", and located in the project's "tests" directory, are the project's non-unit test code. "Non-unit test" in this context refers to any type of testing other than unit testing, such as (but not limited to) functional testing, style linting, regression testing, etc. It can also be assumed that non-unit testing code is usually written as Bash shell scripts.
Files:
cognee/tests/test_dataset_database_handler.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDevDatasetDatabaseHandler.pycognee/modules/data/deletion/prune_system.pycognee/modules/users/models/DatasetDatabase.pycognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.pycognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
🧠 Learnings (1)
📚 Learning: 2025-10-11T04:18:24.594Z
Learnt from: Vattikuti-Manideep-Sitaram
Repo: topoteretes/cognee PR: 1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.594Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.
Applied to files:
cognee/modules/data/deletion/prune_system.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
- GitHub Check: End-to-End Tests / Test Cognify - Edge Centered Payload
- GitHub Check: End-to-End Tests / Test graph edge ingestion
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
- GitHub Check: Basic Tests / Run Simple Examples
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: Basic Tests / Run Integration Tests
- GitHub Check: Basic Tests / Run Unit Tests
- GitHub Check: Basic Tests / Run Simple Examples BAML
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: CLI Tests / CLI Integration Tests
🔇 Additional comments (4)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)
45-51: Kuzu handler payload aligned with new schemaReturning
"graph_dataset_database_handler": "kuzu"keeps the runtime payload in sync with the new non-nullable column and the migration default, so prune_system can reliably pick the correct handler for these datasets.cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (1)
34-40: LanceDB handler now records its identifierAdding
"vector_dataset_database_handler": "lancedb"to the returned dict matches the new DatasetDatabase column and the migration default, so per-dataset pruning can reliably resolve the correct handler.cognee/modules/users/models/DatasetDatabase.py (1)
21-22: Ensure all DatasetDatabase creation paths populate new handler fieldsMaking
graph_dataset_database_handlerandvector_dataset_database_handlernon-nullable is appropriate, but it means every insert ofDatasetDatabasemust now explicitly set both fields (or rely on DB defaults only by omitting the columns from INSERT). Please double‑check any direct ORM creation code, fixtures, or tests that instantiateDatasetDatabaseto ensure they provide these values, otherwise you may hit NOT NULL violations at runtime.cognee/tests/test_dataset_database_handler.py (1)
32-37: Tests correctly exercise per-dataset handler identifiersHaving the test handlers return
vector_dataset_database_handler/graph_dataset_database_handlervalues that match the environment configuration (custom_lancedb_handler,custom_kuzu_handler) keeps the tests aligned with the new schema and pruning behavior, and should catch regressions if these fields stop being persisted.Also applies to: 46-52
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py (2)
4-10: Defensive lookup forvector_dataset_database_handlerwould improve robustness
dataset_database.vector_dataset_database_handleris used as a direct key intosupported_dataset_database_handlers. If the DB field isNULL, stale, or otherwise not in the mapping, this will raise a bareKeyErrorwithout context, which can be painful to debug.Consider adding explicit, structured error handling (and optionally a one-line docstring for this helper) to surface a clearer error, e.g.:
- handler = supported_dataset_database_handlers[dataset_database.vector_dataset_database_handler] - return await handler["handler_instance"].resolve_dataset_connection_info(dataset_database) + handler_key = dataset_database.vector_dataset_database_handler + try: + handler = supported_dataset_database_handlers[handler_key] + except KeyError as exc: + raise ValueError( + f"Unsupported vector dataset database handler: {handler_key!r}" + ) from exc + + return await handler["handler_instance"].resolve_dataset_connection_info(dataset_database)This aligns with the guideline to prefer explicit, structured error handling in Python code. As per coding guidelines.
13-19: Mirror defensive handling forgraph_dataset_database_handler(and consider deduping helpers)The same
KeyErrorrisk exists fordataset_database.graph_dataset_database_handler. It would be good to mirror the defensive pattern from the vector helper so both paths fail with a clear, high-level error if the handler id is unknown.You could also optionally factor the common logic (handler lookup + error handling +
resolve_dataset_connection_infocall) into a single internal helper that takes the attribute name ("vector_dataset_database_handler"/"graph_dataset_database_handler") to avoid duplication. As per coding guidelines.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: Basic Tests / Run Integration Tests
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: Basic Tests / Run Unit Tests
- GitHub Check: Basic Tests / Run Formatting Check
- GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: CLI Tests / CLI Integration Tests
Description
Add info on dataset database handler used for dataset database
Type of Change
Screenshots/Videos (if applicable)
Pre-submission Checklist
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
Summary by CodeRabbit
New Features
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.