-
Notifications
You must be signed in to change notification settings - Fork 959
feat: Add Support ChromaDB #622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Support ChromaDB #622
Conversation
WalkthroughThis pull request adds support for the new vector database provider "chromadb." It updates configuration files and provider options in the settings, introduces a new adapter implementation for interacting with ChromaDB, and adjusts the engine creation logic to conditionally instantiate the new adapter. Additionally, tests for ChromaDB functionality are provided along with a Docker service configuration and a new dependency declaration. Changes
Sequence Diagram(s)sequenceDiagram
participant C as Client
participant E as create_vector_engine
participant A as ChromaDBAdapter
C->>E: Request vector engine (provider="chromadb")
E->>E: Validate provider and import "chromadb" module
E-->>C: Raise ImportError (if module missing) or proceed
E->>A: Instantiate ChromaDBAdapter with config parameters
A-->>E: Return adapter instance
E-->>C: Return ChromaDBAdapter instance
Possibly related PRs
Suggested reviewers
Poem
Tip ⚡🧪 Multi-step agentic review comment chat (experimental)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
9e6c021 to
45d3356
Compare
️✅ There are no secrets present in this pull request anymore.If these secrets were true positive and are still valid, we highly recommend you to revoke them. 🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request. |
45d3356 to
0318a0f
Compare
90b23be to
8925442
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (9)
pyproject.toml (1)
60-60: Consider adding ChromaDB to extras section.The dependency is correctly added, but unlike other vector database providers (qdrant, weaviate, etc.), chromadb hasn't been added to the
[tool.poetry.extras]section. This would allow users to selectively install this dependency.[tool.poetry.extras] filesystem = ["s3fs", "botocore"] weaviate = ["weaviate-client"] qdrant = ["qdrant-client"] neo4j = ["neo4j"] postgres = ["psycopg2", "pgvector", "asyncpg"] notebook = ["notebook", "ipykernel", "overrides", "ipywidgets", "jupyterlab", "jupyterlab_widgets", "jupyterlab-server", "jupyterlab-git"] langchain = ["langsmith", "langchain_text_splitters"] llama-index = ["llama-index-core"] gemini = ["google-generativeai"] huggingface = ["transformers"] ollama = ["transformers"] mistral = ["mistral-common"] deepeval = ["deepeval"] posthog = ["posthog"] falkordb = ["falkordb"] groq = ["groq"] milvus = ["pymilvus"] docs = ["unstructured"] codegraph = ["fastembed", "transformers", "tree-sitter", "tree-sitter-python"] evals = ["plotly", "gdown"] gui = ["pyside6", "qasync"] graphiti = ["graphiti-core"] +chromadb = ["chromadb"]cognee/infrastructure/databases/vector/create_vector_engine.py (3)
84-99: ChromaDB implementation looks good but has extra whitespace.The ChromaDB integration follows the established pattern of other vector database providers and includes proper error handling. However, there's an extra blank line on line 84 that should be removed for consistency with the rest of the file.
elif vector_db_provider == "falkordb": if not (vector_db_url and vector_db_port): raise EnvironmentError("Missing requred FalkorDB credentials!") from ..hybrid.falkordb.FalkorDBAdapter import FalkorDBAdapter return FalkorDBAdapter( database_url=vector_db_url, database_port=vector_db_port, embedding_engine=embedding_engine, ) - elif vector_db_provider == "chromadb": try: import chromadb except ImportError: raise ImportError( "ChromaDB is not installed. Please install it with 'pip install chromadb'" ) from .chromadb.ChromaDBAdapter import ChromaDBAdapter return ChromaDBAdapter( url=vector_db_url, api_key=vector_db_key, embedding_engine=embedding_engine, )
85-91: Consider adding credential validation.Unlike other database providers like Weaviate and Qdrant, the ChromaDB implementation doesn't validate if the required credentials are provided. Consider adding credential validation for consistency.
elif vector_db_provider == "chromadb": + if not vector_db_url: + raise EnvironmentError("Missing required ChromaDB credentials!") + try: import chromadb except ImportError: raise ImportError( "ChromaDB is not installed. Please install it with 'pip install chromadb'" )
89-91: Typo in error message.There's a typo in "required" in other error messages in this file ("requred" instead of "required"). For consistency, you should use the same spelling in your error message, even though it's misspelled elsewhere.
try: import chromadb except ImportError: raise ImportError( - "ChromaDB is not installed. Please install it with 'pip install chromadb'" + "ChromaDB is not installed. Please install it with 'pip install chromadb'" )Note: If you prefer to fix all the typos in the file, it would be better to create a separate PR for that.
docker-compose.yml (3)
63-82: Commented-outchromadbservice configuration looks fine.However, note that if you plan to enable this service in the future, ensure the environment variables align with production settings and credentials (e.g., tokens, secrets).
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 82-82: trailing spaces
(trailing-spaces)
82-82: Remove trailing spaces to comply with linting.Below is a proposed fix:
- # - "3002:8000" + # - "3002:8000"🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 82-82: trailing spaces
(trailing-spaces)
103-103: Remove trailing spaces to comply with linting.Below is a proposed fix:
-# UNCOMMENT IF USING CHROMADB +# UNCOMMENT IF USING CHROMADB🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 103-103: trailing spaces
(trailing-spaces)
cognee/tests/test_chromadb.py (1)
153-154: Use a constant for the expected history length.Relying on a magic number
8can make the test brittle if the search history length changes in the future. Consider referencing a descriptive constant or deriving the expected value.cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (1)
236-238: Honor thenormalizedparameter or remove it.Currently, normalization is always performed. If a user passes
normalized=False, it is disregarded. You could wrap the normalization logic in a conditional or remove the parameter to avoid confusion.- normalized: bool = True, ... - normalized_values = normalize_distances(vector_list) + if normalized: + normalized_values = normalize_distances(vector_list) + else: + normalized_values = [result["_distance"] for result in vector_list]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
.env.template(1 hunks).gitignore(1 hunks)cognee/api/v1/settings/routers/get_settings_router.py(1 hunks)cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py(1 hunks)cognee/infrastructure/databases/vector/create_vector_engine.py(1 hunks)cognee/tests/test_chromadb.py(1 hunks)docker-compose.yml(2 hunks)pyproject.toml(1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.35.1)
docker-compose.yml
[error] 82-82: trailing spaces
(trailing-spaces)
[error] 103-103: trailing spaces
(trailing-spaces)
🔇 Additional comments (3)
.gitignore (1)
189-191: LGTM! Appropriate data directory exclusion.The addition of
.chromadb_data/to .gitignore is appropriate and follows the project's pattern of excluding database data directories from version control..env.template (1)
33-33: LGTM! Clear environment variable documentation.The updated comment correctly lists all supported vector database providers including the newly added "chromadb" option.
cognee/api/v1/settings/routers/get_settings_router.py (1)
30-30: Looks great!Adding
"chromadb"as a valid provider is consistent with the new adapter implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 1
🧹 Nitpick comments (11)
README.md (2)
11-12: Demo Link Positioning & Extra Punctuation
The repositioned demo link now appears before the "Learn more" and "Join Discord" links, which improves its visibility. However, there is an isolated period on line 12 that feels out of place. Consider removing it or incorporating it more naturally with the surrounding content.
92-92: Clarify Default Environment Variables Description
This line clearly explains that if no specific database is specified, the defaults (SQLite, LanceDB, and NetworkX) will be used. For enhanced readability, you might consider splitting this long sentence into two or formatting the options as a bulleted list, particularly as new providers like ChromaDB are now supported elsewhere in the project.CONTRIBUTING.md (5)
1-6: Invitation Tone and Clarity Improvements
The revised welcome message and introductory paragraph set an inviting tone. One minor suggestion: in line 4, consider adding an article (e.g., “the cognee community”) for grammatical clarity.
32-36: Community Engagement Guidelines
The "Community Channels" section effectively emphasizes joining the Discord community and participating in events. A minor stylistic note: you might consider adding punctuation at the end of each bullet for consistency.
37-41: Direct Contact Information
The "Direct Contact" section provides clear methods for getting in touch. For improved usability, consider turning the email address into a clickable mailto link (e.g.,<mailto:[email protected]>).
42-42: Response Time Clarity
The guideline on response times is informative. However, a small punctuation adjustment is recommended: inserting a comma after “Discord channel” (i.e., “...using our Discord channel, where the whole community can help!”) can enhance readability.🧰 Tools
🪛 LanguageTool
[uncategorized] ~42-~42: Possible missing comma found.
Context: ...r responses, consider using our Discord channel where the whole community can help! ##...(AI_HYDRA_LEO_MISSING_COMMA)
128-128: Appreciative Sign-off Tone
The closing line effectively conveys gratitude and enthusiasm. Consider moderating the number of exclamation marks to maintain a professional tone if consistency is desired across the document.🧰 Tools
🪛 LanguageTool
[style] ~128-~128: Using many exclamation marks might seem excessive (in this case: 5 exclamation marks for a text that’s 3216 characters long)
Context: ...Thank you for contributing to cognee! 🌟(EN_EXCESSIVE_EXCLAMATION)
cognee/eval_framework/metrics_dashboard.py (1)
80-92: Improved HTML generation with dynamic table structureThe HTML generation is now more flexible as it dynamically creates table headers and rows based on the available data. This makes the code more maintainable and adaptable to changes in the metrics structure.
One potential improvement would be to add HTML escaping for user-provided content to prevent potential XSS vulnerabilities.
Consider adding HTML escaping for user-provided values:
- {"".join(f"<td>{value}</td>" for value in item.values())} + {"".join(f"<td>{html.escape(str(value))}</td>" for value in item.values())}And add the import at the top:
+ import htmlcognee/modules/retrieval/completion_retriever.py (1)
29-29: Potential limitation with multiple retrieval resultsWhile the
top_kparameter allows retrieving multiple document chunks, theget_contextmethod still only returns the first chunk's text. Consider enhancing this method to return a concatenated or otherwise processed collection of all retrieved chunks whentop_k > 1.- return found_chunks[0].payload["text"] + if self.top_k == 1: + return found_chunks[0].payload["text"] + else: + # Join multiple chunks with separators for better context + return "\n\n---\n\n".join([chunk.payload["text"] for chunk in found_chunks])cognee/eval_framework/answer_generation/run_question_answering_module.py (1)
35-35: Add docstring to elaborate usage oftop_k.
Consider explaining the effect of passingNonevs a positive integer fortop_k, so that maintainers can easily understand how to configure the function.cognee/eval_framework/evaluation/deep_eval_adapter.py (1)
37-37: Consider using an empty list instead of None
To maintain consistency, you may want to store an empty list ifgolden_contextis absent, so that the metric is always dealing with list inputs.
🛑 Comments failed to post (1)
cognee/eval_framework/evaluation/metrics/context_coverage.py (1)
8-33: 💡 Verification agent
🧩 Analysis chain
Handle potential empty context or retrieval_context
Iftest_case.contextortest_case.retrieval_contextis empty, indexing[0]will raise an error. Please ensure upstream code always populates these lists, or consider adding a defensive check.
🏁 Script executed:
#!/bin/bash # Attempt to locate usage of ContextCoverageMetric and see if context is guaranteed to be non-empty rg -A 5 "ContextCoverageMetric"Length of output: 1634
Attention: Defensive Check Required for Empty Context Lists
The current implementation in
ContextCoverageMetric.measure()assumes that bothtest_case.contextandtest_case.retrieval_contextcontain at least one element. However, if either list is empty, accessing[0]will raise anIndexError. Please either:
Ensure upstream guarantees non-empty lists: Confirm that all callers of this metric always provide non-empty
contextandretrieval_context, orAdd a defensive guard in the metric: Introduce a validation check at the beginning of the
measuremethod to handle cases when these lists are empty, for example:if not test_case.context or not test_case.retrieval_context: raise ValueError("Test case context or retrieval_context cannot be empty.")Review the surrounding code to decide which approach best fits your error handling strategy.
hajdul88
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thanks for the contribution, it's a solid implementation. I've opened a new PR to run our tests; you can follow it here:
I'm fixing the Ruff issues there and will probably add your test to GA. I also added some minor comments—nothing crucial, just small things.
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py
Outdated
Show resolved
Hide resolved
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py
Outdated
Show resolved
Hide resolved
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py
Outdated
Show resolved
Hide resolved
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py
Outdated
Show resolved
Hide resolved
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py
Outdated
Show resolved
Hide resolved
docker-compose.yml
Outdated
| # chromadb: | ||
| # image: chromadb/chroma:0.6.3 | ||
| # container_name: chromadb | ||
| # profiles: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to comment this out. profiles config makes sure that it will start the chromadb service only if chromadb profile is added during docker compose: docker compose --profile chromadb up
| langdetect = "1.0.9" | ||
| posthog = {version = "^3.5.0", optional = true} | ||
| lancedb = "0.16.0" | ||
| chromadb = "^0.6.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add chromadb as optional dependency that is installed as chromadb extra.
cognee/tests/test_chromadb.py
Outdated
| history = await cognee.get_search_history() | ||
| assert len(history) == 8, "Search history is not correct." | ||
|
|
||
| results = await brute_force_triplet_search("What is a quantum computer?") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably don't need this one, it is used in other searches.
cognee/tests/test_chromadb.py
Outdated
| results = await brute_force_triplet_search("What is a quantum computer?") | ||
| assert len(results) > 0 | ||
|
|
||
| await test_local_file_deletion(text, explanation_file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also not needed probably, it is tested in other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (4)
31-47: Consider validated serialization or sanitized storage.Storing lists and dictionaries as JSON strings is a practical approach, but be mindful of security implications (e.g., malicious JSON content placed in the dictionary or list). It may be prudent to validate or sanitize data before serialization, to mitigate downstream parsing issues.
200-200: Offer option for more comprehensive search.The code caps the results at 100 (
n_results=100). Depending on the dataset size and use case, this could omit valuable global matches. Consider making this value configurable or removing the cap entirely if your dataset and performance profile permit it.
316-317: Restore metadata for consistent data structure.This code returns raw metadata instead of running it through
restore_data_from_chroma. Consider aligning it with the approach used in other methods for uniform data structures across the interface.- payload=metadata, + payload=restore_data_from_chroma(metadata),
344-347: Rename method for clarity.Based on previous review comments, you may rename
get_table_namestoget_collection_namesto align with ChromaDB’s domain model and improve clarity.- async def get_table_names(self): + async def get_collection_names(self):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py(1 hunks)
🔇 Additional comments (1)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (1)
66-77: Confirmed: Removing 6 characters is correct.The suffix
__dictor__listis exactly 6 characters. The slicing operationkey[:-6]is appropriate and aligns with the previous feedback. No further changes needed.
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (2)
183-231: Potential performance concern with large collections.When computing distances for all elements, the code retrieves the entire collection (
n_results=collection_count). For very large collections, this could cause memory or performance issues. Consider setting a reasonable limit or batching the retrieval if you're working with potentially large datasets.You might implement a paginated approach or a maximum threshold to limit the total number of items processed at once.
346-349: Rename method to reflect collections rather than tables.The function is named
get_table_namesbut returns collection names. The rest of the code also refers to “collections,” so this can cause confusion.- async def get_table_names(self): + async def get_collection_names(self):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py(1 hunks)
🔇 Additional comments (2)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (2)
159-166: Handle potential missing 'index_fields' key in metadata.Here, the code uses
data_point.metadata["index_fields"]directly. If the"index_fields"key is missing or null in the metadata, this will cause aKeyError. Consider adding an existence check or a default value before accessing"index_fields".Can you confirm that every
DataPointin your application includes"index_fields"in themetadataattribute? If not, please ensure to handle or validate this field to avoid exceptions.
314-314: Unify distance normalization approach.Here, distance is converted to similarity using
similarity = 1.0 - min(distance, 2.0) / 2.0, which differs from thenormalize_distances(...)approach used elsewhere (e.g., insearch). For consistency and clarity, consider using a single normalization method across all queries.
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
docker-compose.yml (1)
81-81: Remove trailing whitespace.There is a trailing whitespace on this line that should be removed to fix the YAMLlint warning.
- +🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 81-81: trailing spaces
(trailing-spaces)
cognee/tests/test_chromadb.py (1)
69-162: Consider restructuring the main function for better test organization.The
mainfunction contains numerous assertions and test scenarios mixed together. Consider refactoring this into separate, focused test functions for each aspect being tested:
- Data addition
- Document retrieval
- Different search types (INSIGHTS, CHUNKS, etc.)
- Search history
- Data cleanup
This would improve test readability, maintainability, and make failures easier to diagnose.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/tests/test_chromadb.py(1 hunks)docker-compose.yml(2 hunks)
🧰 Additional context used
🪛 YAMLlint (1.35.1)
docker-compose.yml
[error] 81-81: trailing spaces
(trailing-spaces)
🔇 Additional comments (9)
docker-compose.yml (2)
64-80: Good implementation of the ChromaDB service.The service configuration includes all necessary components:
- Specific versioning (0.6.3)
- Appropriate authentication setup with environment variables
- Proper network configuration
- Profile configuration for optional startup
- Volume mapping for data persistence
- Port exposure (3002:8000)
103-103: Good addition of the ChromaDB volume.The volume declaration for
chromadb_datahas been properly added to support the new ChromaDB service.cognee/tests/test_chromadb.py (7)
122-124: Remove print statements from test.These print statements should be removed as they're not necessary for automated tests and were noted in a previous review.
- print("\n\nExtracted sentences are:\n") - for result in search_results: - print(f"{result}\n")
130-132: Remove print statements from test.Print statements should be removed from automated tests.
- print("\n\nExtracted chunks are:\n") - for result in search_results: - print(f"{result}\n")
140-141: Remove print statements from test.Print statements should be removed from automated tests.
- print("Completion result is:") - print(graph_completion)
147-149: Remove print statements from test.Print statements should be removed from automated tests.
- print("\n\nExtracted summaries are:\n") - for result in search_results: - print(f"{result}\n")
154-154: Consider removing redundant test call.As noted in a previous review, this test call might be redundant as it's likely covered by other tests.
- await test_local_file_deletion(text, explanation_file_path)
14-49: Good implementation of file deletion tests.The test function thoroughly validates both cases:
- Files created by cognee are properly deleted
- Files not created by cognee are preserved
This ensures the ChromaDB adapter correctly handles file system operations during data deletion.
51-66: Good implementation of document retrieval tests.The test function properly verifies document retrieval functionality:
- With specific dataset filtering
- Without dataset filtering (retrieving all documents)
This ensures the ChromaDB adapter correctly integrates with user permissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (3)
31-83: Ensure robust handling of nested or complex data structures.The functions
process_data_for_chromaandrestore_data_from_chromaperform JSON serialization for dicts and lists. Deeply nested structures will still be serialized to strings without further recursion checks. If this is intended, make a note. Otherwise, consider recursively processing nested objects for more precise control.
183-231: Be mindful of performance with large collections.
get_distance_from_collection_elementssetsn_resultsto the entire collection size (collection_count). This loads all items, which may be expensive for large collections. Consider pagination or specifying a practical limit to prevent potential performance bottlenecks.
359-362: Rename method to align with ChromaDB naming.The method name
get_table_namesmight be confusing in the context of ChromaDB, as collections are the relevant concept. Consider renaming it toget_collection_namesfor clarity, unless it must conform to an interface requirement.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py(1 hunks)
🔇 Additional comments (2)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (2)
21-29: Consider validating thetextandmetadatafields.While
IndexSchemaexplicitly expects atextfield and sets a defaultmetadatavalue withindex_fields, there's no check to ensure the field inindex_fieldsactually exists or is non-empty. If a data point lacks the required fields, it could trigger runtime errors. Consider adding validation or fallback behavior.
153-167: Guard against missingindex_fieldsinmetadata.In
index_data_points, you call
text=getattr(data_point, data_point.metadata["index_fields"][0]).
Ifmetadata["index_fields"]is absent or has fewer elements, this code will raiseKeyErrororIndexError. Consider safely checking for the key or defaulting to a known attribute.
|
@borisarzentar @hajdul88 I have fixed by comments, take a look again please |
|
Ok from my side, didn't check @borisarzentar 's comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cognee/tests/test_chromadb.py (1)
70-70: 🛠️ Refactor suggestionRemove debug print statement
Remove the print statement that outputs the configuration.
- print(cognee.config)
🧹 Nitpick comments (7)
docker-compose.yml (2)
82-82: Remove trailing whitespaceThere is trailing whitespace on this line that should be removed.
- +🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 82-82: trailing spaces
(trailing-spaces)
103-105: Remove trailing whitespace and uncomment volumes sectionThere is trailing whitespace on line 103. Also, consistent with the service configuration, you should uncomment the volumes section since it will only be used when the ChromaDB profile is activated.
-# UNCOMMENT IF USING CHROMADB -# volumes: -# chromadb_data: +# UNCOMMENT IF USING CHROMADB +volumes: + chromadb_data:🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 103-103: trailing spaces
(trailing-spaces)
cognee/tests/test_chromadb.py (4)
128-130: Remove or disable debug print statementsThese print statements are useful for debugging but should be removed or disabled in the test file. Consider using logging at the debug level instead.
Also applies to: 136-138, 146-147, 153-155
47-48: Fix assertion error messageThe error message states "Data location doesn't exists" when it should be "Data location still exists".
- assert os.path.exists(data.raw_data_location), ( - f"Data location doesn't exists: {data.raw_data_location}" + assert os.path.exists(data.raw_data_location), ( + f"Data location still exists: {data.raw_data_location}"
14-49: Test logic looks good, but consider breaking down into smaller test functionsThe
test_local_file_deletionfunction performs two distinct tests:
- Deleting data created by cognee (and verifying files are removed)
- Deleting data not created by cognee (and verifying files remain)
Consider separating these into two distinct test functions for better clarity and to follow the single responsibility principle.
69-171: Main test function is comprehensive but consider restructuringThe
mainfunction serves as a comprehensive integration test but mixes test setup, execution, assertions, and cleanup. Consider refactoring this to follow a more standard test structure:
- Setup (configure database, create test data)
- Test execution (various search operations)
- Assertions (verify results)
- Cleanup (prune data)
This would make the test more maintainable and easier to understand.
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (1)
225-233: Ensure thenormalizedparameter is honored.Currently, the
searchmethod always normalizes distances without conditionally checking thenormalizedparameter. Consider introducing logic to skip normalization ifnormalizedis False, ensuring users retain explicit control over how distances are handled.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
.github/workflows/test_chromadb.yml(1 hunks)cognee/api/v1/settings/routers/get_settings_router.py(1 hunks)cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py(1 hunks)cognee/infrastructure/databases/vector/create_vector_engine.py(1 hunks)cognee/tests/test_chromadb.py(1 hunks)docker-compose.yml(2 hunks)pyproject.toml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pyproject.toml
- cognee/api/v1/settings/routers/get_settings_router.py
🧰 Additional context used
🪛 YAMLlint (1.35.1)
docker-compose.yml
[error] 82-82: trailing spaces
(trailing-spaces)
[error] 103-103: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: run_simple_example_test
🔇 Additional comments (9)
cognee/infrastructure/databases/vector/create_vector_engine.py (1)
85-100: Implementation for ChromaDB looks good!The implementation follows the same pattern used for other vector database providers, which makes it consistent and maintainable. The error message if ChromaDB is not installed is helpful and provides clear instructions for installation.
.github/workflows/test_chromadb.yml (1)
1-64: Well-structured CI workflow for ChromaDB testingThe workflow is properly configured with:
- Appropriate triggers (manual and pull request events)
- Service container for ChromaDB with the correct image version
- Required environment variables for testing
- Proper dependency setup with Poetry
This will provide good test coverage for the new ChromaDB integration.
docker-compose.yml (1)
64-81: Uncomment the ChromaDB service instead of using a comment directiveAs mentioned in previous review comments, you don't need to comment out this ChromaDB service configuration. The
profilesconfiguration ensures that it will only start when explicitly requested withdocker compose --profile chromadb up.cognee/tests/test_chromadb.py (3)
160-161: Remove redundant testThis test for
brute_force_triplet_searchis likely redundant as it's used in other search tests.
163-163: Consider relocating test function callThis test is likely redundant as it's tested elsewhere. If it's necessary to include it here, consider adding a comment explaining why it's being tested again in this context.
51-67: Test logic for document retrieval looks goodThe
test_getting_of_documentsfunction properly tests document retrieval both with and without a dataset filter.cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (3)
67-67: Likely off-by-one error when removing '__dict' / '__list' suffixes.These lines remove 7 characters instead of 6, whereas the suffix (
__dictor__list) is only 6 characters long. This may cause a truncated character and potentially break the restoration of original keys. Note that a similar comment has already been raised in past reviews.Also applies to: 76-76
287-330: Unify distance normalization betweenbatch_searchandsearch.
batch_searchtransforms the raw distance into a similarity measure (1.0 - min(distance, 2.0) / 2.0), whereassearchrelies onnormalize_distances(...). This disparity in approaches may lead to inconsistent scoring across methods. A refactoring suggestion to unify or deduplicate distance handling logic has been raised previously.
347-347: Use a more descriptive method name.As suggested in prior comments, consider renaming
get_table_namestoget_collection_namesfor clarity and consistency with ChromaDB terminology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (2)
67-76:⚠️ Potential issueSuffix removal off by one.
For keys ending with__dictor__list, you should remove only the last 6 characters (the substring__dictor__list). Removing 7 characters will truncate the preceding character in the key, leading to incorrect key restoration.Apply this fix:
- original_key = key[:-7] # Remove '__dict' suffix + original_key = key[:-6] # Remove '__dict' suffix - original_key = key[:-7] # Remove '__list' suffix + original_key = key[:-6] # Remove '__list' suffix
319-319: 🛠️ Refactor suggestionRestore metadata consistently.
Unlike insearchandget_distance_from_collection_elements, this line directly assignsmetadatatopayloadwithout callingrestore_data_from_chroma(). This causes inconsistent payload formats across methods.Apply this fix:
- payload=metadata, + payload=restore_data_from_chroma(metadata),
🧹 Nitpick comments (1)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (1)
225-286: Maintain consistent distance normalization.
This method usesnormalize_distances, but elsewhere (e.g., inbatch_search) you calculate similarity differently with a fixed formula. Either unify the approach with a single normalization function or document the rationale for having two methods.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py(1 hunks)cognee/tests/test_chromadb.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/tests/test_chromadb.py
🔇 Additional comments (3)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (3)
1-19: Solid foundational setup and imports.
The overall structure, including logging configuration and import statements, provides a clear starting point for this adapter. No immediate issues observed in this snippet.
178-224: Confirm 100 as the default result limit.
Here,n_results=100may be a safe default for typical searches, but large collections or pipeline usage could need more flexibility. Consider making this value configurable or ensuring it meets all use cases.
315-315: Use a shared normalization function for clarity.
Here,similarity = 1.0 - min(distance, 2.0)/2.0diverges from the main normalization strategy insearch. Adopting a consistent approach for all distance calculations simplifies maintenance and reduces potential confusion about how scores are derived.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
cognee/tests/test_chromadb.py (4)
14-49: Fix grammatical issue in assertion message for file existence.
In the assertion at line 47, consider updating the message for clarity. The correct grammar would be "Data location doesn't exist" instead of "doesn't exists."- f"Data location doesn't exists: {data.raw_data_location}" + f"Data location doesn't exist: {data.raw_data_location}"
51-67: Consider additional negative test scenarios.
Currently, the test only checks for the presence of documents (1 or 2). It might be beneficial to add negative tests where no documents exist or invalid dataset names are provided. This would help ensure robust error handling and guard against unexpected data states.
69-164: Refine the large integration test for maintainability.
Themainfunction is quite extensive, testing multiple functionalities end-to-end. Consider splitting it into smaller, more focused test functions or employing parameterized tests to make debugging and maintenance easier.
166-169: Rely on your test runner instead of a manual entry point.
Including amain()entry point for an async test may not be necessary if you are already using an automated test runner like Pytest or unittest. Removing it helps standardize how tests are executed and reported across the project.🧰 Tools
🪛 GitHub Actions: test | chromadb
[error] 169-169: KeyError: 'Could not automatically map to a tokeniser. Please use
tiktoken.get_encodingto explicitly get the tokeniser you expect.'
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py(1 hunks)cognee/tests/test_chromadb.py(1 hunks)docker-compose.yml(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- docker-compose.yml
🧰 Additional context used
🪛 GitHub Actions: test | chromadb
cognee/tests/test_chromadb.py
[error] 169-169: KeyError: 'Could not automatically map to a tokeniser. Please use tiktoken.get_encoding to explicitly get the tokeniser you expect.'
[warning] 1-1: Ontology file 'None' not found. Using fallback ontology at http://example.org/empty_ontology
[warning] 1-1: File /home/runner/work/cognee/cognee/cognee/tests/.cognee_system/test_chromadb/databases/cognee.graph not found. Initializing an empty graph.
⏰ Context from checks skipped due to timeout of 90000ms (17)
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-13
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: run_eval_framework_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-13
- GitHub Check: windows-latest
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: docker-compose-test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: run_simple_example_test
| collection_count = await collection.count() | ||
|
|
||
| results = await collection.query( | ||
| query_embeddings=[query_vector], | ||
| include=["metadatas", "distances"], | ||
| n_results=collection_count, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Investigate potential performance impact of retrieving all elements.
Using collection.count() to set n_results=collection_count can lead to very large queries for large collections. Consider adding a limit or a user-configurable parameter to avoid performance bottlenecks when only partial data is needed.
| normalized: bool = True, | ||
| ): | ||
| """Search for similar items in a collection using text or vector query.""" | ||
| if query_text is None and query_vector is None: | ||
| raise InvalidValueError(message="One of query_text or query_vector must be provided!") | ||
|
|
||
| if query_text and not query_vector: | ||
| query_vector = (await self.embedding_engine.embed_text([query_text]))[0] | ||
|
|
||
| try: | ||
| client = await self.get_connection() | ||
| collection = await client.get_collection(collection_name) | ||
|
|
||
| results = await collection.query( | ||
| query_embeddings=[query_vector], | ||
| include=["metadatas", "distances", "embeddings"] | ||
| if with_vector | ||
| else ["metadatas", "distances"], | ||
| n_results=limit, | ||
| ) | ||
|
|
||
| vector_list = [] | ||
| for i, (id, metadata, distance) in enumerate( | ||
| zip(results["ids"][0], results["metadatas"][0], results["distances"][0]) | ||
| ): | ||
| item = { | ||
| "id": parse_id(id), | ||
| "payload": restore_data_from_chroma(metadata), | ||
| "_distance": distance, | ||
| } | ||
|
|
||
| if with_vector and "embeddings" in results: | ||
| item["vector"] = results["embeddings"][0][i] | ||
|
|
||
| vector_list.append(item) | ||
|
|
||
| # Normalize vector distance | ||
| normalized_values = normalize_distances(vector_list) | ||
| for i in range(len(normalized_values)): | ||
| vector_list[i]["score"] = normalized_values[i] | ||
|
|
||
| # Create and return ScoredResult objects | ||
| return [ | ||
| ScoredResult( | ||
| id=row["id"], | ||
| payload=row["payload"], | ||
| score=row["score"], | ||
| vector=row.get("vector") if with_vector else None, | ||
| ) | ||
| for row in vector_list | ||
| ] | ||
| except Exception as e: | ||
| logger.error(f"Error in search: {str(e)}") | ||
| return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Honor the normalized parameter conditionally.
Although the function signature includes a normalized flag, the code always normalizes distances (lines 272–275) regardless of the flag’s value. If you intend to allow unnormalized distances, wrap the normalization step in a conditional check.
if normalized:
normalized_values = normalize_distances(vector_list)
for i in range(len(normalized_values)):
vector_list[i]["score"] = normalized_values[i]
else:
for i in range(len(vector_list)):
# Use raw distance or compute an alternative measure
vector_list[i]["score"] = 1.0 - vector_list[i]["_distance"] # Example📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| normalized: bool = True, | |
| ): | |
| """Search for similar items in a collection using text or vector query.""" | |
| if query_text is None and query_vector is None: | |
| raise InvalidValueError(message="One of query_text or query_vector must be provided!") | |
| if query_text and not query_vector: | |
| query_vector = (await self.embedding_engine.embed_text([query_text]))[0] | |
| try: | |
| client = await self.get_connection() | |
| collection = await client.get_collection(collection_name) | |
| results = await collection.query( | |
| query_embeddings=[query_vector], | |
| include=["metadatas", "distances", "embeddings"] | |
| if with_vector | |
| else ["metadatas", "distances"], | |
| n_results=limit, | |
| ) | |
| vector_list = [] | |
| for i, (id, metadata, distance) in enumerate( | |
| zip(results["ids"][0], results["metadatas"][0], results["distances"][0]) | |
| ): | |
| item = { | |
| "id": parse_id(id), | |
| "payload": restore_data_from_chroma(metadata), | |
| "_distance": distance, | |
| } | |
| if with_vector and "embeddings" in results: | |
| item["vector"] = results["embeddings"][0][i] | |
| vector_list.append(item) | |
| # Normalize vector distance | |
| normalized_values = normalize_distances(vector_list) | |
| for i in range(len(normalized_values)): | |
| vector_list[i]["score"] = normalized_values[i] | |
| # Create and return ScoredResult objects | |
| return [ | |
| ScoredResult( | |
| id=row["id"], | |
| payload=row["payload"], | |
| score=row["score"], | |
| vector=row.get("vector") if with_vector else None, | |
| ) | |
| for row in vector_list | |
| ] | |
| except Exception as e: | |
| logger.error(f"Error in search: {str(e)}") | |
| return [] | |
| normalized: bool = True, | |
| ): | |
| """Search for similar items in a collection using text or vector query.""" | |
| if query_text is None and query_vector is None: | |
| raise InvalidValueError(message="One of query_text or query_vector must be provided!") | |
| if query_text and not query_vector: | |
| query_vector = (await self.embedding_engine.embed_text([query_text]))[0] | |
| try: | |
| client = await self.get_connection() | |
| collection = await client.get_collection(collection_name) | |
| results = await collection.query( | |
| query_embeddings=[query_vector], | |
| include=["metadatas", "distances", "embeddings"] | |
| if with_vector | |
| else ["metadatas", "distances"], | |
| n_results=limit, | |
| ) | |
| vector_list = [] | |
| for i, (id, metadata, distance) in enumerate( | |
| zip(results["ids"][0], results["metadatas"][0], results["distances"][0]) | |
| ): | |
| item = { | |
| "id": parse_id(id), | |
| "payload": restore_data_from_chroma(metadata), | |
| "_distance": distance, | |
| } | |
| if with_vector and "embeddings" in results: | |
| item["vector"] = results["embeddings"][0][i] | |
| vector_list.append(item) | |
| # Normalize vector distance | |
| if normalized: | |
| normalized_values = normalize_distances(vector_list) | |
| for i in range(len(normalized_values)): | |
| vector_list[i]["score"] = normalized_values[i] | |
| else: | |
| for i in range(len(vector_list)): | |
| # Use raw distance or compute an alternative measure | |
| vector_list[i]["score"] = 1.0 - vector_list[i]["_distance"] # Example | |
| # Create and return ScoredResult objects | |
| return [ | |
| ScoredResult( | |
| id=row["id"], | |
| payload=row["payload"], | |
| score=row["score"], | |
| vector=row.get("vector") if with_vector else None, | |
| ) | |
| for row in vector_list | |
| ] | |
| except Exception as e: | |
| logger.error(f"Error in search: {str(e)}") | |
| return [] |
Description
Add Support for ChromaDB
Summary
This PR adds support for ChromaDB as a vector database option in the Cognee application. ChromaDB is a modern, open-source embedding database designed for AI applications.
Changes
Technical Details
ChromaDBAdapter.py(347 lines) with full CRUD operations for vector datatest_chromadb.py) with 171 lines of test coverageDocker Changes
This PR enhances Cognee's flexibility by providing an alternative vector database option, allowing users to choose the most appropriate database for their specific use case.
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Tested with UI + tests.
Summary by CodeRabbit
.chromadb_data/directory in version control.