-
Notifications
You must be signed in to change notification settings - Fork 967
feature: adds triplet embedding via memify #1832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please make sure all the checkboxes are checked:
|
WalkthroughAdds end-to-end triplet support: DB batch retrieval in Kuzu/Neo4j, Triplet model and export, extraction pipeline and async generator, TripletRetriever (vector+LLM), search wiring for TRIPLET_COMPLETION, memify integration, tests, and an example script. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This reverts commit b7cd326.
Co-authored-by: Pavel Zorin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cognee/tasks/memify/get_triplet_datapoints.py (1)
166-176: Emptyembeddable_textcheck is unreachable but acceptable by prior design decision.Because
embeddable_textis always formatted as"{start}-›{rel}-›{end}", the string will never be falsy (at minimum it contains the separators), so theif not embeddable_text:branch won’t execute in practice. Given the prior discussion and team agreement to always include separators and keep this guard as a defensive check, I don’t see a need to change it.Based on learnings, this behavior is intentional and acceptable.
🧹 Nitpick comments (1)
cognee/tasks/memify/get_triplet_datapoints.py (1)
40-58: Alignget_triplet_datapointsdocstring with actual behavior and parameters.The function returns an
AsyncGenerator[Triplet, None]and yields individualTripletobjects, but the docstring still mentions “batches” andList[Dict[str, Any]], and it doesn’t describe thedataparameter. I’d update the summary,Parameters, andYieldssections to:
- Document
data’s role as pipeline input / initial state.- Clarify that the generator yields
Tripletinstances one-by-one, whiletriplets_batch_sizeonly controls the internal fetch window from the graph engine.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/tasks/memify/get_triplet_datapoints.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/tasks/memify/get_triplet_datapoints.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/tasks/memify/get_triplet_datapoints.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/tasks/memify/get_triplet_datapoints.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/tasks/memify/get_triplet_datapoints.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.
📚 Learning: 2025-12-02T10:37:24.245Z
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.
Applied to files:
cognee/tasks/memify/get_triplet_datapoints.py
🧬 Code graph analysis (1)
cognee/tasks/memify/get_triplet_datapoints.py (5)
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
get_graph_engine(10-24)cognee/shared/logging_utils.py (1)
get_logger(212-224)cognee/infrastructure/engine/models/DataPoint.py (1)
DataPoint(20-220)cognee/modules/engine/models/Triplet.py (1)
Triplet(4-9)cognee/infrastructure/databases/graph/kuzu/adapter.py (1)
get_triplets_batch(2009-2138)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
- GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: End-to-End Tests / Test graph edge ingestion
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: End-to-End Tests / Run Telemetry Test
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: Basic Tests / Run Basic Graph Tests
- GitHub Check: Basic Tests / Run Simple Examples BAML
- GitHub Check: Basic Tests / Run Simple Examples
- GitHub Check: Basic Tests / Run Integration Tests
- GitHub Check: Basic Tests / Run Unit Tests
🔇 Additional comments (1)
cognee/tasks/memify/get_triplet_datapoints.py (1)
12-37: Helper_extract_embeddable_textlooks correct and robust.The helper cleanly handles missing fields, trims whitespace, and skips empty values, which is exactly what you want before joining into embeddable text. No changes needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
cognee/tasks/memify/get_triplet_datapoints.py (1)
115-115: Consider replacing list with counter for memory efficiency.The
triplet_datapointslist accumulates Triplet objects that have already been yielded (line 184). The list is only used to check emptiness (line 199) and count items (line 208). For large batches, this adds unnecessary memory overhead.Apply this diff to use a counter instead:
- triplet_datapoints = [] + valid_triplets_count = 0 skipped_count = 0 for idx, triplet_datapoint in enumerate(triplets_batch): try: # ... existing extraction logic ... triplet_obj = Triplet( from_node_id=start_node_id, to_node_id=end_node_id, text=embeddable_text ) - triplet_datapoints.append(triplet_obj) + valid_triplets_count += 1 yield triplet_obj except Exception as e: # ... existing exception handling ... if skipped_count > 0: logger.warning( f"Skipped {skipped_count} out of {len(triplets_batch)} triplets in batch {batch_number}" ) - if not triplet_datapoints: + if valid_triplets_count == 0: logger.warning( f"No valid triplet datapoints in batch {batch_number} after processing" ) offset += len(triplets_batch) if len(triplets_batch) < triplets_batch_size: break continue - total_triplets_processed += len(triplet_datapoints) + total_triplets_processed += valid_triplets_count logger.info( - f"Batch {batch_number} complete: processed {len(triplet_datapoints)} triplets " + f"Batch {batch_number} complete: processed {valid_triplets_count} triplets " f"(total processed: {total_triplets_processed})" )Also applies to: 182-182, 199-199, 208-208
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/tasks/memify/get_triplet_datapoints.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/tasks/memify/get_triplet_datapoints.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/tasks/memify/get_triplet_datapoints.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/tasks/memify/get_triplet_datapoints.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/tasks/memify/get_triplet_datapoints.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.
📚 Learning: 2025-12-02T10:37:24.245Z
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.
Applied to files:
cognee/tasks/memify/get_triplet_datapoints.py
🧬 Code graph analysis (1)
cognee/tasks/memify/get_triplet_datapoints.py (6)
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
get_graph_engine(10-24)cognee/shared/logging_utils.py (1)
get_logger(212-224)cognee/infrastructure/engine/models/DataPoint.py (1)
DataPoint(20-220)cognee/modules/engine/models/Triplet.py (1)
Triplet(4-9)cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)
get_triplets_batch(1531-1551)cognee/infrastructure/databases/graph/kuzu/adapter.py (1)
get_triplets_batch(2009-2138)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: CLI Tests / CLI Unit Tests
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: End-to-End Tests / Test graph edge ingestion
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: End-to-End Tests / Run Telemetry Test
- GitHub Check: Basic Tests / Run Formatting Check
- GitHub Check: Basic Tests / Run Simple Examples BAML
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: Basic Tests / Run Simple Examples
- GitHub Check: Basic Tests / Run Basic Graph Tests
- GitHub Check: Basic Tests / Run Integration Tests
- GitHub Check: Basic Tests / Run Unit Tests
🔇 Additional comments (2)
cognee/tasks/memify/get_triplet_datapoints.py (2)
12-37: LGTM!The helper function is well-structured with clear logic, proper error handling for edge cases, and a comprehensive docstring.
59-227: LGTM!The core extraction logic is well-implemented with:
- Proper validation of graph engine capabilities
- Robust DataPoint subclass discovery and index field mapping
- Comprehensive error handling at both triplet and batch levels
- Thorough logging for debugging and monitoring
- Correct offset management and loop termination conditions
The embeddable text format with arrow separators (lines 166-167) and empty text check (line 170) are intentional per team agreement. Based on learnings, this design is acceptable.
| triplet_datapoints = [] | ||
| skipped_count = 0 | ||
|
|
||
| for idx, triplet_datapoint in enumerate(triplets_batch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd leave it up to you. There is a recommendation to break this logic into a chain of private functions with reduced scope. The purpose is the ease of code review and general code comprehensiveness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end I created them, the error message handling is a bit ugly in this way but now the logic is separated into priv methods, can revert if you think it is worse but basically we have to handle the errors somehow that are coming from priv functions and adjust the loop based on those.
|
In general it looks good. Let's make sure that issues are resolved and tests are passing |
…ub.com:topoteretes/cognee into feature/cog-3326-2triplet-embedding-via-memify
|
Finished the refactoring and addressed all the refact requests except one that we discussed (pep8). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cognee/tests/test_search_db.py (1)
268-268: Critical: Undefined variable reference.Line 268 references an undefined variable
text, which will cause aNameErrorat runtime. Based on the context and earlier code (line 32 definestext_1and lines 35-39 defineexplanation_file_path_quantum), this should likely be one of those variables.Determine the correct variable and apply the appropriate fix:
Option 1: If re-adding the text content:
- await cognee.add([text], dataset_name) + await cognee.add(text_1, dataset_name)Option 2: If re-adding the file:
- await cognee.add([text], dataset_name) + await cognee.add([explanation_file_path_quantum], dataset_name)Option 3: If adding both:
- await cognee.add([text], dataset_name) + await cognee.add([text_1, explanation_file_path_quantum], dataset_name)Please verify which data should be added at this point in the test.
🧹 Nitpick comments (1)
cognee/tests/test_search_db.py (1)
44-44: Consider moving import to the top of the file.The inline import of
create_triplet_embeddingsat line 44 deviates from PEP 8 guidelines, which recommend placing all imports at the module level unless there's a specific reason (e.g., avoiding circular dependencies or conditional imports).Apply this diff to move the import to the top:
from cognee.modules.users.methods import get_default_user from collections import Counter +from cognee.memify_pipelines.create_triplet_embeddings import create_triplet_embeddingsAnd remove it from line 44:
user = await get_default_user() - from cognee.memify_pipelines.create_triplet_embeddings import create_triplet_embeddings - await create_triplet_embeddings(user=user, dataset=dataset_name, triplets_batch_size=5)As per coding guidelines, Python code should follow PEP 8 style conventions.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/tests/test_search_db.py(8 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/tests/test_search_db.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/tests/test_search_db.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/tests/test_search_db.py
cognee/tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests
Files:
cognee/tests/test_search_db.py
cognee/tests/*
⚙️ CodeRabbit configuration file
cognee/tests/*: When reviewing test code:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance, pointing out any violations discovered.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a project rule, Python source files with names prefixed by the string "test_" and located in the project's "tests" directory are the project's unit-testing code. It is safe, albeit a heuristic, to assume these are considered part of the project's minimal acceptance testing unless a justifying exception to this assumption is documented.
- As a project rule, any files without extensions and with names prefixed by either the string "check_" or the string "test_", and located in the project's "tests" directory, are the project's non-unit test code. "Non-unit test" in this context refers to any type of testing other than unit testing, such as (but not limited to) functional testing, style linting, regression testing, etc. It can also be assumed that non-unit testing code is usually written as Bash shell scripts.
Files:
cognee/tests/test_search_db.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.
📚 Learning: 2024-11-13T16:06:32.576Z
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 196
File: cognee/modules/graph/cognee_graph/CogneeGraph.py:32-38
Timestamp: 2024-11-13T16:06:32.576Z
Learning: In `CogneeGraph.py`, within the `CogneeGraph` class, it's intentional to add skeleton edges in both the `add_edge` method and the `project_graph_from_db` method to ensure that edges are added to the graph and to the nodes.
Applied to files:
cognee/tests/test_search_db.py
📚 Learning: 2024-12-04T18:37:55.092Z
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 251
File: cognee/tests/infrastructure/databases/test_index_graph_edges.py:0-0
Timestamp: 2024-12-04T18:37:55.092Z
Learning: In the `index_graph_edges` function, both graph engine and vector engine initialization failures are handled within the same try-except block, so a single test covers both cases.
Applied to files:
cognee/tests/test_search_db.py
🧬 Code graph analysis (1)
cognee/tests/test_search_db.py (5)
cognee/infrastructure/databases/vector/get_vector_engine.py (1)
get_vector_engine(5-7)cognee/modules/retrieval/triplet_retriever.py (1)
TripletRetriever(20-182)cognee/modules/search/types/SearchType.py (1)
SearchType(4-20)cognee/modules/users/methods/get_default_user.py (1)
get_default_user(13-38)cognee/memify_pipelines/create_triplet_embeddings.py (1)
create_triplet_embeddings(18-53)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: Basic Tests / Run Basic Graph Tests
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: Basic Tests / Run Integration Tests
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: Basic Tests / Run Unit Tests
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
🔇 Additional comments (6)
cognee/tests/test_search_db.py (6)
5-5: LGTM!The new imports for triplet embeddings functionality are correctly placed and necessary for the test.
Also applies to: 16-16, 19-19
72-96: LGTM!The TripletRetriever context validation is well-structured and follows appropriate testing patterns. The assertions correctly validate the string type, non-emptiness, and semantic content.
162-166: LGTM!The TRIPLET_COMPLETION search is properly integrated and follows the same pattern as other completion types in the test.
179-179: LGTM!TRIPLET_COMPLETION is correctly added to the search results validation loop, ensuring consistent validation across all completion types.
207-207: LGTM!The assertion message has been correctly updated to reference "CogneeUserInteraction" instead of "DCogneeUserInteraction".
52-58: The test correctly validates collection coverage.The vector engine's
search()method withlimit=Noneexplicitly retrieves the collection count and returns all items from the collection, regardless ofquery_text. Whenlimit=None, the adapters (ChromaDB, PGVector, LanceDB) all executelimit = await collection.count()before querying, ensuring all items are returned and ordered by similarity to the query embedding. The assertionlen(edges) == len(collection)correctly validates that all graph edges have corresponding embeddings in the Triplet_text collection.Likely an incorrect or invalid review comment.
pazone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's fix the issues and I'd appreciate +1 approval
|
@lxobr Added the requested tests + answered comments and we also discussed these async. |
lxobr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Really glad you took this on
Description
This PR introduces triplet embeddings via a new create_triplet_embeddings memify pipeline.
The pipeline reads the graph in batches, extracts properties from graph elements based on their datapoint types, and generates combined triplet embeddings. These embeddings are stored in the vector database as a new collection.
Changes in This PR:
-Added a new create_triplet_embeddings memify pipeline.
-Added a new get_triplet_datapoints memify task.
-Introduced a new triplet_completion search type.
-Added full test coverage
--Unit tests: memify task, pipeline, and retriever
--Integration tests: memify task, pipeline, and retriever
--End-to-end tests: updated session history tests and multi-DB search tests; added tests for triplet_completion and memify pipeline execution
Acceptance Criteria and Testing
Scenario 1:
-Run default add, cognify pipelines
-Run create triplet embeddings memify pipeline
-Verify the vector DB contains a non empty Triplet_text collection.
-Use the new triplet_completion search type and confirm it works correctly.
Scenario 2:
-Run the default add and cognify pipelines.
-Do not run the triplet embeddings memify pipeline.
-Attempt to use the triplet_completion search type.
-You should receive an error indicating that the triplet embeddings memify pipeline must be executed first.
Type of Change
Screenshots/Videos (if applicable)
Pre-submission Checklist
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
Summary by CodeRabbit
New Features
Examples
Tests
✏️ Tip: You can customize this high-level summary in your review settings.