feature: adds triplet embedding via memify #1832

hajdul88 · 2025-11-25T16:22:39Z

Description

This PR introduces triplet embeddings via a new create_triplet_embeddings memify pipeline.
The pipeline reads the graph in batches, extracts properties from graph elements based on their datapoint types, and generates combined triplet embeddings. These embeddings are stored in the vector database as a new collection.

Changes in This PR:

-Added a new create_triplet_embeddings memify pipeline.
-Added a new get_triplet_datapoints memify task.
-Introduced a new triplet_completion search type.
-Added full test coverage
--Unit tests: memify task, pipeline, and retriever
--Integration tests: memify task, pipeline, and retriever
--End-to-end tests: updated session history tests and multi-DB search tests; added tests for triplet_completion and memify pipeline execution

Acceptance Criteria and Testing
Scenario 1:
-Run default add, cognify pipelines
-Run create triplet embeddings memify pipeline
-Verify the vector DB contains a non empty Triplet_text collection.
-Use the new triplet_completion search type and confirm it works correctly.

Scenario 2:
-Run the default add and cognify pipelines.
-Do not run the triplet embeddings memify pipeline.
-Attempt to use the triplet_completion search type.
-You should receive an error indicating that the triplet embeddings memify pipeline must be executed first.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Code refactoring
Performance improvement
Other (please specify):

Screenshots/Videos (if applicable)

Pre-submission Checklist

I have tested my changes thoroughly before submitting this PR
This PR contains minimal changes necessary to address the issue/feature
My code follows the project's coding standards and style guidelines
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if applicable)
All new and existing tests pass
I have searched existing PRs to ensure this change hasn't been submitted already
I have linked any relevant issues in the description
My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

New Features
- Triplet-based search with LLM-powered completions (TRIPLET_COMPLETION)
- Batch triplet retrieval and a triplet embeddings pipeline for extraction, indexing, and optional background processing
- Context retrieval from triplet embeddings with optional caching and conversation-history support
- New Triplet data type exposed for indexing and search
Examples
- End-to-end example demonstrating triplet embeddings extraction and TRIPLET_COMPLETION search
Tests
- Unit and integration tests covering triplet extraction, retrieval, embedding pipeline, and completion flows

_{✏️ Tip: You can customize this high-level summary in your review settings.}

pull-checklist · 2025-11-25T16:22:43Z

Please make sure all the checkboxes are checked:

I have tested these changes locally.
I have reviewed the code changes.
I have added end-to-end and unit tests (if applicable).
I have updated the documentation and README.md file (if necessary).
I have removed unnecessary code and debug statements.
PR title is clear and follows the convention.
I have tagged reviewers or team members for feedback.

coderabbitai · 2025-11-25T16:22:45Z

Walkthrough

Adds end-to-end triplet support: DB batch retrieval in Kuzu/Neo4j, Triplet model and export, extraction pipeline and async generator, TripletRetriever (vector+LLM), search wiring for TRIPLET_COMPLETION, memify integration, tests, and an example script.

Changes

Cohort / File(s)	Summary
Database adapters `cognee/infrastructure/databases/graph/kuzu/adapter.py`, `cognee/infrastructure/databases/graph/neo4j_driver/adapter.py`	Added `get_triplets_batch(offset, limit)` to both adapters. Kuzu implementation runs a MATCH query with SKIP/LIMIT, validates/parses rows (normalizes node props, attempts JSON decode of relationship props), logs/skips invalid rows; Neo4j adapter adds a batch triplet query and returns results.
Data model export `cognee/modules/engine/models/Triplet.py`, `cognee/modules/engine/models/__init__.py`	New `Triplet` model (inherits `DataPoint`) with fields `text`, `from_node_id`, `to_node_id`, and `metadata`; exported from models package.
Memify pipeline entry `cognee/memify_pipelines/create_triplet_embeddings.py`	New async `create_triplet_embeddings(...)` that validates dataset write access, composes extraction/enrichment tasks, and invokes memify with batching.
Datapoint extraction `cognee/tasks/memify/get_triplet_datapoints.py`	New helpers and async generator `get_triplet_datapoints` that discovers DataPoint index_fields, pages via `get_triplets_batch`, builds embeddable text, yields `Triplet` instances, and robustly handles/logs skips and errors.
Retriever & completion `cognee/modules/retrieval/triplet_retriever.py`	New `TripletRetriever` with `get_context` (vector search + concat) and `get_completion` (LLM generation, optional caching/session-history integration).
Search integration `cognee/modules/search/types/SearchType.py`, `cognee/modules/search/methods/get_search_type_tools.py`	Added `TRIPLET_COMPLETION` enum member and wired `TripletRetriever.get_completion` / `TripletRetriever.get_context` into search dispatch.
Tests — tasks & memify `cognee/tests/integration/tasks/test_get_triplet_datapoints.py`, `cognee/tests/unit/modules/memify_tasks/test_get_triplet_datapoints.py`	Added integration and unit tests covering batch extraction, embeddable text composition, skipping invalid triplets, and error propagation.
Tests — retriever `cognee/tests/integration/retrieval/test_triplet_retriever.py`, `cognee/tests/unit/modules/retrieval/triplet_retriever_test.py`	Added integration and unit tests validating TripletRetriever context/completion behavior, collection-existence handling, empty-result behavior, and error mapping.
Tests — search & convo history `cognee/tests/test_conversation_history.py`, `cognee/tests/test_search_db.py`	Updated tests to exercise `TRIPLET_COMPLETION`, create triplet embeddings in flows, and validate triplet-based contexts/results and session history.
Example `examples/python/triplet_embeddings_example.py`	New example demonstrating ingest → cognify → create triplet embeddings → run `TRIPLET_COMPLETION` search.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay extra attention to:
- cognee/infrastructure/databases/graph/kuzu/adapter.py: JSON decoding, node/relationship parsing, skip-logic and logging.
- cognee/tasks/memify/get_triplet_datapoints.py: index_fields discovery, batching/termination, and per-triplet validation/error handling.
- cognee/modules/retrieval/triplet_retriever.py: context assembly, prompt handling, and session-history/caching interactions.
- Tests: verify mocks/fixtures align with graph/vector engine behaviors and new Triplet model.

Possibly related PRs

Feat/cognee mcp add option to install extras #1696 — Modifies KuzuAdapter implementation; likely to overlap with get_triplets_batch additions.
Transition to new retrievers, update searches #585 — Migration toward retriever classes and search dispatch changes; overlaps with TripletRetriever and search wiring.
test: fix weighted edges example #1745 — Changes around edge_text handling that relate to triplet extraction and edge property projection.

Suggested labels

run-checks, core-team

Suggested reviewers

pazone
borisarzentar

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main feature: adding triplet embeddings via memify pipeline.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The PR description is well-structured, follows the template format, includes all required sections with substantial content, and clearly explains the changes and acceptance criteria.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/cog-3326-2triplet-embedding-via-memify

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

This reverts commit b7cd326.

…y paste)

Co-authored-by: Pavel Zorin <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

cognee/tasks/memify/get_triplet_datapoints.py (1)

166-176: Empty embeddable_text check is unreachable but acceptable by prior design decision.

Because embeddable_text is always formatted as "{start}-›{rel}-›{end}", the string will never be falsy (at minimum it contains the separators), so the if not embeddable_text: branch won’t execute in practice. Given the prior discussion and team agreement to always include separators and keep this guard as a defensive check, I don’t see a need to change it.

Based on learnings, this behavior is intentional and acceptable.

🧹 Nitpick comments (1)

cognee/tasks/memify/get_triplet_datapoints.py (1)

40-58: Align get_triplet_datapoints docstring with actual behavior and parameters.

The function returns an AsyncGenerator[Triplet, None] and yields individual Triplet objects, but the docstring still mentions “batches” and List[Dict[str, Any]], and it doesn’t describe the data parameter. I’d update the summary, Parameters, and Yields sections to:

Document data’s role as pipeline input / initial state.

Clarify that the generator yields Triplet instances one-by-one, while triplets_batch_size only controls the internal fetch window from the graph engine.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3cb8305 and 608bdc3.

📒 Files selected for processing (1)

cognee/tasks/memify/get_triplet_datapoints.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/tasks/memify/get_triplet_datapoints.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/tasks/memify/get_triplet_datapoints.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/tasks/memify/get_triplet_datapoints.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/tasks/memify/get_triplet_datapoints.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.

📚 Learning: 2025-12-02T10:37:24.245Z

Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.

Applied to files:

cognee/tasks/memify/get_triplet_datapoints.py

🧬 Code graph analysis (1)

cognee/tasks/memify/get_triplet_datapoints.py (5)

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

get_graph_engine (10-24)

cognee/shared/logging_utils.py (1)

get_logger (212-224)

cognee/infrastructure/engine/models/DataPoint.py (1)

DataPoint (20-220)

cognee/modules/engine/models/Triplet.py (1)

Triplet (4-9)

cognee/infrastructure/databases/graph/kuzu/adapter.py (1)

get_triplets_batch (2009-2138)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)

GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: End-to-End Tests / Test graph edge ingestion
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: CLI Tests / CLI Functionality Tests
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: End-to-End Tests / Run Telemetry Test
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: Basic Tests / Run Basic Graph Tests
GitHub Check: Basic Tests / Run Simple Examples BAML
GitHub Check: Basic Tests / Run Simple Examples
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: Basic Tests / Run Unit Tests

🔇 Additional comments (1)

cognee/tasks/memify/get_triplet_datapoints.py (1)

12-37: Helper _extract_embeddable_text looks correct and robust.

The helper cleanly handles missing fields, trims whitespace, and skips empty values, which is exactly what you want before joining into embeddable text. No changes needed here.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

cognee/tasks/memify/get_triplet_datapoints.py (1)

115-115: Consider replacing list with counter for memory efficiency.

The triplet_datapoints list accumulates Triplet objects that have already been yielded (line 184). The list is only used to check emptiness (line 199) and count items (line 208). For large batches, this adds unnecessary memory overhead.

Apply this diff to use a counter instead:

-            triplet_datapoints = []
+            valid_triplets_count = 0
             skipped_count = 0
 
             for idx, triplet_datapoint in enumerate(triplets_batch):
                 try:
                     # ... existing extraction logic ...
 
                     triplet_obj = Triplet(
                         from_node_id=start_node_id, to_node_id=end_node_id, text=embeddable_text
                     )
 
-                    triplet_datapoints.append(triplet_obj)
+                    valid_triplets_count += 1
 
                     yield triplet_obj
 
                 except Exception as e:
                     # ... existing exception handling ...
 
             if skipped_count > 0:
                 logger.warning(
                     f"Skipped {skipped_count} out of {len(triplets_batch)} triplets in batch {batch_number}"
                 )
 
-            if not triplet_datapoints:
+            if valid_triplets_count == 0:
                 logger.warning(
                     f"No valid triplet datapoints in batch {batch_number} after processing"
                 )
                 offset += len(triplets_batch)
                 if len(triplets_batch) < triplets_batch_size:
                     break
                 continue
 
-            total_triplets_processed += len(triplet_datapoints)
+            total_triplets_processed += valid_triplets_count
             logger.info(
-                f"Batch {batch_number} complete: processed {len(triplet_datapoints)} triplets "
+                f"Batch {batch_number} complete: processed {valid_triplets_count} triplets "
                 f"(total processed: {total_triplets_processed})"
             )

Also applies to: 182-182, 199-199, 208-208

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 608bdc3 and 41488e9.

📒 Files selected for processing (1)

cognee/tasks/memify/get_triplet_datapoints.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/tasks/memify/get_triplet_datapoints.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/tasks/memify/get_triplet_datapoints.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/tasks/memify/get_triplet_datapoints.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/tasks/memify/get_triplet_datapoints.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.

📚 Learning: 2025-12-02T10:37:24.245Z

Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.

Applied to files:

cognee/tasks/memify/get_triplet_datapoints.py

🧬 Code graph analysis (1)

cognee/tasks/memify/get_triplet_datapoints.py (6)

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

get_graph_engine (10-24)

cognee/shared/logging_utils.py (1)

get_logger (212-224)

cognee/infrastructure/engine/models/DataPoint.py (1)

DataPoint (20-220)

cognee/modules/engine/models/Triplet.py (1)

Triplet (4-9)

cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)

get_triplets_batch (1531-1551)

cognee/infrastructure/databases/graph/kuzu/adapter.py (1)

get_triplets_batch (2009-2138)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)

GitHub Check: CLI Tests / CLI Functionality Tests
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: CLI Tests / CLI Unit Tests
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Test graph edge ingestion
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: End-to-End Tests / Run Telemetry Test
GitHub Check: Basic Tests / Run Formatting Check
GitHub Check: Basic Tests / Run Simple Examples BAML
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: Basic Tests / Run Simple Examples
GitHub Check: Basic Tests / Run Basic Graph Tests
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: Basic Tests / Run Unit Tests

🔇 Additional comments (2)

cognee/tasks/memify/get_triplet_datapoints.py (2)

12-37: LGTM!

The helper function is well-structured with clear logic, proper error handling for edge cases, and a comprehensive docstring.

59-227: LGTM!

The core extraction logic is well-implemented with:

Proper validation of graph engine capabilities

Robust DataPoint subclass discovery and index field mapping

Comprehensive error handling at both triplet and batch levels

Thorough logging for debugging and monitoring

Correct offset management and loop termination conditions

The embeddable text format with arrow separators (lines 166-167) and empty text check (line 170) are intentional per team agreement. Based on learnings, this design is acceptable.

cognee/tasks/memify/get_triplet_datapoints.py

This reverts commit 3293ab4.

pazone · 2025-12-02T15:20:06Z

cognee/tasks/memify/get_triplet_datapoints.py

+            triplet_datapoints = []
+            skipped_count = 0
+
+            for idx, triplet_datapoint in enumerate(triplets_batch):


I'd leave it up to you. There is a recommendation to break this logic into a chain of private functions with reduced scope. The purpose is the ease of code review and general code comprehensiveness

In the end I created them, the error message handling is a bit ugly in this way but now the logic is separated into priv methods, can revert if you think it is worse but basically we have to handle the errors somehow that are coming from priv functions and adjust the loop based on those.

pazone · 2025-12-02T15:23:48Z

In general it looks good. Let's make sure that issues are resolved and tests are passing

…methods

…ub.com:topoteretes/cognee into feature/cog-3326-2triplet-embedding-via-memify

hajdul88 · 2025-12-02T15:40:38Z

Finished the refactoring and addressed all the refact requests except one that we discussed (pep8).

cognee/infrastructure/databases/graph/kuzu/adapter.py

cognee/memify_pipelines/create_triplet_embeddings.py

cognee/modules/retrieval/triplet_retriever.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cognee/tests/test_search_db.py (1)
268-268: Critical: Undefined variable reference.

Line 268 references an undefined variable text, which will cause a NameError at runtime. Based on the context and earlier code (line 32 defines text_1 and lines 35-39 define explanation_file_path_quantum), this should likely be one of those variables.

Determine the correct variable and apply the appropriate fix:

Option 1: If re-adding the text content:
-    await cognee.add([text], dataset_name)
+    await cognee.add(text_1, dataset_name)
Option 2: If re-adding the file:
-    await cognee.add([text], dataset_name)
+    await cognee.add([explanation_file_path_quantum], dataset_name)
Option 3: If adding both:
-    await cognee.add([text], dataset_name)
+    await cognee.add([text_1, explanation_file_path_quantum], dataset_name)
Please verify which data should be added at this point in the test.

🧹 Nitpick comments (1)

cognee/tests/test_search_db.py (1)
44-44: Consider moving import to the top of the file.

The inline import of create_triplet_embeddings at line 44 deviates from PEP 8 guidelines, which recommend placing all imports at the module level unless there's a specific reason (e.g., avoiding circular dependencies or conditional imports).

Apply this diff to move the import to the top:
 from cognee.modules.users.methods import get_default_user
 from collections import Counter
+from cognee.memify_pipelines.create_triplet_embeddings import create_triplet_embeddings
And remove it from line 44:
     user = await get_default_user()
-    from cognee.memify_pipelines.create_triplet_embeddings import create_triplet_embeddings
-
     await create_triplet_embeddings(user=user, dataset=dataset_name, triplets_batch_size=5)
As per coding guidelines, Python code should follow PEP 8 style conventions.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61f7a2e and 804289b.

📒 Files selected for processing (1)

cognee/tests/test_search_db.py (8 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/tests/test_search_db.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/tests/test_search_db.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/tests/test_search_db.py

cognee/tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Files:

cognee/tests/test_search_db.py

cognee/tests/*

⚙️ CodeRabbit configuration file

cognee/tests/*: When reviewing test code:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance, pointing out any violations discovered.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a project rule, Python source files with names prefixed by the string "test_" and located in the project's "tests" directory are the project's unit-testing code. It is safe, albeit a heuristic, to assume these are considered part of the project's minimal acceptance testing unless a justifying exception to this assumption is documented.

As a project rule, any files without extensions and with names prefixed by either the string "check_" or the string "test_", and located in the project's "tests" directory, are the project's non-unit test code. "Non-unit test" in this context refers to any type of testing other than unit testing, such as (but not limited to) functional testing, style linting, regression testing, etc. It can also be assumed that non-unit testing code is usually written as Bash shell scripts.

Files:

cognee/tests/test_search_db.py

🧠 Learnings (3)

📓 Common learnings

Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.

📚 Learning: 2024-11-13T16:06:32.576Z

Learnt from: hajdul88
Repo: topoteretes/cognee PR: 196
File: cognee/modules/graph/cognee_graph/CogneeGraph.py:32-38
Timestamp: 2024-11-13T16:06:32.576Z
Learning: In `CogneeGraph.py`, within the `CogneeGraph` class, it's intentional to add skeleton edges in both the `add_edge` method and the `project_graph_from_db` method to ensure that edges are added to the graph and to the nodes.

Applied to files:

cognee/tests/test_search_db.py

📚 Learning: 2024-12-04T18:37:55.092Z

Learnt from: hajdul88
Repo: topoteretes/cognee PR: 251
File: cognee/tests/infrastructure/databases/test_index_graph_edges.py:0-0
Timestamp: 2024-12-04T18:37:55.092Z
Learning: In the `index_graph_edges` function, both graph engine and vector engine initialization failures are handled within the same try-except block, so a single test covers both cases.

Applied to files:

cognee/tests/test_search_db.py

🧬 Code graph analysis (1)

cognee/tests/test_search_db.py (5)

cognee/infrastructure/databases/vector/get_vector_engine.py (1)

get_vector_engine (5-7)

cognee/modules/retrieval/triplet_retriever.py (1)

TripletRetriever (20-182)

cognee/modules/search/types/SearchType.py (1)

SearchType (4-20)

cognee/modules/users/methods/get_default_user.py (1)

get_default_user (13-38)

cognee/memify_pipelines/create_triplet_embeddings.py (1)

create_triplet_embeddings (18-53)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)

GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: CLI Tests / CLI Functionality Tests
GitHub Check: Basic Tests / Run Basic Graph Tests
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee

🔇 Additional comments (6)

cognee/tests/test_search_db.py (6)

5-5: LGTM!

The new imports for triplet embeddings functionality are correctly placed and necessary for the test.

Also applies to: 16-16, 19-19

72-96: LGTM!

The TripletRetriever context validation is well-structured and follows appropriate testing patterns. The assertions correctly validate the string type, non-emptiness, and semantic content.

162-166: LGTM!

The TRIPLET_COMPLETION search is properly integrated and follows the same pattern as other completion types in the test.

179-179: LGTM!

TRIPLET_COMPLETION is correctly added to the search results validation loop, ensuring consistent validation across all completion types.

207-207: LGTM!

The assertion message has been correctly updated to reference "CogneeUserInteraction" instead of "DCogneeUserInteraction".

52-58: The test correctly validates collection coverage.

The vector engine's search() method with limit=None explicitly retrieves the collection count and returns all items from the collection, regardless of query_text. When limit=None, the adapters (ChromaDB, PGVector, LanceDB) all execute limit = await collection.count() before querying, ensuring all items are returned and ordered by similarity to the query embedding. The assertion len(edges) == len(collection) correctly validates that all graph edges have corresponding embeddings in the Triplet_text collection.

Likely an incorrect or invalid review comment.

pazone

LGTM. Let's fix the issues and I'd appreciate +1 approval

hajdul88 · 2025-12-02T17:12:47Z

@lxobr Added the requested tests + answered comments and we also discussed these async.

lxobr

Looks good! Really glad you took this on

hajdul88 added 5 commits November 25, 2025 13:50

feat: adds triplet datapoint

001e539

feat: adds get triplet batches adapter level method

1e3c304

feat: adds initial get_triplet_datapoints

e6b6e82

feat: adds first version of get_triplet_datapoints task

0b393a7

ruff format

a512e1d

hajdul88 self-assigned this Nov 25, 2025

hajdul88 marked this pull request as draft November 25, 2025 16:22

hajdul88 changed the title ~~Feature/cog 3326 2triplet embedding via memify~~ feature: adds triplet embedding via memify (WIP) Nov 25, 2025

hajdul88 added 20 commits November 25, 2025 17:27

Update get_triplet_datapoints.py

01921d6

Update get_triplet_datapoints.py

df88fac

Merge branch 'dev' into feature/cog-3326-2triplet-embedding-via-memify

de4aa73

feat: adds logging to get_triplet_datapoints

855eb1f

ruff

acc370c

removes indexing from get_triplet datapoints

64f81a5

feat: introduces memify wrapping

207fe58

fix: fixes batching and pipeline yielding logic

717cc69

ruff format

2472174

fix: fixing user issue with new memify pipeline

72f4533

uc and poetry fix for lancedb

b7cd326

Revert "uc and poetry fix for lancedb"

09ad8ea

This reverts commit b7cd326.

fix: fixes logging

565ed40

fix: fixes embedded text by adding separators

36162f5

feat: adds triplet completion to search types

8cc7530

feat: adds triplet retriever and connects it to search type tools

16e9e76

ruff fix

2a447e0

fix: lancedb fix in order to be able to run CI (TO REVERT)

232c761

fix: fixes no triplet embedding error (thats why you should never cop…

77dd332

…y paste)

feat: adds memify triplet embedding example

1ecbcff

chore: removes if (not needed)

608bdc3

Co-authored-by: Pavel Zorin <[email protected]>

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

hajdul88 added 2 commits December 2, 2025 15:50

chore: moves continue a bit earlier in the loop

463fc30

ruff ruff

41488e9

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

cognee/tasks/memify/get_triplet_datapoints.py Show resolved Hide resolved

cognee/tasks/memify/get_triplet_datapoints.py Show resolved Hide resolved

hajdul88 added 4 commits December 2, 2025 16:00

chore: moving if earlier

d96d536

chore: remove unused import

3293ab4

Revert "chore: remove unused import"

b287701

This reverts commit 3293ab4.

Merge branch 'dev' into feature/cog-3326-2triplet-embedding-via-memify

7ab00b5

hajdul88 requested a review from pazone December 2, 2025 15:18

pazone reviewed Dec 2, 2025

View reviewed changes

hajdul88 added 3 commits December 2, 2025 16:29

chore: breaks the get_triplet_datapoints logic into chain of private …

5424ebc

…methods

ruff format

6363bda

Merge branch 'feature/cog-3326-2triplet-embedding-via-memify' of gith…

61f7a2e

…ub.com:topoteretes/cognee into feature/cog-3326-2triplet-embedding-via-memify

hajdul88 requested a review from pazone December 2, 2025 15:36

lxobr reviewed Dec 2, 2025

View reviewed changes

cognee/infrastructure/databases/graph/kuzu/adapter.py Show resolved Hide resolved

lxobr reviewed Dec 2, 2025

View reviewed changes

cognee/memify_pipelines/create_triplet_embeddings.py Show resolved Hide resolved

lxobr reviewed Dec 2, 2025

View reviewed changes

cognee/modules/retrieval/triplet_retriever.py Show resolved Hide resolved

feat: adds vector+graph consistency check for triplet embedding

804289b

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

pazone approved these changes Dec 2, 2025

View reviewed changes

lxobr self-requested a review December 2, 2025 17:14

lxobr approved these changes Dec 2, 2025

View reviewed changes

hajdul88 added the core-team label Dec 2, 2025

hajdul88 merged commit d4d190a into dev Dec 2, 2025
250 of 255 checks passed

hajdul88 deleted the feature/cog-3326-2triplet-embedding-via-memify branch December 2, 2025 17:27

coderabbitai bot mentioned this pull request Dec 11, 2025

chore: retriever test reorganization + adding new tests (integration) (STEP 1) #1881

Merged

16 tasks

feature: adds triplet embedding via memify #1832

feature: adds triplet embedding via memify #1832

Uh oh!

Conversation

hajdul88 commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Screenshots/Videos (if applicable)

Pre-submission Checklist

DCO Affirmation

Summary by CodeRabbit

Uh oh!

pull-checklist bot commented Nov 25, 2025

Please make sure all the checkboxes are checked:

Uh oh!

coderabbitai bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pazone Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

hajdul88 Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pazone commented Dec 2, 2025

Uh oh!

hajdul88 commented Dec 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

pazone left a comment

Choose a reason for hiding this comment

Uh oh!

hajdul88 commented Dec 2, 2025

Uh oh!

lxobr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hajdul88 commented Nov 25, 2025 •

edited

Loading

coderabbitai bot commented Nov 25, 2025 •

edited

Loading

hajdul88 Dec 2, 2025 •

edited

Loading