Skip to content

Conversation

@hajdul88
Copy link
Collaborator

@hajdul88 hajdul88 commented Nov 25, 2025

Description

This PR introduces triplet embeddings via a new create_triplet_embeddings memify pipeline.
The pipeline reads the graph in batches, extracts properties from graph elements based on their datapoint types, and generates combined triplet embeddings. These embeddings are stored in the vector database as a new collection.

Changes in This PR:

-Added a new create_triplet_embeddings memify pipeline.
-Added a new get_triplet_datapoints memify task.
-Introduced a new triplet_completion search type.
-Added full test coverage
--Unit tests: memify task, pipeline, and retriever
--Integration tests: memify task, pipeline, and retriever
--End-to-end tests: updated session history tests and multi-DB search tests; added tests for triplet_completion and memify pipeline execution

Acceptance Criteria and Testing
Scenario 1:
-Run default add, cognify pipelines
-Run create triplet embeddings memify pipeline
-Verify the vector DB contains a non empty Triplet_text collection.
-Use the new triplet_completion search type and confirm it works correctly.

Scenario 2:
-Run the default add and cognify pipelines.
-Do not run the triplet embeddings memify pipeline.
-Attempt to use the triplet_completion search type.
-You should receive an error indicating that the triplet embeddings memify pipeline must be executed first.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Code refactoring
  • Performance improvement
  • Other (please specify):

Screenshots/Videos (if applicable)

Pre-submission Checklist

  • I have tested my changes thoroughly before submitting this PR
  • This PR contains minimal changes necessary to address the issue/feature
  • My code follows the project's coding standards and style guidelines
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if applicable)
  • All new and existing tests pass
  • I have searched existing PRs to ensure this change hasn't been submitted already
  • I have linked any relevant issues in the description
  • My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

  • New Features

    • Triplet-based search with LLM-powered completions (TRIPLET_COMPLETION)
    • Batch triplet retrieval and a triplet embeddings pipeline for extraction, indexing, and optional background processing
    • Context retrieval from triplet embeddings with optional caching and conversation-history support
    • New Triplet data type exposed for indexing and search
  • Examples

    • End-to-end example demonstrating triplet embeddings extraction and TRIPLET_COMPLETION search
  • Tests

    • Unit and integration tests covering triplet extraction, retrieval, embedding pipeline, and completion flows

✏️ Tip: You can customize this high-level summary in your review settings.

@hajdul88 hajdul88 self-assigned this Nov 25, 2025
@pull-checklist
Copy link

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 25, 2025

Walkthrough

Adds end-to-end triplet support: DB batch retrieval in Kuzu/Neo4j, Triplet model and export, extraction pipeline and async generator, TripletRetriever (vector+LLM), search wiring for TRIPLET_COMPLETION, memify integration, tests, and an example script.

Changes

Cohort / File(s) Summary
Database adapters
cognee/infrastructure/databases/graph/kuzu/adapter.py, cognee/infrastructure/databases/graph/neo4j_driver/adapter.py
Added get_triplets_batch(offset, limit) to both adapters. Kuzu implementation runs a MATCH query with SKIP/LIMIT, validates/parses rows (normalizes node props, attempts JSON decode of relationship props), logs/skips invalid rows; Neo4j adapter adds a batch triplet query and returns results.
Data model export
cognee/modules/engine/models/Triplet.py, cognee/modules/engine/models/__init__.py
New Triplet model (inherits DataPoint) with fields text, from_node_id, to_node_id, and metadata; exported from models package.
Memify pipeline entry
cognee/memify_pipelines/create_triplet_embeddings.py
New async create_triplet_embeddings(...) that validates dataset write access, composes extraction/enrichment tasks, and invokes memify with batching.
Datapoint extraction
cognee/tasks/memify/get_triplet_datapoints.py
New helpers and async generator get_triplet_datapoints that discovers DataPoint index_fields, pages via get_triplets_batch, builds embeddable text, yields Triplet instances, and robustly handles/logs skips and errors.
Retriever & completion
cognee/modules/retrieval/triplet_retriever.py
New TripletRetriever with get_context (vector search + concat) and get_completion (LLM generation, optional caching/session-history integration).
Search integration
cognee/modules/search/types/SearchType.py, cognee/modules/search/methods/get_search_type_tools.py
Added TRIPLET_COMPLETION enum member and wired TripletRetriever.get_completion / TripletRetriever.get_context into search dispatch.
Tests — tasks & memify
cognee/tests/integration/tasks/test_get_triplet_datapoints.py, cognee/tests/unit/modules/memify_tasks/test_get_triplet_datapoints.py
Added integration and unit tests covering batch extraction, embeddable text composition, skipping invalid triplets, and error propagation.
Tests — retriever
cognee/tests/integration/retrieval/test_triplet_retriever.py, cognee/tests/unit/modules/retrieval/triplet_retriever_test.py
Added integration and unit tests validating TripletRetriever context/completion behavior, collection-existence handling, empty-result behavior, and error mapping.
Tests — search & convo history
cognee/tests/test_conversation_history.py, cognee/tests/test_search_db.py
Updated tests to exercise TRIPLET_COMPLETION, create triplet embeddings in flows, and validate triplet-based contexts/results and session history.
Example
examples/python/triplet_embeddings_example.py
New example demonstrating ingest → cognify → create triplet embeddings → run TRIPLET_COMPLETION search.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay extra attention to:
    • cognee/infrastructure/databases/graph/kuzu/adapter.py: JSON decoding, node/relationship parsing, skip-logic and logging.
    • cognee/tasks/memify/get_triplet_datapoints.py: index_fields discovery, batching/termination, and per-triplet validation/error handling.
    • cognee/modules/retrieval/triplet_retriever.py: context assembly, prompt handling, and session-history/caching interactions.
    • Tests: verify mocks/fixtures align with graph/vector engine behaviors and new Triplet model.

Possibly related PRs

Suggested labels

run-checks, core-team

Suggested reviewers

  • pazone
  • borisarzentar

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main feature: adding triplet embeddings via memify pipeline.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The PR description is well-structured, follows the template format, includes all required sections with substantial content, and clearly explains the changes and acceptance criteria.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/cog-3326-2triplet-embedding-via-memify

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hajdul88 hajdul88 marked this pull request as draft November 25, 2025 16:22
@hajdul88 hajdul88 changed the title Feature/cog 3326 2triplet embedding via memify feature: adds triplet embedding via memify (WIP) Nov 25, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
cognee/tasks/memify/get_triplet_datapoints.py (1)

166-176: Empty embeddable_text check is unreachable but acceptable by prior design decision.

Because embeddable_text is always formatted as "{start}-›{rel}-›{end}", the string will never be falsy (at minimum it contains the separators), so the if not embeddable_text: branch won’t execute in practice. Given the prior discussion and team agreement to always include separators and keep this guard as a defensive check, I don’t see a need to change it.

Based on learnings, this behavior is intentional and acceptable.

🧹 Nitpick comments (1)
cognee/tasks/memify/get_triplet_datapoints.py (1)

40-58: Align get_triplet_datapoints docstring with actual behavior and parameters.

The function returns an AsyncGenerator[Triplet, None] and yields individual Triplet objects, but the docstring still mentions “batches” and List[Dict[str, Any]], and it doesn’t describe the data parameter. I’d update the summary, Parameters, and Yields sections to:

  • Document data’s role as pipeline input / initial state.
  • Clarify that the generator yields Triplet instances one-by-one, while triplets_batch_size only controls the internal fetch window from the graph engine.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3cb8305 and 608bdc3.

📒 Files selected for processing (1)
  • cognee/tasks/memify/get_triplet_datapoints.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/tasks/memify/get_triplet_datapoints.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/tasks/memify/get_triplet_datapoints.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/tasks/memify/get_triplet_datapoints.py
cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

  • cognee/tasks/memify/get_triplet_datapoints.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.
📚 Learning: 2025-12-02T10:37:24.245Z
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.

Applied to files:

  • cognee/tasks/memify/get_triplet_datapoints.py
🧬 Code graph analysis (1)
cognee/tasks/memify/get_triplet_datapoints.py (5)
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
  • get_graph_engine (10-24)
cognee/shared/logging_utils.py (1)
  • get_logger (212-224)
cognee/infrastructure/engine/models/DataPoint.py (1)
  • DataPoint (20-220)
cognee/modules/engine/models/Triplet.py (1)
  • Triplet (4-9)
cognee/infrastructure/databases/graph/kuzu/adapter.py (1)
  • get_triplets_batch (2009-2138)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
  • GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: End-to-End Tests / Test graph edge ingestion
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: End-to-End Tests / Run Telemetry Test
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: Basic Tests / Run Basic Graph Tests
  • GitHub Check: Basic Tests / Run Simple Examples BAML
  • GitHub Check: Basic Tests / Run Simple Examples
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: Basic Tests / Run Unit Tests
🔇 Additional comments (1)
cognee/tasks/memify/get_triplet_datapoints.py (1)

12-37: Helper _extract_embeddable_text looks correct and robust.

The helper cleanly handles missing fields, trims whitespace, and skips empty values, which is exactly what you want before joining into embeddable text. No changes needed here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
cognee/tasks/memify/get_triplet_datapoints.py (1)

115-115: Consider replacing list with counter for memory efficiency.

The triplet_datapoints list accumulates Triplet objects that have already been yielded (line 184). The list is only used to check emptiness (line 199) and count items (line 208). For large batches, this adds unnecessary memory overhead.

Apply this diff to use a counter instead:

-            triplet_datapoints = []
+            valid_triplets_count = 0
             skipped_count = 0
 
             for idx, triplet_datapoint in enumerate(triplets_batch):
                 try:
                     # ... existing extraction logic ...
 
                     triplet_obj = Triplet(
                         from_node_id=start_node_id, to_node_id=end_node_id, text=embeddable_text
                     )
 
-                    triplet_datapoints.append(triplet_obj)
+                    valid_triplets_count += 1
 
                     yield triplet_obj
 
                 except Exception as e:
                     # ... existing exception handling ...
 
             if skipped_count > 0:
                 logger.warning(
                     f"Skipped {skipped_count} out of {len(triplets_batch)} triplets in batch {batch_number}"
                 )
 
-            if not triplet_datapoints:
+            if valid_triplets_count == 0:
                 logger.warning(
                     f"No valid triplet datapoints in batch {batch_number} after processing"
                 )
                 offset += len(triplets_batch)
                 if len(triplets_batch) < triplets_batch_size:
                     break
                 continue
 
-            total_triplets_processed += len(triplet_datapoints)
+            total_triplets_processed += valid_triplets_count
             logger.info(
-                f"Batch {batch_number} complete: processed {len(triplet_datapoints)} triplets "
+                f"Batch {batch_number} complete: processed {valid_triplets_count} triplets "
                 f"(total processed: {total_triplets_processed})"
             )

Also applies to: 182-182, 199-199, 208-208

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 608bdc3 and 41488e9.

📒 Files selected for processing (1)
  • cognee/tasks/memify/get_triplet_datapoints.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/tasks/memify/get_triplet_datapoints.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/tasks/memify/get_triplet_datapoints.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/tasks/memify/get_triplet_datapoints.py
cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

  • cognee/tasks/memify/get_triplet_datapoints.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.
📚 Learning: 2025-12-02T10:37:24.245Z
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.

Applied to files:

  • cognee/tasks/memify/get_triplet_datapoints.py
🧬 Code graph analysis (1)
cognee/tasks/memify/get_triplet_datapoints.py (6)
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
  • get_graph_engine (10-24)
cognee/shared/logging_utils.py (1)
  • get_logger (212-224)
cognee/infrastructure/engine/models/DataPoint.py (1)
  • DataPoint (20-220)
cognee/modules/engine/models/Triplet.py (1)
  • Triplet (4-9)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)
  • get_triplets_batch (1531-1551)
cognee/infrastructure/databases/graph/kuzu/adapter.py (1)
  • get_triplets_batch (2009-2138)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: CLI Tests / CLI Unit Tests
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: End-to-End Tests / Test graph edge ingestion
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: End-to-End Tests / Run Telemetry Test
  • GitHub Check: Basic Tests / Run Formatting Check
  • GitHub Check: Basic Tests / Run Simple Examples BAML
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: Basic Tests / Run Simple Examples
  • GitHub Check: Basic Tests / Run Basic Graph Tests
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: Basic Tests / Run Unit Tests
🔇 Additional comments (2)
cognee/tasks/memify/get_triplet_datapoints.py (2)

12-37: LGTM!

The helper function is well-structured with clear logic, proper error handling for edge cases, and a comprehensive docstring.


59-227: LGTM!

The core extraction logic is well-implemented with:

  • Proper validation of graph engine capabilities
  • Robust DataPoint subclass discovery and index field mapping
  • Comprehensive error handling at both triplet and batch levels
  • Thorough logging for debugging and monitoring
  • Correct offset management and loop termination conditions

The embeddable text format with arrow separators (lines 166-167) and empty text check (line 170) are intentional per team agreement. Based on learnings, this design is acceptable.

@hajdul88 hajdul88 requested a review from pazone December 2, 2025 15:18
triplet_datapoints = []
skipped_count = 0

for idx, triplet_datapoint in enumerate(triplets_batch):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave it up to you. There is a recommendation to break this logic into a chain of private functions with reduced scope. The purpose is the ease of code review and general code comprehensiveness

Copy link
Collaborator Author

@hajdul88 hajdul88 Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end I created them, the error message handling is a bit ugly in this way but now the logic is separated into priv methods, can revert if you think it is worse but basically we have to handle the errors somehow that are coming from priv functions and adjust the loop based on those.

@pazone
Copy link
Contributor

pazone commented Dec 2, 2025

In general it looks good. Let's make sure that issues are resolved and tests are passing

@hajdul88 hajdul88 requested a review from pazone December 2, 2025 15:36
@hajdul88
Copy link
Collaborator Author

hajdul88 commented Dec 2, 2025

Finished the refactoring and addressed all the refact requests except one that we discussed (pep8).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cognee/tests/test_search_db.py (1)

268-268: Critical: Undefined variable reference.

Line 268 references an undefined variable text, which will cause a NameError at runtime. Based on the context and earlier code (line 32 defines text_1 and lines 35-39 define explanation_file_path_quantum), this should likely be one of those variables.

Determine the correct variable and apply the appropriate fix:

Option 1: If re-adding the text content:

-    await cognee.add([text], dataset_name)
+    await cognee.add(text_1, dataset_name)

Option 2: If re-adding the file:

-    await cognee.add([text], dataset_name)
+    await cognee.add([explanation_file_path_quantum], dataset_name)

Option 3: If adding both:

-    await cognee.add([text], dataset_name)
+    await cognee.add([text_1, explanation_file_path_quantum], dataset_name)

Please verify which data should be added at this point in the test.

🧹 Nitpick comments (1)
cognee/tests/test_search_db.py (1)

44-44: Consider moving import to the top of the file.

The inline import of create_triplet_embeddings at line 44 deviates from PEP 8 guidelines, which recommend placing all imports at the module level unless there's a specific reason (e.g., avoiding circular dependencies or conditional imports).

Apply this diff to move the import to the top:

 from cognee.modules.users.methods import get_default_user
 from collections import Counter
+from cognee.memify_pipelines.create_triplet_embeddings import create_triplet_embeddings

And remove it from line 44:

     user = await get_default_user()
-    from cognee.memify_pipelines.create_triplet_embeddings import create_triplet_embeddings
-
     await create_triplet_embeddings(user=user, dataset=dataset_name, triplets_batch_size=5)

As per coding guidelines, Python code should follow PEP 8 style conventions.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61f7a2e and 804289b.

📒 Files selected for processing (1)
  • cognee/tests/test_search_db.py (8 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/tests/test_search_db.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/tests/test_search_db.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/tests/test_search_db.py
cognee/tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Files:

  • cognee/tests/test_search_db.py
cognee/tests/*

⚙️ CodeRabbit configuration file

cognee/tests/*: When reviewing test code:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance, pointing out any violations discovered.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a project rule, Python source files with names prefixed by the string "test_" and located in the project's "tests" directory are the project's unit-testing code. It is safe, albeit a heuristic, to assume these are considered part of the project's minimal acceptance testing unless a justifying exception to this assumption is documented.
  6. As a project rule, any files without extensions and with names prefixed by either the string "check_" or the string "test_", and located in the project's "tests" directory, are the project's non-unit test code. "Non-unit test" in this context refers to any type of testing other than unit testing, such as (but not limited to) functional testing, style linting, regression testing, etc. It can also be assumed that non-unit testing code is usually written as Bash shell scripts.

Files:

  • cognee/tests/test_search_db.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 1832
File: cognee/tasks/memify/get_triplet_datapoints.py:169-179
Timestamp: 2025-12-02T10:37:24.245Z
Learning: In cognee/tasks/memify/get_triplet_datapoints.py, the triplet embeddable_text format intentionally includes arrow separators (`-›`) even when text components might be empty, as this format was agreed upon by the team. The empty text check is considered acceptable even though it's technically unreachable.
📚 Learning: 2024-11-13T16:06:32.576Z
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 196
File: cognee/modules/graph/cognee_graph/CogneeGraph.py:32-38
Timestamp: 2024-11-13T16:06:32.576Z
Learning: In `CogneeGraph.py`, within the `CogneeGraph` class, it's intentional to add skeleton edges in both the `add_edge` method and the `project_graph_from_db` method to ensure that edges are added to the graph and to the nodes.

Applied to files:

  • cognee/tests/test_search_db.py
📚 Learning: 2024-12-04T18:37:55.092Z
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 251
File: cognee/tests/infrastructure/databases/test_index_graph_edges.py:0-0
Timestamp: 2024-12-04T18:37:55.092Z
Learning: In the `index_graph_edges` function, both graph engine and vector engine initialization failures are handled within the same try-except block, so a single test covers both cases.

Applied to files:

  • cognee/tests/test_search_db.py
🧬 Code graph analysis (1)
cognee/tests/test_search_db.py (5)
cognee/infrastructure/databases/vector/get_vector_engine.py (1)
  • get_vector_engine (5-7)
cognee/modules/retrieval/triplet_retriever.py (1)
  • TripletRetriever (20-182)
cognee/modules/search/types/SearchType.py (1)
  • SearchType (4-20)
cognee/modules/users/methods/get_default_user.py (1)
  • get_default_user (13-38)
cognee/memify_pipelines/create_triplet_embeddings.py (1)
  • create_triplet_embeddings (18-53)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: Basic Tests / Run Basic Graph Tests
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: Basic Tests / Run Unit Tests
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
🔇 Additional comments (6)
cognee/tests/test_search_db.py (6)

5-5: LGTM!

The new imports for triplet embeddings functionality are correctly placed and necessary for the test.

Also applies to: 16-16, 19-19


72-96: LGTM!

The TripletRetriever context validation is well-structured and follows appropriate testing patterns. The assertions correctly validate the string type, non-emptiness, and semantic content.


162-166: LGTM!

The TRIPLET_COMPLETION search is properly integrated and follows the same pattern as other completion types in the test.


179-179: LGTM!

TRIPLET_COMPLETION is correctly added to the search results validation loop, ensuring consistent validation across all completion types.


207-207: LGTM!

The assertion message has been correctly updated to reference "CogneeUserInteraction" instead of "DCogneeUserInteraction".


52-58: The test correctly validates collection coverage.

The vector engine's search() method with limit=None explicitly retrieves the collection count and returns all items from the collection, regardless of query_text. When limit=None, the adapters (ChromaDB, PGVector, LanceDB) all execute limit = await collection.count() before querying, ensuring all items are returned and ordered by similarity to the query embedding. The assertion len(edges) == len(collection) correctly validates that all graph edges have corresponding embeddings in the Triplet_text collection.

Likely an incorrect or invalid review comment.

Copy link
Contributor

@pazone pazone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's fix the issues and I'd appreciate +1 approval

@hajdul88
Copy link
Collaborator Author

hajdul88 commented Dec 2, 2025

@lxobr Added the requested tests + answered comments and we also discussed these async.

@lxobr lxobr self-requested a review December 2, 2025 17:14
Copy link
Collaborator

@lxobr lxobr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Really glad you took this on

@hajdul88 hajdul88 merged commit d4d190a into dev Dec 2, 2025
250 of 255 checks passed
@hajdul88 hajdul88 deleted the feature/cog-3326-2triplet-embedding-via-memify branch December 2, 2025 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants