feat:Add importance_weight to data models and retrieval #1849

3619117923 · 2025-11-29T13:33:54Z

Summary:

Add an importance_weight attribute to data points, document chunks, and graphs.
allowing ingestion and retrieval to factor in data importance for ranking.
Add test cases for importance weight.

Changes:

Add importance_weight to add.py,DataPoint class and graphs.Fix scoring logics and retrievers with importance_weight.Create some new tests and add some test cases.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Code refactoring
Performance improvement
Other (please specify):

Pre-submission Checklist

I have tested my changes thoroughly before submitting this PR
This PR contains minimal changes necessary to address the issue/feature
My code follows the project's coding standards and style guidelines
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if applicable)
All new and existing tests pass
I have searched existing PRs to ensure this change hasn't been submitted already
I have linked any relevant issues in the description
My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

Release Notes

New Features
- Added importance_weight parameter to document ingestion, allowing users to assign priority levels (0.0-1.0) to content.
- Implemented intelligent re-ranking of retrieved results based on importance weights.
- Added weight propagation across knowledge graph relationships for improved ranking accuracy.
Tests
- Added comprehensive test coverage for importance weight functionality across retrieval and graph operations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Add an importance_weight attribute to data points, document chunks, and graphs, allowing ingestion and retrieval to factor in data importance for ranking. Adds test case for importance weight.

pull-checklist · 2025-11-29T13:33:58Z

Please make sure all the checkboxes are checked:

I have tested these changes locally.
I have reviewed the code changes.
I have added end-to-end and unit tests (if applicable).
I have updated the documentation and README.md file (if necessary).
I have removed unnecessary code and debug statements.
PR title is clear and follows the convention.
I have tagged reviewers or team members for feedback.

coderabbitai · 2025-11-29T13:34:02Z

Walkthrough

This PR introduces an importance_weight (default 0.5, range 0.0–1.0) that is threaded from the API and ingestion through Data/DocumentChunk models, chunkers, graph construction, and retrievers. It adds a memify task to propagate/fuse weights across graph neighborhoods and updates scoring/ranking to combine vector similarity with importance weights.

Changes

Cohort / File(s)	Summary
Entry Point & Core API `cognee/api/v1/add/add.py`	Added `importance_weight` parameter with validation (0.0–1.0) and propagation into ingestion task.
Data Models `cognee/infrastructure/engine/models/DataPoint.py`, `cognee/modules/data/models/Data.py`	Added `importance_weight` field (default 0.5) to DataPoint and Data model; DataPoint includes a before-validator to coerce `None`→0.5; DB column added and JSON serialization updated.
Document / Chunking `cognee/modules/data/processing/document_types/TextDocument.py`, `cognee/modules/chunking/LangchainChunker.py`, `cognee/modules/chunking/TextChunker.py`	Propagated document `importance_weight` into emitted `DocumentChunk` instances and extended TextDocument constructor to accept the weight.
Ingestion Task `cognee/tasks/ingestion/ingest_data.py`	Added `importance_weight` parameter and threaded it into created/updated Data objects and downstream storage call.
Graph Processing `cognee/modules/graph/cognee_graph/CogneeGraph.py`, `cognee/modules/graph/utils/expand_with_nodes_and_edges.py`	Node/edge mapping now compute vector similarity as (1 - distance) and multiply by importance_weight (with defaults) to produce importance_score; edge/node attributes store vector_distance, importance_weight, and importance_score; `_process_ontology_edges` and `_process_graph_edges` signatures updated to accept `data_chunk` to propagate weight.
Weight Propagation Task `cognee/tasks/memify/propagate_importance_weights.py`	New module: `propagate_importance_weights` function and `PropagateImportanceWeights` Task that average-propagates importance weights across graph nodes and computes edge weights (module-level DEFAULT_WEIGHT = 0.5).
Memify Integration `cognee/modules/memify/memify.py`	Default enrichment pipeline now includes `propagate_importance_weights` before `add_rule_associations`.
Retrieval & Re-ranking `cognee/modules/retrieval/chunks_retriever.py`, `cognee/modules/retrieval/lexical_retriever.py`, `cognee/modules/retrieval/summaries_retriever.py`	Added `default_importance_weight` params; expand initial candidate set (top_k * N), compute `final_score = similarity_score × importance_weight` (using payload or default), re-rank items by final_score, and return top_k re-ranked payloads.
Graph Triplet Search `cognee/modules/retrieval/utils/brute_force_triplet_search.py`	Candidate expansion and scoring updated to include importance weights (node averages) combined with a position-based similarity factor; projects `importance_weight` by default.
Tests `cognee/tests/integration/documents/TextDocument_test.py`, `cognee/tests/test_importance_weight.py`, `cognee/tests/test_propagate_importance_weights.py`, `cognee/tests/unit/modules/retrieval/chunks_retriever_test.py`	Added and updated tests covering importance_weight propagation, default behavior, ranking influence, propagation/fusion logic, and edge cases.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas needing extra attention:
- Retrieval re-ranking logic (candidate expansion, score calculation, tie-handling) across chunks_retriever, summaries_retriever, and lexical_retriever.
- Graph scoring changes in CogneeGraph: correct conversion of distance→similarity and consistent use of importance_weight defaults.
- Weight propagation algorithm in propagate_importance_weights: neighbor traversal, averaging semantics, and edge-weight derivation.
- Signature updates and call-site propagation for data_chunk and importance_weight (ingestion → chunking → graph); verify all call sites updated.
- Tests: ensure mocks and fixtures reflect new default weights and that asynchronous Task wrapper behaves as expected.
- Potential undefined symbol or leftover reference in brute_force_triplet_search (wide_search_limit / non_global_search) — verify no stale references.

Possibly related issues

[Feature]: Add ability to pass weights via SDK #1768: Matches this PR’s primary objective to introduce per-document/chunk importance_weight and propagate its use across ingestion, chunking, graph processing, and retrievers.

Possibly related PRs

feat: deletes on the fly embeddings and uses edge collections #436: Modifies CogneeGraph edge distance/score assignment logic—overlaps with this PR’s changes to map_vector_distances_to_graph_edges.
Cog 417 chunking unit tests #205: Modifies TextChunker read and DocumentChunk instantiation—touches the same chunking paths updated here.
feat: use external chunker [cog-1354] #551: Changes chunking implementation (TextChunker / LangchainChunker) and related chunk interfaces—overlaps with the chunk weight propagation changes.

Suggested labels

run-checks, core-team

Suggested reviewers

lxobr
alekszievr
hajdul88
borisarzentar

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.85% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects the main change: adding importance_weight to data models and retrieval systems across the codebase.
Description check	✅ Passed	The description includes a clear summary, links to issue #1768, specifies the type of change (new feature), and completes the pre-submission checklist comprehensively.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af39dd3 and 91b72a6.

📒 Files selected for processing (2)

cognee/modules/retrieval/chunks_retriever.py (2 hunks)
cognee/tests/test_importance_weight.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

cognee/modules/retrieval/chunks_retriever.py
cognee/tests/test_importance_weight.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

3619117923 · 2025-12-01T13:20:37Z

Hi, this PR is ready for review.Please help approve and run the workflows whenever someone have time.Thanks!

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

cognee/modules/retrieval/utils/brute_force_triplet_search.py (2)
153-159: wide_search_limit is undefined — will cause NameError at runtime.

The variable wide_search_limit is referenced at line 156 but is not defined anywhere in this function. The parameter wide_search_top_k exists but is never assigned to wide_search_limit. This will cause a NameError when search_in_collection is called.
+    wide_search_limit = wide_search_top_k  # Add this before the search_in_collection definition
+
     async def search_in_collection(collection_name: str):
         try:
             return await vector_engine.search(
                 collection_name=collection_name, query_vector=query_vector, limit=wide_search_limit
             )
181-193: Same undefined wide_search_limit issue in conditional.

This block also references wide_search_limit which is undefined. Ensure the variable is assigned from wide_search_top_k before use:
+    wide_search_limit = wide_search_top_k
+
     # ... later in the code ...
     if wide_search_limit is not None:
cognee/modules/graph/cognee_graph/CogneeGraph.py (1)
203-225: Fix incorrect comment about fallback value.

Line 217's comment states "fallback to 1.0" but the code actually uses 0.5, which is consistent with DEFAULT_WEIGHT in propagate_importance_weights.py.

Apply this diff:
-                # if importance_weight is missing, fallback to 1.0
+                # if importance_weight is missing, fallback to 0.5
                 importance_weight = node.attributes.get("importance_weight", 0.5)

🧹 Nitpick comments (15)

cognee/modules/retrieval/summaries_retriever.py (3)
25-29: Clarify and possibly validate candidate sizing and default_importance_weight range

The introduction of candidate = top_k * 10 and the use of that as the search limit looks reasonable for re‑ranking, but two follow‑ups might be worth considering:

For very large top_k, candidate could cause unexpectedly heavy vector queries. You might want to either:

Add an optional candidate_limit parameter, or

Clamp self.candidate to a sensible maximum to avoid accidental overloads on the vector engine.

default_importance_weight is implicitly expected to be in a bounded numeric range (likely [0.0, 1.0]). Validating this upfront (e.g., raising ValueError if it’s out of range) would make misconfiguration failures explicit and easier to debug.

Also applies to: 56-57

63-79: Defensively handle non-numeric or None importance_weight values in payloads

Right now, payload.get("importance_weight", self.default_importance_weight) will return None (or any other non‑numeric value) if that’s what is stored in the payload, and the later multiplication vector_score * importance_weight will raise a TypeError.

To make this path more robust against bad or partial data, you can normalize and fall back to the default when the stored value is missing or invalid:
-        rescored = []
-        for item in summaries_results:
-            payload = item.payload or {}
-            importance_weight = payload.get("importance_weight", self.default_importance_weight)
-
-            vector_score = item.score if hasattr(item, "score") else 1.0
-            final_score = vector_score * importance_weight
-
-            rescored.append((final_score, payload))
+        rescored = []
+        for item in summaries_results:
+            payload = item.payload or {}
+            raw_importance = payload.get("importance_weight")
+            if isinstance(raw_importance, (int, float)):
+                importance_weight = float(raw_importance)
+            else:
+                importance_weight = self.default_importance_weight
+
+            vector_score = item.score if hasattr(item, "score") else 1.0
+            final_score = vector_score * importance_weight
+
+            rescored.append((final_score, payload))
You could optionally also clamp importance_weight into your intended range [0.0, 1.0] here for extra safety.

81-83: Signature change for get_completion looks consistent; consider documenting **kwargs

Adding **kwargs to get_completion keeps the signature flexible and is likely useful for interface compatibility with other retrievers. Since **kwargs is currently unused, consider briefly mentioning it in the docstring (e.g., “additional backend-specific options”) so its presence is intentional and clear to readers and static analyzers.
cognee/infrastructure/engine/models/DataPoint.py (1)
52-59: Well-structured field with validation, but consider adding a docstring.

The implementation correctly handles None coercion and enforces the 0.0–1.0 range via Pydantic's Field constraints. The mode='before' validator appropriately runs before standard validation.

Per coding guidelines, consider adding a brief docstring to the validator method:
     @field_validator('importance_weight', mode='before')
     @classmethod
     def set_default_weight_on_none(cls, v):
+        """Coerce None values to the default importance weight of 0.5."""
         if v is None:
             return 0.5
         return v
cognee/modules/retrieval/utils/brute_force_triplet_search.py (1)
219-229: Clarify that similarity_score is rank-based, not vector similarity.

The variable name similarity_score is misleading since it's computed from list position (1.0 / (index + 1)) rather than actual vector similarity. Consider renaming to rank_score or adding a comment explaining this heuristic.
         for index, edge in enumerate(initial_results):
-            similarity_score = 1.0 / (index + 1)
+            # Rank-based score: higher position in initial results = higher score
+            rank_score = 1.0 / (index + 1)
             # ...
-            final_score = similarity_score * importance_score
+            final_score = rank_score * importance_score
cognee/modules/data/models/Data.py (1)
3-3: Minor: Missing space after comma in import.
-from sqlalchemy import UUID, Column, DateTime, String, JSON, Integer,Float
+from sqlalchemy import UUID, Column, DateTime, String, JSON, Integer, Float
cognee/tests/test_propagate_importance_weights.py (1)
6-6: Replace Chinese comment with English.

The comment should be in English for consistency with the project's coding standards.

Apply this diff:
-# 导入 CogneeGraph 相关的元素
+# Import CogneeGraph related elements
cognee/tests/unit/modules/retrieval/chunks_retriever_test.py (1)
376-377: Replace Chinese comments with English.

The inline comments should be in English for consistency.

Apply this diff:
-                type('ScoredPoint', (), {'payload': chunk1.model_dump(), 'score': 0.8}),  # 原始得分高 (0.8)
-                type('ScoredPoint', (), {'payload': chunk2.model_dump(), 'score': 0.5})  # 原始得分低 (0.5)
+                type('ScoredPoint', (), {'payload': chunk1.model_dump(), 'score': 0.8}),  # High raw score (0.8)
+                type('ScoredPoint', (), {'payload': chunk2.model_dump(), 'score': 0.5})   # Low raw score (0.5)
cognee/tests/test_importance_weight.py (1)
38-38: Replace Chinese comment with English.

The comment should be in English for consistency.

Apply this diff:
-        mock.return_value.embedding_engine.embed_text.return_value = [[0.1] * 768]  # 模拟嵌入向量
+        mock.return_value.embedding_engine.embed_text.return_value = [[0.1] * 768]  # Mock embedding vector
cognee/tasks/ingestion/ingest_data.py (3)
23-23: Remove unused import.

The Document import from shared.data_models is not used anywhere in this file.

Apply this diff:
-from ...shared.data_models import Document
176-176: Fix spacing around assignment operator.

Line 176 has an extra space before the = operator, which is inconsistent with Python style conventions (PEP 8).

Apply this diff:
-                    importance_weight = importance_weight
+                    importance_weight=importance_weight
203-203: Add space after comma in function call.

Missing space after comma before importance_weight parameter violates PEP 8 style guidelines.

Apply this diff:
-        data, dataset_name, user, node_set, dataset_id, preferred_loaders,importance_weight
+        data, dataset_name, user, node_set, dataset_id, preferred_loaders, importance_weight
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (3)
66-68: Fix spacing in function signature.

Missing spaces after comma and inconsistent spacing around type annotation violate PEP 8 style guidelines.

Apply this diff:
 def _process_ontology_edges(
-    ontology_edges: list, existing_edges_map: dict, ontology_relationships: list,data_chunk: DocumentChunk,
+    ontology_edges: list, existing_edges_map: dict, ontology_relationships: list, data_chunk: DocumentChunk,
 ) -> None:
274-276: Fix spacing in function signature.

Missing space after comma before data_chunk parameter violates PEP 8 style guidelines.

Apply this diff:
 def _process_graph_edges(
-    graph: KnowledgeGraph, name_mapping: dict, existing_edges_map: dict, relationships: list,
-    data_chunk: DocumentChunk) -> None:
+    graph: KnowledgeGraph, name_mapping: dict, existing_edges_map: dict, relationships: list, data_chunk: DocumentChunk
+) -> None:
145-145: Add space after comma in function calls.

Missing spaces after commas before data_chunk parameter in multiple function calls violate PEP 8 style guidelines.

Apply this diff:
-    _process_ontology_edges(ontology_edges, existing_edges_map, ontology_relationships,data_chunk)
+    _process_ontology_edges(ontology_edges, existing_edges_map, ontology_relationships, data_chunk)
-        _process_graph_edges(graph, name_mapping, existing_edges_map, relationships,data_chunk)
+        _process_graph_edges(graph, name_mapping, existing_edges_map, relationships, data_chunk)
Also applies to: 205-205, 388-388

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c17f838 and af39dd3.

📒 Files selected for processing (18)

cognee/api/v1/add/add.py (4 hunks)
cognee/infrastructure/engine/models/DataPoint.py (2 hunks)
cognee/modules/chunking/LangchainChunker.py (1 hunks)
cognee/modules/chunking/TextChunker.py (3 hunks)
cognee/modules/data/models/Data.py (3 hunks)
cognee/modules/graph/cognee_graph/CogneeGraph.py (2 hunks)
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (8 hunks)
cognee/modules/memify/memify.py (2 hunks)
cognee/modules/retrieval/chunks_retriever.py (2 hunks)
cognee/modules/retrieval/lexical_retriever.py (4 hunks)
cognee/modules/retrieval/summaries_retriever.py (2 hunks)
cognee/modules/retrieval/utils/brute_force_triplet_search.py (3 hunks)
cognee/tasks/ingestion/ingest_data.py (6 hunks)
cognee/tasks/memify/propagate_importance_weights.py (1 hunks)
cognee/tests/integration/documents/TextDocument_test.py (3 hunks)
cognee/tests/test_importance_weight.py (1 hunks)
cognee/tests/test_propagate_importance_weights.py (1 hunks)
cognee/tests/unit/modules/retrieval/chunks_retriever_test.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (6)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/tests/test_importance_weight.py
cognee/api/v1/add/add.py
cognee/modules/retrieval/summaries_retriever.py
cognee/tasks/memify/propagate_importance_weights.py
cognee/modules/chunking/TextChunker.py
cognee/modules/chunking/LangchainChunker.py
cognee/modules/graph/cognee_graph/CogneeGraph.py
cognee/modules/memify/memify.py
cognee/tasks/ingestion/ingest_data.py
cognee/infrastructure/engine/models/DataPoint.py
cognee/tests/test_propagate_importance_weights.py
cognee/tests/unit/modules/retrieval/chunks_retriever_test.py
cognee/modules/data/models/Data.py
cognee/modules/retrieval/lexical_retriever.py
cognee/modules/retrieval/chunks_retriever.py
cognee/tests/integration/documents/TextDocument_test.py
cognee/modules/graph/utils/expand_with_nodes_and_edges.py
cognee/modules/retrieval/utils/brute_force_triplet_search.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/tests/test_importance_weight.py
cognee/api/v1/add/add.py
cognee/modules/retrieval/summaries_retriever.py
cognee/tasks/memify/propagate_importance_weights.py
cognee/modules/chunking/TextChunker.py
cognee/modules/chunking/LangchainChunker.py
cognee/modules/graph/cognee_graph/CogneeGraph.py
cognee/modules/memify/memify.py
cognee/tasks/ingestion/ingest_data.py
cognee/infrastructure/engine/models/DataPoint.py
cognee/tests/test_propagate_importance_weights.py
cognee/tests/unit/modules/retrieval/chunks_retriever_test.py
cognee/modules/data/models/Data.py
cognee/modules/retrieval/lexical_retriever.py
cognee/modules/retrieval/chunks_retriever.py
cognee/tests/integration/documents/TextDocument_test.py
cognee/modules/graph/utils/expand_with_nodes_and_edges.py
cognee/modules/retrieval/utils/brute_force_triplet_search.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/tests/test_importance_weight.py
cognee/api/v1/add/add.py
cognee/modules/retrieval/summaries_retriever.py
cognee/tasks/memify/propagate_importance_weights.py
cognee/modules/chunking/TextChunker.py
cognee/modules/chunking/LangchainChunker.py
cognee/modules/graph/cognee_graph/CogneeGraph.py
cognee/modules/memify/memify.py
cognee/tasks/ingestion/ingest_data.py
cognee/infrastructure/engine/models/DataPoint.py
cognee/tests/test_propagate_importance_weights.py
cognee/tests/unit/modules/retrieval/chunks_retriever_test.py
cognee/modules/data/models/Data.py
cognee/modules/retrieval/lexical_retriever.py
cognee/modules/retrieval/chunks_retriever.py
cognee/tests/integration/documents/TextDocument_test.py
cognee/modules/graph/utils/expand_with_nodes_and_edges.py
cognee/modules/retrieval/utils/brute_force_triplet_search.py

cognee/tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Files:

cognee/tests/test_importance_weight.py
cognee/tests/test_propagate_importance_weights.py
cognee/tests/unit/modules/retrieval/chunks_retriever_test.py
cognee/tests/integration/documents/TextDocument_test.py

cognee/tests/*

⚙️ CodeRabbit configuration file

cognee/tests/*: When reviewing test code:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance, pointing out any violations discovered.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a project rule, Python source files with names prefixed by the string "test_" and located in the project's "tests" directory are the project's unit-testing code. It is safe, albeit a heuristic, to assume these are considered part of the project's minimal acceptance testing unless a justifying exception to this assumption is documented.

As a project rule, any files without extensions and with names prefixed by either the string "check_" or the string "test_", and located in the project's "tests" directory, are the project's non-unit test code. "Non-unit test" in this context refers to any type of testing other than unit testing, such as (but not limited to) functional testing, style linting, regression testing, etc. It can also be assumed that non-unit testing code is usually written as Bash shell scripts.

Files:

cognee/tests/test_importance_weight.py
cognee/tests/test_propagate_importance_weights.py

cognee/api/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Public APIs should be type-annotated in Python where practical

Files:

cognee/api/v1/add/add.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/modules/retrieval/summaries_retriever.py
cognee/tasks/memify/propagate_importance_weights.py
cognee/modules/chunking/TextChunker.py
cognee/modules/chunking/LangchainChunker.py
cognee/modules/graph/cognee_graph/CogneeGraph.py
cognee/modules/memify/memify.py
cognee/tasks/ingestion/ingest_data.py
cognee/infrastructure/engine/models/DataPoint.py
cognee/modules/data/models/Data.py
cognee/modules/retrieval/lexical_retriever.py
cognee/modules/retrieval/chunks_retriever.py
cognee/modules/graph/utils/expand_with_nodes_and_edges.py
cognee/modules/retrieval/utils/brute_force_triplet_search.py

🧠 Learnings (2)

📚 Learning: 2024-11-13T14:55:05.912Z

Learnt from: 0xideas
Repo: topoteretes/cognee PR: 205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.

Applied to files:

cognee/tests/unit/modules/retrieval/chunks_retriever_test.py
cognee/tests/integration/documents/TextDocument_test.py
cognee/modules/graph/utils/expand_with_nodes_and_edges.py

📚 Learning: 2024-10-16T07:06:28.669Z

Learnt from: borisarzentar
Repo: topoteretes/cognee PR: 144
File: cognee/tasks/chunking/query_chunks.py:1-17
Timestamp: 2024-10-16T07:06:28.669Z
Learning: The `query_chunks` function in `cognee/tasks/chunking/query_chunks.py` is used within the `search` function in `cognee/api/v1/search/search_v2.py`.

Applied to files:

cognee/modules/retrieval/chunks_retriever.py

🧬 Code graph analysis (11)

cognee/tests/test_importance_weight.py (4)

cognee/modules/retrieval/utils/brute_force_triplet_search.py (1)

brute_force_triplet_search (94-245)

cognee/infrastructure/databases/vector/get_vector_engine.py (1)

get_vector_engine (5-7)

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

get_graph_engine (10-24)

cognee/modules/graph/cognee_graph/CogneeGraph.py (3)

calculate_top_triplet_importances (271-282)

map_vector_distances_to_graph_nodes (203-225)

map_vector_distances_to_graph_edges (227-269)

cognee/modules/retrieval/summaries_retriever.py (2)

cognee/infrastructure/databases/vector/exceptions/exceptions.py (1)

CollectionNotFoundError (5-22)

cognee/modules/retrieval/exceptions/exceptions.py (1)

NoDataError (25-32)

cognee/tasks/memify/propagate_importance_weights.py (3)

cognee/modules/graph/cognee_graph/CogneeGraph.py (3)

CogneeGraph (18-282)

get_node (46-47)

get_edges (56-57)

cognee/shared/logging_utils.py (2)

get_logger (212-224)

info (205-205)

cognee/modules/graph/cognee_graph/CogneeGraphElements.py (1)

get_skeleton_neighbours (76-77)

cognee/modules/graph/cognee_graph/CogneeGraph.py (2)

cognee/modules/graph/cognee_graph/CogneeGraphElements.py (3)

add_attribute (67-68)

add_attribute (128-129)

Edge (89-156)

cognee/infrastructure/engine/models/Edge.py (1)

Edge (5-38)

cognee/modules/memify/memify.py (2)

cognee/api/v1/memify/routers/get_memify_router.py (1)

memify (38-99)

cognee/tasks/memify/propagate_importance_weights.py (1)

propagate_importance_weights (11-78)

cognee/tasks/ingestion/ingest_data.py (1)

cognee/shared/data_models.py (1)

Document (314-317)

cognee/modules/retrieval/lexical_retriever.py (2)

cognee/api/v1/add/add.py (1)

add (18-221)

cognee/modules/graph/cognee_graph/CogneeGraph.py (1)

final_score (276-280)

cognee/modules/retrieval/chunks_retriever.py (4)

cognee/infrastructure/databases/vector/get_vector_engine.py (1)

get_vector_engine (5-7)

cognee/infrastructure/databases/vector/exceptions/exceptions.py (1)

CollectionNotFoundError (5-22)

cognee/modules/retrieval/exceptions/exceptions.py (1)

NoDataError (25-32)

cognee/modules/graph/cognee_graph/CogneeGraph.py (1)

final_score (276-280)

cognee/tests/integration/documents/TextDocument_test.py (3)

cognee/tests/integration/documents/async_gen_zip.py (1)

async_gen_zip (1-12)

cognee/modules/chunking/LangchainChunker.py (1)

read (36-60)

cognee/modules/chunking/TextChunker.py (2)

read (12-81)

TextChunker (11-81)

cognee/modules/graph/utils/expand_with_nodes_and_edges.py (2)

cognee/modules/chunking/models/DocumentChunk.py (1)

DocumentChunk (10-37)

cognee/tests/unit/interfaces/graph/get_graph_from_model_unit_test.py (1)

DocumentChunk (13-17)

cognee/modules/retrieval/utils/brute_force_triplet_search.py (3)

cognee/modules/graph/cognee_graph/CogneeGraph.py (2)

calculate_top_triplet_importances (271-282)

final_score (276-280)

cognee/tests/test_importance_weight.py (4)

calculate_top_triplet_importances (54-55)

calculate_top_triplet_importances (100-101)

calculate_top_triplet_importances (129-130)

calculate_top_triplet_importances (164-165)

cognee/infrastructure/databases/vector/exceptions/exceptions.py (1)

CollectionNotFoundError (5-22)

🔇 Additional comments (29)

cognee/modules/retrieval/chunks_retriever.py (2)

69-71: Verify score semantics from vector engine.

The formula similarity_score = 1 / (1 + distance_score) assumes item.score is a distance metric (lower = more similar). Some vector databases return similarity scores directly (higher = more similar), which would invert the intended ranking.

Confirm that the vector engine consistently returns distance scores, or add a comment documenting this assumption.

30-32: Caching vector_engine at init changes retrieval behavior.

The previous implementation likely retrieved the vector engine per-call in get_context(). Caching it in __init__ is more efficient but means configuration changes after instantiation won't be reflected. This is likely acceptable, but worth noting if the retriever instances are long-lived.

cognee/modules/data/models/Data.py (1)

37-37: LGTM!

The importance_weight column is correctly defined with nullable=False and default=0.5, consistent with the DataPoint model. The to_json() serialization follows the existing camelCase convention.

Also applies to: 60-60

cognee/modules/memify/memify.py (2)

73-81: LGTM!

The ordering is logical—propagate_importance_weights runs first to compute weights across the graph, then add_rule_associations can leverage those weights. The task construction follows the existing pattern.

24-24: Import placement follows existing conventions.

The new import is appropriately grouped with other task imports from cognee.tasks.

cognee/api/v1/add/add.py (1)

29-29: LGTM! Parameter addition and validation are well-implemented.

The importance_weight parameter is properly added with:

A sensible default of 0.5 (mid-range)

Clear docstring documentation explaining its purpose and range

Proper validation ensuring values are between 0.0 and 1.0

Correct propagation to the ingest_data task

Also applies to: 89-91, 171-173, 192-192

cognee/modules/chunking/TextChunker.py (1)

33-33: LGTM! Importance weight propagation is consistent.

The importance_weight is correctly propagated from self.document.importance_weight to all three DocumentChunk yield points:

When handling oversized chunks with empty paragraph buffer

When flushing accumulated paragraph chunks

When flushing remaining chunks at the end

Also applies to: 53-53, 76-76

cognee/tests/test_propagate_importance_weights.py (1)

12-78: LGTM! Comprehensive test coverage.

The test effectively validates:

Weight preservation for nodes with initial weights (N_A=1.0, N_B=0.2)

Propagation to unweighted neighbors (N_X=1.0 from N_A)

Weight fusion via averaging (N_Y=0.6 from neighbors N_A=1.0 and N_B=0.2)

Nodes without weighted neighbors remain unchanged (N_Z)

Edge weight calculation as node average

cognee/tests/unit/modules/retrieval/chunks_retriever_test.py (5)

207-267: LGTM! Test validates default importance weight behavior.

The test correctly verifies:

Chunks without explicit importance_weight use the default value

Mock verifies that score_threshold is not passed to the vector search

Results are properly ordered based on final scores (similarity × weight)

268-326: LGTM! Test validates weight-based ranking.

The test demonstrates that importance weighting correctly influences ranking:

High importance (1.0) with lower similarity (0.6) → final score 0.6

Low importance (0.1) with higher similarity (0.9) → final score 0.09

Result: High importance ranks first despite lower similarity

327-384: LGTM! Test validates boundary values.

The test correctly validates extreme importance weights:

Weight 0.0 zeroes out the score regardless of similarity

Weight 1.0 preserves full similarity score

Demonstrates that full-weight chunk ranks higher than zero-weight chunk

254-266: Verify imports for patch and AsyncMock.

The test code uses patch and AsyncMock from unittest.mock, but the import statements are not visible in the provided snippet. Ensure these imports are present at the top of the file:

from unittest.mock import patch, AsyncMock

385-442: Verify test expectations: mock scores don't align with stated behavior.

The test name references "equal_score" but the mock setup provides different scores (chunk2=1.0, chunk1=3.0). The test assertions expect chunk2 to rank first, yet with:

chunk2: 1.0 × 0.5 = 0.5 (final score)

chunk1: 3.0 × 1.0 = 3.0 (final score)

Standard ranking should place chunk1 first. Either the mock scores should be adjusted to match the test intent, or the assertions need correction. Review the actual ChunksRetriever.get_context() implementation to confirm the scoring formula and ensure test setup aligns with expected ranking behavior.

cognee/modules/chunking/LangchainChunker.py (1)

51-51: LGTM! Consistent importance weight propagation.

The importance_weight is correctly propagated from self.document.importance_weight to the DocumentChunk, maintaining consistency with the TextChunker implementation.

cognee/tests/integration/documents/TextDocument_test.py (1)

28-72: LGTM! Comprehensive test parameterization.

The test is well-structured with:

Parameterization covering explicit weight (0.9) and implicit default (None → 0.5)

Clear assertions verifying chunk importance_weight matches expected values

Integration testing of weight propagation through the document chunking pipeline

cognee/tests/test_importance_weight.py (4)

66-87: LGTM! Test validates importance weight averaging.

The test correctly verifies that triplet scoring averages the importance weights from both nodes:

Edge 1: (0.8 + 0.9) / 2 = 0.85

Edge 2: (0.3 + 0.7) / 2 = 0.5

89-119: LGTM! Test validates behavior with missing weights.

The test correctly verifies that nodes without importance_weight attributes remain unchanged during triplet retrieval, confirming that the system doesn't force-add default weights to all nodes.

121-149: LGTM! Test validates boundary value averaging.

The test correctly verifies edge case handling:

Node weights at extremes (0.0 and 1.0)

Correct averaging: (0.0 + 1.0) / 2 = 0.5

176-195: LGTM! Test validates ranking with explicit scores.

The test correctly verifies that edges with higher importance weights and scores rank higher, assuming the MockEdge initialization issue is fixed.

cognee/modules/retrieval/lexical_retriever.py (4)

14-27: LGTM! Constructor properly extended with importance weight support.

The changes correctly:

Add default_importance_weight parameter with sensible default of 0.5

Store the parameter for later use in scoring

Maintain backward compatibility with existing code

35-36: LGTM! Public method for external payload injection.

The add() method provides a clean interface for caching chunk payloads, which is useful for testing and external integrations.

105-114: LGTM! Robust weight handling and scoring.

The implementation correctly:

Retrieves importance_weight from payload with fallback to default

Validates weight is numeric before use

Computes final score as score × weight

Handles scorer exceptions gracefully by setting final_score to 0.0

129-130: Minor formatting change - no functional impact.

The parameter formatting in get_completion was adjusted for readability. This is a cosmetic change with no functional impact.

cognee/tasks/ingestion/ingest_data.py (1)

26-34: LGTM!

The importance_weight parameter with default value 0.5 is correctly added to the function signature and properly type-annotated.

cognee/tasks/memify/propagate_importance_weights.py (2)

11-78: LGTM!

The weight propagation logic is well-structured:

Properly filters source nodes with valid importance_weight values

Implements average aggregation correctly for both nodes and edges

Includes appropriate logging and error handling

Docstring clearly documents the strategy and parameters

81-90: LGTM!

The Task wrapper follows the standard pattern and properly delegates to the async propagation function.

cognee/modules/graph/cognee_graph/CogneeGraph.py (2)

271-282: LGTM!

The updated ranking logic correctly uses heapq.nlargest to maximize importance_score (which combines vector similarity with importance weights) instead of minimizing raw distances. The docstring clarifies the merged scoring approach.

227-269: Verify fallback weight inconsistency between nodes and edges.

Edges use a fallback importance_weight of 1.0 (line 259), while nodes use 0.5 (line 217). Given that propagate_importance_weights.py computes edge weights as the average of endpoint node weights, this inconsistency may cause unexpected behavior if edges lack propagated weights.

Please verify whether:

The 1.0 fallback for edges is intentional and serves a specific purpose

Or if it should be aligned with the node fallback of 0.5 for consistency

cognee/modules/graph/utils/expand_with_nodes_and_edges.py (1)

50-50: Verify DocumentChunk has importance_weight attribute.

The code accesses data_chunk.importance_weight in multiple locations. Please verify that the DocumentChunk model (which extends DataPoint) has the importance_weight attribute defined, as this is critical for the feature to work correctly.

Also applies to: 62-62, 87-87, 138-138, 198-198, 299-299

cognee/modules/retrieval/chunks_retriever.py

cognee/tests/test_importance_weight.py

Fix logs and MockEdge class

Vasilije1990 · 2025-12-16T14:09:31Z

@3619117923 I'd like to apologise since we caused a bit of a problem with this issue. Our architecture was refactored considerably since this issue was created and our engineers did not account for the changes that this PR would require, and they made our core engine incompatible with the code you provided.
My sincere apologies and we hope to be able to handle these types of situations better in the future.

Add importance_weight to data models and retrieval

f37ff4b

Add an importance_weight attribute to data points, document chunks, and graphs, allowing ingestion and retrieval to factor in data importance for ranking. Adds test case for importance weight.

Merge branch 'dev' into feature/importance_weights

af39dd3

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

cognee/modules/retrieval/chunks_retriever.py Outdated Show resolved Hide resolved

cognee/tests/test_importance_weight.py Show resolved Hide resolved

hande-k added community-contribution Community contribution label review-required Review required labels Dec 2, 2025

3619117923 added 3 commits December 3, 2025 09:17

Fix logger

91b72a6

Fix logs and MockEdge class

Merge branch 'dev' into feature/importance_weights

6a0b2ef

Merge branch 'dev' into feature/importance_weights

e1a7d68

Vasilije1990 requested a review from hajdul88 December 16, 2025 13:48

Vasilije1990 closed this Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat:Add importance_weight to data models and retrieval #1849

feat:Add importance_weight to data models and retrieval #1849

Uh oh!

3619117923 commented Nov 29, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

pull-checklist bot commented Nov 29, 2025

Uh oh!

coderabbitai bot commented Nov 29, 2025 •

edited

Loading

Uh oh!

3619117923 commented Dec 1, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Vasilije1990 commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat:Add importance_weight to data models and retrieval #1849

feat:Add importance_weight to data models and retrieval #1849

Uh oh!

Conversation

3619117923 commented Nov 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Changes:

Type of Change

Pre-submission Checklist

DCO Affirmation

Summary by CodeRabbit

Release Notes

Uh oh!

pull-checklist bot commented Nov 29, 2025

Please make sure all the checkboxes are checked:

Uh oh!

coderabbitai bot commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

3619117923 commented Dec 1, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Vasilije1990 commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

3619117923 commented Nov 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 29, 2025 •

edited

Loading