-
Notifications
You must be signed in to change notification settings - Fork 966
feat:Add importance_weight to data models and retrieval #1849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add an importance_weight attribute to data points, document chunks, and graphs, allowing ingestion and retrieval to factor in data importance for ranking. Adds test case for importance weight.
Please make sure all the checkboxes are checked:
|
WalkthroughThis PR introduces an Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi, this PR is ready for review.Please help approve and run the workflows whenever someone have time.Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
cognee/modules/retrieval/utils/brute_force_triplet_search.py (2)
153-159:wide_search_limitis undefined — will causeNameErrorat runtime.The variable
wide_search_limitis referenced at line 156 but is not defined anywhere in this function. The parameterwide_search_top_kexists but is never assigned towide_search_limit. This will cause aNameErrorwhensearch_in_collectionis called.+ wide_search_limit = wide_search_top_k # Add this before the search_in_collection definition + async def search_in_collection(collection_name: str): try: return await vector_engine.search( collection_name=collection_name, query_vector=query_vector, limit=wide_search_limit )
181-193: Same undefinedwide_search_limitissue in conditional.This block also references
wide_search_limitwhich is undefined. Ensure the variable is assigned fromwide_search_top_kbefore use:+ wide_search_limit = wide_search_top_k + # ... later in the code ... if wide_search_limit is not None:cognee/modules/graph/cognee_graph/CogneeGraph.py (1)
203-225: Fix incorrect comment about fallback value.Line 217's comment states "fallback to 1.0" but the code actually uses 0.5, which is consistent with
DEFAULT_WEIGHTinpropagate_importance_weights.py.Apply this diff:
- # if importance_weight is missing, fallback to 1.0 + # if importance_weight is missing, fallback to 0.5 importance_weight = node.attributes.get("importance_weight", 0.5)
🧹 Nitpick comments (15)
cognee/modules/retrieval/summaries_retriever.py (3)
25-29: Clarify and possibly validatecandidatesizing anddefault_importance_weightrangeThe introduction of
candidate = top_k * 10and the use of that as the search limit looks reasonable for re‑ranking, but two follow‑ups might be worth considering:
- For very large
top_k,candidatecould cause unexpectedly heavy vector queries. You might want to either:
- Add an optional
candidate_limitparameter, or- Clamp
self.candidateto a sensible maximum to avoid accidental overloads on the vector engine.default_importance_weightis implicitly expected to be in a bounded numeric range (likely[0.0, 1.0]). Validating this upfront (e.g., raisingValueErrorif it’s out of range) would make misconfiguration failures explicit and easier to debug.Also applies to: 56-57
63-79: Defensively handle non-numeric orNoneimportance_weightvalues in payloadsRight now,
payload.get("importance_weight", self.default_importance_weight)will returnNone(or any other non‑numeric value) if that’s what is stored in the payload, and the later multiplicationvector_score * importance_weightwill raise aTypeError.To make this path more robust against bad or partial data, you can normalize and fall back to the default when the stored value is missing or invalid:
- rescored = [] - for item in summaries_results: - payload = item.payload or {} - importance_weight = payload.get("importance_weight", self.default_importance_weight) - - vector_score = item.score if hasattr(item, "score") else 1.0 - final_score = vector_score * importance_weight - - rescored.append((final_score, payload)) + rescored = [] + for item in summaries_results: + payload = item.payload or {} + raw_importance = payload.get("importance_weight") + if isinstance(raw_importance, (int, float)): + importance_weight = float(raw_importance) + else: + importance_weight = self.default_importance_weight + + vector_score = item.score if hasattr(item, "score") else 1.0 + final_score = vector_score * importance_weight + + rescored.append((final_score, payload))You could optionally also clamp
importance_weightinto your intended range[0.0, 1.0]here for extra safety.
81-83: Signature change forget_completionlooks consistent; consider documenting**kwargsAdding
**kwargstoget_completionkeeps the signature flexible and is likely useful for interface compatibility with other retrievers. Since**kwargsis currently unused, consider briefly mentioning it in the docstring (e.g., “additional backend-specific options”) so its presence is intentional and clear to readers and static analyzers.cognee/infrastructure/engine/models/DataPoint.py (1)
52-59: Well-structured field with validation, but consider adding a docstring.The implementation correctly handles
Nonecoercion and enforces the 0.0–1.0 range via Pydantic'sFieldconstraints. Themode='before'validator appropriately runs before standard validation.Per coding guidelines, consider adding a brief docstring to the validator method:
@field_validator('importance_weight', mode='before') @classmethod def set_default_weight_on_none(cls, v): + """Coerce None values to the default importance weight of 0.5.""" if v is None: return 0.5 return vcognee/modules/retrieval/utils/brute_force_triplet_search.py (1)
219-229: Clarify thatsimilarity_scoreis rank-based, not vector similarity.The variable name
similarity_scoreis misleading since it's computed from list position (1.0 / (index + 1)) rather than actual vector similarity. Consider renaming torank_scoreor adding a comment explaining this heuristic.for index, edge in enumerate(initial_results): - similarity_score = 1.0 / (index + 1) + # Rank-based score: higher position in initial results = higher score + rank_score = 1.0 / (index + 1) # ... - final_score = similarity_score * importance_score + final_score = rank_score * importance_scorecognee/modules/data/models/Data.py (1)
3-3: Minor: Missing space after comma in import.-from sqlalchemy import UUID, Column, DateTime, String, JSON, Integer,Float +from sqlalchemy import UUID, Column, DateTime, String, JSON, Integer, Floatcognee/tests/test_propagate_importance_weights.py (1)
6-6: Replace Chinese comment with English.The comment should be in English for consistency with the project's coding standards.
Apply this diff:
-# 导入 CogneeGraph 相关的元素 +# Import CogneeGraph related elementscognee/tests/unit/modules/retrieval/chunks_retriever_test.py (1)
376-377: Replace Chinese comments with English.The inline comments should be in English for consistency.
Apply this diff:
- type('ScoredPoint', (), {'payload': chunk1.model_dump(), 'score': 0.8}), # 原始得分高 (0.8) - type('ScoredPoint', (), {'payload': chunk2.model_dump(), 'score': 0.5}) # 原始得分低 (0.5) + type('ScoredPoint', (), {'payload': chunk1.model_dump(), 'score': 0.8}), # High raw score (0.8) + type('ScoredPoint', (), {'payload': chunk2.model_dump(), 'score': 0.5}) # Low raw score (0.5)cognee/tests/test_importance_weight.py (1)
38-38: Replace Chinese comment with English.The comment should be in English for consistency.
Apply this diff:
- mock.return_value.embedding_engine.embed_text.return_value = [[0.1] * 768] # 模拟嵌入向量 + mock.return_value.embedding_engine.embed_text.return_value = [[0.1] * 768] # Mock embedding vectorcognee/tasks/ingestion/ingest_data.py (3)
23-23: Remove unused import.The
Documentimport fromshared.data_modelsis not used anywhere in this file.Apply this diff:
-from ...shared.data_models import Document
176-176: Fix spacing around assignment operator.Line 176 has an extra space before the
=operator, which is inconsistent with Python style conventions (PEP 8).Apply this diff:
- importance_weight = importance_weight + importance_weight=importance_weight
203-203: Add space after comma in function call.Missing space after comma before
importance_weightparameter violates PEP 8 style guidelines.Apply this diff:
- data, dataset_name, user, node_set, dataset_id, preferred_loaders,importance_weight + data, dataset_name, user, node_set, dataset_id, preferred_loaders, importance_weightcognee/modules/graph/utils/expand_with_nodes_and_edges.py (3)
66-68: Fix spacing in function signature.Missing spaces after comma and inconsistent spacing around type annotation violate PEP 8 style guidelines.
Apply this diff:
def _process_ontology_edges( - ontology_edges: list, existing_edges_map: dict, ontology_relationships: list,data_chunk: DocumentChunk, + ontology_edges: list, existing_edges_map: dict, ontology_relationships: list, data_chunk: DocumentChunk, ) -> None:
274-276: Fix spacing in function signature.Missing space after comma before
data_chunkparameter violates PEP 8 style guidelines.Apply this diff:
def _process_graph_edges( - graph: KnowledgeGraph, name_mapping: dict, existing_edges_map: dict, relationships: list, - data_chunk: DocumentChunk) -> None: + graph: KnowledgeGraph, name_mapping: dict, existing_edges_map: dict, relationships: list, data_chunk: DocumentChunk +) -> None:
145-145: Add space after comma in function calls.Missing spaces after commas before
data_chunkparameter in multiple function calls violate PEP 8 style guidelines.Apply this diff:
- _process_ontology_edges(ontology_edges, existing_edges_map, ontology_relationships,data_chunk) + _process_ontology_edges(ontology_edges, existing_edges_map, ontology_relationships, data_chunk)- _process_graph_edges(graph, name_mapping, existing_edges_map, relationships,data_chunk) + _process_graph_edges(graph, name_mapping, existing_edges_map, relationships, data_chunk)Also applies to: 205-205, 388-388
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
cognee/api/v1/add/add.py(4 hunks)cognee/infrastructure/engine/models/DataPoint.py(2 hunks)cognee/modules/chunking/LangchainChunker.py(1 hunks)cognee/modules/chunking/TextChunker.py(3 hunks)cognee/modules/data/models/Data.py(3 hunks)cognee/modules/graph/cognee_graph/CogneeGraph.py(2 hunks)cognee/modules/graph/utils/expand_with_nodes_and_edges.py(8 hunks)cognee/modules/memify/memify.py(2 hunks)cognee/modules/retrieval/chunks_retriever.py(2 hunks)cognee/modules/retrieval/lexical_retriever.py(4 hunks)cognee/modules/retrieval/summaries_retriever.py(2 hunks)cognee/modules/retrieval/utils/brute_force_triplet_search.py(3 hunks)cognee/tasks/ingestion/ingest_data.py(6 hunks)cognee/tasks/memify/propagate_importance_weights.py(1 hunks)cognee/tests/integration/documents/TextDocument_test.py(3 hunks)cognee/tests/test_importance_weight.py(1 hunks)cognee/tests/test_propagate_importance_weights.py(1 hunks)cognee/tests/unit/modules/retrieval/chunks_retriever_test.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/tests/test_importance_weight.pycognee/api/v1/add/add.pycognee/modules/retrieval/summaries_retriever.pycognee/tasks/memify/propagate_importance_weights.pycognee/modules/chunking/TextChunker.pycognee/modules/chunking/LangchainChunker.pycognee/modules/graph/cognee_graph/CogneeGraph.pycognee/modules/memify/memify.pycognee/tasks/ingestion/ingest_data.pycognee/infrastructure/engine/models/DataPoint.pycognee/tests/test_propagate_importance_weights.pycognee/tests/unit/modules/retrieval/chunks_retriever_test.pycognee/modules/data/models/Data.pycognee/modules/retrieval/lexical_retriever.pycognee/modules/retrieval/chunks_retriever.pycognee/tests/integration/documents/TextDocument_test.pycognee/modules/graph/utils/expand_with_nodes_and_edges.pycognee/modules/retrieval/utils/brute_force_triplet_search.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/tests/test_importance_weight.pycognee/api/v1/add/add.pycognee/modules/retrieval/summaries_retriever.pycognee/tasks/memify/propagate_importance_weights.pycognee/modules/chunking/TextChunker.pycognee/modules/chunking/LangchainChunker.pycognee/modules/graph/cognee_graph/CogneeGraph.pycognee/modules/memify/memify.pycognee/tasks/ingestion/ingest_data.pycognee/infrastructure/engine/models/DataPoint.pycognee/tests/test_propagate_importance_weights.pycognee/tests/unit/modules/retrieval/chunks_retriever_test.pycognee/modules/data/models/Data.pycognee/modules/retrieval/lexical_retriever.pycognee/modules/retrieval/chunks_retriever.pycognee/tests/integration/documents/TextDocument_test.pycognee/modules/graph/utils/expand_with_nodes_and_edges.pycognee/modules/retrieval/utils/brute_force_triplet_search.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/tests/test_importance_weight.pycognee/api/v1/add/add.pycognee/modules/retrieval/summaries_retriever.pycognee/tasks/memify/propagate_importance_weights.pycognee/modules/chunking/TextChunker.pycognee/modules/chunking/LangchainChunker.pycognee/modules/graph/cognee_graph/CogneeGraph.pycognee/modules/memify/memify.pycognee/tasks/ingestion/ingest_data.pycognee/infrastructure/engine/models/DataPoint.pycognee/tests/test_propagate_importance_weights.pycognee/tests/unit/modules/retrieval/chunks_retriever_test.pycognee/modules/data/models/Data.pycognee/modules/retrieval/lexical_retriever.pycognee/modules/retrieval/chunks_retriever.pycognee/tests/integration/documents/TextDocument_test.pycognee/modules/graph/utils/expand_with_nodes_and_edges.pycognee/modules/retrieval/utils/brute_force_triplet_search.py
cognee/tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests
Files:
cognee/tests/test_importance_weight.pycognee/tests/test_propagate_importance_weights.pycognee/tests/unit/modules/retrieval/chunks_retriever_test.pycognee/tests/integration/documents/TextDocument_test.py
cognee/tests/*
⚙️ CodeRabbit configuration file
cognee/tests/*: When reviewing test code:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance, pointing out any violations discovered.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a project rule, Python source files with names prefixed by the string "test_" and located in the project's "tests" directory are the project's unit-testing code. It is safe, albeit a heuristic, to assume these are considered part of the project's minimal acceptance testing unless a justifying exception to this assumption is documented.
- As a project rule, any files without extensions and with names prefixed by either the string "check_" or the string "test_", and located in the project's "tests" directory, are the project's non-unit test code. "Non-unit test" in this context refers to any type of testing other than unit testing, such as (but not limited to) functional testing, style linting, regression testing, etc. It can also be assumed that non-unit testing code is usually written as Bash shell scripts.
Files:
cognee/tests/test_importance_weight.pycognee/tests/test_propagate_importance_weights.py
cognee/api/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Public APIs should be type-annotated in Python where practical
Files:
cognee/api/v1/add/add.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/modules/retrieval/summaries_retriever.pycognee/tasks/memify/propagate_importance_weights.pycognee/modules/chunking/TextChunker.pycognee/modules/chunking/LangchainChunker.pycognee/modules/graph/cognee_graph/CogneeGraph.pycognee/modules/memify/memify.pycognee/tasks/ingestion/ingest_data.pycognee/infrastructure/engine/models/DataPoint.pycognee/modules/data/models/Data.pycognee/modules/retrieval/lexical_retriever.pycognee/modules/retrieval/chunks_retriever.pycognee/modules/graph/utils/expand_with_nodes_and_edges.pycognee/modules/retrieval/utils/brute_force_triplet_search.py
🧠 Learnings (2)
📚 Learning: 2024-11-13T14:55:05.912Z
Learnt from: 0xideas
Repo: topoteretes/cognee PR: 205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.
Applied to files:
cognee/tests/unit/modules/retrieval/chunks_retriever_test.pycognee/tests/integration/documents/TextDocument_test.pycognee/modules/graph/utils/expand_with_nodes_and_edges.py
📚 Learning: 2024-10-16T07:06:28.669Z
Learnt from: borisarzentar
Repo: topoteretes/cognee PR: 144
File: cognee/tasks/chunking/query_chunks.py:1-17
Timestamp: 2024-10-16T07:06:28.669Z
Learning: The `query_chunks` function in `cognee/tasks/chunking/query_chunks.py` is used within the `search` function in `cognee/api/v1/search/search_v2.py`.
Applied to files:
cognee/modules/retrieval/chunks_retriever.py
🧬 Code graph analysis (11)
cognee/tests/test_importance_weight.py (4)
cognee/modules/retrieval/utils/brute_force_triplet_search.py (1)
brute_force_triplet_search(94-245)cognee/infrastructure/databases/vector/get_vector_engine.py (1)
get_vector_engine(5-7)cognee/infrastructure/databases/graph/get_graph_engine.py (1)
get_graph_engine(10-24)cognee/modules/graph/cognee_graph/CogneeGraph.py (3)
calculate_top_triplet_importances(271-282)map_vector_distances_to_graph_nodes(203-225)map_vector_distances_to_graph_edges(227-269)
cognee/modules/retrieval/summaries_retriever.py (2)
cognee/infrastructure/databases/vector/exceptions/exceptions.py (1)
CollectionNotFoundError(5-22)cognee/modules/retrieval/exceptions/exceptions.py (1)
NoDataError(25-32)
cognee/tasks/memify/propagate_importance_weights.py (3)
cognee/modules/graph/cognee_graph/CogneeGraph.py (3)
CogneeGraph(18-282)get_node(46-47)get_edges(56-57)cognee/shared/logging_utils.py (2)
get_logger(212-224)info(205-205)cognee/modules/graph/cognee_graph/CogneeGraphElements.py (1)
get_skeleton_neighbours(76-77)
cognee/modules/graph/cognee_graph/CogneeGraph.py (2)
cognee/modules/graph/cognee_graph/CogneeGraphElements.py (3)
add_attribute(67-68)add_attribute(128-129)Edge(89-156)cognee/infrastructure/engine/models/Edge.py (1)
Edge(5-38)
cognee/modules/memify/memify.py (2)
cognee/api/v1/memify/routers/get_memify_router.py (1)
memify(38-99)cognee/tasks/memify/propagate_importance_weights.py (1)
propagate_importance_weights(11-78)
cognee/tasks/ingestion/ingest_data.py (1)
cognee/shared/data_models.py (1)
Document(314-317)
cognee/modules/retrieval/lexical_retriever.py (2)
cognee/api/v1/add/add.py (1)
add(18-221)cognee/modules/graph/cognee_graph/CogneeGraph.py (1)
final_score(276-280)
cognee/modules/retrieval/chunks_retriever.py (4)
cognee/infrastructure/databases/vector/get_vector_engine.py (1)
get_vector_engine(5-7)cognee/infrastructure/databases/vector/exceptions/exceptions.py (1)
CollectionNotFoundError(5-22)cognee/modules/retrieval/exceptions/exceptions.py (1)
NoDataError(25-32)cognee/modules/graph/cognee_graph/CogneeGraph.py (1)
final_score(276-280)
cognee/tests/integration/documents/TextDocument_test.py (3)
cognee/tests/integration/documents/async_gen_zip.py (1)
async_gen_zip(1-12)cognee/modules/chunking/LangchainChunker.py (1)
read(36-60)cognee/modules/chunking/TextChunker.py (2)
read(12-81)TextChunker(11-81)
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (2)
cognee/modules/chunking/models/DocumentChunk.py (1)
DocumentChunk(10-37)cognee/tests/unit/interfaces/graph/get_graph_from_model_unit_test.py (1)
DocumentChunk(13-17)
cognee/modules/retrieval/utils/brute_force_triplet_search.py (3)
cognee/modules/graph/cognee_graph/CogneeGraph.py (2)
calculate_top_triplet_importances(271-282)final_score(276-280)cognee/tests/test_importance_weight.py (4)
calculate_top_triplet_importances(54-55)calculate_top_triplet_importances(100-101)calculate_top_triplet_importances(129-130)calculate_top_triplet_importances(164-165)cognee/infrastructure/databases/vector/exceptions/exceptions.py (1)
CollectionNotFoundError(5-22)
🔇 Additional comments (29)
cognee/modules/retrieval/chunks_retriever.py (2)
69-71: Verify score semantics from vector engine.The formula
similarity_score = 1 / (1 + distance_score)assumesitem.scoreis a distance metric (lower = more similar). Some vector databases return similarity scores directly (higher = more similar), which would invert the intended ranking.Confirm that the vector engine consistently returns distance scores, or add a comment documenting this assumption.
30-32: Cachingvector_engineat init changes retrieval behavior.The previous implementation likely retrieved the vector engine per-call in
get_context(). Caching it in__init__is more efficient but means configuration changes after instantiation won't be reflected. This is likely acceptable, but worth noting if the retriever instances are long-lived.cognee/modules/data/models/Data.py (1)
37-37: LGTM!The
importance_weightcolumn is correctly defined withnullable=Falseanddefault=0.5, consistent with theDataPointmodel. Theto_json()serialization follows the existing camelCase convention.Also applies to: 60-60
cognee/modules/memify/memify.py (2)
73-81: LGTM!The ordering is logical—
propagate_importance_weightsruns first to compute weights across the graph, thenadd_rule_associationscan leverage those weights. The task construction follows the existing pattern.
24-24: Import placement follows existing conventions.The new import is appropriately grouped with other task imports from
cognee.tasks.cognee/api/v1/add/add.py (1)
29-29: LGTM! Parameter addition and validation are well-implemented.The
importance_weightparameter is properly added with:
- A sensible default of 0.5 (mid-range)
- Clear docstring documentation explaining its purpose and range
- Proper validation ensuring values are between 0.0 and 1.0
- Correct propagation to the
ingest_datataskAlso applies to: 89-91, 171-173, 192-192
cognee/modules/chunking/TextChunker.py (1)
33-33: LGTM! Importance weight propagation is consistent.The
importance_weightis correctly propagated fromself.document.importance_weightto all threeDocumentChunkyield points:
- When handling oversized chunks with empty paragraph buffer
- When flushing accumulated paragraph chunks
- When flushing remaining chunks at the end
Also applies to: 53-53, 76-76
cognee/tests/test_propagate_importance_weights.py (1)
12-78: LGTM! Comprehensive test coverage.The test effectively validates:
- Weight preservation for nodes with initial weights (N_A=1.0, N_B=0.2)
- Propagation to unweighted neighbors (N_X=1.0 from N_A)
- Weight fusion via averaging (N_Y=0.6 from neighbors N_A=1.0 and N_B=0.2)
- Nodes without weighted neighbors remain unchanged (N_Z)
- Edge weight calculation as node average
cognee/tests/unit/modules/retrieval/chunks_retriever_test.py (5)
207-267: LGTM! Test validates default importance weight behavior.The test correctly verifies:
- Chunks without explicit
importance_weightuse the default value- Mock verifies that
score_thresholdis not passed to the vector search- Results are properly ordered based on final scores (similarity × weight)
268-326: LGTM! Test validates weight-based ranking.The test demonstrates that importance weighting correctly influences ranking:
- High importance (1.0) with lower similarity (0.6) → final score 0.6
- Low importance (0.1) with higher similarity (0.9) → final score 0.09
- Result: High importance ranks first despite lower similarity
327-384: LGTM! Test validates boundary values.The test correctly validates extreme importance weights:
- Weight 0.0 zeroes out the score regardless of similarity
- Weight 1.0 preserves full similarity score
- Demonstrates that full-weight chunk ranks higher than zero-weight chunk
254-266: Verify imports forpatchandAsyncMock.The test code uses
patchandAsyncMockfromunittest.mock, but the import statements are not visible in the provided snippet. Ensure these imports are present at the top of the file:
from unittest.mock import patch, AsyncMock
385-442: Verify test expectations: mock scores don't align with stated behavior.The test name references "equal_score" but the mock setup provides different scores (chunk2=1.0, chunk1=3.0). The test assertions expect chunk2 to rank first, yet with:
- chunk2: 1.0 × 0.5 = 0.5 (final score)
- chunk1: 3.0 × 1.0 = 3.0 (final score)
Standard ranking should place chunk1 first. Either the mock scores should be adjusted to match the test intent, or the assertions need correction. Review the actual ChunksRetriever.get_context() implementation to confirm the scoring formula and ensure test setup aligns with expected ranking behavior.
cognee/modules/chunking/LangchainChunker.py (1)
51-51: LGTM! Consistent importance weight propagation.The
importance_weightis correctly propagated fromself.document.importance_weightto theDocumentChunk, maintaining consistency with theTextChunkerimplementation.cognee/tests/integration/documents/TextDocument_test.py (1)
28-72: LGTM! Comprehensive test parameterization.The test is well-structured with:
- Parameterization covering explicit weight (0.9) and implicit default (None → 0.5)
- Clear assertions verifying chunk
importance_weightmatches expected values- Integration testing of weight propagation through the document chunking pipeline
cognee/tests/test_importance_weight.py (4)
66-87: LGTM! Test validates importance weight averaging.The test correctly verifies that triplet scoring averages the importance weights from both nodes:
- Edge 1: (0.8 + 0.9) / 2 = 0.85
- Edge 2: (0.3 + 0.7) / 2 = 0.5
89-119: LGTM! Test validates behavior with missing weights.The test correctly verifies that nodes without
importance_weightattributes remain unchanged during triplet retrieval, confirming that the system doesn't force-add default weights to all nodes.
121-149: LGTM! Test validates boundary value averaging.The test correctly verifies edge case handling:
- Node weights at extremes (0.0 and 1.0)
- Correct averaging: (0.0 + 1.0) / 2 = 0.5
176-195: LGTM! Test validates ranking with explicit scores.The test correctly verifies that edges with higher importance weights and scores rank higher, assuming the
MockEdgeinitialization issue is fixed.cognee/modules/retrieval/lexical_retriever.py (4)
14-27: LGTM! Constructor properly extended with importance weight support.The changes correctly:
- Add
default_importance_weightparameter with sensible default of 0.5- Store the parameter for later use in scoring
- Maintain backward compatibility with existing code
35-36: LGTM! Public method for external payload injection.The
add()method provides a clean interface for caching chunk payloads, which is useful for testing and external integrations.
105-114: LGTM! Robust weight handling and scoring.The implementation correctly:
- Retrieves
importance_weightfrom payload with fallback to default- Validates weight is numeric before use
- Computes final score as
score × weight- Handles scorer exceptions gracefully by setting final_score to 0.0
129-130: Minor formatting change - no functional impact.The parameter formatting in
get_completionwas adjusted for readability. This is a cosmetic change with no functional impact.cognee/tasks/ingestion/ingest_data.py (1)
26-34: LGTM!The
importance_weightparameter with default value 0.5 is correctly added to the function signature and properly type-annotated.cognee/tasks/memify/propagate_importance_weights.py (2)
11-78: LGTM!The weight propagation logic is well-structured:
- Properly filters source nodes with valid importance_weight values
- Implements average aggregation correctly for both nodes and edges
- Includes appropriate logging and error handling
- Docstring clearly documents the strategy and parameters
81-90: LGTM!The Task wrapper follows the standard pattern and properly delegates to the async propagation function.
cognee/modules/graph/cognee_graph/CogneeGraph.py (2)
271-282: LGTM!The updated ranking logic correctly uses
heapq.nlargestto maximizeimportance_score(which combines vector similarity with importance weights) instead of minimizing raw distances. The docstring clarifies the merged scoring approach.
227-269: Verify fallback weight inconsistency between nodes and edges.Edges use a fallback
importance_weightof 1.0 (line 259), while nodes use 0.5 (line 217). Given thatpropagate_importance_weights.pycomputes edge weights as the average of endpoint node weights, this inconsistency may cause unexpected behavior if edges lack propagated weights.Please verify whether:
- The 1.0 fallback for edges is intentional and serves a specific purpose
- Or if it should be aligned with the node fallback of 0.5 for consistency
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (1)
50-50: Verify DocumentChunk has importance_weight attribute.The code accesses
data_chunk.importance_weightin multiple locations. Please verify that theDocumentChunkmodel (which extendsDataPoint) has theimportance_weightattribute defined, as this is critical for the feature to work correctly.Also applies to: 62-62, 87-87, 138-138, 198-198, 299-299
Fix logs and MockEdge class
|
@3619117923 I'd like to apologise since we caused a bit of a problem with this issue. Our architecture was refactored considerably since this issue was created and our engineers did not account for the changes that this PR would require, and they made our core engine incompatible with the code you provided. |
#1768
Summary:
Changes:
Add importance_weight to add.py,DataPoint class and graphs.Fix scoring logics and retrievers with importance_weight.Create some new tests and add some test cases.
Type of Change
Pre-submission Checklist
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
Summary by CodeRabbit
Release Notes
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.