-
Notifications
You must be signed in to change notification settings - Fork 966
fix: refactor get_graph_from_model to return nodes and edges correctly #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe pull request introduces several modifications across multiple files, primarily focusing on the reordering of tasks in the Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (7)
cognee/modules/graph/utils/get_graph_from_model.py (2)
65-68: Use a more robust method for splitting field names and indices.Splitting
field_nameusing"."assumes that field names do not contain periods. This could lead to errors if field names include dots.Consider using a structured format or a delimiter that is guaranteed not to appear in field names, or use a data structure to keep track of field names and indices separately.
92-105: Optimize edge creation to avoid duplicates.The
edge_keyis constructed to check for existing edges. Ensure thatedge_keyuniquely represents each edge, especially when multiple edges could exist between the same nodes but with different relationships.Consider incorporating the
relationship_nameor other distinguishing attributes into theedge_keyto prevent unintended deduplication.cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py (2)
39-39: Remove unnecessary f-stringThe f-string doesn't contain any placeholders.
- name = f"Entity", + name = "Entity",🧰 Tools
🪛 Ruff (0.8.0)
39-39: f-string without any placeholders
Remove extraneous
fprefix(F541)
68-69: Document test assertionsAdd comments explaining why we expect exactly 4 nodes and 3 edges to help future maintainers understand the test case.
+ # Expect 4 nodes: 1 Document + 1 DocumentChunk + 1 Entity + 1 EntityType assert len(nodes) == 4 + # Expect 3 edges: DocumentChunk->Document + Entity->EntityType + DocumentChunk->Entity assert len(edges) == 3cognee/tests/unit/interfaces/graph/get_graph_from_huge_model_test.py (1)
76-82: Consider memory implications of shared dictionariesThe shared dictionaries
added_nodesandadded_edgescould grow significantly with 1500 files. Consider:
- Adding memory usage monitoring
- Implementing cleanup after the test
added_nodes = {} added_edges = {} start = time.perf_counter_ns() + initial_memory = psutil.Process().memory_info().rss results = await asyncio.gather(*[ get_graph_from_model(code_file, added_nodes = added_nodes, added_edges = added_edges) for code_file in code_files ]) + + final_memory = psutil.Process().memory_info().rss + memory_used = (final_memory - initial_memory) / 1024 / 1024 # MB + print(f"Memory used: {memory_used:.2f} MB") + + # Cleanup + added_nodes.clear() + added_edges.clear()cognee/modules/graph/utils/expand_with_nodes_and_edges.py (1)
Line range hint
77-89: Fix indentation of return statementThe return statement is incorrectly indented inside the for loop, causing premature return after processing only the first item.
existing_edges_map[edge_key] = True - return (data_points, relationships) + return (data_points, relationships)cognee/tasks/repo_processor/enrich_dependency_graph.py (1)
Line range hint
102-116: Remove commented-out codeThe commented-out code sections should be removed as they:
- Add noise to the codebase
- Make it harder to maintain and understand the current implementation
- Are tracked in version control if needed for reference
- # for node_id, node in tqdm(graph.nodes(data = True), desc = "Enriching dependency graph", unit = "node"): - # if node_id not in node_rank_map: - # continue - - # data_points.append(node_enrich_and_connect(graph, topological_order, node)) data_points_map = {data_point.id: data_point for data_point in data_points} - # data_points_futures = [] for data_point in tqdm(data_points, desc = "Enriching dependency graph", unit = "data_point"): if data_point.id not in node_rank_map: continue if isinstance(data_point, CodeFile): - # data_points_futures.append(node_enrich_and_connect(graph, topological_order, data_point, data_points_map)) await node_enrich_and_connect(graph, topological_order, data_point, data_points_map) yield data_point - # await asyncio.gather(*data_points_futures) - - # return data_points
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (14)
cognee/api/v1/cognify/cognify_v2.py(1 hunks)cognee/modules/chunking/models/DocumentChunk.py(2 hunks)cognee/modules/chunking/models/__init__.py(1 hunks)cognee/modules/engine/models/Entity.py(0 hunks)cognee/modules/engine/models/EntityType.py(0 hunks)cognee/modules/graph/utils/expand_with_nodes_and_edges.py(4 hunks)cognee/modules/graph/utils/get_graph_from_model.py(1 hunks)cognee/tasks/documents/classify_documents.py(1 hunks)cognee/tasks/repo_processor/enrich_dependency_graph.py(1 hunks)cognee/tasks/storage/add_data_points.py(1 hunks)cognee/tasks/summarization/models.py(1 hunks)cognee/tasks/summarization/summarize_text.py(1 hunks)cognee/tests/unit/interfaces/graph/get_graph_from_huge_model_test.py(1 hunks)cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py(1 hunks)
💤 Files with no reviewable changes (2)
- cognee/modules/engine/models/Entity.py
- cognee/modules/engine/models/EntityType.py
✅ Files skipped from review due to trivial changes (2)
- cognee/modules/chunking/models/init.py
- cognee/tasks/documents/classify_documents.py
🧰 Additional context used
🪛 Ruff (0.8.0)
cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py
39-39: f-string without any placeholders
Remove extraneous f prefix
(F541)
🔇 Additional comments (13)
cognee/modules/graph/utils/get_graph_from_model.py (4)
39-45: Correctly handle indexing in lists of DataPoint objects.
When iterating over lists of DataPoint instances, the construction of property_key includes the index. Verify that this approach uniquely identifies each item and that all items are processed without duplication or omission.
75-77: Efficiently skip already added nodes.
The check if str(field_value.id) in added_nodes prevents reprocessing nodes. This improves performance and avoids redundant work.
78-84: Verify recursive calls pass all necessary arguments.
The recursive call to get_graph_from_model now includes added_nodes, added_edges, and visited_properties. Ensure that these arguments are correctly passed to maintain the integrity of the graph construction.
27-32:
Prevent infinite recursion by properly updating visited_properties.
The check for property_key in visited_properties helps avoid processing properties multiple times. However, ensure that property_key is added to visited_properties after processing to prevent infinite loops due to cyclic references.
Confirm that property_key is correctly added to visited_properties after processing each property.
cognee/tasks/summarization/models.py (1)
2-2:
Verify import paths after restructuring modules.
The import statement for DocumentChunk has changed. Ensure that all references to DocumentChunk in this file and elsewhere in the codebase are updated to reflect the new import path.
Run the following script to find all imports of DocumentChunk:
Update them to:
from cognee.modules.chunking.models import DocumentChunkcognee/modules/chunking/models/DocumentChunk.py (2)
1-1: Import List and Optional from typing module.
The addition of List suggests that Entity objects will be stored in a list within DocumentChunk. Ensure that List is correctly imported to avoid NameError.
4-4: Verify the import of Entity from the correct module.
Ensure that the Entity class is imported from the right location, and that any dependencies are correctly handled.
cognee/tasks/summarization/summarize_text.py (1)
25-25:
Update function return type and handle returned summaries appropriately.
The function summarize_text now returns summaries instead of the original data_chunks. Ensure that any code that calls this function is updated to handle summaries correctly.
Run the following script to find all usages of summarize_text and check how the returned value is used:
cognee/tasks/storage/add_data_points.py (2)
14-14: Ensure visited_properties is properly utilized.
Although visited_properties is initialized, it's not clear how it's being used within add_data_points. Verify that it's correctly passed to all necessary functions and utilized to prevent redundant processing.
21-22:
Confirm that get_graph_from_model calls are updated with new parameters.
The calls to get_graph_from_model now include added_nodes, added_edges, and visited_properties. Ensure that these changes are properly reflected in all places where add_data_points is used.
Run the following script to check all invocations:
✅ Verification successful
Let me check the implementation of get_graph_from_model and its usage to verify this concern.
All calls to get_graph_from_model are properly updated with the required parameters
Based on the search results, I can verify that all calls to get_graph_from_model across the codebase are correctly passing the required parameters (added_nodes, added_edges, and visited_properties). This is evidenced by:
- In
add_data_points.py: Uses all three parameters - In
get_graph_from_model_test.py: Usesadded_nodesandadded_edges - In
get_graph_from_huge_model_test.py: Usesadded_nodesandadded_edges - In
get_graph_from_model.py: The recursive call within the function itself uses all parameters - Other usages in
enrich_dependency_graph.pyand profiling code are for different purposes and don't require these tracking parameters
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify all calls to `add_data_points` are updated accordingly.
rg "add_data_points\(" -A 5
Length of output: 1237
Script:
#!/bin/bash
# Search for get_graph_from_model implementation and usage
ast-grep --pattern 'get_graph_from_model'
# Also search with ripgrep for more context
rg "get_graph_from_model" -A 5
Length of output: 10360
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (1)
53-53: Consider memory implications of appending entities
Appending entities to document_chunk.contains without bounds could lead to memory issues. Consider implementing a maximum limit or cleanup mechanism.
cognee/tasks/repo_processor/enrich_dependency_graph.py (1)
73-73: Approve type safety improvement
The explicit string conversion of desc_id is a good defensive programming practice that prevents potential type-related issues when interacting with the graph engine.
cognee/api/v1/cognify/cognify_v2.py (1)
89-89: Verify task dependency order change
The add_data_points task has been moved before extract_graph_from_data in the pipeline. This change could affect data dependencies between tasks.
Let's verify the task dependencies and potential impact:
cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
cognee/tasks/repo_processor/enrich_dependency_graph.py (1)
Line range hint
116-134: Consider adding explanatory comments for the removed code.While removing commented-out code improves readability, consider adding a brief comment explaining why async gathering was removed and if there are any performance implications.
Add a comment like:
+ # Note: Async gathering was removed to simplify the flow and prevent potential memory issues for data_point in tqdm(data_points, desc = "Enriching dependency graph", unit = "data_point"): if data_point.id not in node_rank_map: continue
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
cognee/tasks/repo_processor/enrich_dependency_graph.py(2 hunks)profiling/graph_pydantic_conversion/profile_graph_pydantic_conversion.py(1 hunks)
🔇 Additional comments (3)
profiling/graph_pydantic_conversion/profile_graph_pydantic_conversion.py (1)
31-39: LGTM: Parameter initialization and usage is correct.
The initialization of tracking dictionaries and their usage in get_graph_from_model is properly implemented.
cognee/tasks/repo_processor/enrich_dependency_graph.py (2)
73-73: LGTM: Explicit string conversion improves type safety.
Converting desc_id to string explicitly is a good practice to prevent potential type-related issues.
90-100: LGTM: Consistent parameter usage across files.
The initialization and usage of tracking dictionaries align with the changes in other files.
profiling/graph_pydantic_conversion/profile_graph_pydantic_conversion.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (2)
cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py (1)
39-39: Remove unnecessary f-stringThe f-string doesn't contain any placeholders.
- name = f"Entity", + name = "Entity",🧰 Tools
🪛 Ruff (0.8.0)
39-39: f-string without any placeholders
Remove extraneous
fprefix(F541)
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (1)
84-84: Add return type annotationDocument the return type for better code clarity and type safety.
Add return type annotation:
def expand_with_nodes_and_edges( data_chunks: list[DocumentChunk], chunk_graphs: list[KnowledgeGraph], existing_edges_map: Optional[dict[str, bool]] = None, -): +) -> tuple[list[DocumentChunk], list[tuple[str, str, str, dict]]]:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
cognee/modules/chunking/TextChunker.py(3 hunks)cognee/modules/chunking/models/DocumentChunk.py(2 hunks)cognee/modules/graph/utils/expand_with_nodes_and_edges.py(4 hunks)cognee/modules/graph/utils/retrieve_existing_edges.py(1 hunks)cognee/tasks/graph/extract_graph_from_data.py(1 hunks)cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py(1 hunks)evals/eval_swe_bench.py(0 hunks)
💤 Files with no reviewable changes (1)
- evals/eval_swe_bench.py
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/modules/chunking/models/DocumentChunk.py
🧰 Additional context used
🪛 Ruff (0.8.0)
cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py
39-39: f-string without any placeholders
Remove extraneous f prefix
(F541)
🔇 Additional comments (8)
cognee/tasks/graph/extract_graph_from_data.py (1)
23-26: LGTM! Verify refactoring impact on dependent code
The refactoring improves code clarity by passing data_chunks and chunk_graphs separately instead of using a combined tuple structure. This change aligns well with the single responsibility principle.
Also applies to: 31-32
✅ Verification successful
Let me analyze the function definitions to verify the parameter changes are consistent.
Let me try a different approach using ripgrep to check the function definitions.
Refactoring changes are consistent across the codebase
The function signatures in both the implementation files and their usage in extract_graph_from_data.py are aligned. Both functions consistently accept separate data_chunks and chunk_graphs parameters, and there are no remaining instances of the old tuple structure pattern in the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for any remaining usages of the old tuple structure pattern
rg -A 2 "chunk_and_chunk_graphs|zip\(.*data_chunks.*chunk_graphs\)"
# Check for any other calls to these functions to ensure they've been updated
rg -A 2 "retrieve_existing_edges|expand_with_nodes_and_edges"
Length of output: 2107
Script:
#!/bin/bash
# Check the function definitions and their parameters
ast-grep --pattern 'async def retrieve_existing_edges($_) {
$$$
}'
ast-grep --pattern 'def expand_with_nodes_and_edges($_) {
$$$
}'
Length of output: 139
Script:
#!/bin/bash
# Check the function definitions in their respective files
rg -A 5 "^(async )?def (retrieve_existing_edges|expand_with_nodes_and_edges)" cognee/modules/graph/utils/
Length of output: 1181
cognee/modules/graph/utils/retrieve_existing_edges.py (1)
26-26: LGTM! Edge construction is more explicit
The edge construction using data_chunk.id directly improves code clarity and maintainability.
Also applies to: 32-32, 35-35
cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py (2)
27-69: 🛠️ Refactor suggestion
Add test cases for edge scenarios
The test only covers a basic case. As mentioned in the previous review, consider adding test cases for:
- Empty document chunks
- Multiple document chunks with shared entities
- Document chunks with no entities
- Edge cases around property visiting and cycles
Would you like me to help generate these additional test cases?
🧰 Tools
🪛 Ruff (0.8.0)
39-39: f-string without any placeholders
Remove extraneous f prefix
(F541)
15-15:
Initialize contains as empty list
Initializing contains as None but using it as a list can lead to runtime errors.
- contains: List["Entity"] = None
+ contains: List["Entity"] = []Likely invalid or redundant comment.
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (1)
3-3: Add type validation for input parameters
The function parameters have been updated for better type safety, but runtime type checking is still recommended.
Also applies to: 14-15
cognee/modules/chunking/TextChunker.py (3)
38-38: LGTM! Proper initialization of contains attribute
The empty list initialization is correct and consistent with the changes in expand_with_nodes_and_edges.py.
56-56: LGTM! Consistent initialization with error handling
The initialization is consistent with the first DocumentChunk creation and properly wrapped in a try-except block.
78-78: LGTM! Consistent initialization in final block
The initialization maintains consistency with previous DocumentChunk creations and includes proper error handling.
cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)
Line range hint
1-78: Consider adding monitoring and observabilityGiven that this is a critical embedding service, consider adding:
- Metrics for tracking rate limits and context window exceeded events
- Logging for retry attempts and embedding sizes
- Circuit breaker pattern for handling sustained rate limiting
This will help with:
- Monitoring service health and performance
- Capacity planning
- Early detection of issues
- Cost optimization
Would you like me to provide an example implementation using a metrics library?
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
Show resolved
Hide resolved
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
cognee/modules/graph/utils/get_graph_from_model.py (3)
30-48: Consider extracting property key generation logic.The property key generation logic is repeated and could be encapsulated in a helper method for better maintainability and consistency.
+def generate_property_key(parent_id: str, field_name: str, value_id: str) -> str: + return f"{parent_id}{field_name}{value_id}" + async def get_graph_from_model( # ... existing parameters ... ): # ... existing code ... - property_key = str(data_point.id) + field_name + str(field_value.id) + property_key = generate_property_key(str(data_point.id), field_name, str(field_value.id))
54-62: Add documentation for metadata handling.The model copying logic includes specific metadata fields, but their purpose and importance aren't documented. Consider adding a docstring explaining why these fields are essential.
82-87: Consider extracting edge metadata creation.The edge metadata creation could be moved to a helper function for better maintainability and potential reuse.
+def create_edge_metadata(source_id: str, target_id: str, relationship: str) -> dict: + return { + "source_node_id": source_id, + "target_node_id": target_id, + "relationship_name": relationship, + "updated_at": datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S"), + } + # In the main function: -edges.append((data_point.id, field_value.id, field_name, { - "source_node_id": data_point.id, - "target_node_id": field_value.id, - "relationship_name": field_name, - "updated_at": datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S"), -})) +edges.append(( + data_point.id, + field_value.id, + field_name, + create_edge_metadata(data_point.id, field_value.id, field_name) +))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
cognee/modules/graph/utils/get_graph_from_model.py(1 hunks)
🔇 Additional comments (2)
cognee/modules/graph/utils/get_graph_from_model.py (2)
7-9: LGTM: Function signature change improves clarity.
The explicit parameter requirements make the contract clearer. However, as noted in previous reviews, some callers need updates.
12-14: LGTM: Efficient early return optimization.
The early return prevents redundant processing of already visited nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
cognee/shared/CodeGraphEntities.py (1)
Line range hint
1-42: Consider documenting the table naming conventionThe change to lowercase table names is good practice. Consider adding a comment or updating documentation to explicitly state the naming convention for future contributors.
Would you like me to help create a documentation update for the project's SQL naming conventions?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
cognee/shared/CodeGraphEntities.py(2 hunks)
🔇 Additional comments (1)
cognee/shared/CodeGraphEntities.py (1)
24-24: Verify database migration plan for table rename
Similar to the CodeFile change, renaming from "CodePart" to "codepart" requires migration planning.
Please ensure that:
- A database migration plan is in place
- Any existing SQL queries referencing the old table name are updated
… into fix/graph-extraction-from-model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (8)
cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py (2)
39-39: Remove unnecessary f-stringThe string doesn't contain any placeholders, so the f-string prefix is unnecessary.
- name = f"Entity", + name = "Entity",🧰 Tools
🪛 Ruff (0.8.0)
39-39: f-string without any placeholders
Remove extraneous
fprefix(F541)
65-66: Enhance test assertionsThe current assertions only verify the count of nodes and edges. Consider adding assertions to verify:
- The actual structure of the graph
- The relationships between nodes
- The properties of nodes and edges
# Example of more detailed assertions: assert any(node["type"] == "Document" for node in nodes) assert any(node["type"] == "DocumentChunk" for node in nodes) assert any(node["type"] == "Entity" for node in nodes) assert any(node["type"] == "EntityType" for node in nodes) assert any(edge["type"] == "PART_OF" for edge in edges) assert any(edge["type"] == "CONTAINS" for edge in edges) assert any(edge["type"] == "IS_TYPE" for edge in edges)cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)
74-74: Remove unnecessary f-stringThe error message doesn't contain any placeholders, so the f-string prefix is unnecessary.
- raise Exception(f"Rate limit exceeded and no more retries left.") + raise Exception("Rate limit exceeded and no more retries left.")🧰 Tools
🪛 Ruff (0.8.0)
74-74: f-string without any placeholders
Remove extraneous
fprefix(F541)
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (2)
56-59: Improve null check for contains attributeThe null check for data_chunk.contains is good, but consider adding type annotation and documentation.
Add type hints and docstring:
def expand_with_nodes_and_edges( data_chunks: list[DocumentChunk], chunk_graphs: list[KnowledgeGraph], existing_edges_map: Optional[dict[str, bool]] = None, +) -> tuple[list[DocumentChunk], list[tuple]]: + """Expands document chunks with nodes and edges from knowledge graphs. + + Args: + data_chunks: List of document chunks to expand + chunk_graphs: List of knowledge graphs corresponding to each chunk + existing_edges_map: Optional map of existing edges to avoid duplicates + + Returns: + Tuple of (expanded document chunks, relationships) + """
88-88: Consider adding type hints to return valueThe return type should be explicitly annotated to improve code maintainability.
Add return type annotation:
- return (data_chunks, relationships) + return (data_chunks, relationships) # type: tuple[list[DocumentChunk], list[tuple]]cognee/modules/graph/utils/get_graph_from_model.py (3)
7-9: Improve type hints for dictionary parametersThe dictionary parameters should have more specific type hints to indicate their content structure.
Add more specific type hints:
async def get_graph_from_model( data_point: DataPoint, - added_nodes: dict, - added_edges: dict, - visited_properties: dict = None, + added_nodes: dict[str, bool], + added_edges: dict[str, bool], + visited_properties: dict[str, bool] | None = None,
12-14: Consider adding logging for skipped nodesThe early return for already processed nodes could benefit from debug logging.
Add debug logging:
+ import logging + logger = logging.getLogger(__name__) + if str(data_point.id) in added_nodes: + logger.debug(f"Skipping already processed node: {data_point.id}") return [], []
Line range hint
112-122: Document the helper functionThe new helper function get_own_property_nodes lacks documentation explaining its purpose and parameters.
Add docstring:
def get_own_property_nodes(property_nodes, property_edges): + """Filter property nodes to get only those that are not destinations in edges. + + Args: + property_nodes: List of all property nodes + property_edges: List of edges connecting properties + + Returns: + List of property nodes that are not destinations in any edge + """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py(2 hunks)cognee/modules/graph/utils/expand_with_nodes_and_edges.py(4 hunks)cognee/modules/graph/utils/get_graph_from_model.py(1 hunks)cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py(1 hunks)notebooks/cognee_demo.ipynb(10 hunks)profiling/graph_pydantic_conversion/profile_graph_pydantic_conversion.py(1 hunks)
🔥 Files not summarized due to errors (1)
- notebooks/cognee_demo.ipynb: Error: Server error: no LLM provider could handle the message
🧰 Additional context used
🪛 Ruff (0.8.0)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
74-74: f-string without any placeholders
Remove extraneous f prefix
(F541)
cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py
39-39: f-string without any placeholders
Remove extraneous f prefix
(F541)
🔇 Additional comments (11)
notebooks/cognee_demo.ipynb (3)
268-268: Notebook execution metadata changes
These changes are automatically generated by Jupyter and don't affect the functionality.
Also applies to: 307-307, 352-352, 396-396, 440-440, 482-482, 532-532, 625-625
886-886: Python version metadata update
The notebook's Python version is updated to 3.11.8.
663-663: Verify the integration of the new data points task
The addition of add_data_points task with batch processing appears to be part of the document processing pipeline. Let's verify its integration with the upstream tasks.
✅ Verification successful
Task integration is properly configured and consistent
The add_data_points task is well-integrated into the codebase:
- It's a standard task used across multiple pipelines (code_graph_pipeline, cognify_pipeline)
- The task processes
DataPointentities which are properly defined in the infrastructure layer - The batch_size configuration of 10 is consistent with other pipeline configurations in the codebase
- The task has appropriate dependencies and is used in the correct sequence within the pipeline
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the implementation of add_data_points task and its dependencies
# Look for the add_data_points task implementation
ast-grep --pattern 'def add_data_points($$$):'
# Check for any configuration or interface definitions
rg -A 5 'class.*DataPoints|def.*add_data_points'
Length of output: 469
Script:
#!/bin/bash
# Let's check the DataPoint type definition and task configuration
rg -A 10 'class DataPoint'
# Check how add_data_points is used in task configurations
rg -A 5 'add_data_points.*Task|Task.*add_data_points'
# Check imports and dependencies in the add_data_points module
rg -l 'from.*import.*DataPoint|import.*DataPoint' --type py
# Look for batch processing implementation
ast-grep --pattern 'batch_size.*=.*10'
Length of output: 5266
cognee/tests/unit/interfaces/graph/get_graph_from_model_test.py (2)
12-15: Initialize contains list in constructor
The contains list is initialized as a class variable which could lead to shared state between instances. Initialize it in the constructor instead.
class DocumentChunk(DataPoint):
part_of: Document
text: str
- contains: List["Entity"] = None
+ contains: List["Entity"]
+
+ def __init__(self, **data):
+ super().__init__(**data)
+ self.contains = []27-44: Consider testing edge cases
The test only covers a basic case with a single document chunk containing one entity.
Consider adding test cases for:
- Empty document chunks
- Multiple document chunks with shared entities
- Document chunks with no entities
🧰 Tools
🪛 Ruff (0.8.0)
39-39: f-string without any placeholders
Remove extraneous f prefix
(F541)
profiling/graph_pydantic_conversion/profile_graph_pydantic_conversion.py (2)
31-39: LGTM!
The implementation correctly initializes and passes the required tracking dictionaries to get_graph_from_model.
42-51: LGTM!
The synchronous wrapper correctly implements the new parameter requirements.
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (2)
32-38: LGTM!
The retry mechanism is well-implemented with:
- Reasonable maximum retry limit
- Exponential backoff with a sensible cap
- Clear wait time calculation
55-67:
Fix critical issues in context window handling
The current implementation has potential issues that could lead to infinite recursion.
if len(text) == 1:
- parts = [text]
+ # If we can't split further, raise the original error
+ raise error
else:
parts = [text[0:math.ceil(len(text)/2)], text[math.ceil(len(text)/2):]]
parts_futures = [self.embed_text(part) for part in parts]
embeddings = await asyncio.gather(*parts_futures)
all_embeddings = []
for embeddings_part in embeddings:
all_embeddings.extend(embeddings_part)
return all_embeddingscognee/modules/graph/utils/expand_with_nodes_and_edges.py (2)
14-15:
Add parameter validation for data_chunks and chunk_graphs
The function assumes data_chunks and chunk_graphs have matching lengths. Add validation to prevent IndexError.
This was previously flagged in a past review. Apply the suggested validation:
def expand_with_nodes_and_edges(
data_chunks: list[DocumentChunk],
chunk_graphs: list[KnowledgeGraph],
existing_edges_map: Optional[dict[str, bool]] = None,
):
+ if len(data_chunks) != len(chunk_graphs):
+ raise ValueError("data_chunks and chunk_graphs must have the same length")24-26: Verify index access safety
The index-based access to chunk_graphs could raise IndexError if lists are empty.
Run this script to check for empty list handling:
✅ Verification successful
No safety concerns with index access
The code is safe from IndexError for the following reasons:
- The
data_chunksandchunk_graphslists are created in parallel using a list comprehension[extract_content_graph(chunk.text, graph_model) for chunk in data_chunks] - This ensures both lists have the same length, making the index access in the enumerate loop safe
- The function also handles the case where a graph could be None with an explicit check
if graph is None
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for empty list handling in expand_with_nodes_and_edges calls
# Search for function calls and their context
rg -B 5 -A 5 "expand_with_nodes_and_edges\("
# Search for test cases
rg -B 5 -A 5 "test.*expand_with_nodes_and_edges"
Length of output: 1742
Script:
#!/bin/bash
# Let's check the implementation to understand how the lists are handled
# First, let's see the full implementation of expand_with_nodes_and_edges
ast-grep --pattern 'def expand_with_nodes_and_edges($_) { $$$ }'
# Also check if there's any validation of input parameters
rg -B 2 -A 2 "data_chunks|chunk_graphs" cognee/modules/graph/utils/expand_with_nodes_and_edges.py
# Let's also check the caller to see how these lists are prepared
rg -B 10 -A 2 "expand_with_nodes_and_edges" cognee/tasks/graph/extract_graph_from_data.py
Length of output: 1432
Summary by CodeRabbit
Release Notes
New Features
DocumentChunkto hold multiple related entities.get_graph_from_modelfunction to track added nodes and edges.summarize_textfunction to return a list of summaries.get_graph_from_modelto facilitate synchronous execution.Bug Fixes
desc_idas a string in theenrich_dependency_graphfunction.Refactor
classify_documentsfunction parameter formatting.node_enrich_and_connectfunction by removing commented-out code.extract_graph_from_datafunction.TextChunkerto include thecontainsattribute inDocumentChunk.embed_textmethod of theLiteLLMEmbeddingEngine.__tablename__attributes forCodeFileandCodePartfor consistency.Tests
get_graph_from_modelto validate node and edge counts.