-
Notifications
You must be signed in to change notification settings - Fork 1k
feat: Calculate graph metrics for networkx graph [COG-1082] #484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…-tokens-to-metric-table
…add-num-tokens-to-metric-table
WalkthroughThe changes add an Changes
Sequence Diagram(s)sequenceDiagram
participant C as Client Call
participant S as store_descriptive_metrics
participant E as Graph Engine
participant A as Graph Adapter
C->>S: Call store_descriptive_metrics(data_points, include_optional)
S->>E: get_graph_metrics(include_optional)
E->>A: get_graph_metrics(include_optional)
A-->>E: Return computed metrics
E-->>S: Return metrics data
S-->>C: Process and store metrics
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
…og-1082-metrics-in-networkx-adapter
788a4f9 to
f064f52
Compare
05138fa to
268d778
Compare
e89c9b9 to
27feae8
Compare
27feae8 to
af8e798
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)
534-546: Implement actual graph metrics calculation.The method currently returns hardcoded placeholder values and doesn't utilize the
include_optionalparameter. Consider implementing actual metrics calculation using Neo4j's graph algorithms.Here's a suggested implementation:
async def get_graph_metrics(self, include_optional=False): + """Get graph metrics from Neo4j. + + Args: + include_optional (bool, optional): Whether to include computationally expensive metrics. Defaults to False. + + Returns: + dict: A dictionary containing graph metrics. + """ + # Basic metrics query + basic_metrics_query = """ + MATCH (n) + OPTIONAL MATCH (n)-[r]-() + WITH COUNT(DISTINCT n) as nodes, COUNT(r) as edges + RETURN nodes as num_nodes, + edges as num_edges, + CASE WHEN nodes > 0 THEN toFloat(edges)/nodes ELSE 0 END as mean_degree, + CASE WHEN nodes > 1 THEN toFloat(edges)/(nodes * (nodes-1)) ELSE 0 END as edge_density + """ + basic_results = await self.query(basic_metrics_query) + metrics = basic_results[0] if basic_results else {} + + if include_optional: + # Additional metrics for connected components + components_query = """ + CALL gds.wcc.stream('graph') + YIELD componentId + WITH componentId, count(*) as size + RETURN count(distinct componentId) as num_components, + collect(size) as component_sizes + """ + components_results = await self.query(components_query) + if components_results: + metrics.update({ + "num_connected_components": components_results[0]["num_components"], + "sizes_of_connected_components": components_results[0]["component_sizes"] + }) + + # Additional metrics for clustering and path lengths + advanced_metrics_query = """ + CALL gds.graph.project('temp_graph', '*', '*') + CALL gds.alpha.clustering.average.stream('temp_graph') + YIELD averageClusteringCoefficient + CALL gds.shortestPath.dijkstra.stream('temp_graph') + YIELD pathLength + RETURN averageClusteringCoefficient as avg_clustering, + max(pathLength) as diameter, + avg(pathLength) as avg_shortest_path_length + """ + advanced_results = await self.query(advanced_metrics_query) + if advanced_results: + metrics.update(advanced_results[0]) + + return metrics
🧹 Nitpick comments (6)
cognee/infrastructure/databases/graph/networkx/adapter.py (4)
390-392: Consider providing a docstring.
Adding a short docstring clarifying the purpose of this method (and the effect ofinclude_optional) would improve readability and serve as clear documentation for future maintainers.
404-415: Clarify handling of disconnected graphs when computing diameter and shortest paths.
ReturningNonefor diameter and average shortest path length in the case of non-strongly-connected graphs might be confusing for downstream consumers. Consider making this behavior more explicit in documentation, or providing partial coverage (e.g., largest strongly connected component) if that better fits the use case.
423-432: Performance consideration for large graph metrics.
Metrics like connected components, diameter, or average shortest path might be expensive for very large graphs. If performance is a concern, consider providing an alternate asynchronous or batched approach, or a mechanism to skip certain metrics.
434-449: Use consistent placeholders for unsupported metrics.
Currently, optional metrics are either calculated or returned as -1, but if the graph isn't strongly connected, they becomeNone. For consistency, consider returningNonein all “not applicable” scenarios or using -1 consistently, depending on the needs of your application logic.cognee/infrastructure/databases/graph/graph_db_interface.py (1)
59-59: Provide a default value for the new parameter.
To maintain backward compatibility and keep method calls simpler, consider defaultinginclude_optional=Falsein the interface declaration. This aligns with the adapters’ updated signatures.- async def get_graph_metrics(self, include_optional): + async def get_graph_metrics(self, include_optional=False): raise NotImplementedErrorcognee/modules/data/methods/store_descriptive_metrics.py (1)
26-26: Add type hint for theinclude_optionalparameter.The function signature should include a type hint for better code maintainability and IDE support.
-async def store_descriptive_metrics(data_points: list[DataPoint], include_optional: bool): +async def store_descriptive_metrics(data_points: list[DataPoint], include_optional: bool = False) -> list[DataPoint]:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cognee/api/v1/cognify/cognify_v2.py(1 hunks)cognee/infrastructure/databases/graph/graph_db_interface.py(1 hunks)cognee/infrastructure/databases/graph/neo4j_driver/adapter.py(1 hunks)cognee/infrastructure/databases/graph/networkx/adapter.py(2 hunks)cognee/modules/data/methods/store_descriptive_metrics.py(1 hunks)cognee/modules/data/models/GraphMetrics.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: test
- GitHub Check: docker-compose-test
🔇 Additional comments (4)
cognee/infrastructure/databases/graph/networkx/adapter.py (1)
17-17: Import of numpy is appropriate.
Importingnumpyhere makes sense given the new graph metric calculations relying on array-like operations.cognee/modules/data/models/GraphMetrics.py (2)
2-2: Importing func is valid.
Usingfrom sqlalchemy.sql import funcis consistent with your switch to database-side defaults.
27-28: Retaining server-side timestamps is fine.
Switching from Python-based timestamps tofunc.now()ensures that your database handles date/time generation. This can help maintain consistency and accuracy in multi-instance deployments.cognee/api/v1/cognify/cognify_v2.py (1)
168-168: LGTM!The Task instantiation correctly includes the new parameter with a sensible default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)
534-546: Missing integration with Neo4j GDS & unused parameterThe current implementation of
get_graph_metricsreturns hardcoded values and does not utilize theinclude_optionalparameter. Moreover, our search for neo4j-graph-data-science dependencies and any GDS procedure calls in the repository returned no results—indicating that the required Neo4j Graph Data Science library isn’t configured. This discrepancy means that if you proceed with the suggested GDS-based metric calculations, you’ll first need to add and properly configure the Neo4j GDS dependency.
- Ensure the repository includes the required Neo4j GDS (neo4j-graph-data-science) dependency and configuration.
- Either remove or start using the
include_optionalparameter.- Update
get_graph_metricsto compute real metrics—using either the configured Neo4j GDS procedures or an alternative approach that aligns with your dependencies.🔗 Analysis chain
Implement actual metric calculations using Neo4j's graph algorithms.
The method currently returns hardcoded placeholder values (-1) for all metrics. Consider implementing the actual calculations using Neo4j's Graph Data Science (GDS) library, which provides efficient algorithms for computing these metrics.
Here's a suggested implementation:
async def get_graph_metrics(self, include_optional=False): - return { - "num_nodes": -1, - "num_edges": -1, - "mean_degree": -1, - "edge_density": -1, - "num_connected_components": -1, - "sizes_of_connected_components": -1, - "num_selfloops": -1, - "diameter": -1, - "avg_shortest_path_length": -1, - "avg_clustering": -1, - } + # Basic metrics + basic_metrics_query = """ + MATCH (n) + OPTIONAL MATCH (n)-[r]-() + WITH COUNT(DISTINCT n) as nodes, COUNT(r) as edges + RETURN nodes as num_nodes, + edges as num_edges, + toFloat(edges)/nodes as mean_degree, + toFloat(edges)/(nodes * (nodes-1)) as edge_density + """ + basic_metrics = (await self.query(basic_metrics_query))[0] + + # Component metrics using GDS + component_metrics_query = """ + CALL gds.wcc.stats.estimate('*') + YIELD componentCount, componentDistribution + RETURN componentCount as num_connected_components, + componentDistribution as sizes_of_connected_components + """ + component_metrics = (await self.query(component_metrics_query))[0] + + # Self-loops + selfloops_query = """ + MATCH (n)-[r]->(n) + RETURN COUNT(r) as num_selfloops + """ + selfloops = (await self.query(selfloops_query))[0] + + metrics = { + **basic_metrics, + **component_metrics, + **selfloops + } + + if include_optional: + # Optional metrics using GDS + optional_metrics_query = """ + CALL gds.alpha.allShortestPaths.stream('*') + YIELD sourceNodeId, targetNodeId, distance + WITH COLLECT(distance) as paths + RETURN + MAX(paths) as diameter, + AVG(paths) as avg_shortest_path_length + """ + optional_metrics = (await self.query(optional_metrics_query))[0] + + clustering_query = """ + CALL gds.localClusteringCoefficient.stream('*') + YIELD nodeId, localClusteringCoefficient + RETURN AVG(localClusteringCoefficient) as avg_clustering + """ + clustering = (await self.query(clustering_query))[0] + + metrics.update({ + **optional_metrics, + **clustering + }) + + return metricsNote: Ensure that the Neo4j Graph Data Science library is installed and properly configured in your Neo4j instance.
Let's verify the Neo4j GDS library installation and configuration:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check if Neo4j GDS library is installed and configured # Expected: Find references to Neo4j GDS library in configuration or dependency files # Search for GDS library in project dependencies rg -l "neo4j-graph-data-science|gds" --type yaml --type toml --type json # Search for GDS procedure calls in Neo4j queries rg "CALL\s+gds\." --type pythonLength of output: 139
Script:
#!/bin/bash # Re-run searches for Neo4j GDS library references using explicit glob patterns # 1. Search for Neo4j GDS dependency references in common dependency configuration files echo "Searching for 'neo4j-graph-data-science' in dependency files:" rg "neo4j-graph-data-science" --glob "**/{requirements.txt,setup.py,pyproject.toml,environment.yml}" || echo "No dependency reference found." # 2. Search for GDS procedure calls in Python files by explicitly targeting .py files echo "Searching for 'CALL gds.' in Python files:" rg "CALL\s+gds\." --glob "*.py" || echo "No GDS procedure calls found."Length of output: 514
🧹 Nitpick comments (3)
cognee/infrastructure/databases/graph/graph_db_interface.py (1)
59-59: Add type hints and default value forinclude_optionalparameter.The parameter should be typed and have a default value for better usability.
- async def get_graph_metrics(self, include_optional): + async def get_graph_metrics(self, include_optional: bool = False):cognee/modules/data/methods/store_descriptive_metrics.py (1)
26-26: Add type hints forinclude_optionalparameter.Add type hints to improve code maintainability and IDE support.
-async def store_descriptive_metrics(data_points: list[DataPoint], include_optional: bool): +async def store_descriptive_metrics(data_points: list[DataPoint], include_optional: bool = False) -> list[DataPoint]:cognee/infrastructure/databases/graph/networkx/adapter.py (1)
393-396: Add docstring and error handling for mean degree calculation.The helper function should have a docstring and handle empty graphs consistently.
def _get_mean_degree(graph): + """Calculate the mean degree of the graph. + + Args: + graph: A NetworkX graph object + + Returns: + float: Mean degree of the graph, or 0 for empty graphs + """ degrees = [d for _, d in graph.degree()] return np.mean(degrees) if degrees else 0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cognee/api/v1/cognify/cognify_v2.py(1 hunks)cognee/infrastructure/databases/graph/graph_db_interface.py(1 hunks)cognee/infrastructure/databases/graph/neo4j_driver/adapter.py(1 hunks)cognee/infrastructure/databases/graph/networkx/adapter.py(2 hunks)cognee/modules/data/methods/store_descriptive_metrics.py(1 hunks)cognee/modules/data/models/GraphMetrics.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- cognee/modules/data/models/GraphMetrics.py
⏰ Context from checks skipped due to timeout of 90000ms (18)
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: profiler
- GitHub Check: docker-compose-test
🔇 Additional comments (3)
cognee/modules/data/methods/store_descriptive_metrics.py (1)
29-29: Add error handling for graph metrics retrieval.The function should handle potential errors when retrieving graph metrics to ensure graceful failure.
cognee/api/v1/cognify/cognify_v2.py (1)
168-168: Verify performance impact of always including optional metrics.Setting
include_optional=Trueby default means optional metrics (diameter, clustering, etc.) will always be calculated, which could impact performance for large graphs.Run the following script to check the graph size and estimate the impact:
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)
534-534: Verify implementation consistency across adapters.The
include_optionalparameter has been added to align with other adapters, but its implementation might differ. Please ensure consistent behavior across all graph database adapters.Let's verify the implementation in other adapters:
✅ Verification successful
I'll run additional shell scripts to search for both synchronous and asynchronous implementations of
get_graph_metrics(and to see how theinclude_optionalparameter is used) across the codebase.
Graph Metrics Adapter Consistency Verified
Both the networkx and neo4j adapters now define an asynchronousget_graph_metricsmethod that accepts theinclude_optionalparameter (defaulting toFalse), which aligns with the abstract definition ingraph_db_interface.py. While the two adapters differ in the format of the returned metrics (a tuple in networkx vs. a dictionary in neo4j), the handling of theinclude_optionalparameter is consistent across implementations.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check implementation of get_graph_metrics across adapters # Expected: Find all implementations of get_graph_metrics method # Search for get_graph_metrics implementations ast-grep --pattern 'def get_graph_metrics(self, $_)'Length of output: 54
Script:
#!/bin/bash echo "Searching for synchronous definitions of get_graph_metrics..." rg -n "def get_graph_metrics" -C 2 . echo "Searching for asynchronous definitions of get_graph_metrics..." rg -n "async def get_graph_metrics" -C 2 . echo "Searching for get_graph_metrics implementations that mention include_optional..." rg -n "get_graph_metrics.*include_optional" -C 2 .Length of output: 5053
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
cognee/infrastructure/databases/graph/networkx/adapter.py (2)
393-396: Consider handling empty graphs in mean degree calculation.The current implementation returns 0 for empty graphs, which might be misleading. Consider returning None instead to distinguish between graphs with actual zero mean degree and empty graphs.
def _get_mean_degree(graph): degrees = [d for _, d in graph.degree()] - return np.mean(degrees) if degrees else 0 + return np.mean(degrees) if degrees else None
410-415: Improve average shortest path length calculation.Similar to the diameter calculation, consider calculating the average shortest path length for the largest component in disconnected graphs.
def _get_avg_shortest_path_length(graph): if nx.is_strongly_connected(graph): return nx.average_shortest_path_length(graph) + elif graph.number_of_nodes() > 0: + largest = max(nx.weakly_connected_components(graph), key=len) + subgraph = graph.subgraph(largest) + return nx.average_shortest_path_length(subgraph) else: return None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/infrastructure/databases/graph/networkx/adapter.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: windows-latest
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: docker-compose-test
🔇 Additional comments (5)
cognee/infrastructure/databases/graph/networkx/adapter.py (5)
17-17: LGTM!The numpy import is appropriate for statistical calculations used in the graph metrics.
404-409: Improve diameter calculation for disconnected graphs.The current implementation returns None for disconnected graphs. Consider calculating the diameter of the largest component instead.
423-432: LGTM!The mandatory metrics calculation is comprehensive and includes essential graph properties.
434-449: Add error handling for optional metrics calculation.The current implementation might fail if any of the optional metric calculations raise an exception.
416-422: 🛠️ Refactor suggestionConsider using undirected graph for clustering coefficient.
The current implementation converts to a directed graph (
DiGraph), but clustering coefficient is typically calculated on undirected graphs.def _get_avg_clustering(graph): try: - return nx.average_clustering(nx.DiGraph(graph)) + return nx.average_clustering(graph.to_undirected()) except Exception as e: logger.warning("Failed to calculate clustering coefficient: %s", e) return NoneLikely invalid or redundant comment.
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit
New Features
Refactor
Chore