-
Notifications
You must be signed in to change notification settings - Fork 963
Neo4j base node label #903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
<!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. --------- Signed-off-by: Diego B Theuerkauf <[email protected]> Co-authored-by: vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Hande <[email protected]> Co-authored-by: Matea Pesic <[email protected]> Co-authored-by: hajdul88 <[email protected]> Co-authored-by: Daniel Molnar <[email protected]> Co-authored-by: Diego Baptista Theuerkauf <[email protected]> Co-authored-by: Dmitrii Galkin <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: lxobr <[email protected]> Co-authored-by: github-actions[bot] <[email protected]>
<!-- .github/pull_request_template.md --> ## Description Update migration example for upcoming blog on main branch ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md --> ## Description <!-- Provide a clear description of the changes in this PR --> ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
Please make sure all the checkboxes are checked:
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe changes introduce a Changes
Suggested labels
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)
731-740:⚠️ Potential issueCritical bug: Invalid node label syntax and incorrect edge directions.
Both
remove_connection_to_predecessors_ofandremove_connection_to_successors_ofhave bugs that will cause runtime failures:
- Using
{id}as a label instead of{BASE_LABEL}- The edge directions appear reversed relative to the method names
- The "Not understanding" comments indicate confusion
Apply these fixes:
async def remove_connection_to_predecessors_of( self, node_ids: list[str], edge_label: str ) -> None: - # Not understanding query = f""" UNWIND $node_ids AS id - MATCH (node:`{id}`)-[r:{edge_label}]->(predecessor) + MATCH (predecessor)-[r:{edge_label}]->(node:`{BASE_LABEL}`) + WHERE node.id = id DELETE r; """ async def remove_connection_to_successors_of( self, node_ids: list[str], edge_label: str ) -> None: - # Not understanding query = f""" UNWIND $node_ids AS id - MATCH (node:`{id}`)<-[r:{edge_label}]-(successor) + MATCH (node:`{BASE_LABEL}`)-[r:{edge_label}]->(successor) + WHERE node.id = id DELETE r; """Also applies to: 760-769
🧹 Nitpick comments (2)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (2)
52-58: Constructor improvements look good, but fix the trailing whitespace.The changes to disable notifications and create a uniqueness constraint are beneficial. However, there's a formatting issue to address.
- self.query( - ("CREATE CONSTRAINT IF NOT EXISTS FOR " - f"(n:`{BASE_LABEL}`) REQUIRE n.id IS UNIQUE;") + self.query( + ("CREATE CONSTRAINT IF NOT EXISTS FOR " + f"(n:`{BASE_LABEL}`) REQUIRE n.id IS UNIQUE;")🧰 Tools
🪛 Pylint (3.3.7)
[convention] 56-56: Trailing whitespace
(C0303)
541-545: Minor style improvement: Use dictionary literal instead of dict().results = await self.query( query, - dict( - node_id=node_id, - ), + {"node_id": node_id}, )🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 542-544: Consider using '{"node_id": node_id}' instead of a call to 'dict'.
(R1735)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py(20 hunks)cognee/infrastructure/databases/graph/neo4j_driver/neo4j_metrics_utils.py(2 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py
[convention] 56-56: Trailing whitespace
(C0303)
[refactor] 542-544: Consider using '{"node_id": node_id}' instead of a call to 'dict'.
(R1735)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Pensar Scan
🔇 Additional comments (4)
cognee/infrastructure/databases/graph/neo4j_driver/neo4j_metrics_utils.py (2)
59-66: Good optimization: Using stats instead of streaming for connected components.The change from streaming to
gds.wcc.statsis more efficient as it returns the component count directly without needing to stream and count distinct IDs.
183-190: Good optimization: Using stats instead of streaming for clustering coefficient.The change to
gds.localClusteringCoefficient.statsis more efficient as it returns the average clustering coefficient directly without streaming and averaging values manually.cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (2)
32-32: Good practice: Introducing BASE_LABEL constant for consistent node labeling.Using a constant for the base node label ensures consistency across all queries and makes future changes easier.
113-119: Excellent consistency in applying BASE_LABEL throughout all queries.The systematic update of all node-related queries to use the
BASE_LABELconstant ensures a uniform node labeling scheme across the entire adapter. This will make graph operations more predictable and maintainable.Also applies to: 139-144, 171-178, 225-228, 250-253, 375-377, 411-412, 461-462, 534-536, 549-551, 580-582, 596-597, 642-644, 664-666, 685-687, 690-692
| query = f""" | ||
| MATCH (from_node: `{BASE_LABEL}`)-[:`{edge_label}`]->(to_node: `{BASE_LABEL}`) | ||
| WHERE from_node.id = $from_node_id AND to_node.id = $to_node_id | ||
| RETURN COUNT(relationship) > 0 AS edge_exists | ||
| """ | ||
|
|
||
| params = { | ||
| "from_node_id": str(from_node), | ||
| "to_node_id": str(to_node), | ||
| "edge_label": edge_label, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security concern: Edge label should be parameterized to prevent Cypher injection.
The edge label is now interpolated directly into the query string instead of being passed as a parameter. This could lead to Cypher injection vulnerabilities if the edge label comes from untrusted sources.
Consider parameterizing the edge label or validating it against a whitelist of allowed labels before interpolation.
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/neo4j_driver/adapter.py around lines
294 to 303, the edge label is directly interpolated into the Cypher query
string, which risks Cypher injection. To fix this, avoid direct string
interpolation of the edge label. Instead, validate the edge label against a
predefined whitelist of allowed labels before including it in the query, or
refactor the query to use parameterized inputs if supported by the Neo4j driver.
This ensures only safe, expected edge labels are used in the query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (2)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (2)
732-735:⚠️ Potential issueFix incorrect label interpolation bug.
The query uses
{id}instead of{BASE_LABEL}in the MATCH clause, which will cause a syntax error.Apply this diff to fix the bug:
query = f""" UNWIND $node_ids AS id - MATCH (node:`{id}`)-[r:{edge_label}]->(predecessor) + MATCH (node:`{BASE_LABEL}`{{id: id}})-[r:{edge_label}]->(predecessor) DELETE r; """Note: This also has the same Cypher injection vulnerability with
edge_labelinterpolation that should be addressed.
761-764:⚠️ Potential issueFix incorrect label interpolation bug.
The query uses
{id}instead of{BASE_LABEL}in the MATCH clause, which will cause a syntax error.Apply this diff to fix the bug:
query = f""" UNWIND $node_ids AS id - MATCH (node:`{id}`)<-[r:{edge_label}]-(successor) + MATCH (node:`{BASE_LABEL}`{{id: id}})<-[r:{edge_label}]-(successor) DELETE r; """Note: This also has the same Cypher injection vulnerability with
edge_labelinterpolation that should be addressed.
♻️ Duplicate comments (1)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)
294-298:⚠️ Potential issueSecurity concern: Edge label should be parameterized to prevent Cypher injection.
The edge label is interpolated directly into the query string, which could lead to Cypher injection vulnerabilities if the edge label comes from untrusted sources. This is consistent with the previous review comment on this issue.
Consider parameterizing the edge label or validating it against a whitelist of allowed labels before interpolation.
🧹 Nitpick comments (2)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (2)
588-590: Remove unused parameter from query params.The
edge_labelparameter is included in the params dictionary but is not used in the query since it's directly interpolated into the query string.Apply this diff to remove the unused parameter:
results = await self.query( query, dict( node_id=node_id, - edge_label=edge_label, ), )🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 588-591: Consider using '{"node_id": node_id, "edge_label": edge_label}' instead of a call to 'dict'.
(R1735)
542-544: Use dictionary literal for better readability.Consider using a dictionary literal instead of the
dict()call for better readability.Apply this diff:
results = await self.query( query, - dict( + { node_id=node_id, - ), + }, )🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 542-544: Consider using '{"node_id": node_id}' instead of a call to 'dict'.
(R1735)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py(20 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)
cognee/infrastructure/databases/graph/graph_db_interface.py (1)
query(153-163)
🪛 Pylint (3.3.7)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py
[convention] 56-56: Trailing whitespace
(C0303)
[refactor] 542-544: Consider using '{"node_id": node_id}' instead of a call to 'dict'.
(R1735)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Pensar Scan
🔇 Additional comments (2)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (2)
32-32: LGTM: Good design for consistent node labeling.The
BASE_LABELconstant provides a clean way to ensure all nodes have a consistent base label across the Neo4j adapter.
113-114: LGTM: Consistent BASE_LABEL usage across node queries.The updates to use
BASE_LABELconsistently across all node-related queries improve the consistency and organization of the graph schema. The query logic remains sound while adding the beneficial base labeling.Also applies to: 139-139, 171-173, 225-227, 250-250, 269-271, 460-461, 642-643, 663-665, 685-686, 690-691
| notifications_min_severity="OFF", | ||
| ) | ||
| # Create contraint/index | ||
| self.query( | ||
| ("CREATE CONSTRAINT IF NOT EXISTS FOR " | ||
| f"(n:`{BASE_LABEL}`) REQUIRE n.id IS UNIQUE;") | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix async/await and trailing whitespace issues.
Two issues to address:
- The constraint creation uses a synchronous call to
self.query()butquery()is an async method - Trailing whitespace on line 56
Apply this diff to fix both issues:
- notifications_min_severity="OFF",
- )
- # Create contraint/index
- self.query(
- ("CREATE CONSTRAINT IF NOT EXISTS FOR "
- f"(n:`{BASE_LABEL}`) REQUIRE n.id IS UNIQUE;")
+ notifications_min_severity="OFF",
)
+ # Create constraint/index asynchronously in first query call
+ self._constraint_created = FalseThen add this method to handle lazy constraint creation:
async def _ensure_constraint(self):
if not self._constraint_created:
await self.query(
f"CREATE CONSTRAINT IF NOT EXISTS FOR (n:`{BASE_LABEL}`) REQUIRE n.id IS UNIQUE;"
)
self._constraint_created = TrueCall await self._ensure_constraint() at the beginning of methods that modify nodes.
🧰 Tools
🪛 Pylint (3.3.7)
[convention] 56-56: Trailing whitespace
(C0303)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/neo4j_driver/adapter.py around lines 52
to 58, the call to self.query() is incorrectly synchronous while query() is an
async method, and there is trailing whitespace on line 56. To fix this, convert
the constraint creation call to use await with an async method, remove the
trailing whitespace, and implement the provided async _ensure_constraint()
method to lazily create the constraint. Then, call await
self._ensure_constraint() at the start of any methods that modify nodes to
ensure the constraint is created before modifications.
| query = f""" | ||
| MATCH (node: `{BASE_LABEL}`)<-[r:`{edge_label}`]-(predecessor) | ||
| WHERE node.id = $node_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security concern: Edge label should be parameterized to prevent Cypher injection.
The edge label is interpolated directly into the query string, creating a potential Cypher injection vulnerability.
Consider using parameterized queries or validating the edge label against a whitelist before interpolation.
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/neo4j_driver/adapter.py around lines
534 to 536, the edge_label is directly interpolated into the Cypher query
string, which risks Cypher injection. To fix this, avoid direct string
interpolation of edge_label; instead, validate edge_label against a predefined
whitelist of allowed labels before including it in the query. Alternatively,
refactor the query to use parameterized queries if supported by the Neo4j
driver, ensuring edge_label is safely handled without direct string insertion.
| query = f""" | ||
| MATCH (node: `{BASE_LABEL}`)-[r:`{edge_label}`]->(successor) | ||
| WHERE node.id = $node_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security concern: Edge label should be parameterized to prevent Cypher injection.
The edge label is interpolated directly into the query string, creating a potential Cypher injection vulnerability.
Consider using parameterized queries or validating the edge label against a whitelist before interpolation.
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/neo4j_driver/adapter.py around lines
580 to 582, the edge label is directly interpolated into the Cypher query
string, which poses a Cypher injection risk. To fix this, avoid direct string
interpolation of the edge label; instead, validate the edge label against a
predefined whitelist of allowed labels before including it in the query.
Alternatively, restructure the query to use parameterized inputs for the edge
label if supported by the Neo4j driver, ensuring no untrusted input is directly
embedded in the query string.
| MATCH (from_node :`{BASE_LABEL}`{{id: $from_node}}), | ||
| (to_node :`{BASE_LABEL}`{{id: $to_node}}) | ||
| MERGE (from_node)-[r:`{relationship_name}`]->(to_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security concern: Relationship name should be parameterized to prevent Cypher injection.
The relationship name is interpolated directly into the query string, creating a potential Cypher injection vulnerability.
Consider parameterizing the relationship name or validating it against a whitelist before interpolation:
- MERGE (from_node)-[r:`{relationship_name}`]->(to_node)
+ CALL apoc.merge.relationship(from_node, $relationship_name, {}, $properties, to_node) YIELD rel AS rAnd update the params to include the relationship name:
params = {
"from_node": str(from_node),
"to_node": str(to_node),
- "relationship_name": relationship_name,
+ "relationship_name": relationship_name,
"properties": serialized_properties,
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/neo4j_driver/adapter.py around lines
375 to 377, the relationship name is directly interpolated into the Cypher query
string, which poses a Cypher injection risk. To fix this, validate the
relationship name against a predefined whitelist of allowed names before
including it in the query. Alternatively, if possible, parameterize the
relationship name safely. Update the query construction to use the validated or
parameterized relationship name and ensure the parameters dictionary includes
this validated value instead of direct string interpolation.
|
@tomasonjo @hajdul88 will help out on this. We have tests, but needs approval for contributors to run |
|
Which linter are you using? |
|
Hi @tomasonjo! First of all thanks for the contribution, it's nice to see you here. We changed the target of your PR to our dev branch since before release we have additional checks that we do. Regarding the testing, for now we encourage our contributors to test manually or run the workflow with their own secrets first, then once they are done with that and finished all the changes we run our whole test pipeline ourselves. If you want to run tests manually, unit tests are under: cognee/cognee/tests/unit/ Github workflows (if you want to run them with your own secrets) in: .github/workflows here we do iterative test sets defined in test_suites.yml Since you only changed neo4j related part, the two tests that are relevant are: cognee/cognee/tests/test_neo4j.py Otherwise I checked your PR and tested it locally, it looks good to me. Let me know when you finished the testing and then I can run GActions and merge it. |
We use ruff 0.11.9 |
|
The tests seem to pass |
|
That's great! Thanks, I will check and run the tests possibly also merge tomorrow. |
| notifications_min_severity="OFF", | ||
| ) | ||
| # Create contraint/index | ||
| self.query( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GA tests are passing, but I just found out that the indexing was actually never awaited, it's a minor thing. I can fix it after we merge.
Do you have some tests I can run?
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.