-
Notifications
You must be signed in to change notification settings - Fork 966
Feat: Natural Language Retriever (text2cypher) #663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Natural Language Retriever (text2cypher) #663
Conversation
WalkthroughThis pull request introduces a new class, Changes
Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
cognee/tests/test_neo4j.py (1)
71-71: Refine log message
While “Extracted results” is acceptable, consider providing a more context-specific log message to help differentiate between different types of returned data in your logs.cognee/modules/retrieval/natural_language_retriever.py (5)
7-7: Remove unused import
generate_completionis imported but never used. It is best practice to remove unused imports to keep the module lean.- from cognee.modules.retrieval.utils.completion import generate_completion🧰 Tools
🪛 Ruff (0.8.2)
7-7:
cognee.modules.retrieval.utils.completion.generate_completionimported but unusedRemove unused import:
cognee.modules.retrieval.utils.completion.generate_completion(F401)
28-45: Check for empty schema
This method retrieves node and edge schemas from the graph. Consider handling scenarios where no relevant nodes or edges exist in the database to avoid unexpected results.
46-131: Externalize large system prompt
The embedded system prompt is quite extensive. For clarity, consider storing the prompt in a separate file or using a dedicated prompt manager to improve readability and maintainability.
132-169: Enhance retry handling
The multi-attempt logic here is good. As an optional improvement, consider adding a small delay or exponential backoff between attempts, especially if rate limits or transient errors could occur.
183-189: Return structure clarity
Returning context directly inget_completionis straightforward. If you plan to do any additional processing on the context, consider clarifying or documenting the structure of the returned data for maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cognee/modules/retrieval/natural_language_retriever.py(1 hunks)cognee/modules/search/methods/search.py(2 hunks)cognee/modules/search/types/SearchType.py(1 hunks)cognee/tests/test_neo4j.py(1 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
cognee/tests/test_neo4j.py (1)
cognee/api/v1/search/routers/get_search_router.py (1)
get_search_history(27-33)
🪛 Ruff (0.8.2)
cognee/modules/retrieval/natural_language_retriever.py
7-7: cognee.modules.retrieval.utils.completion.generate_completion imported but unused
Remove unused import: cognee.modules.retrieval.utils.completion.generate_completion
(F401)
🔇 Additional comments (5)
cognee/modules/search/types/SearchType.py (1)
13-13: Addition of new enum value
This addition cleanly expands the existing search types to support natural language. Good job!cognee/tests/test_neo4j.py (1)
77-77: Validate updated history length
Verifying that the length of the history is now exactly 10 could introduce fragility if the test environment or search sequence changes. Please confirm that 10 is not an arbitrary magic number and that each new search action is truly expected to increment the history up to 10.cognee/modules/search/methods/search.py (2)
16-16: Import usage
TheNaturalLanguageRetrieverimport looks good, and it is indeed used below.
71-71: Appropriate new search task
Great job mappingSearchType.NATURAL_LANGUAGEtoNaturalLanguageRetriever().get_completion. This cleanly integrates the new retriever into existing search logic.cognee/modules/retrieval/natural_language_retriever.py (1)
170-182: Context retrieval logic
This function nicely handles the graph engine check and raises aSearchTypeNotSupportedif using NetworkX. The error handling is clear and properly logged.
|
@Vasilije1990 @borisarzentar @hajdul88 added new PR (instead #646) - can you check it please and give a feedback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cognee/tests/test_neo4j.py (1)
73-75: Consider removing or converting print statements.
Using print in test code can clutter logs. Consider using logging instead for better maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/tests/test_neo4j.py(1 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
cognee/tests/test_neo4j.py (2)
cognee/api/v1/search/search_v2.py (1)
search(10-31)cognee/api/v1/search/routers/get_search_router.py (2)
search(36-47)get_search_history(27-33)
🪛 GitHub Actions: test | neo4j
cognee/tests/test_neo4j.py
[error] 106-106: OSError: Missing required Neo4j credentials.
[warning] 1-1: Ontology file 'None' not found. Using fallback ontology at http://example.org/empty_ontology.
[warning] 1-1: Ontology file 'None' not found. Using fallback ontology at http://example.org/empty_ontology.
⏰ Context from checks skipped due to timeout of 90000ms (29)
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: run_eval_framework_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-13
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: chromadb test
- GitHub Check: Test on macos-13
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: run_notebook_test / test
- GitHub Check: windows-latest
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: run_simple_example_test
- GitHub Check: lint (ubuntu-latest, 3.10.x)
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: docker-compose-test
- GitHub Check: Build Cognee Backend Docker App Image
🔇 Additional comments (3)
cognee/tests/test_neo4j.py (3)
72-72: Text update looks good.
Renaming "Extracted summaries are:" to "Extracted results are:" is more aligned with the broader result set.
76-81: Good addition of NATURAL_LANGUAGE query test.
This addresses the request to verify the new search type and ensures the results are not empty.
87-87:Details
✅ Verification successful
Confirm updated search history length.
Ensure the test environment indeed logs 10 entries, given the new searches introduced.
🏁 Script executed:
#!/bin/bash # Searching for references to the logging of search history in the codebase. rg 'search_history' -A 5 -B 5Length of output: 9744
Confirm updated search history length
The assertion in
cognee/tests/test_neo4j.pynow correctly expects 10 search history entries. Please verify that the test environment is logging exactly 10 entries for Neo4j after incorporating the new searches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (8)
cognee/modules/retrieval/natural_language_retriever.py (4)
17-25: Update docstring to match the current parametersThe docstring mentions "custom prompt paths" (plural) but the constructor only accepts a single prompt path parameter (
system_prompt_path).- """Initialize retriever with optional custom prompt paths.""" + """Initialize retriever with optional custom system prompt path and maximum retry attempts."""
67-71: Simplify log string formattingThe conditional formatting of the log message is a bit verbose. Consider using a simpler approach.
- logger.info( - f"Executing generated Cypher query (attempt {attempt + 1}): {cypher_query[:100]}..." - if len(cypher_query) > 100 - else cypher_query - ) + log_query = f"{cypher_query[:100]}..." if len(cypher_query) > 100 else cypher_query + logger.info(f"Executing generated Cypher query (attempt {attempt + 1}): {log_query}")
14-16: Enhance class docstring with more detailsThe class docstring is quite brief. Consider expanding it to provide more context about what natural language search entails and how this retriever works.
- """Retriever for handling natural language search""" + """ + Retriever for handling natural language search. + + This class converts natural language queries into Cypher queries that can be + executed against a graph database. It uses LLM to generate the Cypher queries + based on the graph schema and implements retry logic to handle failed attempts. + """
105-110: Consider adding more functionality to get_completionThe
get_completionmethod currently just returns the context. Consider enhancing it to provide more valuable completions based on the context, perhaps by using an LLM to generate a response.async def get_completion(self, query: str, context: Optional[Any] = None) -> Any: - """Returns a completion based on the query and context.""" + """ + Returns a completion based on the query and context. + + Currently, this method simply returns the context. Consider enhancing it + to generate more informative responses by using an LLM to process the + context and generate a human-readable answer. + """ if context is None: context = await self.get_context(query) return contextcognee/infrastructure/llm/prompts/natural_language_retriever_system.txt (4)
60-61: Fix grammar in the prompt textThere's a grammatical error in the sentence structure.
-This queries doesn't work. Do NOT use them: +These queries don't work. Do NOT use them:🧰 Tools
🪛 LanguageTool
[grammar] ~60-~60: Consider using the singular form after the singular determiner “This”.
Context: ...p properties):{{edge_schemas}}This queries doesn't work. Do NOT use them: `{{previ...(AGREEMENT_SENT_START)
1-66: Consider adding more examples to the promptThe prompt currently only contains one example (Get all nodes connected to John). More examples of varying complexity would help the LLM better understand the expected output format and handle a wider range of user queries.
Consider adding 2-3 more examples with different query patterns, such as:
- Finding paths between entities
- Filtering based on properties
- Aggregating data
For example:
Example 2: Find all TextDocuments created after Jan 1, 2023 that mention "AI" MATCH (d:TextDocument) WHERE d.created_at > '2023-01-01' AND d.text CONTAINS "AI" RETURN d.id, d.name, d.created_at LIMIT 10 Example 3: Find the shortest path between entities "Machine Learning" and "Natural Language Processing" MATCH p=shortestPath((a:Entity {name: 'Machine Learning'})-[*]-(b:Entity {name: 'Natural Language Processing'})) RETURN p🧰 Tools
🪛 LanguageTool
[grammar] ~60-~60: Consider using the singular form after the singular determiner “This”.
Context: ...p properties):{{edge_schemas}}This queries doesn't work. Do NOT use them: `{{previ...(AGREEMENT_SENT_START)
57-58: Enhance edge schema template placeholderThe current placeholder
{{edge_schemas}}will be replaced with a list of edge properties, but it doesn't provide context about how these edge schemas will be structured in the template.Consider adding a sample format comment to indicate how the edge schemas will be presented:
Edge schema (relationship properties): -`{{edge_schemas}}` +`{{edge_schemas}}` +<!-- Example format: +- HAS_ENTITY: Connects TextDocument to Entity +- BELONGS_TO: Connects Entity to EntityType +- CONTAINS: Connects TextDocument to DocumentChunk +-->This will make the template more maintainable and easier to understand when reviewing the code.
13-20: Add guidance for handling complex resultsThe "QUERY REQUIREMENTS" section focuses on query generation but doesn't provide guidance on handling complex result structures. This could be crucial for ensuring the generated queries return results in a useful format.
QUERY REQUIREMENTS: 1. Return ONLY the exact Cypher query with NO explanations, comments, or markdown 2. Generate syntactically correct Neo4j Cypher code (Neo4j 4.4+ compatible) 3. Be precise - match the exact property names and relationship types from the schema 4. Handle complex queries by breaking them into logical pattern matching parts 5. Use parameters (e.g., $name) for literal values when appropriate 6. Use appropriate data types for parameters (strings, numbers, booleans) +7. Structure results appropriately - use aliases and object construction for complex results +8. Consider the intended use of the results - prioritize properties that answer the user's question
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/infrastructure/llm/prompts/natural_language_retriever_system.txt(1 hunks)cognee/modules/retrieval/natural_language_retriever.py(1 hunks)
🧰 Additional context used
🪛 LanguageTool
cognee/infrastructure/llm/prompts/natural_language_retriever_system.txt
[grammar] ~60-~60: Consider using the singular form after the singular determiner “This”.
Context: ...p properties): {{edge_schemas}} This queries doesn't work. Do NOT use them: `{{previ...
(AGREEMENT_SENT_START)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-13
- GitHub Check: windows-latest
- GitHub Check: run_simple_example_test
- GitHub Check: docker-compose-test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
cognee/modules/retrieval/natural_language_retriever.py (1)
1-118: 🛠️ Refactor suggestionMissing documentation about system prompt format requirements
The class depends heavily on a system prompt template but doesn't provide any documentation about its required format or content, making it harder for other developers to work with this class.
Add class-level documentation explaining the expected format and variables for the system prompt:
class NaturalLanguageRetriever(BaseRetriever): - """Retriever for handling natural language search""" + """Retriever for handling natural language search + + This class converts natural language queries into Cypher queries for Neo4j graph databases. + It requires a system prompt template that instructs the LLM how to generate valid Cypher. + + The system prompt should include the following variables: + - `{{edge_schemas}}`: Will be replaced with schema information from the graph database + - `{{previous_attempts}}`: Will be replaced with information about previous query attempts + + The prompt should instruct the LLM to: + 1. Analyze the natural language query + 2. Understand the available schema + 3. Generate a valid Cypher query that answers the question + 4. Handle errors from previous attempts (if any) + 5. Return only the Cypher query as a string without any explanations + """
♻️ Duplicate comments (1)
cognee/modules/retrieval/natural_language_retriever.py (1)
13-25: 🛠️ Refactor suggestionClass initializer needs to handle the prompt path more robustly
The class initializes with a hard-coded prompt path that doesn't include the full directory structure. This might lead to issues finding the prompt file.
- def __init__( - self, - system_prompt_path: str = "natural_language_retriever_system.txt", - max_attempts: int = 3, - ): - """Initialize retriever with optional custom prompt paths.""" - self.system_prompt_path = system_prompt_path + def __init__( + self, + system_prompt_path: str = "llm/prompts/natural_language_retriever_system.txt", + max_attempts: int = 3, + ): + """Initialize retriever with optional custom prompt paths. + + Args: + system_prompt_path: Path to the system prompt template file relative to prompts directory + max_attempts: Maximum number of attempts to generate and execute a valid Cypher query + """ + self.system_prompt_path = system_prompt_path
🧹 Nitpick comments (5)
cognee/modules/retrieval/natural_language_retriever.py (5)
26-43: Graph schema query could be more comprehensiveThe node schema query returns labels and properties, but the edge schema query only returns property keys without relationship types, which may limit the ability to generate accurate Cypher queries.
async def _get_graph_schema(self, graph_engine) -> tuple: """Retrieve the node and edge schemas from the graph database.""" node_schemas = await graph_engine.query( """ MATCH (n) UNWIND keys(n) AS prop RETURN DISTINCT labels(n) AS NodeLabels, collect(DISTINCT prop) AS Properties; """ ) edge_schemas = await graph_engine.query( """ - MATCH ()-[r]->() - UNWIND keys(r) AS key - RETURN DISTINCT key; + MATCH ()-[r]->() + UNWIND keys(r) AS prop + RETURN DISTINCT type(r) AS RelationshipType, collect(DISTINCT prop) AS Properties; """ ) return node_schemas, edge_schemas
61-98: Improve error handling and result validation in query executionThe method catches all exceptions without differentiating between query generation errors and execution errors. Additionally, there's no validation of the context format or structure.
async def _execute_cypher_query(self, query: str, graph_engine: GraphDBInterface) -> Any: """Execute the natural language query against Neo4j with multiple attempts.""" node_schemas, edge_schemas = await self._get_graph_schema(graph_engine) previous_attempts = "" cypher_query = "" for attempt in range(self.max_attempts): logger.info(f"Starting attempt {attempt + 1}/{self.max_attempts} for query generation") try: - cypher_query = await self._generate_cypher_query( - query, edge_schemas, previous_attempts - ) + try: + cypher_query = await self._generate_cypher_query( + query, edge_schemas, previous_attempts + ) + except Exception as e: + logger.error(f"Error in query generation (attempt {attempt + 1}): {str(e)}") + previous_attempts += f"Query generation error: {str(e)}\n" + continue logger.info( f"Executing generated Cypher query (attempt {attempt + 1}): {cypher_query[:100]}..." if len(cypher_query) > 100 else cypher_query ) - context = await graph_engine.query(cypher_query) + try: + context = await graph_engine.query(cypher_query) + except Exception as e: + logger.error(f"Error executing Cypher query (attempt {attempt + 1}): {str(e)}") + previous_attempts += f"Query: {cypher_query} -> Execution error: {str(e)}\n" + continue if context: result_count = len(context) if isinstance(context, list) else 1 logger.info( f"Successfully executed query (attempt {attempt + 1}): returned {result_count} result(s)" ) return context + elif context == []: + logger.info(f"Query executed successfully but returned empty results (attempt {attempt + 1})") + previous_attempts += f"Query: {cypher_query} -> Result: Empty list\n" + else: + logger.warning(f"Query returned unexpected result type: {type(context)} (attempt {attempt + 1})") + previous_attempts += f"Query: {cypher_query} -> Result: Unexpected type {type(context)}\n" - previous_attempts += f"Query: {cypher_query} -> Result: None\n" + previous_attempts += f"Query: {cypher_query} -> Result: No data returned\n" except Exception as e: previous_attempts += f"Query: {cypher_query if 'cypher_query' in locals() else 'Not generated'} -> Executed with error: {e}\n" logger.error(f"Error executing query: {str(e)}") logger.warning( f"Failed to get results after {self.max_attempts} attempts for query: '{query[:50]}...'" ) return []
99-111: Add more specific exception handling in get_contextThe method catches all exceptions without distinguishing between different types of errors that could occur.
async def get_context(self, query: str) -> Optional[Any]: """Retrieves relevant context using a natural language query converted to Cypher.""" try: graph_engine = await get_graph_engine() if isinstance(graph_engine, (NetworkXAdapter)): raise SearchTypeNotSupported("Natural language search type not supported.") return await self._execute_cypher_query(query, graph_engine) + except SearchTypeNotSupported as e: + logger.error("Natural language search not supported with current graph engine: %s", str(e)) + raise + except ValueError as e: + logger.error("Invalid query or parameters: %s", str(e)) + raise except Exception as e: - logger.error("Failed to execute natural language search retrieval: %s", str(e)) - raise e + logger.error("Unexpected error during natural language search retrieval: %s", str(e)) + raise
112-118: Consider enhancing the get_completion methodThe current implementation simply returns the context without any processing. Since this is a natural language retriever, consider formatting the results or generating a textual response based on the retrieved context.
async def get_completion(self, query: str, context: Optional[Any] = None) -> Any: """Returns a completion based on the query and context.""" if context is None: context = await self.get_context(query) - return context + # If no context was found, return an informative empty result + if not context: + return {"result": [], "message": "No data found matching your query."} + + # Return structured response with the original query and results + return { + "query": query, + "result": context, + "message": f"Found {len(context) if isinstance(context, list) else 1} result(s) matching your query." + }
19-21: Consider adding parameter type annotationsThe constructor parameters lack type annotations, which would improve code readability and IDE support.
def __init__( self, - system_prompt_path: str = "natural_language_retriever_system.txt", - max_attempts: int = 3, + system_prompt_path: str = "natural_language_retriever_system.txt", + max_attempts: int = 3, ):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/modules/retrieval/natural_language_retriever.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (27)
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-15
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-13
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_eval_framework_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: run_simple_example_test / test
- GitHub Check: Test on macos-13
- GitHub Check: chromadb test
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: run_simple_example_test
- GitHub Check: Build Cognee Backend Docker App Image
- GitHub Check: docker-compose-test
🔇 Additional comments (1)
cognee/modules/retrieval/natural_language_retriever.py (1)
1-11: Imports are well-organized and appropriateThe imports are logically grouped and include all necessary components for the natural language retriever functionality.
|
@borisarzentar I have fixed by your comments (tests + moved prompt to llm/prompts) |
Great! |
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
I added one example "get all connected nodes to entity"