-
Notifications
You must be signed in to change notification settings - Fork 963
Feat/cog 1365 unify retrievers #572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request reorganizes the retrieval module by updating several import paths and introducing new retriever classes that extend an abstract base class. New utility functions and comparison scripts have been added, and key functions in task completions now support saving context to a file. Import paths throughout the codebase have been updated to reflect the new module structure, ensuring consistency and separating utility functions into a dedicated submodule. Changes
Suggested reviewers
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🔭 Outside diff range comments (1)
cognee/modules/retrieval/utils/completion.py (1)
7-24: 🛠️ Refactor suggestionAdd error handling and enhance documentation.
The function needs error handling for LLM client failures and more detailed documentation.
async def generate_completion( query: str, context: str, user_prompt_path: str, system_prompt_path: str, ) -> str: - """Generates a completion using LLM with given context and prompts.""" + """Generates a completion using LLM with given context and prompts. + + Args: + query: The query string to generate completion for. + context: The context information to use for completion. + user_prompt_path: Path to the user prompt template. + system_prompt_path: Path to the system prompt template. + + Returns: + Generated completion response. + + Raises: + LLMClientError: If there's an error during LLM completion generation. + FileNotFoundError: If prompt template files are not found. + """ - args = {"question": query, "context": context} + args = {"query": query, "context": context} user_prompt = render_prompt(user_prompt_path, args) system_prompt = read_query_prompt(system_prompt_path) llm_client = get_llm_client() - return await llm_client.acreate_structured_output( - text_input=user_prompt, - system_prompt=system_prompt, - response_model=str, - ) + try: + return await llm_client.acreate_structured_output( + text_input=user_prompt, + system_prompt=system_prompt, + response_model=str, + ) + except Exception as e: + raise LLMClientError(f"Failed to generate completion: {str(e)}") from e
🧹 Nitpick comments (15)
cognee/modules/retrieval/base_retriever.py (2)
5-7: Enhance class docstring with more details.The docstring could be more descriptive about the class's purpose, expected behavior of implementations, and usage examples.
- """Base class for all retrieval operations.""" + """Base class for all retrieval operations. + + This abstract class defines the interface that all retriever implementations must follow. + It provides a unified way to retrieve context and generate completions across different + retrieval strategies. + + Example: + class MyRetriever(BaseRetriever): + async def get_context(self, query: str) -> dict: + # Implementation details + pass + + async def get_completion(self, query: str, context: Optional[dict] = None) -> str: + # Implementation details + pass + """
8-16: Consider using more specific return types.The use of
Anyas return type is too permissive and might lead to type-safety issues. Consider using more specific types based on the expected return values from implementations.- async def get_context(self, query: str) -> Any: + async def get_context(self, query: str) -> Union[str, dict, list]: """Retrieves context based on the query.""" pass @abstractmethod - async def get_completion(self, query: str, context: Optional[Any] = None) -> Any: + async def get_completion(self, query: str, context: Optional[Union[str, dict, list]] = None) -> str: """Generates a response using the query and optional context.""" passAlso, enhance the method docstrings with parameter descriptions and return value information.
- """Retrieves context based on the query.""" + """Retrieves context based on the query. + + Args: + query: The search query string. + + Returns: + Context information relevant to the query. + """ - """Generates a response using the query and optional context.""" + """Generates a response using the query and optional context. + + Args: + query: The query string to generate completion for. + context: Optional context to use for completion generation. + If None, context will be retrieved using get_context. + + Returns: + Generated completion response. + """cognee/tasks/completion/graph_query_summary_completion.py (1)
25-29: Enhance function docstring with parameter details.The docstring should describe all parameters, return value, and any exceptions that might be raised.
- """Executes a query on the graph database and retrieves a summarized completion with optional context saving.""" + """Executes a query on the graph database and retrieves a summarized completion. + + Args: + query: The search query string. + save_context_path: Optional path to save the context as a JSON file. + If None, context will not be saved. + + Returns: + List containing the summarized completion response. + + Raises: + NoRelevantDataFound: If no relevant data is found for the query. + """cognee/tasks/completion/query_completion.py (1)
9-26: Enhance docstring with context saving details.The docstring should better describe the context saving functionality and its behavior.
""" Executes a query against a vector database and computes a relevant response using an LLM. Parameters: - query (str): The query string to compute. - - save_context_path (str): The path to save the context. + - save_context_path (str, optional): Path where the context will be saved as a JSON file. + If provided, creates directories as needed. + Context saving failures are logged but don't affect execution. Returns: - list: Answer to the query. Notes: - Limits the search to the top 1 matching chunk for simplicity and relevance. - Ensure that the vector database and LLM client are properly configured and accessible. - The response model used for the LLM output is expected to be a string. + - Context is saved with UTF-8 encoding and human-readable formatting. + - Context saving is best-effort; failures won't interrupt the main flow. """cognee/modules/retrieval/graph_summary_completion_retriever.py (1)
11-25: Consider moving default paths to constants.The default paths for prompts could be moved to class-level or module-level constants to improve maintainability and reusability.
+# At the top of the file +DEFAULT_USER_PROMPT = "graph_context_for_question.txt" +DEFAULT_SYSTEM_PROMPT = "answer_simple_question.txt" +DEFAULT_SUMMARIZE_PROMPT = "summarize_search_results.txt" + def __init__( self, - user_prompt_path: str = "graph_context_for_question.txt", - system_prompt_path: str = "answer_simple_question.txt", - summarize_prompt_path: str = "summarize_search_results.txt", + user_prompt_path: str = DEFAULT_USER_PROMPT, + system_prompt_path: str = DEFAULT_SYSTEM_PROMPT, + summarize_prompt_path: str = DEFAULT_SUMMARIZE_PROMPT, top_k: int = 5, ):cognee/modules/retrieval/completion_retriever.py (1)
21-28: Make chunk limit configurable.The hard-coded limit of 1 chunk should be configurable through the constructor.
+def __init__( + self, + user_prompt_path: str = "context_for_question.txt", + system_prompt_path: str = "answer_simple_question.txt", + chunk_limit: int = 1, +): + """Initialize retriever with optional custom prompt paths and chunk limit.""" + self.user_prompt_path = user_prompt_path + self.system_prompt_path = system_prompt_path + self.chunk_limit = chunk_limit async def get_context(self, query: str) -> Any: """Retrieves relevant document chunks as context.""" vector_engine = get_vector_engine() - found_chunks = await vector_engine.search("DocumentChunk_text", query, limit=1) + found_chunks = await vector_engine.search("DocumentChunk_text", query, limit=self.chunk_limit)cognee/modules/retrieval/code_retriever.py (1)
16-52: Consider optimizing the get_context implementation.The current implementation has a few areas for improvement:
- The vector index collections are recomputed on every call
- The dictionary building loop could be simplified
Consider these optimizations:
async def get_context(self, query: str) -> Any: """Find relevant code files based on the query.""" - subclasses = get_all_subclasses(DataPoint) - vector_index_collections = [] - - for subclass in subclasses: - index_fields = subclass.model_fields["metadata"].default.get("index_fields", []) - for field_name in index_fields: - vector_index_collections.append(f"{subclass.__name__}_{field_name}") + # Cache this computation in __init__ since it's static + if not hasattr(self, '_vector_index_collections'): + self._vector_index_collections = [ + f"{subclass.__name__}_{field_name}" + for subclass in get_all_subclasses(DataPoint) + for field_name in subclass.model_fields["metadata"].default.get("index_fields", []) + ] found_triplets = await brute_force_triplet_search( query, top_k=self.top_k, - collections=vector_index_collections or None, + collections=self._vector_index_collections or None, properties_to_project=["id", "file_path", "source_code"], ) - retrieved_files = {} - for triplet in found_triplets: - if triplet.node1.attributes["source_code"]: - retrieved_files[triplet.node1.attributes["file_path"]] = triplet.node1.attributes[ - "source_code" - ] - if triplet.node2.attributes["source_code"]: - retrieved_files[triplet.node2.attributes["file_path"]] = triplet.node2.attributes[ - "source_code" - ] + # Use dictionary comprehension for cleaner code + retrieved_files = { + node.attributes["file_path"]: node.attributes["source_code"] + for triplet in found_triplets + for node in (triplet.node1, triplet.node2) + if node.attributes.get("source_code") + }cognee/modules/retrieval/insights_retriever.py (1)
12-16: Consider making the score threshold configurable.The score threshold of 0.5 is hardcoded in the get_context method. Consider making it configurable through the constructor.
Add score_threshold to the constructor:
- def __init__(self, exploration_levels: int = 1, top_k: int = 5): + def __init__(self, exploration_levels: int = 1, top_k: int = 5, score_threshold: float = 0.5): """Initialize retriever with exploration levels and search parameters.""" self.exploration_levels = exploration_levels self.top_k = top_k + self.score_threshold = score_thresholdcognee/modules/retrieval/utils/run_search_comparisons.py (3)
1-6: Consider standardizing on logging rather than print statements.The file imports the
loggingmodule but also uses
76-105: Enable type annotations for return values.While this function is marked as asynchronous, it doesn’t strictly need async capabilities unless you anticipate asynchronous operations. If it's required to be async, consider applying more specific type hints (e.g.,
-> Dict[str, Any]), and verify if there's truly a need for an async function here.
120-153: Unify context comparison with result comparison logic.You have separate code paths for comparing completion results vs. context. There is some duplicated logic in printing differences, which could be simplified or modularized for consistency and maintenance.
cognee/modules/retrieval/graph_completion_retriever.py (4)
1-9: Ensure all imports are necessary and used.You import
AnyandOptionalfromtyping, but onlyAnyappears used. Confirm that all imports remain relevant.
11-24: Document constructor parameters.The
__init__method setsuser_prompt_path,system_prompt_path, andtop_k. Consider adding docstrings that describe their usage and default values for improved clarity.
36-54: Ensure performance for large graphs.
brute_force_triplet_searchmay be expensive for large datasets. Consider batching or indexing strategies if performance becomes an issue at scale.
55-71: Provide more context for the returned data structure.Both
get_contextandget_completionreturn genericAnytypes. Without a clearer structure, callers may need to rely on trial-and-error. Consider refining or documenting the returned format to streamline usage.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (24)
cognee/modules/retrieval/__init__.py(1 hunks)cognee/modules/retrieval/base_retriever.py(1 hunks)cognee/modules/retrieval/chunks_retriever.py(1 hunks)cognee/modules/retrieval/code_retriever.py(1 hunks)cognee/modules/retrieval/completion_retriever.py(1 hunks)cognee/modules/retrieval/graph_completion_retriever.py(1 hunks)cognee/modules/retrieval/graph_summary_completion_retriever.py(1 hunks)cognee/modules/retrieval/insights_retriever.py(1 hunks)cognee/modules/retrieval/summaries_retriever.py(1 hunks)cognee/modules/retrieval/utils/completion.py(1 hunks)cognee/modules/retrieval/utils/run_search_comparisons.py(1 hunks)cognee/modules/search/methods/search.py(1 hunks)cognee/tasks/completion/graph_query_completion.py(3 hunks)cognee/tasks/completion/graph_query_summary_completion.py(1 hunks)cognee/tasks/completion/query_completion.py(2 hunks)cognee/tests/test_neo4j.py(1 hunks)cognee/tests/test_pgvector.py(1 hunks)cognee/tests/test_qdrant.py(1 hunks)cognee/tests/test_weaviate.py(1 hunks)cognee/tests/unit/modules/retriever/test_description_to_codepart_search.py(5 hunks)evals/eval_swe_bench.py(1 hunks)evals/qa_context_provider_utils.py(1 hunks)examples/python/graphiti_example.py(1 hunks)notebooks/cognee_graphiti_demo.ipynb(1 hunks)
✅ Files skipped from review due to trivial changes (11)
- cognee/modules/retrieval/init.py
- evals/eval_swe_bench.py
- notebooks/cognee_graphiti_demo.ipynb
- cognee/tests/test_neo4j.py
- cognee/tests/test_weaviate.py
- cognee/tests/test_qdrant.py
- examples/python/graphiti_example.py
- cognee/modules/search/methods/search.py
- evals/qa_context_provider_utils.py
- cognee/tests/test_pgvector.py
- cognee/tests/unit/modules/retriever/test_description_to_codepart_search.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: docker-compose-test
🔇 Additional comments (7)
cognee/modules/retrieval/summaries_retriever.py (1)
20-24: Consider enhancing get_completion or removing it.Similar to ChunksRetriever, the get_completion method doesn't add value beyond get_context.
cognee/modules/retrieval/completion_retriever.py (1)
12-20: Consider moving default paths to constants.Similar to GraphSummaryCompletionRetriever, the default paths could be moved to constants.
cognee/modules/retrieval/code_retriever.py (1)
53-57: Consider enhancing the get_completion implementation.The current implementation simply returns the context without any processing. Consider adding validation or transformation of the context data.
Please verify if this simple pass-through behavior aligns with the requirements of the unified retriever framework.
cognee/modules/retrieval/insights_retriever.py (1)
62-67: Consider enhancing the get_completion implementation.Similar to CodeRetriever, this implementation simply returns the context. Consider whether additional processing or validation is needed.
Please verify if this simple pass-through behavior aligns with the requirements of the unified retriever framework.
cognee/modules/retrieval/utils/run_search_comparisons.py (2)
107-118: Recommend error handling for retriever calls.If either
old_implementationornew_retriever.get_completionraises an exception, the function breaks without a fallback. Consider handling exceptions gracefully and returning an informative response to avoid halting the entire comparison process.
155-221: Validate configuration usage before invocation.In
main, you rely on keys incomparisons(e.g.,"completion_context","graph_completion_context") and similarly forBASIC_RETRIEVERS. If any key is missing or misnamed, the loop silently ignores it. Consider validating available keys or providing feedback when an expected comparison key is missing—this can help catch config issues earlier.cognee/modules/retrieval/graph_completion_retriever.py (1)
25-35: Check for null or unexpected attributes when resolving edges.When converting node attributes to strings, you rely on
"text"or"name". If these attributes are missing or malformed, you might end up withNone. Consider providing a fallback string or implementing validation to avoid potential runtime issues.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cognee/tasks/completion/query_completion.py (1)
15-16: Enhance parameter description in docstring.The description of
save_context_pathcould be more informative by specifying the expected format and purpose.- query (str): The query string to compute. - - save_context_path (str): The path to save the context. + - save_context_path (str, optional): File path where the context will be saved in JSON format. If provided, the context used for completion will be saved to this location.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/tasks/completion/query_completion.py(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
cognee/tasks/completion/query_completion.py
42-42: Undefined name logger
(F821)
🪛 GitHub Actions: lint | ruff lint
cognee/tasks/completion/query_completion.py
[error] 42-42: Ruff: Undefined name logger.
⏰ Context from checks skipped due to timeout of 90000ms (27)
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_eval_framework_test / test
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: docker-compose-test
- GitHub Check: Build Cognee Backend Docker App Image
🔇 Additional comments (2)
cognee/tasks/completion/query_completion.py (2)
34-43: Implementation follows previous review suggestions.The context saving implementation now includes proper error handling and JSON formatting as suggested in the previous review.
🧰 Tools
🪛 Ruff (0.8.2)
42-42: Undefined name
logger(F821)
🪛 GitHub Actions: lint | ruff lint
[error] 42-42: Ruff: Undefined name
logger.
44-47: LGTM! Improved code readability.The context is now stored in a variable before use, which improves readability and avoids duplicate access to the payload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (5)
cognee/modules/retrieval/utils/run_search_comparisons.py (5)
1-1: Expand the TODO comment with more details.The TODO comment should include more information about why this file should be deleted after merging and what COG-1403 refers to. This would help future developers understand the purpose and lifecycle of this file.
-# TODO: delete after merging COG-1365, see COG-1403 +# TODO: delete after merging COG-1365 (Unify Retrievers PR), see COG-1403 +# This is a temporary utility for comparing old and new retriever implementations +# to ensure they produce equivalent results before fully switching to the new system.
77-105: Convert print statements to logging for better control of output levels.Using the logging module instead of print statements would provide better control over the verbosity and output destination. It would also make it easier to silence the output when needed.
async def compare_completion(old_results: list, new_results: list) -> Dict: """Compare two lists of completion results and print differences.""" lengths_match = len(old_results) == len(new_results) matches = [] if lengths_match: - print("Results length match") + logging.info("Results length match") matches = [old == new for old, new in zip(old_results, new_results)] if all(matches): - print("All entries match") + logging.info("All entries match") else: - print(f"Differences found at indices: {[i for i, m in enumerate(matches) if not m]}") - print("\nDifferences:") + logging.warning(f"Differences found at indices: {[i for i, m in enumerate(matches) if not m]}") + logging.warning("Differences:") for i, (old, new) in enumerate(zip(old_results, new_results)): if old != new: - print(f"\nIndex {i}:") - print("Old:", json.dumps(old, indent=2)) - print("New:", json.dumps(new, indent=2)) + logging.warning(f"\nIndex {i}:") + logging.warning(f"Old: {json.dumps(old, indent=2)}") + logging.warning(f"New: {json.dumps(new, indent=2)}") else: - print(f"Results length mismatch: {len(old_results)} vs {len(new_results)}") - print("\nOld results:", json.dumps(old_results, indent=2)) - print("\nNew results:", json.dumps(new_results, indent=2)) + logging.warning(f"Results length mismatch: {len(old_results)} vs {len(new_results)}") + logging.warning(f"\nOld results: {json.dumps(old_results, indent=2)}") + logging.warning(f"\nNew results: {json.dumps(new_results, indent=2)}")
156-160: Document why retriever is set to False in setup steps.The code sets
setup_steps["retriever"] = Falsewithout explaining why. Adding a comment would clarify the intention behind this override.async def main(query: str, comparisons: Dict[str, bool], setup_steps: Dict[str, bool]): """Run comparison tests for selected retrievers with the given setup configuration.""" # Ensure retriever is always False in setup steps + # We disable the retriever setup step because we're testing our own retriever implementations + # and don't want the setup to initialize the default retrievers setup_steps["retriever"] = False await setup_main(setup_steps)
196-198: Consider more verbose logging for development and debugging.The script sets logging level to ERROR, which might suppress useful information during development. Consider making the logging level configurable.
if __name__ == "__main__": - logging.basicConfig(level=logging.ERROR) + import argparse + + parser = argparse.ArgumentParser(description="Compare old and new retriever implementations") + parser.add_argument("--log-level", default="ERROR", choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"], + help="Set the logging level") + args = parser.parse_args() + + logging.basicConfig(level=getattr(logging, args.log_level)) + logging.info(f"Starting comparison with log level: {args.log_level}")
156-194: Consider implementing parallel execution for faster comparisons.When running multiple comparisons, they could be executed in parallel using
asyncio.gather()to improve performance.async def main(query: str, comparisons: Dict[str, bool], setup_steps: Dict[str, bool]): """Run comparison tests for selected retrievers with the given setup configuration.""" # Ensure retriever is always False in setup steps setup_steps["retriever"] = False await setup_main(setup_steps) + context_tasks = [] # Compare contexts for completion-based retrievers for retriever in COMPLETION_RETRIEVERS: context_key = f"{retriever['type']}_context" if comparisons.get(context_key, False): - await compare_completion_context( + task = compare_completion_context( query=query, old_implementation=retriever["old_implementation"], retriever_class=retriever["retriever_class"], name=retriever["name"], retriever_type=retriever["type"], ) + context_tasks.append(task) + + if context_tasks: + logging.info("Running context comparisons...") + await asyncio.gather(*context_tasks) + completion_tasks = [] # Run completion comparisons for retriever in COMPLETION_RETRIEVERS: if comparisons.get(retriever["type"], False): - await compare_retriever( + task = compare_retriever( query=query, old_implementation=retriever["old_implementation"], new_retriever=retriever["retriever_class"](), name=retriever["name"], ) + completion_tasks.append(task) + basic_tasks = [] # Run basic retriever comparisons for retriever in BASIC_RETRIEVERS: retriever_type = retriever["name"].split()[0] if comparisons.get(retriever_type, False): - await compare_retriever( + task = compare_retriever( query=query, old_implementation=retriever["old_implementation"], new_retriever=retriever["retriever_class"](), name=retriever["name"], ) + basic_tasks.append(task) + + if completion_tasks or basic_tasks: + logging.info("Running retriever comparisons...") + all_tasks = completion_tasks + basic_tasks + await asyncio.gather(*all_tasks)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
cognee/modules/retrieval/utils/code_graph_retrieval.py(1 hunks)cognee/modules/retrieval/utils/run_search_comparisons.py(1 hunks)cognee/tasks/chunks/query_chunks.py(1 hunks)cognee/tasks/completion/graph_query_completion.py(3 hunks)cognee/tasks/completion/graph_query_summary_completion.py(2 hunks)cognee/tasks/completion/query_completion.py(2 hunks)cognee/tasks/graph/query_graph_connections.py(1 hunks)cognee/tasks/summarization/query_summaries.py(1 hunks)
✅ Files skipped from review due to trivial changes (4)
- cognee/modules/retrieval/utils/code_graph_retrieval.py
- cognee/tasks/chunks/query_chunks.py
- cognee/tasks/graph/query_graph_connections.py
- cognee/tasks/summarization/query_summaries.py
🚧 Files skipped from review as they are similar to previous changes (2)
- cognee/tasks/completion/graph_query_summary_completion.py
- cognee/tasks/completion/query_completion.py
⏰ Context from checks skipped due to timeout of 90000ms (29)
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_eval_framework_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: Test on macos-13
- GitHub Check: run_notebook_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: Test on macos-13
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: lint (ubuntu-latest, 3.11.x)
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: lint (ubuntu-latest, 3.10.x)
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Build Cognee Backend Docker App Image
- GitHub Check: docker-compose-test
🔇 Additional comments (5)
cognee/tasks/completion/graph_query_completion.py (4)
33-35: Function signature updated with optional context saving parameter.The addition of the
save_context_pathparameter aligns with the PR objective to facilitate context comparisons between old and new retriever implementations. The docstring has been appropriately updated.Also applies to: 42-42
71-80: Proper error handling implemented for context saving.The implementation includes error handling for file operations and JSON serialization as recommended in previous review comments. Good use of try-except block with appropriate error logging.
1-4:Details
❓ Verification inconclusive
The import changes and TODO comment need clarification.
The TODO comment mentions deleting this file after merging COG-1365, but this PR seems to be implementing that ticket. Consider clarifying the timeline for deletion or updating the comment to reflect the current state of the implementation.
Also applies to: 15-16
🏁 Script executed:
#!/bin/bash # Search for references to COG-1403 to better understand the deletion plan rg "COG-1403" --type pyLength of output: 831
Action Required: Clarify the Deletion Timeline for COG-1365
The current TODO comment in
cognee/tasks/completion/graph_query_completion.py(and similar comments in other files) states:# TODO: delete after merging COG-1365, see COG-1403As this PR appears to be implementing COG-1365, it’s unclear whether the deletion is scheduled immediately post-merge or if the file should remain until later steps are executed. Please update the comment to clearly reflect the intended timeline. For example:
- If the file is no longer needed and should be removed as part of COG-1365, update the comment to indicate that and consider removing the now-unused code.
- If the deletion is intended for a subsequent release or further merging steps, specify that timeline to avoid confusion.
This clarification is needed for the import section (lines 1-4) as well as the similar TODO at lines 15-16 in the same file. Additionally, note that similar comments appear in several other files (e.g.,
cognee/tasks/chunks/query_chunks.py,cognee/tasks/completion/query_completion.py, etc.), so a consistent update across the repository may be beneficial.
83-83:Details
✅ Verification successful
Verify use of previously computed context.
The function now uses the previously computed context variable instead of calling the context resolver again, which is more efficient. However, this change should ensure there's no difference in behavior compared to the previous implementation.
🏁 Script executed:
#!/bin/bash # Look for other usages of this function to ensure compatibility with the updated signature rg "graph_query_completion\(" --type pyLength of output: 224
Use of Previously Computed Context Verified
The updated implementation correctly reuses the previously computed
contextvariable ingraph_query_completionwithout altering behavior. The search confirms that bothgraph_query_completionand its caller ingraph_query_summary_completionare consistently using the same signature, ensuring that the change does not impact compatibility or behavior.cognee/modules/retrieval/utils/run_search_comparisons.py (1)
1-222: 🛠️ Refactor suggestionMove this utility to the tests directory for automated testing.
As suggested in the past review comment, this comparison utility should be moved to the tests directory so it can be run as part of the CI/CD pipeline. This would help ensure that the new retrievers maintain compatibility with the old implementations.
Consider moving this file to a location like
cognee/tests/retrieval/and adapting it to run as a proper test case. This way, it can be executed automatically during PR validation.
# Conflicts: # cognee/modules/retrieval/utils/code_graph_retrieval.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (5)
cognee/modules/retrieval/code_retriever.py (5)
28-32: Consider making collection names configurable.The collection names are hardcoded in the initializer. It would be better to make these configurable parameters that could be set during initialization or through configuration.
- def __init__(self, limit: int = 3): + def __init__( + self, + limit: int = 3, + file_name_collections: List[str] = None, + classes_and_functions_collections: List[str] = None + ): """Initialize retriever with search parameters.""" self.limit = limit - self.file_name_collections = ["CodeFile_name"] - self.classes_and_functions_collections = [ - "ClassDefinition_source_code", - "FunctionDefinition_source_code", - ] + self.file_name_collections = file_name_collections or ["CodeFile_name"] + self.classes_and_functions_collections = classes_and_functions_collections or [ + "ClassDefinition_source_code", + "FunctionDefinition_source_code", + ]
52-56: Improve error message for database initialization.The error message for database initialization could be more specific to help with debugging.
- raise RuntimeError("Database initialization error in code_graph_retriever, ") from e + raise RuntimeError(f"Database initialization error in CodeRetriever: {str(e)}") from e
108-116: Extract file path extraction logic to a helper method.The logic for extracting file paths from triplets could be moved to a separate helper method for better readability and maintainability.
+ async def _extract_paths_from_triplets(self, relevant_triplets): + """Extract file paths from graph triplets.""" + paths = set() + for sublist in relevant_triplets: + for tpl in sublist: + if isinstance(tpl, tuple) and len(tpl) >= 3: + if "file_path" in tpl[0]: + paths.add(tpl[0]["file_path"]) + if "file_path" in tpl[2]: + paths.add(tpl[2]["file_path"]) + return paths + async def get_context(self, query: str) -> Any: # ... existing code ... relevant_triplets = await asyncio.gather( *[graph_engine.get_connections(node_id) for node_id in code_ids + file_ids] ) - paths = set() - for sublist in relevant_triplets: - for tpl in sublist: - if isinstance(tpl, tuple) and len(tpl) >= 3: - if "file_path" in tpl[0]: - paths.add(tpl[0]["file_path"]) - if "file_path" in tpl[2]: - paths.add(tpl[2]["file_path"]) + paths = await self._extract_paths_from_triplets(relevant_triplets)
133-140: Handle empty paths case explicitly.If no paths are found, the function will return an empty list. It might be better to handle this case explicitly with a descriptive message.
+ if not paths: + logger.warning("No file paths found for the given query") + return [] + return [ { "name": file_path, "description": file_path, "content": retrieved_files[file_path], } for file_path in paths ]
142-146: Add more specific return type annotation.The return type annotation for get_completion and get_context is simply "Any", which doesn't provide clear information about the expected structure.
- async def get_context(self, query: str) -> Any: + async def get_context(self, query: str) -> List[Dict[str, str]]: # ... - async def get_completion(self, query: str, context: Optional[Any] = None) -> Any: + async def get_completion(self, query: str, context: Optional[List[Dict[str, str]]] = None) -> List[Dict[str, str]]:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/modules/retrieval/code_retriever.py(1 hunks)cognee/tests/unit/modules/retriever/test_description_to_codepart_search.py(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/tests/unit/modules/retriever/test_description_to_codepart_search.py
⏰ Context from checks skipped due to timeout of 90000ms (15)
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: windows-latest
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-13
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: docker-compose-test
🔇 Additional comments (3)
cognee/modules/retrieval/code_retriever.py (3)
1-14: Comprehensive imports, good job.The imports cover all necessary components including typing, async operations, database connections, and LLM client integration.
16-33: Good class structure and initialization.The CodeRetriever class properly extends BaseRetriever with a clear nested Pydantic model for structured output. The initialization with configurable limit parameter is well-designed.
34-46: Well-structured LLM interaction with good error handling.The _process_query method properly handles the LLM integration with appropriate error handling. The code raises meaningful exceptions when the LLM fails to provide structured output.
| relevant_triplets = await asyncio.gather( | ||
| *[graph_engine.get_connections(node_id) for node_id in code_ids + file_ids] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Consider limiting concurrent graph engine calls.
Multiple graph engine calls are made in parallel without limiting their number, which could potentially overwhelm the database if there are many IDs.
🏁 Script executed:
#!/bin/bash
# Check if there's any rate limiting or connection pooling for graph engine
rg -A 2 -B 2 "get_graph_engine" --type pyLength of output: 25960
Action Required: Limit Concurrent Graph Engine Calls
In cognee/modules/retrieval/code_retriever.py (lines 104-106), the code currently dispatches all calls to graph_engine.get_connections(node_id) concurrently via asyncio.gather. This pattern may lead to overwhelming the database when many IDs are processed simultaneously. Please consider adding concurrency controls—such as using an async semaphore, batching the IDs, or another throttling mechanism—to limit the number of parallel requests. If the graph engine already handles rate limiting or connection pooling internally, please document that behavior to clarify the design intent.
| async def read_file(fp): | ||
| try: | ||
| async with aiofiles.open(fp, "r", encoding="utf-8") as f: | ||
| retrieved_files[fp] = await f.read() | ||
| except Exception as e: | ||
| print(f"Error reading {fp}: {e}") | ||
| retrieved_files[fp] = "" | ||
|
|
||
| read_tasks.append(read_file(file_path)) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Define read_file outside the loop to avoid recreation.
The read_file function is redefined in every loop iteration. Define it once outside the loop to improve performance and readability.
# Before the loop
+ async def read_file(fp):
+ try:
+ async with aiofiles.open(fp, "r", encoding="utf-8") as f:
+ retrieved_files[fp] = await f.read()
+ except Exception as e:
+ logger.error(f"Error reading {fp}: {e}")
+ retrieved_files[fp] = ""
+
retrieved_files = {}
read_tasks = []
for file_path in paths:
- async def read_file(fp):
- try:
- async with aiofiles.open(fp, "r", encoding="utf-8") as f:
- retrieved_files[fp] = await f.read()
- except Exception as e:
- print(f"Error reading {fp}: {e}")
- retrieved_files[fp] = ""
-
read_tasks.append(read_file(file_path))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def read_file(fp): | |
| try: | |
| async with aiofiles.open(fp, "r", encoding="utf-8") as f: | |
| retrieved_files[fp] = await f.read() | |
| except Exception as e: | |
| print(f"Error reading {fp}: {e}") | |
| retrieved_files[fp] = "" | |
| read_tasks.append(read_file(file_path)) | |
| # Before the loop | |
| async def read_file(fp): | |
| try: | |
| async with aiofiles.open(fp, "r", encoding="utf-8") as f: | |
| retrieved_files[fp] = await f.read() | |
| except Exception as e: | |
| logger.error(f"Error reading {fp}: {e}") | |
| retrieved_files[fp] = "" | |
| retrieved_files = {} | |
| read_tasks = [] | |
| for file_path in paths: | |
| read_tasks.append(read_file(file_path)) |
| retrieved_files = {} | ||
| read_tasks = [] | ||
| for file_path in paths: | ||
|
|
||
| async def read_file(fp): | ||
| try: | ||
| async with aiofiles.open(fp, "r", encoding="utf-8") as f: | ||
| retrieved_files[fp] = await f.read() | ||
| except Exception as e: | ||
| print(f"Error reading {fp}: {e}") | ||
| retrieved_files[fp] = "" | ||
|
|
||
| read_tasks.append(read_file(file_path)) | ||
|
|
||
| await asyncio.gather(*read_tasks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replace print with proper logging for file read errors.
Using print for error reporting is not ideal for production code. Consider using a proper logging mechanism.
+ import logging
+
+ # At the top of the file, after imports
+ logger = logging.getLogger(__name__)
+
# Then in the read_file function
async def read_file(fp):
try:
async with aiofiles.open(fp, "r", encoding="utf-8") as f:
retrieved_files[fp] = await f.read()
except Exception as e:
- print(f"Error reading {fp}: {e}")
+ logger.error(f"Error reading {fp}: {e}")
retrieved_files[fp] = ""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| retrieved_files = {} | |
| read_tasks = [] | |
| for file_path in paths: | |
| async def read_file(fp): | |
| try: | |
| async with aiofiles.open(fp, "r", encoding="utf-8") as f: | |
| retrieved_files[fp] = await f.read() | |
| except Exception as e: | |
| print(f"Error reading {fp}: {e}") | |
| retrieved_files[fp] = "" | |
| read_tasks.append(read_file(file_path)) | |
| await asyncio.gather(*read_tasks) | |
| import logging | |
| logger = logging.getLogger(__name__) | |
| # ... other imports | |
| # The code segment starting at line 117 | |
| retrieved_files = {} | |
| read_tasks = [] | |
| for file_path in paths: | |
| async def read_file(fp): | |
| try: | |
| async with aiofiles.open(fp, "r", encoding="utf-8") as f: | |
| retrieved_files[fp] = await f.read() | |
| except Exception as e: | |
| logger.error(f"Error reading {fp}: {e}") | |
| retrieved_files[fp] = "" | |
| read_tasks.append(read_file(file_path)) | |
| await asyncio.gather(*read_tasks) |
| for collection in self.file_name_collections: | ||
| search_results_file = await vector_engine.search( | ||
| collection, query, limit=self.limit | ||
| ) | ||
| for res in search_results_file: | ||
| similar_filenames.append( | ||
| {"id": res.id, "score": res.score, "payload": res.payload} | ||
| ) | ||
|
|
||
| for collection in self.classes_and_functions_collections: | ||
| search_results_code = await vector_engine.search( | ||
| collection, query, limit=self.limit | ||
| ) | ||
| for res in search_results_code: | ||
| similar_codepieces.append( | ||
| {"id": res.id, "score": res.score, "payload": res.payload} | ||
| ) | ||
| else: | ||
| for collection in self.file_name_collections: | ||
| for file_from_query in files_and_codeparts.filenames: | ||
| search_results_file = await vector_engine.search( | ||
| collection, file_from_query, limit=self.limit | ||
| ) | ||
| for res in search_results_file: | ||
| similar_filenames.append( | ||
| {"id": res.id, "score": res.score, "payload": res.payload} | ||
| ) | ||
|
|
||
| for collection in self.classes_and_functions_collections: | ||
| search_results_code = await vector_engine.search( | ||
| collection, files_and_codeparts.sourcecode, limit=self.limit | ||
| ) | ||
| for res in search_results_code: | ||
| similar_codepieces.append( | ||
| {"id": res.id, "score": res.score, "payload": res.payload} | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor to reduce code duplication.
There's significant code duplication between the two branches (when filenames/sourcecode are available vs. when they're not). Consider refactoring to avoid this duplication.
- if not files_and_codeparts.filenames or not files_and_codeparts.sourcecode:
- for collection in self.file_name_collections:
- search_results_file = await vector_engine.search(
- collection, query, limit=self.limit
- )
- for res in search_results_file:
- similar_filenames.append(
- {"id": res.id, "score": res.score, "payload": res.payload}
- )
-
- for collection in self.classes_and_functions_collections:
- search_results_code = await vector_engine.search(
- collection, query, limit=self.limit
- )
- for res in search_results_code:
- similar_codepieces.append(
- {"id": res.id, "score": res.score, "payload": res.payload}
- )
- else:
- for collection in self.file_name_collections:
- for file_from_query in files_and_codeparts.filenames:
- search_results_file = await vector_engine.search(
- collection, file_from_query, limit=self.limit
- )
- for res in search_results_file:
- similar_filenames.append(
- {"id": res.id, "score": res.score, "payload": res.payload}
- )
-
- for collection in self.classes_and_functions_collections:
- search_results_code = await vector_engine.search(
- collection, files_and_codeparts.sourcecode, limit=self.limit
- )
- for res in search_results_code:
- similar_codepieces.append(
- {"id": res.id, "score": res.score, "payload": res.payload}
- )
+ # Search for filenames
+ for collection in self.file_name_collections:
+ if files_and_codeparts.filenames:
+ for file_from_query in files_and_codeparts.filenames:
+ search_results_file = await vector_engine.search(
+ collection, file_from_query, limit=self.limit
+ )
+ for res in search_results_file:
+ similar_filenames.append(
+ {"id": res.id, "score": res.score, "payload": res.payload}
+ )
+ else:
+ search_results_file = await vector_engine.search(
+ collection, query, limit=self.limit
+ )
+ for res in search_results_file:
+ similar_filenames.append(
+ {"id": res.id, "score": res.score, "payload": res.payload}
+ )
+
+ # Search for code pieces
+ for collection in self.classes_and_functions_collections:
+ search_query = files_and_codeparts.sourcecode if files_and_codeparts.sourcecode else query
+ search_results_code = await vector_engine.search(
+ collection, search_query, limit=self.limit
+ )
+ for res in search_results_code:
+ similar_codepieces.append(
+ {"id": res.id, "score": res.score, "payload": res.payload}
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for collection in self.file_name_collections: | |
| search_results_file = await vector_engine.search( | |
| collection, query, limit=self.limit | |
| ) | |
| for res in search_results_file: | |
| similar_filenames.append( | |
| {"id": res.id, "score": res.score, "payload": res.payload} | |
| ) | |
| for collection in self.classes_and_functions_collections: | |
| search_results_code = await vector_engine.search( | |
| collection, query, limit=self.limit | |
| ) | |
| for res in search_results_code: | |
| similar_codepieces.append( | |
| {"id": res.id, "score": res.score, "payload": res.payload} | |
| ) | |
| else: | |
| for collection in self.file_name_collections: | |
| for file_from_query in files_and_codeparts.filenames: | |
| search_results_file = await vector_engine.search( | |
| collection, file_from_query, limit=self.limit | |
| ) | |
| for res in search_results_file: | |
| similar_filenames.append( | |
| {"id": res.id, "score": res.score, "payload": res.payload} | |
| ) | |
| for collection in self.classes_and_functions_collections: | |
| search_results_code = await vector_engine.search( | |
| collection, files_and_codeparts.sourcecode, limit=self.limit | |
| ) | |
| for res in search_results_code: | |
| similar_codepieces.append( | |
| {"id": res.id, "score": res.score, "payload": res.payload} | |
| ) | |
| # Search for filenames | |
| for collection in self.file_name_collections: | |
| if files_and_codeparts.filenames: | |
| for file_from_query in files_and_codeparts.filenames: | |
| search_results_file = await vector_engine.search( | |
| collection, file_from_query, limit=self.limit | |
| ) | |
| for res in search_results_file: | |
| similar_filenames.append( | |
| {"id": res.id, "score": res.score, "payload": res.payload} | |
| ) | |
| else: | |
| search_results_file = await vector_engine.search( | |
| collection, query, limit=self.limit | |
| ) | |
| for res in search_results_file: | |
| similar_filenames.append( | |
| {"id": res.id, "score": res.score, "payload": res.payload} | |
| ) | |
| # Search for code pieces | |
| for collection in self.classes_and_functions_collections: | |
| search_query = files_and_codeparts.sourcecode if files_and_codeparts.sourcecode else query | |
| search_results_code = await vector_engine.search( | |
| collection, search_query, limit=self.limit | |
| ) | |
| for res in search_results_code: | |
| similar_codepieces.append( | |
| {"id": res.id, "score": res.score, "payload": res.payload} | |
| ) |
| @@ -0,0 +1,146 @@ | |||
| from typing import Any, Optional, List, Dict | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code retriever shouldn't be changed as we developed a new one, so be sure when resolving conflicts that this doesn't change that.
| for file_path in paths | ||
| ] | ||
|
|
||
| async def get_completion(self, query: str, context: Optional[Any] = None) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know if completion makes sense here. @borisarzentar ?
| async def get_context(self, query: str) -> Any: | ||
| """Retrieves relevant document chunks as context.""" | ||
| vector_engine = get_vector_engine() | ||
| found_chunks = await vector_engine.search("DocumentChunk_text", query, limit=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know now the limit is hardcoded so its just a theoretical question. Shouldn't we outsource these to the user? Maybe not just asking
| else: | ||
| vector_engine = get_vector_engine() | ||
| results = await asyncio.gather( | ||
| vector_engine.search("Entity_name", query_text=query, limit=self.top_k), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know its not your code, but it would be nice to make collection names dynamic for insights too. In this way they fail if the LLM doesn't extract anythign
| CONTEXT_DUMP_DIR = "context_dumps" | ||
|
|
||
| # Define retriever configurations | ||
| COMPLETION_RETRIEVERS = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
Description
BaseRetrieverclass to unify all the retrievers and searches.DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit
New Features
BaseRetriever,ChunksRetriever,CodeRetriever,CompletionRetriever,GraphCompletionRetriever,GraphSummaryCompletionRetriever,InsightsRetriever, andSummariesRetriever.Refactor