-
Notifications
You must be signed in to change notification settings - Fork 965
feat: entity brute force triplet search [COG-1325] #589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request refactors and enhances several modules within the retrieval functionality. In the graph summarization component, the intermediate steps have been simplified by replacing prompt reading and LLM client calls with a direct call to a new summarization utility. New asynchronous functions and optional parameters have been introduced for memory fragment management in triplet search, while obsolete entity completion code is removed. Additionally, new context provider classes, an entity completion retriever, and a comparison example script have been added to streamline entity-based completions and context retrieval. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant GSR as GraphSummaryCompletionRetriever
participant SumTxt as summarize_text
participant LLM as LLM Client
Client->>GSR: resolve_edges_to_text(retrieved_edges)
GSR->>SumTxt: summarize_text(direct_text, prompt_path)
SumTxt->>LLM: acreate_structured_output(text, system_prompt)
LLM-->>SumTxt: summary result
SumTxt-->>GSR: summarized text
GSR-->>Client: return summary
sequenceDiagram
participant Client
participant ECR as EntityCompletionRetriever
participant Extractor as EntityExtractor
participant Context as ContextProvider
participant LLM as LLM Client
Client->>ECR: get_completion(query, [context])
ECR->>Extractor: extract entities from query
Extractor-->>ECR: entities
ECR->>Context: get_context(query)
Context-->>ECR: context data
ECR->>LLM: generate completion with context
LLM-->>ECR: completion result
ECR-->>Client: return completion
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (12)
examples/python/entity_completion_comparison.py (3)
1-62: Consider adding source attribution for example articlesThe example articles are well-structured for demonstrating the functionality. Consider adding comments indicating the source of these articles or clarifying if they are fictional examples created for demonstration purposes.
64-143: Comprehensive example with some enhancement opportunitiesThe example effectively demonstrates the different retrieval approaches. Consider these improvements:
- Add exception handling around API calls
- Add comments about expected differences between each retrieval method
- Consider extracting the retrieval comparison into a separate function for better readability
145-163: Consider making rebuild_kg and retrieve configurable via CLI argumentsThe example uses hardcoded boolean flags to control execution. Consider using a library like
argparseto make these configurable via command-line arguments, enhancing usability.cognee/modules/retrieval/entity_completion_retriever.py (2)
16-27: Parameter initialization is straightforward.
You might consider validating file paths (e.g., checking if files exist) or raising a warning if they are missing.
49-69: Clarify single-element list return.
The method always returns a list with one string. If multi-segment completions are expected in the future, clarify the design now or return a single string for simplicity.cognee/modules/retrieval/utils/brute_force_triplet_search.py (4)
52-69: Add optional error handling for graph projection.
While creating and projecting the memory fragment, consider wrappingproject_graph_from_dbin a try-except block to handle potential database or engine failures more gracefully.
77-77: Parameter addition is aligned with new use cases.
Includingmemory_fragmentsupports context reuse. Ensure the docstring references this option consistently.
134-134: Logging error detail is informative.
Consider adding a stack trace if deeper debugging is expected, but the short message is sufficient for routine logs.
149-151: Conditional memory fragment creation is efficient.
This is a helpful approach to share the fragment across multiple searches and avoid redundant creation.cognee/modules/retrieval/context_providers/triplet_search_context_provider.py (3)
65-69: Formatting triplets is straightforward.
The returned string is user-friendly, though consider truncation if results are very large.
70-83: Logical aggregation of triplet results.
The approach gracefully handles the case of empty results. Optionally handle partial successes if a single entity fails.
84-98: Async context retrieval flows well.
The concurrency strategy withasyncio.gatheris sound. You may want to handle or log partial entity failures instead of short-circuiting the entire process.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
cognee/modules/retrieval/context_providers/summarized_triplet_search_context_provider.py(1 hunks)cognee/modules/retrieval/context_providers/triplet_search_context_provider.py(1 hunks)cognee/modules/retrieval/entity_completion_retriever.py(1 hunks)cognee/modules/retrieval/graph_summary_completion_retriever.py(2 hunks)cognee/modules/retrieval/utils/brute_force_triplet_search.py(7 hunks)cognee/modules/retrieval/utils/completion.py(1 hunks)cognee/tasks/entity_completion/entity_completion.py(0 hunks)examples/python/entity_completion_comparison.py(1 hunks)
💤 Files with no reviewable changes (1)
- cognee/tasks/entity_completion/entity_completion.py
⏰ Context from checks skipped due to timeout of 90000ms (30)
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: run_eval_framework_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: windows-latest
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Test on macos-13
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: lint (ubuntu-latest, 3.10.x)
- GitHub Check: docker-compose-test
- GitHub Check: Build Cognee Backend Docker App Image
🔇 Additional comments (21)
cognee/modules/retrieval/utils/completion.py (1)
26-38: Clean utility function implementationThe new
summarize_textfunction provides a valuable abstraction for text summarization across the codebase. The implementation is clean, well-documented, and follows the established pattern of the existinggenerate_completionfunction.cognee/modules/retrieval/context_providers/summarized_triplet_search_context_provider.py (2)
9-25: Well-designed class extensionGood job extending
TripletSearchContextProviderwhile allowing customization of the summarization prompt. The constructor properly calls the parent class's constructor and stores the additional parameter.
26-30: Effective implementation of the _format_triplets methodThe method effectively leverages the parent implementation before adding summarization functionality, following good inheritance practices. The formatted output is clear and includes both the entity name and summary.
cognee/modules/retrieval/graph_summary_completion_retriever.py (2)
4-4: Good import additionAppropriate addition of import for the newly utilized
summarize_textfunction.
28-28: Improved code reuseGood refactoring to leverage the new
summarize_textutility function instead of duplicating summarization logic. This improves maintainability and consistency in how summarization is performed across the codebase.cognee/modules/retrieval/entity_completion_retriever.py (4)
1-2: Imports look fine.
No issues with the added imports; they align well with the usage throughout the file.
4-7: Dependency consistency is good.
The classes and functions imported are used consistently, and their purpose is clear.
10-10: Informative logger name.
Using a module-specific logger name helps in debugging and filtering logs effectively.
13-14: Class definition is clear.
The docstring succinctly describes the retriever’s purpose.cognee/modules/retrieval/utils/brute_force_triplet_search.py (5)
3-3: Type hints approved.
ImportingOptionalandListis consistent with the newly introduced parameters.
85-91: Delegating to brute_force_search is coherent.
Forwarding all parameters consistently, including the optional memory fragment parameter, is a clear approach.
102-102: Memory fragment parameter is well-defined.
Allowing reuse or creation of a new fragment provides flexibility for diverse scenarios.
111-113: Docstring updates are accurate.
The additional parameter descriptions match the new method signature.
158-158: Telemetry message updated correctly.
Changing the telemetry to indicate completion aids in proper monitoring stages.cognee/modules/retrieval/context_providers/triplet_search_context_provider.py (7)
1-2: Imports are appropriate.
They are used within various methods, and the async approach is consistent across the file.
4-5: Base classes imported suitably.
No concerns, the context provider base is correctly extended.
7-13: Brute force utilities usage is coherent.
The imports for triplet search, formatting, and fragment retrieval are well organized.
18-20: Class definition is clear.
Straightforward extension ofBaseContextProviderwith an explicit docstring describing triplet search logic.
21-30: Reasonable defaults for search parameters.
top_k,collections,properties_to_projectprovide adequate flexibility. Checking for valid ranges or types could further bolster robustness.
31-42: Graceful fallback for entity text.
Merging name, description, and text ensures more complete entity data.
43-63: Efficient search task creation.
List comprehension with assignment expression is clean. Ensure that extremely large entity sets do not create undue concurrency overhead.
cognee/modules/retrieval/context_providers/summarized_triplet_search_context_provider.py
Outdated
Show resolved
Hide resolved
cognee/modules/retrieval/context_providers/summarized_triplet_search_context_provider.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
cognee/modules/retrieval/context_providers/SummarizedTripletSearchContextProvider.py (1)
12-22: Consider adding error handling for failed or empty summaries.
Currently,_format_tripletswill return an empty or partial summary ifsummarize_textfails or yields an empty string. A fallback path (e.g., returningdirect_textdirectly) might enhance resilience.examples/python/entity_completion_comparison.py (1)
64-164: Consider breaking down the main function for maintainability.
The singlemainfunction covers multiple logical workflows (data pruning, system pruning, text addition, etc.). Splitting these into smaller helper functions could improve readability and testability.Do you want me to generate a shell script or separate scripts to run each step individually?
cognee/modules/retrieval/utils/brute_force_triplet_search.py (1)
52-69: Ensure consistency in node/edge property projections.
get_memory_fragmentoffers flexibility, but consider documenting the typical properties that should always be projected, especially if multiple providers or modules rely on them.cognee/modules/retrieval/context_providers/TripletSearchContextProvider.py (1)
43-64: Efficient handling of multiple entity searches.
Creating a single memory fragment outside the tasks is thoughtful. Consider enforcing input text length or sanitizing the final combined query if there's a risk of overly large input.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cognee/modules/retrieval/context_providers/SummarizedTripletSearchContextProvider.py(1 hunks)cognee/modules/retrieval/context_providers/TripletSearchContextProvider.py(1 hunks)cognee/modules/retrieval/utils/brute_force_triplet_search.py(8 hunks)examples/python/entity_completion_comparison.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (30)
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_eval_framework_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-15
- GitHub Check: test
- GitHub Check: Test on macos-13
- GitHub Check: windows-latest
- GitHub Check: Test on macos-13
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Test on macos-13
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: docker-compose-test
- GitHub Check: Build Cognee Backend Docker App Image
- GitHub Check: run_simple_example_test
🔇 Additional comments (11)
cognee/modules/retrieval/context_providers/SummarizedTripletSearchContextProvider.py (1)
1-6: Imports look correct and succinct.
No issues identified in how the necessary modules and types are imported.examples/python/entity_completion_comparison.py (2)
1-3: Imports look good.
All required modules appear properly referenced.
16-62: Definition of large text samples.
Having inline multi-line text is convenient for testing and demonstrations. No issues spotted.cognee/modules/retrieval/utils/brute_force_triplet_search.py (3)
3-3: No issues with the newly introduced typing imports.
77-92: Reusing memory fragments is a great improvement.
The newmemory_fragmentparameter allows more efficient repeated queries. Ensure that the shared fragment is thread-safe if used concurrently.Also applies to: 102-125
137-137: No further issues spotted in logging and telemetry.
The refined logging clarifies whether the vector engine fails. Telemetry usage is consistent with other calls.Also applies to: 157-157
cognee/modules/retrieval/context_providers/TripletSearchContextProvider.py (5)
21-30: Constructor aligns well with typical usage.
Initializingtop_k,collections, andproperties_to_projecthere cleanly captures search context configuration.
31-42: Graceful fallback in_get_entity_text.
The stepwise checks ofname,description, andtexthelp avoid null references. No issues spotted.
65-69: Simple formatting of triplets.
The function reliably returns a minimal context heading plus raw triplets. Looks good as a baseline presentation.
70-83: Concise aggregation of triplets into context.
Looping through entities and matching triplets is straightforward. This approach is user-friendly if multiple entity contexts are needed simultaneously.
84-98: Overall workflow is clear.
Usingasyncio.gatherto concurrently search for each entity is a sensible design. The fallback messages (e.g., "No valid entities found") are helpful to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
examples/python/entity_completion_comparison.py (2)
16-28: Consider moving large article text into external files or constants.Embedding large, static articles directly in the code can negatively impact readability and maintainability, especially if these articles evolve frequently. Storing them in dedicated files or configuration constants can ease updates and reduce clutter.
158-163: Use modernasyncio.runfor a simpler event loop management.Replacing the manual event loop management with
asyncio.run(main(...))can simplify your code and reduce potential pitfalls in complex scenarios:- loop = asyncio.new_event_loop() - asyncio.set_event_loop(loop) - try: - loop.run_until_complete(main(steps_to_enable)) - finally: - loop.run_until_complete(loop.shutdown_asyncgens()) + asyncio.run(main(steps_to_enable))cognee/modules/retrieval/EntityCompletionRetriever.py (2)
28-48: Return type improvements could enhance clarity.Currently,
get_contextreturnsNonewhen no entities or context are found. Consider returning an empty object or raising an informative exception to make downstream handling clearer.
49-69: Revisit returning a single completion as a list.
get_completionreturns a list with one string, which might cause confusion for consumers that expect multiple items. Confirm whether you need a single string or truly support multiple completions.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/modules/retrieval/EntityCompletionRetriever.py(1 hunks)examples/python/entity_completion_comparison.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (22)
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: windows-latest
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_eval_framework_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Test on macos-13
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: docker-compose-test
- GitHub Check: run_simple_example_test
🔇 Additional comments (2)
examples/python/entity_completion_comparison.py (1)
88-93:Details
✅ Verification successful
Verify that all properties in
properties_to_projectare valid.Ensure "TextSummary_text" exists in the underlying data schema or knowledge graph, otherwise, the search might fail or return incomplete results.
🏁 Script executed:
#!/bin/bash # Searches for references to property definitions in the codebase. # This checks if "TextSummary_text" is declared or expected in any relevant files. fd -e py | xargs rg "TextSummary_text"Length of output: 391
Ensure that the collection names and properties are correctly aligned with the underlying schema.
The search results confirm that
"TextSummary_text"is used elsewhere in the codebase (incognee/modules/retrieval/utils/brute_force_triplet_search.pyandcognee/modules/retrieval/summaries_retriever.py), indicating that it is an expected part of the underlying data schema. Please double-check that both the collections (like"TextSummary_text") and the properties listed in"properties_to_project"remain consistent with the schema to avoid any runtime issues.cognee/modules/retrieval/EntityCompletionRetriever.py (1)
16-27: Constructor looks good.The initialization pattern is straightforward and aligns with typical dependency injection for both the extractor and context provider.
Description
brute_force_triplet_search, extracting memory projection.entity_completioninto EntityCompletionRetriever (extends BaseRetriever).DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit
New Features
Refactor