Skip to content

Conversation

@lxobr
Copy link
Collaborator

@lxobr lxobr commented Feb 28, 2025

Description

  • Refactored brute_force_triplet_search, extracting memory projection.
  • Built TripletSearchContextProvider (extends BaseContextProvider) to create a single memory projection and perform a triplet search for each entity.
  • Refactored entity_completion into EntityCompletionRetriever (extends BaseRetriever).
  • Added SummarizedTripletSearchContextProvider (extends TripletSearchContextProvider) for an alternative summarized output format.
  • Developed and tested an example showcasing both context providers, comparing raw triplets, summaries, and standard search results.

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin

Summary by CodeRabbit

  • New Features

    • Enhanced text summarization now delivers clearer, more concise overviews of search results.
    • Improved search performance with optimized context retrieval and memory reuse for faster, more reliable results.
    • Introduced advanced entity-based completion for generating more relevant, context-aware responses.
  • Refactor

    • Streamlined internal workflows and error handling to ensure a smoother overall experience.

@lxobr lxobr self-assigned this Feb 28, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 28, 2025

Walkthrough

This pull request refactors and enhances several modules within the retrieval functionality. In the graph summarization component, the intermediate steps have been simplified by replacing prompt reading and LLM client calls with a direct call to a new summarization utility. New asynchronous functions and optional parameters have been introduced for memory fragment management in triplet search, while obsolete entity completion code is removed. Additionally, new context provider classes, an entity completion retriever, and a comparison example script have been added to streamline entity-based completions and context retrieval.

Changes

File(s) Change Summary
cognee/modules/retrieval/graph_summary_completion_retriever.py Updated resolve_edges_to_text to call summarize_text directly, removing intermediate prompt reading and LLM client invocation.
cognee/modules/retrieval/utils/completion.py Added new asynchronous function summarize_text to handle LLM-based text summarization.
cognee/modules/retrieval/utils/brute_force_triplet_search.py Introduced get_memory_fragment and updated brute_force_triplet_search/brute_force_search to accept an optional memory_fragment; refined error handling and logging.
cognee/tasks/entity_completion/entity_completion.py Removed obsolete entity completion functions (get_llm_response and entity_completion).
examples/python/entity_completion_comparison.py Added a new script to test and compare entity completion functionalities using different context providers.
cognee/modules/retrieval/context_providers/TripletSearchContextProvider.py
cognee/modules/retrieval/context_providers/SummarizedTripletSearchContextProvider.py
Added new context provider classes: a basic triplet search provider and an extended version that formats triplets using summarization.
cognee/modules/retrieval/EntityCompletionRetriever.py Added a new EntityCompletionRetriever class with asynchronous methods for context retrieval and completion generation, including structured error handling.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant GSR as GraphSummaryCompletionRetriever
    participant SumTxt as summarize_text
    participant LLM as LLM Client

    Client->>GSR: resolve_edges_to_text(retrieved_edges)
    GSR->>SumTxt: summarize_text(direct_text, prompt_path)
    SumTxt->>LLM: acreate_structured_output(text, system_prompt)
    LLM-->>SumTxt: summary result
    SumTxt-->>GSR: summarized text
    GSR-->>Client: return summary
Loading
sequenceDiagram
    participant Client
    participant ECR as EntityCompletionRetriever
    participant Extractor as EntityExtractor
    participant Context as ContextProvider
    participant LLM as LLM Client

    Client->>ECR: get_completion(query, [context])
    ECR->>Extractor: extract entities from query
    Extractor-->>ECR: entities
    ECR->>Context: get_context(query)
    Context-->>ECR: context data
    ECR->>LLM: generate completion with context
    LLM-->>ECR: completion result
    ECR-->>Client: return completion
Loading

Possibly related PRs

Suggested reviewers

  • hajdul88
  • soobrosa

Poem

In the code garden under the moon's soft light,
I, a rabbit, hop through lines both clear and bright.
Summaries bloom where triplets elegantly align,
Each new function a playful, rhythmic design.
With whiskered joy, I celebrate this coding delight!

✨ Finishing Touches
  • 📝 Generate Docstrings

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (12)
examples/python/entity_completion_comparison.py (3)

1-62: Consider adding source attribution for example articles

The example articles are well-structured for demonstrating the functionality. Consider adding comments indicating the source of these articles or clarifying if they are fictional examples created for demonstration purposes.


64-143: Comprehensive example with some enhancement opportunities

The example effectively demonstrates the different retrieval approaches. Consider these improvements:

  1. Add exception handling around API calls
  2. Add comments about expected differences between each retrieval method
  3. Consider extracting the retrieval comparison into a separate function for better readability

145-163: Consider making rebuild_kg and retrieve configurable via CLI arguments

The example uses hardcoded boolean flags to control execution. Consider using a library like argparse to make these configurable via command-line arguments, enhancing usability.

cognee/modules/retrieval/entity_completion_retriever.py (2)

16-27: Parameter initialization is straightforward.
You might consider validating file paths (e.g., checking if files exist) or raising a warning if they are missing.


49-69: Clarify single-element list return.
The method always returns a list with one string. If multi-segment completions are expected in the future, clarify the design now or return a single string for simplicity.

cognee/modules/retrieval/utils/brute_force_triplet_search.py (4)

52-69: Add optional error handling for graph projection.
While creating and projecting the memory fragment, consider wrapping project_graph_from_db in a try-except block to handle potential database or engine failures more gracefully.


77-77: Parameter addition is aligned with new use cases.
Including memory_fragment supports context reuse. Ensure the docstring references this option consistently.


134-134: Logging error detail is informative.
Consider adding a stack trace if deeper debugging is expected, but the short message is sufficient for routine logs.


149-151: Conditional memory fragment creation is efficient.
This is a helpful approach to share the fragment across multiple searches and avoid redundant creation.

cognee/modules/retrieval/context_providers/triplet_search_context_provider.py (3)

65-69: Formatting triplets is straightforward.
The returned string is user-friendly, though consider truncation if results are very large.


70-83: Logical aggregation of triplet results.
The approach gracefully handles the case of empty results. Optionally handle partial successes if a single entity fails.


84-98: Async context retrieval flows well.
The concurrency strategy with asyncio.gather is sound. You may want to handle or log partial entity failures instead of short-circuiting the entire process.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d43125 and 8233e6d.

📒 Files selected for processing (8)
  • cognee/modules/retrieval/context_providers/summarized_triplet_search_context_provider.py (1 hunks)
  • cognee/modules/retrieval/context_providers/triplet_search_context_provider.py (1 hunks)
  • cognee/modules/retrieval/entity_completion_retriever.py (1 hunks)
  • cognee/modules/retrieval/graph_summary_completion_retriever.py (2 hunks)
  • cognee/modules/retrieval/utils/brute_force_triplet_search.py (7 hunks)
  • cognee/modules/retrieval/utils/completion.py (1 hunks)
  • cognee/tasks/entity_completion/entity_completion.py (0 hunks)
  • examples/python/entity_completion_comparison.py (1 hunks)
💤 Files with no reviewable changes (1)
  • cognee/tasks/entity_completion/entity_completion.py
⏰ Context from checks skipped due to timeout of 90000ms (30)
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_dynamic_steps_example_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_simple_example_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_multimedia_example_test / test
  • GitHub Check: run_networkx_metrics_test / test
  • GitHub Check: run_eval_framework_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on macos-13
  • GitHub Check: test
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on macos-13
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: test
  • GitHub Check: windows-latest
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: Test on macos-13
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: test
  • GitHub Check: lint (ubuntu-latest, 3.10.x)
  • GitHub Check: docker-compose-test
  • GitHub Check: Build Cognee Backend Docker App Image
🔇 Additional comments (21)
cognee/modules/retrieval/utils/completion.py (1)

26-38: Clean utility function implementation

The new summarize_text function provides a valuable abstraction for text summarization across the codebase. The implementation is clean, well-documented, and follows the established pattern of the existing generate_completion function.

cognee/modules/retrieval/context_providers/summarized_triplet_search_context_provider.py (2)

9-25: Well-designed class extension

Good job extending TripletSearchContextProvider while allowing customization of the summarization prompt. The constructor properly calls the parent class's constructor and stores the additional parameter.


26-30: Effective implementation of the _format_triplets method

The method effectively leverages the parent implementation before adding summarization functionality, following good inheritance practices. The formatted output is clear and includes both the entity name and summary.

cognee/modules/retrieval/graph_summary_completion_retriever.py (2)

4-4: Good import addition

Appropriate addition of import for the newly utilized summarize_text function.


28-28: Improved code reuse

Good refactoring to leverage the new summarize_text utility function instead of duplicating summarization logic. This improves maintainability and consistency in how summarization is performed across the codebase.

cognee/modules/retrieval/entity_completion_retriever.py (4)

1-2: Imports look fine.
No issues with the added imports; they align well with the usage throughout the file.


4-7: Dependency consistency is good.
The classes and functions imported are used consistently, and their purpose is clear.


10-10: Informative logger name.
Using a module-specific logger name helps in debugging and filtering logs effectively.


13-14: Class definition is clear.
The docstring succinctly describes the retriever’s purpose.

cognee/modules/retrieval/utils/brute_force_triplet_search.py (5)

3-3: Type hints approved.
Importing Optional and List is consistent with the newly introduced parameters.


85-91: Delegating to brute_force_search is coherent.
Forwarding all parameters consistently, including the optional memory fragment parameter, is a clear approach.


102-102: Memory fragment parameter is well-defined.
Allowing reuse or creation of a new fragment provides flexibility for diverse scenarios.


111-113: Docstring updates are accurate.
The additional parameter descriptions match the new method signature.


158-158: Telemetry message updated correctly.
Changing the telemetry to indicate completion aids in proper monitoring stages.

cognee/modules/retrieval/context_providers/triplet_search_context_provider.py (7)

1-2: Imports are appropriate.
They are used within various methods, and the async approach is consistent across the file.


4-5: Base classes imported suitably.
No concerns, the context provider base is correctly extended.


7-13: Brute force utilities usage is coherent.
The imports for triplet search, formatting, and fragment retrieval are well organized.


18-20: Class definition is clear.
Straightforward extension of BaseContextProvider with an explicit docstring describing triplet search logic.


21-30: Reasonable defaults for search parameters.
top_k, collections, properties_to_project provide adequate flexibility. Checking for valid ranges or types could further bolster robustness.


31-42: Graceful fallback for entity text.
Merging name, description, and text ensures more complete entity data.


43-63: Efficient search task creation.
List comprehension with assignment expression is clean. Ensure that extremely large entity sets do not create undue concurrency overhead.

@lxobr lxobr requested a review from borisarzentar March 4, 2025 13:44
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
cognee/modules/retrieval/context_providers/SummarizedTripletSearchContextProvider.py (1)

12-22: Consider adding error handling for failed or empty summaries.
Currently, _format_triplets will return an empty or partial summary if summarize_text fails or yields an empty string. A fallback path (e.g., returning direct_text directly) might enhance resilience.

examples/python/entity_completion_comparison.py (1)

64-164: Consider breaking down the main function for maintainability.
The single main function covers multiple logical workflows (data pruning, system pruning, text addition, etc.). Splitting these into smaller helper functions could improve readability and testability.

Do you want me to generate a shell script or separate scripts to run each step individually?

cognee/modules/retrieval/utils/brute_force_triplet_search.py (1)

52-69: Ensure consistency in node/edge property projections.
get_memory_fragment offers flexibility, but consider documenting the typical properties that should always be projected, especially if multiple providers or modules rely on them.

cognee/modules/retrieval/context_providers/TripletSearchContextProvider.py (1)

43-64: Efficient handling of multiple entity searches.
Creating a single memory fragment outside the tasks is thoughtful. Consider enforcing input text length or sanitizing the final combined query if there's a risk of overly large input.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8233e6d and 02b30f9.

📒 Files selected for processing (4)
  • cognee/modules/retrieval/context_providers/SummarizedTripletSearchContextProvider.py (1 hunks)
  • cognee/modules/retrieval/context_providers/TripletSearchContextProvider.py (1 hunks)
  • cognee/modules/retrieval/utils/brute_force_triplet_search.py (8 hunks)
  • examples/python/entity_completion_comparison.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (30)
  • GitHub Check: run_networkx_metrics_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_eval_framework_test / test
  • GitHub Check: run_simple_example_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: run_notebook_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on macos-15
  • GitHub Check: test
  • GitHub Check: Test on macos-13
  • GitHub Check: windows-latest
  • GitHub Check: Test on macos-13
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: Test on macos-13
  • GitHub Check: run_dynamic_steps_example_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: test
  • GitHub Check: run_multimedia_example_test / test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: test
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: docker-compose-test
  • GitHub Check: Build Cognee Backend Docker App Image
  • GitHub Check: run_simple_example_test
🔇 Additional comments (11)
cognee/modules/retrieval/context_providers/SummarizedTripletSearchContextProvider.py (1)

1-6: Imports look correct and succinct.
No issues identified in how the necessary modules and types are imported.

examples/python/entity_completion_comparison.py (2)

1-3: Imports look good.
All required modules appear properly referenced.


16-62: Definition of large text samples.
Having inline multi-line text is convenient for testing and demonstrations. No issues spotted.

cognee/modules/retrieval/utils/brute_force_triplet_search.py (3)

3-3: No issues with the newly introduced typing imports.


77-92: Reusing memory fragments is a great improvement.
The new memory_fragment parameter allows more efficient repeated queries. Ensure that the shared fragment is thread-safe if used concurrently.

Also applies to: 102-125


137-137: No further issues spotted in logging and telemetry.
The refined logging clarifies whether the vector engine fails. Telemetry usage is consistent with other calls.

Also applies to: 157-157

cognee/modules/retrieval/context_providers/TripletSearchContextProvider.py (5)

21-30: Constructor aligns well with typical usage.
Initializing top_k, collections, and properties_to_project here cleanly captures search context configuration.


31-42: Graceful fallback in _get_entity_text.
The stepwise checks of name, description, and text help avoid null references. No issues spotted.


65-69: Simple formatting of triplets.
The function reliably returns a minimal context heading plus raw triplets. Looks good as a baseline presentation.


70-83: Concise aggregation of triplets into context.
Looping through entities and matching triplets is straightforward. This approach is user-friendly if multiple entity contexts are needed simultaneously.


84-98: Overall workflow is clear.
Using asyncio.gather to concurrently search for each entity is a sensible design. The fallback messages (e.g., "No valid entities found") are helpful to the caller.

@lxobr lxobr requested a review from borisarzentar March 4, 2025 14:55
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
examples/python/entity_completion_comparison.py (2)

16-28: Consider moving large article text into external files or constants.

Embedding large, static articles directly in the code can negatively impact readability and maintainability, especially if these articles evolve frequently. Storing them in dedicated files or configuration constants can ease updates and reduce clutter.


158-163: Use modern asyncio.run for a simpler event loop management.

Replacing the manual event loop management with asyncio.run(main(...)) can simplify your code and reduce potential pitfalls in complex scenarios:

- loop = asyncio.new_event_loop()
- asyncio.set_event_loop(loop)
- try:
-     loop.run_until_complete(main(steps_to_enable))
- finally:
-     loop.run_until_complete(loop.shutdown_asyncgens())
+ asyncio.run(main(steps_to_enable))
cognee/modules/retrieval/EntityCompletionRetriever.py (2)

28-48: Return type improvements could enhance clarity.

Currently, get_context returns None when no entities or context are found. Consider returning an empty object or raising an informative exception to make downstream handling clearer.


49-69: Revisit returning a single completion as a list.

get_completion returns a list with one string, which might cause confusion for consumers that expect multiple items. Confirm whether you need a single string or truly support multiple completions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02b30f9 and acad4bc.

📒 Files selected for processing (2)
  • cognee/modules/retrieval/EntityCompletionRetriever.py (1 hunks)
  • examples/python/entity_completion_comparison.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (22)
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on macos-13
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: windows-latest
  • GitHub Check: run_networkx_metrics_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_eval_framework_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_dynamic_steps_example_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on macos-13
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: Test on macos-13
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: docker-compose-test
  • GitHub Check: run_simple_example_test
🔇 Additional comments (2)
examples/python/entity_completion_comparison.py (1)

88-93:

Details

✅ Verification successful

Verify that all properties in properties_to_project are valid.

Ensure "TextSummary_text" exists in the underlying data schema or knowledge graph, otherwise, the search might fail or return incomplete results.


🏁 Script executed:

#!/bin/bash
# Searches for references to property definitions in the codebase.
# This checks if "TextSummary_text" is declared or expected in any relevant files.
fd -e py | xargs rg "TextSummary_text"

Length of output: 391


Ensure that the collection names and properties are correctly aligned with the underlying schema.

The search results confirm that "TextSummary_text" is used elsewhere in the codebase (in cognee/modules/retrieval/utils/brute_force_triplet_search.py and cognee/modules/retrieval/summaries_retriever.py), indicating that it is an expected part of the underlying data schema. Please double-check that both the collections (like "TextSummary_text") and the properties listed in "properties_to_project" remain consistent with the schema to avoid any runtime issues.

cognee/modules/retrieval/EntityCompletionRetriever.py (1)

16-27: Constructor looks good.

The initialization pattern is straightforward and aligns with typical dependency injection for both the extractor and context provider.

@borisarzentar borisarzentar changed the title Feat/cog 1325 entity brute force triplet search feat: entity brute force triplet search [COG-1325] Mar 4, 2025
@lxobr lxobr merged commit f033f73 into dev Mar 5, 2025
36 of 37 checks passed
@lxobr lxobr deleted the feat/COG-1325-entity-brute-force-triplet-search branch March 5, 2025 10:18
@coderabbitai coderabbitai bot mentioned this pull request May 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants