Skip to content

Conversation

@lxobr
Copy link
Collaborator

@lxobr lxobr commented Aug 6, 2025

Description

  • Improved list handling, removed .index logic from get_graph_from_model, transitioned to fully datapoint-oriented processing
  • Streamlined datapoint iteration by introducing _datapoints_generator with nested loops
  • Generalized field processing to handle mixed lists: [DataPoint, (Edge, DataPoint), (Edge, [DataPoint])], allowing dynamic multiple edges generation
  • Small improvements and refactorings
  • Added tests to test_get_graph_from_model_flexible_edges() covering weighted edges and dynamic multiple edges
  • Created dynamic_multiple_edges_example.py demonstrating dynamic multiple edges

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

@lxobr lxobr requested a review from hajdul88 August 6, 2025 15:30
@pull-checklist
Copy link

pull-checklist bot commented Aug 6, 2025

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 6, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This update introduces a major refactor and feature expansion for LLM structured output frameworks, including full integration of the BAML framework alongside existing Instructor/litellm support. A new LLMGateway abstraction centralizes all LLM interactions. The pipeline system gains fine-grained incremental processing, and search now supports a new "FEELING_LUCKY" type with dynamic selection. Numerous modules are updated for consistency, error handling, and new configuration options.

Changes

Cohort / File(s) Change Summary
LLM Gateway Abstraction
cognee/infrastructure/llm/LLMGateway.py, cognee/infrastructure/llm/__init__.py, cognee/modules/retrieval/code_retriever.py, cognee/modules/retrieval/graph_completion_cot_retriever.py, cognee/modules/retrieval/natural_language_retriever.py, cognee/modules/retrieval/utils/completion.py, cognee/modules/retrieval/utils/description_to_codepart_search.py, cognee/tasks/graph/cascade_extract/utils/extract_content_nodes_and_relationship_names.py, cognee/tasks/graph/cascade_extract/utils/extract_edge_triplets.py, cognee/tasks/graph/cascade_extract/utils/extract_nodes.py, cognee/tasks/entity_completion/entity_extractors/llm_entity_extractor.py, cognee/tasks/chunk_naive_llm_classifier/chunk_naive_llm_classifier.py, cognee/modules/data/processing/document_types/AudioDocument.py, cognee/modules/data/processing/document_types/ImageDocument.py, cognee/modules/data/extraction/extract_categories.py (deleted), ...
Introduces the LLMGateway class as a unified interface for LLM operations, replacing scattered client instantiations and prompt utilities. All LLM-related calls are routed through this gateway, supporting both Instructor/litellm and BAML frameworks. Removes the now-obsolete extract_categories module.
BAML Structured Output Framework Integration
cognee/infrastructure/llm/structured_output_framework/baml/baml_client/*, cognee/infrastructure/llm/structured_output_framework/baml/baml_src/*, cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/*, ...
Adds a full BAML client SDK, including async/sync clients, type builders, streaming types, runtime, and prompt templates for content classification, knowledge graph extraction, and summarization. Generated files provide data models and client logic.
LLM Config and Environment
.env.template, cognee/infrastructure/llm/config.py
Adds new environment variables for structured output framework selection and BAML configuration. Updates LLMConfig with BAML-specific fields and a post-init registry.
Pipeline Incremental Processing
cognee/modules/pipelines/operations/run_tasks.py, cognee/modules/pipelines/operations/pipeline.py, cognee/api/v1/add/add.py, cognee/api/v1/cognify/cognify.py, cognee/api/v1/cognify/code_graph_pipeline.py, ...
Refactors pipeline task execution to support incremental, concurrent processing of data items, with robust status tracking and error aggregation. Adds incremental_loading parameters throughout the pipeline stack.
Graph Database and Adapter Updates
cognee/infrastructure/databases/graph/config.py, cognee/infrastructure/databases/graph/get_graph_engine.py, cognee/infrastructure/databases/graph/neo4j_driver/adapter.py, cognee/infrastructure/databases/graph/kuzu/adapter.py, cognee/infrastructure/databases/graph/neptune_driver/adapter.py, cognee/infrastructure/databases/graph/networkx/adapter.py
Adds graph_database_name to configs and adapters, removes memgraph support, and standardizes document subgraph queries to use data_id instead of content_hash. Neo4j adapter gains edge property flattening.
Search System Enhancement
cognee/modules/search/types/SearchType.py, cognee/modules/search/operations/select_search_type.py, cognee/modules/search/methods/search.py, cognee/api/v1/search/search.py, cognee/infrastructure/llm/prompts/search_type_selector_prompt.txt
Adds a new FEELING_LUCKY search type, with logic to dynamically select the best search type using an LLM and a new prompt. Updates documentation and selection logic accordingly.
Error Handling and Status Models
cognee/modules/pipelines/exceptions/exceptions.py, cognee/modules/pipelines/exceptions/__init__.py, cognee/modules/pipelines/models/PipelineRunInfo.py, cognee/modules/pipelines/models/DataItemStatus.py, cognee/modules/pipelines/models/__init__.py, cognee/api/v1/add/routers/get_add_router.py, cognee/api/v1/cognify/routers/get_cognify_router.py
Adds new error and status classes for pipeline runs and data items, with improved error propagation and HTTP response handling in API routers.
Data Model and Deletion Refactor
cognee/modules/data/models/Data.py, cognee/api/v1/delete/delete.py
Adds a mutable JSON pipeline_status field to the Data model for better status tracking. Refactors document deletion to use data_id instead of content_hash.
Graph Extraction & Traversal Refactor
cognee/modules/graph/utils/get_graph_from_model.py, cognee/modules/engine/utils/generate_edge_id.py, cognee/modules/graph/cognee_graph/CogneeGraph.py
Refactors graph extraction and traversal to unify data extraction, simplify edge creation, and introduce a utility for generating normalized edge UUIDs.
Prompt and Tokenizer Updates
cognee/infrastructure/llm/prompts/*, cognee/infrastructure/llm/tokenizer/*, cognee/infrastructure/databases/vector/embeddings/*
Adds or updates prompt templates, including a new search type selector prompt. Tokenizer adapters are improved for fallback and error handling.
Miscellaneous Refactoring and Imports
cognee/infrastructure/llm/utils.py, cognee/modules/retrieval/context_providers/TripletSearchContextProvider.py, cognee/modules/retrieval/graph_completion_context_extension_retriever.py, cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/*, ...
Cleans up and standardizes import statements, removes unused imports, and updates function signatures and docstrings for clarity and consistency.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant API
    participant Pipeline
    participant LLMGateway
    participant BAML/Instructor

    User->>API: Submit data for processing (add/cognify)
    API->>Pipeline: Start pipeline with incremental_loading
    loop For each data item (concurrent)
        Pipeline->>LLMGateway: Request structured output (e.g., extract graph/categories/summary)
        alt Framework = BAML
            LLMGateway->>BAML/Instructor: Route request to BAML extraction
        else Framework = Instructor
            LLMGateway->>BAML/Instructor: Route request to Instructor extraction
        end
        BAML/Instructor-->>LLMGateway: Structured output (graph, categories, etc.)
        LLMGateway-->>Pipeline: Return structured output
        Pipeline->>API: Update status, yield result/event
    end
    API-->>User: Return pipeline run info or error
Loading
sequenceDiagram
    participant User
    participant API
    participant SearchModule
    participant LLMGateway

    User->>API: Search with type FEELING_LUCKY
    API->>SearchModule: specific_search(query, FEELING_LUCKY)
    SearchModule->>LLMGateway: select_search_type(query)
    LLMGateway-->>SearchModule: Returns best SearchType
    SearchModule->>SearchModule: Perform search with selected type
    SearchModule-->>API: Return search results
    API-->>User: Results
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

This PR introduces new frameworks, refactors core LLM and pipeline logic, adds new data models, and touches many files across the codebase, including generated and configuration files. Review will require careful attention to integration points, concurrency, error handling, and backward compatibility.

Possibly related PRs

Poem

A rabbit hopped through code so wide,
Bringing BAML and Gateway side by side.
Now LLMs speak with a single voice,
Incremental pipelines dance and rejoice.
"Feeling Lucky?"—search is new,
Graphs and prompts all shiny too!
🐇✨ The future’s structured, thanks to you!

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/cog-2672-dynamic-multiple-edges-in-datapoints

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@lxobr lxobr requested a review from Vasilije1990 August 6, 2025 15:30
@lxobr lxobr changed the base branch from main to dev August 6, 2025 15:33
@lxobr lxobr changed the title Feature/cog 2672 dynamic multiple edges in datapoints feat: dynamic multiple edges in datapoints Aug 6, 2025
@lxobr lxobr self-assigned this Aug 6, 2025
@lxobr lxobr requested a review from hajdul88 August 7, 2025 10:32
hajdul88
hajdul88 previously approved these changes Aug 7, 2025
Copy link
Collaborator

@hajdul88 hajdul88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay to me

@hajdul88 hajdul88 self-requested a review August 7, 2025 10:47
@hajdul88 hajdul88 dismissed their stale review August 7, 2025 10:48

Unit tests are failing

@lxobr lxobr merged commit 6dbd8e8 into dev Aug 7, 2025
58 of 62 checks passed
@lxobr lxobr deleted the feature/cog-2672-dynamic-multiple-edges-in-datapoints branch August 7, 2025 12:50
@coderabbitai coderabbitai bot mentioned this pull request Sep 19, 2025
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants