Skip to content

Conversation

@EricXiao95
Copy link
Contributor

Description

This PR fix graph visualization access for users with read permissions (#1182)

  • Add permission checks for graph visualization endpoints to ensure users can only access datasets they have permission to view
  • Create get_dataset_with_permissions method to validate user access before returning a dataset
  • Remove redundant dataset existence validation in datasets router and delegate permission checking to graph data retrieval
  • Add comprehensive test suite for graph visualization permissions covering owner access and permission granting scenarios
  • Update get_formatted_graph_data() to use dataset owner's ID for context

Testing

Tests can be run with:

pytest -s cognee/tests/test_graph_visualization_permissions.py

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

borisarzentar and others added 26 commits July 28, 2025 23:19
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
…er optimization (topoteretes#1151)

<!-- .github/pull_request_template.md -->

## Description
feature: solve edge embedding duplicates in edge collection + retriever
optimization

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Co-authored-by: Vasilije <[email protected]>
…opoteretes#1092)

<!-- .github/pull_request_template.md -->

## Description
Attempt at making incremental loading run async

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
Add async lock for dynamic table creation

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
…poteretes#1177)

<!-- .github/pull_request_template.md -->

## Description
Add default tokenizer for custom models not available on HuggingFace

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->
This PR implements the 'FEELING_LUCKY' search type, which intelligently
routes user queries to the most appropriate search retriever, addressing
[topoteretes#1162](topoteretes#1162).

- implement new search type FEELING_LUCKY
- Add the select_search_type function to analyze queries and choose the
proper search type
- Integrate with an LLM for intelligent search type determination
- Add logging for the search type selection process
- Support fallback to RAG_COMPLETION when the LLM selection fails
- Add tests for the new search type

## How it works
When a user selects the 'FEELING_LUCKY' search type, the system first
sends their natural language query to an LLM-based classifier. This
classifier analyzes the query's intent (e.g., is it asking for a
relationship, a summary, or a factual answer?) and selects the optimal
SearchType, such as 'INSIGHTS' or 'GRAPH_COMPLETION'. The main search
function then proceeds using this dynamically selected type. If the
classification process fails, it gracefully falls back to the default
'RAG_COMPLETION' type.

## Testing
Tests can be run with:
```bash
python -m pytest cognee/tests/unit/modules/search/search_methods_test.py -k "feeling_lucky" -v
```

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

Signed-off-by: EricXiao <[email protected]>
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
Resolve issues with Cognee MCP docker use

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: vasilije <[email protected]>
Co-authored-by: Andrew Carbonetto <[email protected]>
Co-authored-by: Andy Kwok <[email protected]>
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Signed-off-by: Raj2604 <[email protected]>
Co-authored-by: Daulet Amirkhanov <[email protected]>
Co-authored-by: Hande <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Boris <[email protected]>
Co-authored-by: Matea Pesic <[email protected]>
Co-authored-by: github-actions[bot] <[email protected]>
Co-authored-by: hajdul88 <[email protected]>
Co-authored-by: Boris Arzentar <[email protected]>
Co-authored-by: Raj Mandhare <[email protected]>
Co-authored-by: Pedro Thompson <[email protected]>
Co-authored-by: Pedro Henrique Thompson Furtado <[email protected]>
<!-- .github/pull_request_template.md -->

## Description
Add multi db support for Neo4j Enterprise users

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.

---------

Signed-off-by: Raj2604 <[email protected]>
Co-authored-by: vasilije <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Daulet Amirkhanov <[email protected]>
Co-authored-by: Hande <[email protected]>
Co-authored-by: Boris <[email protected]>
Co-authored-by: Matea Pesic <[email protected]>
Co-authored-by: github-actions[bot] <[email protected]>
Co-authored-by: hajdul88 <[email protected]>
Co-authored-by: Boris Arzentar <[email protected]>
Co-authored-by: Raj Mandhare <[email protected]>
Co-authored-by: Pedro Thompson <[email protected]>
Co-authored-by: Pedro Henrique Thompson Furtado <[email protected]>
This will allow to deal with the issue when the user is using custom
embedding and LLM and passes the hosted_vllm option as part of the
LiteLLM documentation

<!-- .github/pull_request_template.md -->

## Description
<!-- This allows the user to use hosted_vllm with respect to LiteLLM
usage - and only gets applicable for custom embedding models -
specifically Hugging Face models -->

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
…sh (topoteretes#1210)

<!-- .github/pull_request_template.md -->

## Description
Changing deletion logic to use document id instead of content hash

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
<!-- .github/pull_request_template.md -->

## Description
<!-- Provide a clear description of the changes in this PR -->
- Improved list handling, removed `.index` logic from
`get_graph_from_model`, transitioned to fully datapoint-oriented
processing
- Streamlined datapoint iteration by introducing `_datapoints_generator`
with nested loops
- Generalized field processing to handle mixed lists: `[DataPoint,
(Edge, DataPoint), (Edge, [DataPoint])]`, allowing dynamic multiple
edges generation
- Small improvements and refactorings
- Added tests to `test_get_graph_from_model_flexible_edges()` covering
weighted edges and dynamic multiple edges
- Created `dynamic_multiple_edges_example.py` demonstrating dynamic
multiple edges

## DCO Affirmation
I affirm that all code in every commit of this pull request conforms to
the terms of the Topoteretes Developer Certificate of Origin.
@pull-checklist
Copy link

pull-checklist bot commented Aug 7, 2025

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 7, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

This update introduces extensive support for structured output frameworks, notably integrating the BAML client and instructor-based LLM gateway, and refactors LLM-related logic to use a new LLMGateway abstraction. It also implements incremental data loading in pipelines, adds permission-aware dataset access, introduces new data models, and updates search type selection with a "feeling lucky" mode.

Changes

Cohort / File(s) Change Summary
Structured Output Framework: BAML Integration
cognee/infrastructure/llm/structured_output_framework/baml/baml_client/*, cognee/infrastructure/llm/structured_output_framework/baml/baml_src/*, cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/*
Introduces BAML async/sync clients, runtime, parsers, type builder, type map, and Pydantic data models. Adds BAML schema and prompt templates for content graph extraction and classification, with async summary extraction and mock summary support.
LLMGateway Abstraction and LLM Refactor
cognee/infrastructure/llm/LLMGateway.py, cognee/infrastructure/llm/__init__.py, cognee/infrastructure/llm/config.py, cognee/infrastructure/llm/utils.py, cognee/infrastructure/llm/structured_output_framework/litellm_instructor/extraction/*, cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/*, cognee/modules/retrieval/*, cognee/modules/retrieval/utils/*, cognee/modules/data/processing/document_types/*, cognee/tasks/chunk_naive_llm_classifier/chunk_naive_llm_classifier.py, cognee/eval_framework/evaluation/direct_llm_eval_adapter.py, cognee/modules/engine/utils/generate_edge_id.py
Adds LLMGateway class as a unified interface for LLM operations. Refactors all LLM and prompt usage to static gateway methods, removing direct client instantiation and scattered utility functions. Updates imports and usage throughout retrieval, evaluation, and chunk classification modules.
Incremental Loading and Pipeline Refactor
cognee/modules/pipelines/operations/run_tasks.py, cognee/modules/pipelines/operations/pipeline.py, cognee/api/v1/add/add.py, cognee/api/v1/cognify/cognify.py, cognee/api/v1/cognify/code_graph_pipeline.py, cognee/modules/pipelines/models/PipelineRunInfo.py, cognee/modules/pipelines/models/DataItemStatus.py, cognee/modules/pipelines/models/__init__.py, cognee/api/v1/add/routers/get_add_router.py, cognee/api/v1/cognify/routers/get_cognify_router.py, cognee/modules/pipelines/exceptions/*
Implements incremental loading for pipeline tasks, allowing per-data-item processing and skipping already-completed items. Adds new error and status models for pipeline runs, updates API endpoints to handle new statuses and errors.
Graph Database and Data Model Updates
cognee/infrastructure/databases/graph/*, cognee/base_config.py, cognee/modules/data/models/Data.py
Adds support for graph database name in configs/adapters, updates subgraph lookup to use data_id instead of content_hash, adds pipeline status to Data model, and updates config dictionary outputs.
Dataset Permission and Access Control
cognee/modules/data/methods/*, cognee/modules/graph/methods/get_formatted_graph_data.py, cognee/api/v1/datasets/routers/get_datasets_router.py
Introduces permission-aware dataset retrieval, updates graph data formatting to enforce permissions, and modifies dataset status endpoint defaults.
Search Type Selector and "Feeling Lucky" Mode
cognee/modules/search/operations/select_search_type.py, cognee/modules/search/operations/__init__.py, cognee/modules/search/methods/search.py, cognee/modules/search/types/SearchType.py, cognee/infrastructure/llm/prompts/search_type_selector_prompt.txt, cognee/api/v1/search/search.py
Adds async search type selector using LLM, introduces "FEELING_LUCKY" search type, and updates search logic and documentation to support dynamic query type selection.
Document and Chunk Processing
cognee/tasks/documents/extract_chunks_from_documents.py, cognee/modules/data/processing/document_types/PdfDocument.py
Removes custom PDF error handling, now allowing exceptions to propagate directly during PDF reading and chunk extraction.
Edge and Graph Utilities
cognee/modules/engine/utils/generate_edge_id.py, cognee/modules/graph/utils/get_graph_from_model.py, cognee/modules/graph/cognee_graph/CogneeGraph.py
Adds edge ID generation utility, refactors graph extraction to handle relationships and edge metadata more consistently, and simplifies edge mapping and triplet importance calculations.
Miscellaneous and Formatting
cognee/infrastructure/llm/tokenizer/*, cognee/infrastructure/databases/vector/embeddings/*, cognee/infrastructure/llm/prompts/*, .env.template, cognee/shared/data_models.py
Updates tokenizer imports and fallback logic, adds missing newlines to prompt templates, expands environment variable templates, and makes minor formatting/import changes.
Removals
cognee/modules/data/extraction/extract_categories.py
Removes the old extract_categories function in favor of LLMGateway-based implementations.

Sequence Diagram(s)

sequenceDiagram
    participant API
    participant LLMGateway
    participant BAMLClient
    participant InstructorClient

    API->>LLMGateway: extract_content_graph(content, response_model, mode)
    alt framework == "BAML"
        LLMGateway->>BAMLClient: ExtractContentGraphGeneric(content, mode)
        BAMLClient-->>LLMGateway: KnowledgeGraph
    else framework == "instructor"
        LLMGateway->>InstructorClient: acreate_structured_output(content, prompt, response_model)
        InstructorClient-->>LLMGateway: KnowledgeGraph
    end
    LLMGateway-->>API: KnowledgeGraph
Loading
sequenceDiagram
    participant User
    participant API
    participant Pipeline
    participant DB

    User->>API: POST /add (incremental_loading=True)
    API->>Pipeline: cognee_pipeline(..., incremental_loading=True)
    Pipeline->>DB: Check data item status
    alt Already processed
        Pipeline-->>API: PipelineRunAlreadyCompleted
    else Not processed
        Pipeline->>DB: Process and update status
        Pipeline-->>API: PipelineRunCompleted
    end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

Possibly related issues

Possibly related PRs

Suggested labels

run-checks

Suggested reviewers

  • borisarzentar

Poem

A rabbit hopped through code so wide,
Adding BAML and Instructor side by side.
With LLMGateway’s magic, prompts now flow,
Incremental pipelines, permissions in tow.
“Feeling Lucky?”—let the search decide,
As data and graphs are neatly supplied.
🐇✨ The garden of features, now unified!

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@EricXiao95 EricXiao95 closed this Aug 7, 2025
@EricXiao95 EricXiao95 reopened this Aug 7, 2025
@EricXiao95 EricXiao95 closed this Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants