Skip to content

Conversation

@dexters1
Copy link
Collaborator

@dexters1 dexters1 commented Nov 19, 2025

Description

Add rate limit handling for BAML

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Code refactoring
  • Performance improvement
  • Other (please specify):

Screenshots/Videos (if applicable)

Pre-submission Checklist

  • I have tested my changes thoroughly before submitting this PR
  • This PR contains minimal changes necessary to address the issue/feature
  • My code follows the project's coding standards and style guidelines
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if applicable)
  • All new and existing tests pass
  • I have searched existing PRs to ensure this change hasn't been submitted already
  • I have linked any relevant issues in the description
  • My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

  • Improvements

    • Centralized rate limiting for LLM and embedding requests to better manage API throughput.
    • Retry policies adjusted: longer initial backoff and reduced max retries to improve stability under rate limits.
  • Chores

    • Refactored rate-limiting implementation from decorator-based to context-manager usage across services.
  • Tests

    • Unit tests and mocks updated to reflect the new context-manager rate-limiting approach.

✏️ Tip: You can customize this high-level summary in your review settings.

@dexters1 dexters1 self-assigned this Nov 19, 2025
@pull-checklist
Copy link

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 19, 2025

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds a shared aiolimiter-based rate-limiting module and migrates LLM and embedding call sites from a decorator-based limiter to context-manager usage; increases initial retry backoff, reduces retry counts across adapters/engines, removes the legacy embedding_rate_limiter module, and adds aiolimiter to dependencies.

Changes

Cohort / File(s) Summary
New shared rate-limiter
cognee/shared/rate_limiting.py
New module that initializes llm_rate_limiter and embedding_rate_limiter (AsyncLimiter via aiolimiter) and provides llm_rate_limiter_context_manager() and embedding_rate_limiter_context_manager() which return an AsyncLimiter context or a no-op context based on config.
Removed legacy embedding limiter
cognee/infrastructure/databases/vector/embeddings/embedding_rate_limiter.py
Entire module deleted — removed singleton EmbeddingRateLimiter, decorator-based sync/async limiters, sleep-and-retry utilities, patterns/constants, and related exports.
LLM adapters — apply shared limiter & adjust retries
.../litellm_instructor/llm/openai/adapter.py, .../litellm_instructor/llm/anthropic/adapter.py, .../litellm_instructor/llm/gemini/adapter.py, .../litellm_instructor/llm/mistral/adapter.py, .../litellm_instructor/llm/ollama/adapter.py, .../litellm_instructor/llm/generic_llm_api/adapter.py, cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
Imported llm_rate_limiter_context_manager, wrapped primary (and fallback where present) LLM calls in async with llm_rate_limiter_context_manager():, increased initial tenacity wait from wait_exponential_jitter(2,128)(8,128), and generally reduced max_retries (commonly 5→2); minor prompt/formatting and response-path adjustments in places.
Embedding engines — use shared limiter & adjust backoff
cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py, .../LiteLLMEmbeddingEngine.py, .../OllamaEmbeddingEngine.py, cognee/tests/unit/infrastructure/mock_embedding_engine.py
Replaced imports of old decorator-based limiter with embedding_rate_limiter_context_manager, wrapped live embedding calls in the context manager, adjusted retry backoff initial wait from 2→8, and tweaked response access patterns (e.g., response.data) while preserving mock/no-op behavior.
Tests updated
cognee/tests/unit/infrastructure/test_embedding_rate_limiting_realistic.py
Removed imports and calls related to the removed EmbeddingRateLimiter singleton (e.g., reset_instance()), relying on config cache clearing instead.
Deps
pyproject.toml
Added dependency aiolimiter>=1.2.1.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas to focus on:
    • Ensure no remaining imports reference the removed embedding_rate_limiter.py.
    • Verify global limiter initialization and enabled/disabled behavior in cognee/shared/rate_limiting.py.
    • Confirm interaction between context-manager-based limiters and tenacity retries (exception propagation, retry triggers).
    • Check tests/mock paths still bypass rate limiting as intended.

Possibly related PRs

Suggested reviewers

  • borisarzentar
  • Vasilije1990
  • hajdul88

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title check ⚠️ Warning The title is related to the changeset but is misleading. The PR implements new rate-limiting functionality across multiple LLM/embedding components and refactors the rate-limiting architecture, not just fixing BAML issues. Update the title to reflect the broader scope, such as: 'refactor: Centralize rate limiting using AsyncLimiter context managers' or 'feat: Implement unified rate limiting for LLM and embedding calls'.
Description check ⚠️ Warning The description is minimal and incomplete. It lacks details about the actual changes, lacks type-of-change checkbox selection, and doesn't address the architectural refactoring or centralization of rate-limiting logic. Provide a comprehensive description explaining the rate-limiting refactor, the migration from singleton-based decorators to context managers, affected components, and link to any related issues.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 487635b and 7360729.

📒 Files selected for processing (1)
  • cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)
cognee/shared/logging_utils.py (1)
  • get_logger (212-224)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Conversation sessions test
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Test graph edge ingestion
  • GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
  • GitHub Check: Basic Tests / Run Basic Graph Tests
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: Basic Tests / Run Simple Examples BAML
  • GitHub Check: Basic Tests / Run Simple Examples
  • GitHub Check: Basic Tests / Run Linting
  • GitHub Check: Basic Tests / Run Unit Tests
  • GitHub Check: Basic Tests / Run Formatting Check
  • GitHub Check: Basic Tests / Run Integration Tests
🔇 Additional comments (4)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (4)

12-12: LGTM: Logging setup is correct.

The logger is properly initialized using the project's logging utility and will log retry attempts at DEBUG level.

Also applies to: 21-23


3-3: LGTM: BaseModel import is necessary.

This import is correctly added to support the type hint on line 33.


32-71: Function implementation looks correct.

The core logic remains unchanged. The retry decorator properly wraps the entire async function, which is the correct approach for handling rate limits at the API call level.


4-10: Remove unused import and add missing retry condition to decorator.

The review comment correctly identifies two real issues:

  1. retry_if_not_exception_type is imported on line 8 but never used
  2. The @retry decorator (lines 26-31) lacks a retry= parameter, causing it to retry on ALL exceptions rather than just transient/rate-limit errors

However, the exact exception class that BAML raises for rate limits cannot be determined from the codebase alone, as baml_py is an external library. The original review's suggestion to search the codebase also returned no results.

To complete this fix:

  • Remove the unused retry_if_not_exception_type import or use it in the decorator
  • Add retry= parameter specifying the appropriate BAML exception (requires checking baml-py documentation or testing to identify the correct exception class)
  • Verify the exception handling matches BAML's actual behavior

logger = get_logger()


@retry(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of concerns about this approach. It feels like we are starting to add LLM-specific workarounds into our codebase and I’m not sure that’s ideal.
Are we sure exponential retries make sense for all cases with structured output? For example, if we send 1000 requests in one second - would all the failed ones get retried?
How do we know this retry strategy will work well with other LLMs?
Might be worth looking at other ways to handle rate limits too, like a reverse proxy or a message queue

Copy link
Collaborator Author

@dexters1 dexters1 Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all failed requests would get retried if a 1000 different backend requests would be made, but we should use a proxy or a similar mechanism for a production scenario and not rely on this approach. This is just a last resort for a single user in a non production scenario where he doesn't need anything advanced and it will work fine in that scenario.

Btw it can be changed to handle wait differently, we can provide any function to handle the waiting, so we can make it work in a 1000 different request scenario with exponential backoff (if our wait function would work with a global variable for example). Might make sense to also update the wait to handle the 1000 different request scenario as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dexters1 please update it based on the feedback of Pavel.
I am worried that if we don't specify options in our docs, people will use this in production. Maybe add a ticket to handle and suggest best ways to run it in prod in Mintlify and pass to Lazar for review

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vasilije1990 We proposed an alternative way. Also @dexters1 recommends LiteLLM as a proxy. Let's see if it solves the issue and then decide the future of this PR. Perhaps, we can avoid the rate limiter on our side

Copy link
Collaborator Author

@dexters1 dexters1 Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would this merge this PR to the dev branch as a fail safe for beginner developers. Would be good to expand the wait function to work globally as well when we get the time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned. We introduce exponential retries and expect it to solve the problem of the rate limiter. The initial problem is that local LLMs can't survive under a high RPM, and we are expecting them to recover by introducing retries (sending additional requests).

@dexters1 dexters1 changed the base branch from main to dev December 1, 2025 13:54
@dexters1 dexters1 changed the base branch from dev to main December 1, 2025 13:54
@borisarzentar
Copy link
Member

@pazone Should we close this one as the issue is also closed?

@dexters1
Copy link
Collaborator Author

dexters1 commented Dec 4, 2025

@pazone Should we close this one as the issue is also closed?

We need this mechanism as well for casual users to handle when rate limit is hit on the Cognee side. We should enhance it to work on a global request level to not spam too many requests in parallel if they can't be handled and we can add optional RPM support. This can be achieved by specifying our own wait function in the tenacity requests wait criteria.

The reason an issue was raised about this only for BAML is because we don't have any retry mechanism on the BAML side and we have this already on the embedding and instructor side.

@dexters1 dexters1 changed the base branch from main to dev December 4, 2025 23:00
@gitguardian
Copy link

gitguardian bot commented Dec 5, 2025

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9573981 Triggered Generic Password 5d0586d .github/workflows/e2e_tests.yml View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@pazone
Copy link
Contributor

pazone commented Dec 5, 2025

Bringing more details to the PR

What problem is attempted to solve:

If a user runs an LLM locally with limited resources it might not handle the default requests rate and eventually start to respond with 500s or too many requests depending on the implementation.

What approaches have been discussed

  • Use LiteLLM proxy locally that has a rate limiter
  • Use nginx proxy to limit RPM
  • Implement the rate limiter on the Cognee side. Considered as a controversial solution, but might be helpful for the local try-outs

What does this PR do

The PR introduces retries with exponential backoff. Retries failed requests to LLM with increasing delays.
I'm probably missing something. I can't understand how exponential retries may help us when the local LLM dies under the load.

Adding @borisarzentar and @hajdul88 to review.
Let's make sure that we solve the initial problem without introducing additional functionality that 100% helps.
IMHO, retrying requests to any throttled service will lead us to much bigger and obvious problems

@dexters1
Copy link
Collaborator Author

dexters1 commented Dec 5, 2025

Bringing more details to the PR

What problem is attempted to solve:

If a user runs an LLM locally with limited resources it might not handle the default requests rate and eventually start to respond with 500s or too many requests depending on the implementation.

What approaches have been discussed

  • Use LiteLLM proxy locally that has a rate limiter
  • Use nginx proxy to limit RPM
  • Implement the rate limiter on the Cognee side. Considered as a controversial solution, but might be helpful for the local try-outs

What does this PR do

The PR introduces retries with exponential backoff. Retries failed requests to LLM with increasing delays. I'm probably missing something. I can't understand how exponential retries may help us when the local LLM dies under the load.

Adding @borisarzentar and @hajdul88 to review. Let's make sure that we solve the initial problem without introducing additional functionality that 100% helps.

It's not necessarily for when local LLMs die, it's

Bringing more details to the PR

What problem is attempted to solve:

If a user runs an LLM locally with limited resources it might not handle the default requests rate and eventually start to respond with 500s or too many requests depending on the implementation.

What approaches have been discussed

  • Use LiteLLM proxy locally that has a rate limiter
  • Use nginx proxy to limit RPM
  • Implement the rate limiter on the Cognee side. Considered as a controversial solution, but might be helpful for the local try-outs

What does this PR do

The PR introduces retries with exponential backoff. Retries failed requests to LLM with increasing delays. I'm probably missing something. I can't understand how exponential retries may help us when the local LLM dies under the load.

Adding @borisarzentar and @hajdul88 to review. Let's make sure that we solve the initial problem without introducing additional functionality that 100% helps.

Exponential retries is not additional functionality, it's something we already have and use for instructor where we use it to handle OpenAI, local model and other cloud model calls. This PR just introduces the same mechanism to the BAML side as it was missing here, it's not here to help when a local LLM model dies, it's mostly here to handle when rate limit has been reached when using BAML for OpenAI and other cloud providers as otherwise it will break Cognee when rate limit would be reached. It might also help with local models as no one complained about local LLMs dying when using instructor but it's just a minor part of the mechanism.

For the local model use I will add optional simple RPM control on the Cognee side

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py (1)

101-113: Incorrect response handling for instructor client.

The instructor.from_litellm client returns the parsed Pydantic model directly, not a raw API response with choices. The current code checks response.choices and attempts model_validate_json, which will likely fail because response is already the validated response_model instance.

             async with llm_rate_limiter_context_manager():
                 response = await self.aclient.chat.completions.create(
                     model=self.model,
                     max_tokens=self.max_completion_tokens,
                     max_retries=2,
                     messages=messages,
                     response_model=response_model,
                 )
-            if response.choices and response.choices[0].message.content:
-                content = response.choices[0].message.content
-                return response_model.model_validate_json(content)
-            else:
-                raise ValueError("Failed to get valid response after retries")
+            return response
♻️ Duplicate comments (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)

27-32: Missing retry condition parameter and inconsistent backoff timing.

Two issues:

  1. The retry_if_not_exception_type is imported but not used. Per past review feedback, non-retryable errors (like 404 NotFound) should not be retried. Other adapters include retry=retry_if_not_exception_type(...).

  2. wait_exponential_jitter(2, 128) differs from other adapters that use (8, 128). This inconsistency may lead to different retry behavior for BAML vs other LLM calls.

+import litellm
+
 @retry(
     stop=stop_after_delay(128),
-    wait=wait_exponential_jitter(2, 128),
+    wait=wait_exponential_jitter(8, 128),
+    retry=retry_if_not_exception_type(litellm.exceptions.NotFoundError),
     before_sleep=before_sleep_log(logger, logging.DEBUG),
     reraise=True,
 )
🧹 Nitpick comments (6)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py (1)

73-86: Rate limiter integration is correct, but prompt formatting has unintended whitespace.

The content string at lines 81-82 includes leading whitespace that will be sent to the API:

"content": f"""Use the given format to extract information
            from the following input: {text_input}. {system_prompt}""",

This will result in extra spaces in the prompt. Consider using a continuation without extra indentation or using string concatenation.

-                        "content": f"""Use the given format to extract information
-                    from the following input: {text_input}. {system_prompt}""",
+                        "content": f"Use the given format to extract information from the following input: {text_input}. {system_prompt}",
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py (3)

145-150: Missing rate limiter wrapper for audio transcription API call.

For consistency with other API calls in this adapter, consider wrapping the transcriptions.create call with the rate limiter context manager.

         async with open_data_file(input_file, mode="rb") as audio_file:
-            transcription = self.aclient.audio.transcriptions.create(
-                model="whisper-1",  # Ensure the correct model for transcription
-                file=audio_file,
-                language="en",
-            )
+            async with llm_rate_limiter_context_manager():
+                transcription = self.aclient.audio.transcriptions.create(
+                    model="whisper-1",  # Ensure the correct model for transcription
+                    file=audio_file,
+                    language="en",
+                )

158-164: Inconsistent retry wait time for transcribe_image.

This method uses wait_exponential_jitter(2, 128) while other methods in this adapter use (8, 128). Consider aligning for consistency.

     @retry(
         stop=stop_after_delay(128),
-        wait=wait_exponential_jitter(2, 128),
+        wait=wait_exponential_jitter(8, 128),
         retry=retry_if_not_exception_type(litellm.exceptions.NotFoundError),
         before_sleep=before_sleep_log(logger, logging.DEBUG),
         reraise=True,
     )

188-203: Missing rate limiter wrapper for image transcription API call.

For consistency, consider wrapping this API call with the rate limiter context manager.

-        response = self.aclient.chat.completions.create(
-            model=self.model,
-            messages=[
-                {
-                    "role": "user",
-                    "content": [
-                        {"type": "text", "text": "What's in this image?"},
-                        {
-                            "type": "image_url",
-                            "image_url": {"url": f"data:image/jpeg;base64,{encoded_image}"},
-                        },
-                    ],
-                }
-            ],
-            max_completion_tokens=300,
-        )
+        async with llm_rate_limiter_context_manager():
+            response = self.aclient.chat.completions.create(
+                model=self.model,
+                messages=[
+                    {
+                        "role": "user",
+                        "content": [
+                            {"type": "text", "text": "What's in this image?"},
+                            {
+                                "type": "image_url",
+                                "image_url": {"url": f"data:image/jpeg;base64,{encoded_image}"},
+                            },
+                        ],
+                    }
+                ],
+                max_completion_tokens=300,
+            )
cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py (1)

100-105: Async context manager wraps synchronous embedding call.

The fastembed.TextEmbedding.embed() method is synchronous, but it's wrapped in an async context manager. While this works for rate limiting purposes, the synchronous call will block the event loop. Consider wrapping in asyncio.to_thread() for non-blocking execution if this becomes a performance bottleneck.

cognee/shared/rate_limiting.py (1)

14-30: Add docstrings for public functions.

Per coding guidelines, public functions should have docstrings to document their purpose and behavior.

 def llm_rate_limiter_context_manager():
+    """
+    Return a rate limiter context manager for LLM API calls.
+
+    Returns the configured AsyncLimiter if rate limiting is enabled,
+    otherwise returns a no-op context manager (nullcontext).
+    """
     global llm_rate_limiter
     if llm_config.llm_rate_limit_enabled:
         return llm_rate_limiter
     else:
         #  Return a no-op context manager if rate limiting is disabled
         return nullcontext()


 def embedding_rate_limiter_context_manager():
+    """
+    Return a rate limiter context manager for embedding API calls.
+
+    Returns the configured AsyncLimiter if rate limiting is enabled,
+    otherwise returns a no-op context manager (nullcontext).
+    """
     global embedding_rate_limiter
     if llm_config.embedding_rate_limit_enabled:
         return embedding_rate_limiter
     else:
         #  Return a no-op context manager if rate limiting is disabled
         return nullcontext()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7360729 and 7deaa6e.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (14)
  • cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py (3 hunks)
  • cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (2 hunks)
  • cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py (3 hunks)
  • cognee/infrastructure/databases/vector/embeddings/embedding_rate_limiter.py (0 hunks)
  • cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (3 hunks)
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py (3 hunks)
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py (4 hunks)
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py (4 hunks)
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py (3 hunks)
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py (3 hunks)
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py (4 hunks)
  • cognee/shared/rate_limiting.py (1 hunks)
  • cognee/tests/unit/infrastructure/mock_embedding_engine.py (2 hunks)
  • pyproject.toml (1 hunks)
💤 Files with no reviewable changes (1)
  • cognee/infrastructure/databases/vector/embeddings/embedding_rate_limiter.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py
  • cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py
  • cognee/tests/unit/infrastructure/mock_embedding_engine.py
  • cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py
  • cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py
  • cognee/shared/rate_limiting.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py
  • cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py
  • cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py
  • cognee/tests/unit/infrastructure/mock_embedding_engine.py
  • cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py
  • cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py
  • cognee/shared/rate_limiting.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py
  • cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py
  • cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py
  • cognee/tests/unit/infrastructure/mock_embedding_engine.py
  • cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py
  • cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py
  • cognee/shared/rate_limiting.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py
  • cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py
cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py
  • cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py
  • cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py
  • cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py
  • cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py
  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py
pyproject.toml

📄 CodeRabbit inference engine (AGENTS.md)

Python version requirement: >= 3.10 and < 3.14

Files:

  • pyproject.toml
cognee/tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Files:

  • cognee/tests/unit/infrastructure/mock_embedding_engine.py
🧠 Learnings (2)
📚 Learning: 2025-11-24T16:45:09.996Z
Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:45:09.996Z
Learning: Applies to pyproject.toml : Python version requirement: >= 3.10 and < 3.14

Applied to files:

  • pyproject.toml
📚 Learning: 2025-11-24T16:45:09.996Z
Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:45:09.996Z
Learning: Applies to cognee/**/*.py : Use shared logging utilities from cognee.shared.logging_utils in Python code

Applied to files:

  • cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py
🧬 Code graph analysis (11)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py (2)
cognee/shared/rate_limiting.py (1)
  • llm_rate_limiter_context_manager (15-21)
cognee/infrastructure/databases/graph/neo4j_driver/deadlock_retry.py (1)
  • wait (35-41)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py (1)
cognee/shared/rate_limiting.py (1)
  • llm_rate_limiter_context_manager (15-21)
cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py (1)
cognee/shared/rate_limiting.py (1)
  • embedding_rate_limiter_context_manager (24-30)
cognee/tests/unit/infrastructure/mock_embedding_engine.py (1)
cognee/shared/rate_limiting.py (1)
  • embedding_rate_limiter_context_manager (24-30)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (3)
cognee/shared/logging_utils.py (1)
  • get_logger (212-224)
cognee/shared/rate_limiting.py (1)
  • llm_rate_limiter_context_manager (15-21)
cognee/infrastructure/databases/graph/neo4j_driver/deadlock_retry.py (1)
  • wait (35-41)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py (1)
cognee/shared/rate_limiting.py (1)
  • llm_rate_limiter_context_manager (15-21)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)
cognee/shared/rate_limiting.py (1)
  • embedding_rate_limiter_context_manager (24-30)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py (1)
cognee/shared/rate_limiting.py (1)
  • llm_rate_limiter_context_manager (15-21)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py (1)
cognee/shared/rate_limiting.py (1)
  • llm_rate_limiter_context_manager (15-21)
cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py (1)
cognee/shared/rate_limiting.py (1)
  • embedding_rate_limiter_context_manager (24-30)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py (2)
cognee/shared/rate_limiting.py (1)
  • llm_rate_limiter_context_manager (15-21)
cognee/infrastructure/databases/graph/neo4j_driver/deadlock_retry.py (1)
  • wait (35-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: Basic Tests / Run Simple Examples
  • GitHub Check: End-to-End Tests / Test graph edge ingestion
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: Basic Tests / Run Linting
  • GitHub Check: End-to-End Tests / Run Telemetry Test
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: CLI Tests / CLI Unit Tests
  • GitHub Check: CLI Tests / CLI Functionality Tests
🔇 Additional comments (17)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (3)

28-28: Import addition is correct.

The import follows the project's module structure and uses the shared rate limiting utilities as per the coding guidelines.


113-122: Rate limiter integration and response handling look correct.

The async with pattern correctly wraps the API call. The change from response["data"] to response.data is correct since litellm.aembedding returns an EmbeddingResponse object with a data attribute, not a dictionary.


82-88: Inconsistent retry wait time compared to LLM adapters.

The embedding engine uses wait_exponential_jitter(2, 128) while the LLM adapters in this PR were updated to use wait_exponential_jitter(8, 128). Verify if this difference is intentional (different rate limit characteristics for embeddings vs LLM calls) or if it should be aligned for consistency.

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py (2)

18-18: Import addition is correct.

Follows project conventions for shared utilities.


47-53: Retry configuration changes are appropriate.

Increasing the initial wait time from 2 to 8 seconds provides better backoff behavior when dealing with rate-limited APIs.

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py (2)

14-15: Import is correct.

The import follows project conventions. The extra blank line is minor.


100-117: Rate limiter integration is correct.

The async with pattern properly wraps the API call, and the retry configuration aligns with other adapters.

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py (4)

16-16: Import addition is correct.

Follows project conventions for shared utilities.


75-81: Retry configuration is appropriate.

The increased initial wait time and consistent configuration with other adapters is good.


109-126: Primary API call rate limiting is correctly implemented.

The rate limiter context manager properly wraps the API call, and the reduced max_retries=2 works well with the external retry decorator.


144-161: Fallback path also includes rate limiting - good implementation.

Applying rate limiting to both the primary and fallback API calls ensures consistent behavior and prevents rate limit issues when falling back to an alternative model.

pyproject.toml (1)

62-62: Dependency addition is appropriate.

The aiolimiter>=1.2.1 constraint is compatible with the project's Python requirement (>=3.10,<3.14), as aiolimiter supports >=3.8. Version 1.2.1 is the latest available (December 2024), and the >= constraint allows for future patch and minor updates within the 1.x series. The library is a well-established asyncio rate limiter implementation suitable for the PR's purpose.

cognee/tests/unit/infrastructure/mock_embedding_engine.py (1)

50-51: LGTM!

The mock correctly integrates the rate limiter context manager, allowing tests to verify rate limiting behavior when needed.

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py (1)

109-127: LGTM!

The rate limiter context manager is correctly applied to both primary and fallback API calls, with consistent retry configuration (max_retries=2).

cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py (1)

120-128: LGTM!

The rate limiter correctly wraps the HTTP request to the Ollama endpoint while preserving both response format handlers (embeddings and data).

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py (2)

139-157: LGTM for async rate limiting implementation.

The rate limiter is correctly applied to the primary async API call path.


199-248: Synchronous method missing rate limiting.

The create_structured_output sync method doesn't use rate limiting, unlike its async counterpart acreate_structured_output. This creates inconsistent behavior between sync and async code paths.

Note: AsyncLimiter from aiolimiter is async-only, so a sync rate limiter would need a different approach (e.g., limits library or a sync wrapper).

Is rate limiting for the sync method intentionally omitted, or should a sync-compatible rate limiter be added for consistency?

Comment on lines +180 to +181
# api_base=self.fallback_endpoint,
response_model=response_model,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Commented-out api_base may cause fallback to use wrong endpoint.

The api_base=self.fallback_endpoint is commented out. If fallback_endpoint is configured, the fallback call will use the default endpoint instead of the intended fallback endpoint.

                         api_key=self.fallback_api_key,
-                        # api_base=self.fallback_endpoint,
+                        api_base=self.fallback_endpoint,
                         response_model=response_model,
🤖 Prompt for AI Agents
In
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py
around lines 180 to 181, the call that should use the configured fallback
endpoint has api_base commented out, so fallback requests will go to the default
endpoint; restore/pass api_base=self.fallback_endpoint (or conditionally include
it only when fallback_endpoint is set) in that call so the fallback request uses
the intended endpoint, keeping response_model and other args intact.

Comment on lines +1 to +12
from aiolimiter import AsyncLimiter
from contextlib import nullcontext
from cognee.infrastructure.llm.config import get_llm_config

llm_config = get_llm_config()

llm_rate_limiter = AsyncLimiter(
llm_config.llm_rate_limit_requests, llm_config.embedding_rate_limit_interval
)
embedding_rate_limiter = AsyncLimiter(
llm_config.embedding_rate_limit_requests, llm_config.embedding_rate_limit_interval
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, check if llm_rate_limit_interval exists in the config
rg -n "llm_rate_limit_interval" --type py -C2

Repository: topoteretes/cognee

Length of output: 5695


🏁 Script executed:

# Also check for llm_rate_limit_requests to understand the config parameters
rg -n "llm_rate_limit_requests|embedding_rate_limit" --type py -C2

Repository: topoteretes/cognee

Length of output: 15422


🏁 Script executed:

# Let's look at the get_llm_config function to understand how config is initialized
rg -n "def get_llm_config" --type py -A 10

Repository: topoteretes/cognee

Length of output: 946


🏁 Script executed:

# Check how the rate limiters are being used throughout the codebase
rg -n "llm_rate_limiter|embedding_rate_limiter" --type py -B2 -A2

Repository: topoteretes/cognee

Length of output: 28657


🏁 Script executed:

# Read the complete cognee/shared/rate_limiting.py file
cat -n cognee/shared/rate_limiting.py

Repository: topoteretes/cognee

Length of output: 1210


🏁 Script executed:

# Check if there are actual runtime config changes in tests or production code
rg -n "get_llm_config\.cache_clear|_instance.*None" --type py -B3 -A3

Repository: topoteretes/cognee

Length of output: 5713


🏁 Script executed:

# Look at the get_llm_config implementation more carefully
cat -n cognee/infrastructure/llm/config.py | head -230 | tail -30

Repository: topoteretes/cognee

Length of output: 1464


🏁 Script executed:

# Check if the rate limiters from cognee/shared/rate_limiting.py are actually used or if they're replaced by the singleton
rg "from cognee.shared.rate_limiting import llm_rate_limiter[^_]" --type py

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check what the correct parameters should be for AsyncLimiter
rg -n "AsyncLimiter" --type py -B2 -A2

Repository: topoteretes/cognee

Length of output: 836


🏁 Script executed:

# Verify the parameter names and their values in config
rg -n "llm_rate_limit_interval|embedding_rate_limit_interval" cognee/infrastructure/llm/config.py | head -20

Repository: topoteretes/cognee

Length of output: 475


Fix parameter bug and address runtime config initialization limitations.

Line 8 has a copy-paste error: llm_rate_limiter uses llm_config.embedding_rate_limit_interval instead of llm_config.llm_rate_limit_interval. While both default to 60 seconds, independent configuration is not possible since they're swapped.

Additionally, module-level initialization with cached get_llm_config() prevents runtime reconfiguration. The rate limiters are frozen at import time and cannot adapt if configuration changes. Consider lazy initialization or a factory pattern if runtime reconfiguration is needed, particularly given the codebase pattern of explicitly clearing caches in tests (get_llm_config.cache_clear()).

🤖 Prompt for AI Agents
In cognee/shared/rate_limiting.py around lines 1 to 12, there is a parameter bug
and an initialization design issue: llm_rate_limiter is incorrectly created with
llm_config.embedding_rate_limit_interval (swap/copy-paste) and both limiters are
instantiated at module import time, preventing runtime reconfiguration. Fix by
using llm_config.llm_rate_limit_interval for the llm_rate_limiter, and refactor
to lazily create or provide a factory function (e.g., get_llm_rate_limiter() and
get_embedding_rate_limiter()) that calls get_llm_config() at call time (or
checks/clears cache as needed) so tests and runtime can change config after
import; ensure any existing callers use the new accessors or the factory to
obtain fresh AsyncLimiter instances.

Comment on lines +7 to +9
llm_rate_limiter = AsyncLimiter(
llm_config.llm_rate_limit_requests, llm_config.embedding_rate_limit_interval
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: LLM rate limiter uses embedding interval instead of LLM interval.

The llm_rate_limiter is configured with embedding_rate_limit_interval instead of llm_rate_limit_interval. This means the LLM rate limiter will use the wrong time window.

 llm_rate_limiter = AsyncLimiter(
-    llm_config.llm_rate_limit_requests, llm_config.embedding_rate_limit_interval
+    llm_config.llm_rate_limit_requests, llm_config.llm_rate_limit_interval
 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
llm_rate_limiter = AsyncLimiter(
llm_config.llm_rate_limit_requests, llm_config.embedding_rate_limit_interval
)
llm_rate_limiter = AsyncLimiter(
llm_config.llm_rate_limit_requests, llm_config.llm_rate_limit_interval
)
🤖 Prompt for AI Agents
In cognee/shared/rate_limiting.py around lines 7 to 9, the llm_rate_limiter is
being constructed with llm_config.embedding_rate_limit_interval instead of the
LLM-specific interval; change the second argument to
llm_config.llm_rate_limit_interval so the AsyncLimiter uses the correct time
window for LLM requests.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)

36-55: Enhance docstring to document retry and rate-limiting behavior.

Per coding guidelines, the docstring should document retry/rate-limiting behavior so users understand potential delays and exceptions. Add a section describing:

  • Retry behavior (up to 128s with exponential backoff)
  • Rate limiting (when enabled via config)
  • Exceptions that may be raised after retries are exhausted

Apply this enhancement:

     """
     Generate a response from a user query.
 
     This method asynchronously creates structured output by sending a request through BAML
     using the provided parameters to generate a completion based on the user input and
     system prompt.
+    
+    Retry Behavior:
+    ---------------
+    This function automatically retries on transient failures with exponential backoff,
+    up to a maximum of 128 seconds. Rate limiting is applied when enabled via configuration.
 
     Parameters:
     -----------
-
         - text_input (str): The input text provided by the user for generating a response.
         - system_prompt (str): The system's prompt to guide the model's response.
         - response_model (Type[BaseModel]): The expected model type for the response.
 
     Returns:
     --------
-
         - BaseModel: A structured output generated by the model, returned as an instance of
           BaseModel.
+    
+    Raises:
+    -------
+        - Exception: Re-raises the last exception after all retry attempts are exhausted.
     """

As per coding guidelines (guideline 5), documenting behavior comprehensively is essential.

🧹 Nitpick comments (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)

27-32: Document retry behavior and production alternatives.

The 128-second retry window with 8-128s exponential backoff is appropriate for cloud API rate limits but won't help overwhelmed local LLMs (as noted in PR discussion). Since this is a "fail-safe for beginner developers" rather than a production solution, ensure documentation clearly states:

  1. This is for single-user, non-production scenarios
  2. Production deployments should use LiteLLM proxy or similar rate-limiting infrastructure
  3. Each request retries independently—parallel requests don't coordinate

Based on PR discussion feedback, do you want me to open an issue to add Mintlify documentation explaining production best practices for rate limiting?

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7deaa6e and a66b2ce.

📒 Files selected for processing (1)
  • cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

  • cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
🧬 Code graph analysis (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (5)
cognee/shared/logging_utils.py (1)
  • get_logger (212-224)
cognee/shared/rate_limiting.py (1)
  • llm_rate_limiter_context_manager (15-21)
cognee/infrastructure/databases/graph/neo4j_driver/deadlock_retry.py (1)
  • wait (35-41)
cognee/infrastructure/llm/structured_output_framework/baml/baml_client/parser.py (2)
  • AcreateStructuredOutput (26-34)
  • AcreateStructuredOutput (43-51)
cognee/infrastructure/llm/structured_output_framework/baml/baml_client/sync_client.py (4)
  • AcreateStructuredOutput (95-118)
  • AcreateStructuredOutput (127-149)
  • AcreateStructuredOutput (158-172)
  • AcreateStructuredOutput (181-195)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: Basic Tests / Run Formatting Check
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: CLI Tests / CLI Functionality Tests
🔇 Additional comments (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)

62-67: LGTM: Two-level rate protection is appropriate.

The combination of rate limiting (context manager) and retry logic (decorator) provides defense in depth—the rate limiter prevents overwhelming the API, while the retry decorator handles transient failures. When rate limiting is disabled via config, the retry logic serves as a fallback.

retry,
stop_after_delay,
wait_exponential_jitter,
retry_if_not_exception_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: retry_if_not_exception_type imported but not used in decorator.

Line 8 imports retry_if_not_exception_type, but it's not applied in the @retry decorator (lines 27-32). A past review comment indicated this was addressed in commits 72c20c2 to 362aa8d, but the current code doesn't include it. Without this parameter, the function will retry all exceptions—including non-retryable errors like NotFoundError (404), validation errors, and authentication failures—causing users to wait up to 128 seconds for errors that should fail immediately.

Apply this diff to add the missing retry condition:

 @retry(
     stop=stop_after_delay(128),
     wait=wait_exponential_jitter(8, 128),
+    retry=retry_if_not_exception_type(litellm.exceptions.NotFoundError),
     before_sleep=before_sleep_log(logger, logging.DEBUG),
     reraise=True,
 )

Ensure litellm.exceptions is imported at the top of the file:

import litellm.exceptions

Also applies to: 27-32

🤖 Prompt for AI Agents
In
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
around lines 8 and 27-32, add an import for litellm.exceptions at the top and
update the @retry decorator to use retry_if_not_exception_type so non-retryable
exceptions (e.g. litellm.exceptions.NotFoundError,
litellm.exceptions.ValidationError, litellm.exceptions.AuthenticationError) are
not retried. Concretely: add "import litellm.exceptions" near the other imports,
and modify the existing @retry(...) call to include
retry=retry_if_not_exception_type((litellm.exceptions.NotFoundError,
litellm.exceptions.ValidationError, litellm.exceptions.AuthenticationError)))
while keeping the current stop, wait, and retry_error_callback settings.

@Vasilije1990 Vasilije1990 merged commit 52b0029 into dev Dec 8, 2025
106 of 108 checks passed
@Vasilije1990 Vasilije1990 deleted the baml-rate-limit-handling branch December 8, 2025 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants