fix: Resolve issue with BAML rate limit handling #1813

dexters1 · 2025-11-19T22:09:11Z

Description

Add rate limit handling for BAML

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Code refactoring
Performance improvement
Other (please specify):

Screenshots/Videos (if applicable)

Pre-submission Checklist

I have tested my changes thoroughly before submitting this PR
This PR contains minimal changes necessary to address the issue/feature
My code follows the project's coding standards and style guidelines
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if applicable)
All new and existing tests pass
I have searched existing PRs to ensure this change hasn't been submitted already
I have linked any relevant issues in the description
My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

Improvements
- Centralized rate limiting for LLM and embedding requests to better manage API throughput.
- Retry policies adjusted: longer initial backoff and reduced max retries to improve stability under rate limits.
Chores
- Refactored rate-limiting implementation from decorator-based to context-manager usage across services.
Tests
- Unit tests and mocks updated to reflect the new context-manager rate-limiting approach.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

pull-checklist · 2025-11-19T22:09:15Z

Please make sure all the checkboxes are checked:

I have tested these changes locally.
I have reviewed the code changes.
I have added end-to-end and unit tests (if applicable).
I have updated the documentation and README.md file (if necessary).
I have removed unnecessary code and debug statements.
PR title is clear and follows the convention.
I have tagged reviewers or team members for feedback.

coderabbitai · 2025-11-19T22:09:27Z

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds a shared aiolimiter-based rate-limiting module and migrates LLM and embedding call sites from a decorator-based limiter to context-manager usage; increases initial retry backoff, reduces retry counts across adapters/engines, removes the legacy embedding_rate_limiter module, and adds aiolimiter to dependencies.

Changes

Cohort / File(s)	Summary
New shared rate-limiter `cognee/shared/rate_limiting.py`	New module that initializes `llm_rate_limiter` and `embedding_rate_limiter` (AsyncLimiter via aiolimiter) and provides `llm_rate_limiter_context_manager()` and `embedding_rate_limiter_context_manager()` which return an AsyncLimiter context or a no-op context based on config.
Removed legacy embedding limiter `cognee/infrastructure/databases/vector/embeddings/embedding_rate_limiter.py`	Entire module deleted — removed singleton `EmbeddingRateLimiter`, decorator-based sync/async limiters, sleep-and-retry utilities, patterns/constants, and related exports.
LLM adapters — apply shared limiter & adjust retries `.../litellm_instructor/llm/openai/adapter.py`, `.../litellm_instructor/llm/anthropic/adapter.py`, `.../litellm_instructor/llm/gemini/adapter.py`, `.../litellm_instructor/llm/mistral/adapter.py`, `.../litellm_instructor/llm/ollama/adapter.py`, `.../litellm_instructor/llm/generic_llm_api/adapter.py`, `cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py`	Imported `llm_rate_limiter_context_manager`, wrapped primary (and fallback where present) LLM calls in `async with llm_rate_limiter_context_manager():`, increased initial tenacity wait from `wait_exponential_jitter(2,128)`→`(8,128)`, and generally reduced `max_retries` (commonly 5→2); minor prompt/formatting and response-path adjustments in places.
Embedding engines — use shared limiter & adjust backoff `cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py`, `.../LiteLLMEmbeddingEngine.py`, `.../OllamaEmbeddingEngine.py`, `cognee/tests/unit/infrastructure/mock_embedding_engine.py`	Replaced imports of old decorator-based limiter with `embedding_rate_limiter_context_manager`, wrapped live embedding calls in the context manager, adjusted retry backoff initial wait from 2→8, and tweaked response access patterns (e.g., `response.data`) while preserving mock/no-op behavior.
Tests updated `cognee/tests/unit/infrastructure/test_embedding_rate_limiting_realistic.py`	Removed imports and calls related to the removed `EmbeddingRateLimiter` singleton (e.g., `reset_instance()`), relying on config cache clearing instead.
Deps `pyproject.toml`	Added dependency `aiolimiter>=1.2.1`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas to focus on:
- Ensure no remaining imports reference the removed embedding_rate_limiter.py.
- Verify global limiter initialization and enabled/disabled behavior in cognee/shared/rate_limiting.py.
- Confirm interaction between context-manager-based limiters and tenacity retries (exception propagation, retry triggers).
- Check tests/mock paths still bypass rate limiting as intended.

Possibly related PRs

Fix gemini gh action #748 — Also modifies LLM/embedding rate-limiting and retry logic across adapters; strong overlap in target modules.
feat: Adding rate limiting #709 — Implements rate limiting on the same LLM and embedding call sites; relates to decorator → context-manager migration.
feat: dynamic multiple edges in datapoints #1212 — Related changes to model extraction/traversal logic; may touch adjacent areas in structured-output pipelines.

Suggested reviewers

borisarzentar
Vasilije1990
hajdul88

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title is related to the changeset but is misleading. The PR implements new rate-limiting functionality across multiple LLM/embedding components and refactors the rate-limiting architecture, not just fixing BAML issues.	Update the title to reflect the broader scope, such as: 'refactor: Centralize rate limiting using AsyncLimiter context managers' or 'feat: Implement unified rate limiting for LLM and embedding calls'.
Description check	⚠️ Warning	The description is minimal and incomplete. It lacks details about the actual changes, lacks type-of-change checkbox selection, and doesn't address the architectural refactoring or centralization of rate-limiting logic.	Provide a comprehensive description explaining the rate-limiting refactor, the migration from singleton-based decorators to context managers, affected components, and link to any related issues.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 487635b and 7360729.

📒 Files selected for processing (1)

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)

cognee/shared/logging_utils.py (1)

get_logger (212-224)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: CLI Tests / CLI Functionality Tests
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Conversation sessions test
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Test graph edge ingestion
GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
GitHub Check: Basic Tests / Run Basic Graph Tests
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: Basic Tests / Run Simple Examples BAML
GitHub Check: Basic Tests / Run Simple Examples
GitHub Check: Basic Tests / Run Linting
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: Basic Tests / Run Formatting Check
GitHub Check: Basic Tests / Run Integration Tests

🔇 Additional comments (4)

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (4)

12-12: LGTM: Logging setup is correct.

The logger is properly initialized using the project's logging utility and will log retry attempts at DEBUG level.

Also applies to: 21-23

3-3: LGTM: BaseModel import is necessary.

This import is correctly added to support the type hint on line 33.

32-71: Function implementation looks correct.

The core logic remains unchanged. The retry decorator properly wraps the entire async function, which is the correct approach for handling rate limits at the API call level.

4-10: Remove unused import and add missing retry condition to decorator.

The review comment correctly identifies two real issues:

retry_if_not_exception_type is imported on line 8 but never used

The @retry decorator (lines 26-31) lacks a retry= parameter, causing it to retry on ALL exceptions rather than just transient/rate-limit errors

However, the exact exception class that BAML raises for rate limits cannot be determined from the codebase alone, as baml_py is an external library. The original review's suggestion to search the codebase also returned no results.

To complete this fix:

Remove the unused retry_if_not_exception_type import or use it in the decorator

Add retry= parameter specifying the appropriate BAML exception (requires checking baml-py documentation or testing to identify the correct exception class)

Verify the exception handling matches BAML's actual behavior

...ucture/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py

pazone · 2025-11-20T12:30:36Z

...ucture/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py

 logger = get_logger()


+@retry(


I have a couple of concerns about this approach. It feels like we are starting to add LLM-specific workarounds into our codebase and I’m not sure that’s ideal.
Are we sure exponential retries make sense for all cases with structured output? For example, if we send 1000 requests in one second - would all the failed ones get retried?
How do we know this retry strategy will work well with other LLMs?
Might be worth looking at other ways to handle rate limits too, like a reverse proxy or a message queue

Yes, all failed requests would get retried if a 1000 different backend requests would be made, but we should use a proxy or a similar mechanism for a production scenario and not rely on this approach. This is just a last resort for a single user in a non production scenario where he doesn't need anything advanced and it will work fine in that scenario.

Btw it can be changed to handle wait differently, we can provide any function to handle the waiting, so we can make it work in a 1000 different request scenario with exponential backoff (if our wait function would work with a global variable for example). Might make sense to also update the wait to handle the 1000 different request scenario as well

@dexters1 please update it based on the feedback of Pavel.
I am worried that if we don't specify options in our docs, people will use this in production. Maybe add a ticket to handle and suggest best ways to run it in prod in Mintlify and pass to Lazar for review

@Vasilije1990 We proposed an alternative way. Also @dexters1 recommends LiteLLM as a proxy. Let's see if it solves the issue and then decide the future of this PR. Perhaps, we can avoid the rate limiter on our side

I would this merge this PR to the dev branch as a fail safe for beginner developers. Would be good to expand the wait function to work globally as well when we get the time

I'm a bit concerned. We introduce exponential retries and expect it to solve the problem of the rate limiter. The initial problem is that local LLMs can't survive under a high RPM, and we are expecting them to recover by introducing retries (sending additional requests).

borisarzentar · 2025-12-03T11:37:33Z

@pazone Should we close this one as the issue is also closed?

dexters1 · 2025-12-04T23:00:02Z

@pazone Should we close this one as the issue is also closed?

We need this mechanism as well for casual users to handle when rate limit is hit on the Cognee side. We should enhance it to work on a global request level to not spam too many requests in parallel if they can't be handled and we can add optional RPM support. This can be achieved by specifying our own wait function in the tenacity requests wait criteria.

The reason an issue was raised about this only for BAML is because we don't have any retry mechanism on the BAML side and we have this already on the embedding and instructor side.

gitguardian · 2025-12-05T12:24:46Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
9573981	Triggered	Generic Password	`5d0586d`	.github/workflows/e2e_tests.yml	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

pazone · 2025-12-05T13:34:50Z

Bringing more details to the PR

What problem is attempted to solve:

If a user runs an LLM locally with limited resources it might not handle the default requests rate and eventually start to respond with 500s or too many requests depending on the implementation.

What approaches have been discussed

Use LiteLLM proxy locally that has a rate limiter
Use nginx proxy to limit RPM
Implement the rate limiter on the Cognee side. Considered as a controversial solution, but might be helpful for the local try-outs

What does this PR do

The PR introduces retries with exponential backoff. Retries failed requests to LLM with increasing delays.
I'm probably missing something. I can't understand how exponential retries may help us when the local LLM dies under the load.

Adding @borisarzentar and @hajdul88 to review.
Let's make sure that we solve the initial problem without introducing additional functionality that 100% helps.
IMHO, retrying requests to any throttled service will lead us to much bigger and obvious problems

dexters1 · 2025-12-05T13:45:43Z

Bringing more details to the PR

What problem is attempted to solve:

If a user runs an LLM locally with limited resources it might not handle the default requests rate and eventually start to respond with 500s or too many requests depending on the implementation.

What approaches have been discussed

Use LiteLLM proxy locally that has a rate limiter

Use nginx proxy to limit RPM

Implement the rate limiter on the Cognee side. Considered as a controversial solution, but might be helpful for the local try-outs

What does this PR do

The PR introduces retries with exponential backoff. Retries failed requests to LLM with increasing delays. I'm probably missing something. I can't understand how exponential retries may help us when the local LLM dies under the load.

Adding @borisarzentar and @hajdul88 to review. Let's make sure that we solve the initial problem without introducing additional functionality that 100% helps.

It's not necessarily for when local LLMs die, it's

Bringing more details to the PR

What problem is attempted to solve:

If a user runs an LLM locally with limited resources it might not handle the default requests rate and eventually start to respond with 500s or too many requests depending on the implementation.

What approaches have been discussed

Use LiteLLM proxy locally that has a rate limiter

Use nginx proxy to limit RPM

Implement the rate limiter on the Cognee side. Considered as a controversial solution, but might be helpful for the local try-outs

What does this PR do

The PR introduces retries with exponential backoff. Retries failed requests to LLM with increasing delays. I'm probably missing something. I can't understand how exponential retries may help us when the local LLM dies under the load.

Adding @borisarzentar and @hajdul88 to review. Let's make sure that we solve the initial problem without introducing additional functionality that 100% helps.

Exponential retries is not additional functionality, it's something we already have and use for instructor where we use it to handle OpenAI, local model and other cloud model calls. This PR just introduces the same mechanism to the BAML side as it was missing here, it's not here to help when a local LLM model dies, it's mostly here to handle when rate limit has been reached when using BAML for OpenAI and other cloud providers as otherwise it will break Cognee when rate limit would be reached. It might also help with local models as no one complained about local LLMs dying when using instructor but it's just a minor part of the mechanism.

For the local model use I will add optional simple RPM control on the Cognee side

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py (1)
101-113: Incorrect response handling for instructor client.

The instructor.from_litellm client returns the parsed Pydantic model directly, not a raw API response with choices. The current code checks response.choices and attempts model_validate_json, which will likely fail because response is already the validated response_model instance.
             async with llm_rate_limiter_context_manager():
                 response = await self.aclient.chat.completions.create(
                     model=self.model,
                     max_tokens=self.max_completion_tokens,
                     max_retries=2,
                     messages=messages,
                     response_model=response_model,
                 )
-            if response.choices and response.choices[0].message.content:
-                content = response.choices[0].message.content
-                return response_model.model_validate_json(content)
-            else:
-                raise ValueError("Failed to get valid response after retries")
+            return response

♻️ Duplicate comments (1)

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)
27-32: Missing retry condition parameter and inconsistent backoff timing.

Two issues:

The retry_if_not_exception_type is imported but not used. Per past review feedback, non-retryable errors (like 404 NotFound) should not be retried. Other adapters include retry=retry_if_not_exception_type(...).

wait_exponential_jitter(2, 128) differs from other adapters that use (8, 128). This inconsistency may lead to different retry behavior for BAML vs other LLM calls.
+import litellm
+
 @retry(
     stop=stop_after_delay(128),
-    wait=wait_exponential_jitter(2, 128),
+    wait=wait_exponential_jitter(8, 128),
+    retry=retry_if_not_exception_type(litellm.exceptions.NotFoundError),
     before_sleep=before_sleep_log(logger, logging.DEBUG),
     reraise=True,
 )

🧹 Nitpick comments (6)

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py (1)

73-86: Rate limiter integration is correct, but prompt formatting has unintended whitespace.

The content string at lines 81-82 includes leading whitespace that will be sent to the API:
"content": f"""Use the given format to extract information
            from the following input: {text_input}. {system_prompt}""",
This will result in extra spaces in the prompt. Consider using a continuation without extra indentation or using string concatenation.
-                        "content": f"""Use the given format to extract information
-                    from the following input: {text_input}. {system_prompt}""",
+                        "content": f"Use the given format to extract information from the following input: {text_input}. {system_prompt}",

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py (3)

145-150: Missing rate limiter wrapper for audio transcription API call.

For consistency with other API calls in this adapter, consider wrapping the transcriptions.create call with the rate limiter context manager.

         async with open_data_file(input_file, mode="rb") as audio_file:
-            transcription = self.aclient.audio.transcriptions.create(
-                model="whisper-1",  # Ensure the correct model for transcription
-                file=audio_file,
-                language="en",
-            )
+            async with llm_rate_limiter_context_manager():
+                transcription = self.aclient.audio.transcriptions.create(
+                    model="whisper-1",  # Ensure the correct model for transcription
+                    file=audio_file,
+                    language="en",
+                )

158-164: Inconsistent retry wait time for transcribe_image.

This method uses wait_exponential_jitter(2, 128) while other methods in this adapter use (8, 128). Consider aligning for consistency.

     @retry(
         stop=stop_after_delay(128),
-        wait=wait_exponential_jitter(2, 128),
+        wait=wait_exponential_jitter(8, 128),
         retry=retry_if_not_exception_type(litellm.exceptions.NotFoundError),
         before_sleep=before_sleep_log(logger, logging.DEBUG),
         reraise=True,
     )

188-203: Missing rate limiter wrapper for image transcription API call.

For consistency, consider wrapping this API call with the rate limiter context manager.

-        response = self.aclient.chat.completions.create(
-            model=self.model,
-            messages=[
-                {
-                    "role": "user",
-                    "content": [
-                        {"type": "text", "text": "What's in this image?"},
-                        {
-                            "type": "image_url",
-                            "image_url": {"url": f"data:image/jpeg;base64,{encoded_image}"},
-                        },
-                    ],
-                }
-            ],
-            max_completion_tokens=300,
-        )
+        async with llm_rate_limiter_context_manager():
+            response = self.aclient.chat.completions.create(
+                model=self.model,
+                messages=[
+                    {
+                        "role": "user",
+                        "content": [
+                            {"type": "text", "text": "What's in this image?"},
+                            {
+                                "type": "image_url",
+                                "image_url": {"url": f"data:image/jpeg;base64,{encoded_image}"},
+                            },
+                        ],
+                    }
+                ],
+                max_completion_tokens=300,
+            )

cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py (1)

100-105: Async context manager wraps synchronous embedding call.

The fastembed.TextEmbedding.embed() method is synchronous, but it's wrapped in an async context manager. While this works for rate limiting purposes, the synchronous call will block the event loop. Consider wrapping in asyncio.to_thread() for non-blocking execution if this becomes a performance bottleneck.

cognee/shared/rate_limiting.py (1)

14-30: Add docstrings for public functions.

Per coding guidelines, public functions should have docstrings to document their purpose and behavior.

 def llm_rate_limiter_context_manager():
+    """
+    Return a rate limiter context manager for LLM API calls.
+
+    Returns the configured AsyncLimiter if rate limiting is enabled,
+    otherwise returns a no-op context manager (nullcontext).
+    """
     global llm_rate_limiter
     if llm_config.llm_rate_limit_enabled:
         return llm_rate_limiter
     else:
         #  Return a no-op context manager if rate limiting is disabled
         return nullcontext()


 def embedding_rate_limiter_context_manager():
+    """
+    Return a rate limiter context manager for embedding API calls.
+
+    Returns the configured AsyncLimiter if rate limiting is enabled,
+    otherwise returns a no-op context manager (nullcontext).
+    """
     global embedding_rate_limiter
     if llm_config.embedding_rate_limit_enabled:
         return embedding_rate_limiter
     else:
         #  Return a no-op context manager if rate limiting is disabled
         return nullcontext()

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7360729 and 7deaa6e.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (14)

cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py (3 hunks)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (2 hunks)
cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py (3 hunks)
cognee/infrastructure/databases/vector/embeddings/embedding_rate_limiter.py (0 hunks)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (3 hunks)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py (3 hunks)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py (4 hunks)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py (4 hunks)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py (3 hunks)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py (3 hunks)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py (4 hunks)
cognee/shared/rate_limiting.py (1 hunks)
cognee/tests/unit/infrastructure/mock_embedding_engine.py (2 hunks)
pyproject.toml (1 hunks)

💤 Files with no reviewable changes (1)

cognee/infrastructure/databases/vector/embeddings/embedding_rate_limiter.py

🧰 Additional context used

📓 Path-based instructions (5)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py
cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py
cognee/tests/unit/infrastructure/mock_embedding_engine.py
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py
cognee/shared/rate_limiting.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py
cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py
cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py
cognee/tests/unit/infrastructure/mock_embedding_engine.py
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py
cognee/shared/rate_limiting.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py
cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py
cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py
cognee/tests/unit/infrastructure/mock_embedding_engine.py
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py
cognee/shared/rate_limiting.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py
cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py
cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py
cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py

pyproject.toml

📄 CodeRabbit inference engine (AGENTS.md)

Python version requirement: >= 3.10 and < 3.14

Files:

pyproject.toml

cognee/tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Files:

cognee/tests/unit/infrastructure/mock_embedding_engine.py

🧠 Learnings (2)

📚 Learning: 2025-11-24T16:45:09.996Z

Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:45:09.996Z
Learning: Applies to pyproject.toml : Python version requirement: >= 3.10 and < 3.14

Applied to files:

pyproject.toml

📚 Learning: 2025-11-24T16:45:09.996Z

Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:45:09.996Z
Learning: Applies to cognee/**/*.py : Use shared logging utilities from cognee.shared.logging_utils in Python code

Applied to files:

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py

🧬 Code graph analysis (11)

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py (2)

cognee/shared/rate_limiting.py (1)

llm_rate_limiter_context_manager (15-21)

cognee/infrastructure/databases/graph/neo4j_driver/deadlock_retry.py (1)

wait (35-41)

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py (1)

cognee/shared/rate_limiting.py (1)

llm_rate_limiter_context_manager (15-21)

cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py (1)

cognee/shared/rate_limiting.py (1)

embedding_rate_limiter_context_manager (24-30)

cognee/tests/unit/infrastructure/mock_embedding_engine.py (1)

cognee/shared/rate_limiting.py (1)

embedding_rate_limiter_context_manager (24-30)

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (3)

cognee/shared/logging_utils.py (1)

get_logger (212-224)

cognee/shared/rate_limiting.py (1)

llm_rate_limiter_context_manager (15-21)

cognee/infrastructure/databases/graph/neo4j_driver/deadlock_retry.py (1)

wait (35-41)

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py (1)

cognee/shared/rate_limiting.py (1)

llm_rate_limiter_context_manager (15-21)

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)

cognee/shared/rate_limiting.py (1)

embedding_rate_limiter_context_manager (24-30)

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py (1)

cognee/shared/rate_limiting.py (1)

llm_rate_limiter_context_manager (15-21)

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py (1)

cognee/shared/rate_limiting.py (1)

llm_rate_limiter_context_manager (15-21)

cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py (1)

cognee/shared/rate_limiting.py (1)

embedding_rate_limiter_context_manager (24-30)

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py (2)

cognee/shared/rate_limiting.py (1)

llm_rate_limiter_context_manager (15-21)

cognee/infrastructure/databases/graph/neo4j_driver/deadlock_retry.py (1)

wait (35-41)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: Basic Tests / Run Simple Examples
GitHub Check: End-to-End Tests / Test graph edge ingestion
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: Basic Tests / Run Linting
GitHub Check: End-to-End Tests / Run Telemetry Test
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: CLI Tests / CLI Unit Tests
GitHub Check: CLI Tests / CLI Functionality Tests

🔇 Additional comments (17)

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (3)

28-28: Import addition is correct.

The import follows the project's module structure and uses the shared rate limiting utilities as per the coding guidelines.

113-122: Rate limiter integration and response handling look correct.

The async with pattern correctly wraps the API call. The change from response["data"] to response.data is correct since litellm.aembedding returns an EmbeddingResponse object with a data attribute, not a dictionary.

82-88: Inconsistent retry wait time compared to LLM adapters.

The embedding engine uses wait_exponential_jitter(2, 128) while the LLM adapters in this PR were updated to use wait_exponential_jitter(8, 128). Verify if this difference is intentional (different rate limit characteristics for embeddings vs LLM calls) or if it should be aligned for consistency.

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py (2)

18-18: Import addition is correct.

Follows project conventions for shared utilities.

47-53: Retry configuration changes are appropriate.

Increasing the initial wait time from 2 to 8 seconds provides better backoff behavior when dealing with rate-limited APIs.

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py (2)

14-15: Import is correct.

The import follows project conventions. The extra blank line is minor.

100-117: Rate limiter integration is correct.

The async with pattern properly wraps the API call, and the retry configuration aligns with other adapters.

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py (4)

16-16: Import addition is correct.

Follows project conventions for shared utilities.

75-81: Retry configuration is appropriate.

The increased initial wait time and consistent configuration with other adapters is good.

109-126: Primary API call rate limiting is correctly implemented.

The rate limiter context manager properly wraps the API call, and the reduced max_retries=2 works well with the external retry decorator.

144-161: Fallback path also includes rate limiting - good implementation.

Applying rate limiting to both the primary and fallback API calls ensures consistent behavior and prevents rate limit issues when falling back to an alternative model.

pyproject.toml (1)

62-62: Dependency addition is appropriate.

The aiolimiter>=1.2.1 constraint is compatible with the project's Python requirement (>=3.10,<3.14), as aiolimiter supports >=3.8. Version 1.2.1 is the latest available (December 2024), and the >= constraint allows for future patch and minor updates within the 1.x series. The library is a well-established asyncio rate limiter implementation suitable for the PR's purpose.

cognee/tests/unit/infrastructure/mock_embedding_engine.py (1)

50-51: LGTM!

The mock correctly integrates the rate limiter context manager, allowing tests to verify rate limiting behavior when needed.

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py (1)

109-127: LGTM!

The rate limiter context manager is correctly applied to both primary and fallback API calls, with consistent retry configuration (max_retries=2).

cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py (1)

120-128: LGTM!

The rate limiter correctly wraps the HTTP request to the Ollama endpoint while preserving both response format handlers (embeddings and data).

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py (2)

139-157: LGTM for async rate limiting implementation.

The rate limiter is correctly applied to the primary async API call path.

199-248: Synchronous method missing rate limiting.

The create_structured_output sync method doesn't use rate limiting, unlike its async counterpart acreate_structured_output. This creates inconsistent behavior between sync and async code paths.

Note: AsyncLimiter from aiolimiter is async-only, so a sync rate limiter would need a different approach (e.g., limits library or a sync wrapper).

Is rate limiting for the sync method intentionally omitted, or should a sync-compatible rate limiter be added for consistency?

coderabbitai · 2025-12-05T18:00:49Z

cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py

+                        # api_base=self.fallback_endpoint,
+                        response_model=response_model,


⚠️ Potential issue | 🟡 Minor

Commented-out api_base may cause fallback to use wrong endpoint.

The api_base=self.fallback_endpoint is commented out. If fallback_endpoint is configured, the fallback call will use the default endpoint instead of the intended fallback endpoint.

api_key=self.fallback_api_key, - # api_base=self.fallback_endpoint, + api_base=self.fallback_endpoint, response_model=response_model,

🤖 Prompt for AI Agents

In cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py around lines 180 to 181, the call that should use the configured fallback endpoint has api_base commented out, so fallback requests will go to the default endpoint; restore/pass api_base=self.fallback_endpoint (or conditionally include it only when fallback_endpoint is set) in that call so the fallback request uses the intended endpoint, keeping response_model and other args intact.

coderabbitai · 2025-12-05T18:00:49Z

cognee/shared/rate_limiting.py

+from aiolimiter import AsyncLimiter
+from contextlib import nullcontext
+from cognee.infrastructure.llm.config import get_llm_config
+
+llm_config = get_llm_config()
+
+llm_rate_limiter = AsyncLimiter(
+    llm_config.llm_rate_limit_requests, llm_config.embedding_rate_limit_interval
+)
+embedding_rate_limiter = AsyncLimiter(
+    llm_config.embedding_rate_limit_requests, llm_config.embedding_rate_limit_interval
+)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, check if llm_rate_limit_interval exists in the config rg -n "llm_rate_limit_interval" --type py -C2

Repository: topoteretes/cognee

Length of output: 5695

🏁 Script executed:

# Also check for llm_rate_limit_requests to understand the config parameters rg -n "llm_rate_limit_requests|embedding_rate_limit" --type py -C2

Repository: topoteretes/cognee

Length of output: 15422

🏁 Script executed:

# Let's look at the get_llm_config function to understand how config is initialized rg -n "def get_llm_config" --type py -A 10

Repository: topoteretes/cognee

Length of output: 946

🏁 Script executed:

# Check how the rate limiters are being used throughout the codebase rg -n "llm_rate_limiter|embedding_rate_limiter" --type py -B2 -A2

Repository: topoteretes/cognee

Length of output: 28657

🏁 Script executed:

# Read the complete cognee/shared/rate_limiting.py file cat -n cognee/shared/rate_limiting.py

Repository: topoteretes/cognee

Length of output: 1210

🏁 Script executed:

# Check if there are actual runtime config changes in tests or production code rg -n "get_llm_config\.cache_clear|_instance.*None" --type py -B3 -A3

Repository: topoteretes/cognee

Length of output: 5713

🏁 Script executed:

# Look at the get_llm_config implementation more carefully cat -n cognee/infrastructure/llm/config.py | head -230 | tail -30

Repository: topoteretes/cognee

Length of output: 1464

🏁 Script executed:

# Check if the rate limiters from cognee/shared/rate_limiting.py are actually used or if they're replaced by the singleton rg "from cognee.shared.rate_limiting import llm_rate_limiter[^_]" --type py

Repository: topoteretes/cognee

Length of output: 44

🏁 Script executed:

# Check what the correct parameters should be for AsyncLimiter rg -n "AsyncLimiter" --type py -B2 -A2

Repository: topoteretes/cognee

Length of output: 836

🏁 Script executed:

# Verify the parameter names and their values in config rg -n "llm_rate_limit_interval|embedding_rate_limit_interval" cognee/infrastructure/llm/config.py | head -20

Repository: topoteretes/cognee

Length of output: 475

Fix parameter bug and address runtime config initialization limitations.

Line 8 has a copy-paste error: llm_rate_limiter uses llm_config.embedding_rate_limit_interval instead of llm_config.llm_rate_limit_interval. While both default to 60 seconds, independent configuration is not possible since they're swapped.

Additionally, module-level initialization with cached get_llm_config() prevents runtime reconfiguration. The rate limiters are frozen at import time and cannot adapt if configuration changes. Consider lazy initialization or a factory pattern if runtime reconfiguration is needed, particularly given the codebase pattern of explicitly clearing caches in tests (get_llm_config.cache_clear()).

🤖 Prompt for AI Agents

In cognee/shared/rate_limiting.py around lines 1 to 12, there is a parameter bug and an initialization design issue: llm_rate_limiter is incorrectly created with llm_config.embedding_rate_limit_interval (swap/copy-paste) and both limiters are instantiated at module import time, preventing runtime reconfiguration. Fix by using llm_config.llm_rate_limit_interval for the llm_rate_limiter, and refactor to lazily create or provide a factory function (e.g., get_llm_rate_limiter() and get_embedding_rate_limiter()) that calls get_llm_config() at call time (or checks/clears cache as needed) so tests and runtime can change config after import; ensure any existing callers use the new accessors or the factory to obtain fresh AsyncLimiter instances.

coderabbitai · 2025-12-05T18:00:49Z

cognee/shared/rate_limiting.py

+llm_rate_limiter = AsyncLimiter(
+    llm_config.llm_rate_limit_requests, llm_config.embedding_rate_limit_interval
+)


⚠️ Potential issue | 🔴 Critical

Bug: LLM rate limiter uses embedding interval instead of LLM interval.

The llm_rate_limiter is configured with embedding_rate_limit_interval instead of llm_rate_limit_interval. This means the LLM rate limiter will use the wrong time window.

llm_rate_limiter = AsyncLimiter( - llm_config.llm_rate_limit_requests, llm_config.embedding_rate_limit_interval + llm_config.llm_rate_limit_requests, llm_config.llm_rate_limit_interval )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

llm_rate_limiter = AsyncLimiter(

llm_config.llm_rate_limit_requests, llm_config.embedding_rate_limit_interval

)

llm_rate_limiter = AsyncLimiter(

llm_config.llm_rate_limit_requests, llm_config.llm_rate_limit_interval

)

🤖 Prompt for AI Agents

In cognee/shared/rate_limiting.py around lines 7 to 9, the llm_rate_limiter is being constructed with llm_config.embedding_rate_limit_interval instead of the LLM-specific interval; change the second argument to llm_config.llm_rate_limit_interval so the AsyncLimiter uses the correct time window for LLM requests.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)
36-55: Enhance docstring to document retry and rate-limiting behavior.

Per coding guidelines, the docstring should document retry/rate-limiting behavior so users understand potential delays and exceptions. Add a section describing:

Retry behavior (up to 128s with exponential backoff)

Rate limiting (when enabled via config)

Exceptions that may be raised after retries are exhausted

Apply this enhancement:
     """
     Generate a response from a user query.
 
     This method asynchronously creates structured output by sending a request through BAML
     using the provided parameters to generate a completion based on the user input and
     system prompt.
+    
+    Retry Behavior:
+    ---------------
+    This function automatically retries on transient failures with exponential backoff,
+    up to a maximum of 128 seconds. Rate limiting is applied when enabled via configuration.
 
     Parameters:
     -----------
-
         - text_input (str): The input text provided by the user for generating a response.
         - system_prompt (str): The system's prompt to guide the model's response.
         - response_model (Type[BaseModel]): The expected model type for the response.
 
     Returns:
     --------
-
         - BaseModel: A structured output generated by the model, returned as an instance of
           BaseModel.
+    
+    Raises:
+    -------
+        - Exception: Re-raises the last exception after all retry attempts are exhausted.
     """
As per coding guidelines (guideline 5), documenting behavior comprehensively is essential.

🧹 Nitpick comments (1)

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)

27-32: Document retry behavior and production alternatives.

The 128-second retry window with 8-128s exponential backoff is appropriate for cloud API rate limits but won't help overwhelmed local LLMs (as noted in PR discussion). Since this is a "fail-safe for beginner developers" rather than a production solution, ensure documentation clearly states:

This is for single-user, non-production scenarios

Production deployments should use LiteLLM proxy or similar rate-limiting infrastructure

Each request retries independently—parallel requests don't coordinate

Based on PR discussion feedback, do you want me to open an issue to add Mintlify documentation explaining production best practices for rate limiting?

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7deaa6e and a66b2ce.

📒 Files selected for processing (1)

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py

🧬 Code graph analysis (1)

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (5)

cognee/shared/logging_utils.py (1)

get_logger (212-224)

cognee/shared/rate_limiting.py (1)

llm_rate_limiter_context_manager (15-21)

cognee/infrastructure/databases/graph/neo4j_driver/deadlock_retry.py (1)

wait (35-41)

cognee/infrastructure/llm/structured_output_framework/baml/baml_client/parser.py (2)

AcreateStructuredOutput (26-34)

AcreateStructuredOutput (43-51)

cognee/infrastructure/llm/structured_output_framework/baml/baml_client/sync_client.py (4)

AcreateStructuredOutput (95-118)

AcreateStructuredOutput (127-149)

AcreateStructuredOutput (158-172)

AcreateStructuredOutput (181-195)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: Basic Tests / Run Formatting Check
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: CLI Tests / CLI Functionality Tests

🔇 Additional comments (1)

cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)

62-67: LGTM: Two-level rate protection is appropriate.

The combination of rate limiting (context manager) and retry logic (decorator) provides defense in depth—the rate limiter prevents overwhelming the API, while the retry decorator handles transient failures. When rate limiting is disabled via config, the retry logic serves as a fallback.

coderabbitai · 2025-12-05T18:04:07Z

...ucture/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py

+    retry,
+    stop_after_delay,
+    wait_exponential_jitter,
+    retry_if_not_exception_type,


⚠️ Potential issue | 🔴 Critical

Critical: retry_if_not_exception_type imported but not used in decorator.

Line 8 imports retry_if_not_exception_type, but it's not applied in the @retry decorator (lines 27-32). A past review comment indicated this was addressed in commits 72c20c2 to 362aa8d, but the current code doesn't include it. Without this parameter, the function will retry all exceptions—including non-retryable errors like NotFoundError (404), validation errors, and authentication failures—causing users to wait up to 128 seconds for errors that should fail immediately.

Apply this diff to add the missing retry condition:

@retry( stop=stop_after_delay(128), wait=wait_exponential_jitter(8, 128), + retry=retry_if_not_exception_type(litellm.exceptions.NotFoundError), before_sleep=before_sleep_log(logger, logging.DEBUG), reraise=True, )

Ensure litellm.exceptions is imported at the top of the file:

import litellm.exceptions

Also applies to: 27-32

🤖 Prompt for AI Agents

In cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py around lines 8 and 27-32, add an import for litellm.exceptions at the top and update the @retry decorator to use retry_if_not_exception_type so non-retryable exceptions (e.g. litellm.exceptions.NotFoundError, litellm.exceptions.ValidationError, litellm.exceptions.AuthenticationError) are not retried. Concretely: add "import litellm.exceptions" near the other imports, and modify the existing @retry(...) call to include retry=retry_if_not_exception_type((litellm.exceptions.NotFoundError, litellm.exceptions.ValidationError, litellm.exceptions.AuthenticationError))) while keeping the current stop, wait, and retry_error_callback settings.

fix: Resolve issue with BAML rate limit handling

7360729

dexters1 requested a review from Vasilije1990 November 19, 2025 22:09

dexters1 self-assigned this Nov 19, 2025

dexters1 added the core-team label Nov 19, 2025

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

...ucture/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py Show resolved Hide resolved

Merge branch 'main' into baml-rate-limit-handling

a8950b2

pazone reviewed Nov 20, 2025

View reviewed changes

dexters1 changed the base branch from main to dev December 1, 2025 13:54

dexters1 changed the base branch from dev to main December 1, 2025 13:54

Merge branch 'main' into baml-rate-limit-handling

362aa8d

dexters1 changed the base branch from main to dev December 4, 2025 23:00

Merge branch 'dev' into baml-rate-limit-handling

5d0586d

pazone requested review from borisarzentar and hajdul88 December 5, 2025 13:22

dexters1 added 3 commits December 5, 2025 15:40

feat: Add RPM control

0c97a40

feat: Add RPM limiting to Cognee

7deaa6e

refactor: reduce ammount of retry attempts for baml llm calls

a66b2ce

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

dexters1 added 2 commits December 5, 2025 19:18

test: Update embeding limiter test

2f572ae

chore: regen poetry lock file

67a4d40

Vasilije1990 merged commit 52b0029 into dev Dec 8, 2025
106 of 108 checks passed

Vasilije1990 deleted the baml-rate-limit-handling branch December 8, 2025 04:07

This was referenced Dec 10, 2025

Add support for transcribe image and audio transcription for gemini, anthropic, mistral and ollama. #1828

Closed

Test audio image transcription #1911

Merged

		# api_base=self.fallback_endpoint,
		response_model=response_model,

fix: Resolve issue with BAML rate limit handling #1813

fix: Resolve issue with BAML rate limit handling #1813

Uh oh!

Conversation

dexters1 commented Nov 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Screenshots/Videos (if applicable)

Pre-submission Checklist

DCO Affirmation

Summary by CodeRabbit

Uh oh!

pull-checklist bot commented Nov 19, 2025

Please make sure all the checkboxes are checked:

Uh oh!

coderabbitai bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pazone Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

dexters1 Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vasilije1990 Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

pazone Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

dexters1 Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pazone Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

borisarzentar commented Dec 3, 2025

Uh oh!

dexters1 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gitguardian bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Uh oh!

pazone commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem is attempted to solve:

What approaches have been discussed

What does this PR do

Uh oh!

dexters1 commented Dec 5, 2025

What problem is attempted to solve:

What approaches have been discussed

What does this PR do

What problem is attempted to solve:

What approaches have been discussed

What does this PR do

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 5, 2025

dexters1 commented Nov 19, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 19, 2025 •

edited

Loading

dexters1 Nov 20, 2025 •

edited

Loading

dexters1 Dec 1, 2025 •

edited

Loading

dexters1 commented Dec 4, 2025 •

edited

Loading

gitguardian bot commented Dec 5, 2025 •

edited

Loading

pazone commented Dec 5, 2025 •

edited

Loading