-
Notifications
You must be signed in to change notification settings - Fork 954
fix: Resolve issue with BAML rate limit handling #1813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please make sure all the checkboxes are checked:
|
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (1)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including You can disable this status message by setting the WalkthroughAdds a shared aiolimiter-based rate-limiting module and migrates LLM and embedding call sites from a decorator-based limiter to context-manager usage; increases initial retry backoff, reduces retry counts across adapters/engines, removes the legacy embedding_rate_limiter module, and adds aiolimiter to dependencies. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)
cognee/shared/logging_utils.py (1)
get_logger(212-224)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: End-to-End Tests / Conversation sessions test
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / Test graph edge ingestion
- GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
- GitHub Check: Basic Tests / Run Basic Graph Tests
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: Basic Tests / Run Simple Examples BAML
- GitHub Check: Basic Tests / Run Simple Examples
- GitHub Check: Basic Tests / Run Linting
- GitHub Check: Basic Tests / Run Unit Tests
- GitHub Check: Basic Tests / Run Formatting Check
- GitHub Check: Basic Tests / Run Integration Tests
🔇 Additional comments (4)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (4)
12-12: LGTM: Logging setup is correct.The logger is properly initialized using the project's logging utility and will log retry attempts at DEBUG level.
Also applies to: 21-23
3-3: LGTM: BaseModel import is necessary.This import is correctly added to support the type hint on line 33.
32-71: Function implementation looks correct.The core logic remains unchanged. The retry decorator properly wraps the entire async function, which is the correct approach for handling rate limits at the API call level.
4-10: Remove unused import and add missing retry condition to decorator.The review comment correctly identifies two real issues:
retry_if_not_exception_typeis imported on line 8 but never used- The
@retrydecorator (lines 26-31) lacks aretry=parameter, causing it to retry on ALL exceptions rather than just transient/rate-limit errorsHowever, the exact exception class that BAML raises for rate limits cannot be determined from the codebase alone, as
baml_pyis an external library. The original review's suggestion to search the codebase also returned no results.To complete this fix:
- Remove the unused
retry_if_not_exception_typeimport or use it in the decorator- Add
retry=parameter specifying the appropriate BAML exception (requires checking baml-py documentation or testing to identify the correct exception class)- Verify the exception handling matches BAML's actual behavior
...ucture/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
Show resolved
Hide resolved
| logger = get_logger() | ||
|
|
||
|
|
||
| @retry( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of concerns about this approach. It feels like we are starting to add LLM-specific workarounds into our codebase and I’m not sure that’s ideal.
Are we sure exponential retries make sense for all cases with structured output? For example, if we send 1000 requests in one second - would all the failed ones get retried?
How do we know this retry strategy will work well with other LLMs?
Might be worth looking at other ways to handle rate limits too, like a reverse proxy or a message queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all failed requests would get retried if a 1000 different backend requests would be made, but we should use a proxy or a similar mechanism for a production scenario and not rely on this approach. This is just a last resort for a single user in a non production scenario where he doesn't need anything advanced and it will work fine in that scenario.
Btw it can be changed to handle wait differently, we can provide any function to handle the waiting, so we can make it work in a 1000 different request scenario with exponential backoff (if our wait function would work with a global variable for example). Might make sense to also update the wait to handle the 1000 different request scenario as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dexters1 please update it based on the feedback of Pavel.
I am worried that if we don't specify options in our docs, people will use this in production. Maybe add a ticket to handle and suggest best ways to run it in prod in Mintlify and pass to Lazar for review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vasilije1990 We proposed an alternative way. Also @dexters1 recommends LiteLLM as a proxy. Let's see if it solves the issue and then decide the future of this PR. Perhaps, we can avoid the rate limiter on our side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would this merge this PR to the dev branch as a fail safe for beginner developers. Would be good to expand the wait function to work globally as well when we get the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned. We introduce exponential retries and expect it to solve the problem of the rate limiter. The initial problem is that local LLMs can't survive under a high RPM, and we are expecting them to recover by introducing retries (sending additional requests).
|
@pazone Should we close this one as the issue is also closed? |
We need this mechanism as well for casual users to handle when rate limit is hit on the Cognee side. We should enhance it to work on a global request level to not spam too many requests in parallel if they can't be handled and we can add optional RPM support. This can be achieved by specifying our own wait function in the tenacity requests wait criteria. The reason an issue was raised about this only for BAML is because we don't have any retry mechanism on the BAML side and we have this already on the embedding and instructor side. |
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 9573981 | Triggered | Generic Password | 5d0586d | .github/workflows/e2e_tests.yml | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
|
Bringing more details to the PR What problem is attempted to solve:If a user runs an LLM locally with limited resources it might not handle the default requests rate and eventually start to respond with 500s or too many requests depending on the implementation. What approaches have been discussed
What does this PR doThe PR introduces retries with exponential backoff. Retries failed requests to LLM with increasing delays. Adding @borisarzentar and @hajdul88 to review. |
It's not necessarily for when local LLMs die, it's
Exponential retries is not additional functionality, it's something we already have and use for instructor where we use it to handle OpenAI, local model and other cloud model calls. This PR just introduces the same mechanism to the BAML side as it was missing here, it's not here to help when a local LLM model dies, it's mostly here to handle when rate limit has been reached when using BAML for OpenAI and other cloud providers as otherwise it will break Cognee when rate limit would be reached. It might also help with local models as no one complained about local LLMs dying when using instructor but it's just a minor part of the mechanism. For the local model use I will add optional simple RPM control on the Cognee side |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py (1)
101-113: Incorrect response handling for instructor client.The
instructor.from_litellmclient returns the parsed Pydantic model directly, not a raw API response withchoices. The current code checksresponse.choicesand attemptsmodel_validate_json, which will likely fail becauseresponseis already the validatedresponse_modelinstance.async with llm_rate_limiter_context_manager(): response = await self.aclient.chat.completions.create( model=self.model, max_tokens=self.max_completion_tokens, max_retries=2, messages=messages, response_model=response_model, ) - if response.choices and response.choices[0].message.content: - content = response.choices[0].message.content - return response_model.model_validate_json(content) - else: - raise ValueError("Failed to get valid response after retries") + return response
♻️ Duplicate comments (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)
27-32: Missingretrycondition parameter and inconsistent backoff timing.Two issues:
The
retry_if_not_exception_typeis imported but not used. Per past review feedback, non-retryable errors (like 404 NotFound) should not be retried. Other adapters includeretry=retry_if_not_exception_type(...).
wait_exponential_jitter(2, 128)differs from other adapters that use(8, 128). This inconsistency may lead to different retry behavior for BAML vs other LLM calls.+import litellm + @retry( stop=stop_after_delay(128), - wait=wait_exponential_jitter(2, 128), + wait=wait_exponential_jitter(8, 128), + retry=retry_if_not_exception_type(litellm.exceptions.NotFoundError), before_sleep=before_sleep_log(logger, logging.DEBUG), reraise=True, )
🧹 Nitpick comments (6)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py (1)
73-86: Rate limiter integration is correct, but prompt formatting has unintended whitespace.The content string at lines 81-82 includes leading whitespace that will be sent to the API:
"content": f"""Use the given format to extract information from the following input: {text_input}. {system_prompt}""",This will result in extra spaces in the prompt. Consider using a continuation without extra indentation or using string concatenation.
- "content": f"""Use the given format to extract information - from the following input: {text_input}. {system_prompt}""", + "content": f"Use the given format to extract information from the following input: {text_input}. {system_prompt}",cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py (3)
145-150: Missing rate limiter wrapper for audio transcription API call.For consistency with other API calls in this adapter, consider wrapping the
transcriptions.createcall with the rate limiter context manager.async with open_data_file(input_file, mode="rb") as audio_file: - transcription = self.aclient.audio.transcriptions.create( - model="whisper-1", # Ensure the correct model for transcription - file=audio_file, - language="en", - ) + async with llm_rate_limiter_context_manager(): + transcription = self.aclient.audio.transcriptions.create( + model="whisper-1", # Ensure the correct model for transcription + file=audio_file, + language="en", + )
158-164: Inconsistent retry wait time fortranscribe_image.This method uses
wait_exponential_jitter(2, 128)while other methods in this adapter use(8, 128). Consider aligning for consistency.@retry( stop=stop_after_delay(128), - wait=wait_exponential_jitter(2, 128), + wait=wait_exponential_jitter(8, 128), retry=retry_if_not_exception_type(litellm.exceptions.NotFoundError), before_sleep=before_sleep_log(logger, logging.DEBUG), reraise=True, )
188-203: Missing rate limiter wrapper for image transcription API call.For consistency, consider wrapping this API call with the rate limiter context manager.
- response = self.aclient.chat.completions.create( - model=self.model, - messages=[ - { - "role": "user", - "content": [ - {"type": "text", "text": "What's in this image?"}, - { - "type": "image_url", - "image_url": {"url": f"data:image/jpeg;base64,{encoded_image}"}, - }, - ], - } - ], - max_completion_tokens=300, - ) + async with llm_rate_limiter_context_manager(): + response = self.aclient.chat.completions.create( + model=self.model, + messages=[ + { + "role": "user", + "content": [ + {"type": "text", "text": "What's in this image?"}, + { + "type": "image_url", + "image_url": {"url": f"data:image/jpeg;base64,{encoded_image}"}, + }, + ], + } + ], + max_completion_tokens=300, + )cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py (1)
100-105: Async context manager wraps synchronous embedding call.The
fastembed.TextEmbedding.embed()method is synchronous, but it's wrapped in an async context manager. While this works for rate limiting purposes, the synchronous call will block the event loop. Consider wrapping inasyncio.to_thread()for non-blocking execution if this becomes a performance bottleneck.cognee/shared/rate_limiting.py (1)
14-30: Add docstrings for public functions.Per coding guidelines, public functions should have docstrings to document their purpose and behavior.
def llm_rate_limiter_context_manager(): + """ + Return a rate limiter context manager for LLM API calls. + + Returns the configured AsyncLimiter if rate limiting is enabled, + otherwise returns a no-op context manager (nullcontext). + """ global llm_rate_limiter if llm_config.llm_rate_limit_enabled: return llm_rate_limiter else: # Return a no-op context manager if rate limiting is disabled return nullcontext() def embedding_rate_limiter_context_manager(): + """ + Return a rate limiter context manager for embedding API calls. + + Returns the configured AsyncLimiter if rate limiting is enabled, + otherwise returns a no-op context manager (nullcontext). + """ global embedding_rate_limiter if llm_config.embedding_rate_limit_enabled: return embedding_rate_limiter else: # Return a no-op context manager if rate limiting is disabled return nullcontext()
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (14)
cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py(3 hunks)cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py(2 hunks)cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py(3 hunks)cognee/infrastructure/databases/vector/embeddings/embedding_rate_limiter.py(0 hunks)cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py(3 hunks)cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py(3 hunks)cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py(4 hunks)cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py(4 hunks)cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py(3 hunks)cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py(3 hunks)cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py(4 hunks)cognee/shared/rate_limiting.py(1 hunks)cognee/tests/unit/infrastructure/mock_embedding_engine.py(2 hunks)pyproject.toml(1 hunks)
💤 Files with no reviewable changes (1)
- cognee/infrastructure/databases/vector/embeddings/embedding_rate_limiter.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.pycognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.pycognee/tests/unit/infrastructure/mock_embedding_engine.pycognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.pycognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.pycognee/shared/rate_limiting.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.pycognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.pycognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.pycognee/tests/unit/infrastructure/mock_embedding_engine.pycognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.pycognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.pycognee/shared/rate_limiting.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.pycognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.pycognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.pycognee/tests/unit/infrastructure/mock_embedding_engine.pycognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.pycognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.pycognee/shared/rate_limiting.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.pycognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.pycognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.pycognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.pycognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.pycognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.pycognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py
pyproject.toml
📄 CodeRabbit inference engine (AGENTS.md)
Python version requirement: >= 3.10 and < 3.14
Files:
pyproject.toml
cognee/tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests
Files:
cognee/tests/unit/infrastructure/mock_embedding_engine.py
🧠 Learnings (2)
📚 Learning: 2025-11-24T16:45:09.996Z
Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:45:09.996Z
Learning: Applies to pyproject.toml : Python version requirement: >= 3.10 and < 3.14
Applied to files:
pyproject.toml
📚 Learning: 2025-11-24T16:45:09.996Z
Learnt from: CR
Repo: topoteretes/cognee PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:45:09.996Z
Learning: Applies to cognee/**/*.py : Use shared logging utilities from cognee.shared.logging_utils in Python code
Applied to files:
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py
🧬 Code graph analysis (11)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py (2)
cognee/shared/rate_limiting.py (1)
llm_rate_limiter_context_manager(15-21)cognee/infrastructure/databases/graph/neo4j_driver/deadlock_retry.py (1)
wait(35-41)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py (1)
cognee/shared/rate_limiting.py (1)
llm_rate_limiter_context_manager(15-21)
cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py (1)
cognee/shared/rate_limiting.py (1)
embedding_rate_limiter_context_manager(24-30)
cognee/tests/unit/infrastructure/mock_embedding_engine.py (1)
cognee/shared/rate_limiting.py (1)
embedding_rate_limiter_context_manager(24-30)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (3)
cognee/shared/logging_utils.py (1)
get_logger(212-224)cognee/shared/rate_limiting.py (1)
llm_rate_limiter_context_manager(15-21)cognee/infrastructure/databases/graph/neo4j_driver/deadlock_retry.py (1)
wait(35-41)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py (1)
cognee/shared/rate_limiting.py (1)
llm_rate_limiter_context_manager(15-21)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)
cognee/shared/rate_limiting.py (1)
embedding_rate_limiter_context_manager(24-30)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py (1)
cognee/shared/rate_limiting.py (1)
llm_rate_limiter_context_manager(15-21)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/mistral/adapter.py (1)
cognee/shared/rate_limiting.py (1)
llm_rate_limiter_context_manager(15-21)
cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py (1)
cognee/shared/rate_limiting.py (1)
embedding_rate_limiter_context_manager(24-30)
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py (2)
cognee/shared/rate_limiting.py (1)
llm_rate_limiter_context_manager(15-21)cognee/infrastructure/databases/graph/neo4j_driver/deadlock_retry.py (1)
wait(35-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: Basic Tests / Run Simple Examples
- GitHub Check: End-to-End Tests / Test graph edge ingestion
- GitHub Check: Basic Tests / Run Integration Tests
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: Basic Tests / Run Linting
- GitHub Check: End-to-End Tests / Run Telemetry Test
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: CLI Tests / CLI Unit Tests
- GitHub Check: CLI Tests / CLI Functionality Tests
🔇 Additional comments (17)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (3)
28-28: Import addition is correct.The import follows the project's module structure and uses the shared rate limiting utilities as per the coding guidelines.
113-122: Rate limiter integration and response handling look correct.The
async withpattern correctly wraps the API call. The change fromresponse["data"]toresponse.datais correct sincelitellm.aembeddingreturns anEmbeddingResponseobject with adataattribute, not a dictionary.
82-88: Inconsistent retry wait time compared to LLM adapters.The embedding engine uses
wait_exponential_jitter(2, 128)while the LLM adapters in this PR were updated to usewait_exponential_jitter(8, 128). Verify if this difference is intentional (different rate limit characteristics for embeddings vs LLM calls) or if it should be aligned for consistency.cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/anthropic/adapter.py (2)
18-18: Import addition is correct.Follows project conventions for shared utilities.
47-53: Retry configuration changes are appropriate.Increasing the initial wait time from 2 to 8 seconds provides better backoff behavior when dealing with rate-limited APIs.
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/ollama/adapter.py (2)
14-15: Import is correct.The import follows project conventions. The extra blank line is minor.
100-117: Rate limiter integration is correct.The
async withpattern properly wraps the API call, and the retry configuration aligns with other adapters.cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/generic_llm_api/adapter.py (4)
16-16: Import addition is correct.Follows project conventions for shared utilities.
75-81: Retry configuration is appropriate.The increased initial wait time and consistent configuration with other adapters is good.
109-126: Primary API call rate limiting is correctly implemented.The rate limiter context manager properly wraps the API call, and the reduced
max_retries=2works well with the external retry decorator.
144-161: Fallback path also includes rate limiting - good implementation.Applying rate limiting to both the primary and fallback API calls ensures consistent behavior and prevents rate limit issues when falling back to an alternative model.
pyproject.toml (1)
62-62: Dependency addition is appropriate.The
aiolimiter>=1.2.1constraint is compatible with the project's Python requirement (>=3.10,<3.14), as aiolimiter supports >=3.8. Version 1.2.1 is the latest available (December 2024), and the>=constraint allows for future patch and minor updates within the 1.x series. The library is a well-established asyncio rate limiter implementation suitable for the PR's purpose.cognee/tests/unit/infrastructure/mock_embedding_engine.py (1)
50-51: LGTM!The mock correctly integrates the rate limiter context manager, allowing tests to verify rate limiting behavior when needed.
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/gemini/adapter.py (1)
109-127: LGTM!The rate limiter context manager is correctly applied to both primary and fallback API calls, with consistent retry configuration (max_retries=2).
cognee/infrastructure/databases/vector/embeddings/OllamaEmbeddingEngine.py (1)
120-128: LGTM!The rate limiter correctly wraps the HTTP request to the Ollama endpoint while preserving both response format handlers (
embeddingsanddata).cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py (2)
139-157: LGTM for async rate limiting implementation.The rate limiter is correctly applied to the primary async API call path.
199-248: Synchronous method missing rate limiting.The
create_structured_outputsync method doesn't use rate limiting, unlike its async counterpartacreate_structured_output. This creates inconsistent behavior between sync and async code paths.Note:
AsyncLimiterfrom aiolimiter is async-only, so a sync rate limiter would need a different approach (e.g.,limitslibrary or a sync wrapper).Is rate limiting for the sync method intentionally omitted, or should a sync-compatible rate limiter be added for consistency?
| # api_base=self.fallback_endpoint, | ||
| response_model=response_model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out api_base may cause fallback to use wrong endpoint.
The api_base=self.fallback_endpoint is commented out. If fallback_endpoint is configured, the fallback call will use the default endpoint instead of the intended fallback endpoint.
api_key=self.fallback_api_key,
- # api_base=self.fallback_endpoint,
+ api_base=self.fallback_endpoint,
response_model=response_model,🤖 Prompt for AI Agents
In
cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/openai/adapter.py
around lines 180 to 181, the call that should use the configured fallback
endpoint has api_base commented out, so fallback requests will go to the default
endpoint; restore/pass api_base=self.fallback_endpoint (or conditionally include
it only when fallback_endpoint is set) in that call so the fallback request uses
the intended endpoint, keeping response_model and other args intact.
| from aiolimiter import AsyncLimiter | ||
| from contextlib import nullcontext | ||
| from cognee.infrastructure.llm.config import get_llm_config | ||
|
|
||
| llm_config = get_llm_config() | ||
|
|
||
| llm_rate_limiter = AsyncLimiter( | ||
| llm_config.llm_rate_limit_requests, llm_config.embedding_rate_limit_interval | ||
| ) | ||
| embedding_rate_limiter = AsyncLimiter( | ||
| llm_config.embedding_rate_limit_requests, llm_config.embedding_rate_limit_interval | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, check if llm_rate_limit_interval exists in the config
rg -n "llm_rate_limit_interval" --type py -C2Repository: topoteretes/cognee
Length of output: 5695
🏁 Script executed:
# Also check for llm_rate_limit_requests to understand the config parameters
rg -n "llm_rate_limit_requests|embedding_rate_limit" --type py -C2Repository: topoteretes/cognee
Length of output: 15422
🏁 Script executed:
# Let's look at the get_llm_config function to understand how config is initialized
rg -n "def get_llm_config" --type py -A 10Repository: topoteretes/cognee
Length of output: 946
🏁 Script executed:
# Check how the rate limiters are being used throughout the codebase
rg -n "llm_rate_limiter|embedding_rate_limiter" --type py -B2 -A2Repository: topoteretes/cognee
Length of output: 28657
🏁 Script executed:
# Read the complete cognee/shared/rate_limiting.py file
cat -n cognee/shared/rate_limiting.pyRepository: topoteretes/cognee
Length of output: 1210
🏁 Script executed:
# Check if there are actual runtime config changes in tests or production code
rg -n "get_llm_config\.cache_clear|_instance.*None" --type py -B3 -A3Repository: topoteretes/cognee
Length of output: 5713
🏁 Script executed:
# Look at the get_llm_config implementation more carefully
cat -n cognee/infrastructure/llm/config.py | head -230 | tail -30Repository: topoteretes/cognee
Length of output: 1464
🏁 Script executed:
# Check if the rate limiters from cognee/shared/rate_limiting.py are actually used or if they're replaced by the singleton
rg "from cognee.shared.rate_limiting import llm_rate_limiter[^_]" --type pyRepository: topoteretes/cognee
Length of output: 44
🏁 Script executed:
# Check what the correct parameters should be for AsyncLimiter
rg -n "AsyncLimiter" --type py -B2 -A2Repository: topoteretes/cognee
Length of output: 836
🏁 Script executed:
# Verify the parameter names and their values in config
rg -n "llm_rate_limit_interval|embedding_rate_limit_interval" cognee/infrastructure/llm/config.py | head -20Repository: topoteretes/cognee
Length of output: 475
Fix parameter bug and address runtime config initialization limitations.
Line 8 has a copy-paste error: llm_rate_limiter uses llm_config.embedding_rate_limit_interval instead of llm_config.llm_rate_limit_interval. While both default to 60 seconds, independent configuration is not possible since they're swapped.
Additionally, module-level initialization with cached get_llm_config() prevents runtime reconfiguration. The rate limiters are frozen at import time and cannot adapt if configuration changes. Consider lazy initialization or a factory pattern if runtime reconfiguration is needed, particularly given the codebase pattern of explicitly clearing caches in tests (get_llm_config.cache_clear()).
🤖 Prompt for AI Agents
In cognee/shared/rate_limiting.py around lines 1 to 12, there is a parameter bug
and an initialization design issue: llm_rate_limiter is incorrectly created with
llm_config.embedding_rate_limit_interval (swap/copy-paste) and both limiters are
instantiated at module import time, preventing runtime reconfiguration. Fix by
using llm_config.llm_rate_limit_interval for the llm_rate_limiter, and refactor
to lazily create or provide a factory function (e.g., get_llm_rate_limiter() and
get_embedding_rate_limiter()) that calls get_llm_config() at call time (or
checks/clears cache as needed) so tests and runtime can change config after
import; ensure any existing callers use the new accessors or the factory to
obtain fresh AsyncLimiter instances.
| llm_rate_limiter = AsyncLimiter( | ||
| llm_config.llm_rate_limit_requests, llm_config.embedding_rate_limit_interval | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: LLM rate limiter uses embedding interval instead of LLM interval.
The llm_rate_limiter is configured with embedding_rate_limit_interval instead of llm_rate_limit_interval. This means the LLM rate limiter will use the wrong time window.
llm_rate_limiter = AsyncLimiter(
- llm_config.llm_rate_limit_requests, llm_config.embedding_rate_limit_interval
+ llm_config.llm_rate_limit_requests, llm_config.llm_rate_limit_interval
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| llm_rate_limiter = AsyncLimiter( | |
| llm_config.llm_rate_limit_requests, llm_config.embedding_rate_limit_interval | |
| ) | |
| llm_rate_limiter = AsyncLimiter( | |
| llm_config.llm_rate_limit_requests, llm_config.llm_rate_limit_interval | |
| ) |
🤖 Prompt for AI Agents
In cognee/shared/rate_limiting.py around lines 7 to 9, the llm_rate_limiter is
being constructed with llm_config.embedding_rate_limit_interval instead of the
LLM-specific interval; change the second argument to
llm_config.llm_rate_limit_interval so the AsyncLimiter uses the correct time
window for LLM requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)
36-55: Enhance docstring to document retry and rate-limiting behavior.Per coding guidelines, the docstring should document retry/rate-limiting behavior so users understand potential delays and exceptions. Add a section describing:
- Retry behavior (up to 128s with exponential backoff)
- Rate limiting (when enabled via config)
- Exceptions that may be raised after retries are exhausted
Apply this enhancement:
""" Generate a response from a user query. This method asynchronously creates structured output by sending a request through BAML using the provided parameters to generate a completion based on the user input and system prompt. + + Retry Behavior: + --------------- + This function automatically retries on transient failures with exponential backoff, + up to a maximum of 128 seconds. Rate limiting is applied when enabled via configuration. Parameters: ----------- - - text_input (str): The input text provided by the user for generating a response. - system_prompt (str): The system's prompt to guide the model's response. - response_model (Type[BaseModel]): The expected model type for the response. Returns: -------- - - BaseModel: A structured output generated by the model, returned as an instance of BaseModel. + + Raises: + ------- + - Exception: Re-raises the last exception after all retry attempts are exhausted. """As per coding guidelines (guideline 5), documenting behavior comprehensively is essential.
🧹 Nitpick comments (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)
27-32: Document retry behavior and production alternatives.The 128-second retry window with 8-128s exponential backoff is appropriate for cloud API rate limits but won't help overwhelmed local LLMs (as noted in PR discussion). Since this is a "fail-safe for beginner developers" rather than a production solution, ensure documentation clearly states:
- This is for single-user, non-production scenarios
- Production deployments should use LiteLLM proxy or similar rate-limiting infrastructure
- Each request retries independently—parallel requests don't coordinate
Based on PR discussion feedback, do you want me to open an issue to add Mintlify documentation explaining production best practices for rate limiting?
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
🧬 Code graph analysis (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (5)
cognee/shared/logging_utils.py (1)
get_logger(212-224)cognee/shared/rate_limiting.py (1)
llm_rate_limiter_context_manager(15-21)cognee/infrastructure/databases/graph/neo4j_driver/deadlock_retry.py (1)
wait(35-41)cognee/infrastructure/llm/structured_output_framework/baml/baml_client/parser.py (2)
AcreateStructuredOutput(26-34)AcreateStructuredOutput(43-51)cognee/infrastructure/llm/structured_output_framework/baml/baml_client/sync_client.py (4)
AcreateStructuredOutput(95-118)AcreateStructuredOutput(127-149)AcreateStructuredOutput(158-172)AcreateStructuredOutput(181-195)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: Basic Tests / Run Formatting Check
- GitHub Check: Basic Tests / Run Integration Tests
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: CLI Tests / CLI Functionality Tests
🔇 Additional comments (1)
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py (1)
62-67: LGTM: Two-level rate protection is appropriate.The combination of rate limiting (context manager) and retry logic (decorator) provides defense in depth—the rate limiter prevents overwhelming the API, while the retry decorator handles transient failures. When rate limiting is disabled via config, the retry logic serves as a fallback.
| retry, | ||
| stop_after_delay, | ||
| wait_exponential_jitter, | ||
| retry_if_not_exception_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: retry_if_not_exception_type imported but not used in decorator.
Line 8 imports retry_if_not_exception_type, but it's not applied in the @retry decorator (lines 27-32). A past review comment indicated this was addressed in commits 72c20c2 to 362aa8d, but the current code doesn't include it. Without this parameter, the function will retry all exceptions—including non-retryable errors like NotFoundError (404), validation errors, and authentication failures—causing users to wait up to 128 seconds for errors that should fail immediately.
Apply this diff to add the missing retry condition:
@retry(
stop=stop_after_delay(128),
wait=wait_exponential_jitter(8, 128),
+ retry=retry_if_not_exception_type(litellm.exceptions.NotFoundError),
before_sleep=before_sleep_log(logger, logging.DEBUG),
reraise=True,
)Ensure litellm.exceptions is imported at the top of the file:
import litellm.exceptionsAlso applies to: 27-32
🤖 Prompt for AI Agents
In
cognee/infrastructure/llm/structured_output_framework/baml/baml_src/extraction/acreate_structured_output.py
around lines 8 and 27-32, add an import for litellm.exceptions at the top and
update the @retry decorator to use retry_if_not_exception_type so non-retryable
exceptions (e.g. litellm.exceptions.NotFoundError,
litellm.exceptions.ValidationError, litellm.exceptions.AuthenticationError) are
not retried. Concretely: add "import litellm.exceptions" near the other imports,
and modify the existing @retry(...) call to include
retry=retry_if_not_exception_type((litellm.exceptions.NotFoundError,
litellm.exceptions.ValidationError, litellm.exceptions.AuthenticationError)))
while keeping the current stop, wait, and retry_error_callback settings.
Description
Add rate limit handling for BAML
Type of Change
Screenshots/Videos (if applicable)
Pre-submission Checklist
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
Summary by CodeRabbit
Improvements
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.