Fixing 60-sec wait for retrying nemotron-parse API calls' 408 timeouts#1277
Fixing 60-sec wait for retrying nemotron-parse API calls' 408 timeouts#1277jamesbraza merged 1 commit intomainfrom
nemotron-parse API calls' 408 timeouts#1277Conversation
There was a problem hiding this comment.
Pull request overview
Adjusts retry behavior for nemotron-parse API calls so LiteLLM-surfaced 408 timeouts don’t incur the long (60s+) backoff intended for rate-limiter timeouts.
Changes:
- Splits the retry logic for
_call_nvidia_apiinto separate Tenacity retries: one forTimeoutError(with exponential backoff) and one for LiteLLM 408 timeouts (without the long backoff). - Introduces/uses
_is_litellm_timeout_with_408as the predicate for identifying 408 timeouts.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @retry( | ||
| retry=( | ||
| retry_if_exception_type(TimeoutError) # Hitting rate limits | ||
| | retry_if_exception(_is_litellm_timeout_with_408) # Inference timeout | ||
| ), | ||
| retry=retry_if_exception_type(TimeoutError), # Hitting rate limits | ||
| stop=stop_after_attempt(3), | ||
| wait=wait_exponential(multiplier=2, min=GLOBAL_RATE_LIMITER_TIMEOUT), | ||
| before_sleep=before_sleep_log(logger, logging.WARNING), | ||
| ) | ||
| @retry( | ||
| retry=retry_if_exception(_is_litellm_timeout_with_408), # Inference timeout | ||
| stop=stop_after_attempt(3), | ||
| before_sleep=before_sleep_log(logger, logging.WARNING), | ||
| ) |
There was a problem hiding this comment.
Stacking multiple tenacity @retry decorators makes the retry semantics hard to reason about and can lead to more attempts and/or waits than intended (e.g., an exception that satisfies both retry predicates can be retried by the inner decorator and then again by the outer one, potentially re-introducing the long wait_exponential(..., min=GLOBAL_RATE_LIMITER_TIMEOUT) delay you’re trying to avoid for 408s). Consider collapsing this back to a single @retry with a combined retry= predicate and a wait= callable that chooses the backoff based on the caught exception (or otherwise ensure the TimeoutError retry explicitly excludes the 408 case).
There was a problem hiding this comment.
Not a bad idea, but YAGNI for now
#1276 incorrectly grouped the timeout errors, leading to an unnecessary 60-sec retry wait in the 408 timeout error case
Note
Adjusts
_call_nvidia_apiretry strategy to avoid unnecessary backoff on inference timeouts.Timeoutwith status 408) into its own@retrywithout exponential waitTimeoutErrorand preserves retries forNemotronBBoxErrorThis prevents the previous ~60s wait when retrying 408 errors while maintaining proper backoff for rate limits.
Written by Cursor Bugbot for commit 8218ef9. Configure here.