Skip to content

Retrying nemotron-parse API calls receiving 408 timeouts#1276

Merged
jamesbraza merged 1 commit intomainfrom
retrying-timeout
Jan 27, 2026
Merged

Retrying nemotron-parse API calls receiving 408 timeouts#1276
jamesbraza merged 1 commit intomainfrom
retrying-timeout

Conversation

@jamesbraza
Copy link
Copy Markdown
Collaborator

@jamesbraza jamesbraza commented Jan 27, 2026

When serving nemotron-parse using vllm==0.14.1 on Modal with an Nvidia H200 with a FastAPI proxy to make vLLM match Nvidia NIM, we see 408 status codes coming back sometimes.

litellm==1.81.1 casts these 408 to litellm.exceptions.TimeoutError: https://github.com/BerriAI/litellm/blob/v1.81.1-nightly/litellm/exceptions.py#L243

This PR moves us to also retry those 408 errors.


Note

Expands retry logic for _call_nvidia_api to handle 408 Request Timeout surfaced by LiteLLM.

  • Adds _is_litellm_timeout_with_408 and updates Tenacity decorator to retry_if_exception(_is_litellm_timeout_with_408) alongside TimeoutError
  • Imports http and retry_if_exception to support new condition

Written by Cursor Bugbot for commit 954774a. Configure here.

@jamesbraza jamesbraza self-assigned this Jan 27, 2026
@jamesbraza jamesbraza added the bug Something isn't working label Jan 27, 2026
Copilot AI review requested due to automatic review settings January 27, 2026 21:42
@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jan 27, 2026
@dosubot dosubot bot added the enhancement New feature or request label Jan 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds retry logic to handle 408 (Request Timeout) status codes from the nemotron-parse API when served via vLLM on Modal. According to the PR description, litellm version 1.81.1 casts these 408 responses to litellm.exceptions.Timeout exceptions, and the new retry logic ensures these timeout errors are retried along with the existing rate limit timeouts.

Changes:

  • Added a helper function to detect litellm timeout exceptions with 408 status codes
  • Extended the retry decorator on _call_nvidia_api to retry on inference timeouts (408) in addition to rate limit timeouts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 27, 2026
@jamesbraza jamesbraza merged commit 1a6ed3f into main Jan 27, 2026
12 of 14 checks passed
@jamesbraza jamesbraza deleted the retrying-timeout branch January 27, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request lgtm This PR has been approved by a maintainer size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants