Skip to content

Conversation

@rmccorm4
Copy link
Contributor

@rmccorm4 rmccorm4 commented Aug 20, 2025

Overview:

Clarify that both prefill/decode should be successfully started before sending any inference requests, by using the health endpoint.

Future Work

There will be a dedicated doc on health endpoint and how to use it in the near future.

Summary by CodeRabbit

  • Documentation
    • Updated deployment verification steps to include a preflight readiness check that polls health until both prefill and decode workers are up, with guidance to monitor logs if one is still starting.
    • Revised testing instructions to use a real OpenAI-compatible API call to the responses endpoint with configurable parameters.
    • Applied these updates in both relevant sections of the guide for consistency.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 20, 2025

Walkthrough

Revises the TRT-LLM GPT OSS deployment guide to add a readiness polling step against /health (checking prefill and decode workers/endpoints) and replaces the test step with an OpenAI-compatible /v1/responses curl example. The same changes are applied in two places within the document. No code/public APIs changed.

Changes

Cohort / File(s) Change summary
Docs: TRT-LLM GPT OSS deployment verification
components/backends/trtllm/gpt-oss.md
Replaced “Test the Deployment” with “Verify the Deployment is Ready,” adding /health polling for prefill/decode worker readiness and listing expected dyn endpoints. Updated subsequent test to an OpenAI-compatible POST /v1/responses curl example. Applied in two duplicated sections within the doc.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Server as Inference Server
  participant Prefill as Prefill Worker
  participant Decode as Decode Worker

  User->>Server: GET /health (poll)
  Server->>Prefill: Check prefill status
  Server->>Decode: Check decode status
  Prefill-->>Server: status=healthy|starting
  Decode-->>Server: status=healthy|starting
  Server-->>User: { endpoints: [...], statuses }

  alt Both healthy
    User->>Server: POST /v1/responses {model,input,...}
    Server-->>User: 200 OK, response payload
  else Any starting
    User-->>User: Wait and continue polling /health
    Note over User,Server: Monitor logs until both endpoints are healthy
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Poem

A whisk of ears, I watch the lights,
Prefill hums, decode ignites.
I ping /health—two greens in view,
Then /v1/responses sings anew.
Logs like clover guide my hop—
Ready, steady—carrots pop! 🥕✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
components/backends/trtllm/gpt-oss.md (4)

217-219: Specify code fence language for curl example (markdownlint MD040).

Add the language to the fenced block for better syntax highlighting and to satisfy markdownlint.

-```
+```bash
 curl http://localhost:8000/health
-```
+```

221-221: Clarify success criteria (status must be healthy).

Explicitly state that overall status should be healthy before sending requests.

-Make sure that both of the endpoints are available before sending an inference request:
+Make sure both endpoints are listed and the overall status is "healthy" before sending any inference requests:

223-230: Specify code fence language for JSON example (markdownlint MD040).

Add the language to the fenced block.

-```
+```json
 {
   "endpoints": [
     "dyn://dynamo.tensorrt_llm.generate",
     "dyn://dynamo.tensorrt_llm_next.generate"
   ],
   "status": "healthy"
 }
-```
+```

217-221: Optional: include a one-liner to poll /health until ready.

This helps prevent premature requests during startup. Note: requires jq.

 curl http://localhost:8000/health

+Optional (requires jq): wait until the deployment is ready before proceeding:
+
+```bash
+until curl -fsS http://localhost:8000/health | jq -e '.status=="healthy" and (.endpoints|index("dyn://dynamo.tensorrt_llm.generate")) and (.endpoints|index("dyn://dynamo.tensorrt_llm_next.generate"))' >/dev/null; do

  • echo "Waiting for prefill and decode workers to become healthy..."
  • sleep 2
    +done
    +```

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used: .coderabbit.yaml**
**Review profile: CHILL**
**Plan: Pro**

**💡 Knowledge Base configuration:**

- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between f73d35d50c389926a934de131751796ce5050766 and d13febe87b5671196b3286419e435b7c6402b16b.

</details>

<details>
<summary>📒 Files selected for processing (1)</summary>

* `components/backends/trtllm/gpt-oss.md` (1 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>components/backends/trtllm/gpt-oss.md</summary>

[grammar] ~232-~232: There might be a mistake here.
Context: ... still be starting up. You can watch the worker logs to see the progress of worke...

(QB_NEW_EN)

</details>

</details>
<details>
<summary>🪛 markdownlint-cli2 (0.17.2)</summary>

<details>
<summary>components/backends/trtllm/gpt-oss.md</summary>

217-217: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

222-222: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)</summary>

* GitHub Check: Build and Test - dynamo
* GitHub Check: pre-merge-rust (lib/runtime/examples)
* GitHub Check: pre-merge-rust (lib/bindings/python)
* GitHub Check: pre-merge-rust (.)

</details>

<details>
<summary>🔇 Additional comments (1)</summary><blockquote>

<details>
<summary>components/backends/trtllm/gpt-oss.md (1)</summary>

`214-216`: **Nice addition: readiness check before inference.**

Adding an explicit health check step to ensure both prefill and decode are up before sending traffic improves reliability and user experience.

</details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@rmccorm4 rmccorm4 merged commit 8380f1b into main Aug 20, 2025
12 of 13 checks passed
@rmccorm4 rmccorm4 deleted the rmccormick/nvbugs5468257 branch August 20, 2025 17:12
hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants