-
Notifications
You must be signed in to change notification settings - Fork 774
feat: Trtllm canary health check #3082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
…annel to discover registered endpoints. Signed-off-by: [email protected] <[email protected]>
…INT_HEALTH_STATUS Signed-off-by: [email protected] <[email protected]>
…caches are consistent Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: Tzu-Ling Kan <[email protected]>
WalkthroughUpdates the TensorRT-LLM health-check to a token-based payload structure and wires this payload into the main server startup flow. The main routine now constructs TrtllmHealthCheckPayload and passes its dict to endpoint.serve_endpoint in both execution branches. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Main as Main (trtllm/main.py)
participant HC as TrtllmHealthCheckPayload
participant EP as Endpoint
Main->>HC: Construct()
HC-->>Main: to_dict() -> health_check_payload
alt Publish disabled
Note over Main,EP: New: pass payload to serve_endpoint
Main->>EP: serve_endpoint(health_check_payload)
else Publish enabled
Note over Main,EP: New: pass payload to serve_endpoint
Main->>EP: serve_endpoint(health_check_payload)
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
components/backends/trtllm/src/dynamo/trtllm/health_check.py (3)
27-27: Avoid hard‑coding token_id=1; prefer model BOS at runtime.ID 1 isn’t universal across tokenizers. Derive BOS (or PAD fallback) from the active tokenizer to reduce cross‑model fragility. I’ve proposed a main.py patch to do this post‑construction.
Would you like me to wire a sentinel (e.g., "MODEL_BOS") here and replace it in main.py?
29-35: Option: set ignore_eos=True for guaranteed one decode step.If some models instantly emit EOS, setting
ignore_eos=Trueensures a decode step still occurs. Not mandatory; current settings are acceptable for fast liveness.Please confirm whether a zero-token early stop is considered “healthy” in your SLOs.
36-45: Trim redundant sampling knobs for greedy decode.With
temperature=0.0,top_k=1,top_p=1.0, andbeam_width=1are largely redundant. If handler defaults cover these, consider simplifying:Apply if optional fields are truly optional in request parsing.
- "sampling_options": { - "temperature": 0.0, - "top_p": 1.0, - "top_k": 1, - "beam_width": 1, - "repetition_penalty": 1.0, - "presence_penalty": 0.0, - "frequency_penalty": 0.0, - "seed": None, - }, + "sampling_options": { + "temperature": 0.0 + },components/backends/trtllm/src/dynamo/trtllm/main.py (1)
320-322: Align canary token with the active tokenizer (BOS/PAD fallback)Verified request handlers reference token_ids, stop_conditions, and sampling_options in components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py — apply the best-effort BOS→PAD alignment to the health-check payload in components/backends/trtllm/src/dynamo/trtllm/main.py (around lines 320–322):
- # Get health check payload (checks env var and falls back to TensorRT-LLM default) - health_check_payload = TrtllmHealthCheckPayload().to_dict() + # Get health check payload (env override or default), then align token_id with model if possible + health_check_payload = TrtllmHealthCheckPayload().to_dict() + try: + if isinstance(health_check_payload, dict) and "token_ids" in health_check_payload: + ids = health_check_payload.get("token_ids") or [] + if ids == [1] or not ids: + bos = getattr(tokenizer, "bos_token_id", None) + if isinstance(bos, int) and bos >= 0: + health_check_payload["token_ids"] = [bos] + else: + pad = getattr(tokenizer, "pad_token_id", None) + if isinstance(pad, int) and pad >= 0: + health_check_payload["token_ids"] = [pad] + except Exception: + # Best-effort only; keep payload as-is if anything goes wrong + pass
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/backends/trtllm/src/dynamo/trtllm/health_check.py(1 hunks)components/backends/trtllm/src/dynamo/trtllm/main.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
components/backends/trtllm/src/dynamo/trtllm/main.py (4)
components/backends/trtllm/src/dynamo/trtllm/health_check.py (1)
TrtllmHealthCheckPayload(13-47)lib/bindings/python/src/dynamo/health_check.py (1)
to_dict(86-96)lib/bindings/python/src/dynamo/_core.pyi (2)
endpoint(207-211)serve_endpoint(220-232)lib/bindings/python/rust/lib.rs (2)
endpoint(549-555)serve_endpoint(569-622)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - vllm
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: Build and Test - sglang
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (3)
components/backends/trtllm/src/dynamo/trtllm/main.py (3)
30-30: Import looks good.Clean, minimal integration point.
341-344: Passing health_check_payload in metrics branch: LGTM.Matches serve_endpoint signature and keeps graceful_shutdown default.
347-349: Passing health_check_payload in non‑metrics branch: LGTM.Consistent behavior across both paths.
rmccorm4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
gitlab failure for test_chat_only_aggregated_with_test_logits_processor appears unrelated to this change and happening on other branches as well:
E AssertionError: Expected content not found in response. Missing: ['AI']
Signed-off-by: [email protected] <[email protected]> Signed-off-by: Tzu-Ling Kan <[email protected]> Signed-off-by: Kristen Kelleher <[email protected]>
Overview:
Trtllm canary health check
Details:
Add canary health check for trtllm. Main logic in PR#2903
Where should the reviewer start?
components/backends/trtllm/src/dynamo/trtllm/health_check.py: default payload
components/backends/trtllm/src/dynamo/trtllm/main.py: add payload to endpoint
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
DIS-635
Summary by CodeRabbit
Impact: Improves reliability and compatibility of health checks for the TensorRT-LLM backend.