-
Notifications
You must be signed in to change notification settings - Fork 735
feat: timing metrics reporting through nvext for vLLM backend
#4661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi zhongxuanwang-nv! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
Signed-off-by: Zhongxuan Wang <[email protected]>
9f7ee4e to
08d432b
Compare
nvext for vLLM backend
nvext for vLLM backendnvext for vLLM backend
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughEnd-to-end timing metrics support is added across Python and Rust layers. The vllm handlers collect timing data (request received, prefill/decode start/end), the HTTP service injects boundary timestamps, the preprocessor extracts timing fields, and delta response builders include timing metrics in nvext responses. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/src/dynamo/vllm/handlers.py (1)
1-1: Fix Black formatting before merge.The pipeline failure indicates that Black formatting failed. Please run
blackon this file to resolve the formatting issue.Based on the pipeline failure log, run:
black components/src/dynamo/vllm/handlers.py
🧹 Nitpick comments (1)
lib/llm/src/http/service/openai.rs (1)
59-90: Consider returning a success indicator.The function silently handles errors (line 64
.ok()) and structural mismatches (lines 69-88 nestedas_object_mut()). While timing metrics are non-critical and silent failures may be acceptable, consider returning aboolto indicate success for better observability.If you want to track silent failures, you could refactor to:
-fn inject_request_completed_seconds(nvext: &mut Option<serde_json::Value>) { +fn inject_request_completed_seconds(nvext: &mut Option<serde_json::Value>) -> bool { let ts = SystemTime::now() .duration_since(UNIX_EPOCH) .map(|d| d.as_secs_f64()) .ok(); - if let Some(ts) = ts { + let Some(ts) = ts else { + tracing::trace!("Failed to get current timestamp for request_completed_seconds"); + return false; + }; + let nvext = nvext.get_or_insert_with(|| serde_json::Value::Object(serde_json::Map::new())); if let Some(obj) = nvext.as_object_mut() { if let Some(timing) = obj.get_mut("timing_metrics") { if let Some(timing_obj) = timing.as_object_mut() { timing_obj.insert( "request_completed_seconds".to_string(), serde_json::Value::from(ts), ); + return true; } } else { let mut timing_obj = serde_json::Map::new(); timing_obj.insert( "request_completed_seconds".to_string(), serde_json::Value::from(ts), ); obj.insert( "timing_metrics".to_string(), serde_json::Value::Object(timing_obj), ); + return true; } } - } + false }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
components/src/dynamo/vllm/handlers.py(8 hunks)lib/llm/src/http/service/openai.rs(6 hunks)lib/llm/src/preprocessor.rs(1 hunks)lib/llm/src/protocols/common/preprocessor.rs(1 hunks)lib/llm/src/protocols/openai/chat_completions/delta.rs(2 hunks)lib/llm/src/protocols/openai/completions/delta.rs(2 hunks)lib/llm/src/protocols/openai/nvext.rs(4 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.
Applied to files:
lib/llm/src/protocols/openai/nvext.rs
📚 Learning: 2025-09-08T21:18:43.478Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2936
File: lib/parsers/src/reasoning/granite_parser.rs:42-46
Timestamp: 2025-09-08T21:18:43.478Z
Learning: In GraniteReasoningParser in lib/parsers/src/reasoning/granite_parser.rs, the think_start_tokens and think_end_tokens are hardcoded in the constructor with fixed values, so unwrap() calls on these vectors are safe and won't panic.
Applied to files:
lib/llm/src/protocols/openai/nvext.rs
📚 Learning: 2025-06-24T20:59:35.725Z
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.
Applied to files:
lib/llm/src/protocols/common/preprocessor.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.
Applied to files:
lib/llm/src/protocols/openai/chat_completions/delta.rslib/llm/src/protocols/openai/completions/delta.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.
Applied to files:
lib/llm/src/protocols/openai/chat_completions/delta.rslib/llm/src/protocols/openai/completions/delta.rs
📚 Learning: 2025-08-25T22:04:45.205Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2700
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:19-28
Timestamp: 2025-08-25T22:04:45.205Z
Learning: The response_generator() method exists on multiple request types in the codebase: NvCreateChatCompletionRequest (for chat completions) and NvCreateCompletionRequest (for text completions). When making signature changes, it's important to distinguish between these different object types as they have separate implementations and call sites.
Applied to files:
lib/llm/src/http/service/openai.rs
🧬 Code graph analysis (6)
lib/llm/src/preprocessor.rs (4)
lib/llm/src/protocols/openai/nvext.rs (2)
nvext(9-9)builder(99-101)lib/llm/src/protocols/openai/responses.rs (3)
nvext(44-46)nvext(104-106)nvext(151-153)lib/llm/src/protocols/openai/chat_completions.rs (3)
nvext(74-76)nvext(134-136)nvext(238-240)lib/llm/src/protocols/openai.rs (2)
nvext(50-50)nvext(60-60)
lib/llm/src/protocols/openai/nvext.rs (1)
lib/llm/src/preprocessor.rs (1)
builder(202-248)
lib/llm/src/protocols/common/preprocessor.rs (1)
lib/llm/src/preprocessor.rs (1)
builder(202-248)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
lib/llm/src/protocols/openai/chat_completions/aggregator.rs (2)
new(92-104)default(63-65)lib/llm/src/protocols/openai/nvext.rs (2)
default(93-95)nvext(9-9)
lib/llm/src/protocols/openai/completions/delta.rs (2)
lib/llm/src/protocols/openai/nvext.rs (2)
default(93-95)nvext(9-9)lib/llm/src/protocols/openai/responses.rs (3)
nvext(44-46)nvext(104-106)nvext(151-153)
components/src/dynamo/vllm/handlers.py (1)
components/src/dynamo/sglang/request_handlers/handler_base.py (1)
BaseWorkerHandler(20-222)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4661/merge) by zhongxuanwang-nv.
components/src/dynamo/vllm/handlers.py
[error] 1-1: Black formatting failed. 1 file reformatted by the hook; please re-run the commit or CI to pass: 'pre-commit run --show-diff-on-failure --color=always --all-files'.
🔇 Additional comments (19)
lib/llm/src/preprocessor.rs (1)
240-244: LGTM!The extraction of
request_received_secondsfrom nvext follows the established pattern for other nvext fields and maintains consistency with the timing metrics feature.lib/llm/src/protocols/openai/nvext.rs (3)
84-89: LGTM!The new
request_received_secondsfield is properly integrated with appropriate serde attributes and clear documentation.
79-79: LGTM!The documentation correctly lists the expanded set of supported extra_fields.
136-160: LGTM!Tests properly validate the new field and the expanded extra_fields list.
lib/llm/src/protocols/common/preprocessor.rs (1)
105-109: LGTM!The
request_received_secondsfield is well-documented and follows the established patterns for optional fields in the struct.components/src/dynamo/vllm/handlers.py (5)
77-87: LGTM!The helper functions provide clean abstractions for timing metrics detection and timestamp retrieval.
313-393: LGTM!The timing metrics initialization and decode tracking logic is sound:
- Frontend timestamp is preserved when merging with prefill timing (lines 349-352)
- Aggregated mode (no prefill_result) correctly treats decode_start as prefill_start (line 380)
- First token time is used as prefill_end in aggregated mode (line 392), which aligns with the semantic that prefill completes when the first token is generated
402-413: LGTM!The conditional injection of timing_metrics is correct:
- Only when finish_reason is present (the final chunk)
- Only when timing is requested
- Safely initializes disaggregated_params if needed
267-271: LGTM!The refactored completion_usage construction is cleaner and preserves the existing structure while preparing for timing metrics augmentation.
447-524: LGTM!The prefill worker timing metrics logic correctly:
- Captures prefill_start_seconds at the start of processing (line 460)
- Records prefill_end_seconds when the result is available (lines 521-523)
- Builds disaggregated_params with both kv_transfer_params and timing_metrics (lines 515-524)
lib/llm/src/protocols/openai/completions/delta.rs (2)
268-286: LGTM!The generic extraction of worker_id and timing_metrics from disaggregated_params into nvext is clean and extensible. The conditional check ensures nvext is only set when there's actual data to include.
304-350: LGTM!The test thoroughly validates that both timing_metrics and worker_id are correctly extracted from disaggregated_params and placed into nvext.
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
365-383: LGTM!The nvext construction from disaggregated_params follows the same clean pattern as the completions path, maintaining consistency across the codebase.
464-506: LGTM!The test provides thorough coverage of timing_metrics and worker_id extraction into nvext.
lib/llm/src/http/service/openai.rs (5)
317-329: LGTM!Capturing
request_received_secondsimmediately at handler entry is the right approach for accurate timing. The safe initialization withget_or_insert_with(Default::default)is appropriate.
453-468: LGTM!The injection on the final content chunk (with
finish_reason) aligns with the backend's timing_metrics population strategy, avoiding injection on the separate usage-only chunk.
751-763: LGTM!Consistent timing capture across both completions and chat completions endpoints.
898-912: LGTM!The stream mapping for chat completions follows the same correct pattern as completions.
2149-2184: LGTM!The tests validate the two key scenarios:
- Preserving existing fields when injecting into existing timing_metrics
- Creating the structure from scratch when nvext is None
Both scenarios are well-covered.
Signed-off-by: Zhongxuan Wang <[email protected]>
Signed-off-by: Zhongxuan Wang <[email protected]>
|
/ok to test 8588689 |
Signed-off-by: Zhongxuan Wang <[email protected]>
Signed-off-by: Zhongxuan Wang <[email protected]>
Signed-off-by: Zhongxuan Wang <[email protected]>
Signed-off-by: Zhongxuan Wang <[email protected]>
|
/ok to test 0f1d019 |
|
|
||
| let nvext_response = NvExtResponse { | ||
| worker_id: Some(worker_id_info), | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is worker_id gone? I think AI Gateway uses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely not gone, i just removed the NvExtResponse and WorkerIdInfo struct and instead just export the worker_id json directly because I figured it would look less redundant :)
Do you think i should bring these two struct back, even though no Dynamo code is referencing them?
Thank you Graham!
Signed-off-by: Zhongxuan Wang <[email protected]>
keivenchang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @zhongxuanwang-nv, thanks for pushing a first pass of timing metrics. The feature itself is useful and you’ve wired it to make it work (HTTP frontend, prefill, decode) and added tests to exercise the new paths (love it!). That said, I need to question some architectural and readability issues that are likely to cause issues in the future, so I’d treat this as a good starting point and may need some revisions.
Architecturally, the biggest concern for me is that you’re using absolute timestamps taken on different machines (HTTP frontend vs prefill vs decode nodes), which quietly assumes that all clocks are in sync. That’s never a 100% guarantee; it’s just a matter of time until one of our clients sees a problem and files a bug. An alternative design that would guarantee timing, is to use the HTTP frontend to own the absolute start and end times, while prefill and decode workers report their own relative or cumulative durations (for example, “prefill took X ms, decode took Y ms”), instead of wall‑clock timestamps. The frontend can then assemble end‑to‑end timing using a single clock. Would love to hear from @nnshah1 @rmccorm4 @grahamking's thought on using absolute wall-clock from various machines.
Now, if you really really want to keep absolute timestamps across nodes, I think you should add a big explicit disclaimer in your code and docs; calling out the NTP/clock‑sync requirement and warning that cross‑node deltas are best‑effort and will be incorrect wrong if clocks drift. It's just a matter of time before that happens, somewhere, in the future.
On readability, especially on Python dicts, it would help a lot to to new readers of your code, to show examples of names and fields. It will also help AI tools understand the intended structure and intent, and quickly help you find issues should there be any in the future.
Finally, and this is just a debatable opinion of mine -- the name extra_fields is very generic, not so much industry standard. I can imagine that in the future, it could become a placeholder for someone to start dumping “one more field” later, whether it's related to metrics or not. How about calling it telemetry_fields, observability_fields, or response_metadata_fields? Or, metadata might be somewhat acceptable... maybe. @grahamking may also have some good ideas as he is good with giving good and clear names.
|
Thank you so much @keivenchang for your comments! They are really insightful and helpful!
Yes! This is indeed a constraint and I also like your proposal! If I understood it correctly, we only have
Yes! Indeed, and I deeply apologize for not including this request info! I just did and will make sure in the future also include both sample request / response when I deal with anything API related :)
That's true.. The |
Signed-off-by: Zhongxuan Wang <[email protected]>
d390ba5 to
b023688
Compare
Signed-off-by: Zhongxuan Wang <[email protected]>
Signed-off-by: Zhongxuan Wang <[email protected]>
Signed-off-by: Zhongxuan Wang <[email protected]>
|
@zhongxuanwang-nv thanks for your contribution. In general, we have been trying to move away from engine-specific pipelines in dynamo, so I'm not sure if we should have the Python handlers of vllm handle the timing metrics at all. Unless mistaken, I believe it's possible this can be done entirely in I have a PR up here that just handles |
PeaBrane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments left
Overview:
This PR introduces end-to-end timing metrics through the entire Dynamo → vLLM → frontend stack.
Details:
Added:
generate/prefill_generate, including:_should_include_timing_metricsgate onobservability_fieldsand_get_current_time_secondshelper incomponents/src/dynamo/vllm/handlers.py.timing_metricsmap withprefill_start_seconds,prefill_end_seconds,decode_start_seconds,decode_first_token_seconds,decode_end_seconds.request_received_secondsrecorded at HTTP ingress and stored intorequest.nvext.request_received_secondsinlib/llm/src/http/service/openai.rsfor both completions and chat completions.inject_request_completed_secondshelper that addsrequest_completed_secondsundertiming_metricsinlib/llm/src/http/service/openai.rs, plus unit tests.request_received_seconds: Option<f64>onPreprocessedRequestinlib/llm/src/protocols/common/preprocessor.rs.request_received_secondsfromnvextinlib/llm/src/preprocessor.rs.Changed:
extra_fieldsin request payload toobservability_fields.components/src/dynamo/vllm/handlers.py:prefill_start_secondsanddecode_start_seconds, and use first-token time asprefill_end_seconds.finish_reason), attach the fulltiming_metricsmap intotok["disaggregated_params"]["timing_metrics"]so downstream layers can read it.completion_usageconstruction is wrapped cleanly before further augmentation.lib/llm/src/http/service/openai.rs:completions_singleandchat_completionswith amapthat injectsrequest_completed_secondsonly on the chunk whose choices have afinish_reason(not the usage-only chunk), matching when backend timing metrics are finalized.OpenAIPreprocessornow forwardsrequest_received_secondsfromrequest.nvext()into thePreprocessedRequestbuilder so workers can see frontend ingress time.Removed:
Why it matters:
timing_metricsintodisaggregated_paramson the token stream makes these numbers available to observability / analytics stacks and any downstream Dynamo components without changing the public OpenAI surface.Where should the reviewer start?
components/src/dynamo/vllm/handlers.py_should_include_timing_metrics, the newtiming_metricslifecycle inBaseWorkerHandler.generate, and how prefill vs aggregated decode modes are handled.lib/llm/src/http/service/openai.rshandler_completions/handler_chat_completionsforrequest_received_seconds.inject_request_completed_secondsand its use incompletions_single/chat_completionswhere the stream is mapped to stamp the final content chunk.
lib/llm/src/protocols/common/preprocessor.rs(PreprocessedRequestnew field).lib/llm/src/preprocessor.rs(OpenAIPreprocessorcopyingrequest_received_seconds).lib/llm/src/http/service/openai.rsto validate the egress stamping.Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.