Skip to content

Conversation

@zhongxuanwang-nv
Copy link
Member

@zhongxuanwang-nv zhongxuanwang-nv commented Dec 1, 2025

Overview:

This PR introduces end-to-end timing metrics through the entire Dynamo → vLLM → frontend stack.

# When observability_fields contains "timing_metrics", e.g. 
# {
#   "nvext": {
#     "observability_fields": ["timing_metrics"]
#   }
# }
# the following fields are collected and returned in the response's nvext.timing_metrics object:
#
# timing_metrics = {
#     "request_received_seconds": 1733097600.123,   # openai.rs:311 (completions) / :748 (chat_completions)
#     "prefill_start_seconds": 1733097600.456,      # PrefillWorkerHandler:~486 or DecodeWorkerHandler:~404 (aggregated)
#     "prefill_end_seconds": 1733097600.789,        # PrefillWorkerHandler:~550 or DecodeWorkerHandler:~416 (aggregated)
#     "decode_start_seconds": 1733097601.012,       # DecodeWorkerHandler:~405
#     "decode_first_token_seconds": 1733097601.345, # DecodeWorkerHandler:~413
#     "decode_end_seconds": 1733097602.678,         # DecodeWorkerHandler:~428
#     "request_completed_seconds": 1733097602.700,  # openai.rs:76 (inject_request_completed_seconds)
# }
#
# Disaggregated mode (separate prefill and decode workers):
#   - request_received_seconds:    Rust frontend captures SystemTime when HTTP request arrives
#   - prefill_start_seconds:       PrefillWorkerHandler sets when it starts processing
#   - prefill_end_seconds:         PrefillWorkerHandler sets after prefill completes
#   - decode_start_seconds:        DecodeWorkerHandler sets when it starts processing
#   - decode_first_token_seconds:  DecodeWorkerHandler sets when first token is generated
#   - decode_end_seconds:          DecodeWorkerHandler sets when generation finishes
#   - request_completed_seconds:   Rust frontend injects when response leaves server
#
# Aggregated mode (no separate prefill worker, DecodeWorkerHandler does both):
#   - prefill_start_seconds == decode_start_seconds (same time.time() call)
#   - prefill_end_seconds == decode_first_token_seconds (first token marks prefill completion)

Details:

Added:

  • End-to-end timing collection in vLLM worker handler generate / prefill_generate, including:
    • _should_include_timing_metrics gate on observability_fields and _get_current_time_seconds helper in components/src/dynamo/vllm/handlers.py.
    • timing_metrics map with prefill_start_seconds, prefill_end_seconds, decode_start_seconds, decode_first_token_seconds, decode_end_seconds.
  • Frontend timing capture in OpenAI HTTP service:
    • request_received_seconds recorded at HTTP ingress and stored into request.nvext.request_received_seconds in lib/llm/src/http/service/openai.rs for both completions and chat completions.
  • Request completion timestamp injection:
    • inject_request_completed_seconds helper that adds request_completed_seconds under timing_metrics in lib/llm/src/http/service/openai.rs, plus unit tests.
  • Preprocessor plumbing for timing:
    • New request_received_seconds: Option<f64> on PreprocessedRequest in lib/llm/src/protocols/common/preprocessor.rs.
    • Builder now sets request_received_seconds from nvext in lib/llm/src/preprocessor.rs.

Changed:

  • Renamed extra_fields in request payload to observability_fields.
  • vLLM worker generate paths in components/src/dynamo/vllm/handlers.py:
    • On aggregated requests (no prefill worker), treat decode start as both prefill_start_seconds and decode_start_seconds, and use first-token time as prefill_end_seconds.
    • When a result chunk finishes (has finish_reason), attach the full timing_metrics map into tok["disaggregated_params"]["timing_metrics"] so downstream layers can read it.
    • Ensure completion_usage construction is wrapped cleanly before further augmentation.
  • OpenAI streaming responses in lib/llm/src/http/service/openai.rs:
    • Wrap streaming completions_single and chat_completions with a map that injects request_completed_seconds only on the chunk whose choices have a finish_reason (not the usage-only chunk), matching when backend timing metrics are finalized.
  • Preprocessing pipeline:
    • OpenAIPreprocessor now forwards request_received_seconds from request.nvext() into the PreprocessedRequest builder so workers can see frontend ingress time.

Removed:

  • None

Why it matters:

  • Attaching timing_metrics into disaggregated_params on the token stream makes these numbers available to observability / analytics stacks and any downstream Dynamo components without changing the public OpenAI surface.
  • Injecting completion timestamps only on the final content chunk avoids double-counting and keeps the usage-only chunk semantics clean.

Where should the reviewer start?

  • Start with the vLLM worker wiring:
    • components/src/dynamo/vllm/handlers.py
      • Focus on _should_include_timing_metrics, the new timing_metrics lifecycle in
        BaseWorkerHandler.generate, and how prefill vs aggregated decode modes are handled.
  • Then look at frontend ingress/egress timing:
    • lib/llm/src/http/service/openai.rs
      • handler_completions / handler_chat_completions for request_received_seconds.
      • inject_request_completed_seconds and its use in completions_single / chat_completions
        where the stream is mapped to stamp the final content chunk.
  • Finally, verify the plumbing and types:
    • lib/llm/src/protocols/common/preprocessor.rs (PreprocessedRequest new field).
    • lib/llm/src/preprocessor.rs (OpenAIPreprocessor copying request_received_seconds).
    • The tests at the bottom of lib/llm/src/http/service/openai.rs to validate the egress stamping.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • Relates to Dynamo requirements from NAT team

Summary by CodeRabbit

  • New Features
    • Added comprehensive timing metrics collection throughout the request lifecycle, capturing key timestamps such as request arrival, decode/prefill processing stages, and response completion for improved visibility into system behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 1, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

👋 Hi zhongxuanwang-nv! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Dec 1, 2025
Signed-off-by: Zhongxuan Wang <[email protected]>
@zhongxuanwang-nv zhongxuanwang-nv changed the title Initial push Optional timing metrics reporting through nvext for vLLM backend Dec 1, 2025
@zhongxuanwang-nv zhongxuanwang-nv added backend::vllm Relates to the vllm backend and removed external-contribution Pull request is from an external contributor labels Dec 1, 2025
@zhongxuanwang-nv zhongxuanwang-nv changed the title Optional timing metrics reporting through nvext for vLLM backend feat: timing metrics reporting through nvext for vLLM backend Dec 1, 2025
@github-actions github-actions bot added the feat label Dec 1, 2025
@zhongxuanwang-nv
Copy link
Member Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 1, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 1, 2025

Walkthrough

End-to-end timing metrics support is added across Python and Rust layers. The vllm handlers collect timing data (request received, prefill/decode start/end), the HTTP service injects boundary timestamps, the preprocessor extracts timing fields, and delta response builders include timing metrics in nvext responses.

Changes

Cohort / File(s) Summary
Timing collection in vllm handlers
components/src/dynamo/vllm/handlers.py
Added helper functions to detect and capture timing metrics; instrumented decode/prefill/generate paths to populate timing fields (request_received_seconds, decode/prefill start/end times) and merge timing data across worker boundaries; extended disaggregated_params construction to include timing_metrics dict.
HTTP service timing injection
lib/llm/src/http/service/openai.rs
Added private helper inject_request_completed_seconds() to populate nvext.timing_metrics; captures request_received_seconds at request arrival in handlers (completions, chat_completions, embeddings); injects request_completed_seconds on final streaming chunks via per-stream transformation; added unit tests.
Preprocessor and protocol schema extensions
lib/llm/src/preprocessor.rs, lib/llm/src/protocols/common/preprocessor.rs, lib/llm/src/protocols/openai/nvext.rs
Extended PreprocessedRequest with optional request_received_seconds field; updated OpenAIPreprocessor::builder to extract and set this field from nvext; expanded NvExt structure with optional request_received_seconds field and updated allowed extra_fields to include "timing_metrics"; updated tests.
Delta response builders
lib/llm/src/protocols/openai/chat_completions/delta.rs, lib/llm/src/protocols/openai/completions/delta.rs
Refactored disaggregated_params extraction to collect both worker_id and timing_metrics; inject both fields into nvext as a map only when present; added unit tests validating timing_metrics propagation through delta processing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

  • Areas requiring extra attention:
    • Timing metric propagation logic across decode/prefill/generate paths in handlers.py—verify correct initialization and merging across worker boundaries
    • Streaming response timing injection in openai.rs—confirm final chunk detection and timestamp accuracy
    • Field mapping consistency between nvext, preprocessor, and protocol structures—ensure no field loss during transformations
    • Delta/nvext builders—validate conditional logic for when fields are present/absent

Poem

🐰 Hop, hop, the clocks now tick,
Each request leaves a timestamp's quick!
From start to end, the metrics flow,
Through handlers, streams, and responses' glow.
Timing's tale is finally sung—
Infrastructure's golden thread now strung! 🕐✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main feature: adding timing metrics reporting via nvext for the vLLM backend, which aligns with the primary objective of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The PR description provides comprehensive coverage of all required sections with clear details about changes, context, and review guidance.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/src/dynamo/vllm/handlers.py (1)

1-1: Fix Black formatting before merge.

The pipeline failure indicates that Black formatting failed. Please run black on this file to resolve the formatting issue.

Based on the pipeline failure log, run:

black components/src/dynamo/vllm/handlers.py
🧹 Nitpick comments (1)
lib/llm/src/http/service/openai.rs (1)

59-90: Consider returning a success indicator.

The function silently handles errors (line 64 .ok()) and structural mismatches (lines 69-88 nested as_object_mut()). While timing metrics are non-critical and silent failures may be acceptable, consider returning a bool to indicate success for better observability.

If you want to track silent failures, you could refactor to:

-fn inject_request_completed_seconds(nvext: &mut Option<serde_json::Value>) {
+fn inject_request_completed_seconds(nvext: &mut Option<serde_json::Value>) -> bool {
     let ts = SystemTime::now()
         .duration_since(UNIX_EPOCH)
         .map(|d| d.as_secs_f64())
         .ok();

-    if let Some(ts) = ts {
+    let Some(ts) = ts else {
+        tracing::trace!("Failed to get current timestamp for request_completed_seconds");
+        return false;
+    };
+    
         let nvext = nvext.get_or_insert_with(|| serde_json::Value::Object(serde_json::Map::new()));
         if let Some(obj) = nvext.as_object_mut() {
             if let Some(timing) = obj.get_mut("timing_metrics") {
                 if let Some(timing_obj) = timing.as_object_mut() {
                     timing_obj.insert(
                         "request_completed_seconds".to_string(),
                         serde_json::Value::from(ts),
                     );
+                    return true;
                 }
             } else {
                 let mut timing_obj = serde_json::Map::new();
                 timing_obj.insert(
                     "request_completed_seconds".to_string(),
                     serde_json::Value::from(ts),
                 );
                 obj.insert(
                     "timing_metrics".to_string(),
                     serde_json::Value::Object(timing_obj),
                 );
+                return true;
             }
         }
-    }
+    false
 }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9d7d95 and 08d432b.

📒 Files selected for processing (7)
  • components/src/dynamo/vllm/handlers.py (8 hunks)
  • lib/llm/src/http/service/openai.rs (6 hunks)
  • lib/llm/src/preprocessor.rs (1 hunks)
  • lib/llm/src/protocols/common/preprocessor.rs (1 hunks)
  • lib/llm/src/protocols/openai/chat_completions/delta.rs (2 hunks)
  • lib/llm/src/protocols/openai/completions/delta.rs (2 hunks)
  • lib/llm/src/protocols/openai/nvext.rs (4 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.

Applied to files:

  • lib/llm/src/protocols/openai/nvext.rs
📚 Learning: 2025-09-08T21:18:43.478Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2936
File: lib/parsers/src/reasoning/granite_parser.rs:42-46
Timestamp: 2025-09-08T21:18:43.478Z
Learning: In GraniteReasoningParser in lib/parsers/src/reasoning/granite_parser.rs, the think_start_tokens and think_end_tokens are hardcoded in the constructor with fixed values, so unwrap() calls on these vectors are safe and won't panic.

Applied to files:

  • lib/llm/src/protocols/openai/nvext.rs
📚 Learning: 2025-06-24T20:59:35.725Z
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.

Applied to files:

  • lib/llm/src/protocols/common/preprocessor.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.

Applied to files:

  • lib/llm/src/protocols/openai/chat_completions/delta.rs
  • lib/llm/src/protocols/openai/completions/delta.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.

Applied to files:

  • lib/llm/src/protocols/openai/chat_completions/delta.rs
  • lib/llm/src/protocols/openai/completions/delta.rs
📚 Learning: 2025-08-25T22:04:45.205Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2700
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:19-28
Timestamp: 2025-08-25T22:04:45.205Z
Learning: The response_generator() method exists on multiple request types in the codebase: NvCreateChatCompletionRequest (for chat completions) and NvCreateCompletionRequest (for text completions). When making signature changes, it's important to distinguish between these different object types as they have separate implementations and call sites.

Applied to files:

  • lib/llm/src/http/service/openai.rs
🧬 Code graph analysis (6)
lib/llm/src/preprocessor.rs (4)
lib/llm/src/protocols/openai/nvext.rs (2)
  • nvext (9-9)
  • builder (99-101)
lib/llm/src/protocols/openai/responses.rs (3)
  • nvext (44-46)
  • nvext (104-106)
  • nvext (151-153)
lib/llm/src/protocols/openai/chat_completions.rs (3)
  • nvext (74-76)
  • nvext (134-136)
  • nvext (238-240)
lib/llm/src/protocols/openai.rs (2)
  • nvext (50-50)
  • nvext (60-60)
lib/llm/src/protocols/openai/nvext.rs (1)
lib/llm/src/preprocessor.rs (1)
  • builder (202-248)
lib/llm/src/protocols/common/preprocessor.rs (1)
lib/llm/src/preprocessor.rs (1)
  • builder (202-248)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
lib/llm/src/protocols/openai/chat_completions/aggregator.rs (2)
  • new (92-104)
  • default (63-65)
lib/llm/src/protocols/openai/nvext.rs (2)
  • default (93-95)
  • nvext (9-9)
lib/llm/src/protocols/openai/completions/delta.rs (2)
lib/llm/src/protocols/openai/nvext.rs (2)
  • default (93-95)
  • nvext (9-9)
lib/llm/src/protocols/openai/responses.rs (3)
  • nvext (44-46)
  • nvext (104-106)
  • nvext (151-153)
components/src/dynamo/vllm/handlers.py (1)
components/src/dynamo/sglang/request_handlers/handler_base.py (1)
  • BaseWorkerHandler (20-222)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4661/merge) by zhongxuanwang-nv.
components/src/dynamo/vllm/handlers.py

[error] 1-1: Black formatting failed. 1 file reformatted by the hook; please re-run the commit or CI to pass: 'pre-commit run --show-diff-on-failure --color=always --all-files'.

🔇 Additional comments (19)
lib/llm/src/preprocessor.rs (1)

240-244: LGTM!

The extraction of request_received_seconds from nvext follows the established pattern for other nvext fields and maintains consistency with the timing metrics feature.

lib/llm/src/protocols/openai/nvext.rs (3)

84-89: LGTM!

The new request_received_seconds field is properly integrated with appropriate serde attributes and clear documentation.


79-79: LGTM!

The documentation correctly lists the expanded set of supported extra_fields.


136-160: LGTM!

Tests properly validate the new field and the expanded extra_fields list.

lib/llm/src/protocols/common/preprocessor.rs (1)

105-109: LGTM!

The request_received_seconds field is well-documented and follows the established patterns for optional fields in the struct.

components/src/dynamo/vllm/handlers.py (5)

77-87: LGTM!

The helper functions provide clean abstractions for timing metrics detection and timestamp retrieval.


313-393: LGTM!

The timing metrics initialization and decode tracking logic is sound:

  • Frontend timestamp is preserved when merging with prefill timing (lines 349-352)
  • Aggregated mode (no prefill_result) correctly treats decode_start as prefill_start (line 380)
  • First token time is used as prefill_end in aggregated mode (line 392), which aligns with the semantic that prefill completes when the first token is generated

402-413: LGTM!

The conditional injection of timing_metrics is correct:

  • Only when finish_reason is present (the final chunk)
  • Only when timing is requested
  • Safely initializes disaggregated_params if needed

267-271: LGTM!

The refactored completion_usage construction is cleaner and preserves the existing structure while preparing for timing metrics augmentation.


447-524: LGTM!

The prefill worker timing metrics logic correctly:

  • Captures prefill_start_seconds at the start of processing (line 460)
  • Records prefill_end_seconds when the result is available (lines 521-523)
  • Builds disaggregated_params with both kv_transfer_params and timing_metrics (lines 515-524)
lib/llm/src/protocols/openai/completions/delta.rs (2)

268-286: LGTM!

The generic extraction of worker_id and timing_metrics from disaggregated_params into nvext is clean and extensible. The conditional check ensures nvext is only set when there's actual data to include.


304-350: LGTM!

The test thoroughly validates that both timing_metrics and worker_id are correctly extracted from disaggregated_params and placed into nvext.

lib/llm/src/protocols/openai/chat_completions/delta.rs (2)

365-383: LGTM!

The nvext construction from disaggregated_params follows the same clean pattern as the completions path, maintaining consistency across the codebase.


464-506: LGTM!

The test provides thorough coverage of timing_metrics and worker_id extraction into nvext.

lib/llm/src/http/service/openai.rs (5)

317-329: LGTM!

Capturing request_received_seconds immediately at handler entry is the right approach for accurate timing. The safe initialization with get_or_insert_with(Default::default) is appropriate.


453-468: LGTM!

The injection on the final content chunk (with finish_reason) aligns with the backend's timing_metrics population strategy, avoiding injection on the separate usage-only chunk.


751-763: LGTM!

Consistent timing capture across both completions and chat completions endpoints.


898-912: LGTM!

The stream mapping for chat completions follows the same correct pattern as completions.


2149-2184: LGTM!

The tests validate the two key scenarios:

  1. Preserving existing fields when injecting into existing timing_metrics
  2. Creating the structure from scratch when nvext is None

Both scenarios are well-covered.

Signed-off-by: Zhongxuan Wang <[email protected]>
@pull-request-size pull-request-size bot added size/XL and removed size/L labels Dec 1, 2025
Signed-off-by: Zhongxuan Wang <[email protected]>
@zhongxuanwang-nv
Copy link
Member Author

/ok to test 8588689

@zhongxuanwang-nv zhongxuanwang-nv marked this pull request as ready for review December 1, 2025 20:49
@zhongxuanwang-nv zhongxuanwang-nv requested review from a team as code owners December 1, 2025 20:49
Signed-off-by: Zhongxuan Wang <[email protected]>
Signed-off-by: Zhongxuan Wang <[email protected]>
@zhongxuanwang-nv
Copy link
Member Author

/ok to test 0f1d019


let nvext_response = NvExtResponse {
worker_id: Some(worker_id_info),
};
Copy link
Contributor

@grahamking grahamking Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is worker_id gone? I think AI Gateway uses it.

Copy link
Member Author

@zhongxuanwang-nv zhongxuanwang-nv Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely not gone, i just removed the NvExtResponse and WorkerIdInfo struct and instead just export the worker_id json directly because I figured it would look less redundant :)

Do you think i should bring these two struct back, even though no Dynamo code is referencing them?

Thank you Graham!

Signed-off-by: Zhongxuan Wang <[email protected]>
Copy link
Contributor

@keivenchang keivenchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zhongxuanwang-nv, thanks for pushing a first pass of timing metrics. The feature itself is useful and you’ve wired it to make it work (HTTP frontend, prefill, decode) and added tests to exercise the new paths (love it!). That said, I need to question some architectural and readability issues that are likely to cause issues in the future, so I’d treat this as a good starting point and may need some revisions.

Architecturally, the biggest concern for me is that you’re using absolute timestamps taken on different machines (HTTP frontend vs prefill vs decode nodes), which quietly assumes that all clocks are in sync. That’s never a 100% guarantee; it’s just a matter of time until one of our clients sees a problem and files a bug. An alternative design that would guarantee timing, is to use the HTTP frontend to own the absolute start and end times, while prefill and decode workers report their own relative or cumulative durations (for example, “prefill took X ms, decode took Y ms”), instead of wall‑clock timestamps. The frontend can then assemble end‑to‑end timing using a single clock. Would love to hear from @nnshah1 @rmccorm4 @grahamking's thought on using absolute wall-clock from various machines.

Now, if you really really want to keep absolute timestamps across nodes, I think you should add a big explicit disclaimer in your code and docs; calling out the NTP/clock‑sync requirement and warning that cross‑node deltas are best‑effort and will be incorrect wrong if clocks drift. It's just a matter of time before that happens, somewhere, in the future.

On readability, especially on Python dicts, it would help a lot to to new readers of your code, to show examples of names and fields. It will also help AI tools understand the intended structure and intent, and quickly help you find issues should there be any in the future.

Finally, and this is just a debatable opinion of mine -- the name extra_fields is very generic, not so much industry standard. I can imagine that in the future, it could become a placeholder for someone to start dumping “one more field” later, whether it's related to metrics or not. How about calling it telemetry_fields, observability_fields, or response_metadata_fields? Or, metadata might be somewhat acceptable... maybe. @grahamking may also have some good ideas as he is good with giving good and clear names.

@zhongxuanwang-nv
Copy link
Member Author

Thank you so much @keivenchang for your comments! They are really insightful and helpful!

Architecturally, the biggest concern for me is that you’re using absolute timestamps taken on different machines ...

Yes! This is indeed a constraint and I also like your proposal! If I understood it correctly, we only have request_received_seconds and request_completed_seconds as absolute timestamps, but I'm wondering if we were to remove prefill_start_seconds and prefill_end_seconds, and use prefill_duration_seconds (or similarly) like that, we would probably lose the routing time info? Routing time here = prefill_start_seconds - request_received_seconds, and it seems that to include the routing time info, we can't avoid the path of using these absolute timestamps? I've also just included the big disclaimer, but I'm open to any suggestions!

On readability, especially on Python dicts ...

Yes! Indeed, and I deeply apologize for not including this request info! I just did and will make sure in the future also include both sample request / response when I deal with anything API related :)

Finally, and this is just a debatable opinion of mine -- the name extra_fields is very generic ...

That's true.. The extra_fields naming does sound very generic; I think telemetry_fields or observability_fields all sound great (if I were to pick I would pick observability_fields), i think this constrains the scope of this field to only observability, so if people may want to introduce other metadata, they could use another field instead.

Signed-off-by: Zhongxuan Wang <[email protected]>
@PeaBrane
Copy link
Contributor

PeaBrane commented Dec 11, 2025

@zhongxuanwang-nv thanks for your contribution. In general, we have been trying to move away from engine-specific pipelines in dynamo, so I'm not sure if we should have the Python handlers of vllm handle the timing metrics at all. Unless mistaken, I believe it's possible this can be done entirely in lib/llm Rust land, particularly in delta.rs and prefill_router.rs, and more ideally using the context registry (currently underutilized). This has the added benefit of resolving the headaches of desynced inter-node timings.

I have a PR up here that just handles ttft and total_request_time for now. But we can probably augment them with the metrics you are interested in by probing the prefill and decode routers (+ even more metrics like kv hit rate etc). Let me know if you feel like this is a good launch point:
#4880

Copy link
Contributor

@PeaBrane PeaBrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments left

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend feat size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants