fix: Send last token batch when finish_reason is set #3531

rmccorm4 · 2025-10-09T19:25:34Z

Overview:

Fixes SGLang worker to include last token(s) in response when a finish reason is set. Before when finish_reason is set, it would hard-code the response to have no tokens. But there are scenarios where finish_reason is set and there are tokens to send back with it.

Note this edge case isn't just for the "last" token, it's the last "batch" of tokens - so if you set something like --stream-interval 50 where each iteration returns 50 tokens at a time, without this change you'd be excluding up to the last 50 tokens as well.

Details:

Before (content=null - no token returned):

$ curl localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '
{
  "model": "Qwen/Qwen3-0.6B",
  "messages": [{"role": "user", "content": "Write me a DND campaign"}],
  "stream": true,
  "max_tokens": 1,
  "ignore_eos": false,
  "stream_options": {"include_usage": false}
}'
data: {"id":"chatcmpl-2dae09bb-1ed3-409f-9ad1-2f2e2c5e15d0","choices":[{"index":0,"delta":{"content":null,"function_call":null,"tool_calls":null,"role":"assistant","refusal":null,"reasoning_content":null},"finish_reason":"length"}],"created":1759510713,"model":"Qwen/Qwen3-0.6B","service_tier":null,"system_fingerprint":null,"object":"chat.completion.chunk","usage":null}

data: [DONE]

After (<think> token returned in content field):

$ curl localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '
{
  "model": "Qwen/Qwen3-0.6B",
  "messages": [{"role": "user", "content": "Write me a DND campaign"}],
  "stream": true,
  "max_tokens": 1,
  "ignore_eos": false,
  "stream_options": {"include_usage": false}
}'
data: {"id":"chatcmpl-a5747b00-1cb8-4062-8cf2-7ddf899b9a15","choices":[{"index":0,"delta":{"content":"<think>","function_call":null,"tool_calls":null,"role":"assistant","refusal":null,"reasoning_content":null},"finish_reason":"length"}],"created":1760037414,"model":"Qwen/Qwen3-0.6B","service_tier":null,"system_fingerprint":null,"object":"chat.completion.chunk","usage":null}

data: [DONE]

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Fixes [BUG]: Missing first token with max_tokens=1 in sglang backend #3443

Summary by CodeRabbit

Refactor
- Streamlined token streaming logic for more consistent handling of generated tokens and completion signals.
- Consolidated control flow to always read and slice output tokens, improving clarity and reducing edge-case inconsistencies.
- Enhanced robustness with clearer error behavior when token outputs are missing.
- No changes to user-facing APIs; behavior remains consistent while improving reliability and maintainability.

ayushag-nv

lgtm !

coderabbitai · 2025-10-09T19:28:30Z

Walkthrough

Refactors _process_token_stream to linearize handling of streamed results: always read and slice output_ids, update counters in one path, and conditionally attach finish_reason. Removes the prior else branch and initializes an empty out dict each iteration, raising the same error if output_ids are missing.

Changes

Cohort / File(s)	Summary
LLM decode stream handling `components/src/dynamo/sglang/request_handlers/llm/decode_handler.py`	Linearizes token stream processing: initialize empty out dict per loop, always slice `output_ids` for `token_ids`, update `num_output_tokens_so_far` via `next_total_toks`, and conditionally add `finish_reason`. Removes previous branching; retains ValueError on missing `output_ids`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Stream as Token Stream Source
  participant Handler as decode_handler._process_token_stream
  participant Client as Caller

  loop For each streamed result
    Stream->>Handler: res
    Note over Handler: out = {}
    alt res.output_ids present
      Handler->>Handler: token_ids = res.output_ids[so_far:]
      Handler->>Handler: so_far = len(res.output_ids)
      opt res.finish_reason present
        Handler->>Handler: out.finish_reason = res.finish_reason
      end
      Handler-->>Client: out with token_ids (and finish_reason if any)
    else Missing output_ids
      Handler->>Handler: raise ValueError("Missing output_ids")
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I nibble through streams, tick-tock of bytes,
Hop-slice tokens, align the lights.
No branching thickets, a straighter trail,
Finish flags flutter on a gentle gale.
With whiskered focus and tidy flow—
Carrots of code, onward I go! 🥕🐇

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description includes the Overview, Details, and Related Issues sections as specified by the repository’s template but omits the required Where should the reviewer start? section. Without this section, reviewers lack explicit guidance on which files or code areas to focus on during review. The template mandates this heading to streamline the review process and highlight critical changes. Because this key section is missing, the description check fails.	Please add the Where should the reviewer start? section to the pull request description, listing the specific files, modules, or functions that warrant focused review and ensuring full adherence to the repository’s PR template.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title Check	✅ Passed	The title concisely states the nature of the fix and its trigger condition, focusing on sending the final token batch when a finish_reason is set. It clearly references the main behavior change without extraneous details. This phrasing allows teammates to understand the core update from the title alone.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)

182-184: Consider adding exception chaining for better debugging.

When raising a ValueError after catching KeyError, preserve the exception chain using raise ... from err to maintain the full traceback for debugging.

Apply this diff:

-            try:
-                next_total_toks = len(res["output_ids"])
-            except KeyError:
-                raise ValueError(
-                    f"Missing 'output_ids' in response. Response keys: {list(res.keys())}"
-                )
+            try:
+                next_total_toks = len(res["output_ids"])
+            except KeyError as err:
+                raise ValueError(
+                    f"Missing 'output_ids' in response. Response keys: {list(res.keys())}"
+                ) from err

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9414e3b and 6e71433.

📒 Files selected for processing (1)

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.13.3)

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py

182-184: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

182-184: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: vllm (amd64)
GitHub Check: trtllm (arm64)
GitHub Check: sglang
GitHub Check: trtllm (amd64)
GitHub Check: vllm (arm64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (1)

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)

174-186: LGTM! Fix correctly includes tokens with finish_reason.

The refactored logic correctly addresses the issue where final tokens were missing when finish_reason was set. By initializing an empty out dict and always extracting tokens from output_ids, the method now properly includes both the final token(s) and the finish_reason in the response, as demonstrated in the PR objectives.

rmccorm4 · 2025-10-09T19:42:22Z

This is an example of a more extreme case with python -m dynamo.sglang ... --stream-interval 50 (return up to 50 tokens per iteration) where the last big batch of tokens would be excluded without this fix:

Without fix, null content on last chunk:

With fix, big response chunks, even on last chunk:

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py

…io just in case. Match TRTLLM behavior

lixuwei2333 · 2025-10-10T08:54:30Z

#2985 I had fixed this bug, only to find it has been reintroduced :(

Signed-off-by: Dan Gil <[email protected]>

fix: Send last token when finish_reason is set

6e71433

rmccorm4 requested review from a team as code owners October 9, 2025 19:25

pull-request-size bot added the size/S label Oct 9, 2025

github-actions bot added the fix label Oct 9, 2025

rmccorm4 requested review from ayushag-nv and ishandhanani October 9, 2025 19:25

ayushag-nv approved these changes Oct 9, 2025

View reviewed changes

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

rmccorm4 enabled auto-merge (squash) October 9, 2025 19:28

rmccorm4 disabled auto-merge October 9, 2025 19:34

rmccorm4 changed the title ~~fix: Send last token when finish_reason is set~~ fix: Send last token batch when finish_reason is set Oct 9, 2025

rmccorm4 requested a review from Elnifio October 9, 2025 19:43

GuanLuo reviewed Oct 9, 2025

View reviewed changes

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py Outdated Show resolved Hide resolved

Address Guan's feedback - handle no finish_reason + no outputs scenar…

609b9af

…io just in case. Match TRTLLM behavior

copy-pr-bot bot temporarily deployed to GITLAB October 9, 2025 22:56 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 9, 2025 22:57 Inactive

GuanLuo approved these changes Oct 9, 2025

View reviewed changes

rmccorm4 merged commit 111c681 into main Oct 10, 2025
20 of 21 checks passed

rmccorm4 deleted the rmccormick/fix_sglang_last_token branch October 10, 2025 00:09

dagil-nvidia pushed a commit that referenced this pull request Oct 10, 2025

fix: Send last token batch when finish_reason is set (#3531)

010e60c

Signed-off-by: Dan Gil <[email protected]>

ziqifan617 pushed a commit that referenced this pull request Oct 10, 2025

fix: Send last token batch when finish_reason is set (#3531)

b354a7e

ziqifan617 pushed a commit that referenced this pull request Oct 20, 2025

fix: Send last token batch when finish_reason is set (#3531)

a69ec2b

nv-tusharma pushed a commit that referenced this pull request Oct 20, 2025

fix: Send last token batch when finish_reason is set (#3531)

9e1fa58

dagil-nvidia mentioned this pull request Oct 21, 2025

feat: Add automated dependency version tracking and extraction #3547

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Send last token batch when finish_reason is set #3531

fix: Send last token batch when finish_reason is set #3531

Uh oh!

rmccorm4 commented Oct 9, 2025 •

edited

Loading

Uh oh!

ayushag-nv left a comment

Uh oh!

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

rmccorm4 commented Oct 9, 2025

Uh oh!

Uh oh!

Uh oh!

lixuwei2333 commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fix: Send last token batch when finish_reason is set #3531

fix: Send last token batch when finish_reason is set #3531

Uh oh!

Conversation

rmccorm4 commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

ayushag-nv left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

rmccorm4 commented Oct 9, 2025

Uh oh!

Uh oh!

Uh oh!

lixuwei2333 commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rmccorm4 commented Oct 9, 2025 •

edited

Loading

coderabbitai bot commented Oct 9, 2025 •

edited

Loading