Skip to content

Conversation

@rmccorm4
Copy link
Contributor

@rmccorm4 rmccorm4 commented Oct 9, 2025

Overview:

Fixes SGLang worker to include last token(s) in response when a finish reason is set. Before when finish_reason is set, it would hard-code the response to have no tokens. But there are scenarios where finish_reason is set and there are tokens to send back with it.

Note this edge case isn't just for the "last" token, it's the last "batch" of tokens - so if you set something like --stream-interval 50 where each iteration returns 50 tokens at a time, without this change you'd be excluding up to the last 50 tokens as well.

Details:

Before (content=null - no token returned):

$ curl localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '
{
  "model": "Qwen/Qwen3-0.6B",
  "messages": [{"role": "user", "content": "Write me a DND campaign"}],
  "stream": true,
  "max_tokens": 1,
  "ignore_eos": false,
  "stream_options": {"include_usage": false}
}'
data: {"id":"chatcmpl-2dae09bb-1ed3-409f-9ad1-2f2e2c5e15d0","choices":[{"index":0,"delta":{"content":null,"function_call":null,"tool_calls":null,"role":"assistant","refusal":null,"reasoning_content":null},"finish_reason":"length"}],"created":1759510713,"model":"Qwen/Qwen3-0.6B","service_tier":null,"system_fingerprint":null,"object":"chat.completion.chunk","usage":null}

data: [DONE]

After (<think> token returned in content field):

$ curl localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '
{
  "model": "Qwen/Qwen3-0.6B",
  "messages": [{"role": "user", "content": "Write me a DND campaign"}],
  "stream": true,
  "max_tokens": 1,
  "ignore_eos": false,
  "stream_options": {"include_usage": false}
}'
data: {"id":"chatcmpl-a5747b00-1cb8-4062-8cf2-7ddf899b9a15","choices":[{"index":0,"delta":{"content":"<think>","function_call":null,"tool_calls":null,"role":"assistant","refusal":null,"reasoning_content":null},"finish_reason":"length"}],"created":1760037414,"model":"Qwen/Qwen3-0.6B","service_tier":null,"system_fingerprint":null,"object":"chat.completion.chunk","usage":null}

data: [DONE]

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • Refactor
    • Streamlined token streaming logic for more consistent handling of generated tokens and completion signals.
    • Consolidated control flow to always read and slice output tokens, improving clarity and reducing edge-case inconsistencies.
    • Enhanced robustness with clearer error behavior when token outputs are missing.
    • No changes to user-facing APIs; behavior remains consistent while improving reliability and maintainability.

@rmccorm4 rmccorm4 requested review from a team as code owners October 9, 2025 19:25
@github-actions github-actions bot added the fix label Oct 9, 2025
Copy link
Contributor

@ayushag-nv ayushag-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 9, 2025

Walkthrough

Refactors _process_token_stream to linearize handling of streamed results: always read and slice output_ids, update counters in one path, and conditionally attach finish_reason. Removes the prior else branch and initializes an empty out dict each iteration, raising the same error if output_ids are missing.

Changes

Cohort / File(s) Summary
LLM decode stream handling
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py
Linearizes token stream processing: initialize empty out dict per loop, always slice output_ids for token_ids, update num_output_tokens_so_far via next_total_toks, and conditionally add finish_reason. Removes previous branching; retains ValueError on missing output_ids.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Stream as Token Stream Source
  participant Handler as decode_handler._process_token_stream
  participant Client as Caller

  loop For each streamed result
    Stream->>Handler: res
    Note over Handler: out = {}
    alt res.output_ids present
      Handler->>Handler: token_ids = res.output_ids[so_far:]
      Handler->>Handler: so_far = len(res.output_ids)
      opt res.finish_reason present
        Handler->>Handler: out.finish_reason = res.finish_reason
      end
      Handler-->>Client: out with token_ids (and finish_reason if any)
    else Missing output_ids
      Handler->>Handler: raise ValueError("Missing output_ids")
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I nibble through streams, tick-tock of bytes,
Hop-slice tokens, align the lights.
No branching thickets, a straighter trail,
Finish flags flutter on a gentle gale.
With whiskered focus and tidy flow—
Carrots of code, onward I go! 🥕🐇

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description includes the Overview, Details, and Related Issues sections as specified by the repository’s template but omits the required Where should the reviewer start? section. Without this section, reviewers lack explicit guidance on which files or code areas to focus on during review. The template mandates this heading to streamline the review process and highlight critical changes. Because this key section is missing, the description check fails. Please add the Where should the reviewer start? section to the pull request description, listing the specific files, modules, or functions that warrant focused review and ensuring full adherence to the repository’s PR template.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title Check ✅ Passed The title concisely states the nature of the fix and its trigger condition, focusing on sending the final token batch when a finish_reason is set. It clearly references the main behavior change without extraneous details. This phrasing allows teammates to understand the core update from the title alone.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)

182-184: Consider adding exception chaining for better debugging.

When raising a ValueError after catching KeyError, preserve the exception chain using raise ... from err to maintain the full traceback for debugging.

Apply this diff:

-            try:
-                next_total_toks = len(res["output_ids"])
-            except KeyError:
-                raise ValueError(
-                    f"Missing 'output_ids' in response. Response keys: {list(res.keys())}"
-                )
+            try:
+                next_total_toks = len(res["output_ids"])
+            except KeyError as err:
+                raise ValueError(
+                    f"Missing 'output_ids' in response. Response keys: {list(res.keys())}"
+                ) from err
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9414e3b and 6e71433.

📒 Files selected for processing (1)
  • components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py

182-184: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


182-184: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: vllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang
  • GitHub Check: trtllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)

174-186: LGTM! Fix correctly includes tokens with finish_reason.

The refactored logic correctly addresses the issue where final tokens were missing when finish_reason was set. By initializing an empty out dict and always extracting tokens from output_ids, the method now properly includes both the final token(s) and the finish_reason in the response, as demonstrated in the PR objectives.

@rmccorm4 rmccorm4 enabled auto-merge (squash) October 9, 2025 19:28
@rmccorm4 rmccorm4 disabled auto-merge October 9, 2025 19:34
@rmccorm4
Copy link
Contributor Author

rmccorm4 commented Oct 9, 2025

This is an example of a more extreme case with python -m dynamo.sglang ... --stream-interval 50 (return up to 50 tokens per iteration) where the last big batch of tokens would be excluded without this fix:

Without fix, null content on last chunk:

image

With fix, big response chunks, even on last chunk:

image

@rmccorm4 rmccorm4 changed the title fix: Send last token when finish_reason is set fix: Send last token batch when finish_reason is set Oct 9, 2025
@rmccorm4 rmccorm4 requested a review from Elnifio October 9, 2025 19:43
@rmccorm4 rmccorm4 merged commit 111c681 into main Oct 10, 2025
20 of 21 checks passed
@rmccorm4 rmccorm4 deleted the rmccormick/fix_sglang_last_token branch October 10, 2025 00:09
@lixuwei2333
Copy link
Contributor

#2985 I had fixed this bug, only to find it has been reintroduced :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Missing first token with max_tokens=1 in sglang backend

5 participants