-
Notifications
You must be signed in to change notification settings - Fork 737
fix: Send last token batch when finish_reason is set #3531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ayushag-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm !
WalkthroughRefactors _process_token_stream to linearize handling of streamed results: always read and slice output_ids, update counters in one path, and conditionally attach finish_reason. Removes the prior else branch and initializes an empty out dict each iteration, raising the same error if output_ids are missing. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Stream as Token Stream Source
participant Handler as decode_handler._process_token_stream
participant Client as Caller
loop For each streamed result
Stream->>Handler: res
Note over Handler: out = {}
alt res.output_ids present
Handler->>Handler: token_ids = res.output_ids[so_far:]
Handler->>Handler: so_far = len(res.output_ids)
opt res.finish_reason present
Handler->>Handler: out.finish_reason = res.finish_reason
end
Handler-->>Client: out with token_ids (and finish_reason if any)
else Missing output_ids
Handler->>Handler: raise ValueError("Missing output_ids")
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)
182-184: Consider adding exception chaining for better debugging.When raising a ValueError after catching KeyError, preserve the exception chain using
raise ... from errto maintain the full traceback for debugging.Apply this diff:
- try: - next_total_toks = len(res["output_ids"]) - except KeyError: - raise ValueError( - f"Missing 'output_ids' in response. Response keys: {list(res.keys())}" - ) + try: + next_total_toks = len(res["output_ids"]) + except KeyError as err: + raise ValueError( + f"Missing 'output_ids' in response. Response keys: {list(res.keys())}" + ) from err
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py
182-184: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
182-184: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: vllm (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang
- GitHub Check: trtllm (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)
174-186: LGTM! Fix correctly includes tokens with finish_reason.The refactored logic correctly addresses the issue where final tokens were missing when
finish_reasonwas set. By initializing an emptyoutdict and always extracting tokens fromoutput_ids, the method now properly includes both the final token(s) and the finish_reason in the response, as demonstrated in the PR objectives.
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py
Outdated
Show resolved
Hide resolved
…io just in case. Match TRTLLM behavior
|
#2985 I had fixed this bug, only to find it has been reintroduced :( |
Signed-off-by: Dan Gil <[email protected]>


Overview:
Fixes SGLang worker to include last token(s) in response when a finish reason is set. Before when finish_reason is set, it would hard-code the response to have no tokens. But there are scenarios where finish_reason is set and there are tokens to send back with it.
Note this edge case isn't just for the "last" token, it's the last "batch" of tokens - so if you set something like
--stream-interval 50where each iteration returns 50 tokens at a time, without this change you'd be excluding up to the last 50 tokens as well.Details:
Before (
content=null- no token returned):After (
<think>token returned incontentfield):Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit