Skip to content

fix(llm): collect usage stats from final stream chunk#276

Merged
0xallam merged 1 commit intomainfrom
fix/input-token-counting
Jan 21, 2026
Merged

fix(llm): collect usage stats from final stream chunk#276
0xallam merged 1 commit intomainfrom
fix/input-token-counting

Conversation

@0xallam
Copy link
Copy Markdown
Member

@0xallam 0xallam commented Jan 21, 2026

Problem

Input tokens were always showing as zero in usage stats. This regression was introduced in commit 56526cb which added an early break when </function> was found in the streamed response.

When using streaming with stream_options: {"include_usage": True}, the LLM API sends token usage data in a separate final chunk after all content chunks:

chunk 1: {"content": "Let me analyze..."}
chunk 2: {"content": "</function>"}
chunk 3: {"content": null, "usage": {"prompt_tokens": 1500, "completion_tokens": 200}}  ← FINAL CHUNK

The early break caused us to exit the loop before receiving chunk 3, so stream_chunk_builder(chunks) built a response with no usage data.

Solution

Instead of breaking immediately when </function> is found, we now:

  1. Set a flag and continue collecting chunks
  2. Break when we receive a chunk with usage data (ideal case)
  3. Fall back to breaking after 5 additional chunks (prevents infinite loops with misbehaving models)
if done_streaming:
    done_streaming += 1
    if getattr(chunk, "usage", None) or done_streaming > 5:
        break  # Got usage or fallback limit
    continue

This ensures we capture the usage chunk while still protecting against models that don't properly end their streams.

Thanks to @bearsyankees for catching this issue!

The early break on </function> prevented receiving the final chunk
that contains token usage data (input_tokens, output_tokens).
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 21, 2026

Greptile Summary

This PR fixes a regression where input tokens were always showing as zero in usage stats. The issue was caused by an early break statement that exited the streaming loop before receiving the final chunk containing usage data from the LLM API.

The fix replaces the immediate break with a flag-based approach that:

  • Continues collecting chunks after </function> is detected
  • Breaks when a chunk with usage data arrives (ideal case)
  • Falls back to breaking after 5 additional chunks (prevents infinite loops)

This ensures usage statistics are properly captured while maintaining protection against misbehaving models.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The fix correctly addresses the root cause of the usage stats regression with a simple, well-bounded solution. The logic is sound: it continues streaming after </function> is found, breaks when usage data arrives, and includes a safety limit of 5 additional chunks to prevent infinite loops. The change is minimal, focused, and doesn't affect other functionality.
  • No files require special attention

Important Files Changed

Filename Overview
strix/llm/llm.py Fixed regression where usage stats were always zero by collecting final stream chunk before breaking

@0xallam 0xallam merged commit b456a4e into main Jan 21, 2026
1 check passed
@0xallam 0xallam deleted the fix/input-token-counting branch January 21, 2026 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant