Skip to content

fix(streaming): thinking-only retry guard + 3-layer SSE/middleware/actor diagnostics + RollingFileLogger Debug fix#947

Merged
Aaronontheweb merged 3 commits intonetclaw-dev:devfrom
Aaronontheweb:investigate/llamacpp-openai-chat-faults
May 9, 2026
Merged

fix(streaming): thinking-only retry guard + 3-layer SSE/middleware/actor diagnostics + RollingFileLogger Debug fix#947
Aaronontheweb merged 3 commits intonetclaw-dev:devfrom
Aaronontheweb:investigate/llamacpp-openai-chat-faults

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Collaborator

@Aaronontheweb Aaronontheweb commented May 9, 2026

Summary

  • Thinking-only retry guard: when a streaming LLM call completes with reasoning content but no visible text/tool calls (Qwen3 + --jinja regime), retry via EvaluateEmptyResponse instead of letting Slack post a silent fallback reply.
  • Stale-message absorption in LlmSessionActor.Ready for late LlmCallFailed / LlmResponseReceived after watchdog timeouts (kills noisy dead letters).
  • 3-layer streaming diagnostics (Debug level): OpenAiCompatibleChatClient (SSE wire), LoggingChatClient (middleware), LlmSessionActor (post-assembly). Counts text deltas/chars, thinking deltas/chars, tool-call deltas, and finish reason at each layer so operators can pinpoint where deltas land or get dropped.
  • ILoggerFactory injection in OpenAiCompatibleProviderPlugin so the SSE-layer log isn't swallowed by NullLogger.Instance.
  • RollingFileLogger Debug-floor fix: removes the hardcoded >= LogLevel.Information filter so Logging:LogLevel:Default=Debug actually persists Debug logs to ~/.netclaw/logs/daemon-*.log (bug: RollingFileLogger hardcodes Information floor, ignores framework log level #908). Without this, the new diagnostics — and any other Debug logs — never reach disk regardless of config.

Related

Refs netclaw-dev/netclaw-website#16 — delivers the 3-layer SSE / middleware / actor diagnostic counters described in that issue's "What Netclaw provides to help diagnose this" section. Does not close it: the docs/troubleshooting article portion of #16 is still open and should be addressed separately on the netclaw-website repo.

Context

Symptom: a recent self-hosted Slack session (D0AC6CKBK5K/1778333220.192409, 2026-05-09) produced three consecutive streaming turns with output: 2/4/28 final tokens despite many Thinking delta: events arriving. Each turn ended with Turn completed without visible Slack output; posting fallback reply. Same backend, the non-streaming title and distillation calls returned full output (1752 / 1358 / 466 tokens), so the failure was specific to the streaming path with reasoning content.

Pattern lines up with the May 8 testlab fix fix(llama-server): add --jinja so Qwen3 tool-call template is honored. With --jinja + --reasoning-format deepseek, llama-server now correctly emits Qwen3's <think> content as delta.reasoning_content (separate channel) rather than jumbling it into delta.content. Our streaming consumer surfaces those reasoning deltas as Thinking but the assistant response can be empty if the model never transitions to content — that's the case the new retry guard handles.

Composition

Three commits, smallest-change-first:

  1. a64819c3 — diagnostics + thinking-only retry guard (cherry-picked from prior investigation branch claude-wt-netclaw-insta-crash)
  2. 3a86cf54ILoggerFactory injection (cherry-picked; otherwise SSE-layer log is silent)
  3. 1bf8da37RollingFileLogger Debug-floor fix (#908)

The temporary Warning/Info level bump from the original investigation branch was deliberately not included — the diagnostics stay at Debug and configuration decides what gets persisted.

Test plan

  • Full test suite passes (3,342 / 3,342, 0 failures)

  • dotnet build clean (0 warnings, 0 errors)

  • Live validation against testlab (https://llm.testlab.petabridge.net, Qwen3.6-27B-UD-Q4_K_XL.gguf) in an isolated Docker container with NETCLAW_Logging__LogLevel__Default=Debug. All three breakdown logs landed in the daemon log file at [DBG]:

    SSE        : textDeltas=2 textChars=3 thinkingDeltas=37 thinkingChars=150 toolCallDeltas=0 finishReason=stop
    Middleware : textDeltas=2 textChars=3 thinkingDeltas=37 thinkingChars=150 toolCallDeltas=0 finishReason=stop
    Actor      : text=3ch thinking=150ch toolCalls=0 finishReason=stop
    

    Counts agree across all three layers for a healthy call — confirming no delta loss in the happy path and that the instrumentation is wired correctly to compare against fault scenarios.

  • Reproduce the May 9 fault pattern with the retry guard active and confirm a real assistant message replaces the fallback reply.

Add debug-level logging across the LLM response pipeline to diagnose
sessions that produce tokens but no visible Slack output. Three layers
now report content type breakdowns (text/thinking/tool call counts and
char totals): OpenAiCompatibleChatClient (SSE), LoggingChatClient
(middleware), and LlmSessionActor (actor). Comparing the three logs
pinpoints whether content is misrouted upstream (llama.cpp) or dropped
downstream (Netclaw parsing/ToolCallTextFilter).

Unify the empty-response and thinking-only guards into a single check
that retries via EvaluateEmptyResponse when the LLM produces reasoning
content but no visible text or tool calls — prevents silent fallback
replies in Slack.

Absorb stale LlmCallFailed and LlmResponseReceived messages in the
Ready state to eliminate noisy dead letters after watchdog timeouts.
OpenAiCompatibleProviderPlugin was constructing OpenAiCompatibleChatClient
without a logger, so _logger fell back to NullLogger.Instance and the
SSE-layer "stream content breakdown" Debug log was silently swallowed.
Wire ILoggerFactory through DI and create a categorized logger.
…etclaw-dev#908)

RollingFileLogger.IsEnabled was hardcoded to LogLevel.Information,
which silently overrode any Debug-level configuration coming through
Logging.LogLevel.Default or SetMinimumLevel. The framework was
correctly configured for Debug, but every Debug log was rejected
at the file sink — only the console sink saw them.

This made the new SSE / middleware / actor content-breakdown
diagnostics invisible in production daemon logs, and is consistent
with operator reports that "structured log output isn't appearing
in daemon logs."

Defer entirely to the framework's configured minimum level instead
of imposing our own floor.

Cherry-picked from the diagnostic-investigation branch (the original
6558a47 also bumped specific call sites to Warning/Info temporarily;
that bump is intentionally skipped here — the diagnostic logs stay
at Debug and we let configuration decide what gets persisted).
Copy link
Copy Markdown
Collaborator Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to clarify one thing before we merge this

});

Command<ProcessingWatchdogExpired>(_ => { });
Command<LlmCallFailed>(_ => { }); // stale failure arriving after watchdog timeout
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread src/Netclaw.Actors/Sessions/LlmSessionActor.cs
public IDisposable? BeginScope<TState>(TState state) where TState : notnull => null;

public bool IsEnabled(LogLevel logLevel) => logLevel >= LogLevel.Information;
public bool IsEnabled(LogLevel logLevel) => logLevel != LogLevel.None;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensures that debug logs can actually show up now

@Aaronontheweb Aaronontheweb merged commit fe5c89b into netclaw-dev:dev May 9, 2026
7 of 8 checks passed
@Aaronontheweb Aaronontheweb deleted the investigate/llamacpp-openai-chat-faults branch May 9, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

reliability Retries, resilience, graceful degradation sessions LLM session actor, turn lifecycle, pipelines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant