Skip to content

Approval prompts become permanently stuck after daemon restart #939

@Aaronontheweb

Description

@Aaronontheweb

Problem

When a tool approval prompt is rendered in Slack or Discord and the Netclaw daemon restarts (or the session times out and is passivated), clicking the approval button does nothing. The button is a platform-side UI element that still exists, but the server-side actor responsible for handling the response no longer exists — the response is silently dropped.

Reproduction

  1. Start a Netclaw session in Slack or Discord
  2. Trigger a tool call that requires approval (e.g., shell_execute with unapproved patterns)
  3. Observe the approval prompt rendered with Approve/Deny buttons
  4. Restart the Netclaw daemon (or wait for the 1-hour receive timeout to passivate the session)
  5. Click an approval button — nothing happens, the button appears dead

Root Cause

The approval lifecycle spans three actors, and all state is in-memory — nothing survives a restart:

1. LlmSessionActor blocks on a TaskCompletionSource

The session actor creates an ApprovalChannel containing a ConcurrentDictionary<ToolCallId, TaskCompletionSource<ApprovalDecision>> that blocks the tool execution thread awaiting user response. This TCS is transient — lost on restart.

The session also tracks _pendingToolInteractions (in-memory Dictionary<string, PendingToolInteraction>) to validate incoming responses, but this is also transient.

2. Channel binding actors hold the pending list

Both SlackThreadBindingActor._pendingApprovalRequests and DiscordSessionBindingActor._pendingApprovalRequests are in-memory lists that track which approvals are outstanding for re-posting buttons. They persist only while the actor is alive.

3. Conversation actors drop responses for missing children

When the approval button is clicked, the platform sends an interaction to the conversation actor. The conversation actor looks up the child binding actor by name. If it's gone (restart or passivation), the response is silently dropped:

Slack: SlackConversationActor.cs — drops if thread.IsNobody():

Sender.Tell(new Status.Failure(new InvalidOperationException(ingressClosedReason)));
return;
}
// Defense-in-depth: validate channel ACL even though the tool already checked.
// DM channels (D-prefixed) skip this — they were validated via user ACL + AllowDirectMessages.
var isDmChannel = message.ChannelId.Value.StartsWith("D", StringComparison.Ordinal);

Discord: DiscordConversationActor.cs — drops if sessionBinding.IsNobody():

if (sessionBinding.IsNobody())
{
_log.Info(
"Ignoring Discord interaction for missing session binding channel={0} threadOrMessage={1}",
_channelId.Value,
interaction.ThreadOrMessageId.Value);
ChannelTelemetry.For(ChannelType.Discord).RecordExtra("interactionErrors", "missing_session_binding");
return;

Flow Diagram

sequenceDiagram
    participant Slack as Slack/Discord UI
    participant Conv as ConversationActor
    participant Bind as ThreadBindingActor
    participant Pipeline as SessionPipeline
    participant Session as LlmSessionActor
    participant Approval as ApprovalChannel (TCS)

    Note over Session,Slack: Tool call requires approval

    Session->>Approval: WaitForApprovalAsync(callId)
    Note over Approval: Blocks on TCS (in-memory)

    Session-->>Pipeline: emit ToolInteractionRequest
    Pipeline-->>Bind: HandleApprovalRequestAsync
    Bind->>Slack: Post approval buttons
    Note over Bind: Added to _pendingApprovalRequests (in-memory)

    Note over Slack,Approval: === DAEMON RESTART ===

    rect rgb(255, 230, 230)
        Note over Conv,Approval: All actors stopped.<br/>TCS, _pendingApprovalRequests,<br/>_pendingToolInteractions lost.
        Note over Slack: Buttons still exist in UI
    end

    Slack->>Conv: User clicks approval button
    Conv->>Conv: Context.Child(bindingActorName)
    Conv--xConv: IsNobody() = true
    Conv--xSlack: Silently dropped (logged)

    Note over Slack,Approval: Button appears dead.<br/>Session never receives approval.<br/>Tool call is permanently stuck.
Loading

Proposed Fix

Make approval requests reentrant across daemon restarts by persisting approval state and resurrecting actors on demand.

Phase 1: Persist pending approvals in session state

  • Add a ToolInteractionPending persisted event to LlmSessionActor that captures: CallId, ToolName, Patterns, Options, RequesterSenderId, RequesterPrincipal
  • On recovery (RecoveryCompleted), if there are pending ToolInteractionPending events, transition to a new WaitingForApproval phase
  • When a ToolInteractionResponse arrives during recovery, match it against persisted pending approvals and resolve

Phase 2: Resurrect binding actors on approval response

  • SlackConversationActor: When receiving SlackApprovalResponse for a missing thread, spawn a new SlackThreadBindingActor (it recovers from persistence), forward the response
  • DiscordConversationActor: Same pattern — spawn new DiscordSessionBindingActor on missing child

Phase 3: Re-post approval UI on recovery

  • On RecoveryCompleted, if the session has pending approvals, the binding actor should re-post fresh Slack buttons / Discord buttons (old ones may have expired or been buried)
  • The ToolInteractionRequest data from the persisted event provides everything needed to reconstruct the UI

Key files to modify

File Change
src/Netclaw.Actors/Protocol/SessionOutput.cs Add ToolInteractionPending event for Akka Persistence
src/Netclaw.Actors/Sessions/LlmSessionActor.cs Persist pending interactions; recover them; handle ToolInteractionResponse during recovery
src/Netclaw.Actors/Sessions/IApprovalChannel.cs Support re-creation of TCS from persisted state on recovery
src/Netclaw.Channels.Slack/SlackConversationActor.cs Resurrect thread actor on missing child for approval responses
src/Netclaw.Channels.Slack/SlackThreadBindingActor.cs Re-post buttons on recovery if pending approvals exist
src/Netclaw.Channels.Discord/DiscordConversationActor.cs Resurrect session binding on missing child for approval responses
src/Netclaw.Channels.Discord/DiscordSessionBindingActor.cs Re-post buttons on recovery if pending approvals exist

Impact

This is a correctness issue — approval prompts become permanently stuck after any restart. Since daemon restarts happen during updates and deployments, this affects all users in production. Fixing this also enables a proper "graceful hibernation" pattern where sessions can be passivated and reactivated without losing mid-turn state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions