Skip to content

fix: reconnect xai asr after finalize#2192

Merged
wangyoucao577 merged 4 commits into
mainfrom
fix/xai-asr-finalize-reconnect
Jun 25, 2026
Merged

fix: reconnect xai asr after finalize#2192
wangyoucao577 merged 4 commits into
mainfrom
fix/xai-asr-finalize-reconnect

Conversation

@BenWeekes

Copy link
Copy Markdown
Contributor

Summary

xAI STT is a single-utterance protocol: the server closes the websocket after each transcript.done. Previously, on a finalize-induced close, on_close() consumed the _close_expected flag and returned without reconnecting, leaving recognition = None. Every subsequent turn's audio then hit the base class, which only buffers frames when disconnected and never reconnects — so the second utterance in a session was silently dropped. This is the "first turn works, dead after that" symptom seen in the voice_assistant_xai_grok graph.

Changes

  • extension.py — on the planned-close path, schedule _reconnect_after_finalize() to open a fresh connection (bypassing the failure-backoff path, since this is a planned cycle, not an error), with a fallback to the reconnect-manager backoff if the clean reconnect throws. Buffered frames flush from on_open. A fresh XAIASRRecognition per utterance also avoids reusing a stale done_event/done_payload.
  • The reconnect task is tracked via self._finalize_reconnect_task and cancelled on stop_connection() / on_deinit(), so an in-flight reconnect is cleanly aborted on shutdown.

Tests

  • tests/test_reconnect.py — unit tests (mock-based): planned reconnect on finalize-close, fallback to backoff on failure, and cancellation of a pending reconnect on stop_connection.
  • integration_tests/asr_guarder/test_same_session_finalize_reconnect.py — guarder test that runs two audio→finalize cycles in one session and requires a second non-empty final result.

@diyuyi-agora

Copy link
Copy Markdown
Contributor

Race between stop_connection and an in-flight _connect_recognition, with no test coverage

test_stop_connection_cancels_pending_finalize_reconnect_task injects a fake task blocked on Event.wait(), so it does not cover the real path where reconnect has already reached _connect_recognition() / recognition.start().

If stop_connection() is called while _connect_recognition() is running:

the task is cancelled;
but _connect_recognition may have already partially created self.recognition;
stop_connection then calls close() and sets it to None.
That is usually acceptable, but there is a short window where it is unclear whether cleanup after cancel is complete, or whether extra on_close / error callbacks are triggered. There is no test coverage for that today.

Suggestion: Add a test that mocks _connect_recognition so it is already awaiting before stop_connection is called, then verify that recognition is None, there is no second reconnect, and no leaked exceptions.

@wangyoucao577 wangyoucao577 merged commit 071f24c into main Jun 25, 2026
21 of 34 checks passed
@wangyoucao577 wangyoucao577 deleted the fix/xai-asr-finalize-reconnect branch June 25, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants