Skip to content

Live API: phantom server_content.interrupted=True with no user audio wedges model into zero-audio state for remainder of session #2333

@doron-netizen

Description

@doron-netizen

Gemini Live API sends server_content.interrupted=True immediately after the greeting's turn_complete, when the client has sent zero user audio and input_transcription has never fired. After this phantom interrupt, the model produces zero audio chunks and zero output tokens for every subsequent user turn for the rest of the session, even though input_transcription fires correctly for user speech.

Model: gemini-2.5-flash-native-audio-preview-09-2025.
SDK: google-genai Python. Region: us-east-1.

Evidence from one reproducible session, timestamps relative UTC: T+0.0s turn_complete turn=1 (greeting, gemini_audio_chunks=11). T+0.0s server_content.interrupted=True with last_user_tx="" and last_user_tx_age_ms=-1. T+3.0s turn_complete turn=2 with gemini_audio_chunks still 11 and no model audio. T+13.0s input_transcription fires for user speech, turn_complete follows with gemini_audio_chunks still 11. Same pattern at T+20.0s and T+36.0s. T+50.0s user closes session. Final session tokens: input=6263, output=0. The gemini_audio_chunks counter is the count of audio chunks produced by the model this session; it stays at 11 (the greeting) for the remaining 50 seconds despite 3 user utterances. output_tokens=0 confirms the model emitted nothing.

Frequency: 4 phantom interrupts with the same signature (last_user_tx="", last_user_tx_age_ms=-1) observed in a 6-hour window across independent sessions. 3 of 4 recovered; 1 wedged permanently. Historical data from our logs shows 6 occurrences in a 1-hour window on an earlier date.

Ruled out on our side: not a client-synthesized interrupt (we never emit interrupted=True from our side); not client-side barge-in (no client interrupt control message sent in the window); not audio dropped by our relay (skip_audio=False throughout the post-interrupt window, any model audio would have been forwarded); not VAD tuning (already set to suppress false positives with start_of_speech_sensitivity=START_SENSITIVITY_HIGH, end_of_speech_sensitivity=END_SENSITIVITY_LOW, prefix_padding_ms=200, silence_duration_ms=1200); not a token budget issue (session used ~6k input tokens across 6 turns); not a network disconnect (session_resumption token stayed valid).

Questions:
(1) Why does server_content.interrupted=True fire when no client audio has been transmitted and no input_transcription has occurred?
(2) Why does the model emit zero audio for subsequent user turns after this event, even though input_transcription continues firing?
(3) Is there a documented recovery path for a session in this state short of full reconnect?
(4) Can server-side VAD be disabled while keeping input_transcription enabled? Full CloudWatch logs and SDK trace captures available on request.

Metadata

Metadata

Labels

priority: p2Moderately-important priority. Fix may not be included in next release.status:awaiting user responsetype: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions