Live API: phantom server_content.interrupted=True with no user audio wedges model into zero-audio state for remainder of session

Gemini Live API sends server_content.interrupted=True immediately after the greeting's turn_complete, when the client has sent zero user audio and input_transcription has never fired. After this phantom interrupt, the model produces zero audio chunks and zero output tokens for every subsequent user turn for the rest of the session, even though input_transcription fires correctly for user speech.

Model: gemini-2.5-flash-native-audio-preview-09-2025. 
SDK: google-genai Python. Region: us-east-1.

Evidence from one reproducible session, timestamps relative UTC: T+0.0s turn_complete turn=1 (greeting, gemini_audio_chunks=11). T+0.0s server_content.interrupted=True with last_user_tx="" and last_user_tx_age_ms=-1. T+3.0s turn_complete turn=2 with gemini_audio_chunks still 11 and no model audio. T+13.0s input_transcription fires for user speech, turn_complete follows with gemini_audio_chunks still 11. Same pattern at T+20.0s and T+36.0s. T+50.0s user closes session. Final session tokens: input=6263, output=0. The gemini_audio_chunks counter is the count of audio chunks produced by the model this session; it stays at 11 (the greeting) for the remaining 50 seconds despite 3 user utterances. output_tokens=0 confirms the model emitted nothing.

Frequency: 4 phantom interrupts with the same signature (last_user_tx="", last_user_tx_age_ms=-1) observed in a 6-hour window across independent sessions. 3 of 4 recovered; 1 wedged permanently. Historical data from our logs shows 6 occurrences in a 1-hour window on an earlier date.

Ruled out on our side: not a client-synthesized interrupt (we never emit interrupted=True from our side); not client-side barge-in (no client interrupt control message sent in the window); not audio dropped by our relay (skip_audio=False throughout the post-interrupt window, any model audio would have been forwarded); not VAD tuning (already set to suppress false positives with start_of_speech_sensitivity=START_SENSITIVITY_HIGH, end_of_speech_sensitivity=END_SENSITIVITY_LOW, prefix_padding_ms=200, silence_duration_ms=1200); not a token budget issue (session used ~6k input tokens across 6 turns); not a network disconnect (session_resumption token stayed valid).

Questions: 
(1) Why does server_content.interrupted=True fire when no client audio has been transmitted and no input_transcription has occurred? 
(2) Why does the model emit zero audio for subsequent user turns after this event, even though input_transcription continues firing? 
(3) Is there a documented recovery path for a session in this state short of full reconnect? 
(4) Can server-side VAD be disabled while keeping input_transcription enabled? Full CloudWatch logs and SDK trace captures available on request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Live API: phantom server_content.interrupted=True with no user audio wedges model into zero-audio state for remainder of session #2333

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Live API: phantom server_content.interrupted=True with no user audio wedges model into zero-audio state for remainder of session #2333

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions