fix(google): pause audio input during synchronous tool execution on t…#5556
fix(google): pause audio input during synchronous tool execution on t…#5556vedevpatel wants to merge 1 commit intolivekit:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the Google Gemini Realtime session to prevent microphone audio from being streamed while Gemini 3.1 synchronous tool calls are in-flight, avoiding server-side “new turn” detection that cancels pending tool calls and corrupts state.
Changes:
- Added a
_tool_call_pendingflag to pausepush_audio()during Gemini 3.1 tool execution. - Set the flag when a tool call is received (Gemini 3.1 only), clear it after sending tool responses.
- Clear the flag on
tool_call_cancellationto prevent stalling.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| def push_audio(self, frame: rtc.AudioFrame) -> None: | ||
| if self._tool_call_pending: |
There was a problem hiding this comment.
When audio is dropped due to _tool_call_pending, the AudioByteStream (and potentially the resampler) may still contain buffered partial samples from before the tool call. When _tool_call_pending flips back to False, the next push_audio() call can combine that stale buffered audio with new audio, creating a discontinuity/corrupted stream. Consider clearing _bstream (and resetting _input_resampler if needed) when entering the pending state (or before returning early here).
| def push_audio(self, frame: rtc.AudioFrame) -> None: | |
| if self._tool_call_pending: | |
| def _clear_pending_audio_state(self) -> None: | |
| flush_bstream = getattr(self._bstream, "flush", None) | |
| if callable(flush_bstream): | |
| for _ in flush_bstream(): | |
| pass | |
| input_resampler = getattr(self, "_input_resampler", None) | |
| if input_resampler is None: | |
| return | |
| reset_resampler = getattr(input_resampler, "reset", None) | |
| if callable(reset_resampler): | |
| reset_resampler() | |
| return | |
| flush_resampler = getattr(input_resampler, "flush", None) | |
| if callable(flush_resampler): | |
| for _ in flush_resampler(): | |
| pass | |
| def push_audio(self, frame: rtc.AudioFrame) -> None: | |
| if self._tool_call_pending: | |
| self._clear_pending_audio_state() |
| # true while synchronous tool call is in flight for 3.1 only | ||
| # Audio frames dropped here to prevent server from thinking incoming audio is a | ||
| # new turn and cancelling the pending tool call | ||
| self._tool_call_pending = False |
There was a problem hiding this comment.
_tool_call_pending is only cleared on tool response send and server tool-call cancellation. If the session is restarted/disconnected while a tool call is pending (e.g., send/recv task errors trigger _mark_restart_needed(on_error=True)), the flag can remain True across reconnects and permanently mute push_audio() for Gemini 3.1. Suggest resetting _tool_call_pending as part of session restart/close (e.g., in _close_active_session, _mark_restart_needed, or at the start of each connect loop).
| ) | ||
| ) | ||
| self._mark_current_generation_done() | ||
| if "3.1" in self._opts.model: |
There was a problem hiding this comment.
The model gating if "3.1" in self._opts.model is imprecise and can accidentally match non-Live models or future model names (the file already enumerates known Live model names via KNOWN_GEMINI_API_MODELS / LiveAPIModels). Prefer an exact match (or a well-scoped prefix check like model.startswith("gemini-3.1-")) to keep the behavior tightly bound to Gemini 3.1 Live only.
| if "3.1" in self._opts.model: | |
| if self._opts.model.startswith("gemini-3.1-"): |
0dbb2e9 to
5ae5e69
Compare
5ae5e69 to
9325de1
Compare
| ) | ||
| ) | ||
| self._mark_current_generation_done() | ||
| if "3.1" in self._opts.model: |
There was a problem hiding this comment.
first of all, I don't think dropping the audio when there is in-flight tool call is a right solution, but I am wondering why here it's only applied to 3.1?
fyi, we support gemini NON_BLOCKING tool call via tool_behavior option, you may check that option instead of using the model name? or even making this configurable when the tool behavior is blocking?
…he Gemini 3.1 live model Gemini 3.1 forces synchronous tool calling, which means the model blocks until tool responses arrive. The plugin's _send_task was constantly forwarding microphone audio while tools executed, which caused the server to think of incoming audio as a new turn and cancel the pending tool call after ~12s. This caused duplicate tool execution with already-resolved call_ids as well as corrupted conversation state. Adds a _tool_call_pending flag (for Gemini 3.1 only) that drops push_audio frames from the moment a toolCall is received until send_tool_response is flushed. Also clears the flag on tool_call_cancellation so the session never stalls. No behavior change for Gemini 2.5 models.
9325de1 to
db77c65
Compare
| is_blocking = ( | ||
| not is_given(self._opts.tool_behavior) | ||
| or self._opts.tool_behavior == types.Behavior.BLOCKING | ||
| ) | ||
| if is_blocking: | ||
| self._tool_call_pending = True | ||
| self._bstream.clear() |
There was a problem hiding this comment.
🔴 Audio silently dropped during blocking tool calls on all models, not just 3.1 as intended
The comment on line 498 says "for 3.1 only" and the commit message says "on the Gemini 3.1 live model", but the is_blocking check at lines 1306-1309 has no model guard — it evaluates to True for any model when tool_behavior is NOT_GIVEN (the default) or BLOCKING. This means push_audio silently drops all audio frames during blocking tool execution on the default 2.5 models (gemini-2.5-flash-native-audio-preview-12-2025) as well, where the underlying server issue (audio being interpreted as a new turn that cancels the pending tool call) may not exist. Users speaking during tool execution on 2.5 models will have their audio silently discarded.
Prompt for agents
The _handle_tool_calls method sets _tool_call_pending = True for all models with blocking tool behavior, but the comment and commit message state this should only apply to 3.1 models. The model name is available via self._opts.model. The fix should add a model check, e.g. checking if '3.1' is in self._opts.model (similar to how the RealtimeModel.__init__ uses '3.1 in model' to determine mutability at realtime_api.py:289). For example, the is_blocking check should also verify the model is a 3.1 model before setting _tool_call_pending = True and clearing the byte stream.
Was this helpful? React with 👍 or 👎 to provide feedback.
Gemini 3.1 live model
Gemini 3.1 forces synchronous tool calling, which means the model blocks until tool responses arrive. The plugin's _send_task was constantly forwarding microphone audio while tools executed, which caused the server to think of incoming audio as a new turn and cancel the pending tool call after ~12s. This caused duplicate tool execution with already-resolved call_ids as well as corrupted conversation state.