Skip to content

bug(live): zombie WebSocket session after LiveRequestQueue.close() — periodic 'operation cancelled' errors in Cloud Audit Logs #5228

@TonyLee-AI

Description

@TonyLee-AI

Describe the bug

run_live() contains a while True: reconnection loop intended for session resumption. However, this loop has no way to distinguish between:

  • An intentional client-side shutdown via LiveRequestQueue.close()
  • An unintentional network drop that should trigger reconnection

As a result, calling LiveRequestQueue.close() does not actually terminate the live session. After the application code believes the session has ended, run_live() silently re-establishes a new WebSocket connection (a "zombie" session) without any notification to the caller.

Steps to reproduce

  1. Start a live session with agent_runner.run_live() and consume events via async for event in live_events:
  2. Call live_request_queue.close() to signal session end
  3. Wait — the underlying WebSocket connection will be re-established automatically by the reconnect loop
  4. Observe in Cloud Audit Logs (or server-side logs): periodic "The operation was cancelled." errors at ~10-minute intervals indefinitely

A minimal reproduction is possible with the official bidi-demo sample:
https://github.com/google/adk-samples/tree/main/python/agents/bidi-demo/app

After sending a message and leaving idle, the zombie session and its periodic cancellations persist indefinitely.

Expected behavior

Calling LiveRequestQueue.close() should fully terminate the live session. run_live() should exit cleanly without reconnecting.

Observed behavior

Scenario A — session resumption handle present: run_live() catches APIError(1000) (normal WebSocket close), finds a session handle, and calls continue — reconnecting despite the intentional close.

Scenario B — no session resumption handle: run_live() catches APIError(1000), finds no handle, logs a spurious ERROR: APIError in live flow: 1000 None., and raises — treating a clean close as an error.

In both cases, a zombie connection is either kept alive or repeatedly re-established. The Gemini Live server cancels idle connections after ~10 minutes, which surfaces as:

ERROR: "The operation was cancelled." (gRPC code 1)

in Cloud Audit Logs — repeated indefinitely at ~10-minute intervals, even long after the application believes the session has ended.

The Google auth token refresh cycle visible in debug logs confirms the zombie connection remains active:

[DEBUG] google.auth.transport.requests: Making request...    # every ~10 min
[DEBUG] google.auth.transport.requests: Response received...

No application-level logs appear — the zombie reconnect is completely transparent to user code.

Environment

  • google-adk version: 1.22.1 (also reproduced on latest)
  • google-genai version: 1.59.0+
  • Python version: 3.12
  • OS: Linux
  • Model: gemini-live-2.5-flash (Vertex AI)
  • Method: google.cloud.aiplatform.v1beta1.LlmBidiService.BidiGenerateContent

Regression

This affects all versions of google-adk that include the while True: reconnection loop in run_live() (introduced with session resumption support). PR #5007 did not address this case as it fixed the opposite direction (session resumption loop never iterating).

Logs

Cloud Audit Log (repeated every ~10 minutes after session is believed closed):

{
  "protoPayload": {
    "status": { "code": 1, "message": "The operation was cancelled." },
    "methodName": "google.cloud.aiplatform.v1beta1.LlmBidiService.BidiGenerateContent"
  },
  "severity": "ERROR"
}

Application debug logs (every ~10 minutes — auth refresh for zombie connection):

[DEBUG] google.auth.transport.requests: Making request...
[DEBUG] google.auth.transport.requests: Response received...

No application-level logs appear — the zombie reconnect is completely transparent to user code.

Root cause

In base_llm_flow.py, run_live()'s exception handlers cannot tell whether APIError(1000) / ConnectionClosed originated from:

  • LiveRequestQueue.close() calling llm_connection.close() (intentional)
  • A server-side or network-triggered close (unintentional)
except errors.APIError as e:
    if e.code in [1000, 1006]:
        if invocation_context.live_session_resumption_handle:
            continue  # reconnects even after intentional close!
    logger.error('APIError in live flow: %s', e)  # spurious error if no handle
    raise

Proposed fix

PR #5226 addresses this by adding an is_closed flag to LiveRequestQueue that is set synchronously in close(). run_live()'s exception handlers check this flag before attempting to reconnect:

if e.code == 1000 and invocation_context.live_request_queue.is_closed:
    logger.info('Live session for agent %s closed by client request.', ...)
    return  # clean exit, no reconnect

Additional context

  • Google Cloud Support confirmed: "simply pushing a 'close' message or sentinel to the LiveRequestQueue is not sufficient to fully terminate the underlying bidirectional streaming connection"
  • Discussed in GitHub Discussion #4156

Metadata

Metadata

Assignees

Labels

live[Component] This issue is related to live, voice and video chat

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions