bug(live): zombie WebSocket session after LiveRequestQueue.close() — periodic 'operation cancelled' errors in Cloud Audit Logs

## Describe the bug

`run_live()` contains a `while True:` reconnection loop intended for session resumption. However, this loop has no way to distinguish between:

- An **intentional** client-side shutdown via `LiveRequestQueue.close()`
- An **unintentional** network drop that should trigger reconnection

As a result, calling `LiveRequestQueue.close()` does not actually terminate the live session. After the application code believes the session has ended, `run_live()` silently re-establishes a new WebSocket connection (a "zombie" session) without any notification to the caller.

## Steps to reproduce

1. Start a live session with `agent_runner.run_live()` and consume events via `async for event in live_events:`
2. Call `live_request_queue.close()` to signal session end
3. Wait — the underlying WebSocket connection will be re-established automatically by the reconnect loop
4. Observe in Cloud Audit Logs (or server-side logs): periodic `"The operation was cancelled."` errors at ~10-minute intervals indefinitely

A minimal reproduction is possible with the official `bidi-demo` sample:
https://github.com/google/adk-samples/tree/main/python/agents/bidi-demo/app

After sending a message and leaving idle, the zombie session and its periodic cancellations persist indefinitely.

## Expected behavior

Calling `LiveRequestQueue.close()` should fully terminate the live session. `run_live()` should exit cleanly without reconnecting.

## Observed behavior

**Scenario A — session resumption handle present:** `run_live()` catches `APIError(1000)` (normal WebSocket close), finds a session handle, and calls `continue` — reconnecting despite the intentional close.

**Scenario B — no session resumption handle:** `run_live()` catches `APIError(1000)`, finds no handle, logs a spurious `ERROR: APIError in live flow: 1000 None.`, and raises — treating a clean close as an error.

In both cases, a zombie connection is either kept alive or repeatedly re-established. The Gemini Live server cancels idle connections after ~10 minutes, which surfaces as:

```
ERROR: "The operation was cancelled." (gRPC code 1)
```

in Cloud Audit Logs — repeated indefinitely at ~10-minute intervals, even long after the application believes the session has ended.

The Google auth token refresh cycle visible in debug logs confirms the zombie connection remains active:

```
[DEBUG] google.auth.transport.requests: Making request...    # every ~10 min
[DEBUG] google.auth.transport.requests: Response received...
```

**No application-level logs appear** — the zombie reconnect is completely transparent to user code.

## Environment

- **google-adk version:** 1.22.1 (also reproduced on latest)
- **google-genai version:** 1.59.0+
- **Python version:** 3.12
- **OS:** Linux
- **Model:** `gemini-live-2.5-flash` (Vertex AI)
- **Method:** `google.cloud.aiplatform.v1beta1.LlmBidiService.BidiGenerateContent`

## Regression

This affects all versions of `google-adk` that include the `while True:` reconnection loop in `run_live()` (introduced with session resumption support). PR #5007 did not address this case as it fixed the opposite direction (session resumption loop never iterating).

## Logs

**Cloud Audit Log (repeated every ~10 minutes after session is believed closed):**
```json
{
  "protoPayload": {
    "status": { "code": 1, "message": "The operation was cancelled." },
    "methodName": "google.cloud.aiplatform.v1beta1.LlmBidiService.BidiGenerateContent"
  },
  "severity": "ERROR"
}
```

**Application debug logs (every ~10 minutes — auth refresh for zombie connection):**
```
[DEBUG] google.auth.transport.requests: Making request...
[DEBUG] google.auth.transport.requests: Response received...
```

**No application-level logs appear** — the zombie reconnect is completely transparent to user code.

## Root cause

In `base_llm_flow.py`, `run_live()`'s exception handlers cannot tell whether `APIError(1000)` / `ConnectionClosed` originated from:
- `LiveRequestQueue.close()` calling `llm_connection.close()` (intentional)
- A server-side or network-triggered close (unintentional)

```python
except errors.APIError as e:
    if e.code in [1000, 1006]:
        if invocation_context.live_session_resumption_handle:
            continue  # reconnects even after intentional close!
    logger.error('APIError in live flow: %s', e)  # spurious error if no handle
    raise
```

## Proposed fix

PR #5226 addresses this by adding an `is_closed` flag to `LiveRequestQueue` that is set synchronously in `close()`. `run_live()`'s exception handlers check this flag before attempting to reconnect:

```python
if e.code == 1000 and invocation_context.live_request_queue.is_closed:
    logger.info('Live session for agent %s closed by client request.', ...)
    return  # clean exit, no reconnect
```

## Additional context

- Google Cloud Support confirmed: *"simply pushing a 'close' message or sentinel to the LiveRequestQueue is not sufficient to fully terminate the underlying bidirectional streaming connection"*
- Discussed in [GitHub Discussion #4156](https://github.com/google/adk-python/discussions/4156)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(live): zombie WebSocket session after LiveRequestQueue.close() — periodic 'operation cancelled' errors in Cloud Audit Logs #5228

Describe the bug

Steps to reproduce

Expected behavior

Observed behavior

Environment

Regression

Logs

Root cause

Proposed fix

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug(live): zombie WebSocket session after LiveRequestQueue.close() — periodic 'operation cancelled' errors in Cloud Audit Logs #5228

Description

Describe the bug

Steps to reproduce

Expected behavior

Observed behavior

Environment

Regression

Logs

Root cause

Proposed fix

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions