RFC: Add on_agent_error_callback and on_run_error_callback to BasePlugin

## RFC: Agent-Level and Invocation-Level Error Callbacks

**Relates to**: #4863
**Implementation**: #5045

## Summary

This RFC adds two new notification-only plugin callbacks —
`on_agent_error_callback` and `on_run_error_callback` — to complete the
ADK plugin lifecycle's error coverage. While `BigQueryAgentAnalyticsPlugin`
is the most visible motivating example, this RFC improves the plugin
lifecycle contract for **any** ADK plugin that needs observability, cleanup,
or failure notification beyond model/tool scope.

## Problem Statement

The ADK plugin lifecycle has a **gap** in error callback coverage:

| Level       | before | after | error |
|-------------|--------|-------|-------|
| **Run**     | ✅      | ✅     | ❌     |
| **Agent**   | ✅      | ✅     | ❌     |
| **Model**   | ✅      | ✅     | ✅     |
| **Tool**    | ✅      | ✅     | ✅     |

When an unhandled exception escapes `_run_async_impl()` or `execute_fn()`,
the corresponding `after_*` callback is **never invoked**, leaving plugins
with dangling `*_STARTING` events and no error signal.

Model/tool error callbacks are not sufficient because many failures occur in
agent orchestration or runner execution outside any single model or tool step.

### Motivating Example: `BigQueryAgentAnalyticsPlugin`

- `AGENT_STARTING` events have no matching terminal event → appear as
  phantom successes in dashboards
- `INVOCATION_STARTING` events have no matching terminal event → crashed
  invocations are invisible to analytics
- Latency calculations are artificially skewed (crashed calls excluded)
- No Python exception traceback is captured anywhere in BigQuery

## Concrete Scenarios

The need for both `on_agent_error_callback` and `on_run_error_callback`
comes from the fact that failures can occur at different lifecycle layers,
and plugins often need different behavior at each layer. This is not
"more callbacks for symmetry" — it is about capturing different failure
scopes with different cleanup and analytics responsibilities.

### Scenario 1: Agent crashes after `AGENT_STARTING`

A plugin like `BigQueryAgentAnalyticsPlugin` emits `AGENT_STARTING` in
`before_agent_callback` (`bigquery_agent_analytics_plugin.py:2935`).

Example:
- An agent starts execution
- `_run_async_impl()` raises `RuntimeError("planner crashed")`
- `after_agent_callback` never runs (success-only, outside the try/except)
- Without `on_agent_error_callback`, the plugin has no place to emit a
  terminal `AGENT_ERROR` row or close the agent span

**Why model/tool callbacks are insufficient:**
This failure may happen outside any model call or tool call —
no `on_model_error_callback` or `on_tool_error_callback` will fire.

**Why agent-level callback is needed:**
It is the only lifecycle point that can reliably convert an agent-local
uncaught exception into a terminal agent error signal.

### Scenario 2: Agent error bubbles up and fails the whole invocation

The same uncaught agent exception often also terminates the entire runner
invocation.

Example:
- Root agent crashes before producing a final response
- The runner aborts the invocation
- `after_run_callback` never runs (success-only, outside the try/except)
- Without `on_run_error_callback`, the plugin has no place to emit
  `INVOCATION_ERROR`, flush final logs, or clear invocation-scoped state

**Why agent-level callback alone is insufficient:**
`on_agent_error_callback` only knows an agent failed. It does not
represent the terminal state of the invocation as a whole.
Invocation-level cleanup and analytics are runner concerns.

**Why run-level callback is needed:**
It is the only lifecycle point that can reliably capture invocation
failure, invocation-scoped cleanup, and final analytics/flush behavior
when execution aborts before `after_run_callback`.

### Scenario 3: Failure outside model/tool execution entirely

Some crashes happen in orchestration logic, not in model/tool execution.

Examples:
- Agent loop/control-flow bug
- Event transformation bug
- Plugin interaction bug in agent execution path
- Runner orchestration failure while streaming events

These failures do not trigger `on_model_error_callback`
(`base_llm_flow.py:379-391`) or `on_tool_error_callback`
(`llm_flows/functions.py:487, 530`) — those are scoped to their
specific try/except blocks around model and tool calls respectively.

Without agent/run error callbacks, these failures become invisible
to plugins.

### Scenario 4: BigQuery analytics needs two terminal rows, not one

For analytics, agent and invocation are different entities.

Example:
- `before_run_callback` emits `INVOCATION_STARTING` (line 2874)
- `before_agent_callback` emits `AGENT_STARTING` (line 2935)
- Root agent crashes

Correct terminal state is:
- One `AGENT_ERROR` (emitted by `on_agent_error_callback`, line 3327)
- One `INVOCATION_ERROR` (emitted by `on_run_error_callback`, line 3376)

If only agent-level error existed:
- Invocation-level dashboards would still show dangling
  `INVOCATION_STARTING` with no terminal event

If only run-level error existed:
- Agent-level latency/failure analytics would lose the agent failure row

This is why the RFC requires exactly-once-per-layer semantics rather than
choosing only one of the two levels.

## Goals

- Notify plugins on uncaught agent execution failures
- Notify plugins on uncaught runner invocation failures
- Preserve the original application exception as the primary error
- Keep `after_*` callbacks as success-only (no semantic change)

## Non-Goals

- No recovery semantics — these are notification-only, unlike
  `on_model_error_callback` which can return a replacement response
- No change to existing `on_model_error_callback` or
  `on_tool_error_callback` behavior
- No change to `after_agent_callback` or `after_run_callback` meaning
- No framework-level traceback formatting or truncation policy —
  formatting is sink-specific

## Proposed API

```python
async def on_agent_error_callback(
    self,
    *,
    agent: BaseAgent,
    callback_context: CallbackContext,
    error: Exception,
) -> None:
    """Notification-only callback for uncaught agent execution errors."""

async def on_run_error_callback(
    self,
    *,
    invocation_context: InvocationContext,
    error: Exception,
) -> None:
    """Notification-only callback for uncaught runner execution errors."""
```

Note: `on_run_error_callback` does **not** take `callback_context` — it
receives `invocation_context` directly, which is sufficient. A
`CallbackContext` is constructed internally by plugin implementations
that need one (e.g., `BigQueryAgentAnalyticsPlugin`).

## Callback Semantics

### Notification-only, not recovery

Unlike `on_model_error_callback` (which can return a replacement
`LlmResponse` to suppress the exception), these new callbacks are
**notification-only**:

- The exception is **always re-raised** after all plugins are notified
- Plugins **cannot** return a replacement value to suppress the error
- Rationale: there is no meaningful "replacement value" for a crashed
  agent execution or invocation. Recovery at these levels requires
  application-specific retry logic, not a plugin hook.

### Best-effort notification dispatch

These callbacks use a **dedicated dispatch path** in `PluginManager`
(`_run_notification_callbacks`), not the normal early-exit callback path
(`_run_callbacks`).

Requirements:
- **All registered plugins are always invoked** — one plugin's failure
  does not prevent later plugins from being notified
- **Return values are ignored** — a non-`None` return does not
  short-circuit iteration
- **Plugin callback failures are logged and do not propagate** — if a
  plugin's error callback raises, the failure is logged via
  `logger.error(..., exc_info=True)` and iteration continues
- **The original application exception remains the primary exception**
  seen by the caller

```python
async def _run_notification_callbacks(
    self, callback_name: PluginCallbackName, **kwargs: Any
) -> None:
    """Best-effort notification dispatch.

    Always iterates all plugins.
    Ignores return values.
    Logs plugin callback failures and continues.
    Does not replace the original application exception.
    """
    for plugin in self.plugins:
        callback_method = getattr(plugin, callback_name)
        try:
            await callback_method(**kwargs)
        except Exception as e:
            logger.error(
                "Error in plugin '%s' during '%s' callback: %s",
                plugin.name, callback_name, e,
                exc_info=True,
            )
```

Rationale: error callbacks are **observability hooks**, not recovery
hooks. A failing plugin must not prevent other plugins from seeing the
failure, and must not mask the underlying agent/invocation crash.

### Original application exception remains primary

If plugin error callbacks themselves fail, those failures are logged but
do **not** replace the original uncaught exception from the agent or
runner. The caller always sees the original application exception.

Example: agent crashes with `RuntimeError("agent crashed")` and a
plugin's `on_agent_error_callback` raises `ValueError("plugin bug")`.
The caller sees `RuntimeError("agent crashed")`, not the plugin error.

### `after_*` remains success-only

On main, `after_agent_callback` and `after_run_callback` are **semantically
success callbacks**. All built-in plugins treat them that way:

- `logging_plugin.py:130` — `after_run_callback` logs `"✅ INVOCATION COMPLETED"`
- `logging_plugin.py:155` — `after_agent_callback` logs `"🤖 AGENT COMPLETED"`
- `debug_logging_plugin.py:323` — `after_run_callback` finalizes debug output
- `debug_logging_plugin.py:387` — `after_agent_callback` logs `"agent_end"`
- `bigquery_agent_analytics_plugin.py:2871` — `after_run_callback` emits
  `"INVOCATION_COMPLETED"` with latency then clears state
- `bigquery_agent_analytics_plugin.py:2933` — `after_agent_callback` emits
  `"AGENT_COMPLETED"` with latency

If these were moved to `finally`, a crash would produce **both** an
`AGENT_ERROR` **and** an `AGENT_COMPLETED` event — semantically wrong and
would require rewriting every plugin's after-callback to check for error
state first.

**This RFC keeps `after_*` as success-only.** If guaranteed cleanup is
needed in the future, that should be a separate unconditional cleanup hook,
not a redefinition of existing `after_*` semantics.

### Catch `Exception`, not `BaseException`

The try/except blocks use `except Exception`, not `except BaseException`:

- `BaseException` also catches `KeyboardInterrupt`, `asyncio.CancelledError`,
  `SystemExit` — cancellation and shutdown paths that should not be treated
  as ordinary execution failures
- The target for this RFC is unhandled application/runtime exceptions
- `Exception` is the correct boundary

### Exactly once per layer

Each error callback fires **exactly once at its own layer**, even when
both fire for the same uncaught failure:

- **Agent-level crash** → `on_agent_error_callback` fires once at the
  agent boundary, then re-raises
- **Runner-level boundary** → `on_run_error_callback` fires once at the
  invocation boundary (catches the re-raised exception)
- Both firing is expected and correct — they are different lifecycle
  layers with different cleanup responsibilities
- Neither callback fires more than once per failure at its own layer

## Current Behavior on Main

> The line numbers below describe the state of `main` **before** #5045.
> After the fix, these sites are wrapped in try/except blocks.

### Agent level — `base_agent.py`

**`run_async()` (line 274)**:
```python
async def run_async(self, ctx):
    # ... before_agent_callback (line ~288)
    async for event in self._run_async_impl(ctx):  # line 296
        yield event
    # _handle_after_agent_callback (line 303) — SKIPPED on exception
```

There is **no try/except** around `_run_async_impl()`. If it throws,
execution jumps past `_handle_after_agent_callback()` entirely.

**`run_live()` (line 307)**: Same pattern — no try/except around
`_run_live_impl()` (line 329).

### Runner level — `runners.py`

**`_exec_with_plugin()` (line 794)**:
```python
async def _exec_with_plugin(self, ...):
    # ... before_run_callback
    async for event in execute_fn(invocation_context):  # line 852
        yield event
    # after_run_callback (line 923) — NOT in finally block
```

There is **no try/except** around `execute_fn()`. The `after_run_callback`
at line 923 is in the normal flow, **not** in a `finally` block. If
`execute_fn()` throws, `after_run_callback` is skipped.

### Plugin infrastructure — `plugin_manager.py`

`PluginCallbackName` (lines 42-55) lists exactly **12 entries**. No
`on_agent_error_callback` or `on_run_error_callback` exist.

### Plugin base class — `base_plugin.py`

`BasePlugin` defines error callbacks only for model (line 272) and tool
(line 348). No agent or run error callbacks exist.

## Framework Changes

**`base_agent.py` — `run_async()` / `run_live()`**:
```python
try:
    async with Aclosing(self._run_async_impl(ctx)) as agen:
        async for event in agen:
            yield event
except Exception as e:
    await self._handle_agent_error_callback(ctx, e)
    raise
# _handle_after_agent_callback — remains here (success-only)
```

**`runners.py` — `_exec_with_plugin()`**:
```python
try:
    async with Aclosing(execute_fn(invocation_context)) as agen:
        async for event in agen:
            # ... event processing ...
            yield event
except Exception as e:
    await plugin_manager.run_on_run_error_callback(
        invocation_context=invocation_context,
        error=e,
    )
    raise
# after_run_callback — remains here (success-only)
```

### `PluginCallbackName` additions

```python
PluginCallbackName = Literal[
    # ... existing 12 entries ...
    "on_agent_error_callback",
    "on_run_error_callback",
]
```

## Plugin Integration Example: `BigQueryAgentAnalyticsPlugin`

This section shows how the first consumer uses the new callbacks. It is
not the sole justification for the RFC — any plugin benefits.

Add `AGENT_ERROR` and `INVOCATION_ERROR` event types:

- Emit terminal error rows with `status="ERROR"`
- Include `error_message` from `str(error)`
- Include `error_traceback` captured via `traceback.format_exception()`,
  truncated at the plugin layer using `config.max_content_length`
- Perform failure-path cleanup inside error callbacks (TraceManager
  span popping, context var reset, flush) because the success-only
  `after_*` callbacks are skipped on failure

Analytics views expose:
- `v_agent_error`: `error_traceback`, `total_ms`
- `v_invocation_error`: `error_traceback`

## Backward Compatibility

**Fully backward compatible** — additive only:
- New methods on `BasePlugin` with default no-op implementations
- New entries in `PluginCallbackName`
- New dispatch method `_run_notification_callbacks` (internal)
- `after_*` callbacks remain success-only (no semantic change)
- Existing plugins that don't override new methods are unaffected

## Testing Strategy

1. Verify `on_agent_error_callback` fires on uncaught agent failure
2. Verify `on_run_error_callback` fires on uncaught invocation failure
3. Verify `after_agent_callback` / `after_run_callback` remain success-only
   (NOT called on failure)
4. Verify `Exception` boundary: `asyncio.CancelledError` does not trigger
   error callbacks
5. Verify exactly-once-per-layer: agent crash produces 1× agent error
   callback + 1× run error callback
6. Verify notification dispatch does not short-circuit on non-`None` returns
7. Verify plugin callback failure does not prevent later plugins from being
   notified
8. Verify plugin callback failure does not mask the original application
   exception (end-to-end, both agent-level and runner-level)
9. Verify `BigQueryAgentAnalyticsPlugin` emits correct `AGENT_ERROR` /
   `INVOCATION_ERROR` rows with `error_message` and `error_traceback`
10. Verify BigQuery failure-path cleanup runs even if `_log_event()` fails
11. Verify error analytics views expose `error_traceback` column

## Resolved Questions

1. **Should `on_run_error_callback` receive the original user message?**
   No — `invocation_context` is sufficient. It already provides access to
   invocation/session/agent context. Duplicating parameters creates
   ambiguity, and for resumed invocations there may not even be a new
   user message.

2. **Should the framework capture `traceback.format_exception()`?**
   No — the framework passes the raw `Exception` object. Formatting is
   sink-specific. The BigQuery plugin captures traceback text using
   its own truncation logic.

3. **Should there be a framework-level size limit on error text?**
   No — size limits are enforced at the plugin/sink layer. Different sinks
   have different constraints.


Level	before	after	error
Run	✅	✅	❌
Agent	✅	✅	❌
Model	✅	✅	✅
Tool	✅	✅	✅

RFC: Add on_agent_error_callback and on_run_error_callback to BasePlugin #5044

Description

RFC: Agent-Level and Invocation-Level Error Callbacks

Summary

Problem Statement

Motivating Example: BigQueryAgentAnalyticsPlugin

Concrete Scenarios

Scenario 1: Agent crashes after AGENT_STARTING

Scenario 2: Agent error bubbles up and fails the whole invocation

Scenario 3: Failure outside model/tool execution entirely

Scenario 4: BigQuery analytics needs two terminal rows, not one

Goals

Non-Goals

Proposed API

Callback Semantics

Notification-only, not recovery

Best-effort notification dispatch

Original application exception remains primary

after_* remains success-only

Catch Exception, not BaseException

Exactly once per layer

Current Behavior on Main

Agent level — base_agent.py

Runner level — runners.py

Plugin infrastructure — plugin_manager.py

Plugin base class — base_plugin.py

Framework Changes

PluginCallbackName additions

Plugin Integration Example: BigQueryAgentAnalyticsPlugin

Backward Compatibility

Testing Strategy

Resolved Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Motivating Example: `BigQueryAgentAnalyticsPlugin`

Scenario 1: Agent crashes after `AGENT_STARTING`

`after_*` remains success-only

Catch `Exception`, not `BaseException`

Agent level — `base_agent.py`

Runner level — `runners.py`

Plugin infrastructure — `plugin_manager.py`

Plugin base class — `base_plugin.py`

`PluginCallbackName` additions

Plugin Integration Example: `BigQueryAgentAnalyticsPlugin`