RFC: Agent-Level and Invocation-Level Error Callbacks
Relates to: #4863
Implementation: #5045
Summary
This RFC adds two new notification-only plugin callbacks —
on_agent_error_callback and on_run_error_callback — to complete the
ADK plugin lifecycle's error coverage. While BigQueryAgentAnalyticsPlugin
is the most visible motivating example, this RFC improves the plugin
lifecycle contract for any ADK plugin that needs observability, cleanup,
or failure notification beyond model/tool scope.
Problem Statement
The ADK plugin lifecycle has a gap in error callback coverage:
| Level |
before |
after |
error |
| Run |
✅ |
✅ |
❌ |
| Agent |
✅ |
✅ |
❌ |
| Model |
✅ |
✅ |
✅ |
| Tool |
✅ |
✅ |
✅ |
When an unhandled exception escapes _run_async_impl() or execute_fn(),
the corresponding after_* callback is never invoked, leaving plugins
with dangling *_STARTING events and no error signal.
Model/tool error callbacks are not sufficient because many failures occur in
agent orchestration or runner execution outside any single model or tool step.
Motivating Example: BigQueryAgentAnalyticsPlugin
AGENT_STARTING events have no matching terminal event → appear as
phantom successes in dashboards
INVOCATION_STARTING events have no matching terminal event → crashed
invocations are invisible to analytics
- Latency calculations are artificially skewed (crashed calls excluded)
- No Python exception traceback is captured anywhere in BigQuery
Concrete Scenarios
The need for both on_agent_error_callback and on_run_error_callback
comes from the fact that failures can occur at different lifecycle layers,
and plugins often need different behavior at each layer. This is not
"more callbacks for symmetry" — it is about capturing different failure
scopes with different cleanup and analytics responsibilities.
Scenario 1: Agent crashes after AGENT_STARTING
A plugin like BigQueryAgentAnalyticsPlugin emits AGENT_STARTING in
before_agent_callback (bigquery_agent_analytics_plugin.py:2935).
Example:
- An agent starts execution
_run_async_impl() raises RuntimeError("planner crashed")
after_agent_callback never runs (success-only, outside the try/except)
- Without
on_agent_error_callback, the plugin has no place to emit a
terminal AGENT_ERROR row or close the agent span
Why model/tool callbacks are insufficient:
This failure may happen outside any model call or tool call —
no on_model_error_callback or on_tool_error_callback will fire.
Why agent-level callback is needed:
It is the only lifecycle point that can reliably convert an agent-local
uncaught exception into a terminal agent error signal.
Scenario 2: Agent error bubbles up and fails the whole invocation
The same uncaught agent exception often also terminates the entire runner
invocation.
Example:
- Root agent crashes before producing a final response
- The runner aborts the invocation
after_run_callback never runs (success-only, outside the try/except)
- Without
on_run_error_callback, the plugin has no place to emit
INVOCATION_ERROR, flush final logs, or clear invocation-scoped state
Why agent-level callback alone is insufficient:
on_agent_error_callback only knows an agent failed. It does not
represent the terminal state of the invocation as a whole.
Invocation-level cleanup and analytics are runner concerns.
Why run-level callback is needed:
It is the only lifecycle point that can reliably capture invocation
failure, invocation-scoped cleanup, and final analytics/flush behavior
when execution aborts before after_run_callback.
Scenario 3: Failure outside model/tool execution entirely
Some crashes happen in orchestration logic, not in model/tool execution.
Examples:
- Agent loop/control-flow bug
- Event transformation bug
- Plugin interaction bug in agent execution path
- Runner orchestration failure while streaming events
These failures do not trigger on_model_error_callback
(base_llm_flow.py:379-391) or on_tool_error_callback
(llm_flows/functions.py:487, 530) — those are scoped to their
specific try/except blocks around model and tool calls respectively.
Without agent/run error callbacks, these failures become invisible
to plugins.
Scenario 4: BigQuery analytics needs two terminal rows, not one
For analytics, agent and invocation are different entities.
Example:
before_run_callback emits INVOCATION_STARTING (line 2874)
before_agent_callback emits AGENT_STARTING (line 2935)
- Root agent crashes
Correct terminal state is:
- One
AGENT_ERROR (emitted by on_agent_error_callback, line 3327)
- One
INVOCATION_ERROR (emitted by on_run_error_callback, line 3376)
If only agent-level error existed:
- Invocation-level dashboards would still show dangling
INVOCATION_STARTING with no terminal event
If only run-level error existed:
- Agent-level latency/failure analytics would lose the agent failure row
This is why the RFC requires exactly-once-per-layer semantics rather than
choosing only one of the two levels.
Goals
- Notify plugins on uncaught agent execution failures
- Notify plugins on uncaught runner invocation failures
- Preserve the original application exception as the primary error
- Keep
after_* callbacks as success-only (no semantic change)
Non-Goals
- No recovery semantics — these are notification-only, unlike
on_model_error_callback which can return a replacement response
- No change to existing
on_model_error_callback or
on_tool_error_callback behavior
- No change to
after_agent_callback or after_run_callback meaning
- No framework-level traceback formatting or truncation policy —
formatting is sink-specific
Proposed API
async def on_agent_error_callback(
self,
*,
agent: BaseAgent,
callback_context: CallbackContext,
error: Exception,
) -> None:
"""Notification-only callback for uncaught agent execution errors."""
async def on_run_error_callback(
self,
*,
invocation_context: InvocationContext,
error: Exception,
) -> None:
"""Notification-only callback for uncaught runner execution errors."""
Note: on_run_error_callback does not take callback_context — it
receives invocation_context directly, which is sufficient. A
CallbackContext is constructed internally by plugin implementations
that need one (e.g., BigQueryAgentAnalyticsPlugin).
Callback Semantics
Notification-only, not recovery
Unlike on_model_error_callback (which can return a replacement
LlmResponse to suppress the exception), these new callbacks are
notification-only:
- The exception is always re-raised after all plugins are notified
- Plugins cannot return a replacement value to suppress the error
- Rationale: there is no meaningful "replacement value" for a crashed
agent execution or invocation. Recovery at these levels requires
application-specific retry logic, not a plugin hook.
Best-effort notification dispatch
These callbacks use a dedicated dispatch path in PluginManager
(_run_notification_callbacks), not the normal early-exit callback path
(_run_callbacks).
Requirements:
- All registered plugins are always invoked — one plugin's failure
does not prevent later plugins from being notified
- Return values are ignored — a non-
None return does not
short-circuit iteration
- Plugin callback failures are logged and do not propagate — if a
plugin's error callback raises, the failure is logged via
logger.error(..., exc_info=True) and iteration continues
- The original application exception remains the primary exception
seen by the caller
async def _run_notification_callbacks(
self, callback_name: PluginCallbackName, **kwargs: Any
) -> None:
"""Best-effort notification dispatch.
Always iterates all plugins.
Ignores return values.
Logs plugin callback failures and continues.
Does not replace the original application exception.
"""
for plugin in self.plugins:
callback_method = getattr(plugin, callback_name)
try:
await callback_method(**kwargs)
except Exception as e:
logger.error(
"Error in plugin '%s' during '%s' callback: %s",
plugin.name, callback_name, e,
exc_info=True,
)
Rationale: error callbacks are observability hooks, not recovery
hooks. A failing plugin must not prevent other plugins from seeing the
failure, and must not mask the underlying agent/invocation crash.
Original application exception remains primary
If plugin error callbacks themselves fail, those failures are logged but
do not replace the original uncaught exception from the agent or
runner. The caller always sees the original application exception.
Example: agent crashes with RuntimeError("agent crashed") and a
plugin's on_agent_error_callback raises ValueError("plugin bug").
The caller sees RuntimeError("agent crashed"), not the plugin error.
after_* remains success-only
On main, after_agent_callback and after_run_callback are semantically
success callbacks. All built-in plugins treat them that way:
logging_plugin.py:130 — after_run_callback logs "✅ INVOCATION COMPLETED"
logging_plugin.py:155 — after_agent_callback logs "🤖 AGENT COMPLETED"
debug_logging_plugin.py:323 — after_run_callback finalizes debug output
debug_logging_plugin.py:387 — after_agent_callback logs "agent_end"
bigquery_agent_analytics_plugin.py:2871 — after_run_callback emits
"INVOCATION_COMPLETED" with latency then clears state
bigquery_agent_analytics_plugin.py:2933 — after_agent_callback emits
"AGENT_COMPLETED" with latency
If these were moved to finally, a crash would produce both an
AGENT_ERROR and an AGENT_COMPLETED event — semantically wrong and
would require rewriting every plugin's after-callback to check for error
state first.
This RFC keeps after_* as success-only. If guaranteed cleanup is
needed in the future, that should be a separate unconditional cleanup hook,
not a redefinition of existing after_* semantics.
Catch Exception, not BaseException
The try/except blocks use except Exception, not except BaseException:
BaseException also catches KeyboardInterrupt, asyncio.CancelledError,
SystemExit — cancellation and shutdown paths that should not be treated
as ordinary execution failures
- The target for this RFC is unhandled application/runtime exceptions
Exception is the correct boundary
Exactly once per layer
Each error callback fires exactly once at its own layer, even when
both fire for the same uncaught failure:
- Agent-level crash →
on_agent_error_callback fires once at the
agent boundary, then re-raises
- Runner-level boundary →
on_run_error_callback fires once at the
invocation boundary (catches the re-raised exception)
- Both firing is expected and correct — they are different lifecycle
layers with different cleanup responsibilities
- Neither callback fires more than once per failure at its own layer
Current Behavior on Main
The line numbers below describe the state of main before #5045.
After the fix, these sites are wrapped in try/except blocks.
Agent level — base_agent.py
run_async() (line 274):
async def run_async(self, ctx):
# ... before_agent_callback (line ~288)
async for event in self._run_async_impl(ctx): # line 296
yield event
# _handle_after_agent_callback (line 303) — SKIPPED on exception
There is no try/except around _run_async_impl(). If it throws,
execution jumps past _handle_after_agent_callback() entirely.
run_live() (line 307): Same pattern — no try/except around
_run_live_impl() (line 329).
Runner level — runners.py
_exec_with_plugin() (line 794):
async def _exec_with_plugin(self, ...):
# ... before_run_callback
async for event in execute_fn(invocation_context): # line 852
yield event
# after_run_callback (line 923) — NOT in finally block
There is no try/except around execute_fn(). The after_run_callback
at line 923 is in the normal flow, not in a finally block. If
execute_fn() throws, after_run_callback is skipped.
Plugin infrastructure — plugin_manager.py
PluginCallbackName (lines 42-55) lists exactly 12 entries. No
on_agent_error_callback or on_run_error_callback exist.
Plugin base class — base_plugin.py
BasePlugin defines error callbacks only for model (line 272) and tool
(line 348). No agent or run error callbacks exist.
Framework Changes
base_agent.py — run_async() / run_live():
try:
async with Aclosing(self._run_async_impl(ctx)) as agen:
async for event in agen:
yield event
except Exception as e:
await self._handle_agent_error_callback(ctx, e)
raise
# _handle_after_agent_callback — remains here (success-only)
runners.py — _exec_with_plugin():
try:
async with Aclosing(execute_fn(invocation_context)) as agen:
async for event in agen:
# ... event processing ...
yield event
except Exception as e:
await plugin_manager.run_on_run_error_callback(
invocation_context=invocation_context,
error=e,
)
raise
# after_run_callback — remains here (success-only)
PluginCallbackName additions
PluginCallbackName = Literal[
# ... existing 12 entries ...
"on_agent_error_callback",
"on_run_error_callback",
]
Plugin Integration Example: BigQueryAgentAnalyticsPlugin
This section shows how the first consumer uses the new callbacks. It is
not the sole justification for the RFC — any plugin benefits.
Add AGENT_ERROR and INVOCATION_ERROR event types:
- Emit terminal error rows with
status="ERROR"
- Include
error_message from str(error)
- Include
error_traceback captured via traceback.format_exception(),
truncated at the plugin layer using config.max_content_length
- Perform failure-path cleanup inside error callbacks (TraceManager
span popping, context var reset, flush) because the success-only
after_* callbacks are skipped on failure
Analytics views expose:
v_agent_error: error_traceback, total_ms
v_invocation_error: error_traceback
Backward Compatibility
Fully backward compatible — additive only:
- New methods on
BasePlugin with default no-op implementations
- New entries in
PluginCallbackName
- New dispatch method
_run_notification_callbacks (internal)
after_* callbacks remain success-only (no semantic change)
- Existing plugins that don't override new methods are unaffected
Testing Strategy
- Verify
on_agent_error_callback fires on uncaught agent failure
- Verify
on_run_error_callback fires on uncaught invocation failure
- Verify
after_agent_callback / after_run_callback remain success-only
(NOT called on failure)
- Verify
Exception boundary: asyncio.CancelledError does not trigger
error callbacks
- Verify exactly-once-per-layer: agent crash produces 1× agent error
callback + 1× run error callback
- Verify notification dispatch does not short-circuit on non-
None returns
- Verify plugin callback failure does not prevent later plugins from being
notified
- Verify plugin callback failure does not mask the original application
exception (end-to-end, both agent-level and runner-level)
- Verify
BigQueryAgentAnalyticsPlugin emits correct AGENT_ERROR /
INVOCATION_ERROR rows with error_message and error_traceback
- Verify BigQuery failure-path cleanup runs even if
_log_event() fails
- Verify error analytics views expose
error_traceback column
Resolved Questions
-
Should on_run_error_callback receive the original user message?
No — invocation_context is sufficient. It already provides access to
invocation/session/agent context. Duplicating parameters creates
ambiguity, and for resumed invocations there may not even be a new
user message.
-
Should the framework capture traceback.format_exception()?
No — the framework passes the raw Exception object. Formatting is
sink-specific. The BigQuery plugin captures traceback text using
its own truncation logic.
-
Should there be a framework-level size limit on error text?
No — size limits are enforced at the plugin/sink layer. Different sinks
have different constraints.
RFC: Agent-Level and Invocation-Level Error Callbacks
Relates to: #4863
Implementation: #5045
Summary
This RFC adds two new notification-only plugin callbacks —
on_agent_error_callbackandon_run_error_callback— to complete theADK plugin lifecycle's error coverage. While
BigQueryAgentAnalyticsPluginis the most visible motivating example, this RFC improves the plugin
lifecycle contract for any ADK plugin that needs observability, cleanup,
or failure notification beyond model/tool scope.
Problem Statement
The ADK plugin lifecycle has a gap in error callback coverage:
When an unhandled exception escapes
_run_async_impl()orexecute_fn(),the corresponding
after_*callback is never invoked, leaving pluginswith dangling
*_STARTINGevents and no error signal.Model/tool error callbacks are not sufficient because many failures occur in
agent orchestration or runner execution outside any single model or tool step.
Motivating Example:
BigQueryAgentAnalyticsPluginAGENT_STARTINGevents have no matching terminal event → appear asphantom successes in dashboards
INVOCATION_STARTINGevents have no matching terminal event → crashedinvocations are invisible to analytics
Concrete Scenarios
The need for both
on_agent_error_callbackandon_run_error_callbackcomes from the fact that failures can occur at different lifecycle layers,
and plugins often need different behavior at each layer. This is not
"more callbacks for symmetry" — it is about capturing different failure
scopes with different cleanup and analytics responsibilities.
Scenario 1: Agent crashes after
AGENT_STARTINGA plugin like
BigQueryAgentAnalyticsPluginemitsAGENT_STARTINGinbefore_agent_callback(bigquery_agent_analytics_plugin.py:2935).Example:
_run_async_impl()raisesRuntimeError("planner crashed")after_agent_callbacknever runs (success-only, outside the try/except)on_agent_error_callback, the plugin has no place to emit aterminal
AGENT_ERRORrow or close the agent spanWhy model/tool callbacks are insufficient:
This failure may happen outside any model call or tool call —
no
on_model_error_callbackoron_tool_error_callbackwill fire.Why agent-level callback is needed:
It is the only lifecycle point that can reliably convert an agent-local
uncaught exception into a terminal agent error signal.
Scenario 2: Agent error bubbles up and fails the whole invocation
The same uncaught agent exception often also terminates the entire runner
invocation.
Example:
after_run_callbacknever runs (success-only, outside the try/except)on_run_error_callback, the plugin has no place to emitINVOCATION_ERROR, flush final logs, or clear invocation-scoped stateWhy agent-level callback alone is insufficient:
on_agent_error_callbackonly knows an agent failed. It does notrepresent the terminal state of the invocation as a whole.
Invocation-level cleanup and analytics are runner concerns.
Why run-level callback is needed:
It is the only lifecycle point that can reliably capture invocation
failure, invocation-scoped cleanup, and final analytics/flush behavior
when execution aborts before
after_run_callback.Scenario 3: Failure outside model/tool execution entirely
Some crashes happen in orchestration logic, not in model/tool execution.
Examples:
These failures do not trigger
on_model_error_callback(
base_llm_flow.py:379-391) oron_tool_error_callback(
llm_flows/functions.py:487, 530) — those are scoped to theirspecific try/except blocks around model and tool calls respectively.
Without agent/run error callbacks, these failures become invisible
to plugins.
Scenario 4: BigQuery analytics needs two terminal rows, not one
For analytics, agent and invocation are different entities.
Example:
before_run_callbackemitsINVOCATION_STARTING(line 2874)before_agent_callbackemitsAGENT_STARTING(line 2935)Correct terminal state is:
AGENT_ERROR(emitted byon_agent_error_callback, line 3327)INVOCATION_ERROR(emitted byon_run_error_callback, line 3376)If only agent-level error existed:
INVOCATION_STARTINGwith no terminal eventIf only run-level error existed:
This is why the RFC requires exactly-once-per-layer semantics rather than
choosing only one of the two levels.
Goals
after_*callbacks as success-only (no semantic change)Non-Goals
on_model_error_callbackwhich can return a replacement responseon_model_error_callbackoron_tool_error_callbackbehaviorafter_agent_callbackorafter_run_callbackmeaningformatting is sink-specific
Proposed API
Note:
on_run_error_callbackdoes not takecallback_context— itreceives
invocation_contextdirectly, which is sufficient. ACallbackContextis constructed internally by plugin implementationsthat need one (e.g.,
BigQueryAgentAnalyticsPlugin).Callback Semantics
Notification-only, not recovery
Unlike
on_model_error_callback(which can return a replacementLlmResponseto suppress the exception), these new callbacks arenotification-only:
agent execution or invocation. Recovery at these levels requires
application-specific retry logic, not a plugin hook.
Best-effort notification dispatch
These callbacks use a dedicated dispatch path in
PluginManager(
_run_notification_callbacks), not the normal early-exit callback path(
_run_callbacks).Requirements:
does not prevent later plugins from being notified
Nonereturn does notshort-circuit iteration
plugin's error callback raises, the failure is logged via
logger.error(..., exc_info=True)and iteration continuesseen by the caller
Rationale: error callbacks are observability hooks, not recovery
hooks. A failing plugin must not prevent other plugins from seeing the
failure, and must not mask the underlying agent/invocation crash.
Original application exception remains primary
If plugin error callbacks themselves fail, those failures are logged but
do not replace the original uncaught exception from the agent or
runner. The caller always sees the original application exception.
Example: agent crashes with
RuntimeError("agent crashed")and aplugin's
on_agent_error_callbackraisesValueError("plugin bug").The caller sees
RuntimeError("agent crashed"), not the plugin error.after_*remains success-onlyOn main,
after_agent_callbackandafter_run_callbackare semanticallysuccess callbacks. All built-in plugins treat them that way:
logging_plugin.py:130—after_run_callbacklogs"✅ INVOCATION COMPLETED"logging_plugin.py:155—after_agent_callbacklogs"🤖 AGENT COMPLETED"debug_logging_plugin.py:323—after_run_callbackfinalizes debug outputdebug_logging_plugin.py:387—after_agent_callbacklogs"agent_end"bigquery_agent_analytics_plugin.py:2871—after_run_callbackemits"INVOCATION_COMPLETED"with latency then clears statebigquery_agent_analytics_plugin.py:2933—after_agent_callbackemits"AGENT_COMPLETED"with latencyIf these were moved to
finally, a crash would produce both anAGENT_ERRORand anAGENT_COMPLETEDevent — semantically wrong andwould require rewriting every plugin's after-callback to check for error
state first.
This RFC keeps
after_*as success-only. If guaranteed cleanup isneeded in the future, that should be a separate unconditional cleanup hook,
not a redefinition of existing
after_*semantics.Catch
Exception, notBaseExceptionThe try/except blocks use
except Exception, notexcept BaseException:BaseExceptionalso catchesKeyboardInterrupt,asyncio.CancelledError,SystemExit— cancellation and shutdown paths that should not be treatedas ordinary execution failures
Exceptionis the correct boundaryExactly once per layer
Each error callback fires exactly once at its own layer, even when
both fire for the same uncaught failure:
on_agent_error_callbackfires once at theagent boundary, then re-raises
on_run_error_callbackfires once at theinvocation boundary (catches the re-raised exception)
layers with different cleanup responsibilities
Current Behavior on Main
Agent level —
base_agent.pyrun_async()(line 274):There is no try/except around
_run_async_impl(). If it throws,execution jumps past
_handle_after_agent_callback()entirely.run_live()(line 307): Same pattern — no try/except around_run_live_impl()(line 329).Runner level —
runners.py_exec_with_plugin()(line 794):There is no try/except around
execute_fn(). Theafter_run_callbackat line 923 is in the normal flow, not in a
finallyblock. Ifexecute_fn()throws,after_run_callbackis skipped.Plugin infrastructure —
plugin_manager.pyPluginCallbackName(lines 42-55) lists exactly 12 entries. Noon_agent_error_callbackoron_run_error_callbackexist.Plugin base class —
base_plugin.pyBasePlugindefines error callbacks only for model (line 272) and tool(line 348). No agent or run error callbacks exist.
Framework Changes
base_agent.py—run_async()/run_live():runners.py—_exec_with_plugin():PluginCallbackNameadditionsPlugin Integration Example:
BigQueryAgentAnalyticsPluginThis section shows how the first consumer uses the new callbacks. It is
not the sole justification for the RFC — any plugin benefits.
Add
AGENT_ERRORandINVOCATION_ERRORevent types:status="ERROR"error_messagefromstr(error)error_tracebackcaptured viatraceback.format_exception(),truncated at the plugin layer using
config.max_content_lengthspan popping, context var reset, flush) because the success-only
after_*callbacks are skipped on failureAnalytics views expose:
v_agent_error:error_traceback,total_msv_invocation_error:error_tracebackBackward Compatibility
Fully backward compatible — additive only:
BasePluginwith default no-op implementationsPluginCallbackName_run_notification_callbacks(internal)after_*callbacks remain success-only (no semantic change)Testing Strategy
on_agent_error_callbackfires on uncaught agent failureon_run_error_callbackfires on uncaught invocation failureafter_agent_callback/after_run_callbackremain success-only(NOT called on failure)
Exceptionboundary:asyncio.CancelledErrordoes not triggererror callbacks
callback + 1× run error callback
Nonereturnsnotified
exception (end-to-end, both agent-level and runner-level)
BigQueryAgentAnalyticsPluginemits correctAGENT_ERROR/INVOCATION_ERRORrows witherror_messageanderror_traceback_log_event()failserror_tracebackcolumnResolved Questions
Should
on_run_error_callbackreceive the original user message?No —
invocation_contextis sufficient. It already provides access toinvocation/session/agent context. Duplicating parameters creates
ambiguity, and for resumed invocations there may not even be a new
user message.
Should the framework capture
traceback.format_exception()?No — the framework passes the raw
Exceptionobject. Formatting issink-specific. The BigQuery plugin captures traceback text using
its own truncation logic.
Should there be a framework-level size limit on error text?
No — size limits are enforced at the plugin/sink layer. Different sinks
have different constraints.