fix(atenet): retry transient upstream resets when routing to resumed actors by mccormickt · Pull Request #278 · agent-substrate/substrate

Tommy McCormick (mccormickt) · 2026-06-18T22:01:10Z

Adds a retry policy to the actor route so transient failures are retried until an actor is ready. This manifests commonly as 503 errors during the brief period between the request hitting envoy and the actor actually resuming and being able to serve requests. This change sets a circuit breaker to envoy's default of 20% of load, and a concurrency check of 20 as an arbitrary floor above the default of 3. Open to suggestions on these values from experience, or we can simply observe and iterate from here.

Reproduced with Claude on a kind cluster with the multi-template demo to inform the changes.

Fixes #218.

Tests pass
Appropriate changes to documentation are included in the PR

…actors When a request hits the router for a suspended actor, the ext_proc filter resumes the actor and rewrites :authority to the actor pod's IP:80, which the dynamic_forward_proxy cluster then connects to. In the brief window after resume returns but before the restored workload is accepting connections (or when a pooled connection to a just-suspended actor has gone stale), Envoy's upstream connection is reset before response headers. The actor route had no retry policy, so each such reset became an immediate 503 "upstream connect error or disconnect/reset before headers. reset reason: connection termination". Add a retry policy to the actor route (retry_on "reset,connect-failure", 5 retries with 50ms-1s backoff) so these transient failures are retried once the listener is ready. A retry policy alone is not enough: every actor is routed through the single dynamic_forward_proxy cluster, whose retry circuit breaker defaults to only 3 concurrent retries cluster-wide, so a burst of concurrent requests to a just-resumed actor overflows it and the excess fails with 503 (UO) instead of retrying. Rather than inflate the static max_retries (which exists to cap retry amplification during an outage), configure a retry budget on the cluster: budget_percent 20% (Envoy's default) scales the allowed retries with load, with min_retry_concurrency 20 as a low-traffic floor above the default of 3. Other circuit breakers keep their defaults. Reproduced on a kind cluster with the multi-template demo: concurrent requests to an actor immediately after suspend produced intermittent 503s with envoy response flags UC (connection termination) and UO (overflow). With the fix deployed (retry policy and retry budget verified live in the envoy config dump) the same protocol produced 0 failures across 1600+ requests. Fixes agent-substrate#218.

Bowei Du (bowei) self-assigned this Jun 18, 2026

Tommy McCormick (mccormickt) force-pushed the push-omsqxovpxsus branch from 245b264 to de17521 Compare June 19, 2026 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(atenet): retry transient upstream resets when routing to resumed actors#278

fix(atenet): retry transient upstream resets when routing to resumed actors#278
Tommy McCormick (mccormickt) wants to merge 1 commit into
agent-substrate:mainfrom
mccormickt:push-omsqxovpxsus

Tommy McCormick (mccormickt) commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Tommy McCormick (mccormickt) commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants