Skip to content

fix(atenet): retry transient upstream resets when routing to resumed actors#278

Open
Tommy McCormick (mccormickt) wants to merge 1 commit into
agent-substrate:mainfrom
mccormickt:push-omsqxovpxsus
Open

fix(atenet): retry transient upstream resets when routing to resumed actors#278
Tommy McCormick (mccormickt) wants to merge 1 commit into
agent-substrate:mainfrom
mccormickt:push-omsqxovpxsus

Conversation

@mccormickt

Copy link
Copy Markdown
Contributor

Adds a retry policy to the actor route so transient failures are retried until an actor is ready. This manifests commonly as 503 errors during the brief period between the request hitting envoy and the actor actually resuming and being able to serve requests. This change sets a circuit breaker to envoy's default of 20% of load, and a concurrency check of 20 as an arbitrary floor above the default of 3. Open to suggestions on these values from experience, or we can simply observe and iterate from here.

Reproduced with Claude on a kind cluster with the multi-template demo to inform the changes.

Fixes #218.

  • Tests pass
  • Appropriate changes to documentation are included in the PR

@bowei Bowei Du (bowei) self-assigned this Jun 18, 2026
…actors

When a request hits the router for a suspended actor, the ext_proc filter resumes the actor and rewrites :authority to the actor pod's IP:80, which the dynamic_forward_proxy cluster then connects to. In the brief window after resume returns but before the restored workload is accepting connections (or when a pooled connection to a just-suspended actor has gone stale), Envoy's upstream connection is reset before response headers. The actor route had no retry policy, so each such reset became an immediate 503 "upstream connect error or disconnect/reset before headers. reset reason: connection termination".

Add a retry policy to the actor route (retry_on "reset,connect-failure", 5 retries with 50ms-1s backoff) so these transient failures are retried once the listener is ready. A retry policy alone is not enough: every actor is routed through the single dynamic_forward_proxy cluster, whose retry circuit breaker defaults to only 3 concurrent retries cluster-wide, so a burst of concurrent requests to a just-resumed actor overflows it and the excess fails with 503 (UO) instead of retrying. Rather than inflate the static max_retries (which exists to cap retry amplification during an outage), configure a retry budget on the cluster: budget_percent 20% (Envoy's default) scales the allowed retries with load, with min_retry_concurrency 20 as a low-traffic floor above the default of 3. Other circuit breakers keep their defaults.

Reproduced on a kind cluster with the multi-template demo: concurrent requests to an actor immediately after suspend produced intermittent 503s with envoy response flags UC (connection termination) and UO (overflow). With the fix deployed (retry policy and retry budget verified live in the envoy config dump) the same protocol produced 0 failures across 1600+ requests.

Fixes agent-substrate#218.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intermittent 503 connection termination errors when resuming suspended actors

2 participants