fix(atenet): retry transient upstream resets when routing to resumed actors#278
Open
Tommy McCormick (mccormickt) wants to merge 1 commit into
Open
fix(atenet): retry transient upstream resets when routing to resumed actors#278Tommy McCormick (mccormickt) wants to merge 1 commit into
Tommy McCormick (mccormickt) wants to merge 1 commit into
Conversation
…actors When a request hits the router for a suspended actor, the ext_proc filter resumes the actor and rewrites :authority to the actor pod's IP:80, which the dynamic_forward_proxy cluster then connects to. In the brief window after resume returns but before the restored workload is accepting connections (or when a pooled connection to a just-suspended actor has gone stale), Envoy's upstream connection is reset before response headers. The actor route had no retry policy, so each such reset became an immediate 503 "upstream connect error or disconnect/reset before headers. reset reason: connection termination". Add a retry policy to the actor route (retry_on "reset,connect-failure", 5 retries with 50ms-1s backoff) so these transient failures are retried once the listener is ready. A retry policy alone is not enough: every actor is routed through the single dynamic_forward_proxy cluster, whose retry circuit breaker defaults to only 3 concurrent retries cluster-wide, so a burst of concurrent requests to a just-resumed actor overflows it and the excess fails with 503 (UO) instead of retrying. Rather than inflate the static max_retries (which exists to cap retry amplification during an outage), configure a retry budget on the cluster: budget_percent 20% (Envoy's default) scales the allowed retries with load, with min_retry_concurrency 20 as a low-traffic floor above the default of 3. Other circuit breakers keep their defaults. Reproduced on a kind cluster with the multi-template demo: concurrent requests to an actor immediately after suspend produced intermittent 503s with envoy response flags UC (connection termination) and UO (overflow). With the fix deployed (retry policy and retry budget verified live in the envoy config dump) the same protocol produced 0 failures across 1600+ requests. Fixes agent-substrate#218.
245b264 to
de17521
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a retry policy to the actor route so transient failures are retried until an actor is ready. This manifests commonly as 503 errors during the brief period between the request hitting envoy and the actor actually resuming and being able to serve requests. This change sets a circuit breaker to envoy's default of 20% of load, and a concurrency check of 20 as an arbitrary floor above the default of 3. Open to suggestions on these values from experience, or we can simply observe and iterate from here.
Reproduced with Claude on a kind cluster with the multi-template demo to inform the changes.
Fixes #218.