Recover OAuth workloads from transient refresh failures#5350
Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5350 +/- ##
==========================================
+ Coverage 68.87% 68.93% +0.06%
==========================================
Files 634 635 +1
Lines 64460 64632 +172
==========================================
+ Hits 44394 44555 +161
- Misses 16785 16794 +9
- Partials 3281 3283 +2 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
Large PR justification has been provided. Thank you!
When an OAuth refresh fails as a transient error for longer than the in-loop short retry (~1-2 minutes at defaults), the monitor used to mark the workload unauthenticated and exit. The workload stayed dead even after the underlying condition cleared on its own (e.g. a brief VPN disconnect routing requests through a different network path). Now the monitor keeps retrying on a configurable longer cadence (default 10 minutes) until either success or a configurable ceiling (default 24 hours), at which point the workload is finally marked unauthenticated. Fixes stacklok#5349 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Greg Katz <gkatz@indeed.com>
178577a to
978fc6f
Compare
|
Gentle nudge — I'm still seeing this issue across multiple OAuth-backed MCP servers in our environment. Happy to address review feedback whenever a maintainer can take a look. Thanks! |
| - `unhealthy` - Workload is running but unhealthy | ||
| - `unauthenticated` - Remote workload cannot authenticate (expired tokens) | ||
| - `auth_retrying` - Remote workload's token refresh is failing transiently; monitor is still retrying until success (→ `running`) or the configured ceiling (→ `unauthenticated`) | ||
| - `unknown` - Workload status cannot be determined |
| mts.exitAuthRetrying() | ||
| if tok == nil || tok.Expiry.IsZero() { | ||
| return true, 0 | ||
| } |
| return true, 0 | ||
| } | ||
| return false, waitUntilExpiry(tok.Expiry) | ||
| } |
| // Transient network error — funnel all concurrent callers through a | ||
| // single retry loop so we don't hammer the token endpoint. | ||
| tok, err = mts.refresher.Refresh(mts.monitoringCtx, err) | ||
| if err != nil { |
| // Ordering matters: stopMonitoring is closed first so any concurrent | ||
| // enterAuthRetrying call sees the gate closed before it acquires the | ||
| // field mutex, eliminating the race where a hot caller could write | ||
| // AuthRetrying to the status file just after we've written |
| // will still fail with 503 until the token refresh recovers, but | ||
| // discovery callers see the backend's capabilities throughout. | ||
| return vmcp.BackendDegraded | ||
| case rt.WorkloadStatusPolicyStopped: |
| if err == nil { | ||
| mts.exitAuthRetrying() | ||
| if tok == nil || tok.Expiry.IsZero() { | ||
| return true, 0 |
There was a problem hiding this comment.
This return true, 0 exits monitorLoop via the defer close(mts.stopped) path, but stopMonitoring is never closed — that only happens through markAsUnauthenticated / stopOnce.Do.
After the monitor exits here, the gate in Token() stays open. If tokens later start failing, a hot caller falls past the stopMonitoring select, runs the full refresher.Refresh (up to 5min), then calls enterAuthRetrying — which correctly aborts because its own gate checks the channel under the lock. But the workload has now entered AuthRetrying with no monitor alive to drive the ceiling or recovery ticks.
Zero-expiry tokens do appear in practice (some sources omit expires_in). Worth closing stopMonitoring via stopOnce.Do on this exit path, or routing through a small helper that does so without emitting Unauthenticated.
jhrozek
left a comment
There was a problem hiding this comment.
Overall the state-machine design is solid and the test suite is thorough — especially the concurrent stress test and the real-HTTP integration tests. A few things worth discussing below.
| mts.exitAuthRetrying() | ||
| if tok == nil || tok.Expiry.IsZero() { | ||
| return true, 0 | ||
| } |
There was a problem hiding this comment.
This return true, 0 exits monitorLoop via defer close(mts.stopped), but stopMonitoring is never closed here — that only happens through markAsUnauthenticated / stopOnce.Do.
After the monitor exits this way, the gate in Token() stays open forever. If tokens later start failing, a hot caller falls past the stopMonitoring select, runs the full refresher.Refresh (up to 5min), then calls enterAuthRetrying — which correctly aborts because its own gate checks the channel under the lock. But the workload has now entered AuthRetrying with no monitor alive to drive the ceiling or recovery ticks.
Zero-expiry tokens do appear in practice (some sources omit expires_in). Worth closing stopMonitoring via stopOnce.Do on this exit path, or routing through a small helper that does so without emitting Unauthenticated.
| return true, 0 | ||
| } | ||
| return false, waitUntilExpiry(tok.Expiry) | ||
| } |
There was a problem hiding this comment.
The first success path (raw tokenSource.Token() succeeds) calls exitAuthRetrying() before returning — but this path (short retry succeeded) doesn't.
Narrow race: a hot caller enters AuthRetrying between the monitor's inAuthRetrying() check and Refresh returning. The monitor then succeeds here and returns to normal scheduling, but fastFailIfAuthRetrying keeps rejecting hot callers until the next tick reaches the inAuthRetrying branch.
exitAuthRetrying is a no-op when not in AuthRetrying, so adding it before this return is safe and defensive.
| // Transient network error — funnel all concurrent callers through a | ||
| // single retry loop so we don't hammer the token endpoint. | ||
| tok, err = mts.refresher.Refresh(mts.monitoringCtx, err) | ||
| if err != nil { |
There was a problem hiding this comment.
TOCTOU between the stopMonitoring select a few lines above and this Refresh call: a hot caller passes the gate, gets preempted, markAsUnauthenticated fires on another goroutine and closes the channel, then the hot caller resumes here and runs the full 5-min short-retry against a dead endpoint.
The downstream enterAuthRetrying call is safe — its gate checks stopMonitoring under the lock — but the singleflight slot gets held for up to 5min unnecessarily.
A second non-blocking select on stopMonitoring immediately before this line would close the window.
| // Ordering matters: stopMonitoring is closed first so any concurrent | ||
| // enterAuthRetrying call sees the gate closed before it acquires the | ||
| // field mutex, eliminating the race where a hot caller could write | ||
| // AuthRetrying to the status file just after we've written |
There was a problem hiding this comment.
The in-memory invariant is well-protected — the concurrent stress test validates this — but "eliminating the race" reads as a stronger guarantee than what's actually provided. The SetWorkloadStatus(AuthRetrying) emit happens after mu is released, so markAsUnauthenticated can still write Unauthenticated to disk and then enterAuthRetrying overwrites it with AuthRetrying. The comment a few lines down in the function body already acknowledges this: "a narrower disk-write inversion is still possible".
Suggested wording: "eliminates the in-memory field race" — keeps both comments consistent.
| // will still fail with 503 until the token refresh recovers, but | ||
| // discovery callers see the backend's capabilities throughout. | ||
| return vmcp.BackendDegraded | ||
| case rt.WorkloadStatusPolicyStopped: |
There was a problem hiding this comment.
Intentional design question: should auth_retrying workloads appear in thv list without --all?
The PR adds a retrying indicator and a styled pill for this status, but both are unreachable — pkg/workloads/statuses/file_status.go line 310 filters to WorkloadStatusRunning only before the display code is reached. WorkloadStatusUnauthenticated and WorkloadStatusUnhealthy are also filtered there, so this may be deliberate, but the new visual work suggests the intent was to surface this state prominently by default. pkg/workloads/statuses/status.go (via IsRunning()) and this listing path have similar guards.
| - `unhealthy` - Workload is running but unhealthy | ||
| - `unauthenticated` - Remote workload cannot authenticate (expired tokens) | ||
| - `auth_retrying` - Remote workload's token refresh is failing transiently; monitor is still retrying until success (→ `running`) or the configured ceiling (→ `unauthenticated`) | ||
| - `unknown` - Workload status cannot be determined |
There was a problem hiding this comment.
There is a third exit path missing from this description: if a monitor tick during the auth_retrying window observes a permanent OAuth error (e.g. invalid_grant), markAsUnauthenticated fires immediately without waiting for the ceiling. The 08-workloads-lifecycle.md state diagram already captures this correctly as "Ceiling Exceeded or Permanent Error".
| - `unknown` - Workload status cannot be determined | |
| - `auth_retrying` - Remote workload's token refresh is failing transiently; monitor is still retrying until success (-> `running`), a permanent error is observed (-> `unauthenticated`), or the configured ceiling is exceeded (-> `unauthenticated`) |
|
sorry about the stray test messages, I was driving the review from a CC session and it couldn't figure out how to post an inline comment :-/ |
Summary
than the in-loop short retry (~1–2 minutes at defaults), the
monitor today marks the workload
unauthenticatedand exits itsgoroutine. The workload stays permanently dead even after the
underlying condition clears on its own. The canonical trigger is a
brief client-side network-context change—a VPN disconnect, a
laptop sleeping on one network and resuming on another, etc.—that
causes token-refresh requests to reach the OAuth server from a
different network origin. See OAuth monitor gives up on transient failures, leaving workloads dead #5349 for the full failure mode.
WorkloadStatusAuthRetryingworkload status. After the shortretry exhausts on a still-transient error, the monitor stays alive
and re-attempts refresh on a longer cadence (default 10 min,
configurable via
TOOLHIVE_TOKEN_AUTH_RETRYING_TICK_INTERVAL) untileither success (→
Running) or a configurable ceiling (default 24h, configurable via
TOOLHIVE_TOKEN_AUTH_RETRYING_MAX_ELAPSED)(→
Unauthenticated).Token()calls, e.g. from the tokeninjection middleware) fast-fail with the cached error during
AuthRetrying, returning 503+Retry-After immediately rather than
blocking on another short-retry duration against the broken
endpoint.
pkg/oauthproto/oauthtest/server.goprovides a scriptableOAuth server fixture used by two end-to-end integration tests that
drive the state machine against the real
golang.org/x/oauth2library + real HTTP responses.
vmcpbackend health mapping treatsAuthRetryingasBackendDegradedso tool discovery still aggregates the workloadwhile invocation surfaces the underlying 503.
Fixes #5349
Type of change
Test plan
task test)—11 new tests across the PR:7 covering the AuthRetrying state machine (entry on short-retry
exhaustion, recovery to
Running, ceiling transition toUnauthenticated, hot-caller fast-fail, permanent-error mid-AuthRetrying, DCR Warn silence on ceiling, post-stop gate);
2 integration tests that drive the state machine against a
real
httptest.NewServerthrough the realgolang.org/x/oauth2library; 2 covering the new switch arms in
mapWorkloadStatusToVMCPHealthandworkloadStatusIndicator.task lint-fix)deployed
~/.local/bin/thv, restarted two real OAuth-backedremote workloads (Sourcegraph and Sourcegraph Deep Search) on
the new binary. Verified MCP
initializeround-trip worksend-to-end with token injection through the proxy. Verified
thv list --allrendering of the newauth_retryingstatusvia a manual status-file simulation.
Note on confidence. The bug's natural trigger (a real WAF block
during a real token refresh) recurs every 1–3 days in the affected
environment and needs sudo-level network controls (
pfctl/iptables) to drive on demand—neither fits CI or routineverification. The two
TestIntegration_*cases inpkg/auth/monitored_token_source_test.goare where regressions inthe long-tail behavior would surface: they drive the full state
machine through the real
golang.org/x/oauth2library against ascriptable OAuth server fixture (
pkg/oauthproto/oauthtest/)returning the actual
<!DOCTYPE html>4xx-without-RFC-6749-error-code response shape observed in production. Two scenarios:
Running → AuthRetrying → Running(recovery on next tick) andRunning → AuthRetrying → Unauthenticated(ceiling exceeded). Themock-based unit tests in the same file pin individual edge cases at
higher speed; together with the integration tests they cover both
the state machine logic and the real-library response-parsing path.
API Compatibility
v1beta1API.The
v1beta1operator API surface is not touched by this PR. Thenew
auth_retryingworkload status is added to the runtime enum andto the OpenAPI swagger for the workloads HTTP API, both of which are
additive—existing clients that don't know about the new value will
see it as an unrecognised string and route through their default
handlers.
Does this introduce a user-facing change?
Yes:
auth_retryingworkload status (visible inthv list --all,in the OpenAPI swagger, and in the architecture state diagram).
10m) andceiling (default
24h) viaTOOLHIVE_TOKEN_AUTH_RETRYING_TICK_INTERVALand
TOOLHIVE_TOKEN_AUTH_RETRYING_MAX_ELAPSED.operators no longer need to manually re-auth workloads after a
transient network-context change resolves on its own.
Large PR Justification
The 1342-line additions break down as:
*_test.gofiles. The bulk is inpkg/auth/monitored_token_source_test.go(718 new lines), coveringthe new state machine including 2 integration tests that drive the
full flow against a real
golang.org/x/oauth2library.pkg/oauthproto/oauthtest/server.go, a new scriptable OAuth serverfixture used by the integration tests.
auth_retryingenum value inpkg/api/v1/workload_types.go.state transitions, ceiling logic, hot-caller fast-fail, and the
MonitoredTokenSourceplumbing reference each other and would notfunction if split.
Generated with Claude Code.