Recover OAuth workloads from transient refresh failures by gkatz2 · Pull Request #5350 · stacklok/toolhive

gkatz2 · 2026-05-20T19:41:17Z

Summary

When an OAuth token refresh fails as a transient error for longer
than the in-loop short retry (~1–2 minutes at defaults), the
monitor today marks the workload unauthenticated and exits its
goroutine. The workload stays permanently dead even after the
underlying condition clears on its own. The canonical trigger is a
brief client-side network-context change—a VPN disconnect, a
laptop sleeping on one network and resuming on another, etc.—that
causes token-refresh requests to reach the OAuth server from a
different network origin. See OAuth monitor gives up on transient failures, leaving workloads dead #5349 for the full failure mode.
Add a WorkloadStatusAuthRetrying workload status. After the short
retry exhausts on a still-transient error, the monitor stays alive
and re-attempts refresh on a longer cadence (default 10 min,
configurable via TOOLHIVE_TOKEN_AUTH_RETRYING_TICK_INTERVAL) until
either success (→ Running) or a configurable ceiling (default 24
h, configurable via TOOLHIVE_TOKEN_AUTH_RETRYING_MAX_ELAPSED)
(→ Unauthenticated).
Hot callers (request-path Token() calls, e.g. from the token
injection middleware) fast-fail with the cached error during
AuthRetrying, returning 503+Retry-After immediately rather than
blocking on another short-retry duration against the broken
endpoint.
New pkg/oauthproto/oauthtest/server.go provides a scriptable
OAuth server fixture used by two end-to-end integration tests that
drive the state machine against the real golang.org/x/oauth2
library + real HTTP responses.
The vmcp backend health mapping treats AuthRetrying as
BackendDegraded so tool discovery still aggregates the workload
while invocation surfaces the underlying 503.

Type of change

Test plan

Unit tests (task test)—11 new tests across the PR:
7 covering the AuthRetrying state machine (entry on short-retry
exhaustion, recovery to Running, ceiling transition to
Unauthenticated, hot-caller fast-fail, permanent-error mid-
AuthRetrying, DCR Warn silence on ceiling, post-stop gate);
2 integration tests that drive the state machine against a
real httptest.NewServer through the real golang.org/x/oauth2
library; 2 covering the new switch arms in
mapWorkloadStatusToVMCPHealth and workloadStatusIndicator.
Linting (task lint-fix)
Manual testing—built a local override binary, replaced the
deployed ~/.local/bin/thv, restarted two real OAuth-backed
remote workloads (Sourcegraph and Sourcegraph Deep Search) on
the new binary. Verified MCP initialize round-trip works
end-to-end with token injection through the proxy. Verified
thv list --all rendering of the new auth_retrying status
via a manual status-file simulation.

Note on confidence. The bug's natural trigger (a real WAF block
during a real token refresh) recurs every 1–3 days in the affected
environment and needs sudo-level network controls (pfctl /
iptables) to drive on demand—neither fits CI or routine
verification. The two TestIntegration_* cases in
pkg/auth/monitored_token_source_test.go are where regressions in
the long-tail behavior would surface: they drive the full state
machine through the real golang.org/x/oauth2 library against a
scriptable OAuth server fixture (pkg/oauthproto/oauthtest/)
returning the actual <!DOCTYPE html> 4xx-without-RFC-6749-error-
code response shape observed in production. Two scenarios:
Running → AuthRetrying → Running (recovery on next tick) and
Running → AuthRetrying → Unauthenticated (ceiling exceeded). The
mock-based unit tests in the same file pin individual edge cases at
higher speed; together with the integration tests they cover both
the state machine logic and the real-library response-parsing path.

API Compatibility

This PR does not break the v1beta1 API.

The v1beta1 operator API surface is not touched by this PR. The
new auth_retrying workload status is added to the runtime enum and
to the OpenAPI swagger for the workloads HTTP API, both of which are
additive—existing clients that don't know about the new value will
see it as an unrecognised string and route through their default
handlers.

Does this introduce a user-facing change?

Yes:

New auth_retrying workload status (visible in thv list --all,
in the OpenAPI swagger, and in the architecture state diagram).
Operators can tune the AuthRetrying cadence (default 10m) and
ceiling (default 24h) via TOOLHIVE_TOKEN_AUTH_RETRYING_TICK_INTERVAL
and TOOLHIVE_TOKEN_AUTH_RETRYING_MAX_ELAPSED.
In environments where the bug surfaces (described in OAuth monitor gives up on transient failures, leaving workloads dead #5349),
operators no longer need to manually re-auth workloads after a
transient network-context change resolves on its own.

Large PR Justification

The 1342-line additions break down as:

Tests: ~789 lines across 4 *_test.go files. The bulk is in
pkg/auth/monitored_token_source_test.go (718 new lines), covering
the new state machine including 2 integration tests that drive the
full flow against a real golang.org/x/oauth2 library.
Test infrastructure: 136 lines in
pkg/oauthproto/oauthtest/server.go, a new scriptable OAuth server
fixture used by the integration tests.
Generated swagger: 12 lines, regenerated from the new
auth_retrying enum value in pkg/api/v1/workload_types.go.
Architecture docs: 15 lines documenting the new state.
Production code: ~390 lines. The change is necessarily atomic—
state transitions, ceiling logic, hot-caller fast-fail, and the
MonitoredTokenSource plumbing reference each other and would not
function if split.

Generated with Claude Code.

github-actions

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.

This review will be automatically dismissed once you add the justification section.

codecov · 2026-05-20T19:51:58Z

Codecov Report

❌ Patch coverage is 81.12245% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.93%. Comparing base (83e9eae) to head (978fc6f).

Files with missing lines	Patch %	Lines
pkg/auth/monitored_token_source.go	79.72%	25 Missing and 5 partials ⚠️
pkg/oauthproto/oauthtest/server.go	84.44%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5350      +/-   ##
==========================================
+ Coverage   68.87%   68.93%   +0.06%     
==========================================
  Files         634      635       +1     
  Lines       64460    64632     +172     
==========================================
+ Hits        44394    44555     +161     
- Misses      16785    16794       +9     
- Partials     3281     3283       +2

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-05-20T21:55:18Z

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

Large PR justification has been provided. Thank you!

When an OAuth refresh fails as a transient error for longer than the in-loop short retry (~1-2 minutes at defaults), the monitor used to mark the workload unauthenticated and exit. The workload stayed dead even after the underlying condition cleared on its own (e.g. a brief VPN disconnect routing requests through a different network path). Now the monitor keeps retrying on a configurable longer cadence (default 10 minutes) until either success or a configurable ceiling (default 24 hours), at which point the workload is finally marked unauthenticated. Fixes stacklok#5349 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Greg Katz <gkatz@indeed.com>

gkatz2 · 2026-06-05T15:16:34Z

Gentle nudge — I'm still seeing this issue across multiple OAuth-backed MCP servers in our environment. Happy to address review feedback whenever a maintainer can take a look. Thanks!

jhrozek

Test comment

jhrozek · 2026-06-11T09:00:16Z

 - `unhealthy` - Workload is running but unhealthy
 - `unauthenticated` - Remote workload cannot authenticate (expired tokens)
+- `auth_retrying` - Remote workload's token refresh is failing transiently; monitor is still retrying until success (→ `running`) or the configured ceiling (→ `unauthenticated`)
 - `unknown` - Workload status cannot be determined


jhrozek

test

jhrozek · 2026-06-11T09:00:30Z

+		mts.exitAuthRetrying()
+		if tok == nil || tok.Expiry.IsZero() {
+			return true, 0
+		}


jhrozek

test

jhrozek · 2026-06-11T09:00:31Z

+			return true, 0
+		}
+		return false, waitUntilExpiry(tok.Expiry)
+	}


jhrozek

test

jhrozek · 2026-06-11T09:00:33Z

 	// Transient network error — funnel all concurrent callers through a
 	// single retry loop so we don't hammer the token endpoint.
 	tok, err = mts.refresher.Refresh(mts.monitoringCtx, err)
-	if err != nil {


jhrozek

test

jhrozek · 2026-06-11T09:00:34Z

+// Ordering matters: stopMonitoring is closed first so any concurrent
+// enterAuthRetrying call sees the gate closed before it acquires the
+// field mutex, eliminating the race where a hot caller could write
+// AuthRetrying to the status file just after we've written


jhrozek

test

jhrozek · 2026-06-11T09:00:45Z

+		// will still fail with 503 until the token refresh recovers, but
+		// discovery callers see the backend's capabilities throughout.
+		return vmcp.BackendDegraded
 	case rt.WorkloadStatusPolicyStopped:


jhrozek

test

jhrozek · 2026-06-11T09:01:45Z

+	if err == nil {
+		mts.exitAuthRetrying()
+		if tok == nil || tok.Expiry.IsZero() {
+			return true, 0


This return true, 0 exits monitorLoop via the defer close(mts.stopped) path, but stopMonitoring is never closed — that only happens through markAsUnauthenticated / stopOnce.Do.

After the monitor exits here, the gate in Token() stays open. If tokens later start failing, a hot caller falls past the stopMonitoring select, runs the full refresher.Refresh (up to 5min), then calls enterAuthRetrying — which correctly aborts because its own gate checks the channel under the lock. But the workload has now entered AuthRetrying with no monitor alive to drive the ceiling or recovery ticks.

Zero-expiry tokens do appear in practice (some sources omit expires_in). Worth closing stopMonitoring via stopOnce.Do on this exit path, or routing through a small helper that does so without emitting Unauthenticated.

jhrozek

Overall the state-machine design is solid and the test suite is thorough — especially the concurrent stress test and the real-HTTP integration tests. A few things worth discussing below.

jhrozek · 2026-06-11T09:05:14Z

+		mts.exitAuthRetrying()
+		if tok == nil || tok.Expiry.IsZero() {
+			return true, 0
+		}


This return true, 0 exits monitorLoop via defer close(mts.stopped), but stopMonitoring is never closed here — that only happens through markAsUnauthenticated / stopOnce.Do.

After the monitor exits this way, the gate in Token() stays open forever. If tokens later start failing, a hot caller falls past the stopMonitoring select, runs the full refresher.Refresh (up to 5min), then calls enterAuthRetrying — which correctly aborts because its own gate checks the channel under the lock. But the workload has now entered AuthRetrying with no monitor alive to drive the ceiling or recovery ticks.

Zero-expiry tokens do appear in practice (some sources omit expires_in). Worth closing stopMonitoring via stopOnce.Do on this exit path, or routing through a small helper that does so without emitting Unauthenticated.

jhrozek · 2026-06-11T09:05:14Z

+			return true, 0
+		}
+		return false, waitUntilExpiry(tok.Expiry)
+	}


The first success path (raw tokenSource.Token() succeeds) calls exitAuthRetrying() before returning — but this path (short retry succeeded) doesn't.

Narrow race: a hot caller enters AuthRetrying between the monitor's inAuthRetrying() check and Refresh returning. The monitor then succeeds here and returns to normal scheduling, but fastFailIfAuthRetrying keeps rejecting hot callers until the next tick reaches the inAuthRetrying branch.

exitAuthRetrying is a no-op when not in AuthRetrying, so adding it before this return is safe and defensive.

jhrozek · 2026-06-11T09:05:14Z

 	// Transient network error — funnel all concurrent callers through a
 	// single retry loop so we don't hammer the token endpoint.
 	tok, err = mts.refresher.Refresh(mts.monitoringCtx, err)
-	if err != nil {


TOCTOU between the stopMonitoring select a few lines above and this Refresh call: a hot caller passes the gate, gets preempted, markAsUnauthenticated fires on another goroutine and closes the channel, then the hot caller resumes here and runs the full 5-min short-retry against a dead endpoint.

The downstream enterAuthRetrying call is safe — its gate checks stopMonitoring under the lock — but the singleflight slot gets held for up to 5min unnecessarily.

A second non-blocking select on stopMonitoring immediately before this line would close the window.

jhrozek · 2026-06-11T09:05:14Z

+// Ordering matters: stopMonitoring is closed first so any concurrent
+// enterAuthRetrying call sees the gate closed before it acquires the
+// field mutex, eliminating the race where a hot caller could write
+// AuthRetrying to the status file just after we've written


The in-memory invariant is well-protected — the concurrent stress test validates this — but "eliminating the race" reads as a stronger guarantee than what's actually provided. The SetWorkloadStatus(AuthRetrying) emit happens after mu is released, so markAsUnauthenticated can still write Unauthenticated to disk and then enterAuthRetrying overwrites it with AuthRetrying. The comment a few lines down in the function body already acknowledges this: "a narrower disk-write inversion is still possible".

Suggested wording: "eliminates the in-memory field race" — keeps both comments consistent.

jhrozek · 2026-06-11T09:05:14Z

+		// will still fail with 503 until the token refresh recovers, but
+		// discovery callers see the backend's capabilities throughout.
+		return vmcp.BackendDegraded
 	case rt.WorkloadStatusPolicyStopped:


Intentional design question: should auth_retrying workloads appear in thv list without --all?

The PR adds a retrying indicator and a styled pill for this status, but both are unreachable — pkg/workloads/statuses/file_status.go line 310 filters to WorkloadStatusRunning only before the display code is reached. WorkloadStatusUnauthenticated and WorkloadStatusUnhealthy are also filtered there, so this may be deliberate, but the new visual work suggests the intent was to surface this state prominently by default. pkg/workloads/statuses/status.go (via IsRunning()) and this listing path have similar guards.

jhrozek · 2026-06-11T09:05:14Z

 - `unhealthy` - Workload is running but unhealthy
 - `unauthenticated` - Remote workload cannot authenticate (expired tokens)
+- `auth_retrying` - Remote workload's token refresh is failing transiently; monitor is still retrying until success (→ `running`) or the configured ceiling (→ `unauthenticated`)
 - `unknown` - Workload status cannot be determined


There is a third exit path missing from this description: if a monitor tick during the auth_retrying window observes a permanent OAuth error (e.g. invalid_grant), markAsUnauthenticated fires immediately without waiting for the ceiling. The 08-workloads-lifecycle.md state diagram already captures this correctly as "Ceiling Exceeded or Permanent Error".

Suggested change

- `unknown` - Workload status cannot be determined

- `auth_retrying` - Remote workload's token refresh is failing transiently; monitor is still retrying until success (-> `running`), a permanent error is observed (-> `unauthenticated`), or the configured ceiling is exceeded (-> `unauthenticated`)

jhrozek · 2026-06-11T09:27:02Z

sorry about the stray test messages, I was driving the review from a CC session and it couldn't figure out how to post an inline comment :-/

gkatz2 requested review from ChrisJBurns, JAORMX, amirejaz, blkt, jerm-dro, jhrozek, lujunsan, rdimitrov and yrobla as code owners May 20, 2026 19:41

github-actions Bot added the size/XL Extra large PR: 1000+ lines changed label May 20, 2026

github-actions Bot previously requested changes May 20, 2026

View reviewed changes

github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 20, 2026

gkatz2 force-pushed the fix/auth-retrying-cross-tick branch from 178577a to 978fc6f Compare June 5, 2026 15:07

gkatz2 requested review from aponcedeleonch, reyortiz3 and tgrunnagle as code owners June 5, 2026 15:07

github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jun 5, 2026

jhrozek reviewed Jun 11, 2026

View reviewed changes

	- `unknown` - Workload status cannot be determined
	- `auth_retrying` - Remote workload's token refresh is failing transiently; monitor is still retrying until success (-> `running`), a permanent error is observed (-> `unauthenticated`), or the configured ceiling is exceeded (-> `unauthenticated`)

Conversation

gkatz2 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of change

Test plan

API Compatibility

Does this introduce a user-facing change?

Large PR Justification

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Large PR Detected

How to unblock this PR:

Alternative:

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

gkatz2 commented Jun 5, 2026

Uh oh!

jhrozek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhrozek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhrozek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhrozek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhrozek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhrozek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhrozek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhrozek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhrozek commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

gkatz2 commented May 20, 2026 •

edited

Loading

codecov Bot commented May 20, 2026 •

edited

Loading