Recover OAuth workloads from transient refresh failures#5350
Conversation
When an OAuth refresh fails as a transient error for longer than the in-loop short retry (~1-2 minutes at defaults), the monitor used to mark the workload unauthenticated and exit. The workload stayed dead even after the underlying condition cleared on its own (e.g. a brief VPN disconnect routing requests through a different network path). Now the monitor keeps retrying on a configurable longer cadence (default 10 minutes) until either success or a configurable ceiling (default 24 hours), at which point the workload is finally marked unauthenticated. Fixes stacklok#5349 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Greg Katz <gkatz@indeed.com>
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5350 +/- ##
==========================================
+ Coverage 68.38% 68.47% +0.08%
==========================================
Files 624 625 +1
Lines 63442 63614 +172
==========================================
+ Hits 43386 43560 +174
+ Misses 16818 16811 -7
- Partials 3238 3243 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
Large PR justification has been provided. Thank you!
Summary
than the in-loop short retry (~1–2 minutes at defaults), the
monitor today marks the workload
unauthenticatedand exits itsgoroutine. The workload stays permanently dead even after the
underlying condition clears on its own. The canonical trigger is a
brief client-side network-context change—a VPN disconnect, a
laptop sleeping on one network and resuming on another, etc.—that
causes token-refresh requests to reach the OAuth server from a
different network origin. See OAuth monitor gives up on transient failures, leaving workloads dead #5349 for the full failure mode.
WorkloadStatusAuthRetryingworkload status. After the shortretry exhausts on a still-transient error, the monitor stays alive
and re-attempts refresh on a longer cadence (default 10 min,
configurable via
TOOLHIVE_TOKEN_AUTH_RETRYING_TICK_INTERVAL) untileither success (→
Running) or a configurable ceiling (default 24h, configurable via
TOOLHIVE_TOKEN_AUTH_RETRYING_MAX_ELAPSED)(→
Unauthenticated).Token()calls, e.g. from the tokeninjection middleware) fast-fail with the cached error during
AuthRetrying, returning 503+Retry-After immediately rather than
blocking on another short-retry duration against the broken
endpoint.
pkg/oauthproto/oauthtest/server.goprovides a scriptableOAuth server fixture used by two end-to-end integration tests that
drive the state machine against the real
golang.org/x/oauth2library + real HTTP responses.
vmcpbackend health mapping treatsAuthRetryingasBackendDegradedso tool discovery still aggregates the workloadwhile invocation surfaces the underlying 503.
Fixes #5349
Type of change
Test plan
task test)—11 new tests across the PR:7 covering the AuthRetrying state machine (entry on short-retry
exhaustion, recovery to
Running, ceiling transition toUnauthenticated, hot-caller fast-fail, permanent-error mid-AuthRetrying, DCR Warn silence on ceiling, post-stop gate);
2 integration tests that drive the state machine against a
real
httptest.NewServerthrough the realgolang.org/x/oauth2library; 2 covering the new switch arms in
mapWorkloadStatusToVMCPHealthandworkloadStatusIndicator.task lint-fix)deployed
~/.local/bin/thv, restarted two real OAuth-backedremote workloads (Sourcegraph and Sourcegraph Deep Search) on
the new binary. Verified MCP
initializeround-trip worksend-to-end with token injection through the proxy. Verified
thv list --allrendering of the newauth_retryingstatusvia a manual status-file simulation.
Note on confidence. The bug's natural trigger (a real WAF block
during a real token refresh) recurs every 1–3 days in the affected
environment and needs sudo-level network controls (
pfctl/iptables) to drive on demand—neither fits CI or routineverification. The two
TestIntegration_*cases inpkg/auth/monitored_token_source_test.goare where regressions inthe long-tail behavior would surface: they drive the full state
machine through the real
golang.org/x/oauth2library against ascriptable OAuth server fixture (
pkg/oauthproto/oauthtest/)returning the actual
<!DOCTYPE html>4xx-without-RFC-6749-error-code response shape observed in production. Two scenarios:
Running → AuthRetrying → Running(recovery on next tick) andRunning → AuthRetrying → Unauthenticated(ceiling exceeded). Themock-based unit tests in the same file pin individual edge cases at
higher speed; together with the integration tests they cover both
the state machine logic and the real-library response-parsing path.
API Compatibility
v1beta1API.The
v1beta1operator API surface is not touched by this PR. Thenew
auth_retryingworkload status is added to the runtime enum andto the OpenAPI swagger for the workloads HTTP API, both of which are
additive—existing clients that don't know about the new value will
see it as an unrecognised string and route through their default
handlers.
Does this introduce a user-facing change?
Yes:
auth_retryingworkload status (visible inthv list --all,in the OpenAPI swagger, and in the architecture state diagram).
10m) andceiling (default
24h) viaTOOLHIVE_TOKEN_AUTH_RETRYING_TICK_INTERVALand
TOOLHIVE_TOKEN_AUTH_RETRYING_MAX_ELAPSED.operators no longer need to manually re-auth workloads after a
transient network-context change resolves on its own.
Large PR Justification
The 1342-line additions break down as:
*_test.gofiles. The bulk is inpkg/auth/monitored_token_source_test.go(718 new lines), coveringthe new state machine including 2 integration tests that drive the
full flow against a real
golang.org/x/oauth2library.pkg/oauthproto/oauthtest/server.go, a new scriptable OAuth serverfixture used by the integration tests.
auth_retryingenum value inpkg/api/v1/workload_types.go.state transitions, ceiling logic, hot-caller fast-fail, and the
MonitoredTokenSourceplumbing reference each other and would notfunction if split.
Generated with Claude Code.