Skip to content

Recover OAuth workloads from transient refresh failures#5350

Open
gkatz2 wants to merge 1 commit into
stacklok:mainfrom
gkatz2:fix/auth-retrying-cross-tick
Open

Recover OAuth workloads from transient refresh failures#5350
gkatz2 wants to merge 1 commit into
stacklok:mainfrom
gkatz2:fix/auth-retrying-cross-tick

Conversation

@gkatz2
Copy link
Copy Markdown
Contributor

@gkatz2 gkatz2 commented May 20, 2026

Summary

  • When an OAuth token refresh fails as a transient error for longer
    than the in-loop short retry (~1–2 minutes at defaults), the
    monitor today marks the workload unauthenticated and exits its
    goroutine. The workload stays permanently dead even after the
    underlying condition clears on its own. The canonical trigger is a
    brief client-side network-context change—a VPN disconnect, a
    laptop sleeping on one network and resuming on another, etc.—that
    causes token-refresh requests to reach the OAuth server from a
    different network origin. See OAuth monitor gives up on transient failures, leaving workloads dead #5349 for the full failure mode.
  • Add a WorkloadStatusAuthRetrying workload status. After the short
    retry exhausts on a still-transient error, the monitor stays alive
    and re-attempts refresh on a longer cadence (default 10 min,
    configurable via TOOLHIVE_TOKEN_AUTH_RETRYING_TICK_INTERVAL) until
    either success (→ Running) or a configurable ceiling (default 24
    h, configurable via TOOLHIVE_TOKEN_AUTH_RETRYING_MAX_ELAPSED)
    (→ Unauthenticated).
  • Hot callers (request-path Token() calls, e.g. from the token
    injection middleware) fast-fail with the cached error during
    AuthRetrying, returning 503+Retry-After immediately rather than
    blocking on another short-retry duration against the broken
    endpoint.
  • New pkg/oauthproto/oauthtest/server.go provides a scriptable
    OAuth server fixture used by two end-to-end integration tests that
    drive the state machine against the real golang.org/x/oauth2
    library + real HTTP responses.
  • The vmcp backend health mapping treats AuthRetrying as
    BackendDegraded so tool discovery still aggregates the workload
    while invocation surfaces the underlying 503.

Fixes #5349

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)—11 new tests across the PR:
    7 covering the AuthRetrying state machine (entry on short-retry
    exhaustion, recovery to Running, ceiling transition to
    Unauthenticated, hot-caller fast-fail, permanent-error mid-
    AuthRetrying, DCR Warn silence on ceiling, post-stop gate);
    2 integration tests that drive the state machine against a
    real httptest.NewServer through the real golang.org/x/oauth2
    library; 2 covering the new switch arms in
    mapWorkloadStatusToVMCPHealth and workloadStatusIndicator.
  • Linting (task lint-fix)
  • Manual testing—built a local override binary, replaced the
    deployed ~/.local/bin/thv, restarted two real OAuth-backed
    remote workloads (Sourcegraph and Sourcegraph Deep Search) on
    the new binary. Verified MCP initialize round-trip works
    end-to-end with token injection through the proxy. Verified
    thv list --all rendering of the new auth_retrying status
    via a manual status-file simulation.

Note on confidence. The bug's natural trigger (a real WAF block
during a real token refresh) recurs every 1–3 days in the affected
environment and needs sudo-level network controls (pfctl /
iptables) to drive on demand—neither fits CI or routine
verification. The two TestIntegration_* cases in
pkg/auth/monitored_token_source_test.go are where regressions in
the long-tail behavior would surface: they drive the full state
machine through the real golang.org/x/oauth2 library against a
scriptable OAuth server fixture (pkg/oauthproto/oauthtest/)
returning the actual <!DOCTYPE html> 4xx-without-RFC-6749-error-
code response shape observed in production. Two scenarios:
Running → AuthRetrying → Running (recovery on next tick) and
Running → AuthRetrying → Unauthenticated (ceiling exceeded). The
mock-based unit tests in the same file pin individual edge cases at
higher speed; together with the integration tests they cover both
the state machine logic and the real-library response-parsing path.

API Compatibility

  • This PR does not break the v1beta1 API.

The v1beta1 operator API surface is not touched by this PR. The
new auth_retrying workload status is added to the runtime enum and
to the OpenAPI swagger for the workloads HTTP API, both of which are
additive—existing clients that don't know about the new value will
see it as an unrecognised string and route through their default
handlers.

Does this introduce a user-facing change?

Yes:

  • New auth_retrying workload status (visible in thv list --all,
    in the OpenAPI swagger, and in the architecture state diagram).
  • Operators can tune the AuthRetrying cadence (default 10m) and
    ceiling (default 24h) via TOOLHIVE_TOKEN_AUTH_RETRYING_TICK_INTERVAL
    and TOOLHIVE_TOKEN_AUTH_RETRYING_MAX_ELAPSED.
  • In environments where the bug surfaces (described in OAuth monitor gives up on transient failures, leaving workloads dead #5349),
    operators no longer need to manually re-auth workloads after a
    transient network-context change resolves on its own.

Large PR Justification

The 1342-line additions break down as:

  • Tests: ~789 lines across 4 *_test.go files. The bulk is in
    pkg/auth/monitored_token_source_test.go (718 new lines), covering
    the new state machine including 2 integration tests that drive the
    full flow against a real golang.org/x/oauth2 library.
  • Test infrastructure: 136 lines in
    pkg/oauthproto/oauthtest/server.go, a new scriptable OAuth server
    fixture used by the integration tests.
  • Generated swagger: 12 lines, regenerated from the new
    auth_retrying enum value in pkg/api/v1/workload_types.go.
  • Architecture docs: 15 lines documenting the new state.
  • Production code: ~390 lines. The change is necessarily atomic—
    state transitions, ceiling logic, hot-caller fast-fail, and the
    MonitoredTokenSource plumbing reference each other and would not
    function if split.

Generated with Claude Code.

When an OAuth refresh fails as a transient error for longer than the
in-loop short retry (~1-2 minutes at defaults), the monitor used to
mark the workload unauthenticated and exit. The workload stayed dead
even after the underlying condition cleared on its own (e.g. a brief
VPN disconnect routing requests through a different network path).
Now the monitor keeps retrying on a configurable longer cadence
(default 10 minutes) until either success or a configurable ceiling
(default 24 hours), at which point the workload is finally marked
unauthenticated.

Fixes stacklok#5349

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Greg Katz <gkatz@indeed.com>
@github-actions github-actions Bot added the size/XL Extra large PR: 1000+ lines changed label May 20, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 81.12245% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.47%. Comparing base (2894982) to head (178577a).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pkg/auth/monitored_token_source.go 79.72% 25 Missing and 5 partials ⚠️
pkg/oauthproto/oauthtest/server.go 84.44% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5350      +/-   ##
==========================================
+ Coverage   68.38%   68.47%   +0.08%     
==========================================
  Files         624      625       +1     
  Lines       63442    63614     +172     
==========================================
+ Hits        43386    43560     +174     
+ Misses      16818    16811       -7     
- Partials     3238     3243       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions Bot dismissed their stale review May 20, 2026 21:55

Large PR justification has been provided. Thank you!

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OAuth monitor gives up on transient failures, leaving workloads dead

1 participant