Skip to content

Bind vMCP sessions to OIDC identity, not raw token bytes#5378

Open
jhrozek wants to merge 4 commits into
mainfrom
worktree-session-binding
Open

Bind vMCP sessions to OIDC identity, not raw token bytes#5378
jhrozek wants to merge 4 commits into
mainfrom
worktree-session-binding

Conversation

@jhrozek
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek commented May 24, 2026

Summary

vMCP's session-hijack-prevention decorator hashed the bearer token at session
creation and rejected requests whose token hashed differently. Every legitimate
OAuth refresh produces new token bytes for the same (iss, sub) identity, so
the decorator misclassified each refresh as a hijack and tore the session down
with Unauthorized: caller identity does not match session owner. Users were
forced to re-authenticate on every access-token rotation — minutes in production
deployments using short ATs.

  • Replace HMAC token-hash binding with a stable (iss, sub) tuple extracted
    from OIDC Claims, stored under MetadataKeyIdentityBinding. Format-and-parse
    encapsulated in a new leaf package pkg/vmcp/session/binding/.
  • Validation now reads the caller's claims the same way at the JWT and
    introspection code paths, so token rotation produces identical bindings.
  • Session-upgrade defense (anonymous session, caller presents a token) preserved
    via the unauthenticated sentinel branch.
  • RestoreSession reconstructs identity from the stored binding instead of a
    separate identity-subject key. Sessions written under the legacy token-hash
    schema return transportsession.ErrSessionNotFound (forced re-init), avoiding
    the rebind-on-first-use window that would re-introduce the hijack vector.
  • HMAC plumbing (WithHMACSecret, salt generation, hex hashing) and its k8s
    secret requirement removed end-to-end.

Closes #5306

Type of change

  • Bug fix

Test plan

  • Unit tests (task test) — full vmcp tree green
  • Linting (task lint-fix) — 0 issues
  • Manual testing — kind cluster + reproducer at `deploy/session-debug/06-repro-session-bind.py`:
    • Pre-patch: HTTP 401 `caller identity does not match session owner` after
      OAuth refresh; vMCP logs `WARN reason=token_hash_mismatch`.
    • Post-patch: HTTP 200 with real tool result; vMCP logs show zero
      `identity_binding_mismatch` warnings on the refresh path. Bug trigger
      condition (`same sub=True, same tsid=True, same bytes=False`) met and
      request succeeded.

API Compatibility

  • This PR does not break the `v1beta1` API.

Does this introduce a user-facing change?

Yes. Two operator-relevant behaviors:

  1. All existing vMCP sessions invalidate on upgrade across this version.
    The session-binding storage key changes from `vmcp.token.hash` to
    `vmcp.identity.binding`. Sessions written by the previous version return
    `ErrSessionNotFound` on first read by the new version — clients re-initialise
    transparently (one forced re-auth pulse during the rollout window).
    Operators wanting a zero-disruption upgrade should use a blue/green deploy
    or a maintenance window. Mid-rollout (mixed old/new pods) is safe in both
    directions: sessions invalidate symmetrically, never corrupt.

  2. The `--vmcp-session-hmac-secret` flag / `VMCP_SESSION_HMAC_SECRET` env
    var are no longer used.
    Both are accepted silently (DEBUG log only) for
    one release to ease migration; remove from your deployment manifests at
    convenience.

Implementation plan

Approved implementation plan

Design RFC: THV-0073 — vMCP Session Binding by Identity Tuple

Commit-by-commit shape on this branch:

  1. `Add pkg/vmcp/session/binding leaf package` — format owner
    (`Format` / `Parse` / `IsUnauthenticated` + `UnauthenticatedSentinel`).
  2. `Add MetadataKeyIdentityBinding key and tighten ShouldAllowAnonymous` —
    the new metadata key + tightened anonymous-classification logic that
    reads `iss`/`sub` from `Claims` (fail-closed on non-string claims).
  3. `Bind sessions to (iss, sub), not raw token bytes` — the core decorator
    rewrite (`security.go`), factory wiring (`factory.go`), and Phase-2
    marker swap in `sessionmanager`. Bundled because the production code
    doesn't compile cleanly across them in isolation.
  4. `Drop HMAC plumbing from serve.go; document Redis trust boundary` —
    removes the now-unused k8s secret requirement and adds operator
    guidance on Redis access control.

Special notes for reviewers

  • Threat model: `pkg/vmcp/session/internal/security/security.go` package
    doc states the trust boundary — the decorator trusts that
    `identity.Claims["iss"]` / `Claims["sub"]` came from a validated token.
    Token validation happens upstream in the OIDC middleware; this layer does
    not re-validate.
  • `ConstantTimeCompare` length leakage at `security.go:165` is deliberate
    and documented inline — both `iss` (issuer URL, public via discovery) and
    `sub` (per-issuer opaque identifier) are not secret.
  • Inherited limitation (not introduced here, worth knowing): restored
    sessions reconstruct `*auth.Identity` with empty `Token` and nil
    `UpstreamTokens`. Outgoing-auth strategies that need a live token
    (`tokenexchange`, `aws_sts`, `upstream_inject`) fail backend init on
    restore and the affected backends are skipped — same behavior as the
    pre-patch code (HMAC binding also never persisted the token). Tracked
    separately; not in scope for this PR.
  • PR scope: ~25 files across production + tests is above the usual
    10-file guideline. Bundled because the binding-format change is one
    logical refactor that doesn't decompose into smaller PRs that build
    standalone (the security decorator and the factory wiring are tightly
    coupled around the on-the-wire format).

🤖 Generated with Claude Code

jhrozek and others added 4 commits May 24, 2026 21:18
The vMCP session-binding format needs to be parsed and produced in two
places: pkg/vmcp/session/types (from ShouldAllowAnonymous) and
pkg/vmcp/session/internal/security (from BindSession / validateCaller).
A private helper in each would drift over time, so this commit
introduces a single-owner leaf package.

Format encodes a bound identity as iss + "\x00" + sub, rejecting empty
halves and stray NULs. Parse mirrors Format strictly — including
rejecting trailing NULs in the sub half so values that did not pass
through Format (e.g. direct writes to Redis) fail loudly. A literal
"unauthenticated" sentinel covers sessions created without an
authenticated identity.

No callers wired yet — those land in the next commits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The hijack-prevention decorator needs a stable per-identity key in
session metadata. Add MetadataKeyIdentityBinding alongside the existing
token-hash keys (the legacy keys stay temporarily so already-running
sessions can be invalidated cleanly during the migration; final removal
happens in the operator-side follow-up).

ShouldAllowAnonymous treated every empty-token identity as anonymous,
which lumped LocalUserMiddleware identities into the same equivalence
class even when they carried distinct (iss, sub) claims — letting one
local user reuse another's session ID. Tighten the rule so any identity
with a valid (iss, sub) pair from Claims goes through the bound path,
even when Token is empty. Pull iss and sub from Identity.Claims rather
than Identity.Subject so the introspection path and the JWT path
canonicalize against the same source.

Fail-closed on non-string iss/sub claims: a misbehaving validator that
stores a numeric or array value is treated as bound (not anonymous),
with a WARN logged so the misbehavior surfaces to operators.

The new key is unused until the security and factory layers are wired in
the next commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The hijack-prevention decorator hashed the incoming bearer token at
session creation and rejected any subsequent request whose token hashed
differently. Every legitimate OAuth refresh produces a new access token
with different bytes — same identity, same iss, same sub — so the
decorator misclassified the refresh as a hijack and terminated the
session with Unauthorized: caller identity does not match session owner.

Drop the HMAC plumbing entirely and bind to a stable (iss, sub) tuple
extracted from the OIDC identity's Claims. The binding lives in session
metadata under MetadataKeyIdentityBinding, written exclusively by the
new BindSession constructor (renamed from PreventSessionHijacking).
Validation reads the caller's claims through the same path so the JWT
and introspection code paths canonicalize against the same source. The
session-upgrade defense (anonymous session, caller presents a token)
moves to the unauthenticated-sentinel branch and works exactly as
before.

RestoreSession reconstructs identity from the stored binding rather
than from a separate identity-subject key. Sessions written under the
legacy token-hash schema return the bare transportsession.ErrSessionNotFound
sentinel so the client receives the standard "re-initialise" signal one
forced re-auth at deploy is preferable to a rebind-on-first-use window
that would re-introduce the hijack the check exists to block.

Downstream callers of the removed WithHMACSecret option (cli/serve.go)
and the Phase-2 marker switch (sessionmanager) follow in the next
commits; tests catch up after that.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Use MetadataKeyIdentityBinding as the Phase-2 marker

BindSession writes MetadataKeyIdentityBinding on every successful
session creation (sentinel for anonymous, real binding for
authenticated), so its presence is the new way to distinguish a
fully-initialised Phase-2 session from a Generate()-only placeholder.
Without this swap, the now-unwritten MetadataKeyTokenHash would always
read as absent and Terminate would take the placeholder path on every
session, breaking termination semantics.

The behaviour for legacy sessions still in Redis from before the
migration is intentional and documented in the plan: Terminate writes
the placeholder-terminated marker (Update with MetadataKeyTerminated)
which sits for the TTL; loadSession returns
transportsession.ErrSessionNotFound so the client transparently
re-initialises.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
createSessionFactory no longer takes an HMAC secret or a Kubernetes
detection flag. VMCP_SESSION_HMAC_SECRET stays readable from the
environment for one deploy cycle (DEBUG-logged and ignored) so the
operator-side env-var injection can be removed in a follow-up PR
without forcing a coordinated cut-over.

Add a startup WARN when incoming auth is configured as "anonymous":
AnonymousMiddleware populates the same (iss, sub) for every request,
so all callers collide on one identity binding and per-identity
hijack prevention degrades to dev-only behaviour. Surface that to
operators rather than letting them assume the binding still scopes
per user.

Document the trust boundary in docs/arch/13-vmcp-scalability.md: the
new scheme stores plaintext (iss, sub) at rest in Redis/Valkey,
which trades the HMAC's at-rest opacity for refresh correctness.
Operators must layer Redis ACLs or NetworkPolicies if a Redis dump
revealing identity is unacceptable.

Add a TODO on extractBindingID for the forward-looking RFC 7662 case
(probe IdP introspection responses for iss + sub) so it surfaces if
that becomes a top-level incoming-auth type.

The five HMAC-specific serve_test.go cases collapse to one
table-driven test of the new signature; the removed cases targeted
HMAC validation that no longer exists.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions github-actions Bot added the size/XL Extra large PR: 1000+ lines changed label May 24, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 24, 2026

Codecov Report

❌ Patch coverage is 85.21127% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.73%. Comparing base (08baf2d) to head (3fdfd9b).

Files with missing lines Patch % Lines
pkg/vmcp/cli/serve.go 21.42% 11 Missing ⚠️
pkg/vmcp/session/internal/security/security.go 89.06% 5 Missing and 2 partials ⚠️
pkg/vmcp/session/factory.go 88.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5378      +/-   ##
==========================================
- Coverage   68.74%   68.73%   -0.02%     
==========================================
  Files         626      627       +1     
  Lines       63496    63498       +2     
==========================================
- Hits        43650    43645       -5     
- Misses      16602    16608       +6     
- Partials     3244     3245       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vMCP terminates MCP session on every legitimate access-token refresh

1 participant