Skip to content

feat(auth): persist OAuth state to disk to survive restarts#5

Open
tricamtech wants to merge 1 commit into
thebackpackdevorg:mainfrom
tricamtech:feat/oauth-state-persistence
Open

feat(auth): persist OAuth state to disk to survive restarts#5
tricamtech wants to merge 1 commit into
thebackpackdevorg:mainfrom
tricamtech:feat/oauth-state-persistence

Conversation

@tricamtech
Copy link
Copy Markdown

Problem

SimpleOAuthProvider stores all of its state in in-memory dicts:

self._clients: dict[str, OAuthClientInformationFull] = {}
self._access_tokens: dict[str, AccessToken] = {}
self._refresh_tokens: dict[str, RefreshToken] = {}
self._auth_codes: dict[str, AuthorizationCode] = {}
self._pending: dict[str, dict] = {}

Restart wipes all of them. The current docstring is upfront about this:

Stores clients, authorization codes, and tokens in memory.
Tokens survive until server restart (acceptable for personal use).

That's fine for the simple "single Claude Desktop on my laptop" use case. But for the deployment shape the README's "Remote deployment" section explicitly recommends — vault-mcp running as a long-lived systemd service behind a reverse proxy, with claude.ai web / Claude desktop / multiple Claude Code clients all connected through dynamically-registered OAuth — a single restart silently invalidates every client's stored bearer token. The next request from each client returns 401, and the user has to walk through the PIN approval flow again, per client.

Even worse: if the restart happens with an approval page open in a browser tab, the new process has no _pending entry for that request_id, so the user gets:

Invalid or expired authorization request.

…with no clue that the entire state was wiped, not just their pending request. (I hit this several times today running the public instance behind a Cloudflare Tunnel — every PIN rotation, every config change, every accidental crash required re-authing in claude.ai's connector dialog.)

Fix

Add opt-in disk persistence for the long-lived state:

  • New constructor arg: state_path: Path | None = None
  • New env var: OAUTH_STATE_PATH
  • When set, the provider loads _clients, _access_tokens, and _refresh_tokens from a JSON file on init, and writes them back atomically after every mutation.

What is persisted

  • Registered DCR clients (_clients)
  • Issued access tokens (_access_tokens)
  • Issued refresh tokens (_refresh_tokens)

What is NOT persisted (intentional)

  • The per-process cf_bypass_token — regenerated each start so it can never be exfiltrated from a stale state file. The save filter if k != self._cf_token excludes it explicitly.
  • _auth_codes — short-lived (5 minute expiry by design), no value in persisting.
  • _pending — even shorter-lived (in-flight authorization requests), no value in persisting; surviving across restart could even be confusing (user clicks Approve on a request_id that no longer matches anything sane).

Implementation details

  • Atomic writes via tempfile.mkstemp + os.replace so a crash mid-write can't corrupt the state file
  • File mode 0600 (root-only readable, like /root/.env)
  • asyncio.Lock serializes concurrent saves
  • Actual file I/O dispatched to the default executor to avoid blocking the event loop
  • Save errors logged + swallowed — we never want a transient disk issue to block an auth flow
  • Load errors fall back to empty state with a logged warning, so a corrupted state file doesn't take down the server
  • Refresh-token rotation now correctly drops the OLD refresh token from the dict (it wasn't being cleaned up before, just orphaned; now it's removed and the save reflects that)
  • version: 1 field in the state file for future schema migrations

Verification

End-to-end on a live server:

$ OAUTH_STATE_PATH=/tmp/state.json OAUTH_PIN=test1234 \
    OAUTH_ISSUER_URL=http://127.0.0.1:8795 \
    VAULT_PATH=/tmp/vault SERVER_HOST=127.0.0.1 SERVER_PORT=8795 \
    python -m vault_mcp.server &
... INFO OAuth provider initialized — state path: /tmp/state.json,
         loaded 0 clients / 0 access tokens / 0 refresh tokens

$ curl -X POST http://127.0.0.1:8795/register \
    -H 'Content-Type: application/json' \
    -d '{"client_name":"test","redirect_uris":["http://localhost/cb"]}'
→ client_id: 567926af-12f6-4efc-823d-7eeea5390bee

$ ls -la /tmp/state.json
-rw------- 1 root root 719 ... /tmp/state.json
$ jq '.clients | keys' /tmp/state.json
["567926af-12f6-4efc-823d-7eeea5390bee"]

# Kill and restart
$ kill %1 && python -m vault_mcp.server &
... INFO OAuth provider initialized — state path: /tmp/state.json,
         loaded 1 clients / 0 access tokens / 0 refresh tokens
                ^^^^^^^^^

# Register a second client → both are now persisted
$ curl -X POST http://127.0.0.1:8795/register ...
$ jq '.clients | keys' /tmp/state.json
["567926af-12f6-4efc-823d-7eeea5390bee", "bff56b80-02bc-4f3a-8377-4b96c9c7013a"]

The same flow with real access/refresh tokens (after walking the full Authorization Code grant via the browser) preserves token validity across restart — connected clients don't need to re-auth after a service restart.

Backwards compatibility

state_path defaults to None. Without OAUTH_STATE_PATH set in the environment, behavior is byte-identical to before this commit — all state in memory, wiped on restart, no disk I/O, no new files. Existing deployments are unaffected.

Notes

Sister to #1 (vault_reindex guard), #2 (MCP_ALLOWED_HOSTS), #3 (PIN field maxlength), and #4 (watchfiles auto-reindex). Together, these five PRs make the "vault-mcp behind a tunnel for cross-device access" deployment shape (which the README documents) actually work without papercuts.

The SimpleOAuthProvider stores everything in in-memory dicts:

  self._clients: dict[str, OAuthClientInformationFull] = {}
  self._access_tokens: dict[str, AccessToken] = {}
  self._refresh_tokens: dict[str, RefreshToken] = {}
  self._auth_codes: dict[str, AuthorizationCode] = {}
  self._pending: dict[str, dict] = {}

Restart wipes all of them. For deployments where the only consumer is
a single Claude Desktop instance running on the same machine that's
restarted manually, this is fine — that's the use case the docstring
explicitly calls out:

  Stores clients, authorization codes, and tokens in memory.
  Tokens survive until server restart (acceptable for personal use).

But for the deployment shape the README's "Remote deployment" section
recommends — vault-mcp running as a long-lived systemd service behind
a reverse proxy, with claude.ai web / Claude desktop / multiple Claude
Code clients all connected through dynamically-registered OAuth — a
restart silently invalidates every client's stored bearer token. The
next request from each client returns 401, and the user has to walk
through the PIN approval flow again per-client. Even worse: if the
restart happens with an approval page open in a tab, the new process
has no _pending entry for the request_id, so the user gets "Invalid
or expired authorization request" with no clue that the entire
state was wiped, not just their pending request.

I hit this several times today running the public instance behind a
Cloudflare Tunnel for cross-device access. Every PIN rotation, every
config change requiring `systemctl restart`, every accidental crash
required re-authing in claude.ai's connector dialog.

## Fix

Add optional disk persistence for the long-lived state:

- New constructor arg: `state_path: Path | None = None`
- New env var:        `OAUTH_STATE_PATH`
- When set, the provider loads `_clients`, `_access_tokens`, and
  `_refresh_tokens` from a JSON file on init, and writes them back
  after every mutation.

What is persisted:
- Registered DCR clients (`_clients`)
- Issued access tokens (`_access_tokens`)
- Issued refresh tokens (`_refresh_tokens`)

What is NOT persisted (intentionally):
- The per-process `cf_bypass_token` — regenerated each start so it
  can never be exfiltrated from a stale state file. The save filter
  `if k != self._cf_token` excludes it explicitly.
- `_auth_codes` — short-lived (5 minute expiry by design), no value
  in persisting
- `_pending` — even shorter-lived (in-flight authorization requests),
  no value in persisting; surviving across restart could even be
  confusing (user clicks Approve on a half-dead request_id)

Implementation details:

- Atomic writes via `tempfile.mkstemp` + `os.replace` so a crash
  mid-write can't corrupt the state file
- File mode 0600 (root-only readable, like /root/.env)
- `asyncio.Lock` (`_save_lock`) serializes concurrent saves
- Actual file I/O dispatched to the default executor to avoid
  blocking the event loop
- Save errors are logged via `logger.exception` and swallowed —
  we never want a transient disk issue to block an auth flow
- Load errors fall back to empty state with a logged warning,
  so a corrupted state file doesn't take down the server
- Refresh-token rotation now correctly drops the OLD refresh token
  from the dict (it wasn't being cleaned up before, just orphaned;
  now it's removed and the save reflects that)
- `version: 1` field in the state file for future schema migrations

## Verification

End-to-end on a live server:

  $ OAUTH_STATE_PATH=/tmp/state.json OAUTH_PIN=test1234 ... \
      python -m vault_mcp.server &
  ... INFO OAuth provider initialized — state path: /tmp/state.json,
           loaded 0 clients / 0 access tokens / 0 refresh tokens

  $ curl -X POST http://127.0.0.1:8795/register \
      -d '{"client_name":"test","redirect_uris":["http://localhost/cb"]}'
  → client_id: 567926af-...

  $ ls -la /tmp/state.json
  -rw------- 1 root root 719 ... /tmp/state.json
  $ jq '.clients | keys' /tmp/state.json
  ["567926af-..."]

  # Kill and restart
  $ kill %1 && python -m vault_mcp.server &
  ... INFO OAuth provider initialized — state path: /tmp/state.json,
           loaded 1 clients / 0 access tokens / 0 refresh tokens
                  ^^^^^^^^^

  # Register a second client → both are now persisted
  $ curl -X POST http://127.0.0.1:8795/register ...
  $ jq '.clients | keys' /tmp/state.json
  ["567926af-...", "bff56b80-..."]

The same flow with real access/refresh tokens (after walking the
full Authorization Code grant via the browser) preserves token
validity across restart — the connecting client doesn't need to
re-auth after a service restart.

## Backwards compatibility

`state_path` defaults to `None`. Without `OAUTH_STATE_PATH` set in
the environment, behavior is byte-identical to before this commit:
all state in memory, wiped on restart, no disk I/O, no new files.
Existing deployments are unaffected.

Sister to thebackpackdevorg#1 (vault_reindex guard), thebackpackdevorg#2 (MCP_ALLOWED_HOSTS),
thebackpackdevorg#3 (PIN field maxlength), and thebackpackdevorg#4 (watchfiles auto-reindex).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant