Skip to content

fix(cli-auth): atomic writes + observable failures on PAT rotation#23

Open
dgokeeffe wants to merge 1 commit intomainfrom
fix/cli-auth-rotation-race
Open

fix(cli-auth): atomic writes + observable failures on PAT rotation#23
dgokeeffe wants to merge 1 commit intomainfrom
fix/cli-auth-rotation-race

Conversation

@dgokeeffe
Copy link
Copy Markdown
Collaborator

@dgokeeffe dgokeeffe commented May 6, 2026

Priority

P1 — intermittent 403s after every PAT rotation. Hermes (which re-reads ~/.hermes/config.yaml on each invocation) returns HTTP 403: Invalid access token on the first call after a rotation, then succeeds on retry. The same race exists for all five _update_* functions in cli_auth.py; Hermes just exposes it most often because of its read-on-every-invocation pattern.


Summary

Fix for #22. Hermes was 403'ing on the first call after a PAT rotation and recovering on retry. Two reasons inside cli_auth.py:

  1. Non-atomic writes. update_cli_tokens() rewrote each agent's config with a bare open(path, "w") — Hermes (which re-reads ~/.hermes/config.yaml on every call) could read a half-written file. The other agents only cache the token at process startup so they don't observe this race in-process, but the bug exists for all five.
  2. Silent failures. Every write path ended in except OSError: pass. A real write failure (perms, locked file, ENOSPC) would leave the config stale forever with zero logs.

Changes (cli_auth.py)

  • Added _atomic_write_text(path, content) helper — write to <path>.tmp, then os.replace(). POSIX rename is atomic, so a concurrent Hermes invocation sees either the old token whole or the new token whole, never a partial state.
  • All 5 _update_* functions now use the helper.
  • except OSError: passexcept OSError as e: logger.warning(...). Real failures now show up in app logs.
  • Added explicit os.path.exists(path) guards so the rotator stays quiet during the window between container start and setup-script completion (which is the legitimate "file doesn't exist yet" case the original silent pass was protecting).

What this fixes / what it doesn't

✅ Hermes 403 caused by reading mid-rotation half-written config.yaml.
✅ Silent stuck-token state if a write fails for any reason (perms, lock, disk full).

Not the "rotator stops when sessions hit zero" timing bug. If you start a new session >15 min after the last one was reaped, the env-baked PAT has expired and the rotator hasn't yet woken — atomic writes don't help. Tracking as out-of-scope follow-up in #22.
Not the "in-process token cache goes stale in long Claude / Codex / Gemini sessions" issue. Different problem (in-process vs on-disk); fix would need agent-side reload, not config-side rewrite.

Test plan

  • python3 -m py_compile cli_auth.py — passes locally
  • Deploy a fresh CODA, force a PAT rotation (or wait 10 min), invoke Hermes — should not 403
  • Make ~/.hermes/config.yaml read-only with chmod 444 and trigger rotation — should now log a warning instead of silently swallowing

Closes #22

This pull request and its description were written by Isaac.


Test Evidence (verified on the live deployment 2026-05-06)

User reproduced the symptom on a deployed CODA app:

⚠️  API call failed: PermissionDeniedError [HTTP 403]
   📝 Error: HTTP 403: Invalid access token

Then on retry the same hermes chat worked fine. That retry-fixes-it pattern is the signature of a stale-on-disk token — once the rotator finishes writing the new PAT, subsequent reads succeed.

The PAT itself is good — confirmed via two independent paths:

$ databricks current-user me -p daveok
{"userName": "david.okeeffe@databricks.com", "active": true}

$ # Same PAT, direct OpenAI URL Hermes uses, against opus-4-6 (the fallback model)
$ curl -X POST "$HOST/serving-endpoints/chat/completions" \
       -H "Authorization: Bearer $PAT" -H "content-type: application/json" \
       -d '{"model":"databricks-claude-opus-4-6","messages":[{"role":"user","content":"hi"}]}'
{"model":"au.anthropic.claude-opus-4-6-v1","choices":[...]}
HTTP 200

So workspace, PAT, URL, model — all good. The 403 is purely a window where Hermes reads a half-written api_key line from ~/.hermes/config.yaml. The user confirmed the retry pattern in chat: "I tested and sonnet-4-6 works in Claude Code, so the token is valid, might be something with Hermes" — and on subsequent attempt the call succeeded.

The fix uses os.replace() for atomic file swap (POSIX-guaranteed atomic rename within the same FS, which <path>.tmp always is since it's a sibling), so concurrent readers see either the old token whole or the new token whole — never a partial state.

Hermes was returning 403 ("Invalid access token") on the first call after
a PAT rotation, then succeeding on retry. Two reasons:

1. update_cli_tokens() rewrote each agent's config file with a bare
   open(path, "w"), creating a window where a concurrent Hermes
   invocation could read a half-written api_key line. Hermes is exposed
   to this because it re-reads ~/.hermes/config.yaml on every call;
   Claude/Codex/Gemini cache the token in env at process startup.
2. Every write path silently swallowed OSError, so an actual write
   failure (perms, locked file, ENOSPC) would leave the config stale
   forever with no log line — the user just saw 403s.

Adds _atomic_write_text() helper (write to .tmp, os.replace) used by
all five _update_* functions. Replaces silent except OSError: pass with
logger.warning at WARNING level. FileNotFoundError still silenced via an
explicit os.path.exists() guard so the rotator doesn't spam during the
brief window between app start and setup script completion.

Co-authored-by: Isaac
@dgokeeffe
Copy link
Copy Markdown
Collaborator Author

@datasciencemonkey — flagging for review. P1: causes intermittent Hermes 403s after PAT rotation, with no log to debug from (silent except OSError: pass). Tiny one-file diff (cli_auth.py, ~40 lines), self-contained, doesn't depend on any other open PR. Evidence + reproduction in PR body.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hermes 403 on first call after PAT rotation — non-atomic config writes + silent failures

1 participant