fix(cli-auth): atomic writes + observable failures on PAT rotation#23
Open
fix(cli-auth): atomic writes + observable failures on PAT rotation#23
Conversation
Hermes was returning 403 ("Invalid access token") on the first call after
a PAT rotation, then succeeding on retry. Two reasons:
1. update_cli_tokens() rewrote each agent's config file with a bare
open(path, "w"), creating a window where a concurrent Hermes
invocation could read a half-written api_key line. Hermes is exposed
to this because it re-reads ~/.hermes/config.yaml on every call;
Claude/Codex/Gemini cache the token in env at process startup.
2. Every write path silently swallowed OSError, so an actual write
failure (perms, locked file, ENOSPC) would leave the config stale
forever with no log line — the user just saw 403s.
Adds _atomic_write_text() helper (write to .tmp, os.replace) used by
all five _update_* functions. Replaces silent except OSError: pass with
logger.warning at WARNING level. FileNotFoundError still silenced via an
explicit os.path.exists() guard so the rotator doesn't spam during the
brief window between app start and setup script completion.
Co-authored-by: Isaac
Collaborator
Author
|
@datasciencemonkey — flagging for review. P1: causes intermittent Hermes 403s after PAT rotation, with no log to debug from (silent |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Priority
P1 — intermittent 403s after every PAT rotation. Hermes (which re-reads
~/.hermes/config.yamlon each invocation) returnsHTTP 403: Invalid access tokenon the first call after a rotation, then succeeds on retry. The same race exists for all five_update_*functions incli_auth.py; Hermes just exposes it most often because of its read-on-every-invocation pattern.Summary
Fix for #22. Hermes was 403'ing on the first call after a PAT rotation and recovering on retry. Two reasons inside
cli_auth.py:update_cli_tokens()rewrote each agent's config with a bareopen(path, "w")— Hermes (which re-reads~/.hermes/config.yamlon every call) could read a half-written file. The other agents only cache the token at process startup so they don't observe this race in-process, but the bug exists for all five.except OSError: pass. A real write failure (perms, locked file, ENOSPC) would leave the config stale forever with zero logs.Changes (
cli_auth.py)_atomic_write_text(path, content)helper — write to<path>.tmp, thenos.replace(). POSIX rename is atomic, so a concurrent Hermes invocation sees either the old token whole or the new token whole, never a partial state._update_*functions now use the helper.except OSError: pass→except OSError as e: logger.warning(...). Real failures now show up in app logs.os.path.exists(path)guards so the rotator stays quiet during the window between container start and setup-script completion (which is the legitimate "file doesn't exist yet" case the original silentpasswas protecting).What this fixes / what it doesn't
✅ Hermes 403 caused by reading mid-rotation half-written
config.yaml.✅ Silent stuck-token state if a write fails for any reason (perms, lock, disk full).
❌ Not the "rotator stops when sessions hit zero" timing bug. If you start a new session >15 min after the last one was reaped, the env-baked PAT has expired and the rotator hasn't yet woken — atomic writes don't help. Tracking as out-of-scope follow-up in #22.
❌ Not the "in-process token cache goes stale in long Claude / Codex / Gemini sessions" issue. Different problem (in-process vs on-disk); fix would need agent-side reload, not config-side rewrite.
Test plan
python3 -m py_compile cli_auth.py— passes locally~/.hermes/config.yamlread-only withchmod 444and trigger rotation — should now log a warning instead of silently swallowingCloses #22
This pull request and its description were written by Isaac.
Test Evidence (verified on the live deployment 2026-05-06)
User reproduced the symptom on a deployed CODA app:
Then on retry the same
hermes chatworked fine. That retry-fixes-it pattern is the signature of a stale-on-disk token — once the rotator finishes writing the new PAT, subsequent reads succeed.The PAT itself is good — confirmed via two independent paths:
So workspace, PAT, URL, model — all good. The 403 is purely a window where Hermes reads a half-written
api_keyline from~/.hermes/config.yaml. The user confirmed the retry pattern in chat: "I tested and sonnet-4-6 works in Claude Code, so the token is valid, might be something with Hermes" — and on subsequent attempt the call succeeded.The fix uses
os.replace()for atomic file swap (POSIX-guaranteed atomic rename within the same FS, which<path>.tmpalways is since it's a sibling), so concurrent readers see either the old token whole or the new token whole — never a partial state.