Hermes 403 on first call after PAT rotation — non-atomic config writes + silent failures

## Symptom

Running `hermes chat` against a deployed CODA app returns **`HTTP 403: Invalid access token`** on the first call after a PAT rotation, then succeeds on retry. Captured trace:

```
⚠️  API call failed (attempt 1/3): PermissionDeniedError [HTTP 403]
   🌐 Endpoint: https://adb-<workspace>.azuredatabricks.net/serving-endpoints/
   📝 Error: HTTP 403: Invalid access token
```

## What I confirmed empirically

- Workspace + PAT are good — `databricks current-user me` succeeds; `databricks serving-endpoints list` returns 7 endpoints; opus-4-6 is `ready=READY`.
- A fresh PAT against `{host}/serving-endpoints/chat/completions` (the URL Hermes constructs) returns **200** with a valid model response. So the URL pattern is correct and the endpoint serves OpenAI-style requests.
- Claude Code with the same workspace's PAT works fine (sonnet-4-6 via `/anthropic`).
- This is **not** a Geo Designated Services block — same PAT can call the same endpoint via curl.

## Root cause

`cli_auth.py:update_cli_tokens()` is called by `pat_rotator._persist_token()` after every rotation. Each `_update_*` function does **non-atomic** read-modify-write:

```python
with open(path) as f:
    content = f.read()
new_content = ... # regex-replace api_key
if new_content != content:
    with open(path, "w") as f:           # window starts here
        f.write(new_content)             # window closes here
```

Hermes specifically re-reads `~/.hermes/config.yaml` on every invocation. If the rotator is mid-write when Hermes opens the file, it sees a partial / empty / stale-token state → 403. Other CLIs (Claude / Codex / Gemini / OpenCode) read the token into their process env at startup and don't re-read, so they never observe the partial-write state — but they also never benefit from rotation within a long-running process, which is a separate bug.

Compounding: every `_update_*` swallows `OSError` silently. If a write actually fails (perms, locked file, disk full), the file stays stale forever and the user just gets 403s with no log line to debug from.

## Fix (PR coming)

Adds `_atomic_write_text()` — write to `<path>.tmp`, then `os.replace()`. POSIX rename is atomic, so concurrent readers see either the old file whole or the new file whole, never a partial state.

Replaces silent `except OSError: pass` with `logger.warning(...)`. The "file doesn't exist yet" case (rotator firing during the window between app start and setup-script completion) is handled by an explicit `os.path.exists()` guard so it stays quiet.

Applied to all 5 update functions, not just Hermes — the same race exists in `_update_claude` / `_update_opencode` and the dotenv helper. Hermes just exposed it first because of its read-on-every-call invocation pattern.

## Out of scope (worth a follow-up)

- **PAT rotator stops when active sessions hit zero**. If a session is reaped and a new one starts >15 min later, PAT_v1 has expired but the rotator hasn't yet woken to mint v2. First Hermes call still fails — atomic write doesn't help. Fix is to trigger an immediate `_persist_token()` on session creation. Filing as separate issue if this turns out to also be hit in practice.
- **Other agents don't refresh in-process**. Claude / Codex / Gemini / OpenCode read the token at process startup. If you stay in a Claude Code session for >15 min, it's using a stale token and would 401 on its next call. Different problem from this issue (token-in-process vs token-on-disk); fixing here would be scope creep.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hermes 403 on first call after PAT rotation — non-atomic config writes + silent failures #22

Symptom

What I confirmed empirically

Root cause

Fix (PR coming)

Out of scope (worth a follow-up)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hermes 403 on first call after PAT rotation — non-atomic config writes + silent failures #22

Description

Symptom

What I confirmed empirically

Root cause

Fix (PR coming)

Out of scope (worth a follow-up)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions