Skip to content

fix(delegate): per-room CAS keys to stop multi-tab room loss (#345)#347

Merged
sanity merged 15 commits into
mainfrom
fix-delegate-cas-rooms-clobber
Jun 5, 2026
Merged

fix(delegate): per-room CAS keys to stop multi-tab room loss (#345)#347
sanity merged 15 commits into
mainfrom
fix-delegate-cas-rooms-clobber

Conversation

@sanity
Copy link
Copy Markdown
Contributor

@sanity sanity commented Jun 4, 2026

Problem

Multi-tab data loss (#345): accepting an invite to a new room in one browser tab
left the room missing from other tabs, and it was lost entirely on refresh. Root
cause (confirmed byte-level on technic): the chat delegate stored the entire
room list in a single rooms_data blob, written with a blind last-writer-wins
overwrite. Each tab has its own in-memory ROOMS, so across tabs the last save
clobbered the others — a stale tab silently erased a room a fresh tab had just
added.

Approach

Two layers, both in this PR:

1. A compare-and-swap primitive in the chat-delegate KV store. New wire
variants GetVersionedRequest/CasStoreRequest (+ GetVersionedResponse/
CasStoreResponse{CasStoreResult}), appended to the externally-tagged enums so
old clients never receive them. Every value is wrapped in a versioned envelope
[tag=1][generation u64 LE][data]; plain Store/Get transparently wrap/unwrap,
CAS stores only when the expected generation matches. Pure, unit-tested logic in
delegates/chat-delegate/src/versioning.rs. Delegate WASM rebuilt → V24
migration entry in legacy_delegates.toml.

2. Per-room keys (realizes #65). Each room is its own
room:<base58(owner_vk)> delegate key holding a RoomSlot::{Present|Tombstone},
CAS-versioned independently; list-level view prefs (current room, notification
modes, room order) live in a single rooms_meta key. The single rooms_data
blob is no longer written (still read once for migration). A save for room X
never touches room Y's key, so the clobber is structurally impossible.

  • Save (do_save_rooms_to_delegate): content-diffs each room and CAS-writes
    only changed slots; reconcile_room_present/_tombstone/_meta are the
    read-merge-write reconcilers; present_action is the pure rejoin-vs-leave
    decision. Per-room error isolation (one room's CAS exhaustion doesn't abort the
    pass).
  • Load (response_handler.rs): fire_list_rooms_request (a ListRequest)
    load_rooms_per_room plain-GETs each slot + rooms_meta, reconstructs
    Rooms, and hydrates via the shared hydrate_loaded_rooms. Plain-GET
    correlation is distinct from the save's CAS correlation, so a concurrent
    save can't steal a load's response slot.
  • Rejoin vs leave (round-9): an explicit rejoin (mark_room_rejoined at the
    invitation-accept and identity-import sites) overwrites a remote Tombstone;
    a background content update to a room left elsewhere adopts the leave. The
    rejoin flag is consumed once the rejoin is persisted, and cleared on leave, so
    a stale flag can't resurrect a room left in another tab.

Transparent, non-destructive migration. On the V24 upgrade the new delegate
key is empty → load falls through to the legacy-delegate probe and recovers the
old single rooms_data blob, re-saving it as per-room keys (the old blob is left
in place as a rollback fallback). Because per-room keys are dynamic,
fire_legacy_migration_request also fires a ListRequest to each legacy
delegate so a future delegate-WASM bump can recover the per-room keys the
now-legacy delegate holds (migrate_legacy_per_room) — without this, the next
bump would strand every room.

Testing

All native unit tests green (cargo test -p river-ui --bins 336, chat-delegate
40, river-core chat_delegate 19); cargo fmt + clippy clean.

  • CAS loop (stores/abort/conflict-retry/exhaustion/failed + wrote-bool), envelope
    encode/decode, generation arithmetic, handler dispatch.
  • present_action truth table + reconcile_room_present at byte level (absent,
    tombstone-background-adopts-leave, tombstone-explicit-rejoin-overwrites,
    diverged-identity-keeps-local, unparseable-errs).
  • plan_load_from_keys (all branches), reconstruct_rooms, RoomSlot/
    RoomsMeta CBOR round-trips, single-blob→per-room explosion+reconstruct
    round-trip (the on-node upgrade format contract).
  • Rejoin flag set/clear + a source-grep pin on the mark_room_rejoined /
    clear_room_rejoined wiring; parse_room_storage_key round-trip/negatives;
    correlation-key divergence (load vs save).

Reviewed by a multi-model pass (4 perspectives + external codex); findings
addressed in the follow-up commit (legacy per-room migration, rejoin/cache
correctness, per-room error isolation, invitation_secrets union, doc updates).

Still TODO before merge/deploy: two-tab Playwright on technic + an on-node
upgrade check (load the new UI against a node holding old single-blob state →
rooms reappear), and re-confirming deploy authorization (scope grew from
single-blob to per-room).

Closes #345. Realizes #65.

[AI-assisted - Claude]

sanity added 4 commits June 3, 2026 21:14
…elegate KV store

The chat delegate's key-value store was a blind last-writer-wins overwrite,
so two browser tabs each persisting their full room-list snapshot would
clobber each other — a stale tab could silently destroy a room a newer tab
just accepted (#345).

Add a generic optimistic-concurrency primitive:
- New wire variants GetVersionedRequest / CasStoreRequest and
  GetVersionedResponse / CasStoreResponse{CasStoreResult} (appended; ciborium
  tags externally by name so old delegate WASM is unaffected).
- Every KV value is wrapped in an internal [tag][generation][data] envelope;
  plain Get/Store transparently unwrap/wrap it (back-compat, stays
  generation-consistent). CasStore only writes when expected_generation
  matches, else returns the current generation+value to merge and retry.
- All generation/CAS logic lives in pure functions (delegate native tests
  no-op the secret store) with full unit coverage, plus handler-dispatch and
  CBOR wire round-trip tests.

This is the delegate half of the fix; the client CAS-write of rooms_data
follows. Delegate WASM changes here require a migration entry before publish.

Refs #345

[AI-assisted - Claude]
Replace the blind full-blob StoreRequest in do_save_rooms_to_delegate with a
CasStoreRequest loop keyed on the rooms_data generation (#345).
On a conflict the loop folds the conflicting remote snapshot into the local
working copy via the tombstone-aware Rooms::merge (#247) and retries, so a
stale tab can no longer overwrite a room another tab just accepted — both
tabs' rooms survive.

- ROOMS_GENERATION tracks the last-known generation; learned lazily (first
  save of a session may take one self-correcting conflict+merge+retry).
  Capturing it at load time comes with the per-room key split (PR2).
- fold_conflicting_rooms is a free function with unit coverage: tombstones
  from both sides survive, this tab's current-room selection is preserved,
  and an unparseable remote snapshot errors rather than clobbering.
- response_handler routes the new CasStore/GetVersioned responses through the
  pending-request registry and handles them in the processing match.

Refs #345

[AI-assisted - Claude]
…345)

The CAS primitive changes the chat-delegate WASM and therefore its delegate
key. Record the previous (currently-deployed) delegate as legacy V24 so users'
existing rooms/signing keys migrate forward instead of being lost.

- V24 delegate_key 1ec6b3d1… == the deployed delegate 358xw7Ld… (verified by
  base58 decode), confirming the committed WASM is the production delegate.
- New delegate WASM: 904f76ff… -> 8bb5d06f…. Room-contract WASM intentionally
  untouched (room-contract does not link chat_delegate; rebuilt only the
  delegate to avoid a spurious contract-key change).
- Validated: cargo make check-migration, cargo test -p river-core --test
  migration_test.

Refs #345

[AI-assisted - Claude]
…CAS correlation keys; test the CAS loop

Multi-model review of #347 found one critical bug and several should-fixes:

- [P1, Codex — critical data loss] After a CAS conflict the loop folded the
  other tab's rooms into `working` and stored the union, but only updated the
  generation — the in-memory ROOMS signal never got the merged rooms, so the
  NEXT save snapshotted stale ROOMS and overwrote them, re-deleting the
  just-merged rooms. Fix: refactor the loop into a testable `run_rooms_cas_loop`
  returning `CasSaveOutcome { stored, generation, merged_remote }`; when
  merged_remote, defer a ROOMS.merge(stored) write-back (Dioxus-safe). Bonus:
  the other tab's room now appears without a reload.

- [M1, skeptical] A plain GetResponse for "rooms_data" (cold-start load /
  legacy probe) could steal an in-flight CAS save's pending-request slot
  (shared correlation key), failing that save. Give CAS/GetVersioned ops
  DISTINCT correlation keys (cas_store_correlation_key /
  get_versioned_correlation_key) used by both get_request_key and the
  response handler.

- [M3 / testing] The CAS loop's conflict/exhaustion/error paths were untested.
  Added run_rooms_cas_loop tests (stored-first, conflict→fold→retry carries
  both writers' state, 8-attempt exhaustion, Failed propagation) and a
  populated-map Rooms::merge test pinning the #345 headline (two distinct
  rooms both survive).

- [docs] P2 (delete resets generation — unreachable in PR1; documented for
  PR2), envelope-tag CBOR-collision invariant, M2 (un-mergeable remote aborts
  save intentionally — refuse to clobber), migrated_rooms/current_room_key not
  folded, and corrected the coalesce_save "will retry" comment.

Verified false-positive (dropped): big-picture's base58 `358wv7Kc…` — the
recorded key re-encodes to `358xw7Ld…` (matches the deployed delegate).

Tests: chat-delegate 39, river-core chat_delegate 19, river-ui --bins 304, all
green. Clippy clean (changed files). Delegate WASM unchanged (comment-only
delegate edits; committed 8bb5d06f stands).

Refs #345

[AI-assisted - Claude]
Copy link
Copy Markdown
Contributor Author

@sanity sanity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comprehensive PR Review: #347

Summary

  • PR Title: fix(delegate): compare-and-swap rooms_data to stop multi-tab room loss (#345)
  • Type: fix (delegate WASM + wire format + concurrency + data migration)
  • CI Status: was green on the reviewed commit 1acac35c; re-running on the fix commit 31dd7ba1
  • Linked Issues: #345 (closes); relates to #65 (PR2), #307 (CLI analog)
  • Review tier: Full (high-risk surfaces: concurrency, wire format/serialization, delegate WASM, data migration)
  • Reviewers run: four Freenet perspectives (code-first, testing, skeptical, big-picture) + external Codex (non-Claude). Findings synthesized, every cited file:line verified.

This review found one critical data-loss bug and several should-fixes; all are addressed in 31dd7ba1. A re-review on the new HEAD follows.


Findings and resolutions

Must Fix (Blocking) — addressed

  1. [Codex P1 — critical] Merged rooms were not written back to the ROOMS signal after a conflict-resolved store. After a CAS conflict the loop folded the other tab's rooms into the local working copy and stored the union, but the success path only updated ROOMS_GENERATION. The in-memory ROOMS never received the merged rooms, so the next save snapshotted the stale ROOMS (now with the correct generation, so no conflict) and overwrote the delegate — silently re-deleting the just-merged rooms. This defeated the fix in the multi-save case. Verified against chat_delegate.rs:1633-1647.
    • Fix (31dd7ba1): refactored the loop into a testable run_rooms_cas_loop returning CasSaveOutcome { stored, generation, merged_remote }; when merged_remote, do_save_rooms_to_delegate defers a ROOMS.merge(stored) write-back (Dioxus-safe). Side benefit: the other tab's room now appears without a reload.

Should Fix — addressed

  1. [skeptical M1] Pending-request slot theft. A plain GetResponse for rooms_data (cold-start load / legacy probe) shared the "rooms_data" correlation key with an in-flight CAS save and could complete that save's oneshot with the wrong response, failing the save (and coalesce_save does not auto-retry a failed save). Pre-existing mechanism, but the CAS rewrite made it a hard mid-loop error.
    • Fix: CAS/GetVersioned ops now use distinct correlation keys (cas_store_correlation_key / get_versioned_correlation_key), applied in both get_request_key and the response-handler routing.
  2. [testing / skeptical M3 / code-first] CAS loop conflict/exhaustion/error paths untested.
    • Fix: added run_rooms_cas_loop tests (stored-on-first-attempt, conflict→fold→retry carrying both writers' state at the adopted generation, 8-attempt exhaustion error, Failed propagation) and a populated-map Rooms::merge test pinning the #345 headline (two distinct rooms both survive).

Consider / Documented (not code-blocking)

  1. [Codex P2] Delete resets a key's generation (ABA). Not reachable in PR1 (no CAS-tracked key is ever deleted). Documented in versioning.rs and the delete handler; the proper handling belongs in PR2 (where per-room-key leave-deletes appear, guarded by the removed_rooms tombstone).
  2. [skeptical M2] An un-mergeable remote aborts the whole save. This is intentional and safe — refusing to clobber data we can't parse/merge is preferable to overwriting it; in-memory rooms are untouched and the next mutation re-attempts. Documented at the fold site. Per-room isolation comes with PR2.
  3. [code-first] migrated_rooms/current_room_key are intentionally not folded from the remote (transient/per-tab; merge re-derives upgrade pointers). Noted in fold_conflicting_rooms. Also tightened the envelope-tag invariant comment (real CBOR payloads start 0xA0..0xBF, never the 0x01 tag) and corrected the misleading coalesce_save "will retry" comment.

Dropped — false positive

  • [big-picture] PR description base58 typo (358wv7Kc…). Verified incorrect: the recorded V24 hex re-encodes to 358xw7Ld…, which matches the deployed delegate. The PR text was already correct.

Verified safe (by skeptical + code-first, independently confirmed)

  • Envelope defensive-decode cannot misread real CBOR data; migration old→new delegate is correct (V24 delegate_key = BLAKE3(code_hash) and equals the deployed delegate by base58 decode); plain Get/Store envelope back-compat keeps the outbound-DM cache working; ROOMS_GENERATION is single-flight under coalesce_save; the CAS loop never mutates the ROOMS signal mid-loop (Dioxus-safe); riverctl is decoupled (own rooms.json). Room-contract WASM untouched.

Testing

Test Type Status Notes
Unit Pure CAS/envelope (versioning.rs), handler dispatch, CBOR wire round-trips, fold_conflicting_rooms, run_rooms_cas_loop, Rooms::merge distinct-rooms
Integration ⚠️ Native delegate set_secret/get_secret are no-ops, so end-to-end persistence isn't host-testable; invariants pinned at the pure layer
Simulation N/A
E2E Two-tab live-node Playwright verification to run before deploy (incl. outbound-DM envelope round-trip + legacy-migration re-save)

Verdict

State: Needs Changes — Re-review Required After Fix → fixes applied in 31dd7ba1; re-review running on the new HEAD.
HEAD SHA reviewed: 1acac35c (original); fixes on 31dd7ba1.

[AI-assisted - Claude]

sanity added 4 commits June 3, 2026 22:05
…tion keys

Re-review (Codex + skeptical, independently) found that the round-1 P1 fix was
incomplete: the merged-rooms write-back into ROOMS is deferred (setTimeout 0)
for Dioxus signal-safety, so a coalesce_save catch-up iteration that re-runs
do_save before the defer fires snapshots a still-stale ROOMS at the now-current
generation, stores without conflict, and re-deletes the just-merged rooms — the
same data-loss class in a tighter window.

Fix: a PENDING_MERGED_ROOMS holding cell (a plain thread_local RefCell, NOT a
Dioxus signal) set SYNCHRONOUSLY when a conflict merges remote rooms, and folded
into EVERY save's working snapshot. All rooms-saves are serialized by the
coalesce mutex, so a value set before one save returns is always visible to the
next — correctness no longer depends on the deferred write-back's timing (which
becomes pure UI convenience + cell cleanup, as the skeptical reviewer
recommended). The deferred drain reads the cell's current (latest cumulative)
value to avoid a multi-conflict gap.

Also addresses the testing reviewer's coverage ask:
- cas_correlation_keys_diverge_from_plain_storage_keys + _round_trip_request_to_
  response pin the M1 fix (distinct slots; request/response symmetry).
- rooms_cbor / outbound_dm_store envelope-tag-collision tests pin the documented
  decode invariant (real CBOR maps start 0xA0..0xBF, never ENVELOPE_TAG 0x01).

Tests: chat-delegate 40, river-ui --bins 307, all green. Clippy clean.

Refs #345

[AI-assisted - Claude]
Re-review round 3 (Codex) found a third data-loss regression: the CAS conflict
merge uses Rooms::merge, which unions tombstones, so it resurrects a stale
removed_rooms entry over an explicit REJOIN. When a user rejoins a previously
left room (clears its tombstone, re-adds it), the first save after a reload —
where lazy generation forces a conflict against the delegate's own pre-rejoin
blob — re-removes the just-rejoined room. The current blind-overwrite code does
NOT have this bug, so the CAS rewrite introduced it.

Fix: add Rooms::merge_reconciling (next to Rooms::merge), which drops from the
other side's removed_rooms any room THIS side currently holds in its map before
merging — local map membership wins over a stale remote tombstone. Local
tombstones still block re-adds (#247 leave-direction preserved) and shared room
state still CRDT-merges. Both fold sites (the CAS conflict fold and the
PENDING_MERGED_ROOMS fold) route through it. Net: conflict resolution is
local-authoritative ('keep my decisions, absorb the rooms I'm missing') —
strictly better than blind overwrite, which kept local decisions but had no
anti-clobber for additions.

Trade-off (documented): a room another tab left isn't removed from a tab still
holding it open; genuine per-room leave/rejoin versioning belongs to PR2.

Tests: merge_reconciling_preserves_local_rejoin_over_stale_tombstone (with a
precondition asserting plain merge WOULD drop it) and
merge_reconciling_keeps_local_leave_sticky (#247 not weakened). river-ui --bins
309 green, clippy clean.

Refs #345

[AI-assisted - Claude]
… for persistence

Re-review round 4 (Codex + skeptical) found the deferred ROOMS write-back was
the source of three issues: it inserted a delegate-deserialized room into the
live ROOMS signal that (a) lacked rehydrated secrets/actions_state — private
rooms couldn't decrypt (Codex P1), (b) was never mark_needs_sync'd so it never
subscribed and stayed stale (Codex P2), and (c) used plain merge so it could
still resurrect a stale tombstone over a concurrent rejoin (Codex P2 /
skeptical H1).

All three stem from surfacing the other tab's room LIVE — which is PR2's scope.
Remove the write-back entirely. Persistence correctness already rests on the
PENDING_MERGED_ROOMS holding cell (set synchronously on a merged conflict,
folded into every save via merge_reconciling under the coalesce mutex), so a
later/coalesce-catch-up save can't drop the merged rooms. The merged room then
reappears FULLY HYDRATED via the normal load path on next reload — the
documented 'fixed after refresh' PR1 behavior. The cell is per-session
(thread_local), so it resets on reload after hydration.

This is strictly safer than skeptical's narrower suggestion (switch the
write-back to merge_reconciling), which would fix H1 but leave Codex's
under-hydrated/unsynced-room P1/P2 in place.

Also corrected stale write-back references in comments/docs.

Tests: river-ui --bins 309 green; clippy clean.

Refs #345

[AI-assisted - Claude]
Round-5 review (both reviewers clean — no data-loss windows). Two non-blocking
doc follow-ups:

- Codex P2: document that merge_reconciling CRDT-merges only room_state for a
  room on both sides; other per-room scalars (last_read_message_id,
  self_nickname, invitation_secrets) keep the local value. This is NOT a
  regression — the prior blind overwrite reverted every room's scalars on a
  stale save, so CAS is no worse and strictly better elsewhere. Proper
  per-field reconciliation lands with the per-room key split (PR2). Tracked
  on #345.
- Skeptical L1: correct the CasSaveOutcome.stored doc, which still claimed the
  snapshot 'must be reflected back into ROOMS' — exactly the write-back round-4
  proved harmful and round-5 removed. Prevents a future maintainer from
  restoring it.

Doc-only; no behavior change.

Refs #345

[AI-assisted - Claude]
Copy link
Copy Markdown
Contributor Author

@sanity sanity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comprehensive PR Review: #347 — Final (converged)

Review tier: Full (concurrency + wire format + delegate WASM + data migration).
Reviewers: four Freenet perspectives + external Codex, run iteratively across 6 commits as fixes landed.
HEAD reviewed: clean pass on fabdf21d; current HEAD 357d6b23 is doc-only on top of it.

Outcome

The iterative multi-model review found and fixed four distinct data-loss / data-corruption bugs that the first implementation missed — most surfaced by the external Codex pass, whose blind spots don't correlate with the (Claude-authored) code:

  1. Merged rooms never written back → next save re-deleted them. (Codex round 1)
  2. Deferred write-back didn't cover the coalesce catch-up iteration → same loss, tighter window. (Codex + skeptical round 2) → fixed with the synchronous PENDING_MERGED_ROOMS holding cell folded into every save.
  3. Stale tombstone resurrected over an explicit rejoin → a rejoined room re-removed on first save after reload. (Codex round 3) → fixed with Rooms::merge_reconciling (local map membership wins over a stale remote tombstone; #247 leave-stickiness preserved).
  4. The write-back surfaced under-hydrated, unsubscribed rooms and could still undo a rejoin. (Codex + skeptical round 4) → removed the write-back entirely; persistence now rests solely on the holding cell, and the merged room reappears fully hydrated via the load path on reload (the documented "fixed after refresh" PR1 behavior; live cross-tab surfacing is PR2's scope).

Plus M1 (a plain GetResponse could steal an in-flight CAS save's response slot → distinct correlation keys).

Final verdict

No data-loss windows remain in the reviewed paths (both round-5 reviewers concur). The CAS conflict resolution is local-authoritative ("keep my decisions, absorb the rooms I'm missing") — strictly better than the prior blind overwrite, which it replaces. All findings are fixed or have a justified dismissal.

Remaining items — non-blocking, deferred to PR2 (per-room key split), tracked on #345:

  • Per-room scalar metadata (last_read_message_id, self_nickname, invitation_secrets) keeps the local value on a same-room conflict — not a regression (the prior blind overwrite reverted every room's scalars on a stale save; CAS is no worse and better elsewhere). Documented on Rooms::merge_reconciling.
  • A CAS-only-merged room's contract-WASM-upgrade pointer is deferred to reload (pre-existing, rare path).

Testing

  • Unit: pure CAS/envelope (versioning.rs), handler dispatch, CBOR wire round-trips (incl. legacy-variant pin), correlation-key divergence + round-trip, run_rooms_cas_loop (conflict→fold→retry, exhaustion, failure), merge_reconciling (rejoin preserved with a plain-merge-drops-it precondition; leave stays sticky), distinct-rooms union, envelope-tag CBOR-collision. chat-delegate 40, river-core 210, river-ui --bins 309 — all green.
  • Migration: check-delegate-migration + check-room-contract-migration green on HEAD; V24 delegate_key verified == the deployed delegate by base58 decode.
  • E2E: two-tab live-node Playwright verification (incl. outbound-DM envelope round-trip + legacy-migration re-save) to run before deploy.

Verdict

State: Ready to Merge once CI is green on the exact current HEAD (357d6b23).
Deploy note: delegate-WASM change → publish via cargo make publish-all from main after merge, gated on green CI at the merge commit + the E2E pass. PR2 (per-room key split, #65) follows after PR1 merges.

[AI-assisted - Claude]

sanity and others added 6 commits June 4, 2026 12:47
…346)

main merged #346 (drag-and-drop room reordering), which added a room_order
field to Rooms. Merged origin/main into the branch (auto-merged, no conflicts)
and added room_order: Vec::new() to the 10 Rooms test literals this PR
introduced, which predated the field. Production code was unaffected (it never
constructs Rooms literals); merge_reconciling delegates to Rooms::merge, which
already handles room_order (unions, prunes to map), so no logic change.

Delegate WASM and legacy_delegates.toml untouched by the merge — V24 migration
intact. Tests: river-ui --bins 318, chat-delegate 40, river-core 186 green.

Refs #345
Codex post-merge review (P1): the PENDING_MERGED_ROOMS fold in
do_save_rooms_to_delegate logged-and-continued on a merge_reconciling error,
then CAS-stored rooms_clone WITHOUT the pending rooms — silently deleting the
concurrent-tab rooms the holding cell exists to preserve. Reachable when the
same room was rejoined locally with a different self_sk (merge returns
'self_sk is different').

Match the conflict-fold's behavior: propagate the error (?) and abort the save
instead of persisting a snapshot that drops pending rooms. The cell stays set,
in-memory rooms are untouched, and the next save (or reload, which resets the
per-session cell) retries — no silent data loss. Moved the fold out of the
value-block to function scope so ? applies.

Tests: river-ui --bins 318 green.

Refs #345

[AI-assisted - Claude]
…rite CAS (#345)

The cached-generation + PENDING_MERGED_ROOMS approach to the rooms CAS save
generated a long tail of data-loss edge cases in review (missing write-back,
coalesce-catch-up race, rejoin resurrection, under-hydrated/unsynced write-back
rooms, and most recently a pending-only room overriding a legitimate remote
leave). The root cause: trying to AVOID re-reading the delegate on the
no-conflict path forced a side-cache (PENDING) of rooms that live on the
delegate but not in the ROOMS signal, and reconciling that cache against
later remote changes kept producing corner cases.

Replace it with plain optimistic read-modify-write, the textbook pattern:
do_save now, per attempt, GetVersioned (read current value + generation) ->
merge_reconciling(local ROOMS, delegate state) -> CasStore at the read
generation; a Conflict re-reads and retries. This removes ROOMS_GENERATION,
PENDING_MERGED_ROOMS, the holding cell, the write-back, and CasSaveOutcome.

Why it's correct with no side-cache:
- A room another tab added lives on the delegate, so every save re-reads and
  re-includes it (never dropped) — fixes the round-1/2 loss without PENDING.
- A subsequent remote leave is honored: merge_reconciling keeps the delegate's
  tombstone for rooms this tab doesn't explicitly hold (fixes the round-8
  remote-leave resurrection — a pending-only room can no longer look 'locally
  authoritative').
- A local rejoin still wins (local map membership beats a stale delegate
  tombstone), with no lazy-generation false conflict (we read the fresh
  generation each save) — round-3 stays fixed.
- We still never write delegate-deserialized (under-hydrated) rooms into the
  ROOMS signal; merged rooms reappear fully hydrated via the load path on
  reload (round-4 stays fixed). Live cross-tab surfacing remains PR2's scope.

Cost: one extra ~337KB delegate read per save (local IPC, coalesced to <=2 per
burst — not a #246-class burst); PR2's per-room keys shrink it. GetVersioned
uses its distinct correlation key (M1), so it can't collide with the load
path's plain GetRequest.

run_rooms_cas_loop now takes injected get_versioned + cas_store closures;
rewrote its tests (empty-store, delegate-state-merged, retry-on-conflict,
exhaustion, failed, unparseable-delegate-aborts). Tests: river-ui --bins 320,
chat-delegate 40, river-core 186 green; clippy clean; delegate WASM + V24
migration unchanged.

Refs #345

[AI-assisted - Claude]
Completes the per-room storage split: rooms are loaded from independent
room:<base58(vk)> slot keys (+ rooms_meta) instead of the single
rooms_data blob.

- startup fires ListRequest (was GetRequest{rooms_data}); the ListResponse
  arm spawns load_rooms_per_room, which plain-GETs each slot (distinct
  correlation from the save CAS, so an in-flight save can't steal the
  load's response), reconstructs Rooms (Present->map, Tombstone->
  removed_rooms, apply_meta), and hydrates.
- extract the post-deserialize hydrate block into a shared free fn
  hydrate_loaded_rooms() used by both the single-blob GetResponse arm and
  the per-room load.
- NON-DESTRUCTIVE migration: a single rooms_data blob (legacy delegate, or
  a stray current-delegate blob) is read, hydrated, then re-saved as
  per-room keys; the blob is left in place as a rollback fallback. Per-room
  keys take priority over a stray blob so migration never re-triggers.
- wire mark_room_rejoined() at the two explicit-rejoin tombstone-clear
  sites (members import-identity, get_response invitation-accept) so an
  explicit rejoin overwrites a remote Tombstone slot rather than adopting
  the leave (round-9).
- remove now-dead Rooms::merge_reconciling (+ its tests); the per-room key
  split it was a stopgap for now exists.
- tests: plan_load_from_keys, reconstruct_rooms, RoomSlot/RoomsMeta
  round-trips.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
P1 (codex/big-picture) — legacy migration now per-room aware:
- fire_legacy_migration_request also fires a fire-and-forget ListRequest to
  each legacy delegate; a legacy ListResponse routes to migrate_legacy_per_room,
  which GETs the dynamic room:<vk> slots from the legacy delegate (new
  send_delegate_request_to) and re-saves them to the current delegate. Without
  this the NEXT delegate-WASM bump after per-room shipped would strand all rooms.

MAJOR (code-first/skeptical/big-picture) — rejoin/cache correctness:
- cas_write_key/cas_write_delegate_key return Result<bool> (wrote vs aborted).
- save loop records Tombstone (not a phantom Present(h)) when a reconcile aborts
  to adopt a remote leave, and clears REJOINED_THIS_SESSION once a rejoin is
  actually persisted as Present — so a later cross-tab leave + background update
  adopts the leave instead of resurrecting on a stale flag.
- mark_room_rejoined clears the room's ROOM_SLOT_STATE entry; new
  clear_room_rejoined wired at the leave site.

MINOR:
- do_save accumulates per-room errors and continues (one room's CAS exhaustion
  no longer aborts the whole pass).
- reconcile_room_present unions invitation_secrets (don't drop a private-room
  secret on a cross-tab merge).
- clarifying comment on concurrent-load single-waiter behavior; documented the
  load-side Rooms::merge abort-on-mismatch limitation.

Tests: reconcile_room_present (5 cases incl. M1 diverged-identity + unparseable),
rejoin flag set/clear + wiring pin, parse_room_storage_key round-trip/negatives,
single-blob->per-room explosion+reconstruct round-trip, cas_write_key wrote-bool.

Docs: river-publish.md + direct-messages.md updated for ListRequest load flow and
the fixed-vs-dynamic new-storage-key carve-out. Stale past-due TODOs corrected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@sanity sanity changed the title fix(delegate): compare-and-swap rooms_data to stop multi-tab room loss (#345) fix(delegate): per-room CAS keys to stop multi-tab room loss (#345) Jun 5, 2026
BLOCKER (code-first/skeptical) — early mark-done stranded per-room data:
- a legacy delegate returning no rooms_data blob no longer marks legacy
  migration done. Its dynamic room:<vk> keys may still need migrating via the
  List-driven migrate_legacy_per_room; an early mark-done would permanently
  strand them on any transient migration failure (the V24->V25 data-loss).
  Done-marking is now owned solely by the successful-migration paths
  (hydrate_loaded_rooms legacy re-save + the current PerRoom load).

MAJOR/P1 (codex/skeptical) — legacy correlation collision:
- requests to a non-current delegate now register under a delegate-scoped
  correlation key (legacy_scoped_correlation); the response side rebuilds the
  same scoped key for is_legacy responses. Two legacy delegates (or a reconnect
  racing current vs legacy) holding the same room:<vk> can no longer collide on
  the single-waiter PENDING_REQUESTS slot and migrate stale data. The
  current-delegate path keeps the bare correlation (unchanged, incl. signing).

MINOR: documented that a load-side merge Err also skips secret repopulation for
the rooms that did merge.

Tests: invitation_secrets union (matching-identity path), legacy_scoped_correlation
per-delegate, and a source-pin that the empty-rooms_data legacy branch never marks
migration done.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@sanity sanity merged commit ee7a1cc into main Jun 5, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-tab room loss: chat delegate rooms_data is a blind full-blob overwrite (last-write-wins across tabs)

1 participant