fix(delegate): per-room CAS keys to stop multi-tab room loss (#345)#347
Conversation
…elegate KV store The chat delegate's key-value store was a blind last-writer-wins overwrite, so two browser tabs each persisting their full room-list snapshot would clobber each other — a stale tab could silently destroy a room a newer tab just accepted (#345). Add a generic optimistic-concurrency primitive: - New wire variants GetVersionedRequest / CasStoreRequest and GetVersionedResponse / CasStoreResponse{CasStoreResult} (appended; ciborium tags externally by name so old delegate WASM is unaffected). - Every KV value is wrapped in an internal [tag][generation][data] envelope; plain Get/Store transparently unwrap/wrap it (back-compat, stays generation-consistent). CasStore only writes when expected_generation matches, else returns the current generation+value to merge and retry. - All generation/CAS logic lives in pure functions (delegate native tests no-op the secret store) with full unit coverage, plus handler-dispatch and CBOR wire round-trip tests. This is the delegate half of the fix; the client CAS-write of rooms_data follows. Delegate WASM changes here require a migration entry before publish. Refs #345 [AI-assisted - Claude]
Replace the blind full-blob StoreRequest in do_save_rooms_to_delegate with a CasStoreRequest loop keyed on the rooms_data generation (#345). On a conflict the loop folds the conflicting remote snapshot into the local working copy via the tombstone-aware Rooms::merge (#247) and retries, so a stale tab can no longer overwrite a room another tab just accepted — both tabs' rooms survive. - ROOMS_GENERATION tracks the last-known generation; learned lazily (first save of a session may take one self-correcting conflict+merge+retry). Capturing it at load time comes with the per-room key split (PR2). - fold_conflicting_rooms is a free function with unit coverage: tombstones from both sides survive, this tab's current-room selection is preserved, and an unparseable remote snapshot errors rather than clobbering. - response_handler routes the new CasStore/GetVersioned responses through the pending-request registry and handles them in the processing match. Refs #345 [AI-assisted - Claude]
…345) The CAS primitive changes the chat-delegate WASM and therefore its delegate key. Record the previous (currently-deployed) delegate as legacy V24 so users' existing rooms/signing keys migrate forward instead of being lost. - V24 delegate_key 1ec6b3d1… == the deployed delegate 358xw7Ld… (verified by base58 decode), confirming the committed WASM is the production delegate. - New delegate WASM: 904f76ff… -> 8bb5d06f…. Room-contract WASM intentionally untouched (room-contract does not link chat_delegate; rebuilt only the delegate to avoid a spurious contract-key change). - Validated: cargo make check-migration, cargo test -p river-core --test migration_test. Refs #345 [AI-assisted - Claude]
…CAS correlation keys; test the CAS loop Multi-model review of #347 found one critical bug and several should-fixes: - [P1, Codex — critical data loss] After a CAS conflict the loop folded the other tab's rooms into `working` and stored the union, but only updated the generation — the in-memory ROOMS signal never got the merged rooms, so the NEXT save snapshotted stale ROOMS and overwrote them, re-deleting the just-merged rooms. Fix: refactor the loop into a testable `run_rooms_cas_loop` returning `CasSaveOutcome { stored, generation, merged_remote }`; when merged_remote, defer a ROOMS.merge(stored) write-back (Dioxus-safe). Bonus: the other tab's room now appears without a reload. - [M1, skeptical] A plain GetResponse for "rooms_data" (cold-start load / legacy probe) could steal an in-flight CAS save's pending-request slot (shared correlation key), failing that save. Give CAS/GetVersioned ops DISTINCT correlation keys (cas_store_correlation_key / get_versioned_correlation_key) used by both get_request_key and the response handler. - [M3 / testing] The CAS loop's conflict/exhaustion/error paths were untested. Added run_rooms_cas_loop tests (stored-first, conflict→fold→retry carries both writers' state, 8-attempt exhaustion, Failed propagation) and a populated-map Rooms::merge test pinning the #345 headline (two distinct rooms both survive). - [docs] P2 (delete resets generation — unreachable in PR1; documented for PR2), envelope-tag CBOR-collision invariant, M2 (un-mergeable remote aborts save intentionally — refuse to clobber), migrated_rooms/current_room_key not folded, and corrected the coalesce_save "will retry" comment. Verified false-positive (dropped): big-picture's base58 `358wv7Kc…` — the recorded key re-encodes to `358xw7Ld…` (matches the deployed delegate). Tests: chat-delegate 39, river-core chat_delegate 19, river-ui --bins 304, all green. Clippy clean (changed files). Delegate WASM unchanged (comment-only delegate edits; committed 8bb5d06f stands). Refs #345 [AI-assisted - Claude]
sanity
left a comment
There was a problem hiding this comment.
Comprehensive PR Review: #347
Summary
- PR Title: fix(delegate): compare-and-swap rooms_data to stop multi-tab room loss (#345)
- Type: fix (delegate WASM + wire format + concurrency + data migration)
- CI Status: was green on the reviewed commit
1acac35c; re-running on the fix commit31dd7ba1 - Linked Issues: #345 (closes); relates to #65 (PR2), #307 (CLI analog)
- Review tier: Full (high-risk surfaces: concurrency, wire format/serialization, delegate WASM, data migration)
- Reviewers run: four Freenet perspectives (code-first, testing, skeptical, big-picture) + external Codex (non-Claude). Findings synthesized, every cited
file:lineverified.
This review found one critical data-loss bug and several should-fixes; all are addressed in 31dd7ba1. A re-review on the new HEAD follows.
Findings and resolutions
Must Fix (Blocking) — addressed
- [Codex P1 — critical] Merged rooms were not written back to the
ROOMSsignal after a conflict-resolved store. After a CAS conflict the loop folded the other tab's rooms into the localworkingcopy and stored the union, but the success path only updatedROOMS_GENERATION. The in-memoryROOMSnever received the merged rooms, so the next save snapshotted the staleROOMS(now with the correct generation, so no conflict) and overwrote the delegate — silently re-deleting the just-merged rooms. This defeated the fix in the multi-save case. Verified againstchat_delegate.rs:1633-1647.- Fix (
31dd7ba1): refactored the loop into a testablerun_rooms_cas_loopreturningCasSaveOutcome { stored, generation, merged_remote }; whenmerged_remote,do_save_rooms_to_delegatedefers aROOMS.merge(stored)write-back (Dioxus-safe). Side benefit: the other tab's room now appears without a reload.
- Fix (
Should Fix — addressed
- [skeptical M1] Pending-request slot theft. A plain
GetResponseforrooms_data(cold-start load / legacy probe) shared the"rooms_data"correlation key with an in-flight CAS save and could complete that save's oneshot with the wrong response, failing the save (andcoalesce_savedoes not auto-retry a failed save). Pre-existing mechanism, but the CAS rewrite made it a hard mid-loop error.- Fix: CAS/GetVersioned ops now use distinct correlation keys (
cas_store_correlation_key/get_versioned_correlation_key), applied in bothget_request_keyand the response-handler routing.
- Fix: CAS/GetVersioned ops now use distinct correlation keys (
- [testing / skeptical M3 / code-first] CAS loop conflict/exhaustion/error paths untested.
- Fix: added
run_rooms_cas_looptests (stored-on-first-attempt, conflict→fold→retry carrying both writers' state at the adopted generation, 8-attempt exhaustion error,Failedpropagation) and a populated-mapRooms::mergetest pinning the #345 headline (two distinct rooms both survive).
- Fix: added
Consider / Documented (not code-blocking)
- [Codex P2] Delete resets a key's generation (ABA). Not reachable in PR1 (no CAS-tracked key is ever deleted). Documented in
versioning.rsand the delete handler; the proper handling belongs in PR2 (where per-room-key leave-deletes appear, guarded by theremoved_roomstombstone). - [skeptical M2] An un-mergeable remote aborts the whole save. This is intentional and safe — refusing to clobber data we can't parse/merge is preferable to overwriting it; in-memory rooms are untouched and the next mutation re-attempts. Documented at the fold site. Per-room isolation comes with PR2.
- [code-first]
migrated_rooms/current_room_keyare intentionally not folded from the remote (transient/per-tab;mergere-derives upgrade pointers). Noted infold_conflicting_rooms. Also tightened the envelope-tag invariant comment (real CBOR payloads start0xA0..0xBF, never the0x01tag) and corrected the misleadingcoalesce_save"will retry" comment.
Dropped — false positive
- [big-picture] PR description base58 typo (
358wv7Kc…). Verified incorrect: the recorded V24 hex re-encodes to358xw7Ld…, which matches the deployed delegate. The PR text was already correct.
Verified safe (by skeptical + code-first, independently confirmed)
- Envelope defensive-decode cannot misread real CBOR data; migration old→new delegate is correct (V24
delegate_key=BLAKE3(code_hash)and equals the deployed delegate by base58 decode); plainGet/Storeenvelope back-compat keeps the outbound-DM cache working;ROOMS_GENERATIONis single-flight undercoalesce_save; the CAS loop never mutates theROOMSsignal mid-loop (Dioxus-safe); riverctl is decoupled (ownrooms.json). Room-contract WASM untouched.
Testing
| Test Type | Status | Notes |
|---|---|---|
| Unit | ✅ | Pure CAS/envelope (versioning.rs), handler dispatch, CBOR wire round-trips, fold_conflicting_rooms, run_rooms_cas_loop, Rooms::merge distinct-rooms |
| Integration | Native delegate set_secret/get_secret are no-ops, so end-to-end persistence isn't host-testable; invariants pinned at the pure layer |
|
| Simulation | N/A | |
| E2E | ⏳ | Two-tab live-node Playwright verification to run before deploy (incl. outbound-DM envelope round-trip + legacy-migration re-save) |
Verdict
State: Needs Changes — Re-review Required After Fix → fixes applied in 31dd7ba1; re-review running on the new HEAD.
HEAD SHA reviewed: 1acac35c (original); fixes on 31dd7ba1.
[AI-assisted - Claude]
…tion keys Re-review (Codex + skeptical, independently) found that the round-1 P1 fix was incomplete: the merged-rooms write-back into ROOMS is deferred (setTimeout 0) for Dioxus signal-safety, so a coalesce_save catch-up iteration that re-runs do_save before the defer fires snapshots a still-stale ROOMS at the now-current generation, stores without conflict, and re-deletes the just-merged rooms — the same data-loss class in a tighter window. Fix: a PENDING_MERGED_ROOMS holding cell (a plain thread_local RefCell, NOT a Dioxus signal) set SYNCHRONOUSLY when a conflict merges remote rooms, and folded into EVERY save's working snapshot. All rooms-saves are serialized by the coalesce mutex, so a value set before one save returns is always visible to the next — correctness no longer depends on the deferred write-back's timing (which becomes pure UI convenience + cell cleanup, as the skeptical reviewer recommended). The deferred drain reads the cell's current (latest cumulative) value to avoid a multi-conflict gap. Also addresses the testing reviewer's coverage ask: - cas_correlation_keys_diverge_from_plain_storage_keys + _round_trip_request_to_ response pin the M1 fix (distinct slots; request/response symmetry). - rooms_cbor / outbound_dm_store envelope-tag-collision tests pin the documented decode invariant (real CBOR maps start 0xA0..0xBF, never ENVELOPE_TAG 0x01). Tests: chat-delegate 40, river-ui --bins 307, all green. Clippy clean. Refs #345 [AI-assisted - Claude]
Re-review round 3 (Codex) found a third data-loss regression: the CAS conflict merge uses Rooms::merge, which unions tombstones, so it resurrects a stale removed_rooms entry over an explicit REJOIN. When a user rejoins a previously left room (clears its tombstone, re-adds it), the first save after a reload — where lazy generation forces a conflict against the delegate's own pre-rejoin blob — re-removes the just-rejoined room. The current blind-overwrite code does NOT have this bug, so the CAS rewrite introduced it. Fix: add Rooms::merge_reconciling (next to Rooms::merge), which drops from the other side's removed_rooms any room THIS side currently holds in its map before merging — local map membership wins over a stale remote tombstone. Local tombstones still block re-adds (#247 leave-direction preserved) and shared room state still CRDT-merges. Both fold sites (the CAS conflict fold and the PENDING_MERGED_ROOMS fold) route through it. Net: conflict resolution is local-authoritative ('keep my decisions, absorb the rooms I'm missing') — strictly better than blind overwrite, which kept local decisions but had no anti-clobber for additions. Trade-off (documented): a room another tab left isn't removed from a tab still holding it open; genuine per-room leave/rejoin versioning belongs to PR2. Tests: merge_reconciling_preserves_local_rejoin_over_stale_tombstone (with a precondition asserting plain merge WOULD drop it) and merge_reconciling_keeps_local_leave_sticky (#247 not weakened). river-ui --bins 309 green, clippy clean. Refs #345 [AI-assisted - Claude]
… for persistence Re-review round 4 (Codex + skeptical) found the deferred ROOMS write-back was the source of three issues: it inserted a delegate-deserialized room into the live ROOMS signal that (a) lacked rehydrated secrets/actions_state — private rooms couldn't decrypt (Codex P1), (b) was never mark_needs_sync'd so it never subscribed and stayed stale (Codex P2), and (c) used plain merge so it could still resurrect a stale tombstone over a concurrent rejoin (Codex P2 / skeptical H1). All three stem from surfacing the other tab's room LIVE — which is PR2's scope. Remove the write-back entirely. Persistence correctness already rests on the PENDING_MERGED_ROOMS holding cell (set synchronously on a merged conflict, folded into every save via merge_reconciling under the coalesce mutex), so a later/coalesce-catch-up save can't drop the merged rooms. The merged room then reappears FULLY HYDRATED via the normal load path on next reload — the documented 'fixed after refresh' PR1 behavior. The cell is per-session (thread_local), so it resets on reload after hydration. This is strictly safer than skeptical's narrower suggestion (switch the write-back to merge_reconciling), which would fix H1 but leave Codex's under-hydrated/unsynced-room P1/P2 in place. Also corrected stale write-back references in comments/docs. Tests: river-ui --bins 309 green; clippy clean. Refs #345 [AI-assisted - Claude]
Round-5 review (both reviewers clean — no data-loss windows). Two non-blocking doc follow-ups: - Codex P2: document that merge_reconciling CRDT-merges only room_state for a room on both sides; other per-room scalars (last_read_message_id, self_nickname, invitation_secrets) keep the local value. This is NOT a regression — the prior blind overwrite reverted every room's scalars on a stale save, so CAS is no worse and strictly better elsewhere. Proper per-field reconciliation lands with the per-room key split (PR2). Tracked on #345. - Skeptical L1: correct the CasSaveOutcome.stored doc, which still claimed the snapshot 'must be reflected back into ROOMS' — exactly the write-back round-4 proved harmful and round-5 removed. Prevents a future maintainer from restoring it. Doc-only; no behavior change. Refs #345 [AI-assisted - Claude]
sanity
left a comment
There was a problem hiding this comment.
Comprehensive PR Review: #347 — Final (converged)
Review tier: Full (concurrency + wire format + delegate WASM + data migration).
Reviewers: four Freenet perspectives + external Codex, run iteratively across 6 commits as fixes landed.
HEAD reviewed: clean pass on fabdf21d; current HEAD 357d6b23 is doc-only on top of it.
Outcome
The iterative multi-model review found and fixed four distinct data-loss / data-corruption bugs that the first implementation missed — most surfaced by the external Codex pass, whose blind spots don't correlate with the (Claude-authored) code:
- Merged rooms never written back → next save re-deleted them. (Codex round 1)
- Deferred write-back didn't cover the coalesce catch-up iteration → same loss, tighter window. (Codex + skeptical round 2) → fixed with the synchronous
PENDING_MERGED_ROOMSholding cell folded into every save. - Stale tombstone resurrected over an explicit rejoin → a rejoined room re-removed on first save after reload. (Codex round 3) → fixed with
Rooms::merge_reconciling(local map membership wins over a stale remote tombstone; #247 leave-stickiness preserved). - The write-back surfaced under-hydrated, unsubscribed rooms and could still undo a rejoin. (Codex + skeptical round 4) → removed the write-back entirely; persistence now rests solely on the holding cell, and the merged room reappears fully hydrated via the load path on reload (the documented "fixed after refresh" PR1 behavior; live cross-tab surfacing is PR2's scope).
Plus M1 (a plain GetResponse could steal an in-flight CAS save's response slot → distinct correlation keys).
Final verdict
No data-loss windows remain in the reviewed paths (both round-5 reviewers concur). The CAS conflict resolution is local-authoritative ("keep my decisions, absorb the rooms I'm missing") — strictly better than the prior blind overwrite, which it replaces. All findings are fixed or have a justified dismissal.
Remaining items — non-blocking, deferred to PR2 (per-room key split), tracked on #345:
- Per-room scalar metadata (
last_read_message_id,self_nickname,invitation_secrets) keeps the local value on a same-room conflict — not a regression (the prior blind overwrite reverted every room's scalars on a stale save; CAS is no worse and better elsewhere). Documented onRooms::merge_reconciling. - A CAS-only-merged room's contract-WASM-upgrade pointer is deferred to reload (pre-existing, rare path).
Testing
- Unit: pure CAS/envelope (
versioning.rs), handler dispatch, CBOR wire round-trips (incl. legacy-variant pin), correlation-key divergence + round-trip,run_rooms_cas_loop(conflict→fold→retry, exhaustion, failure),merge_reconciling(rejoin preserved with a plain-merge-drops-it precondition; leave stays sticky), distinct-rooms union, envelope-tag CBOR-collision. chat-delegate 40, river-core 210, river-ui --bins 309 — all green. - Migration:
check-delegate-migration+check-room-contract-migrationgreen on HEAD; V24delegate_keyverified == the deployed delegate by base58 decode. - E2E: two-tab live-node Playwright verification (incl. outbound-DM envelope round-trip + legacy-migration re-save) to run before deploy.
Verdict
State: Ready to Merge once CI is green on the exact current HEAD (357d6b23).
Deploy note: delegate-WASM change → publish via cargo make publish-all from main after merge, gated on green CI at the merge commit + the E2E pass. PR2 (per-room key split, #65) follows after PR1 merges.
[AI-assisted - Claude]
…346) main merged #346 (drag-and-drop room reordering), which added a room_order field to Rooms. Merged origin/main into the branch (auto-merged, no conflicts) and added room_order: Vec::new() to the 10 Rooms test literals this PR introduced, which predated the field. Production code was unaffected (it never constructs Rooms literals); merge_reconciling delegates to Rooms::merge, which already handles room_order (unions, prunes to map), so no logic change. Delegate WASM and legacy_delegates.toml untouched by the merge — V24 migration intact. Tests: river-ui --bins 318, chat-delegate 40, river-core 186 green. Refs #345
Codex post-merge review (P1): the PENDING_MERGED_ROOMS fold in do_save_rooms_to_delegate logged-and-continued on a merge_reconciling error, then CAS-stored rooms_clone WITHOUT the pending rooms — silently deleting the concurrent-tab rooms the holding cell exists to preserve. Reachable when the same room was rejoined locally with a different self_sk (merge returns 'self_sk is different'). Match the conflict-fold's behavior: propagate the error (?) and abort the save instead of persisting a snapshot that drops pending rooms. The cell stays set, in-memory rooms are untouched, and the next save (or reload, which resets the per-session cell) retries — no silent data loss. Moved the fold out of the value-block to function scope so ? applies. Tests: river-ui --bins 318 green. Refs #345 [AI-assisted - Claude]
…rite CAS (#345) The cached-generation + PENDING_MERGED_ROOMS approach to the rooms CAS save generated a long tail of data-loss edge cases in review (missing write-back, coalesce-catch-up race, rejoin resurrection, under-hydrated/unsynced write-back rooms, and most recently a pending-only room overriding a legitimate remote leave). The root cause: trying to AVOID re-reading the delegate on the no-conflict path forced a side-cache (PENDING) of rooms that live on the delegate but not in the ROOMS signal, and reconciling that cache against later remote changes kept producing corner cases. Replace it with plain optimistic read-modify-write, the textbook pattern: do_save now, per attempt, GetVersioned (read current value + generation) -> merge_reconciling(local ROOMS, delegate state) -> CasStore at the read generation; a Conflict re-reads and retries. This removes ROOMS_GENERATION, PENDING_MERGED_ROOMS, the holding cell, the write-back, and CasSaveOutcome. Why it's correct with no side-cache: - A room another tab added lives on the delegate, so every save re-reads and re-includes it (never dropped) — fixes the round-1/2 loss without PENDING. - A subsequent remote leave is honored: merge_reconciling keeps the delegate's tombstone for rooms this tab doesn't explicitly hold (fixes the round-8 remote-leave resurrection — a pending-only room can no longer look 'locally authoritative'). - A local rejoin still wins (local map membership beats a stale delegate tombstone), with no lazy-generation false conflict (we read the fresh generation each save) — round-3 stays fixed. - We still never write delegate-deserialized (under-hydrated) rooms into the ROOMS signal; merged rooms reappear fully hydrated via the load path on reload (round-4 stays fixed). Live cross-tab surfacing remains PR2's scope. Cost: one extra ~337KB delegate read per save (local IPC, coalesced to <=2 per burst — not a #246-class burst); PR2's per-room keys shrink it. GetVersioned uses its distinct correlation key (M1), so it can't collide with the load path's plain GetRequest. run_rooms_cas_loop now takes injected get_versioned + cas_store closures; rewrote its tests (empty-store, delegate-state-merged, retry-on-conflict, exhaustion, failed, unparseable-delegate-aborts). Tests: river-ui --bins 320, chat-delegate 40, river-core 186 green; clippy clean; delegate WASM + V24 migration unchanged. Refs #345 [AI-assisted - Claude]
Completes the per-room storage split: rooms are loaded from independent
room:<base58(vk)> slot keys (+ rooms_meta) instead of the single
rooms_data blob.
- startup fires ListRequest (was GetRequest{rooms_data}); the ListResponse
arm spawns load_rooms_per_room, which plain-GETs each slot (distinct
correlation from the save CAS, so an in-flight save can't steal the
load's response), reconstructs Rooms (Present->map, Tombstone->
removed_rooms, apply_meta), and hydrates.
- extract the post-deserialize hydrate block into a shared free fn
hydrate_loaded_rooms() used by both the single-blob GetResponse arm and
the per-room load.
- NON-DESTRUCTIVE migration: a single rooms_data blob (legacy delegate, or
a stray current-delegate blob) is read, hydrated, then re-saved as
per-room keys; the blob is left in place as a rollback fallback. Per-room
keys take priority over a stray blob so migration never re-triggers.
- wire mark_room_rejoined() at the two explicit-rejoin tombstone-clear
sites (members import-identity, get_response invitation-accept) so an
explicit rejoin overwrites a remote Tombstone slot rather than adopting
the leave (round-9).
- remove now-dead Rooms::merge_reconciling (+ its tests); the per-room key
split it was a stopgap for now exists.
- tests: plan_load_from_keys, reconstruct_rooms, RoomSlot/RoomsMeta
round-trips.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
P1 (codex/big-picture) — legacy migration now per-room aware: - fire_legacy_migration_request also fires a fire-and-forget ListRequest to each legacy delegate; a legacy ListResponse routes to migrate_legacy_per_room, which GETs the dynamic room:<vk> slots from the legacy delegate (new send_delegate_request_to) and re-saves them to the current delegate. Without this the NEXT delegate-WASM bump after per-room shipped would strand all rooms. MAJOR (code-first/skeptical/big-picture) — rejoin/cache correctness: - cas_write_key/cas_write_delegate_key return Result<bool> (wrote vs aborted). - save loop records Tombstone (not a phantom Present(h)) when a reconcile aborts to adopt a remote leave, and clears REJOINED_THIS_SESSION once a rejoin is actually persisted as Present — so a later cross-tab leave + background update adopts the leave instead of resurrecting on a stale flag. - mark_room_rejoined clears the room's ROOM_SLOT_STATE entry; new clear_room_rejoined wired at the leave site. MINOR: - do_save accumulates per-room errors and continues (one room's CAS exhaustion no longer aborts the whole pass). - reconcile_room_present unions invitation_secrets (don't drop a private-room secret on a cross-tab merge). - clarifying comment on concurrent-load single-waiter behavior; documented the load-side Rooms::merge abort-on-mismatch limitation. Tests: reconcile_room_present (5 cases incl. M1 diverged-identity + unparseable), rejoin flag set/clear + wiring pin, parse_room_storage_key round-trip/negatives, single-blob->per-room explosion+reconstruct round-trip, cas_write_key wrote-bool. Docs: river-publish.md + direct-messages.md updated for ListRequest load flow and the fixed-vs-dynamic new-storage-key carve-out. Stale past-due TODOs corrected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
BLOCKER (code-first/skeptical) — early mark-done stranded per-room data: - a legacy delegate returning no rooms_data blob no longer marks legacy migration done. Its dynamic room:<vk> keys may still need migrating via the List-driven migrate_legacy_per_room; an early mark-done would permanently strand them on any transient migration failure (the V24->V25 data-loss). Done-marking is now owned solely by the successful-migration paths (hydrate_loaded_rooms legacy re-save + the current PerRoom load). MAJOR/P1 (codex/skeptical) — legacy correlation collision: - requests to a non-current delegate now register under a delegate-scoped correlation key (legacy_scoped_correlation); the response side rebuilds the same scoped key for is_legacy responses. Two legacy delegates (or a reconnect racing current vs legacy) holding the same room:<vk> can no longer collide on the single-waiter PENDING_REQUESTS slot and migrate stale data. The current-delegate path keeps the bare correlation (unchanged, incl. signing). MINOR: documented that a load-side merge Err also skips secret repopulation for the rooms that did merge. Tests: invitation_secrets union (matching-identity path), legacy_scoped_correlation per-delegate, and a source-pin that the empty-rooms_data legacy branch never marks migration done. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Problem
Multi-tab data loss (#345): accepting an invite to a new room in one browser tab
left the room missing from other tabs, and it was lost entirely on refresh. Root
cause (confirmed byte-level on technic): the chat delegate stored the entire
room list in a single
rooms_datablob, written with a blind last-writer-winsoverwrite. Each tab has its own in-memory
ROOMS, so across tabs the last saveclobbered the others — a stale tab silently erased a room a fresh tab had just
added.
Approach
Two layers, both in this PR:
1. A compare-and-swap primitive in the chat-delegate KV store. New wire
variants
GetVersionedRequest/CasStoreRequest(+GetVersionedResponse/CasStoreResponse{CasStoreResult}), appended to the externally-tagged enums soold clients never receive them. Every value is wrapped in a versioned envelope
[tag=1][generation u64 LE][data]; plain Store/Get transparently wrap/unwrap,CAS stores only when the expected generation matches. Pure, unit-tested logic in
delegates/chat-delegate/src/versioning.rs. Delegate WASM rebuilt →V24migration entry in
legacy_delegates.toml.2. Per-room keys (realizes #65). Each room is its own
room:<base58(owner_vk)>delegate key holding aRoomSlot::{Present|Tombstone},CAS-versioned independently; list-level view prefs (current room, notification
modes, room order) live in a single
rooms_metakey. The singlerooms_datablob is no longer written (still read once for migration). A save for room X
never touches room Y's key, so the clobber is structurally impossible.
do_save_rooms_to_delegate): content-diffs each room and CAS-writesonly changed slots;
reconcile_room_present/_tombstone/_metaare theread-merge-write reconcilers;
present_actionis the pure rejoin-vs-leavedecision. Per-room error isolation (one room's CAS exhaustion doesn't abort the
pass).
response_handler.rs):fire_list_rooms_request(aListRequest)→
load_rooms_per_roomplain-GETs each slot +rooms_meta, reconstructsRooms, and hydrates via the sharedhydrate_loaded_rooms. Plain-GETcorrelation is distinct from the save's CAS correlation, so a concurrent
save can't steal a load's response slot.
mark_room_rejoinedat theinvitation-accept and identity-import sites) overwrites a remote
Tombstone;a background content update to a room left elsewhere adopts the leave. The
rejoin flag is consumed once the rejoin is persisted, and cleared on leave, so
a stale flag can't resurrect a room left in another tab.
Transparent, non-destructive migration. On the V24 upgrade the new delegate
key is empty → load falls through to the legacy-delegate probe and recovers the
old single
rooms_datablob, re-saving it as per-room keys (the old blob is leftin place as a rollback fallback). Because per-room keys are dynamic,
fire_legacy_migration_requestalso fires aListRequestto each legacydelegate so a future delegate-WASM bump can recover the per-room keys the
now-legacy delegate holds (
migrate_legacy_per_room) — without this, the nextbump would strand every room.
Testing
All native unit tests green (
cargo test -p river-ui --bins336,chat-delegate40,
river-corechat_delegate 19);cargo fmt+clippyclean.encode/decode, generation arithmetic, handler dispatch.
present_actiontruth table +reconcile_room_presentat byte level (absent,tombstone-background-adopts-leave, tombstone-explicit-rejoin-overwrites,
diverged-identity-keeps-local, unparseable-errs).
plan_load_from_keys(all branches),reconstruct_rooms,RoomSlot/RoomsMetaCBOR round-trips, single-blob→per-room explosion+reconstructround-trip (the on-node upgrade format contract).
mark_room_rejoined/clear_room_rejoinedwiring;parse_room_storage_keyround-trip/negatives;correlation-key divergence (load vs save).
Reviewed by a multi-model pass (4 perspectives + external codex); findings
addressed in the follow-up commit (legacy per-room migration, rejoin/cache
correctness, per-room error isolation,
invitation_secretsunion, doc updates).Still TODO before merge/deploy: two-tab Playwright on technic + an on-node
upgrade check (load the new UI against a node holding old single-blob state →
rooms reappear), and re-confirming deploy authorization (scope grew from
single-blob to per-room).
Closes #345. Realizes #65.
[AI-assisted - Claude]