feat: elect a single leader instance and surface it in the desktop UI#1062
feat: elect a single leader instance and surface it in the desktop UI#1062wpfleger96 wants to merge 8 commits into
Conversation
When multiple Buzz dev instances share an agent keypair, the relay fans each mention out to every connected buzz-acp (NIP-01) and all of them respond. Add a per-agent-key leader lock so only the leader instance promotes queued events to prompts; non-leaders keep their queue populated for UI rendering but stay silent. Reads ~/.buzz/leader-locks/<pubkey>.lock: absent or self-owned means leader, foreign-owned means observer, malformed fails safe to leader so a corrupt lock never silences the only responder. Solo dev has no lock file, so behavior is unchanged. The 👀 reaction fires at queue-push time before dispatch, so it is gated separately from dispatch_pending. The election identity is sourced from BUZZ_INSTANCE_ELECTION_ID behind a single named constant. It is deliberately NOT the Tauri bundle identifier (BUZZ_MANAGED_AGENT): that collides across same-class windows (DMG + dev, or worktrees whose icon-gen fell back to the shared dev id), which would let two windows both lead. Phase 2 writes a process-unique value; Phase 1 reads whatever that env carries and defaults to leader when it is unset. Phase 1 of 3: read side only. Lock acquire/steal/failover (Phase 2) and claim UI (Phase 3) follow. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
c6496b8 to
09422e6
Compare
run_prompt_task has a second caller besides dispatch_pending: dispatch_heartbeat. The heartbeat prompt acts autonomously on the wire (sends messages, approves workflows), so an ungated heartbeat let every non-leader instance sharing one agent key self-prompt and act — the same duplicate-actor bug the dispatch gate prevents, arriving through the heartbeat door. Gating before pool.try_claim closes it; solo dev is unaffected since an absent lock means leader. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Folds the NIP-LE spec into PR #1062 alongside the implementation so the spec and the code it describes ship together. Reconciles one wording gap against the shipped code: the draft's invariant named two suppression surfaces for non-leaders (dispatch path, 👀 reaction) but the implementation gates a third — the autonomous heartbeat prompt path (dispatch_heartbeat). Adds bullet (c) so all three suppression surfaces in the spec map to a gate in the code. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
…ument fail-safe scope The primary suppression surface — dispatch_pending — had no test; only the heartbeat gate was covered, so a refactor of the shared leader gate could silently ship a non-leader that still dispatches. Add the missing leader/non-leader test pair. Read-side mutex locks now read through poison rather than panic the event loop, removing a latent footgun for when Phase 2 extends the critical sections. Module-doc and NIP-LE now state that any IO error (not just an absent lock) fails safe to leader, that this can produce a bounded ≤5s transient duplicate during a concurrent rewrite, and that pid/claimed_at are read-side-ignored. Co-authored-by: Will Pfleger <wpfleger@block.xyz> Signed-off-by: Will Pfleger <wpfleger@block.xyz>
…ailover The read side elected nobody: with no writer, every instance read an absent lock and defaulted to leader, so a shared agent key produced duplicate responders. Add the writer half so one instance actually wins. The harness self-mints a process-unique election id (from_env_or_mint), acquires the lock on startup, re-attempts the claim on the existing 5s refresh tick (failover when a leader dies without releasing), and releases on graceful shutdown. acquire holds an exclusive flock across the read-decide-write window to close the TOCTOU race; a dead-pid lock is reclaimable, a live foreign lock is not. release empties the file rather than unlinking it to avoid racing a fresh claim onto a detached inode. Failover tolerates pid recycling: a crashed leader's pid may be reused by an unrelated live process, which a bare pid-liveness probe would mistake for the original leader and never take over, wedging the agent leaderless. A claim is takeable when its pid is dead OR its claimed_at is older than a staleness bound (2x the 5s refresh); a live leader rewrites claimed_at every tick, so only an abandoned claim ages out, without evicting an active leader. Writer is Unix-only (flock); non-Unix gets an always-leader no-op, mirroring the kill_process_group cfg fallback. NIP-LE amended to final state: claim is auto-on-launch plus explicit re-claim. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
f00829b to
45a6de5
Compare
…tive (#1076) Signed-off-by: Will Pfleger <pfleger.will@gmail.com> Co-authored-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>
Phase 3a emits a `leadership_status` observer frame per window-instance every 5s and handles a `claim_leadership` control frame. The desktop had no consumer. This adds the owner-side surface: a per-agent leader badge and a per-instance "Make leader" steal action. The frames already land in `eventsByAgent` via the owner-wide observer subscription, so leadership is a cached derivation rather than a new store. `getAgentLeadership` stays a stable map lookup (required by `useSyncExternalStore`); the `leadershipByAgent` array is rebuilt only when a leadership frame appends. Staleness stays out of the store — the row filters against a 5s clock so a crashed leader's badge drops within 15s without a new frame. The steal ack is non-authoritative: the UI converges off the stream, never optimistically flipping the badge. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Phase 3b — Leadership UI E2E screenshotsCaptured via Single-instance Leader badgeOne instance reporting Multi-instance freshest-leader badgeThree instances, one leader — the row badge reflects the freshest leader ( Leadership submenuThe "Make leader" cooperative-steal actionThe submenu open with a non-leader instance's "Make leader" entry hovered — the cooperative-steal entry point. |




Summary
When multiple Buzz dev instances share an agent keypair, the relay fans every matching event out to all of them (NIP-01) and each instance prompts its agent — duplicate replies for a single
@mention. This adds a per-agent-key leader lock so only the leader instance acts autonomously on the wire. Non-leaders still receive and render events (the queue stays ungated for UI) but suppress every path that would emit under the shared identity.This PR carries the NIP-LE specification (
docs/nips/NIP-LE.md), the read side (every instance derives leadership from the lock file), the writer + auto-claim lifecycle (the harness self-elects on launch, fails over on leader death, releases on shutdown), and the desktop leadership UI (a per-agent leader badge plus a cooperative "Make leader" steal control).The invariant
A single agent identity may have N subscribed instances but exactly one prompter (the leader). Non-leaders suppress all three surfaces that would otherwise act under the shared key:
👀reaction — fires at queue-acceptance time, before dispatch, so the dispatch gate alone would let every sharing instance emit a redundant 👀.buzz messages send/buzz workflows approveon its own; non-leaders must not fire it, or N instances each act for the same key.Each surface maps 1:1 to a gate in the code.
docs/nips/NIP-LE.mdis the normative spec for this invariant and the lock contract.How leadership is decided (read side)
Every instance reads
~/.buzz/leader-locks/<pubkey-hex>.lockand resolves:instance_idmatches this process's election id → leaderinstance_id→ observerStatus is cached per agent key and refreshed in place by
refresh(). Leadership is read-derived, never acquire-derived:is_leaderindependently reads the file, so an acquire that fails open on an IO error can never force a process to lead next to a live foreign leader.How a leader is elected (writer + auto-claim)
The harness self-mints a process-unique election id (
{pid}-{nanos}) whenBUZZ_INSTANCE_ELECTION_IDis unset, and honors the env verbatim if set (desktop per-spawn injection). The same id is both written into the lock and used by the read check, so a process's writes match its own reads.acquire(agent_pubkey): take an exclusiveflock, then read-decide-write under the held lock so the read→decide→write TOCTOU window is closed. The lock is takeable iff free (absent / empty / malformed), already ours, the owner's pid is dead (kill(pid, None)→ESRCH), or the owner'sclaimed_atis stale (older thanSTALE_CLAIM= 10s = 2× the refresh). On success it writes{instance_id, pid, claimed_at}; a live, fresh foreign owner → observe.leader_refreshtick callsacquirebeforerefresh, so a leader re-stamps its ownclaimed_atevery tick (its claim is always fresher than the bound), and a survivor re-reads a dead-or-stale lock and takes over within 5s.releaseempties the lock file (never unlinks → no detached-inode race) only if we still own it, so a co-located sibling takes over immediately and a successor's claim is never stomped.Dead-pid vs recycled-pid. The
ESRCHfast path catches the common crash-without-release case instantly. Theclaimed_atstaleness arm is the backstop for the rarer case where the OS recycles the crashed leader's pid onto an unrelated live process before any survivor ticks: the recycled pid reads as alive, but with no live leader re-stamping the claim it ages past the bound and becomes takeable — closing what would otherwise be a permanent leaderless wedge. The only owner the staleness arm can evict is a live-pid process that has missed two consecutive refresh ticks, which requires a full-runtime stall (turns run on a spawned pool, so a normal heavy turn never blocks the refresh) — in which state the leader is non-functional and takeover is correct, self-healing to a transient duplicate at worst.Desktop leadership UI
The desktop is the owner/controller, not a contending instance.
leadership_statusframes already land ineventsByAgentvia the owner-wide observer subscription, so the UI is a cached derivation — no new store, no new subscription, no relay change.ManagedAgentRowbeside the status — shows when an agent has a live leader instance. Hidden when no live instance reports....dropdown — lists each live instance (truncatedinstanceId, last-seen, leader marker) with a per-instance "Make leader" action. The current leader's item is disabled. Hidden when<= 1live instance, so the solo-dev UX is byte-unchanged.leadershipByAgentis a cachedMaprebuilt only when aleadership_statusframe appends, mirroringtranscriptByAgent.getAgentLeadershipstays a stable map lookup so it satisfies theuseSyncExternalStorereferential-stability contract (a fresh array pergetSnapshotwould render-storm).parseLeadershipPayloadnarrows the untrustedunknownpayload at the boundary; malformed frames are dropped.lastSeen = Date.parse(event.timestamp)withNaN→ drop.useNow(5000)clock at a 15s threshold (3 missed 5s ticks), so a crashed leader's badge ages out without a new frame arriving.max(lastSeen)among instances reportingisLeader, dissolving the<= 15stransient two-leader window after a crash without a "contested" state.claimManagedAgentLeadershipmirrorscancelManagedAgentTurnand returns{ status: "sent" }. The UI never optimistically flips — it converges off theleadership_statusstream.Portability
The writer is
#[cfg(unix)](flockis Unix);#[cfg(not(unix))]is an always-leader no-op acquire/release, mirroring the existingkill_process_groupcfg pattern. Desktop targets macOS/Linux; Windows is not a leader-election target. Theflockfeature is added to the existingcfg(unix)nixdependency;claimed_atparsing reuses the already-presentchronoworkspace dep (Cargo.lock unchanged).Election identity
The lock keys on a per-process election identity behind a single named constant (
ELECTION_ID_ENV=BUZZ_INSTANCE_ELECTION_ID). It is deliberately not the Tauri bundle identifier (BUZZ_MANAGED_AGENT): the bundle id collides across same-class windows (DMG + dev sharexyz.block.buzz.app(.dev); a worktree falls back to the shared dev id ifswift generate-dev-iconfails), which would let two windows both match the lock and both lead. Reaper identity partitions by app-class; election identity must be unique per process.What changed
Harness (
buzz-acp):docs/nips/NIP-LE.md(new) — the NIP-LE spec: the exactly-one-prompter invariant (surfaces a/b/c), the local-filesystem lock contract, claim semantics (auto-on-launch acquire-if-unowned plus explicit re-claim), theflockTOCTOU guard, and dead-pid + stale-claim failover.crates/buzz-acp/src/leader.rs— theLeaderChecktrait andFileLeaderCheck: the read side (above), plus the writer half —acquire/release,lock_is_takeable(ours || dead-pid || stale),pid_is_alive(ESRCH-only-dead;EPERM→ alive so a live leader is never stolen from),claim_is_stale, andfrom_env_or_mint(honor env if set, else mint{pid}-{nanos}).crates/buzz-acp/src/pool.rs—PromptContextcarriesleader: Arc<dyn LeaderCheck>.crates/buzz-acp/src/lib.rs— constructsfrom_env_or_mint; auto-acquireon startup; re-acquirebeforerefreshin the 5sleader_refreshtick;releasefirst in shutdown teardown; gates (a)dispatch_pending, (b) the queue-push👀reaction_add, (c)dispatch_heartbeat; emits theleadership_statusobserver frame consumed by the desktop UI.Desktop:
leadershipHelpers.ts(new) — pure derivation:parseLeadershipPayload,buildLeadership,filterStaleInstances,selectFreshestLeader.leadershipHelpers.test.mjs(new) — 17 behavior tests.observerRelayStore.ts—leadershipByAgentcached map, rebuild-on-append,getAgentLeadershipselector.agentControl.ts—claimManagedAgentLeadershipsender.useObserverEvents.ts—useAgentLeadershiphook.agentUi.ts—truncateInstanceId.ManagedAgentRow.tsx— leader badge + leadership submenu.e2eBridge.ts+observerRelayStore.ts— test-only__BUZZ_E2E_SEED_LEADERSHIP__hook routing syntheticleadership_statusframes through the real cached-map rebuild path (production ingest untouched).leadership-screenshots.spec.ts(new) +playwright.config.ts— E2E screenshot spec, 4 scenarios in the smoke project.Tests
Harness — twenty-one
buzz-acpunit tests, all behavior-focused.refreshflips status when the lock changes. Ids are opaque per-window strings (window-a/window-b) so the suite can't enshrinebundle-id == election-id.flockserialization, not just the happy path).cargo test -p buzz-acp→ 330 passed.cargo fmt --check/cargo clippy --all-targetsclean.Desktop: 17 behavior tests over the pure leadership derivation (guard drops, latest-per-instance, NaN-timestamp drop, zombie prune, stale boundary, NaN-stale, freshest-among-multiple-leaders, the two-leader transient). The store-level referential-stability contract is verified at review (the store imports Tauri bindings that fail under node).
pnpm typecheck/pnpm checkclean;pnpm test760/760. The leadership E2E screenshot spec passes 4/4 in the smoke project.