feat: elect a single leader instance and surface it in the desktop UI by wpfleger96 · Pull Request #1062 · block/buzz

wpfleger96 · 2026-06-15T22:16:18Z

Summary

When multiple Buzz dev instances share an agent keypair, the relay fans every matching event out to all of them (NIP-01) and each instance prompts its agent — duplicate replies for a single @mention. This adds a per-agent-key leader lock so only the leader instance acts autonomously on the wire. Non-leaders still receive and render events (the queue stays ungated for UI) but suppress every path that would emit under the shared identity.

This PR carries the NIP-LE specification (docs/nips/NIP-LE.md), the read side (every instance derives leadership from the lock file), the writer + auto-claim lifecycle (the harness self-elects on launch, fails over on leader death, releases on shutdown), and the desktop leadership UI (a per-agent leader badge plus a cooperative "Make leader" steal control).

The invariant

A single agent identity may have N subscribed instances but exactly one prompter (the leader). Non-leaders suppress all three surfaces that would otherwise act under the shared key:

(a) the prompt/dispatch path — non-leaders promote nothing queued to a prompt.
(b) the pre-dispatch 👀 reaction — fires at queue-acceptance time, before dispatch, so the dispatch gate alone would let every sharing instance emit a redundant 👀.
(c) the autonomous heartbeat prompt path — the periodic self-prompt that, when enabled, has the agent buzz messages send / buzz workflows approve on its own; non-leaders must not fire it, or N instances each act for the same key.

Each surface maps 1:1 to a gate in the code. docs/nips/NIP-LE.md is the normative spec for this invariant and the lock contract.

How leadership is decided (read side)

Every instance reads ~/.buzz/leader-locks/<pubkey-hex>.lock and resolves:

absent lock → leader (solo dev is unaffected — no lock, no suppression)
present + instance_id matches this process's election id → leader
present + foreign instance_id → observer
no election id → leader (solo CLI with the env unset)
unreadable or malformed → fail safe to leader: any IO error (permission, mid-write truncation) or parse failure resolves to leader, so a corrupt lock never silences the only responder

Status is cached per agent key and refreshed in place by refresh(). Leadership is read-derived, never acquire-derived: is_leader independently reads the file, so an acquire that fails open on an IO error can never force a process to lead next to a live foreign leader.

How a leader is elected (writer + auto-claim)

The harness self-mints a process-unique election id ({pid}-{nanos}) when BUZZ_INSTANCE_ELECTION_ID is unset, and honors the env verbatim if set (desktop per-spawn injection). The same id is both written into the lock and used by the read check, so a process's writes match its own reads.

Startup — acquire(agent_pubkey): take an exclusive flock, then read-decide-write under the held lock so the read→decide→write TOCTOU window is closed. The lock is takeable iff free (absent / empty / malformed), already ours, the owner's pid is dead (kill(pid, None) → ESRCH), or the owner's claimed_at is stale (older than STALE_CLAIM = 10s = 2× the refresh). On success it writes {instance_id, pid, claimed_at}; a live, fresh foreign owner → observe.
Failover — no new timer. The existing 5s leader_refresh tick calls acquire before refresh, so a leader re-stamps its own claimed_at every tick (its claim is always fresher than the bound), and a survivor re-reads a dead-or-stale lock and takes over within 5s.
Shutdown — release empties the lock file (never unlinks → no detached-inode race) only if we still own it, so a co-located sibling takes over immediately and a successor's claim is never stomped.

Dead-pid vs recycled-pid. The ESRCH fast path catches the common crash-without-release case instantly. The claimed_at staleness arm is the backstop for the rarer case where the OS recycles the crashed leader's pid onto an unrelated live process before any survivor ticks: the recycled pid reads as alive, but with no live leader re-stamping the claim it ages past the bound and becomes takeable — closing what would otherwise be a permanent leaderless wedge. The only owner the staleness arm can evict is a live-pid process that has missed two consecutive refresh ticks, which requires a full-runtime stall (turns run on a spawned pool, so a normal heavy turn never blocks the refresh) — in which state the leader is non-functional and takeover is correct, self-healing to a transient duplicate at worst.

Desktop leadership UI

The desktop is the owner/controller, not a contending instance. leadership_status frames already land in eventsByAgent via the owner-wide observer subscription, so the UI is a cached derivation — no new store, no new subscription, no relay change.

Leader badge in ManagedAgentRow beside the status — shows when an agent has a live leader instance. Hidden when no live instance reports.
Leadership submenu in the row's ... dropdown — lists each live instance (truncated instanceId, last-seen, leader marker) with a per-instance "Make leader" action. The current leader's item is disabled. Hidden when <= 1 live instance, so the solo-dev UX is byte-unchanged.
leadershipByAgent is a cached Map rebuilt only when a leadership_status frame appends, mirroring transcriptByAgent. getAgentLeadership stays a stable map lookup so it satisfies the useSyncExternalStore referential-stability contract (a fresh array per getSnapshot would render-storm).
parseLeadershipPayload narrows the untrusted unknown payload at the boundary; malformed frames are dropped. lastSeen = Date.parse(event.timestamp) with NaN → drop.
Staleness lives in the component, not the store — the store stays time-independent. The row filters against a useNow(5000) clock at a 15s threshold (3 missed 5s ticks), so a crashed leader's badge ages out without a new frame arriving.
Freshest-leader rule: the badge reflects max(lastSeen) among instances reporting isLeader, dissolving the <= 15s transient two-leader window after a crash without a "contested" state.
Non-authoritative ack: claimManagedAgentLeadership mirrors cancelManagedAgentTurn and returns { status: "sent" }. The UI never optimistically flips — it converges off the leadership_status stream.

Portability

The writer is #[cfg(unix)] (flock is Unix); #[cfg(not(unix))] is an always-leader no-op acquire/release, mirroring the existing kill_process_group cfg pattern. Desktop targets macOS/Linux; Windows is not a leader-election target. The flock feature is added to the existing cfg(unix) nix dependency; claimed_at parsing reuses the already-present chrono workspace dep (Cargo.lock unchanged).

Election identity

The lock keys on a per-process election identity behind a single named constant (ELECTION_ID_ENV = BUZZ_INSTANCE_ELECTION_ID). It is deliberately not the Tauri bundle identifier (BUZZ_MANAGED_AGENT): the bundle id collides across same-class windows (DMG + dev share xyz.block.buzz.app(.dev); a worktree falls back to the shared dev id if swift generate-dev-icon fails), which would let two windows both match the lock and both lead. Reaper identity partitions by app-class; election identity must be unique per process.

What changed

Harness (buzz-acp):

docs/nips/NIP-LE.md (new) — the NIP-LE spec: the exactly-one-prompter invariant (surfaces a/b/c), the local-filesystem lock contract, claim semantics (auto-on-launch acquire-if-unowned plus explicit re-claim), the flock TOCTOU guard, and dead-pid + stale-claim failover.
crates/buzz-acp/src/leader.rs — the LeaderCheck trait and FileLeaderCheck: the read side (above), plus the writer half — acquire/release, lock_is_takeable (ours || dead-pid || stale), pid_is_alive (ESRCH-only-dead; EPERM → alive so a live leader is never stolen from), claim_is_stale, and from_env_or_mint (honor env if set, else mint {pid}-{nanos}).
crates/buzz-acp/src/pool.rs — PromptContext carries leader: Arc<dyn LeaderCheck>.
crates/buzz-acp/src/lib.rs — constructs from_env_or_mint; auto-acquire on startup; re-acquire before refresh in the 5s leader_refresh tick; release first in shutdown teardown; gates (a) dispatch_pending, (b) the queue-push 👀 reaction_add, (c) dispatch_heartbeat; emits the leadership_status observer frame consumed by the desktop UI.

Desktop:

leadershipHelpers.ts (new) — pure derivation: parseLeadershipPayload, buildLeadership, filterStaleInstances, selectFreshestLeader.
leadershipHelpers.test.mjs (new) — 17 behavior tests.
observerRelayStore.ts — leadershipByAgent cached map, rebuild-on-append, getAgentLeadership selector.
agentControl.ts — claimManagedAgentLeadership sender.
useObserverEvents.ts — useAgentLeadership hook.
agentUi.ts — truncateInstanceId.
ManagedAgentRow.tsx — leader badge + leadership submenu.
e2eBridge.ts + observerRelayStore.ts — test-only __BUZZ_E2E_SEED_LEADERSHIP__ hook routing synthetic leadership_status frames through the real cached-map rebuild path (production ingest untouched).
leadership-screenshots.spec.ts (new) + playwright.config.ts — E2E screenshot spec, 4 scenarios in the smoke project.

Tests

Harness — twenty-one buzz-acp unit tests, all behavior-focused.

Reader (6): absent → leader, self-owned → leader, foreign-owned → observer, malformed → fail-safe leader, no-election-id → leader even with a foreign lock, and refresh flips status when the lock changes. Ids are opaque per-window strings (window-a/window-b) so the suite can't enshrine bundle-id == election-id.
Writer (11): unowned-acquire writes self, creates the lock dir when absent, two-writer race yields exactly one winner, dead-pid lock is taken over, live+fresh foreign lock blocks acquire, recycled-pid-with-stale-claim is taken over (the failover-stall reproduction — fails without the staleness arm), reacquire-own is idempotent, release frees the lock for another writer, release does not stomp a successor, no-election-id acquire is a no-op (always leader), and an 8-thread barrier-synced concurrent acquire yields exactly one winner (proving flock serialization, not just the happy path).
Gates (4): a leader/non-leader pair over the heartbeat gate (c) and a leader/non-leader pair over the dispatch gate (a) — proving each negative test exercises its gate, not a dead path.

cargo test -p buzz-acp → 330 passed. cargo fmt --check / cargo clippy --all-targets clean.

Desktop: 17 behavior tests over the pure leadership derivation (guard drops, latest-per-instance, NaN-timestamp drop, zombie prune, stale boundary, NaN-stale, freshest-among-multiple-leaders, the two-leader transient). The store-level referential-stability contract is verified at review (the store imports Tauri bindings that fail under node). pnpm typecheck / pnpm check clean; pnpm test 760/760. The leadership E2E screenshot spec passes 4/4 in the smoke project.

When multiple Buzz dev instances share an agent keypair, the relay fans each mention out to every connected buzz-acp (NIP-01) and all of them respond. Add a per-agent-key leader lock so only the leader instance promotes queued events to prompts; non-leaders keep their queue populated for UI rendering but stay silent. Reads ~/.buzz/leader-locks/<pubkey>.lock: absent or self-owned means leader, foreign-owned means observer, malformed fails safe to leader so a corrupt lock never silences the only responder. Solo dev has no lock file, so behavior is unchanged. The 👀 reaction fires at queue-push time before dispatch, so it is gated separately from dispatch_pending. The election identity is sourced from BUZZ_INSTANCE_ELECTION_ID behind a single named constant. It is deliberately NOT the Tauri bundle identifier (BUZZ_MANAGED_AGENT): that collides across same-class windows (DMG + dev, or worktrees whose icon-gen fell back to the shared dev id), which would let two windows both lead. Phase 2 writes a process-unique value; Phase 1 reads whatever that env carries and defaults to leader when it is unset. Phase 1 of 3: read side only. Lock acquire/steal/failover (Phase 2) and claim UI (Phase 3) follow. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

run_prompt_task has a second caller besides dispatch_pending: dispatch_heartbeat. The heartbeat prompt acts autonomously on the wire (sends messages, approves workflows), so an ungated heartbeat let every non-leader instance sharing one agent key self-prompt and act — the same duplicate-actor bug the dispatch gate prevents, arriving through the heartbeat door. Gating before pool.try_claim closes it; solo dev is unaffected since an absent lock means leader. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

Folds the NIP-LE spec into PR #1062 alongside the implementation so the spec and the code it describes ship together. Reconciles one wording gap against the shipped code: the draft's invariant named two suppression surfaces for non-leaders (dispatch path, 👀 reaction) but the implementation gates a third — the autonomous heartbeat prompt path (dispatch_heartbeat). Adds bullet (c) so all three suppression surfaces in the spec map to a gate in the code. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

…ument fail-safe scope The primary suppression surface — dispatch_pending — had no test; only the heartbeat gate was covered, so a refactor of the shared leader gate could silently ship a non-leader that still dispatches. Add the missing leader/non-leader test pair. Read-side mutex locks now read through poison rather than panic the event loop, removing a latent footgun for when Phase 2 extends the critical sections. Module-doc and NIP-LE now state that any IO error (not just an absent lock) fails safe to leader, that this can produce a bounded ≤5s transient duplicate during a concurrent rewrite, and that pid/claimed_at are read-side-ignored. Co-authored-by: Will Pfleger <wpfleger@block.xyz> Signed-off-by: Will Pfleger <wpfleger@block.xyz>

…ailover The read side elected nobody: with no writer, every instance read an absent lock and defaulted to leader, so a shared agent key produced duplicate responders. Add the writer half so one instance actually wins. The harness self-mints a process-unique election id (from_env_or_mint), acquires the lock on startup, re-attempts the claim on the existing 5s refresh tick (failover when a leader dies without releasing), and releases on graceful shutdown. acquire holds an exclusive flock across the read-decide-write window to close the TOCTOU race; a dead-pid lock is reclaimable, a live foreign lock is not. release empties the file rather than unlinking it to avoid racing a fresh claim onto a detached inode. Failover tolerates pid recycling: a crashed leader's pid may be reused by an unrelated live process, which a bare pid-liveness probe would mistake for the original leader and never take over, wedging the agent leaderless. A claim is takeable when its pid is dead OR its claimed_at is older than a staleness bound (2x the 5s refresh); a live leader rewrites claimed_at every tick, so only an abandoned claim ages out, without evicting an active leader. Writer is Unix-only (flock); non-Unix gets an always-leader no-op, mirroring the kill_process_group cfg fallback. NIP-LE amended to final state: claim is auto-on-launch plus explicit re-claim. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

…tive (#1076) Signed-off-by: Will Pfleger <pfleger.will@gmail.com> Co-authored-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>

Phase 3a emits a `leadership_status` observer frame per window-instance every 5s and handles a `claim_leadership` control frame. The desktop had no consumer. This adds the owner-side surface: a per-agent leader badge and a per-instance "Make leader" steal action. The frames already land in `eventsByAgent` via the owner-wide observer subscription, so leadership is a cached derivation rather than a new store. `getAgentLeadership` stays a stable map lookup (required by `useSyncExternalStore`); the `leadershipByAgent` array is rebuilt only when a leadership frame appends. Staleness stays out of the store — the row filters against a 5s clock so a crashed leader's badge drops within 15s without a new frame. The steal ack is non-authoritative: the UI converges off the stream, never optimistically flipping the badge. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

wpfleger96 · 2026-06-17T14:45:59Z

Phase 3b — Leadership UI E2E screenshots

Captured via leadership-screenshots.spec.ts (smoke project), driving the real cached-map consumer through the __BUZZ_E2E_SEED_LEADERSHIP__ seed hook — synthetic leadership_status frames routed through the production appendAgentEvent rebuild path, not a stub.

Single-instance Leader badge

One instance reporting isLeader: true — the row shows the crown "Leader" badge with no Leadership submenu (≤1 instance, nothing to steal).

Multi-instance freshest-leader badge

Three instances, one leader — the row badge reflects the freshest leader (max(lastSeen) among isLeader: true).

Leadership submenu

The ... dropdown's Leadership submenu listing each instance: truncated instanceId + last-seen + leader marker, with the current leader's item disabled.

"Make leader" cooperative-steal action

The submenu open with a non-leader instance's "Make leader" entry hovered — the cooperative-steal entry point.

wpfleger96 force-pushed the duncan/leader-election-core branch from c6496b8 to 09422e6 Compare June 15, 2026 22:18

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 2 commits June 15, 2026 18:29

wpfleger96 mentioned this pull request Jun 15, 2026

docs(nips): add NIP-LE leader election draft #1063

Closed

wpfleger96 marked this pull request as draft June 15, 2026 22:53

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 2 commits June 16, 2026 15:31

wpfleger96 force-pushed the duncan/leader-election-core branch from f00829b to 45a6de5 Compare June 16, 2026 21:16

wpfleger96 changed the title ~~feat(buzz-acp): gate prompting on client-side leader check~~ feat(buzz-acp): elect a single leader instance to gate autonomous prompting Jun 16, 2026

wpfleger96 mentioned this pull request Jun 16, 2026

feat(buzz-acp): add cooperative leadership steal via stand-down primitive #1076

Merged

wpfleger96 and others added 2 commits June 16, 2026 20:04

wpfleger96 mentioned this pull request Jun 17, 2026

feat(desktop): surface managed-agent leadership and cooperative steal #1078

Merged

test(desktop): add leadership E2E seed hook + screenshot spec

1a73552

Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

wpfleger96 changed the title ~~feat(buzz-acp): elect a single leader instance to gate autonomous prompting~~ feat: elect a single leader instance and surface it in the desktop UI Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: elect a single leader instance and surface it in the desktop UI#1062

feat: elect a single leader instance and surface it in the desktop UI#1062
wpfleger96 wants to merge 8 commits into
mainfrom
duncan/leader-election-core

wpfleger96 commented Jun 15, 2026 •

edited

Loading

Uh oh!

wpfleger96 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wpfleger96 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The invariant

How leadership is decided (read side)

How a leader is elected (writer + auto-claim)

Desktop leadership UI

Portability

Election identity

What changed

Tests

Uh oh!

wpfleger96 commented Jun 17, 2026

Phase 3b — Leadership UI E2E screenshots

Single-instance Leader badge

Multi-instance freshest-leader badge

Leadership submenu

"Make leader" cooperative-steal action

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wpfleger96 commented Jun 15, 2026 •

edited

Loading