From 252df5601700821c7bee7b642c9f0d758103f85f Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Fri, 12 Jun 2026 15:31:40 +0900 Subject: [PATCH 01/14] docs(design): propose multi-node multi-group bootstrap --- ...proposed_multinode_multigroup_bootstrap.md | 201 ++++++++++++++++++ 1 file changed, 201 insertions(+) create mode 100644 docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md diff --git a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md new file mode 100644 index 00000000..c9daf554 --- /dev/null +++ b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md @@ -0,0 +1,201 @@ +# Multi-node multi-group bootstrap — standing up N nodes × M Raft groups at startup + +Status: Proposed +Author: bootjp +Date: 2026-06-12 + +Sibling / prerequisite-for: +- [2026_06_11_proposed_leader_balance_scheduler.md](2026_06_11_proposed_leader_balance_scheduler.md) §1.1a (PR0) + OQ-9 — this doc **is** that PR0. The leader-balance scheduler's transfer-issuing milestones (PR2/PR3) are blocked on a Raft group whose voter set spans more than one node; that topology cannot be declared at startup today, and OQ-9 resolved "option (a): extend the bootstrap/flag surface." This document is the design for option (a). +- [2026_02_18_partial_hotspot_shard_split.md](2026_02_18_partial_hotspot_shard_split.md) Milestone 2 — cross-group range migration testing also needs ≥2 nodes hosting the same group to be meaningful (see §7). + +## 1. Background + +### 1.1 The gap: a real multi-node deployment can only run single-group today + +elastickv runs multiple Raft groups in one process (`--raftGroups id=addr,id=addr,…`, parsed by `parseRaftGroups`, `shard_config.go:61-99`; default group is the lowest ID, `defaultGroupID`, `shard_config.go:386-397`). Each group gets its own `raftGroupRuntime` with its own engine and its own gRPC listener at `rt.spec.address` (`startRaftServers`, `main.go:1610-1620`). This is a genuine multi-Raft-group runtime — **within one process**. What does **not** exist is the ability to spread any single group's voters across more than one node. Verified at file:line on `main`: + +- **`groupSpec` carries one address per group — this node's own listener, not a member list.** `type groupSpec struct { id uint64; address string }` (`shard_config.go:14-17`); `parseRaftGroups` parses each `id=addr` entry into exactly that one address (`shard_config.go:80-93`). There is no field on `groupSpec` for the *other* nodes that should vote in the group. +- **`resolveBootstrapServers` rejects `--raftBootstrapMembers` whenever `len(groups) != 1`.** `if len(groups) != 1 { return nil, errors.WithStack(ErrBootstrapMembersRequireSingleGroup) }` (`main.go:744-748`, error defined `main.go:736`). So a multi-node initial membership can be declared only for a **single-group** deployment. `--raftBootstrapMembers` itself parses `id=host:port,…` into a flat `[]raftengine.Server` voter list (`parseRaftBootstrapMembers`, `shard_config.go:352-384`) — one list, applied to the one group. +- **`buildRuntimeForGroup` passes the *same* `bootstrapServers` to *every* group.** It threads the single process-wide `bootstrapServers` slice into `factory.Create(...)` via `Peers:` with `LocalAddress: group.address` (`multiraft_runtime.go:234-254`). In a multi-group config `bootstrapServers` is `nil` (it can only be non-nil for `len(groups)==1`, per the guard above), so each group bootstraps with `Peers: nil`. +- **A `nil`/single peer list bootstraps a single-member group — and never even builds a transport.** The etcd factory only constructs the inter-node gRPC transport `if len(peers) > 1` (`internal/raftengine/etcd/factory.go:49-52`). With `peers == nil` each group is a one-voter cluster with no transport, so `TransferLeadershipToServer` has no other voter to move leadership to, and no peer to replicate to. +- **The integration tooling documents exactly this limitation.** `scripts/run-jepsen-m5-local.sh:5-22` records in prose that today's `validateShardRanges` / `buildShardGroups` "only support a 'single process hosts all groups' model — separate processes per group fail validation or race on Raft listeners," that it launches "ONE process hosting BOTH **single-member** groups," and that "True distributed multi-group is M6+ work." + +**Consequence:** the only deployable multi-group topology is one process hosting single-voter groups (the M5 Jepsen layout, and the `cmd/server/demo.go` single-group×3-node demo is the only multi-*node* topology — but it is single-*group*). There is no startup wiring that produces "N nodes, each a voter in all M groups." + +### 1.2 What already works (the primitives we compose) + +The single-group multi-node path is fully built and is the template this design generalizes: + +- **`--raftBootstrapMembers id=addr,…` → a voter `[]raftengine.Server`** (`parseRaftBootstrapMembers`, `shard_config.go:352-384`), validated against the local node (must include `--raftId`, local address must match the group address: `resolveBootstrapServers`, `main.go:752-768`). +- **The factory builds a transport when `len(peers) > 1`** and wires it into `Open(...)` (`internal/raftengine/etcd/factory.go:41-90`). `Open` normalizes/validates peers, and on first open writes them to a persisted-peers file; on restart it reloads them and refuses to start if the configured list disagrees with the persisted cluster (`normalizePeers` / `validateOpenPeers` / `savePersistedPeers`, `internal/raftengine/etcd/engine.go:620-643`; `errClusterMismatch`, `:116`; `LoadPersistedPeers`, `internal/raftengine/etcd/peer_metadata.go:40`). +- **The transport resolves peers by node ID → address from the bootstrap list** (`NewGRPCTransport(peers)` builds `map[nodeID]Peer`, `internal/raftengine/etcd/grpc_transport.go:67-86`; sends dial `peer.Address`, `:493-517`), and supports runtime membership churn via `UpsertPeer` / `RemovePeer` (`:145-170`) as conf-changes commit. +- **Each group already gets its own listener and its own `RaftAdmin` service** (`startRaftServers` registers `RegisterOperationalServicesWithInterceptor(ctx, gs, rt.engine, …)` then `lc.Listen(ctx, "tcp", rt.spec.address)` per runtime, `main.go:1610-1615`). `AddVoter`/`AddLearner`/`PromoteLearner`/`RemoveServer` are reachable per group (`cmd/raftadmin/main.go:197-285`; engine `AddVoter`, `internal/raftengine/etcd/engine.go:1252-1257`). +- **`cmd/server/demo.go` already stands up 3 nodes that bootstrap one shared group.** All three node configs set `raftBootstrap=true` and receive the **same** `raftPeers` list (all three `{Suffrage:"voter", ID, Address}`), `cmd/server/demo.go:180-219`. The comment at `:215-219` records the key etcd requirement: *"every member of a fresh cluster must bootstrap with the same peer list."* This is exactly the per-group bootstrap discipline §3 generalizes to M groups. +- **Per-group data dir + `raft-engine` marker is already per-group.** `groupDataDir(baseDir, raftID, groupID, multi)` returns `…/raftID/group-N` in multi mode (`multiraft_runtime.go:110-115`); `ensureRaftEngineDataDir` writes/reads the `raft-engine` marker and refuses an engine mismatch *per dir* (`multiraft_runtime.go:117-151`). So idempotent-restart detection is already per group. + +The only missing piece is a **flag/parse/wiring path that gives each group its own multi-node voter set at bootstrap** instead of a single shared list rejected for multi-group. + +## 2. Goals and Non-Goals + +### 2.1 Goals + +1. Deploy **N nodes × M Raft groups** where **every group is a multi-voter Raft cluster** (each group's voter set spans ≥2 nodes), declarable entirely at process startup. +2. A concrete, validated **flag surface** for per-group peer lists, with strict back-compat: every existing single-group flag (`--raftBootstrapMembers`, `--raftBootstrap`, `--raftGroups`) and the `cmd/server/demo.go` single-process demo behave **exactly as today**. +3. **Deterministic, idempotent bootstrap**: a known node proposes the initial configuration for each group; restart re-detects existing state and does not re-bootstrap; partial-bootstrap failures are recoverable. +4. **Reuse the existing per-group transport, listener, `RaftAdmin`, marker-dir, and persisted-peers machinery** (§1.2) — no new replication or wire surface for the data path. +5. An **in-process integration harness** that stands up 3 nodes × 2 groups with every group multi-voter, so the leader-balance convergence test and hotspot-M2 cross-group tests have a topology to run against. + +### 2.2 Non-Goals + +1. **Dynamic group creation / deletion at runtime.** The set of groups (M) is fixed at startup. Creating a new group while the cluster runs is out of scope (it belongs to a future control-plane RPC). +2. **Replica / leader rebalancing.** Moving where a group's voters live, or spreading leaderships, is **not** this doc — that is leader-balance (#953) and hotspot-split M2 (#945). This doc only stands up the static topology those features need. +3. **Live topology expansion as the bootstrap mechanism.** Growing a group from one voter to many via `AddVoter`/`PromoteLearner` after bootstrap stays the supported **live-expansion** path (§5), but it is explicitly **not** the way this design declares the initial topology (per #953 OQ-9). +4. **Heterogeneous group membership** (groups whose voter sets are different subsets of nodes). v1 targets **homogeneous** membership — every node is a voter in every group — matching the leader-balance scheduler's stated assumption (#953 §2.2 non-goal 5). Heterogeneous sets are a forward extension (§8 OQ-4); the flag syntax (§3.1) is chosen so it does not foreclose them. +5. **Per-protocol address-map changes.** `--raftRedisMap` / `--raftDynamoMap` / `--raftS3Map` / `--raftSqsMap` map *Raft listener address → protocol listener address* and are orthogonal to voter-set membership; they are unchanged (§4.3). + +## 3. Design + +### 3.1 Flag surface — per-group peer lists + +**Decision: add a companion flag `--raftGroupPeers`, and lift the `len(groups)==1` guard in `resolveBootstrapServers`.** Keep `--raftGroups` (group→local-address) exactly as is; declare the *cross-node* voter set per group in a new flag. + +``` +--raftGroupPeers "1=n1@host1:5051,n2@host2:5051,n3@host3:5051;2=n1@host1:5054,n2@host2:5054,n3@host3:5054" +``` + +Grammar: +- Group entries separated by `;` (matching the `--sqsFifoPartitionMap` precedent, which already uses `;` between queues and reserves `,` for the per-entry list, `parseSQSFifoPartitionMap`, `shard_config.go:174-196`). +- Each entry is `groupID=member,member,…`. +- Each `member` is `raftID@host:port` — the `@` separates the node's stable Raft ID (matching `--raftId` semantics) from its listener address for that group. (`raftID` is needed explicitly because etcd's bootstrap requires the same `id→address` mapping on every node, `cmd/server/demo.go:215-219`; the address alone is not the identity.) + +**Why a new flag rather than extending `--raftGroups` entry syntax.** `--raftGroups` entries are `id=addr` and that `addr` is *this node's own* listener (`groupSpec.address`, used as `LocalAddress`, `multiraft_runtime.go:248`). Overloading it to also carry the full member list would make every node's `--raftGroups` value identical across the cluster *except* that the local-address role would have to be inferred — error-prone. A separate `--raftGroupPeers` keeps "what do I listen on" (`--raftGroups`) cleanly separate from "who are the voters" (`--raftGroupPeers`), and mirrors how single-group already separates `--address`/`--raftGroups` from `--raftBootstrapMembers`. + +**Back-compat rules (strict):** +- `--raftGroupPeers` empty ⇒ behavior is **byte-for-byte today's**: `resolveBootstrapServers` runs unchanged (single-group `--raftBootstrapMembers` still works; multi-group still bootstraps single-member groups). No existing deployment or test changes. +- `--raftBootstrapMembers` and `--raftGroupPeers` are **mutually exclusive** — setting both is a validation error (`--raftBootstrapMembers` is the single-group spelling; `--raftGroupPeers` is the multi-group spelling). Single-group deployments may continue to use `--raftBootstrapMembers` and never need to learn the new flag. +- `cmd/server/demo.go` is unchanged: it bootstraps one group with a shared peer list via `raftPeers` directly (`cmd/server/demo.go:180-219`), not via these flags. + +**Validation rules** (fail fast at startup, before any engine opens — same posture as `parseRaftGroups`/`validateShardRanges`): +1. Every group ID in `--raftGroupPeers` must appear in `--raftGroups`, and (v1 homogeneous goal) **every** group in `--raftGroups` must appear in `--raftGroupPeers` when the flag is non-empty. A group with no peer list would silently fall back to single-member — a foot-gun we reject. +2. Each group's member list must **include the local node**: a `member` whose `raftID == --raftId` must be present, and its `host:port` must equal that group's `--raftGroups` local address (`groupSpec.address`). This is the per-group generalization of the existing single-group check `ErrBootstrapMembersLocalAddrMismatch` (`main.go:760-765`). +3. No duplicate `raftID` within a group (mirrors `parseRaftBootstrapMembers`'s `duplicate id` check, `shard_config.go:373-375`). +4. v1 homogeneity check: the set of `raftID`s must be **identical across all groups** (every node votes in every group). Violations are rejected with a clear error pointing at the first divergent group. (Relaxing this is OQ-4.) +5. Each member's address must be non-empty and well-formed `host:port` (reuse existing address parsing). + +### 3.2 Bootstrap semantics + +The wiring change is small and local: instead of one process-wide `bootstrapServers` threaded into every group, **resolve a per-group `[]raftengine.Server` and pass each group its own list**. Concretely, `buildShardGroups` / `buildRuntimeForGroup` change from a single `bootstrapServers []raftengine.Server` parameter (`multiraft_runtime.go:234`, `main.go:777`) to a `bootstrapServersFor func(groupID uint64) []raftengine.Server` lookup (or a `map[uint64][]raftengine.Server`), built once from the parsed `--raftGroupPeers`. Everything downstream — the factory's `len(peers) > 1` transport gate (`factory.go:50`), `Open`'s peer normalize/validate/persist (`engine.go:620-643`), the marker dir, the per-group listener — already operates per group and needs no change. + +**Which node proposes the initial conf (decision: every node bootstraps with the identical per-group peer list — the etcd model — NOT a single designated proposer).** etcd/raft's bootstrap model is that **every** founding member calls `Bootstrap` with the **same** `ConfState`/peer list; raft then elects a leader among them. This is exactly what `cmd/server/demo.go` does for the single group (`raftBootstrap=true` on all three nodes with the shared `raftPeers`, `:204-219`) and what `resolveBootstrapServers` sets up for single-group (`bootstrap = *raftBootstrap || len(bootstrapServers) > 0`, `main.go:534`). We generalize it: when `--raftGroupPeers` is set, **every group on every node bootstraps with that group's full peer list**, and `bootstrap` is implied true for those groups (the operator does not also need `--raftBootstrap`; see the interaction rule below). + +We do **not** invent a "lexicographically-smallest peer proposes, others wait-and-join" protocol. That single-proposer pattern is the *AddVoter-composition* path (§5), not the bootstrap path — and adopting it for bootstrap would mean the non-proposer nodes start with an empty conf and must be added one-by-one, which is fragile (ordering, the proposer must be up first and must be leader) and is exactly the "manual AddVoter dance in every test harness" #953 OQ-9 rejected. The all-nodes-same-list model has no designated-proposer ordering requirement: nodes can start in any order, and raft elects a leader once a quorum is up. + +**Idempotency on restart (decision: persisted-peers + marker dir already give this — per group).** On first open of a group dir, `Open` writes the normalized peer set to the persisted-peers file (`savePersistedPeers`, `engine.go:643`; format in `peer_metadata.go:205`). On restart, the factory **loads the persisted peers and uses them in preference to the flag-supplied list** (`factory.go:43-47`), and `Open` refuses to start if the configured cluster disagrees with what is persisted (`validateOpenPeers` → `errClusterMismatch`, `engine.go:632`, `:116`). So: +- A restart with the same `--raftGroupPeers` re-loads the same persisted set per group → no re-bootstrap, no data risk. +- A restart with a *different* `--raftGroupPeers` than what a group already persisted **fails fast** with `errClusterMismatch` rather than silently re-bootstrapping over committed data. (Membership changes after bootstrap go through `AddVoter`/`RemoveServer`, which rewrite the persisted set, §5.) +- The `raft-engine` marker (`ensureRaftEngineDataDir`, `multiraft_runtime.go:117-151`) independently guards against opening a group dir under the wrong engine type — unchanged, already per group. + +**`bootstrap` flag interaction (decision: `--raftGroupPeers` implies bootstrap=true for all groups; `--raftBootstrap` stays for the single-group/demo path).** Mirror the existing single-group rule `bootstrap = *raftBootstrap || len(bootstrapServers) > 0` (`main.go:534`): when `--raftGroupPeers` is non-empty, the resolved bootstrap flag is true for every group (each has a non-empty peer list). `--raftBootstrap` continues to mean "bootstrap" for deployments that don't use `--raftGroupPeers`. Setting `--raftBootstrap=false` together with `--raftGroupPeers` is a no-op contradiction for a fresh dir — we treat a non-empty `--raftGroupPeers` as authoritative (bootstrap=true), and document it. (On a *restart* the persisted-peers path takes over regardless, so the bootstrap flag is moot once a dir has state — same as today.) + +**Partial-bootstrap failure modes and recovery:** +- *One node never comes up.* With an N-voter group, raft tolerates up to ⌊(N−1)/2⌋ down at bootstrap and still elects a leader once a quorum starts. A 3-voter group forms with 2 up. The down node joins when it starts (its dir is fresh → bootstraps with the same list → catches up via snapshot/log). No operator action. +- *A node bootstrapped with the wrong list.* Caught at `Open` by `errClusterMismatch` (mismatch vs. persisted) for a restart, or — for a fresh dir — caught by raft refusing to make progress because the configurations disagree. Recovery: stop the misconfigured node, wipe its (fresh) group dir, restart with the correct `--raftGroupPeers`. Because the failure is fail-fast and the bad node holds no committed data yet, this is safe. +- *A node crashes mid-bootstrap after writing the persisted file but before committing entries.* Restart re-loads the persisted peers (`factory.go:43-47`) and rejoins; the persisted file is written atomically (`writePersistedPeersFile`, `peer_metadata.go:205`), so a torn write is not a partial state. No special handling beyond what single-group already has. + +### 3.3 Determinism and testability of the bootstrapper + +There is no "elect a bootstrapper" step to test, because the model is all-nodes-same-list (§3.2). What is unit-testable and must be deterministic is **flag parsing → per-group `[]raftengine.Server`**: given a `--raftGroupPeers` string + `--raftGroups` + `--raftId`, the resolver produces a fixed `map[uint64][]raftengine.Server` (sorted by member raftID for reproducibility) or a precise validation error. This is a pure function (like `parseRaftGroups` / `parseRaftBootstrapMembers`) and gets table-driven tests (§6). The leader that emerges is raft's business, not ours. + +### 3.4 Per-group transport / addressing + +**Today (verified):** the gRPC raft transport resolves a peer by deriving its 64-bit node ID and looking up `host:port` in the bootstrap-seeded `map[nodeID]Peer` (`NewGRPCTransport`, `grpc_transport.go:67-86`; `peerFor`/dial, `:493-517`). Membership changes update that map via `UpsertPeer`/`RemovePeer` (`:145-170`). Each group has its **own** transport instance, created by the factory only when `len(peers) > 1` (`factory.go:49-52`), and registered on that group's own listener in `startRaftServers` (`main.go:1610-1615`). + +**What changes:** nothing in the transport itself. Once each group receives its own multi-node peer list (§3.2), the factory's `len(peers) > 1` check trips per group, a transport is built per group, and it resolves that group's peers from that group's list. The change is entirely upstream (feeding per-group lists in); the transport is already per-group and address-map driven. + +**One listener per group vs. per-group ports (decision: keep one listener per group — the existing `rt.spec.address` model — i.e. one port per group per node).** elastickv already binds **one gRPC listener per group per node** at `rt.spec.address` (`main.go:1613`), multiplexing the data-plane gRPC services, the per-group `RaftAdmin`, *and* that group's raft transport onto it (`rt.registerGRPC(gs)` + `RegisterOperationalServices…` + the transport's `Register`, `main.go:1605-1613`). This matches the `5005{1,2,3}` (group 1) / `5005{4,5,6}` (group 2) port convention already used by the M5 script and the demo. We keep it: +- It is the established convention and needs zero transport/listener changes. +- A single shared listener multiplexing *all* groups' raft traffic was considered and rejected: it would require demultiplexing by group ID inside the transport (the `EtcdRaft` service is currently per-engine, registered once per listener, `grpc_transport.go:88-…`), a larger change with no operational benefit at the scales this targets. +- Per-group ports keep each group's raft transport, `RaftAdmin`, and metrics cleanly attributable per group — useful for the leader-balance forward path (#953 §3.4 dials `rt.spec.address` of the source group's leader) and for partition nemeses that want to isolate one group. + +So the addressing model is: **N nodes × M groups ⇒ N×M (raftID, host:port) listener endpoints**, exactly the cross product `--raftGroupPeers` declares. Each node opens M listeners (one per group), each member of a group dials the other members' per-group endpoints. + +## 4. Unchanged surfaces (explicitly) + +### 4.1 Single-group path +With `--raftGroupPeers` empty, `resolveBootstrapServers` runs unchanged (`main.go:742-768`). `--raftBootstrapMembers` still works for single-group, including its three local-node validation errors (`main.go:752-768`). + +### 4.2 The in-process demo +`cmd/server/demo.go` bootstraps one group across 3 nodes via `raftPeers` (`:204-219`); it never reads `--raftGroupPeers`. Unchanged. + +### 4.3 Per-protocol address maps +`--raftRedisMap` / `--raftDynamoMap` / `--raftS3Map` / `--raftSqsMap` map *Raft listener address → protocol listener address* (`parseRaftAddressMap`, `shard_config.go:327-350`; consumed in `multiraft_runtime.go` group→protocol wiring). They are about *where a group exposes its protocol endpoint*, not *who votes in the group*, so they are orthogonal and unchanged. (In a true multi-node deployment an operator already supplies these per node; the new flag does not alter that.) + +### 4.4 Encryption startup ordering +The encryption writer-registration startup path (`main_encryption_registration.go`) is **leader-relative, not single-node-per-group**, so it already tolerates multi-voter groups. `buildProcessStartRegistrationGate` proposes through the **default group** and, when this node is not the default-group leader, **forwards to the current leader** over `EncryptionAdmin` with bounded retry (`proposeWriterRegistration`, `:472-520`; `IsLeader()`/`RaftLeader()` gating, `:482-511`). It assumes only that a default-group leader exists and is reachable — which is *more* true with a multi-voter default group, not less. The `raft-engine` marker and per-group dirs are already per group (§1.2). No encryption guard assumes a single-node-per-group bootstrap order; nothing here changes. (The five-lens "data consistency" review per PR must still confirm the registration forward path behaves when the default group is mid-election at boot, but that is an existing property, not new.) + +## 5. Alternative considered — AddVoter-composition + +**Bootstrap each group single-member, then grow it to N voters at runtime via `AddVoter`/`PromoteLearner`.** The primitives exist and are exercised: engine `AddVoter` (`internal/raftengine/etcd/engine.go:1252-1257`), `AddLearner`/`PromoteLearner` (`:1640-1689`), the per-group `RaftAdmin` service (`cmd/raftadmin/main.go:258-285`), and conf-change apply that calls `UpsertPeer` (`applyConfigChange`, `engine.go:2456`). + +**Rejected as the *bootstrap* mechanism** (consistent with #953 OQ-9 "option (b) stays for live expansion, not bootstrap"): +- It needs a designated first node that is up and leader before any `AddVoter` lands, plus an orchestration sequence (add each voter, wait for it to catch up, repeat) — fragile under etcd's randomized elections, exactly the failure `cmd/server/demo.go:204-219` calls out for the old `joinCluster` approach it deleted. +- Every test harness and every operator runbook would have to replay that dance to get a multi-voter group, the per-test cost OQ-9 explicitly wanted to avoid. +- It produces a *transient* single-voter window at startup where the group has no fault tolerance and no other transfer target — the opposite of what the leader-balance scheduler needs to test against. + +**Kept as the live-expansion path.** Growing an *already-running* group (add a 4th node to a 3-voter group, replace a dead node) is precisely what `AddVoter`/`RemoveServer` are for, and this design does not touch them. The persisted-peers file is rewritten by conf-change apply, so a node added via `AddVoter` and then restarted reloads the grown set (`factory.go:43-47`) — bootstrap and live-expansion compose cleanly. + +## 6. Rollout / testing + +### 6.1 Unit +- **Flag parsing** (`shard_config.go`, table-driven, co-located `*_test.go`): `--raftGroupPeers` grammar — multiple groups (`;`-separated), `raftID@host:port` members, whitespace, empty ⇒ nil; every validation rule of §3.1 (unknown group, missing-group-when-non-empty, local-node-absent, local-addr-mismatch, duplicate raftID, homogeneity violation, mutual-exclusion with `--raftBootstrapMembers`). Pure-function determinism: same input ⇒ identical sorted `map[uint64][]Server`. +- **Per-group bootstrap-server resolution**: the new `bootstrapServersFor(groupID)` returns each group's own list; the empty-flag path returns today's behavior unchanged (regression-locks back-compat). +- **Restart idempotency** (engine-level, `internal/raftengine/etcd/`): re-open a group dir with the same list ⇒ no re-bootstrap; re-open with a divergent list ⇒ `errClusterMismatch` (this path already has coverage for single-group; add the multi-group-dir case). + +### 6.2 Integration — 3-node × 2-group in-process harness +Stand up **3 nodes, 2 groups, every group a 3-voter Raft**, in one test process (extend `cmd/server/demo.go`'s pattern, or a new `internal/`-level harness so it is `go test`-runnable without the binary). Assertions: +- Each group's `Configuration()` reports 3 **voter** members on 3 distinct node IDs (the smoke #953 PR0 calls for: "a group has voters on ≥2 distinct nodes"). +- `TransferLeadershipToServer` between two nodes of the same group **succeeds** (the capability the leader-balance scheduler is blocked on). +- Restart one node: it reloads persisted peers and rejoins both groups without re-bootstrap. +- Kill a minority (1 of 3) in a group: the group keeps a leader and serves; the killed node rejoins on restart. + +This harness is the concrete deliverable that unblocks #953's convergence test (which "requires a topology where each of N groups has voters on ≥2 of 3 nodes — which only exists after PR0", #953 §5). + +### 6.3 Jepsen +Extend the multi-node story to Jepsen as a **later milestone** (noted, not v1): generalize `scripts/run-jepsen-local.sh` / `run-jepsen-m5-local.sh` from "one process hosting single-member groups" (`run-jepsen-m5-local.sh:5-22`) to **separate processes per node, each hosting all M groups as multi-voter Raft**, so partition/kill nemeses can isolate one node from a group's quorum (impossible under the single-process layout, `run-jepsen-m5-local.sh:16-18`). Acceptance bar: existing Redis/DynamoDB workloads show no new anomalies on the true multi-node multi-group topology. This is the M-script work the existing comment defers to "M6+". + +### 6.4 Milestone / PR breakdown + +| PR | Scope | Tests | Shippable alone? | +|---|---|---|---| +| **PR-A** | Flag + parse + validation: `--raftGroupPeers`, the §3.1 grammar and all validation rules, mutual-exclusion with `--raftBootstrapMembers`. No wiring change yet (parsed result unused). | Unit (§6.1) flag-parse table tests. | Yes — pure parsing, zero behavior change (result unconsumed). | +| **PR-B** | Wiring: lift the `len(groups)==1` guard path for the new flag; thread per-group `bootstrapServersFor` through `buildShardGroups`/`buildRuntimeForGroup`; resolve bootstrap=true per group. Each group now opens multi-voter. | Unit (§6.1) per-group resolution + restart idempotency; smoke that a 2-group config opens 2 transports. | After PR-A — the core capability. | +| **PR-C** | In-process 3-node × 2-group integration harness (§6.2) + the leader-transfer-between-nodes smoke. | Integration (§6.2). | After PR-B — the deliverable #953 PR0 / hotspot-M2 need. | +| **PR-D (later)** | Jepsen: true multi-node multi-group runner (§6.3). | Existing workloads, no-new-anomalies bar. | After PR-C; the "M6+" item. | + +Each PR carries the five-lens self-review (CLAUDE.md). Lens highlights for this change: **data loss** — restart must never re-bootstrap over committed data (persisted-peers + `errClusterMismatch`, §3.2); **concurrency/distributed** — any node-start order must form each group (all-same-list model, §3.2), partial-quorum bootstrap recovers; **data consistency** — a divergent `--raftGroupPeers` on restart fails fast, never silently forks a group's membership. + +### 6.5 Doc lifecycle +`*_proposed_*` → `*_partial_*` after PR-B (the topology is deployable) → `*_implemented_*` after PR-C (integration harness lands). `git mv`, propose date fixed. + +## 7. Cross-doc impact (explicit) + +- **Unblocks leader-balance PR2/PR3 (#953).** #953 §1.1a names this as PR0 and §5/§4 state PR2/PR3 are blocked on "a topology where each of N groups has voters on ≥2 of 3 nodes." PR-C's harness is precisely the convergence-test topology #953 §5 requires; PR-B delivers the `TransferLeadershipToServer`-has-a-target precondition #953 §1.1a calls out. #953 PR1 (observe-only) is **not** blocked on this and can ship independently. +- **Unblocks hotspot-split Milestone 2 cross-group migration testing (#945).** Cross-group range migration is only meaningfully testable when the source and destination groups each have voters on ≥2 nodes (so migration races real replication, not a single in-process voter). The §6.2 harness provides that; M2's migration tests can build on it. +- **No change to leader-balance's own design.** This doc resolves #953 OQ-9 with option (a) as #953 recommended; it does not alter the scheduler's policy, transfer mechanism, or proto extension. + +## 8. Open Questions + +1. **OQ-1 — Flag spelling: `--raftGroupPeers` companion flag vs. extending `--raftGroups` entries.** §3.1 recommends the companion flag (clean separation of "my listener" vs. "the voter set"; mirrors single-group `--raftBootstrapMembers`). Confirm before PR-A, since it fixes the operator-facing surface. +2. **OQ-2 — Should `--raftBootstrap` be *required* alongside `--raftGroupPeers`, or implied?** §3.2 recommends implied (non-empty `--raftGroupPeers` ⇒ bootstrap=true per group, matching `bootstrap = *raftBootstrap || len(bootstrapServers) > 0`, `main.go:534`). Alternative: require `--raftBootstrap` explicitly for symmetry with single-group. Confirm before PR-B. +3. **OQ-3 — Per-group bootstrap-server carrier: `func(groupID) []Server` vs. `map[uint64][]Server`.** Implementation detail for threading through `buildShardGroups`/`buildRuntimeForGroup` (`main.go:777`, `multiraft_runtime.go:234`). Map is simpler; func defers construction. Either is fine; pick at PR-B. +4. **OQ-4 — Heterogeneous group membership (groups on a subset of nodes).** v1 enforces homogeneity (§3.1 rule 4) to match #953 §2.2. The `raftID@host:port` member syntax already expresses arbitrary per-group sets, so relaxing rule 4 later needs no grammar change — but #953's observation/forward paths assume homogeneity, so we keep the guard until a consumer needs otherwise. Should the validator's homogeneity check be a hard error (v1) or a warning that allows heterogeneous sets for advanced operators? (Recommendation: hard error in v1.) +5. **OQ-5 — Mixed bootstrap + learner start.** `--raftJoinAsLearner` (`buildRuntimeForGroup`'s `joinAsLearner`, `multiraft_runtime.go:238`) lets a node join an existing cluster as a learner. Should `--raftGroupPeers` interoperate with a per-group learner bootstrap (some members start as learners, promoted later), or is learner-join strictly a live-expansion concern (§5)? (Recommendation: learners are live-expansion only in v1; `--raftGroupPeers` declares voters.) +6. **OQ-6 — Single shared raft listener multiplexing all groups.** §3.4 keeps one listener per group (the `5005{1..6}` convention). Is the per-group-port model acceptable at the target scale, or is a single multiplexed raft listener (demux by group ID) worth the transport change for very high M? (Recommendation: per-group ports for v1; revisit only if port count becomes an operational problem.) + +## 9. Lifecycle + +This document begins as `*_proposed_*`. Per CLAUDE.md / `docs/design/README.md`: +- Rename to `*_partial_*` after PR-B (multi-voter groups deployable at startup), recording which PRs shipped. +- Rename to `*_implemented_*` after PR-C (in-process integration harness landed), with the Jepsen runner (PR-D) tracked as a follow-on. + +Use `git mv` so history follows the rename. The propose date (2026-06-12) and slug stay fixed. From 9dda274b78f6a841f2a22772ecd3b1b465ee0e31 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 14 Jun 2026 16:52:53 +0900 Subject: [PATCH 02/14] docs: address bootstrap design review feedback --- ...proposed_multinode_multigroup_bootstrap.md | 51 ++++++++++++++----- 1 file changed, 37 insertions(+), 14 deletions(-) diff --git a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md index c9daf554..a8dfc5d2 100644 --- a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md +++ b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md @@ -15,7 +15,7 @@ Sibling / prerequisite-for: elastickv runs multiple Raft groups in one process (`--raftGroups id=addr,id=addr,…`, parsed by `parseRaftGroups`, `shard_config.go:61-99`; default group is the lowest ID, `defaultGroupID`, `shard_config.go:386-397`). Each group gets its own `raftGroupRuntime` with its own engine and its own gRPC listener at `rt.spec.address` (`startRaftServers`, `main.go:1610-1620`). This is a genuine multi-Raft-group runtime — **within one process**. What does **not** exist is the ability to spread any single group's voters across more than one node. Verified at file:line on `main`: - **`groupSpec` carries one address per group — this node's own listener, not a member list.** `type groupSpec struct { id uint64; address string }` (`shard_config.go:14-17`); `parseRaftGroups` parses each `id=addr` entry into exactly that one address (`shard_config.go:80-93`). There is no field on `groupSpec` for the *other* nodes that should vote in the group. -- **`resolveBootstrapServers` rejects `--raftBootstrapMembers` whenever `len(groups) != 1`.** `if len(groups) != 1 { return nil, errors.WithStack(ErrBootstrapMembersRequireSingleGroup) }` (`main.go:744-748`, error defined `main.go:736`). So a multi-node initial membership can be declared only for a **single-group** deployment. `--raftBootstrapMembers` itself parses `id=host:port,…` into a flat `[]raftengine.Server` voter list (`parseRaftBootstrapMembers`, `shard_config.go:352-384`) — one list, applied to the one group. +- **`resolveBootstrapServers` rejects `--raftBootstrapMembers` whenever `len(groups) != 1`.** `if len(groups) != 1 { return nil, errors.WithStack(ErrBootstrapMembersRequireSingleGroup) }` (`main.go:746-748`, error defined `main.go:736`). So a multi-node initial membership can be declared only for a **single-group** deployment. `--raftBootstrapMembers` itself parses `id=host:port,…` into a flat `[]raftengine.Server` voter list (`parseRaftBootstrapMembers`, `shard_config.go:352-384`) — one list, applied to the one group. - **`buildRuntimeForGroup` passes the *same* `bootstrapServers` to *every* group.** It threads the single process-wide `bootstrapServers` slice into `factory.Create(...)` via `Peers:` with `LocalAddress: group.address` (`multiraft_runtime.go:234-254`). In a multi-group config `bootstrapServers` is `nil` (it can only be non-nil for `len(groups)==1`, per the guard above), so each group bootstraps with `Peers: nil`. - **A `nil`/single peer list bootstraps a single-member group — and never even builds a transport.** The etcd factory only constructs the inter-node gRPC transport `if len(peers) > 1` (`internal/raftengine/etcd/factory.go:49-52`). With `peers == nil` each group is a one-voter cluster with no transport, so `TransferLeadershipToServer` has no other voter to move leadership to, and no peer to replicate to. - **The integration tooling documents exactly this limitation.** `scripts/run-jepsen-m5-local.sh:5-22` records in prose that today's `validateShardRanges` / `buildShardGroups` "only support a 'single process hosts all groups' model — separate processes per group fail validation or race on Raft listeners," that it launches "ONE process hosting BOTH **single-member** groups," and that "True distributed multi-group is M6+ work." @@ -27,7 +27,7 @@ elastickv runs multiple Raft groups in one process (`--raftGroups id=addr,id=add The single-group multi-node path is fully built and is the template this design generalizes: - **`--raftBootstrapMembers id=addr,…` → a voter `[]raftengine.Server`** (`parseRaftBootstrapMembers`, `shard_config.go:352-384`), validated against the local node (must include `--raftId`, local address must match the group address: `resolveBootstrapServers`, `main.go:752-768`). -- **The factory builds a transport when `len(peers) > 1`** and wires it into `Open(...)` (`internal/raftengine/etcd/factory.go:41-90`). `Open` normalizes/validates peers, and on first open writes them to a persisted-peers file; on restart it reloads them and refuses to start if the configured list disagrees with the persisted cluster (`normalizePeers` / `validateOpenPeers` / `savePersistedPeers`, `internal/raftengine/etcd/engine.go:620-643`; `errClusterMismatch`, `:116`; `LoadPersistedPeers`, `internal/raftengine/etcd/peer_metadata.go:40`). +- **The factory builds a transport when `len(peers) > 1`** and wires it into `Open(...)` (`internal/raftengine/etcd/factory.go:41-90`). `Open` normalizes/validates peers, and on first open writes them to a persisted-peers file (`normalizePeers` / `validateOpenPeers` / `savePersistedPeers`, `internal/raftengine/etcd/engine.go:620-643`; `LoadPersistedPeers`, `internal/raftengine/etcd/peer_metadata.go:40`). On restart, current code reloads the persisted list before opening (`factory.go:43-47`; `normalizeOpenConfig`, `engine.go:3298-3306`) and `validateOpenPeers` protects the persisted snapshot's ConfState against that loaded list (`engine.go:3313-3327`; `errClusterMismatch`, `:116`). It does **not** yet compare a newly supplied flag list against the persisted file after reload; PR-B must add that explicit configured-list-vs-persisted-list validation for `--raftGroupPeers`. - **The transport resolves peers by node ID → address from the bootstrap list** (`NewGRPCTransport(peers)` builds `map[nodeID]Peer`, `internal/raftengine/etcd/grpc_transport.go:67-86`; sends dial `peer.Address`, `:493-517`), and supports runtime membership churn via `UpsertPeer` / `RemovePeer` (`:145-170`) as conf-changes commit. - **Each group already gets its own listener and its own `RaftAdmin` service** (`startRaftServers` registers `RegisterOperationalServicesWithInterceptor(ctx, gs, rt.engine, …)` then `lc.Listen(ctx, "tcp", rt.spec.address)` per runtime, `main.go:1610-1615`). `AddVoter`/`AddLearner`/`PromoteLearner`/`RemoveServer` are reachable per group (`cmd/raftadmin/main.go:197-285`; engine `AddVoter`, `internal/raftengine/etcd/engine.go:1252-1257`). - **`cmd/server/demo.go` already stands up 3 nodes that bootstrap one shared group.** All three node configs set `raftBootstrap=true` and receive the **same** `raftPeers` list (all three `{Suffrage:"voter", ID, Address}`), `cmd/server/demo.go:180-219`. The comment at `:215-219` records the key etcd requirement: *"every member of a fresh cluster must bootstrap with the same peer list."* This is exactly the per-group bootstrap discipline §3 generalizes to M groups. @@ -41,7 +41,7 @@ The only missing piece is a **flag/parse/wiring path that gives each group its o 1. Deploy **N nodes × M Raft groups** where **every group is a multi-voter Raft cluster** (each group's voter set spans ≥2 nodes), declarable entirely at process startup. 2. A concrete, validated **flag surface** for per-group peer lists, with strict back-compat: every existing single-group flag (`--raftBootstrapMembers`, `--raftBootstrap`, `--raftGroups`) and the `cmd/server/demo.go` single-process demo behave **exactly as today**. -3. **Deterministic, idempotent bootstrap**: a known node proposes the initial configuration for each group; restart re-detects existing state and does not re-bootstrap; partial-bootstrap failures are recoverable. +3. **Deterministic, idempotent bootstrap**: every founding node starts each group with the same initial configuration; restart re-detects existing state and does not re-bootstrap; partial-bootstrap failures are recoverable. 4. **Reuse the existing per-group transport, listener, `RaftAdmin`, marker-dir, and persisted-peers machinery** (§1.2) — no new replication or wire surface for the data path. 5. An **in-process integration harness** that stands up 3 nodes × 2 groups with every group multi-voter, so the leader-balance convergence test and hotspot-M2 cross-group tests have a topology to run against. @@ -68,6 +68,28 @@ Grammar: - Each entry is `groupID=member,member,…`. - Each `member` is `raftID@host:port` — the `@` separates the node's stable Raft ID (matching `--raftId` semantics) from its listener address for that group. (`raftID` is needed explicitly because etcd's bootstrap requires the same `id→address` mapping on every node, `cmd/server/demo.go:215-219`; the address alone is not the identity.) +Concrete 3-node × 2-group local example (all nodes share the same `--raftGroupPeers`; `--raftGroups` and protocol maps name only that node's local listeners): + +``` +# node n1 +--raftId n1 \ +--raftGroups "1=127.0.0.1:5051,2=127.0.0.1:5054" \ +--raftGroupPeers "1=n1@127.0.0.1:5051,n2@127.0.0.1:5052,n3@127.0.0.1:5053;2=n1@127.0.0.1:5054,n2@127.0.0.1:5055,n3@127.0.0.1:5056" \ +--raftRedisMap "127.0.0.1:5051=127.0.0.1:6379,127.0.0.1:5054=127.0.0.1:6382" + +# node n2 +--raftId n2 \ +--raftGroups "1=127.0.0.1:5052,2=127.0.0.1:5055" \ +--raftGroupPeers "1=n1@127.0.0.1:5051,n2@127.0.0.1:5052,n3@127.0.0.1:5053;2=n1@127.0.0.1:5054,n2@127.0.0.1:5055,n3@127.0.0.1:5056" \ +--raftRedisMap "127.0.0.1:5052=127.0.0.1:6380,127.0.0.1:5055=127.0.0.1:6383" + +# node n3 +--raftId n3 \ +--raftGroups "1=127.0.0.1:5053,2=127.0.0.1:5056" \ +--raftGroupPeers "1=n1@127.0.0.1:5051,n2@127.0.0.1:5052,n3@127.0.0.1:5053;2=n1@127.0.0.1:5054,n2@127.0.0.1:5055,n3@127.0.0.1:5056" \ +--raftRedisMap "127.0.0.1:5053=127.0.0.1:6381,127.0.0.1:5056=127.0.0.1:6384" +``` + **Why a new flag rather than extending `--raftGroups` entry syntax.** `--raftGroups` entries are `id=addr` and that `addr` is *this node's own* listener (`groupSpec.address`, used as `LocalAddress`, `multiraft_runtime.go:248`). Overloading it to also carry the full member list would make every node's `--raftGroups` value identical across the cluster *except* that the local-address role would have to be inferred — error-prone. A separate `--raftGroupPeers` keeps "what do I listen on" (`--raftGroups`) cleanly separate from "who are the voters" (`--raftGroupPeers`), and mirrors how single-group already separates `--address`/`--raftGroups` from `--raftBootstrapMembers`. **Back-compat rules (strict):** @@ -84,22 +106,22 @@ Grammar: ### 3.2 Bootstrap semantics -The wiring change is small and local: instead of one process-wide `bootstrapServers` threaded into every group, **resolve a per-group `[]raftengine.Server` and pass each group its own list**. Concretely, `buildShardGroups` / `buildRuntimeForGroup` change from a single `bootstrapServers []raftengine.Server` parameter (`multiraft_runtime.go:234`, `main.go:777`) to a `bootstrapServersFor func(groupID uint64) []raftengine.Server` lookup (or a `map[uint64][]raftengine.Server`), built once from the parsed `--raftGroupPeers`. Everything downstream — the factory's `len(peers) > 1` transport gate (`factory.go:50`), `Open`'s peer normalize/validate/persist (`engine.go:620-643`), the marker dir, the per-group listener — already operates per group and needs no change. +The wiring change is small and local: instead of one process-wide `bootstrapServers` threaded into every group, **resolve a per-group `[]raftengine.Server` and pass each group its own list**. Concretely, `buildShardGroups` / `buildRuntimeForGroup` change from a single `bootstrapServers []raftengine.Server` parameter (`multiraft_runtime.go:234`, `main.go:777`) to a static `map[uint64][]raftengine.Server`, built once from the parsed `--raftGroupPeers`. Everything downstream — the factory's `len(peers) > 1` transport gate (`factory.go:50`), `Open`'s peer normalize/validate/persist (`engine.go:620-643`), the marker dir, the per-group listener — already operates per group and needs no change. -**Which node proposes the initial conf (decision: every node bootstraps with the identical per-group peer list — the etcd model — NOT a single designated proposer).** etcd/raft's bootstrap model is that **every** founding member calls `Bootstrap` with the **same** `ConfState`/peer list; raft then elects a leader among them. This is exactly what `cmd/server/demo.go` does for the single group (`raftBootstrap=true` on all three nodes with the shared `raftPeers`, `:204-219`) and what `resolveBootstrapServers` sets up for single-group (`bootstrap = *raftBootstrap || len(bootstrapServers) > 0`, `main.go:534`). We generalize it: when `--raftGroupPeers` is set, **every group on every node bootstraps with that group's full peer list**, and `bootstrap` is implied true for those groups (the operator does not also need `--raftBootstrap`; see the interaction rule below). +**Initial configuration model (decision: every node bootstraps with the identical per-group peer list — the etcd model — NOT a single designated proposer).** etcd/raft's bootstrap model is that **every** founding member calls `Bootstrap` with the **same** `ConfState`/peer list; raft then elects a leader among them. This is exactly what `cmd/server/demo.go` does for the single group (`raftBootstrap=true` on all three nodes with the shared `raftPeers`, `:204-219`) and what `resolveBootstrapServers` sets up for single-group (`bootstrap = *raftBootstrap || len(bootstrapServers) > 0`, `main.go:534`). We generalize it: when `--raftGroupPeers` is set, **every group on every node bootstraps with that group's full peer list**, and `bootstrap` is implied true for those groups (the operator does not also need `--raftBootstrap`; see the interaction rule below). We do **not** invent a "lexicographically-smallest peer proposes, others wait-and-join" protocol. That single-proposer pattern is the *AddVoter-composition* path (§5), not the bootstrap path — and adopting it for bootstrap would mean the non-proposer nodes start with an empty conf and must be added one-by-one, which is fragile (ordering, the proposer must be up first and must be leader) and is exactly the "manual AddVoter dance in every test harness" #953 OQ-9 rejected. The all-nodes-same-list model has no designated-proposer ordering requirement: nodes can start in any order, and raft elects a leader once a quorum is up. -**Idempotency on restart (decision: persisted-peers + marker dir already give this — per group).** On first open of a group dir, `Open` writes the normalized peer set to the persisted-peers file (`savePersistedPeers`, `engine.go:643`; format in `peer_metadata.go:205`). On restart, the factory **loads the persisted peers and uses them in preference to the flag-supplied list** (`factory.go:43-47`), and `Open` refuses to start if the configured cluster disagrees with what is persisted (`validateOpenPeers` → `errClusterMismatch`, `engine.go:632`, `:116`). So: +**Idempotency on restart (decision: persisted-peers + marker dir are the restart boundary; PR-B must validate configured peers before adopting persisted peers).** On first open of a group dir, `Open` writes the normalized peer set to the persisted-peers file (`savePersistedPeers`, `engine.go:643`; format in `peer_metadata.go:205`). On restart, the factory/Open path **loads the persisted peers and uses them in preference to the flag-supplied list** (`factory.go:43-47`; `normalizeOpenConfig`, `engine.go:3298-3306`), and `validateOpenPeers` verifies the persisted snapshot's ConfState against that loaded peer set (`engine.go:632`; `validateOpenPeers`, `engine.go:3313-3327`; `errClusterMismatch`, `:116`). That is enough for same-list idempotency, but not enough to reject an operator who changes `--raftGroupPeers`: current code discards the newly supplied list before validation. PR-B therefore adds an explicit comparison of the normalized per-group flag list against the persisted peers for that group before overwriting it; divergence returns `errClusterMismatch` (or a wrapped validation error with that sentinel). So: - A restart with the same `--raftGroupPeers` re-loads the same persisted set per group → no re-bootstrap, no data risk. -- A restart with a *different* `--raftGroupPeers` than what a group already persisted **fails fast** with `errClusterMismatch` rather than silently re-bootstrapping over committed data. (Membership changes after bootstrap go through `AddVoter`/`RemoveServer`, which rewrite the persisted set, §5.) +- A restart with a *different* `--raftGroupPeers` than what a group already persisted **fails fast** in PR-B's explicit configured-vs-persisted validation rather than silently ignoring the changed flag or re-bootstrapping over committed data. (Membership changes after bootstrap go through `AddVoter`/`RemoveServer`, which rewrite the persisted set, §5.) - The `raft-engine` marker (`ensureRaftEngineDataDir`, `multiraft_runtime.go:117-151`) independently guards against opening a group dir under the wrong engine type — unchanged, already per group. -**`bootstrap` flag interaction (decision: `--raftGroupPeers` implies bootstrap=true for all groups; `--raftBootstrap` stays for the single-group/demo path).** Mirror the existing single-group rule `bootstrap = *raftBootstrap || len(bootstrapServers) > 0` (`main.go:534`): when `--raftGroupPeers` is non-empty, the resolved bootstrap flag is true for every group (each has a non-empty peer list). `--raftBootstrap` continues to mean "bootstrap" for deployments that don't use `--raftGroupPeers`. Setting `--raftBootstrap=false` together with `--raftGroupPeers` is a no-op contradiction for a fresh dir — we treat a non-empty `--raftGroupPeers` as authoritative (bootstrap=true), and document it. (On a *restart* the persisted-peers path takes over regardless, so the bootstrap flag is moot once a dir has state — same as today.) +**`bootstrap` flag interaction (decision: `--raftGroupPeers` implies bootstrap=true for configured groups; `--raftBootstrap` stays for the single-group/demo path).** Mirror the existing single-group rule `bootstrap = *raftBootstrap || len(bootstrapServers) > 0` (`main.go:534`): when `--raftGroupPeers` is non-empty, the resolved bootstrap flag is true for every group that has a non-empty peer list. `--raftBootstrap` continues to mean "bootstrap" for deployments that don't use `--raftGroupPeers`. Setting `--raftBootstrap=false` together with `--raftGroupPeers` is a no-op contradiction for a fresh dir — we treat a non-empty `--raftGroupPeers` as authoritative for those groups (bootstrap=true), and document it. On a restart, the data-loss guard is the WAL path: `openDiskState` checks `wal.Exist(walDir)` and returns `loadWalState` before consulting `cfg.Bootstrap` (`wal_store.go:50-52`); only dirs without a WAL reach `bootstrapNewCluster` (`wal_store.go:62`, `:93-100`). PR-B's five-lens review must preserve that invariant. **Partial-bootstrap failure modes and recovery:** - *One node never comes up.* With an N-voter group, raft tolerates up to ⌊(N−1)/2⌋ down at bootstrap and still elects a leader once a quorum starts. A 3-voter group forms with 2 up. The down node joins when it starts (its dir is fresh → bootstraps with the same list → catches up via snapshot/log). No operator action. -- *A node bootstrapped with the wrong list.* Caught at `Open` by `errClusterMismatch` (mismatch vs. persisted) for a restart, or — for a fresh dir — caught by raft refusing to make progress because the configurations disagree. Recovery: stop the misconfigured node, wipe its (fresh) group dir, restart with the correct `--raftGroupPeers`. Because the failure is fail-fast and the bad node holds no committed data yet, this is safe. +- *A node bootstrapped with the wrong list.* On restart after a persisted peers file exists, PR-B's configured-vs-persisted validation catches the mismatch before opening the engine. On a first bootstrap with fresh dirs, there is no local persisted reference to compare against; nodes with mismatched lists may form incompatible Raft configurations that do not share a quorum and therefore fail to elect a usable leader or make progress. Recovery: stop the misconfigured node(s), wipe only the fresh group dirs that bootstrapped with the wrong list, and restart with the identical `--raftGroupPeers` value used by the rest of the founding members. If any group has already committed user data, recovery must follow the live membership/change path instead of wiping. - *A node crashes mid-bootstrap after writing the persisted file but before committing entries.* Restart re-loads the persisted peers (`factory.go:43-47`) and rejoins; the persisted file is written atomically (`writePersistedPeersFile`, `peer_metadata.go:205`), so a torn write is not a partial state. No special handling beyond what single-group already has. ### 3.3 Determinism and testability of the bootstrapper @@ -148,8 +170,8 @@ The encryption writer-registration startup path (`main_encryption_registration.g ### 6.1 Unit - **Flag parsing** (`shard_config.go`, table-driven, co-located `*_test.go`): `--raftGroupPeers` grammar — multiple groups (`;`-separated), `raftID@host:port` members, whitespace, empty ⇒ nil; every validation rule of §3.1 (unknown group, missing-group-when-non-empty, local-node-absent, local-addr-mismatch, duplicate raftID, homogeneity violation, mutual-exclusion with `--raftBootstrapMembers`). Pure-function determinism: same input ⇒ identical sorted `map[uint64][]Server`. -- **Per-group bootstrap-server resolution**: the new `bootstrapServersFor(groupID)` returns each group's own list; the empty-flag path returns today's behavior unchanged (regression-locks back-compat). -- **Restart idempotency** (engine-level, `internal/raftengine/etcd/`): re-open a group dir with the same list ⇒ no re-bootstrap; re-open with a divergent list ⇒ `errClusterMismatch` (this path already has coverage for single-group; add the multi-group-dir case). +- **Per-group bootstrap-server resolution**: the new `map[uint64][]raftengine.Server` carries each group's own list; the empty-flag path returns today's behavior unchanged (regression-locks back-compat). +- **Restart idempotency** (engine-level, `internal/raftengine/etcd/`): re-open a group dir with the same list ⇒ no re-bootstrap; re-open with a divergent flag-supplied list ⇒ PR-B's explicit configured-vs-persisted validation returns `errClusterMismatch` (add the multi-group-dir case). ### 6.2 Integration — 3-node × 2-group in-process harness Stand up **3 nodes, 2 groups, every group a 3-voter Raft**, in one test process (extend `cmd/server/demo.go`'s pattern, or a new `internal/`-level harness so it is `go test`-runnable without the binary). Assertions: @@ -157,6 +179,7 @@ Stand up **3 nodes, 2 groups, every group a 3-voter Raft**, in one test process - `TransferLeadershipToServer` between two nodes of the same group **succeeds** (the capability the leader-balance scheduler is blocked on). - Restart one node: it reloads persisted peers and rejoins both groups without re-bootstrap. - Kill a minority (1 of 3) in a group: the group keeps a leader and serves; the killed node rejoins on restart. +- Negative case: start one fresh node with a divergent peer list and assert the harness fails health/leader convergence rather than producing an apparently healthy split configuration; document recovery as wipe-fresh-dir + restart with the shared list. This harness is the concrete deliverable that unblocks #953's convergence test (which "requires a topology where each of N groups has voters on ≥2 of 3 nodes — which only exists after PR0", #953 §5). @@ -168,11 +191,11 @@ Extend the multi-node story to Jepsen as a **later milestone** (noted, not v1): | PR | Scope | Tests | Shippable alone? | |---|---|---|---| | **PR-A** | Flag + parse + validation: `--raftGroupPeers`, the §3.1 grammar and all validation rules, mutual-exclusion with `--raftBootstrapMembers`. No wiring change yet (parsed result unused). | Unit (§6.1) flag-parse table tests. | Yes — pure parsing, zero behavior change (result unconsumed). | -| **PR-B** | Wiring: lift the `len(groups)==1` guard path for the new flag; thread per-group `bootstrapServersFor` through `buildShardGroups`/`buildRuntimeForGroup`; resolve bootstrap=true per group. Each group now opens multi-voter. | Unit (§6.1) per-group resolution + restart idempotency; smoke that a 2-group config opens 2 transports. | After PR-A — the core capability. | +| **PR-B** | Wiring: lift the `len(groups)==1` guard path for the new flag; thread the static per-group peer map through `buildShardGroups`/`buildRuntimeForGroup`; replace the process-wide `bootstrap bool` decision with per-group bootstrap derived from whether that group has a resolved peer list; validate flag-supplied peers against persisted peers before adopting persisted state. Each group now opens multi-voter. | Unit (§6.1) per-group resolution + restart idempotency; smoke that a 2-group config opens 2 transports. | After PR-A — the core capability. | | **PR-C** | In-process 3-node × 2-group integration harness (§6.2) + the leader-transfer-between-nodes smoke. | Integration (§6.2). | After PR-B — the deliverable #953 PR0 / hotspot-M2 need. | | **PR-D (later)** | Jepsen: true multi-node multi-group runner (§6.3). | Existing workloads, no-new-anomalies bar. | After PR-C; the "M6+" item. | -Each PR carries the five-lens self-review (CLAUDE.md). Lens highlights for this change: **data loss** — restart must never re-bootstrap over committed data (persisted-peers + `errClusterMismatch`, §3.2); **concurrency/distributed** — any node-start order must form each group (all-same-list model, §3.2), partial-quorum bootstrap recovers; **data consistency** — a divergent `--raftGroupPeers` on restart fails fast, never silently forks a group's membership. +Each PR carries the five-lens self-review (CLAUDE.md). Lens highlights for this change: **data loss** — restart must never re-bootstrap over committed data (existing-WAL path bypasses bootstrap, and PR-B validates configured peers against persisted peers before adopting them, §3.2); **concurrency/distributed** — any node-start order must form each group (all-same-list model, §3.2), partial-quorum bootstrap recovers; **data consistency** — a divergent `--raftGroupPeers` on restart fails fast, never silently forks or ignores a group's membership. ### 6.5 Doc lifecycle `*_proposed_*` → `*_partial_*` after PR-B (the topology is deployable) → `*_implemented_*` after PR-C (integration harness lands). `git mv`, propose date fixed. @@ -187,7 +210,7 @@ Each PR carries the five-lens self-review (CLAUDE.md). Lens highlights for this 1. **OQ-1 — Flag spelling: `--raftGroupPeers` companion flag vs. extending `--raftGroups` entries.** §3.1 recommends the companion flag (clean separation of "my listener" vs. "the voter set"; mirrors single-group `--raftBootstrapMembers`). Confirm before PR-A, since it fixes the operator-facing surface. 2. **OQ-2 — Should `--raftBootstrap` be *required* alongside `--raftGroupPeers`, or implied?** §3.2 recommends implied (non-empty `--raftGroupPeers` ⇒ bootstrap=true per group, matching `bootstrap = *raftBootstrap || len(bootstrapServers) > 0`, `main.go:534`). Alternative: require `--raftBootstrap` explicitly for symmetry with single-group. Confirm before PR-B. -3. **OQ-3 — Per-group bootstrap-server carrier: `func(groupID) []Server` vs. `map[uint64][]Server`.** Implementation detail for threading through `buildShardGroups`/`buildRuntimeForGroup` (`main.go:777`, `multiraft_runtime.go:234`). Map is simpler; func defers construction. Either is fine; pick at PR-B. +3. **OQ-3 — Per-group bootstrap-server carrier: `map[uint64][]Server`.** Resolved here: use a static map built during startup validation and thread that through `buildShardGroups`/`buildRuntimeForGroup` (`main.go:777`, `multiraft_runtime.go:234`). The group set is fixed at startup, so a function provider adds no value until a runtime update path exists. 4. **OQ-4 — Heterogeneous group membership (groups on a subset of nodes).** v1 enforces homogeneity (§3.1 rule 4) to match #953 §2.2. The `raftID@host:port` member syntax already expresses arbitrary per-group sets, so relaxing rule 4 later needs no grammar change — but #953's observation/forward paths assume homogeneity, so we keep the guard until a consumer needs otherwise. Should the validator's homogeneity check be a hard error (v1) or a warning that allows heterogeneous sets for advanced operators? (Recommendation: hard error in v1.) 5. **OQ-5 — Mixed bootstrap + learner start.** `--raftJoinAsLearner` (`buildRuntimeForGroup`'s `joinAsLearner`, `multiraft_runtime.go:238`) lets a node join an existing cluster as a learner. Should `--raftGroupPeers` interoperate with a per-group learner bootstrap (some members start as learners, promoted later), or is learner-join strictly a live-expansion concern (§5)? (Recommendation: learners are live-expansion only in v1; `--raftGroupPeers` declares voters.) 6. **OQ-6 — Single shared raft listener multiplexing all groups.** §3.4 keeps one listener per group (the `5005{1..6}` convention). Is the per-group-port model acceptable at the target scale, or is a single multiplexed raft listener (demux by group ID) worth the transport change for very high M? (Recommendation: per-group ports for v1; revisit only if port count becomes an operational problem.) From 4c7600ab13ac6c1db9bb19f9dc8bd494c49bf513 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 14 Jun 2026 16:59:10 +0900 Subject: [PATCH 03/14] docs: fix leader balance design reference --- .../2026_06_12_proposed_multinode_multigroup_bootstrap.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md index a8dfc5d2..396de585 100644 --- a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md +++ b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md @@ -5,7 +5,7 @@ Author: bootjp Date: 2026-06-12 Sibling / prerequisite-for: -- [2026_06_11_proposed_leader_balance_scheduler.md](2026_06_11_proposed_leader_balance_scheduler.md) §1.1a (PR0) + OQ-9 — this doc **is** that PR0. The leader-balance scheduler's transfer-issuing milestones (PR2/PR3) are blocked on a Raft group whose voter set spans more than one node; that topology cannot be declared at startup today, and OQ-9 resolved "option (a): extend the bootstrap/flag surface." This document is the design for option (a). +- [Leader-balance scheduler design PR #953](https://github.com/bootjp/elastickv/pull/953) §1.1a (PR0) + OQ-9 — this doc **is** that PR0. The leader-balance scheduler's transfer-issuing milestones (PR2/PR3) are blocked on a Raft group whose voter set spans more than one node; that topology cannot be declared at startup today, and OQ-9 resolved "option (a): extend the bootstrap/flag surface." This document is the design for option (a). - [2026_02_18_partial_hotspot_shard_split.md](2026_02_18_partial_hotspot_shard_split.md) Milestone 2 — cross-group range migration testing also needs ≥2 nodes hosting the same group to be meaningful (see §7). ## 1. Background From 3996d89821e84c8c8755ed528c6b7d9310f2813e Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 14 Jun 2026 17:04:07 +0900 Subject: [PATCH 04/14] docs: preserve bootstrap members guard --- .../2026_06_12_proposed_multinode_multigroup_bootstrap.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md index 396de585..c580eb27 100644 --- a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md +++ b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md @@ -57,7 +57,7 @@ The only missing piece is a **flag/parse/wiring path that gives each group its o ### 3.1 Flag surface — per-group peer lists -**Decision: add a companion flag `--raftGroupPeers`, and lift the `len(groups)==1` guard in `resolveBootstrapServers`.** Keep `--raftGroups` (group→local-address) exactly as is; declare the *cross-node* voter set per group in a new flag. +**Decision: add a companion flag `--raftGroupPeers` with its own per-group resolver, while keeping `resolveBootstrapServers` and its `len(groups)==1` guard for `--raftBootstrapMembers`.** Keep `--raftGroups` (group→local-address) exactly as is; declare the *cross-node* voter set per group in a new flag. ``` --raftGroupPeers "1=n1@host1:5051,n2@host2:5051,n3@host3:5051;2=n1@host1:5054,n2@host2:5054,n3@host3:5054" @@ -191,7 +191,7 @@ Extend the multi-node story to Jepsen as a **later milestone** (noted, not v1): | PR | Scope | Tests | Shippable alone? | |---|---|---|---| | **PR-A** | Flag + parse + validation: `--raftGroupPeers`, the §3.1 grammar and all validation rules, mutual-exclusion with `--raftBootstrapMembers`. No wiring change yet (parsed result unused). | Unit (§6.1) flag-parse table tests. | Yes — pure parsing, zero behavior change (result unconsumed). | -| **PR-B** | Wiring: lift the `len(groups)==1` guard path for the new flag; thread the static per-group peer map through `buildShardGroups`/`buildRuntimeForGroup`; replace the process-wide `bootstrap bool` decision with per-group bootstrap derived from whether that group has a resolved peer list; validate flag-supplied peers against persisted peers before adopting persisted state. Each group now opens multi-voter. | Unit (§6.1) per-group resolution + restart idempotency; smoke that a 2-group config opens 2 transports. | After PR-A — the core capability. | +| **PR-B** | Wiring: add the `--raftGroupPeers` resolver alongside the existing `resolveBootstrapServers` path, preserving the `--raftBootstrapMembers` single-group guard; thread the static per-group peer map through `buildShardGroups`/`buildRuntimeForGroup`; replace the process-wide `bootstrap bool` decision with per-group bootstrap derived from whether that group has a resolved peer list; validate flag-supplied peers against persisted peers before adopting persisted state. Each group now opens multi-voter. | Unit (§6.1) per-group resolution + restart idempotency; smoke that a 2-group config opens 2 transports. | After PR-A — the core capability. | | **PR-C** | In-process 3-node × 2-group integration harness (§6.2) + the leader-transfer-between-nodes smoke. | Integration (§6.2). | After PR-B — the deliverable #953 PR0 / hotspot-M2 need. | | **PR-D (later)** | Jepsen: true multi-node multi-group runner (§6.3). | Existing workloads, no-new-anomalies bar. | After PR-C; the "M6+" item. | From 0423973dbd45f848793f1e27d5c09222cb73aa34 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 14 Jun 2026 17:09:30 +0900 Subject: [PATCH 05/14] docs: include full protocol map example --- ...proposed_multinode_multigroup_bootstrap.md | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md index c580eb27..1f9bfa31 100644 --- a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md +++ b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md @@ -68,26 +68,29 @@ Grammar: - Each entry is `groupID=member,member,…`. - Each `member` is `raftID@host:port` — the `@` separates the node's stable Raft ID (matching `--raftId` semantics) from its listener address for that group. (`raftID` is needed explicitly because etcd's bootstrap requires the same `id→address` mapping on every node, `cmd/server/demo.go:215-219`; the address alone is not the identity.) -Concrete 3-node × 2-group local example (all nodes share the same `--raftGroupPeers`; `--raftGroups` and protocol maps name only that node's local listeners): +Concrete 3-node × 2-group local example (all nodes share the same `--raftGroupPeers` and full N×M `--raftRedisMap`; `--raftGroups` names only that node's local raft listeners): ``` +RAFT_GROUP_PEERS="1=n1@127.0.0.1:5051,n2@127.0.0.1:5052,n3@127.0.0.1:5053;2=n1@127.0.0.1:5054,n2@127.0.0.1:5055,n3@127.0.0.1:5056" +RAFT_REDIS_MAP="127.0.0.1:5051=127.0.0.1:6379,127.0.0.1:5052=127.0.0.1:6380,127.0.0.1:5053=127.0.0.1:6381,127.0.0.1:5054=127.0.0.1:6382,127.0.0.1:5055=127.0.0.1:6383,127.0.0.1:5056=127.0.0.1:6384" + # node n1 --raftId n1 \ --raftGroups "1=127.0.0.1:5051,2=127.0.0.1:5054" \ ---raftGroupPeers "1=n1@127.0.0.1:5051,n2@127.0.0.1:5052,n3@127.0.0.1:5053;2=n1@127.0.0.1:5054,n2@127.0.0.1:5055,n3@127.0.0.1:5056" \ ---raftRedisMap "127.0.0.1:5051=127.0.0.1:6379,127.0.0.1:5054=127.0.0.1:6382" +--raftGroupPeers "$RAFT_GROUP_PEERS" \ +--raftRedisMap "$RAFT_REDIS_MAP" # node n2 --raftId n2 \ --raftGroups "1=127.0.0.1:5052,2=127.0.0.1:5055" \ ---raftGroupPeers "1=n1@127.0.0.1:5051,n2@127.0.0.1:5052,n3@127.0.0.1:5053;2=n1@127.0.0.1:5054,n2@127.0.0.1:5055,n3@127.0.0.1:5056" \ ---raftRedisMap "127.0.0.1:5052=127.0.0.1:6380,127.0.0.1:5055=127.0.0.1:6383" +--raftGroupPeers "$RAFT_GROUP_PEERS" \ +--raftRedisMap "$RAFT_REDIS_MAP" # node n3 --raftId n3 \ --raftGroups "1=127.0.0.1:5053,2=127.0.0.1:5056" \ ---raftGroupPeers "1=n1@127.0.0.1:5051,n2@127.0.0.1:5052,n3@127.0.0.1:5053;2=n1@127.0.0.1:5054,n2@127.0.0.1:5055,n3@127.0.0.1:5056" \ ---raftRedisMap "127.0.0.1:5053=127.0.0.1:6381,127.0.0.1:5056=127.0.0.1:6384" +--raftGroupPeers "$RAFT_GROUP_PEERS" \ +--raftRedisMap "$RAFT_REDIS_MAP" ``` **Why a new flag rather than extending `--raftGroups` entry syntax.** `--raftGroups` entries are `id=addr` and that `addr` is *this node's own* listener (`groupSpec.address`, used as `LocalAddress`, `multiraft_runtime.go:248`). Overloading it to also carry the full member list would make every node's `--raftGroups` value identical across the cluster *except* that the local-address role would have to be inferred — error-prone. A separate `--raftGroupPeers` keeps "what do I listen on" (`--raftGroups`) cleanly separate from "who are the voters" (`--raftGroupPeers`), and mirrors how single-group already separates `--address`/`--raftGroups` from `--raftBootstrapMembers`. @@ -150,7 +153,7 @@ With `--raftGroupPeers` empty, `resolveBootstrapServers` runs unchanged (`main.g `cmd/server/demo.go` bootstraps one group across 3 nodes via `raftPeers` (`:204-219`); it never reads `--raftGroupPeers`. Unchanged. ### 4.3 Per-protocol address maps -`--raftRedisMap` / `--raftDynamoMap` / `--raftS3Map` / `--raftSqsMap` map *Raft listener address → protocol listener address* (`parseRaftAddressMap`, `shard_config.go:327-350`; consumed in `multiraft_runtime.go` group→protocol wiring). They are about *where a group exposes its protocol endpoint*, not *who votes in the group*, so they are orthogonal and unchanged. (In a true multi-node deployment an operator already supplies these per node; the new flag does not alter that.) +`--raftRedisMap` / `--raftDynamoMap` / `--raftS3Map` / `--raftSqsMap` map *Raft listener address → protocol listener address* (`parseRaftAddressMap`, `shard_config.go:327-350`; consumed in `multiraft_runtime.go` group→protocol wiring). They are about *where a group exposes its protocol endpoint*, not *who votes in the group*, so they are orthogonal and unchanged. In a true multi-node deployment, every node that accepts follower ingress must include entries for every possible leader raft address, not just its local listeners: Redis `leaderClientForKey` indexes the map by `RaftLeaderForKey` (`adapter/redis.go:4282-4288`), and the HTTP leader proxy indexes by `RaftLeader()` (`adapter/leader_http_proxy.go:47-54`). A local-only map works only if clients always connect directly to the current leader. ### 4.4 Encryption startup ordering The encryption writer-registration startup path (`main_encryption_registration.go`) is **leader-relative, not single-node-per-group**, so it already tolerates multi-voter groups. `buildProcessStartRegistrationGate` proposes through the **default group** and, when this node is not the default-group leader, **forwards to the current leader** over `EncryptionAdmin` with bounded retry (`proposeWriterRegistration`, `:472-520`; `IsLeader()`/`RaftLeader()` gating, `:482-511`). It assumes only that a default-group leader exists and is reachable — which is *more* true with a multi-voter default group, not less. The `raft-engine` marker and per-group dirs are already per group (§1.2). No encryption guard assumes a single-node-per-group bootstrap order; nothing here changes. (The five-lens "data consistency" review per PR must still confirm the registration forward path behaves when the default group is mid-election at boot, but that is an existing property, not new.) From 6b988feaed8873f3a0ba7991b7f70e7b686cfac2 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 14 Jun 2026 17:13:54 +0900 Subject: [PATCH 06/14] docs: map raft groups to node redis listeners --- .../2026_06_12_proposed_multinode_multigroup_bootstrap.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md index 1f9bfa31..073fc870 100644 --- a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md +++ b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md @@ -72,22 +72,25 @@ Concrete 3-node × 2-group local example (all nodes share the same `--raftGroupP ``` RAFT_GROUP_PEERS="1=n1@127.0.0.1:5051,n2@127.0.0.1:5052,n3@127.0.0.1:5053;2=n1@127.0.0.1:5054,n2@127.0.0.1:5055,n3@127.0.0.1:5056" -RAFT_REDIS_MAP="127.0.0.1:5051=127.0.0.1:6379,127.0.0.1:5052=127.0.0.1:6380,127.0.0.1:5053=127.0.0.1:6381,127.0.0.1:5054=127.0.0.1:6382,127.0.0.1:5055=127.0.0.1:6383,127.0.0.1:5056=127.0.0.1:6384" +RAFT_REDIS_MAP="127.0.0.1:5051=127.0.0.1:6379,127.0.0.1:5054=127.0.0.1:6379,127.0.0.1:5052=127.0.0.1:6380,127.0.0.1:5055=127.0.0.1:6380,127.0.0.1:5053=127.0.0.1:6381,127.0.0.1:5056=127.0.0.1:6381" # node n1 --raftId n1 \ +--redisAddress "127.0.0.1:6379" \ --raftGroups "1=127.0.0.1:5051,2=127.0.0.1:5054" \ --raftGroupPeers "$RAFT_GROUP_PEERS" \ --raftRedisMap "$RAFT_REDIS_MAP" # node n2 --raftId n2 \ +--redisAddress "127.0.0.1:6380" \ --raftGroups "1=127.0.0.1:5052,2=127.0.0.1:5055" \ --raftGroupPeers "$RAFT_GROUP_PEERS" \ --raftRedisMap "$RAFT_REDIS_MAP" # node n3 --raftId n3 \ +--redisAddress "127.0.0.1:6381" \ --raftGroups "1=127.0.0.1:5053,2=127.0.0.1:5056" \ --raftGroupPeers "$RAFT_GROUP_PEERS" \ --raftRedisMap "$RAFT_REDIS_MAP" @@ -153,7 +156,7 @@ With `--raftGroupPeers` empty, `resolveBootstrapServers` runs unchanged (`main.g `cmd/server/demo.go` bootstraps one group across 3 nodes via `raftPeers` (`:204-219`); it never reads `--raftGroupPeers`. Unchanged. ### 4.3 Per-protocol address maps -`--raftRedisMap` / `--raftDynamoMap` / `--raftS3Map` / `--raftSqsMap` map *Raft listener address → protocol listener address* (`parseRaftAddressMap`, `shard_config.go:327-350`; consumed in `multiraft_runtime.go` group→protocol wiring). They are about *where a group exposes its protocol endpoint*, not *who votes in the group*, so they are orthogonal and unchanged. In a true multi-node deployment, every node that accepts follower ingress must include entries for every possible leader raft address, not just its local listeners: Redis `leaderClientForKey` indexes the map by `RaftLeaderForKey` (`adapter/redis.go:4282-4288`), and the HTTP leader proxy indexes by `RaftLeader()` (`adapter/leader_http_proxy.go:47-54`). A local-only map works only if clients always connect directly to the current leader. +`--raftRedisMap` / `--raftDynamoMap` / `--raftS3Map` / `--raftSqsMap` map *Raft listener address → protocol listener address* (`parseRaftAddressMap`, `shard_config.go:327-350`; consumed in `multiraft_runtime.go` group→protocol wiring). They are about *where a group exposes its protocol endpoint*, not *who votes in the group*, so they are orthogonal and unchanged. In a true multi-node deployment, every node that accepts follower ingress must include entries for every possible leader raft address, not just its local listeners: Redis `leaderClientForKey` indexes the map by `RaftLeaderForKey` (`adapter/redis.go:4282-4288`), and the HTTP leader proxy indexes by `RaftLeader()` (`adapter/leader_http_proxy.go:47-54`). When one process hosts multiple groups, those groups' raft addresses can map to the same per-process protocol listener, such as the single Redis listener from `--redisAddress` (`main.go:85`, `main.go:1647-1655`). A local-only map works only if clients always connect directly to the current leader. ### 4.4 Encryption startup ordering The encryption writer-registration startup path (`main_encryption_registration.go`) is **leader-relative, not single-node-per-group**, so it already tolerates multi-voter groups. `buildProcessStartRegistrationGate` proposes through the **default group** and, when this node is not the default-group leader, **forwards to the current leader** over `EncryptionAdmin` with bounded retry (`proposeWriterRegistration`, `:472-520`; `IsLeader()`/`RaftLeader()` gating, `:482-511`). It assumes only that a default-group leader exists and is reachable — which is *more* true with a multi-voter default group, not less. The `raft-engine` marker and per-group dirs are already per group (§1.2). No encryption guard assumes a single-node-per-group bootstrap order; nothing here changes. (The five-lens "data consistency" review per PR must still confirm the registration forward path behaves when the default group is mid-election at boot, but that is an existing property, not new.) From 9ac2b49aa12db9827aa399e0cce1b3f2b3a85f60 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 14 Jun 2026 17:19:16 +0900 Subject: [PATCH 07/14] docs: scope bootstrap peer validation --- ..._12_proposed_multinode_multigroup_bootstrap.md | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md index 073fc870..6dde56ed 100644 --- a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md +++ b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md @@ -118,16 +118,17 @@ The wiring change is small and local: instead of one process-wide `bootstrapServ We do **not** invent a "lexicographically-smallest peer proposes, others wait-and-join" protocol. That single-proposer pattern is the *AddVoter-composition* path (§5), not the bootstrap path — and adopting it for bootstrap would mean the non-proposer nodes start with an empty conf and must be added one-by-one, which is fragile (ordering, the proposer must be up first and must be leader) and is exactly the "manual AddVoter dance in every test harness" #953 OQ-9 rejected. The all-nodes-same-list model has no designated-proposer ordering requirement: nodes can start in any order, and raft elects a leader once a quorum is up. -**Idempotency on restart (decision: persisted-peers + marker dir are the restart boundary; PR-B must validate configured peers before adopting persisted peers).** On first open of a group dir, `Open` writes the normalized peer set to the persisted-peers file (`savePersistedPeers`, `engine.go:643`; format in `peer_metadata.go:205`). On restart, the factory/Open path **loads the persisted peers and uses them in preference to the flag-supplied list** (`factory.go:43-47`; `normalizeOpenConfig`, `engine.go:3298-3306`), and `validateOpenPeers` verifies the persisted snapshot's ConfState against that loaded peer set (`engine.go:632`; `validateOpenPeers`, `engine.go:3313-3327`; `errClusterMismatch`, `:116`). That is enough for same-list idempotency, but not enough to reject an operator who changes `--raftGroupPeers`: current code discards the newly supplied list before validation. PR-B therefore adds an explicit comparison of the normalized per-group flag list against the persisted peers for that group before overwriting it; divergence returns `errClusterMismatch` (or a wrapped validation error with that sentinel). So: +**Idempotency on restart (decision: persisted-peers + marker dir are the restart boundary; PR-B validates the bootstrap seed only while membership is still bootstrap-era).** On first open of a group dir, `Open` writes the normalized peer set to the persisted-peers file (`savePersistedPeers`, `engine.go:643`; format in `peer_metadata.go:205`). On restart, the factory/Open path **loads the persisted peers and uses them in preference to the flag-supplied list** (`factory.go:43-47`; `normalizeOpenConfig`, `engine.go:3298-3306`), and `validateOpenPeers` verifies the persisted snapshot's ConfState against that loaded peer set (`engine.go:632`; `validateOpenPeers`, `engine.go:3313-3327`; `errClusterMismatch`, `:116`). That is enough for same-list idempotency, but not enough to reject an operator who changes the initial `--raftGroupPeers` before any membership change: current code discards the newly supplied list before validation. PR-B therefore adds an explicit comparison of the normalized per-group flag list against the persisted peers **only when the persisted peer metadata still represents the initial bootstrap configuration**; divergence returns `errClusterMismatch` (or a wrapped validation error with that sentinel). Once a live membership change has committed, the persisted peer metadata is advanced by conf-change apply (`applyConfChangeCommitted`, `engine.go:2280-2282`; `writeCurrentPersistedPeers`, `engine.go:2688`) and becomes authoritative; `--raftGroupPeers` remains a bootstrap seed, not a desired-membership flag. So: - A restart with the same `--raftGroupPeers` re-loads the same persisted set per group → no re-bootstrap, no data risk. -- A restart with a *different* `--raftGroupPeers` than what a group already persisted **fails fast** in PR-B's explicit configured-vs-persisted validation rather than silently ignoring the changed flag or re-bootstrapping over committed data. (Membership changes after bootstrap go through `AddVoter`/`RemoveServer`, which rewrite the persisted set, §5.) +- A restart with a *different* `--raftGroupPeers` than the bootstrap-era persisted set **fails fast** in PR-B's explicit bootstrap-seed validation rather than silently ignoring the changed flag or re-bootstrapping over committed data. +- A restart after `AddVoter`/`RemoveServer` has changed membership uses the committed persisted peer set even if the original bootstrap flag is still present; operators change membership through RaftAdmin, not by editing `--raftGroupPeers`. - The `raft-engine` marker (`ensureRaftEngineDataDir`, `multiraft_runtime.go:117-151`) independently guards against opening a group dir under the wrong engine type — unchanged, already per group. **`bootstrap` flag interaction (decision: `--raftGroupPeers` implies bootstrap=true for configured groups; `--raftBootstrap` stays for the single-group/demo path).** Mirror the existing single-group rule `bootstrap = *raftBootstrap || len(bootstrapServers) > 0` (`main.go:534`): when `--raftGroupPeers` is non-empty, the resolved bootstrap flag is true for every group that has a non-empty peer list. `--raftBootstrap` continues to mean "bootstrap" for deployments that don't use `--raftGroupPeers`. Setting `--raftBootstrap=false` together with `--raftGroupPeers` is a no-op contradiction for a fresh dir — we treat a non-empty `--raftGroupPeers` as authoritative for those groups (bootstrap=true), and document it. On a restart, the data-loss guard is the WAL path: `openDiskState` checks `wal.Exist(walDir)` and returns `loadWalState` before consulting `cfg.Bootstrap` (`wal_store.go:50-52`); only dirs without a WAL reach `bootstrapNewCluster` (`wal_store.go:62`, `:93-100`). PR-B's five-lens review must preserve that invariant. **Partial-bootstrap failure modes and recovery:** - *One node never comes up.* With an N-voter group, raft tolerates up to ⌊(N−1)/2⌋ down at bootstrap and still elects a leader once a quorum starts. A 3-voter group forms with 2 up. The down node joins when it starts (its dir is fresh → bootstraps with the same list → catches up via snapshot/log). No operator action. -- *A node bootstrapped with the wrong list.* On restart after a persisted peers file exists, PR-B's configured-vs-persisted validation catches the mismatch before opening the engine. On a first bootstrap with fresh dirs, there is no local persisted reference to compare against; nodes with mismatched lists may form incompatible Raft configurations that do not share a quorum and therefore fail to elect a usable leader or make progress. Recovery: stop the misconfigured node(s), wipe only the fresh group dirs that bootstrapped with the wrong list, and restart with the identical `--raftGroupPeers` value used by the rest of the founding members. If any group has already committed user data, recovery must follow the live membership/change path instead of wiping. +- *A node bootstrapped with the wrong list.* On restart while membership is still bootstrap-era, PR-B's bootstrap-seed validation catches the mismatch before opening the engine. On a first bootstrap with fresh dirs, there is no local persisted reference to compare against; nodes with mismatched lists may form incompatible Raft configurations that do not share a quorum and therefore fail to elect a usable leader or make progress. Recovery: stop the misconfigured node(s), wipe only the fresh group dirs that bootstrapped with the wrong list, and restart with the identical `--raftGroupPeers` value used by the rest of the founding members. If any group has already committed user data, recovery must follow the live membership/change path instead of wiping. - *A node crashes mid-bootstrap after writing the persisted file but before committing entries.* Restart re-loads the persisted peers (`factory.go:43-47`) and rejoins; the persisted file is written atomically (`writePersistedPeersFile`, `peer_metadata.go:205`), so a torn write is not a partial state. No special handling beyond what single-group already has. ### 3.3 Determinism and testability of the bootstrapper @@ -170,14 +171,14 @@ The encryption writer-registration startup path (`main_encryption_registration.g - Every test harness and every operator runbook would have to replay that dance to get a multi-voter group, the per-test cost OQ-9 explicitly wanted to avoid. - It produces a *transient* single-voter window at startup where the group has no fault tolerance and no other transfer target — the opposite of what the leader-balance scheduler needs to test against. -**Kept as the live-expansion path.** Growing an *already-running* group (add a 4th node to a 3-voter group, replace a dead node) is precisely what `AddVoter`/`RemoveServer` are for, and this design does not touch them. The persisted-peers file is rewritten by conf-change apply, so a node added via `AddVoter` and then restarted reloads the grown set (`factory.go:43-47`) — bootstrap and live-expansion compose cleanly. +**Kept as the live-expansion path.** Growing an *already-running* group (add a 4th node to a 3-voter group, replace a dead node) is precisely what `AddVoter`/`RemoveServer` are for, and this design does not touch them. The persisted-peers file is rewritten by conf-change apply, so a node added via `AddVoter` and then restarted reloads the grown set (`factory.go:43-47`) — bootstrap and live-expansion compose cleanly. After the first committed membership change, `--raftGroupPeers` is no longer compared as desired membership; the persisted Raft configuration is authoritative until another RaftAdmin membership change commits. ## 6. Rollout / testing ### 6.1 Unit - **Flag parsing** (`shard_config.go`, table-driven, co-located `*_test.go`): `--raftGroupPeers` grammar — multiple groups (`;`-separated), `raftID@host:port` members, whitespace, empty ⇒ nil; every validation rule of §3.1 (unknown group, missing-group-when-non-empty, local-node-absent, local-addr-mismatch, duplicate raftID, homogeneity violation, mutual-exclusion with `--raftBootstrapMembers`). Pure-function determinism: same input ⇒ identical sorted `map[uint64][]Server`. - **Per-group bootstrap-server resolution**: the new `map[uint64][]raftengine.Server` carries each group's own list; the empty-flag path returns today's behavior unchanged (regression-locks back-compat). -- **Restart idempotency** (engine-level, `internal/raftengine/etcd/`): re-open a group dir with the same list ⇒ no re-bootstrap; re-open with a divergent flag-supplied list ⇒ PR-B's explicit configured-vs-persisted validation returns `errClusterMismatch` (add the multi-group-dir case). +- **Restart idempotency** (engine-level, `internal/raftengine/etcd/`): re-open a group dir with the same list ⇒ no re-bootstrap; re-open with a divergent flag-supplied list while membership is still bootstrap-era ⇒ PR-B's explicit bootstrap-seed validation returns `errClusterMismatch`; re-open after an `AddVoter`/`RemoveServer` conf-change with the original bootstrap flag still present ⇒ accepts the committed persisted peer set (add the multi-group-dir cases). ### 6.2 Integration — 3-node × 2-group in-process harness Stand up **3 nodes, 2 groups, every group a 3-voter Raft**, in one test process (extend `cmd/server/demo.go`'s pattern, or a new `internal/`-level harness so it is `go test`-runnable without the binary). Assertions: @@ -197,11 +198,11 @@ Extend the multi-node story to Jepsen as a **later milestone** (noted, not v1): | PR | Scope | Tests | Shippable alone? | |---|---|---|---| | **PR-A** | Flag + parse + validation: `--raftGroupPeers`, the §3.1 grammar and all validation rules, mutual-exclusion with `--raftBootstrapMembers`. No wiring change yet (parsed result unused). | Unit (§6.1) flag-parse table tests. | Yes — pure parsing, zero behavior change (result unconsumed). | -| **PR-B** | Wiring: add the `--raftGroupPeers` resolver alongside the existing `resolveBootstrapServers` path, preserving the `--raftBootstrapMembers` single-group guard; thread the static per-group peer map through `buildShardGroups`/`buildRuntimeForGroup`; replace the process-wide `bootstrap bool` decision with per-group bootstrap derived from whether that group has a resolved peer list; validate flag-supplied peers against persisted peers before adopting persisted state. Each group now opens multi-voter. | Unit (§6.1) per-group resolution + restart idempotency; smoke that a 2-group config opens 2 transports. | After PR-A — the core capability. | +| **PR-B** | Wiring: add the `--raftGroupPeers` resolver alongside the existing `resolveBootstrapServers` path, preserving the `--raftBootstrapMembers` single-group guard; thread the static per-group peer map through `buildShardGroups`/`buildRuntimeForGroup`; replace the process-wide `bootstrap bool` decision with per-group bootstrap derived from whether that group has a resolved peer list; validate the flag-supplied bootstrap seed against persisted bootstrap-era peers before adopting persisted state, but preserve live-expanded persisted peers after conf changes. Each group now opens multi-voter. | Unit (§6.1) per-group resolution + restart idempotency; smoke that a 2-group config opens 2 transports. | After PR-A — the core capability. | | **PR-C** | In-process 3-node × 2-group integration harness (§6.2) + the leader-transfer-between-nodes smoke. | Integration (§6.2). | After PR-B — the deliverable #953 PR0 / hotspot-M2 need. | | **PR-D (later)** | Jepsen: true multi-node multi-group runner (§6.3). | Existing workloads, no-new-anomalies bar. | After PR-C; the "M6+" item. | -Each PR carries the five-lens self-review (CLAUDE.md). Lens highlights for this change: **data loss** — restart must never re-bootstrap over committed data (existing-WAL path bypasses bootstrap, and PR-B validates configured peers against persisted peers before adopting them, §3.2); **concurrency/distributed** — any node-start order must form each group (all-same-list model, §3.2), partial-quorum bootstrap recovers; **data consistency** — a divergent `--raftGroupPeers` on restart fails fast, never silently forks or ignores a group's membership. +Each PR carries the five-lens self-review (CLAUDE.md). Lens highlights for this change: **data loss** — restart must never re-bootstrap over committed data (existing-WAL path bypasses bootstrap, and PR-B validates the bootstrap seed before adopting bootstrap-era persisted peers, §3.2); **concurrency/distributed** — any node-start order must form each group (all-same-list model, §3.2), partial-quorum bootstrap recovers; **data consistency** — a divergent `--raftGroupPeers` before live membership changes fails fast, while committed RaftAdmin membership changes remain authoritative on restart. ### 6.5 Doc lifecycle `*_proposed_*` → `*_partial_*` after PR-B (the topology is deployable) → `*_implemented_*` after PR-C (integration harness lands). `git mv`, propose date fixed. From 50201e1dedaccbef66e9c1da476c3ff094b42f31 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 14 Jun 2026 17:21:37 +0900 Subject: [PATCH 08/14] docs: note proposal date source --- .../design/2026_06_12_proposed_multinode_multigroup_bootstrap.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md index 6dde56ed..90521cfe 100644 --- a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md +++ b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md @@ -3,6 +3,7 @@ Status: Proposed Author: bootjp Date: 2026-06-12 + Sibling / prerequisite-for: - [Leader-balance scheduler design PR #953](https://github.com/bootjp/elastickv/pull/953) §1.1a (PR0) + OQ-9 — this doc **is** that PR0. The leader-balance scheduler's transfer-issuing milestones (PR2/PR3) are blocked on a Raft group whose voter set spans more than one node; that topology cannot be declared at startup today, and OQ-9 resolved "option (a): extend the bootstrap/flag surface." This document is the design for option (a). From 26ed3565652d6df1b2728f25924d834676c9f5a4 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 14 Jun 2026 17:23:47 +0900 Subject: [PATCH 09/14] docs: show proposal date source --- .../2026_06_12_proposed_multinode_multigroup_bootstrap.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md index 90521cfe..41d53d69 100644 --- a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md +++ b/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md @@ -3,7 +3,7 @@ Status: Proposed Author: bootjp Date: 2026-06-12 - +Date source: first add commit `252df5601700821c7bee7b642c9f0d758103f85f` (`git log --follow --diff-filter=A`), authored on 2026-06-12. Sibling / prerequisite-for: - [Leader-balance scheduler design PR #953](https://github.com/bootjp/elastickv/pull/953) §1.1a (PR0) + OQ-9 — this doc **is** that PR0. The leader-balance scheduler's transfer-issuing milestones (PR2/PR3) are blocked on a Raft group whose voter set spans more than one node; that topology cannot be declared at startup today, and OQ-9 resolved "option (a): extend the bootstrap/flag surface." This document is the design for option (a). From 917f531928fc246027d580040f08dce36452bd9b Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 14 Jun 2026 17:26:26 +0900 Subject: [PATCH 10/14] docs: align bootstrap proposal date --- ...p.md => 2026_06_14_proposed_multinode_multigroup_bootstrap.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/design/{2026_06_12_proposed_multinode_multigroup_bootstrap.md => 2026_06_14_proposed_multinode_multigroup_bootstrap.md} (100%) diff --git a/docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md similarity index 100% rename from docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md rename to docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md From a78f689912b2f741e517c3e907bcaa302d45c304 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 14 Jun 2026 17:27:45 +0900 Subject: [PATCH 11/14] docs: update bootstrap proposal date metadata --- .../2026_06_14_proposed_multinode_multigroup_bootstrap.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md index 41d53d69..48eba50f 100644 --- a/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md +++ b/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md @@ -2,8 +2,8 @@ Status: Proposed Author: bootjp -Date: 2026-06-12 -Date source: first add commit `252df5601700821c7bee7b642c9f0d758103f85f` (`git log --follow --diff-filter=A`), authored on 2026-06-12. +Date: 2026-06-14 +Date source: first add commit introducing this proposal in PR #955's review cycle, aligned with the filename propose date. Sibling / prerequisite-for: - [Leader-balance scheduler design PR #953](https://github.com/bootjp/elastickv/pull/953) §1.1a (PR0) + OQ-9 — this doc **is** that PR0. The leader-balance scheduler's transfer-issuing milestones (PR2/PR3) are blocked on a Raft group whose voter set spans more than one node; that topology cannot be declared at startup today, and OQ-9 resolved "option (a): extend the bootstrap/flag surface." This document is the design for option (a). @@ -229,4 +229,4 @@ This document begins as `*_proposed_*`. Per CLAUDE.md / `docs/design/README.md`: - Rename to `*_partial_*` after PR-B (multi-voter groups deployable at startup), recording which PRs shipped. - Rename to `*_implemented_*` after PR-C (in-process integration harness landed), with the Jepsen runner (PR-D) tracked as a follow-on. -Use `git mv` so history follows the rename. The propose date (2026-06-12) and slug stay fixed. +Use `git mv` so history follows the rename. The propose date (2026-06-14) and slug stay fixed. From 221bfc767427135724d0a13c29f7d00147652921 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 14 Jun 2026 17:30:49 +0900 Subject: [PATCH 12/14] docs: specify bootstrap seed metadata --- ...6_14_proposed_multinode_multigroup_bootstrap.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md index 48eba50f..cda4b977 100644 --- a/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md +++ b/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md @@ -28,7 +28,7 @@ elastickv runs multiple Raft groups in one process (`--raftGroups id=addr,id=add The single-group multi-node path is fully built and is the template this design generalizes: - **`--raftBootstrapMembers id=addr,…` → a voter `[]raftengine.Server`** (`parseRaftBootstrapMembers`, `shard_config.go:352-384`), validated against the local node (must include `--raftId`, local address must match the group address: `resolveBootstrapServers`, `main.go:752-768`). -- **The factory builds a transport when `len(peers) > 1`** and wires it into `Open(...)` (`internal/raftengine/etcd/factory.go:41-90`). `Open` normalizes/validates peers, and on first open writes them to a persisted-peers file (`normalizePeers` / `validateOpenPeers` / `savePersistedPeers`, `internal/raftengine/etcd/engine.go:620-643`; `LoadPersistedPeers`, `internal/raftengine/etcd/peer_metadata.go:40`). On restart, current code reloads the persisted list before opening (`factory.go:43-47`; `normalizeOpenConfig`, `engine.go:3298-3306`) and `validateOpenPeers` protects the persisted snapshot's ConfState against that loaded list (`engine.go:3313-3327`; `errClusterMismatch`, `:116`). It does **not** yet compare a newly supplied flag list against the persisted file after reload; PR-B must add that explicit configured-list-vs-persisted-list validation for `--raftGroupPeers`. +- **The factory builds a transport when `len(peers) > 1`** and wires it into `Open(...)` (`internal/raftengine/etcd/factory.go:41-90`). `Open` normalizes/validates peers, and on first open writes them to a persisted-peers file (`normalizePeers` / `validateOpenPeers` / `savePersistedPeers`, `internal/raftengine/etcd/engine.go:620-643`; `LoadPersistedPeers`, `internal/raftengine/etcd/peer_metadata.go:40`). On restart, current code reloads the persisted list before opening (`factory.go:43-47`; `normalizeOpenConfig`, `engine.go:3298-3306`) and `validateOpenPeers` protects the persisted snapshot's ConfState against that loaded list (`engine.go:3313-3327`; `errClusterMismatch`, `:116`). The current peers file stores only `Index` and `Peers` (`peer_metadata.go:31-34`), so it cannot tell an unchanged bootstrap seed from a live-expanded membership. PR-B must extend this metadata with the original `--raftGroupPeers` seed plus a bootstrap-era marker before it can safely validate configured-list-vs-persisted-list for `--raftGroupPeers`. - **The transport resolves peers by node ID → address from the bootstrap list** (`NewGRPCTransport(peers)` builds `map[nodeID]Peer`, `internal/raftengine/etcd/grpc_transport.go:67-86`; sends dial `peer.Address`, `:493-517`), and supports runtime membership churn via `UpsertPeer` / `RemovePeer` (`:145-170`) as conf-changes commit. - **Each group already gets its own listener and its own `RaftAdmin` service** (`startRaftServers` registers `RegisterOperationalServicesWithInterceptor(ctx, gs, rt.engine, …)` then `lc.Listen(ctx, "tcp", rt.spec.address)` per runtime, `main.go:1610-1615`). `AddVoter`/`AddLearner`/`PromoteLearner`/`RemoveServer` are reachable per group (`cmd/raftadmin/main.go:197-285`; engine `AddVoter`, `internal/raftengine/etcd/engine.go:1252-1257`). - **`cmd/server/demo.go` already stands up 3 nodes that bootstrap one shared group.** All three node configs set `raftBootstrap=true` and receive the **same** `raftPeers` list (all three `{Suffrage:"voter", ID, Address}`), `cmd/server/demo.go:180-219`. The comment at `:215-219` records the key etcd requirement: *"every member of a fresh cluster must bootstrap with the same peer list."* This is exactly the per-group bootstrap discipline §3 generalizes to M groups. @@ -119,9 +119,11 @@ The wiring change is small and local: instead of one process-wide `bootstrapServ We do **not** invent a "lexicographically-smallest peer proposes, others wait-and-join" protocol. That single-proposer pattern is the *AddVoter-composition* path (§5), not the bootstrap path — and adopting it for bootstrap would mean the non-proposer nodes start with an empty conf and must be added one-by-one, which is fragile (ordering, the proposer must be up first and must be leader) and is exactly the "manual AddVoter dance in every test harness" #953 OQ-9 rejected. The all-nodes-same-list model has no designated-proposer ordering requirement: nodes can start in any order, and raft elects a leader once a quorum is up. -**Idempotency on restart (decision: persisted-peers + marker dir are the restart boundary; PR-B validates the bootstrap seed only while membership is still bootstrap-era).** On first open of a group dir, `Open` writes the normalized peer set to the persisted-peers file (`savePersistedPeers`, `engine.go:643`; format in `peer_metadata.go:205`). On restart, the factory/Open path **loads the persisted peers and uses them in preference to the flag-supplied list** (`factory.go:43-47`; `normalizeOpenConfig`, `engine.go:3298-3306`), and `validateOpenPeers` verifies the persisted snapshot's ConfState against that loaded peer set (`engine.go:632`; `validateOpenPeers`, `engine.go:3313-3327`; `errClusterMismatch`, `:116`). That is enough for same-list idempotency, but not enough to reject an operator who changes the initial `--raftGroupPeers` before any membership change: current code discards the newly supplied list before validation. PR-B therefore adds an explicit comparison of the normalized per-group flag list against the persisted peers **only when the persisted peer metadata still represents the initial bootstrap configuration**; divergence returns `errClusterMismatch` (or a wrapped validation error with that sentinel). Once a live membership change has committed, the persisted peer metadata is advanced by conf-change apply (`applyConfChangeCommitted`, `engine.go:2280-2282`; `writeCurrentPersistedPeers`, `engine.go:2688`) and becomes authoritative; `--raftGroupPeers` remains a bootstrap seed, not a desired-membership flag. So: +**Idempotency on restart (decision: persisted-peers + stored bootstrap seed + marker dir are the restart boundary; PR-B validates the bootstrap seed only while membership is still bootstrap-era).** On first open of a group dir, `Open` writes the normalized peer set to the persisted-peers file (`savePersistedPeers`, `engine.go:643`; format in `peer_metadata.go:205`). Today that file stores only `Index` and `Peers` (`peer_metadata.go:31-34`), and both the factory and `normalizeOpenConfig` replace the configured peers with the persisted peers before validation (`factory.go:43-47`; `engine.go:3298-3306`). PR-B must therefore add a peers-file v3 metadata envelope, not infer state from the current `Peers` value: when peers came from `--raftGroupPeers`, persist (a) the current committed peers, (b) the normalized original bootstrap seed, and (c) a `bootstrapSeedActive`/first-conf-change marker. On the first committed `AddVoter`/`RemoveServer`/conf-change path, `persistConfigState` (`applyConfChangeCommitted`, `engine.go:2280-2282`; `writeCurrentPersistedPeers`, `engine.go:2688`) rewrites current peers and flips that marker off while retaining the original seed for diagnostics. Existing v1/v2 files and single-group `--raftBootstrapMembers` state have no seed marker, so they skip this new seed comparison and keep today's persisted-peers behavior. + +With that metadata in place, restart opens are deterministic: first read the full persisted peer metadata, compare the normalized flag-supplied `--raftGroupPeers` seed to the stored bootstrap seed only when the marker is still active, then use the persisted current peers as the engine's peer set. Divergence while the marker is active returns `errClusterMismatch` (or a wrapped validation error with that sentinel). Once a live membership change has committed, the marker is inactive and the persisted current peers are authoritative; `--raftGroupPeers` remains a bootstrap seed, not a desired-membership flag. So: - A restart with the same `--raftGroupPeers` re-loads the same persisted set per group → no re-bootstrap, no data risk. -- A restart with a *different* `--raftGroupPeers` than the bootstrap-era persisted set **fails fast** in PR-B's explicit bootstrap-seed validation rather than silently ignoring the changed flag or re-bootstrapping over committed data. +- A restart with a *different* `--raftGroupPeers` than the stored bootstrap seed while `bootstrapSeedActive` is still true **fails fast** in PR-B's explicit bootstrap-seed validation rather than silently ignoring the changed flag or re-bootstrapping over committed data. - A restart after `AddVoter`/`RemoveServer` has changed membership uses the committed persisted peer set even if the original bootstrap flag is still present; operators change membership through RaftAdmin, not by editing `--raftGroupPeers`. - The `raft-engine` marker (`ensureRaftEngineDataDir`, `multiraft_runtime.go:117-151`) independently guards against opening a group dir under the wrong engine type — unchanged, already per group. @@ -172,14 +174,14 @@ The encryption writer-registration startup path (`main_encryption_registration.g - Every test harness and every operator runbook would have to replay that dance to get a multi-voter group, the per-test cost OQ-9 explicitly wanted to avoid. - It produces a *transient* single-voter window at startup where the group has no fault tolerance and no other transfer target — the opposite of what the leader-balance scheduler needs to test against. -**Kept as the live-expansion path.** Growing an *already-running* group (add a 4th node to a 3-voter group, replace a dead node) is precisely what `AddVoter`/`RemoveServer` are for, and this design does not touch them. The persisted-peers file is rewritten by conf-change apply, so a node added via `AddVoter` and then restarted reloads the grown set (`factory.go:43-47`) — bootstrap and live-expansion compose cleanly. After the first committed membership change, `--raftGroupPeers` is no longer compared as desired membership; the persisted Raft configuration is authoritative until another RaftAdmin membership change commits. +**Kept as the live-expansion path.** Growing an *already-running* group (add a 4th node to a 3-voter group, replace a dead node) is precisely what `AddVoter`/`RemoveServer` are for, and this design does not touch them. The persisted-peers file is rewritten by conf-change apply, so a node added via `AddVoter` and then restarted reloads the grown set (`factory.go:43-47`) — bootstrap and live-expansion compose cleanly. After the first committed membership change, PR-B's peers-file marker makes `--raftGroupPeers` inactive for seed comparison; the persisted Raft configuration is authoritative until another RaftAdmin membership change commits. ## 6. Rollout / testing ### 6.1 Unit - **Flag parsing** (`shard_config.go`, table-driven, co-located `*_test.go`): `--raftGroupPeers` grammar — multiple groups (`;`-separated), `raftID@host:port` members, whitespace, empty ⇒ nil; every validation rule of §3.1 (unknown group, missing-group-when-non-empty, local-node-absent, local-addr-mismatch, duplicate raftID, homogeneity violation, mutual-exclusion with `--raftBootstrapMembers`). Pure-function determinism: same input ⇒ identical sorted `map[uint64][]Server`. - **Per-group bootstrap-server resolution**: the new `map[uint64][]raftengine.Server` carries each group's own list; the empty-flag path returns today's behavior unchanged (regression-locks back-compat). -- **Restart idempotency** (engine-level, `internal/raftengine/etcd/`): re-open a group dir with the same list ⇒ no re-bootstrap; re-open with a divergent flag-supplied list while membership is still bootstrap-era ⇒ PR-B's explicit bootstrap-seed validation returns `errClusterMismatch`; re-open after an `AddVoter`/`RemoveServer` conf-change with the original bootstrap flag still present ⇒ accepts the committed persisted peer set (add the multi-group-dir cases). +- **Restart idempotency** (engine-level, `internal/raftengine/etcd/`): peers-file v3 round-trips current peers, the stored bootstrap seed, and the bootstrap-era marker; re-open a group dir with the same list ⇒ no re-bootstrap; re-open with a divergent flag-supplied list while the marker is active ⇒ PR-B's explicit bootstrap-seed validation returns `errClusterMismatch`; re-open after an `AddVoter`/`RemoveServer` conf-change flips the marker inactive with the original bootstrap flag still present ⇒ accepts the committed persisted peer set (add the multi-group-dir cases). ### 6.2 Integration — 3-node × 2-group in-process harness Stand up **3 nodes, 2 groups, every group a 3-voter Raft**, in one test process (extend `cmd/server/demo.go`'s pattern, or a new `internal/`-level harness so it is `go test`-runnable without the binary). Assertions: @@ -199,7 +201,7 @@ Extend the multi-node story to Jepsen as a **later milestone** (noted, not v1): | PR | Scope | Tests | Shippable alone? | |---|---|---|---| | **PR-A** | Flag + parse + validation: `--raftGroupPeers`, the §3.1 grammar and all validation rules, mutual-exclusion with `--raftBootstrapMembers`. No wiring change yet (parsed result unused). | Unit (§6.1) flag-parse table tests. | Yes — pure parsing, zero behavior change (result unconsumed). | -| **PR-B** | Wiring: add the `--raftGroupPeers` resolver alongside the existing `resolveBootstrapServers` path, preserving the `--raftBootstrapMembers` single-group guard; thread the static per-group peer map through `buildShardGroups`/`buildRuntimeForGroup`; replace the process-wide `bootstrap bool` decision with per-group bootstrap derived from whether that group has a resolved peer list; validate the flag-supplied bootstrap seed against persisted bootstrap-era peers before adopting persisted state, but preserve live-expanded persisted peers after conf changes. Each group now opens multi-voter. | Unit (§6.1) per-group resolution + restart idempotency; smoke that a 2-group config opens 2 transports. | After PR-A — the core capability. | +| **PR-B** | Wiring: add the `--raftGroupPeers` resolver alongside the existing `resolveBootstrapServers` path, preserving the `--raftBootstrapMembers` single-group guard; thread the static per-group peer map through `buildShardGroups`/`buildRuntimeForGroup`; replace the process-wide `bootstrap bool` decision with per-group bootstrap derived from whether that group has a resolved peer list; extend persisted-peers metadata with the stored bootstrap seed plus bootstrap-era marker; validate the flag-supplied bootstrap seed against that stored seed before adopting persisted state while the marker is active, but preserve live-expanded persisted peers after conf changes by flipping the marker inactive. Each group now opens multi-voter. | Unit (§6.1) per-group resolution + restart idempotency; smoke that a 2-group config opens 2 transports. | After PR-A — the core capability. | | **PR-C** | In-process 3-node × 2-group integration harness (§6.2) + the leader-transfer-between-nodes smoke. | Integration (§6.2). | After PR-B — the deliverable #953 PR0 / hotspot-M2 need. | | **PR-D (later)** | Jepsen: true multi-node multi-group runner (§6.3). | Existing workloads, no-new-anomalies bar. | After PR-C; the "M6+" item. | From b45366060f748d3f310e5fba6bb5bd3b7b8e010c Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 14 Jun 2026 17:36:53 +0900 Subject: [PATCH 13/14] docs: seed admin discovery from group peers --- .../2026_06_14_proposed_multinode_multigroup_bootstrap.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md index cda4b977..687547e1 100644 --- a/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md +++ b/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md @@ -31,6 +31,7 @@ The single-group multi-node path is fully built and is the template this design - **The factory builds a transport when `len(peers) > 1`** and wires it into `Open(...)` (`internal/raftengine/etcd/factory.go:41-90`). `Open` normalizes/validates peers, and on first open writes them to a persisted-peers file (`normalizePeers` / `validateOpenPeers` / `savePersistedPeers`, `internal/raftengine/etcd/engine.go:620-643`; `LoadPersistedPeers`, `internal/raftengine/etcd/peer_metadata.go:40`). On restart, current code reloads the persisted list before opening (`factory.go:43-47`; `normalizeOpenConfig`, `engine.go:3298-3306`) and `validateOpenPeers` protects the persisted snapshot's ConfState against that loaded list (`engine.go:3313-3327`; `errClusterMismatch`, `:116`). The current peers file stores only `Index` and `Peers` (`peer_metadata.go:31-34`), so it cannot tell an unchanged bootstrap seed from a live-expanded membership. PR-B must extend this metadata with the original `--raftGroupPeers` seed plus a bootstrap-era marker before it can safely validate configured-list-vs-persisted-list for `--raftGroupPeers`. - **The transport resolves peers by node ID → address from the bootstrap list** (`NewGRPCTransport(peers)` builds `map[nodeID]Peer`, `internal/raftengine/etcd/grpc_transport.go:67-86`; sends dial `peer.Address`, `:493-517`), and supports runtime membership churn via `UpsertPeer` / `RemovePeer` (`:145-170`) as conf-changes commit. - **Each group already gets its own listener and its own `RaftAdmin` service** (`startRaftServers` registers `RegisterOperationalServicesWithInterceptor(ctx, gs, rt.engine, …)` then `lc.Listen(ctx, "tcp", rt.spec.address)` per runtime, `main.go:1610-1615`). `AddVoter`/`AddLearner`/`PromoteLearner`/`RemoveServer` are reachable per group (`cmd/raftadmin/main.go:197-285`; engine `AddVoter`, `internal/raftengine/etcd/engine.go:1252-1257`). +- **Admin gRPC discovery is process-wide and is currently seeded from the same bootstrap list.** `startServers` passes `in.bootstrapServers` to `setupAdminService` (`main.go:1133`), and `adminMembersFromBootstrap` turns that list into the `GetClusterOverview` members advertised to the external admin fan-out (`main.go:1247-1254`, `:1315-1329`). With `--raftGroupPeers`, that seed must be derived from a canonical per-group peer list; otherwise `GetClusterOverview` degrades to self-only even though the Raft groups form correctly. - **`cmd/server/demo.go` already stands up 3 nodes that bootstrap one shared group.** All three node configs set `raftBootstrap=true` and receive the **same** `raftPeers` list (all three `{Suffrage:"voter", ID, Address}`), `cmd/server/demo.go:180-219`. The comment at `:215-219` records the key etcd requirement: *"every member of a fresh cluster must bootstrap with the same peer list."* This is exactly the per-group bootstrap discipline §3 generalizes to M groups. - **Per-group data dir + `raft-engine` marker is already per-group.** `groupDataDir(baseDir, raftID, groupID, multi)` returns `…/raftID/group-N` in multi mode (`multiraft_runtime.go:110-115`); `ensureRaftEngineDataDir` writes/reads the `raft-engine` marker and refuses an engine mismatch *per dir* (`multiraft_runtime.go:117-151`). So idempotent-restart detection is already per group. @@ -113,7 +114,7 @@ RAFT_REDIS_MAP="127.0.0.1:5051=127.0.0.1:6379,127.0.0.1:5054=127.0.0.1:6379,127. ### 3.2 Bootstrap semantics -The wiring change is small and local: instead of one process-wide `bootstrapServers` threaded into every group, **resolve a per-group `[]raftengine.Server` and pass each group its own list**. Concretely, `buildShardGroups` / `buildRuntimeForGroup` change from a single `bootstrapServers []raftengine.Server` parameter (`multiraft_runtime.go:234`, `main.go:777`) to a static `map[uint64][]raftengine.Server`, built once from the parsed `--raftGroupPeers`. Everything downstream — the factory's `len(peers) > 1` transport gate (`factory.go:50`), `Open`'s peer normalize/validate/persist (`engine.go:620-643`), the marker dir, the per-group listener — already operates per group and needs no change. +The wiring change is small and local: instead of one process-wide `bootstrapServers` threaded into every group, **resolve a per-group `[]raftengine.Server` and pass each group its own list**. Concretely, `buildShardGroups` / `buildRuntimeForGroup` change from a single `bootstrapServers []raftengine.Server` parameter (`multiraft_runtime.go:234`, `main.go:777`) to a static `map[uint64][]raftengine.Server`, built once from the parsed `--raftGroupPeers`. The raft-engine downstream path — the factory's `len(peers) > 1` transport gate (`factory.go:50`), `Open`'s peer normalize/validate/persist (`engine.go:620-643`), the marker dir, the per-group listener — already operates per group. The one process-wide consumer PR-B must also update is Admin discovery: derive the `setupAdminService` member seed from a canonical group in the per-group map (§3.4) instead of leaving `in.bootstrapServers` nil when `--raftBootstrapMembers` is mutually exclusive. **Initial configuration model (decision: every node bootstraps with the identical per-group peer list — the etcd model — NOT a single designated proposer).** etcd/raft's bootstrap model is that **every** founding member calls `Bootstrap` with the **same** `ConfState`/peer list; raft then elects a leader among them. This is exactly what `cmd/server/demo.go` does for the single group (`raftBootstrap=true` on all three nodes with the shared `raftPeers`, `:204-219`) and what `resolveBootstrapServers` sets up for single-group (`bootstrap = *raftBootstrap || len(bootstrapServers) > 0`, `main.go:534`). We generalize it: when `--raftGroupPeers` is set, **every group on every node bootstraps with that group's full peer list**, and `bootstrap` is implied true for those groups (the operator does not also need `--raftBootstrap`; see the interaction rule below). @@ -151,6 +152,8 @@ There is no "elect a bootstrapper" step to test, because the model is all-nodes- So the addressing model is: **N nodes × M groups ⇒ N×M (raftID, host:port) listener endpoints**, exactly the cross product `--raftGroupPeers` declares. Each node opens M listeners (one per group), each member of a group dials the other members' per-group endpoints. +**Admin discovery seed:** The node-side Admin gRPC service is registered on every group listener (`pb.RegisterAdminServer(gs, adminServer)`, `main.go:1586-1587`) and already advertises self using `canonicalSelfAddress`, which selects the lowest-group-ID listener (`main.go:1285-1309`). PR-B must mirror that rule for peers: derive `adminMembersFromBootstrap` input from the lowest-group-ID `--raftGroupPeers` list, exclude self, and pass those node identities into `setupAdminService`. The v1 homogeneity rule guarantees the node-ID set is the same across groups, and each address in the chosen group is reachable because AdminServer is registered on every group listener. Empty `--raftGroupPeers` keeps today's `bootstrapServers` path unchanged. + ## 4. Unchanged surfaces (explicitly) ### 4.1 Single-group path @@ -181,6 +184,7 @@ The encryption writer-registration startup path (`main_encryption_registration.g ### 6.1 Unit - **Flag parsing** (`shard_config.go`, table-driven, co-located `*_test.go`): `--raftGroupPeers` grammar — multiple groups (`;`-separated), `raftID@host:port` members, whitespace, empty ⇒ nil; every validation rule of §3.1 (unknown group, missing-group-when-non-empty, local-node-absent, local-addr-mismatch, duplicate raftID, homogeneity violation, mutual-exclusion with `--raftBootstrapMembers`). Pure-function determinism: same input ⇒ identical sorted `map[uint64][]Server`. - **Per-group bootstrap-server resolution**: the new `map[uint64][]raftengine.Server` carries each group's own list; the empty-flag path returns today's behavior unchanged (regression-locks back-compat). +- **Admin discovery seed** (`main_admin_test.go`): with `--raftGroupPeers`, derive non-self `NodeIdentity` members from the canonical group so `GetClusterOverview` includes remote nodes; empty-flag/single-group bootstrap keeps the existing `adminMembersFromBootstrap` behavior. - **Restart idempotency** (engine-level, `internal/raftengine/etcd/`): peers-file v3 round-trips current peers, the stored bootstrap seed, and the bootstrap-era marker; re-open a group dir with the same list ⇒ no re-bootstrap; re-open with a divergent flag-supplied list while the marker is active ⇒ PR-B's explicit bootstrap-seed validation returns `errClusterMismatch`; re-open after an `AddVoter`/`RemoveServer` conf-change flips the marker inactive with the original bootstrap flag still present ⇒ accepts the committed persisted peer set (add the multi-group-dir cases). ### 6.2 Integration — 3-node × 2-group in-process harness @@ -201,7 +205,7 @@ Extend the multi-node story to Jepsen as a **later milestone** (noted, not v1): | PR | Scope | Tests | Shippable alone? | |---|---|---|---| | **PR-A** | Flag + parse + validation: `--raftGroupPeers`, the §3.1 grammar and all validation rules, mutual-exclusion with `--raftBootstrapMembers`. No wiring change yet (parsed result unused). | Unit (§6.1) flag-parse table tests. | Yes — pure parsing, zero behavior change (result unconsumed). | -| **PR-B** | Wiring: add the `--raftGroupPeers` resolver alongside the existing `resolveBootstrapServers` path, preserving the `--raftBootstrapMembers` single-group guard; thread the static per-group peer map through `buildShardGroups`/`buildRuntimeForGroup`; replace the process-wide `bootstrap bool` decision with per-group bootstrap derived from whether that group has a resolved peer list; extend persisted-peers metadata with the stored bootstrap seed plus bootstrap-era marker; validate the flag-supplied bootstrap seed against that stored seed before adopting persisted state while the marker is active, but preserve live-expanded persisted peers after conf changes by flipping the marker inactive. Each group now opens multi-voter. | Unit (§6.1) per-group resolution + restart idempotency; smoke that a 2-group config opens 2 transports. | After PR-A — the core capability. | +| **PR-B** | Wiring: add the `--raftGroupPeers` resolver alongside the existing `resolveBootstrapServers` path, preserving the `--raftBootstrapMembers` single-group guard; thread the static per-group peer map through `buildShardGroups`/`buildRuntimeForGroup`; replace the process-wide `bootstrap bool` decision with per-group bootstrap derived from whether that group has a resolved peer list; seed Admin discovery from the canonical group in that peer map; extend persisted-peers metadata with the stored bootstrap seed plus bootstrap-era marker; validate the flag-supplied bootstrap seed against that stored seed before adopting persisted state while the marker is active, but preserve live-expanded persisted peers after conf changes by flipping the marker inactive. Each group now opens multi-voter. | Unit (§6.1) per-group resolution + Admin discovery seed + restart idempotency; smoke that a 2-group config opens 2 transports and `GetClusterOverview` lists remote members. | After PR-A — the core capability. | | **PR-C** | In-process 3-node × 2-group integration harness (§6.2) + the leader-transfer-between-nodes smoke. | Integration (§6.2). | After PR-B — the deliverable #953 PR0 / hotspot-M2 need. | | **PR-D (later)** | Jepsen: true multi-node multi-group runner (§6.3). | Existing workloads, no-new-anomalies bar. | After PR-C; the "M6+" item. | From 3d478109efdfe082f410ce3e1d136fc0e484c37b Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 14 Jun 2026 17:41:22 +0900 Subject: [PATCH 14/14] docs: require multi-voter group peers --- .../2026_06_14_proposed_multinode_multigroup_bootstrap.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md b/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md index 687547e1..f20cd6da 100644 --- a/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md +++ b/docs/design/2026_06_14_proposed_multinode_multigroup_bootstrap.md @@ -109,8 +109,9 @@ RAFT_REDIS_MAP="127.0.0.1:5051=127.0.0.1:6379,127.0.0.1:5054=127.0.0.1:6379,127. 1. Every group ID in `--raftGroupPeers` must appear in `--raftGroups`, and (v1 homogeneous goal) **every** group in `--raftGroups` must appear in `--raftGroupPeers` when the flag is non-empty. A group with no peer list would silently fall back to single-member — a foot-gun we reject. 2. Each group's member list must **include the local node**: a `member` whose `raftID == --raftId` must be present, and its `host:port` must equal that group's `--raftGroups` local address (`groupSpec.address`). This is the per-group generalization of the existing single-group check `ErrBootstrapMembersLocalAddrMismatch` (`main.go:760-765`). 3. No duplicate `raftID` within a group (mirrors `parseRaftBootstrapMembers`'s `duplicate id` check, `shard_config.go:373-375`). -4. v1 homogeneity check: the set of `raftID`s must be **identical across all groups** (every node votes in every group). Violations are rejected with a clear error pointing at the first divergent group. (Relaxing this is OQ-4.) -5. Each member's address must be non-empty and well-formed `host:port` (reuse existing address parsing). +4. Each group must contain at least **two distinct voters**. A one-member list such as `1=n1@...;2=n1@...` recreates today's single-voter/no-transfer topology and is rejected; operators who want single-member groups leave `--raftGroupPeers` empty and use the existing path. +5. v1 homogeneity check: the set of `raftID`s must be **identical across all groups** (every node votes in every group). Violations are rejected with a clear error pointing at the first divergent group. (Relaxing this is OQ-4.) +6. Each member's address must be non-empty and well-formed `host:port` (reuse existing address parsing). ### 3.2 Bootstrap semantics @@ -182,7 +183,7 @@ The encryption writer-registration startup path (`main_encryption_registration.g ## 6. Rollout / testing ### 6.1 Unit -- **Flag parsing** (`shard_config.go`, table-driven, co-located `*_test.go`): `--raftGroupPeers` grammar — multiple groups (`;`-separated), `raftID@host:port` members, whitespace, empty ⇒ nil; every validation rule of §3.1 (unknown group, missing-group-when-non-empty, local-node-absent, local-addr-mismatch, duplicate raftID, homogeneity violation, mutual-exclusion with `--raftBootstrapMembers`). Pure-function determinism: same input ⇒ identical sorted `map[uint64][]Server`. +- **Flag parsing** (`shard_config.go`, table-driven, co-located `*_test.go`): `--raftGroupPeers` grammar — multiple groups (`;`-separated), `raftID@host:port` members, whitespace, empty ⇒ nil; every validation rule of §3.1 (unknown group, missing-group-when-non-empty, local-node-absent, local-addr-mismatch, duplicate raftID, one-voter group, homogeneity violation, mutual-exclusion with `--raftBootstrapMembers`). Pure-function determinism: same input ⇒ identical sorted `map[uint64][]Server`. - **Per-group bootstrap-server resolution**: the new `map[uint64][]raftengine.Server` carries each group's own list; the empty-flag path returns today's behavior unchanged (regression-locks back-compat). - **Admin discovery seed** (`main_admin_test.go`): with `--raftGroupPeers`, derive non-self `NodeIdentity` members from the canonical group so `GetClusterOverview` includes remote nodes; empty-flag/single-group bootstrap keeps the existing `adminMembersFromBootstrap` behavior. - **Restart idempotency** (engine-level, `internal/raftengine/etcd/`): peers-file v3 round-trips current peers, the stored bootstrap seed, and the bootstrap-era marker; re-open a group dir with the same list ⇒ no re-bootstrap; re-open with a divergent flag-supplied list while the marker is active ⇒ PR-B's explicit bootstrap-seed validation returns `errClusterMismatch`; re-open after an `AddVoter`/`RemoveServer` conf-change flips the marker inactive with the original bootstrap flag still present ⇒ accepts the committed persisted peer set (add the multi-group-dir cases).