docs(design): propose multi-node multi-group bootstrap#955
Conversation
|
Warning Review limit reached
More reviews will be available in 12 minutes and 6 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA new design document is added at ChangesMulti-node multi-group bootstrap design document
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review |
|
Claude finished @bootjp's task in 2m 37s —— View job Code Review —
|
| Claim | Verified |
|---|---|
groupSpec is id uint64; address string, shard_config.go:14-17 |
✓ |
resolveBootstrapServers guard at main.go:746-748, error at :736 |
✓ |
buildRuntimeForGroup passes single bootstrapServers to every group, multiraft_runtime.go:234 |
✓ |
Transport built only if len(peers) > 1, factory.go:49-52 |
✓ |
bootstrap = *raftBootstrap || len(bootstrapServers) > 0, main.go:534 |
✓ |
errClusterMismatch at engine.go:116 |
✓ |
persisted peers loaded at factory.go:43-47 |
✓ |
demo.go:200-219 etcd all-same-list bootstrap + comment |
✓ |
parseSQSFifoPartitionMap ;-separator precedent, shard_config.go:174-196 |
✓ |
parseRaftBootstrapMembers duplicate-id check, shard_config.go:373-375 |
✓ |
Open questions: recommendations
| OQ | Recommendation | Rationale |
|---|---|---|
| OQ-1 (companion flag) | Companion --raftGroupPeers ✓ |
Clean separation already well-argued in §3.1 |
OQ-2 (--raftBootstrap implied) |
Implied ✓ | Consistent with main.go:534 — non-empty peers already implies bootstrap there |
| OQ-3 (map vs. func) | map[uint64][]Server |
Simpler; eager validation at startup is better than deferred |
| OQ-4 (heterogeneous) | Hard error in v1 ✓ | Forward-compatible (grammar already expresses it; relax later with a targeted consumer) |
| OQ-5 (learner interop) | Voter-only in v1 ✓ | Learner bootstrap is a separate concern; mixing at bootstrap adds ordering complexity |
| OQ-6 (per-group vs. shared listener) | Per-group in v1 ✓ | Zero transport change; port count only becomes a problem at M >> typical deployments |
Verdict: Approve. The two substantive issues (data-loss-lens gap in §3.2, per-group bootstrap flag scope gap in PR-B) are addressable in implementation PRs and both are within the mandate of "PR carries the five-lens review." The cross-references are exceptionally well-grounded.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 252df56017
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| We do **not** invent a "lexicographically-smallest peer proposes, others wait-and-join" protocol. That single-proposer pattern is the *AddVoter-composition* path (§5), not the bootstrap path — and adopting it for bootstrap would mean the non-proposer nodes start with an empty conf and must be added one-by-one, which is fragile (ordering, the proposer must be up first and must be leader) and is exactly the "manual AddVoter dance in every test harness" #953 OQ-9 rejected. The all-nodes-same-list model has no designated-proposer ordering requirement: nodes can start in any order, and raft elects a leader once a quorum is up. | ||
|
|
||
| **Idempotency on restart (decision: persisted-peers + marker dir already give this — per group).** On first open of a group dir, `Open` writes the normalized peer set to the persisted-peers file (`savePersistedPeers`, `engine.go:643`; format in `peer_metadata.go:205`). On restart, the factory **loads the persisted peers and uses them in preference to the flag-supplied list** (`factory.go:43-47`), and `Open` refuses to start if the configured cluster disagrees with what is persisted (`validateOpenPeers` → `errClusterMismatch`, `engine.go:632`, `:116`). So: |
There was a problem hiding this comment.
Fix restart mismatch detection before relying on it
For the restart scenario where an operator changes --raftGroupPeers after a group has already written etcd-raft-peers.bin, this design relies on errClusterMismatch to catch the divergent flag list, but the current code discards that list before validation: Factory.Create loads persisted peers and overwrites cfg.Peers, and normalizeOpenConfig does the same before validateOpenPeers runs. As a result, validateOpenPeers compares the snapshot to the persisted list, not the new flag-supplied list, so an implementation following this doc would silently ignore divergent --raftGroupPeers rather than fail fast unless the factory/open path is changed to validate the configured list separately.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request proposes a design for a multi-node multi-group bootstrap mechanism in elastickv, introducing a new --raftGroupPeers flag to configure per-group peer lists at startup. The feedback suggests avoiding the use of a function provider for retrieving bootstrap servers, recommending a static map instead, as the group configuration is static at runtime.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
|
||
| ### 3.2 Bootstrap semantics | ||
|
|
||
| The wiring change is small and local: instead of one process-wide `bootstrapServers` threaded into every group, **resolve a per-group `[]raftengine.Server` and pass each group its own list**. Concretely, `buildShardGroups` / `buildRuntimeForGroup` change from a single `bootstrapServers []raftengine.Server` parameter (`multiraft_runtime.go:234`, `main.go:777`) to a `bootstrapServersFor func(groupID uint64) []raftengine.Server` lookup (or a `map[uint64][]raftengine.Server`), built once from the parsed `--raftGroupPeers`. Everything downstream — the factory's `len(peers) > 1` transport gate (`factory.go:50`), `Open`'s peer normalize/validate/persist (`engine.go:620-643`), the marker dir, the per-group listener — already operates per group and needs no change. |
There was a problem hiding this comment.
To prevent designing for hypothetical future requirements, we should avoid introducing function providers when the value is static at runtime. Since the set of groups is fixed at startup, a static map[uint64][]raftengine.Server should be used instead of a function provider lookup like func(groupID uint64) []raftengine.Server.
| The wiring change is small and local: instead of one process-wide `bootstrapServers` threaded into every group, **resolve a per-group `[]raftengine.Server` and pass each group its own list**. Concretely, `buildShardGroups` / `buildRuntimeForGroup` change from a single `bootstrapServers []raftengine.Server` parameter (`multiraft_runtime.go:234`, `main.go:777`) to a `bootstrapServersFor func(groupID uint64) []raftengine.Server` lookup (or a `map[uint64][]raftengine.Server`), built once from the parsed `--raftGroupPeers`. Everything downstream — the factory's `len(peers) > 1` transport gate (`factory.go:50`), `Open`'s peer normalize/validate/persist (`engine.go:620-643`), the marker dir, the per-group listener — already operates per group and needs no change. | |
| The wiring change is small and local: instead of one process-wide bootstrapServers threaded into every group, resolve a per-group []raftengine.Server and pass each group its own list. Concretely, buildShardGroups / buildRuntimeForGroup change from a single bootstrapServers []raftengine.Server parameter (multiraft_runtime.go:234, main.go:777) to a map[uint64][]raftengine.Server lookup, built once from the parsed --raftGroupPeers. Everything downstream - the factory's len(peers) > 1 transport gate (factory.go:50), Open's peer normalize/validate/persist (engine.go:620-643), the marker dir, the per-group listener - already operates per group and needs no change. |
References
- Do not design for hypothetical future requirements by introducing function providers or atomic variables when the value is static at runtime. Use a static value instead, and upgrade to dynamic/thread-safe types only when a runtime update path is actually introduced.
|
|
||
| 1. **OQ-1 — Flag spelling: `--raftGroupPeers` companion flag vs. extending `--raftGroups` entries.** §3.1 recommends the companion flag (clean separation of "my listener" vs. "the voter set"; mirrors single-group `--raftBootstrapMembers`). Confirm before PR-A, since it fixes the operator-facing surface. | ||
| 2. **OQ-2 — Should `--raftBootstrap` be *required* alongside `--raftGroupPeers`, or implied?** §3.2 recommends implied (non-empty `--raftGroupPeers` ⇒ bootstrap=true per group, matching `bootstrap = *raftBootstrap || len(bootstrapServers) > 0`, `main.go:534`). Alternative: require `--raftBootstrap` explicitly for symmetry with single-group. Confirm before PR-B. | ||
| 3. **OQ-3 — Per-group bootstrap-server carrier: `func(groupID) []Server` vs. `map[uint64][]Server`.** Implementation detail for threading through `buildShardGroups`/`buildRuntimeForGroup` (`main.go:777`, `multiraft_runtime.go:234`). Map is simpler; func defers construction. Either is fine; pick at PR-B. |
There was a problem hiding this comment.
Following our architectural guidelines, we should not design for hypothetical future requirements by introducing function providers when the value is static at runtime. Since the group membership is static at runtime, we should resolve OQ-3 by choosing the static map[uint64][]Server approach and removing the function provider option.
| 3. **OQ-3 — Per-group bootstrap-server carrier: `func(groupID) []Server` vs. `map[uint64][]Server`.** Implementation detail for threading through `buildShardGroups`/`buildRuntimeForGroup` (`main.go:777`, `multiraft_runtime.go:234`). Map is simpler; func defers construction. Either is fine; pick at PR-B. | |
| 3. OQ-3 - Per-group bootstrap-server carrier: map[uint64][]Server. Implementation detail for threading through buildShardGroups/buildRuntimeForGroup (main.go:777, multiraft_runtime.go:234). A static map is simpler and avoids introducing a function provider for values that are static at runtime. |
References
- Do not design for hypothetical future requirements by introducing function providers or atomic variables when the value is static at runtime. Use a static value instead, and upgrade to dynamic/thread-safe types only when a runtime update path is actually introduced.
Issue 1: correct stale line anchors in kv/sharded_coordinator.go - keyviz sampler observeMutation: 1795-1824 -> 1841-1846 - RunHLCLeaseRenewal: 1914-1953 -> 1960-1985; defaultGroup access :1915 -> :1961 Issue 2: soften companion-doc references to in-flight PR #955 form, matching the #945/#951/#953 branch-reference style (3 sites + map line). Issue 3: ground OQ-1 in the actual commitTS logic (nextStartTS/ resolveTxnCommitTS/nextTxnTSAfter all from one c.clock) and annotate §4 step 11 as deferred-pending-OQ-1 with an explicit trigger condition. Inline: fix learner LinearizableRead behaviour (engine returns ErrNotLeader, caller forwards; engine.go:1583); note shared-startTS invariant; add OQ-7 for the single-node->multi-node live cutover / rolling-upgrade strategy.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9dda274b78
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Date: 2026-06-12 | ||
|
|
||
| Sibling / prerequisite-for: | ||
| - [2026_06_11_proposed_leader_balance_scheduler.md](2026_06_11_proposed_leader_balance_scheduler.md) §1.1a (PR0) + OQ-9 — this doc **is** that PR0. The leader-balance scheduler's transfer-issuing milestones (PR2/PR3) are blocked on a Raft group whose voter set spans more than one node; that topology cannot be declared at startup today, and OQ-9 resolved "option (a): extend the bootstrap/flag surface." This document is the design for option (a). |
There was a problem hiding this comment.
Point the sibling link at a checked-in design doc
In the commit tree, docs/design/2026_06_11_proposed_leader_balance_scheduler.md is not present (I checked the docs/design tree for leader/balance/2026_06_11), so this sibling/prerequisite link is broken and the PR0/PR2 assumptions it cites cannot be inspected from the repo. Either add the referenced design doc in the same change or update this link to an existing document or external issue.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
🧹 Nitpick comments (3)
docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md (3)
62-91: 💤 Low valueAdd language identifiers to fenced code blocks for clarity.
Lines 62–91 contain shell/bash command examples that should declare a language for syntax highlighting and linter clarity.
💬 Proposed additions
Line 62 (before the first code fence):
-``` +```bashLine 73 (before the local-node 3-node × 2-group example):
-``` +```bash🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md` around lines 62 - 91, Add language identifiers to the fenced code blocks in the documentation file to enable syntax highlighting and linter clarity. Specifically, change the opening fence at line 62 from triple backticks (```) to ```bash for the raftGroupPeers grammar and example command block, and change the opening fence at line 73 from triple backticks (```) to ```bash for the 3-node × 2-group local example block showing node configurations.Source: Linters/SAST tools
210-217: 💤 Low valueClarify which open questions require resolution before PR-A vs. those that are design-forward notes.
OQ-1 and OQ-2 are marked "confirm before PR-A/B," signaling they block implementation. OQ-4, OQ-5, OQ-6 are framed as forward-extensions (homogeneity warnings, learner-join scope, listener multiplexing). Make the "blocker vs. note" distinction explicit at the start of each OQ, or group them: blockers (OQ-1, OQ-2) first, then forward notes (OQ-3–6). This clarifies which decisions unblock the four-PR milestone plan.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md` around lines 210 - 217, The open questions section mixes blocking decisions (OQ-1 and OQ-2, which are marked "confirm before PR-A/B") with forward-extension notes (OQ-3 through OQ-6). Restructure the list to explicitly separate them: create a "Blocking Questions (Required before PR-A/B)" subsection containing OQ-1 and OQ-2 with their decision gates, then create a "Forward-Extension Notes (v1+ design)" subsection for OQ-3 through OQ-6 to clarify they do not block the four-PR milestone. At the top of each open question, add a clear label like "BLOCKER:" or "FORWARD NOTE:" to make the blocker-vs-note distinction immediately visible without requiring readers to search for PR references.
209-216: 💤 Low valueOQ-3 appears to be already resolved in §3.2 — consider removing or clarifying.
Section 3.2 explicitly decides on the bootstrap-server carrier design: "use a static
map[uint64][]raftengine.Server, built once from the parsed--raftGroupPeers." This directly answers OQ-3 ("Per-group bootstrap-server carrier:map[uint64][]Server"), which lists "Resolved here: use a static map built during startup validation." OQ-3 is not open; it is a decided design detail documented in the main body. Either remove it from §8 or reframe it as confirming the early design decision was correct once PR-A/PR-B validate the implementation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md` around lines 209 - 216, OQ-3 is listed as an open question but section 3.2 already explicitly resolves the bootstrap-server carrier design decision. Remove OQ-3 entirely from the Open Questions section (§8) since it is not actually open — the design of using a static map[uint64][]Server built from --raftGroupPeers has already been decided and documented in the main design body (§3.2). If you believe this decision should be revisited or confirmed during implementation, reframe it instead as a retrospective confirmation task for PR-A/PR-B rather than an open question.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md`:
- Around line 62-91: Add language identifiers to the fenced code blocks in the
documentation file to enable syntax highlighting and linter clarity.
Specifically, change the opening fence at line 62 from triple backticks (```) to
```bash for the raftGroupPeers grammar and example command block, and change the
opening fence at line 73 from triple backticks (```) to ```bash for the 3-node ×
2-group local example block showing node configurations.
- Around line 210-217: The open questions section mixes blocking decisions (OQ-1
and OQ-2, which are marked "confirm before PR-A/B") with forward-extension notes
(OQ-3 through OQ-6). Restructure the list to explicitly separate them: create a
"Blocking Questions (Required before PR-A/B)" subsection containing OQ-1 and
OQ-2 with their decision gates, then create a "Forward-Extension Notes (v1+
design)" subsection for OQ-3 through OQ-6 to clarify they do not block the
four-PR milestone. At the top of each open question, add a clear label like
"BLOCKER:" or "FORWARD NOTE:" to make the blocker-vs-note distinction
immediately visible without requiring readers to search for PR references.
- Around line 209-216: OQ-3 is listed as an open question but section 3.2
already explicitly resolves the bootstrap-server carrier design decision. Remove
OQ-3 entirely from the Open Questions section (§8) since it is not actually open
— the design of using a static map[uint64][]Server built from --raftGroupPeers
has already been decided and documented in the main design body (§3.2). If you
believe this decision should be revisited or confirmed during implementation,
reframe it instead as a retrospective confirmation task for PR-A/PR-B rather
than an open question.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5c34d604-a8d8-4350-8635-248683eeb8d7
📒 Files selected for processing (1)
docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4c7600ab13
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| ### 3.1 Flag surface — per-group peer lists | ||
|
|
||
| **Decision: add a companion flag `--raftGroupPeers`, and lift the `len(groups)==1` guard in `resolveBootstrapServers`.** Keep `--raftGroups` (group→local-address) exactly as is; declare the *cross-node* voter set per group in a new flag. |
There was a problem hiding this comment.
Keep the single-group guard on bootstrapMembers
As written, this tells the follow-up PR to lift the guard in resolveBootstrapServers, but that function still parses only the flat --raftBootstrapMembers list. If implemented literally, a multi-group process can pass one peer list through buildRuntimeForGroup for every group; for group 2 the remote peers would still point at group 1 listeners while the local peer is merged to group 2's address, causing cross-group raft dials or startup failures. Since the later rules keep --raftBootstrapMembers mutually exclusive/single-group, this should instead describe a separate --raftGroupPeers resolver while preserving the existing guard.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3996d89821
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - Each entry is `groupID=member,member,…`. | ||
| - Each `member` is `raftID@host:port` — the `@` separates the node's stable Raft ID (matching `--raftId` semantics) from its listener address for that group. (`raftID` is needed explicitly because etcd's bootstrap requires the same `id→address` mapping on every node, `cmd/server/demo.go:215-219`; the address alone is not the identity.) | ||
|
|
||
| Concrete 3-node × 2-group local example (all nodes share the same `--raftGroupPeers`; `--raftGroups` and protocol maps name only that node's local listeners): |
There was a problem hiding this comment.
Include remote entries in protocol maps
This local-only example will make the proposed topology fail follower ingress: Redis leaderClientForKey and the HTTP leader proxy use the coordinator's current Raft leader address to index the protocol map (adapter/redis.go:4282-4289, adapter/leader_http_proxy.go:47-52), so if node n1 receives a request for a group led by n2 it looks for 127.0.0.1:5052/5055 but n1's map only contains its own 5051/5054 entries and returns an unknown-leader/503 instead of forwarding. The existing multi-node runbook uses a shared full raft→protocol map for this reason; these examples should require all N×M raft listener entries, or explicitly say local-only maps only work when clients always hit the leader.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0423973dbd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| ``` | ||
| RAFT_GROUP_PEERS="1=n1@127.0.0.1:5051,n2@127.0.0.1:5052,n3@127.0.0.1:5053;2=n1@127.0.0.1:5054,n2@127.0.0.1:5055,n3@127.0.0.1:5056" | ||
| RAFT_REDIS_MAP="127.0.0.1:5051=127.0.0.1:6379,127.0.0.1:5052=127.0.0.1:6380,127.0.0.1:5053=127.0.0.1:6381,127.0.0.1:5054=127.0.0.1:6382,127.0.0.1:5055=127.0.0.1:6383,127.0.0.1:5056=127.0.0.1:6384" |
There was a problem hiding this comment.
Use each node’s single Redis listener in the map
In this local 3-node×2-group example, RAFT_REDIS_MAP assigns a different Redis port to every raft-group endpoint (for example n1's group 1 maps to 6379 while n1's group 2 maps to 6382), but the server process starts only one Redis listener from --redisAddress (main.go's startRedisServer) and this doc says protocol surfaces stay unchanged. If a key's leader is n1/group 2, follower routing will dial 6382 even though that node can only be listening on one Redis address, so the example should repeat each node's Redis address for all of that node's group endpoints and set --redisAddress per node.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6b988feaed
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| We do **not** invent a "lexicographically-smallest peer proposes, others wait-and-join" protocol. That single-proposer pattern is the *AddVoter-composition* path (§5), not the bootstrap path — and adopting it for bootstrap would mean the non-proposer nodes start with an empty conf and must be added one-by-one, which is fragile (ordering, the proposer must be up first and must be leader) and is exactly the "manual AddVoter dance in every test harness" #953 OQ-9 rejected. The all-nodes-same-list model has no designated-proposer ordering requirement: nodes can start in any order, and raft elects a leader once a quorum is up. | ||
|
|
||
| **Idempotency on restart (decision: persisted-peers + marker dir are the restart boundary; PR-B must validate configured peers before adopting persisted peers).** On first open of a group dir, `Open` writes the normalized peer set to the persisted-peers file (`savePersistedPeers`, `engine.go:643`; format in `peer_metadata.go:205`). On restart, the factory/Open path **loads the persisted peers and uses them in preference to the flag-supplied list** (`factory.go:43-47`; `normalizeOpenConfig`, `engine.go:3298-3306`), and `validateOpenPeers` verifies the persisted snapshot's ConfState against that loaded peer set (`engine.go:632`; `validateOpenPeers`, `engine.go:3313-3327`; `errClusterMismatch`, `:116`). That is enough for same-list idempotency, but not enough to reject an operator who changes `--raftGroupPeers`: current code discards the newly supplied list before validation. PR-B therefore adds an explicit comparison of the normalized per-group flag list against the persisted peers for that group before overwriting it; divergence returns `errClusterMismatch` (or a wrapped validation error with that sentinel). So: |
There was a problem hiding this comment.
Preserve live-expanded peers on restart
When a group is later expanded or replaced via AddVoter/RemoveServer, the existing engine persists the committed peer set from conf changes (internal/raftengine/etcd/engine.go:2281-2282, :2688), while the operator may still pass the original bootstrap-only --raftGroupPeers list. Making PR-B always compare the configured list against persisted peers before adopting persisted state would reject a healthy restart after live membership changes, contradicting §5's claim that bootstrap and live expansion compose cleanly; scope this validation to fresh bootstrap/unchanged membership or define --raftGroupPeers as desired membership that operators must update.
Useful? React with 👍 / 👎.
|
|
||
| Status: Proposed | ||
| Author: bootjp | ||
| Date: 2026-06-12 |
There was a problem hiding this comment.
Align the proposal date with the add commit
docs/design/README.md says the filename/header date must be the author date of the first commit that added the file (git log --follow --diff-filter=A ...); this file is introduced by commit 2b19470d8a28d15850b1f56b526701aee815692e with author date 2026-06-14, but the header (and filename) use 2026-06-12. That leaves the design metadata inconsistent for future lifecycle renames, so either rename/update this to 2026-06-14 or point to an earlier add commit if one exists.
Useful? React with 👍 / 👎.
|
@codex review |
|
@codex review |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 26ed356565
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| We do **not** invent a "lexicographically-smallest peer proposes, others wait-and-join" protocol. That single-proposer pattern is the *AddVoter-composition* path (§5), not the bootstrap path — and adopting it for bootstrap would mean the non-proposer nodes start with an empty conf and must be added one-by-one, which is fragile (ordering, the proposer must be up first and must be leader) and is exactly the "manual AddVoter dance in every test harness" #953 OQ-9 rejected. The all-nodes-same-list model has no designated-proposer ordering requirement: nodes can start in any order, and raft elects a leader once a quorum is up. | ||
|
|
||
| **Idempotency on restart (decision: persisted-peers + marker dir are the restart boundary; PR-B validates the bootstrap seed only while membership is still bootstrap-era).** On first open of a group dir, `Open` writes the normalized peer set to the persisted-peers file (`savePersistedPeers`, `engine.go:643`; format in `peer_metadata.go:205`). On restart, the factory/Open path **loads the persisted peers and uses them in preference to the flag-supplied list** (`factory.go:43-47`; `normalizeOpenConfig`, `engine.go:3298-3306`), and `validateOpenPeers` verifies the persisted snapshot's ConfState against that loaded peer set (`engine.go:632`; `validateOpenPeers`, `engine.go:3313-3327`; `errClusterMismatch`, `:116`). That is enough for same-list idempotency, but not enough to reject an operator who changes the initial `--raftGroupPeers` before any membership change: current code discards the newly supplied list before validation. PR-B therefore adds an explicit comparison of the normalized per-group flag list against the persisted peers **only when the persisted peer metadata still represents the initial bootstrap configuration**; divergence returns `errClusterMismatch` (or a wrapped validation error with that sentinel). Once a live membership change has committed, the persisted peer metadata is advanced by conf-change apply (`applyConfChangeCommitted`, `engine.go:2280-2282`; `writeCurrentPersistedPeers`, `engine.go:2688`) and becomes authoritative; `--raftGroupPeers` remains a bootstrap seed, not a desired-membership flag. So: |
There was a problem hiding this comment.
Persist the original bootstrap seed for restart checks
This restart contract requires rejecting an edited --raftGroupPeers before any live membership change, while accepting a different persisted set after AddVoter/RemoveServer. In the current implementation the peers file only stores Index and Peers (internal/raftengine/etcd/peer_metadata.go:31-34), and startup replaces the configured peers with persisted peers before validation (internal/raftengine/etcd/factory.go:43-47, internal/raftengine/etcd/engine.go:3298-3306), so it does not record the original seed or whether the persisted set is still the bootstrap-era configuration. If PR-B follows this plan as written, a restart with an operator-edited seed and a restart after a legitimate membership change are indistinguishable unless the design adds new metadata or an explicit WAL/conf-change scan, so one of the promised safety properties will fail.
Useful? React with 👍 / 👎.
|
@codex review |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 221bfc7674
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| ### 3.2 Bootstrap semantics | ||
|
|
||
| The wiring change is small and local: instead of one process-wide `bootstrapServers` threaded into every group, **resolve a per-group `[]raftengine.Server` and pass each group its own list**. Concretely, `buildShardGroups` / `buildRuntimeForGroup` change from a single `bootstrapServers []raftengine.Server` parameter (`multiraft_runtime.go:234`, `main.go:777`) to a static `map[uint64][]raftengine.Server`, built once from the parsed `--raftGroupPeers`. Everything downstream — the factory's `len(peers) > 1` transport gate (`factory.go:50`), `Open`'s peer normalize/validate/persist (`engine.go:620-643`), the marker dir, the per-group listener — already operates per group and needs no change. |
There was a problem hiding this comment.
Seed admin discovery from the group peers
For --raftGroupPeers deployments that enable the Admin gRPC service, limiting PR-B’s wiring to buildShardGroups/buildRuntimeForGroup leaves admin discovery with no peer list: --raftBootstrapMembers is mutually exclusive, so the existing bootstrapServers argument passed to setupAdminService stays nil, and adminMembersFromBootstrap returns nil instead of advertising the other nodes (main.go:1133, 1247-1254, 1315-1320). The design should include deriving Admin members from a canonical per-group peer list, e.g. the default/lowest group, otherwise GetClusterOverview/admin fan-out regresses even though the Raft groups form correctly.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b45366060f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 1. Every group ID in `--raftGroupPeers` must appear in `--raftGroups`, and (v1 homogeneous goal) **every** group in `--raftGroups` must appear in `--raftGroupPeers` when the flag is non-empty. A group with no peer list would silently fall back to single-member — a foot-gun we reject. | ||
| 2. Each group's member list must **include the local node**: a `member` whose `raftID == --raftId` must be present, and its `host:port` must equal that group's `--raftGroups` local address (`groupSpec.address`). This is the per-group generalization of the existing single-group check `ErrBootstrapMembersLocalAddrMismatch` (`main.go:760-765`). | ||
| 3. No duplicate `raftID` within a group (mirrors `parseRaftBootstrapMembers`'s `duplicate id` check, `shard_config.go:373-375`). | ||
| 4. v1 homogeneity check: the set of `raftID`s must be **identical across all groups** (every node votes in every group). Violations are rejected with a clear error pointing at the first divergent group. (Relaxing this is OQ-4.) |
There was a problem hiding this comment.
Require at least two voters per group
Because the only cross-group cardinality rule here is that the raftID sets are identical, a --raftGroupPeers value with one raftID per group, such as 1=n1@...;2=n1@..., satisfies all listed validation rules. PR-B would then pass a one-peer list to the factory, and internal/raftengine/etcd/factory.go only creates the inter-node transport when len(peers) > 1, recreating the single-voter/no-transfer topology this proposal is meant to eliminate. Add an explicit per-group distinct-voter count check and unit case so the new flag cannot silently accept non-multi-node groups.
Useful? React with 👍 / 👎.
|
@codex review |
|
Codex Review: Didn't find any major issues. What shall we delve into next? Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Doc-only PR. Proposes multi-node multi-group bootstrap — the prerequisite for true distributed multi-Raft scaling. Today a real multi-node deployment can only run single-group:
resolveBootstrapServersrejects--raftBootstrapMemberswhenlen(groups) != 1(main.go:744-748),groupSpeccarries one address per group (shard_config.go:14-17), andbuildRuntimeForGroupthreads the same (nil-in-multi-group)bootstrapServersinto every group (multiraft_runtime.go:234-254), so each group bootstraps single-member and the factory never even builds a transport (factory.go:49-52). This is exactly the PR0 / OQ-9 "option (a)" gap the leader-balance design (#953) identified.New file:
docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md(Status: Proposed / Author: bootjp / Date: 2026-06-12). Every current-state claim is cited at file:line, verified againstmain.Key design points
--raftGroupPeers "1=n1@host1:5051,n2@host2:5051,...;2=..."(per-groupraftID@host:portvoter lists), lifting thelen(groups)==1guard.--raftGroups/--raftBootstrapMembers/--raftBootstrapand thecmd/server/demo.godemo are unchanged; mutually exclusive with--raftBootstrapMembers; full validation rules (local-node-present, addr-match, homogeneity).cmd/server/demo.go:204-219already uses for single-group), NOT a designated proposer. Idempotent restart and divergent-config fail-fast come free from the existing persisted-peers +errClusterMismatch+raft-enginemarker machinery (already per-group).5005{1..6}convention). No transport change — only feeding per-group lists in.Open questions
OQ-1 flag spelling (companion flag vs. extend
--raftGroups); OQ-2--raftBootstraprequired vs. implied; OQ-3 per-group carrier (map vs. func); OQ-4 heterogeneous membership (hard error vs. warning in v1); OQ-5 learner-start interop; OQ-6 single multiplexed raft listener for very high M.Test plan
Doc-only — no code, no tests run. The doc itself proposes the test strategy: unit (flag-parse table tests, per-group resolution, restart idempotency), integration (in-process 3-node × 2-group harness asserting multi-voter
Configuration()+ inter-nodeTransferLeadershipToServer), and a later Jepsen milestone (true multi-node multi-group runner).Summary by CodeRabbit