docs(design): propose hotspot split M2 — migration plane by bootjp · Pull Request #945 · bootjp/elastickv

bootjp · 2026-06-10T16:10:28Z

Summary

Propose Hotspot Shard Split Milestone 2 — migration plane as a new design doc under docs/design/.
Scope: cross-group split with BACKFILL / FENCE / DELTA_COPY / CUTOVER / CLEANUP phases, SplitJob persistence, MVCC range export/import, and ErrRouteWriteFenced defense in depth.
M3 (auto-detection) and M4 (Jepsen hardening) remain in the parent partial doc and are explicitly non-goals here.
Per CLAUDE.md design-doc-first workflow, this PR is doc-only; implementation lands in M2-PR1..PR7 after the proposal is accepted.

Key design points

SplitJob durability under !dist|job|<id> so a default-group leader flap resumes mid-migration without operator action.
Composed-1 alignment at CUTOVER: reuses the catalog-version bump that PR docs(design): rename Composed-1 docs *_proposed_* → *_partial_* (M5a shipped) #927 / fix(kv): Composed-1 verifyOwnerFromSnapshot must routeKey-normalize before OwnerOf (issue #930) #932 already trust. No new gate work.
Idempotent ImportRangeVersions via persisted import_cursor — a duplicate gRPC retry is a no-op.
Internal-key coverage via routeKey() equivalence check on every exported row, so !txn|, !lst|, !ddb|, !redis|, !sqs|, !s3| families all land in the right shard.
Same-group split path stays byte-for-byte M1: target_group_id == 0 or equal to source short-circuits through the existing fast path.

Open questions for reviewers

7 questions called out at §12 — notably: grace-period sizing, abandon-after-cutover policy, backfill-vs-prepare semantics, and MVCCVersion.key_family strict-enum vs numeric.

Test plan

Doc-only PR — no test runs required.
Reviewer sign-off on §12 open questions.
On acceptance, M2-PR1 starts the implementation chain; this doc renames *_proposed_* → *_partial_* once PR1 lands.

Summary by CodeRabbit

Documentation
- Added a proposed design doc for cross-group hotspot shard migration: phased migration lifecycle (planned → backfill → fence → delta copy → cutover → cleanup → done/failed), staged shadow keyspace promotion, resumable MVCC export/import with chunked cursors and idempotent resumes, cutover read/write fencing and monotone timestamp handling, background staged-to-live promotion, coordinator-driven job management with recovery modes, comprehensive test plan (unit/integration/Jepsen), and rolling-upgrade readiness gating.

Parent partial doc 2026_02_18_partial_hotspot_shard_split.md tracks M1 as shipped; M2-M4 are open. This proposal scopes M2 — the migration plane that moves a split child to a different Raft group. Scope: - SplitJob durable record under !dist|job|<id> with phase/cursor/ts fields so a leader flap resumes mid-flight. - BACKFILL / FENCE / DELTA_COPY / CUTOVER / CLEANUP state machine driven by a single migrator goroutine on the default-group leader, with per-phase recovery semantics enumerated in the matrix. - proto/distribution.proto extension: target_group_id on SplitRangeRequest; GetSplitJob / ListSplitJobs / AbandonSplitJob. - proto/internal.proto: ExportRangeVersions (server-streaming) + ImportRangeVersions (idempotent via persisted import_cursor). - store/MVCCStore.ExportVersions / ImportVersions with internal-key family dispatch and routeKey() equivalence assertion (covers !txn|, !lst|, !ddb|, !redis|, !sqs|, !s3|). - ErrRouteWriteFenced at both coordinator and FSM (defense in depth on the gRPC mesh + Raft apply path); reuses PR #927 / #932 Composed-1 alignment at CUTOVER. - 8-PR landing plan with five-lens self-review per PR. Non-goals (M3+): hotspot detection, auto-split scheduler, online merge, per-tenant policy. 7 open questions called out at the end for reviewer input — notably grace-period sizing, abandon-after-cutover policy, and backfill-vs-prepare-window semantics. Per the design-doc-first workflow in CLAUDE.md, this PR is the doc only; M2-PR1..PR7 follow once the proposal is accepted.

coderabbitai · 2026-06-10T16:10:49Z

Important

Review skipped

Review was skipped as selected files did not have any reviewable changes.

💤 Files selected but had no reviewable changes (1)

docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 273ee018-fc22-45ce-821f-3cd32cafa72b

📥 Commits

Reviewing files that changed from the base of the PR and between c029b3c and c0a678f.

📒 Files selected for processing (1)

docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Design doc adding a Milestone‑2 migration plane: durable SplitJob state machine and phases, cross-group route/staging semantics, streaming MVCC Export/Import contracts and RPCs, CUTOVER/promote semantics, coordinator/FSM fencing and read-fence, migrator runtime, recovery matrix, test plan, and rollout gating.

Changes

Milestone 2 Migration Design

Layer / File(s)	Summary
Design context and goals `docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` (lines 1–41)	Document header, background, M2 goals and explicit non-goals (resumability, atomic cross-group relocation, fenced-write retryable behavior).
SplitJob data model and state machine `docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` (lines 42–85)	Durable `SplitJob` persisted in default raft group with reserved catalog keys, phase enum, pinned HLC/cursor/progress fields, and job retention/GC semantics.
Route state transitions & staging `docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` (lines 86–202)	Same-group vs cross-group transitions, catalog non-overlap invariant, staged shadow keyspace, FENCE→DELTA_COPY safety (fence ACK/fence_ts), crash-safety, and CLEANUP GC behavior.
Wire/RPC extensions & job introspection `docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` (lines 203–313)	SplitRange RPC extensions (`target_group_id`, job-return semantics), job RPCs (Get/List/Abandon), and streaming `ExportRangeVersions` / `ImportRangeVersions` RPCs.
MVCC export/import mechanics & key coverage `docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` (lines 314–475)	`KeyFilter` contract, snapshot-consistent `ExportVersions`, cursor-based resumption, intent exclusion, idempotent `ImportVersions` keyed by (job_id,cursor), target HLC advancement rules, internal-key bracket planning, per-bracket cursors, and scan-budget pacing.
CUTOVER, staged visibility & promotion `docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` (lines 476–522)	CUTOVER as a single atomic catalog write enabling staged-to-live MVCC merge semantics, leader-local bounded `PromoteStaged` proposals with `promote_cursor`, CLEANUP/DONE gating, and edge-case handling for hot keys and concurrent writes.
Coordinator/FSM fencing and read-fence `docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` (lines 523–671)	FENCE write rejection (`ErrRouteWriteFenced`), apply-time `verifyRouteNotFenced` pre-gate, post-CUTOVER `read_route_version` read-fence, authoritative ownership checks (heartbeat watermark, equal-stale fix), and scan-interval ownership gate aligned to Composed-1.
Recovery matrix, tests, and self-review `docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` (lines 672–721)	Idempotency/resumability recovery matrix, unit/integration/Jepsen/property testing plan, and five-lens self-review for M2 risks.
Implementation phasing & rolling-upgrade `docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` (lines 722–751)	Phased PR touchpoints, `cap_migration_v2` capability gating, readiness checks at entry RPCs, and downgrade behavior (pause mid-phase).
Open questions & lifecycle guidance `docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` (lines 752–779)	Open questions, acceptance criteria, lifecycle naming and `git mv` guidance for document renames.

Sequence Diagram(s)

sequenceDiagram
  participant Coordinator
  participant SourceStore
  participant TargetStore
  participant Catalog

  Coordinator->>SourceStore: ExportRangeVersions(job_id, start_cursor, KeyFilter)
  SourceStore->>Coordinator: Stream(versions[], cursor)
  Coordinator->>TargetStore: ImportRangeVersions(job_id, cursor, KeyFilter)
  TargetStore->>TargetStore: persist staged MVCC pages (idempotent by job_id+cursor)
  TargetStore->>Catalog: stage shadow keyspace (not bound in catalog until CUTOVER)
  Coordinator->>Catalog: CUTOVER (atomic catalog rebind, commit route_version)
  Catalog->>Coordinator: ack CUTOVER (route_version increment)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

bootjp/elastickv#905: Jepsen workload and CLI for SplitRange testing referenced by this design.

Poem

A rabbit reads the migration plan with care,
Jobs and cursors hop through MVCC air,
Fences stand firm while chunks drift slow,
Cutover flips the catalog’s glow,
Promoters nudge staged keys to go—🐰

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'docs(design): propose hotspot split M2 — migration plane' clearly and specifically summarizes the main change: proposing a design document for Hotspot Shard Split Milestone 2 focusing on the migration plane.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/hotspot-split-m2-proposal

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request proposes a design document for Milestone 2 of the Hotspot Shard Split feature, focusing on the migration plane. It outlines the data model additions (such as SplitJob), state machine transitions, RPC changes, MVCC range export/import, and coordinator/FSM integration. The review feedback highlights three key areas for improvement: defining a retention or garbage collection policy for completed SplitJob records to prevent unbounded storage growth, decoupling the low-level store package from high-level routing logic by using an interceptor or delegate pattern instead of directly referencing routeKey, and addressing rolling upgrade risks where older M1 nodes might not enforce the WriteFenced state by detailing mitigation strategies such as capability gates.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-10T16:14:34Z

+### 3.1 SplitJob
+
+A split job is durable state for the migration. Persist under reserved keys in the default group, alongside the route catalog.
+
+Key layout:
+
+- `!dist|meta|next_job_id` — uint64, allocator for `job_id`.
+- `!dist|job|<job_id>` — encoded `SplitJob` record.


The proposed design persists SplitJob records under !dist|job|<job_id> but does not specify a limit or a garbage collection/retention policy for completed (DONE, FAILED, or abandoned) jobs. Over time, as the cluster performs many hotspot splits, this collection can grow unboundedly. This poses a risk of unbounded storage growth and potential OOM issues when listing or scanning jobs (e.g., via ListSplitJobs).

To prevent unbounded memory and storage growth, please define a fixed bound or a retention/cleanup policy to automatically garbage collect or archive historical job records after a certain period or count.

References

To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests, such as pending configuration changes. Reject new requests when the bound is reached.

gemini-code-assist · 2026-06-10T16:14:34Z

+### 6.3 Internal-key coverage
+
+§9 of the parent doc enumerates the key families. The exporter scans the raw storage range `[start, end)` filtered by **routeKey** equivalence to the moving range, so internal keys (`!txn|...`, `!lst|...`, redis hash/list/etc.) land in the same shard as their logical owner. The importer dispatches by `key_family` into the matching store helper to preserve any per-family invariant (e.g., list head pointer updates).
+
+A simple safety check: every exported key must map back to the moving range under `routeKey()`. The exporter asserts this on every row before adding to the chunk; an assertion failure aborts the job and surfaces in `last_error`. This protects against a future internal-key family being added that the exporter forgot to teach.


The proposed design has the low-level store package (store/mvcc_store.go) directly referencing and calling routeKey() to filter keys and map them back to the moving range. This introduces tight coupling between the low-level generic storage package and high-level routing/domain logic.

Per the general rules, to prevent tight coupling between low-level generic packages and high-level domain packages, you should use an interceptor or delegate pattern with interfaces (or function values) defined in the low-level package. For example, ExportVersions can accept an optional, nil-safe key-filter function or delegate interface (e.g., func(key []byte) bool) injected from the high-level package, keeping the store package generic and decoupled.

References

To prevent tight coupling between low-level generic packages and high-level domain packages, use an interceptor or delegate pattern with interfaces defined in the low-level package. Ensure the interceptor is optional (nil-safe) so that builds or environments with the high-level feature disabled can run without modification.

gemini-code-assist · 2026-06-10T16:14:34Z

+| Lens | M2-specific risk |
+|---|---|
+| Data loss | CUTOVER atomicity; ImportRangeVersions idempotency; the BACKFILL/DELTA_COPY commit-ts windows have no gap. |
+| Concurrency | Migrator goroutine vs leadership election; FENCE visibility timing on followers; AbandonJob race with phase advance. |
+| Performance | Chunk size vs gRPC frame limit; pacing default vs prod throughput; cursor lookup cost per import. |
+| Data consistency | Composed-1 alignment at CUTOVER; reader at `≤ fence_ts` on source vs reader at `≥ catalog_version+1` on target; internal-key route mapping. |
+| Test coverage | Property tests over chunk-then-restart sequences; FSM unit for FENCE; Jepsen split + nemesis. |


The design introduces a significant operational risk during rolling upgrades. If a migration job is initiated while some nodes in the cluster are still running the older M1 version, those M1 nodes will not enforce the WriteFenced state (since the FENCE rejection logic is introduced in M2-PR5). This could allow writes to be accepted on the source group during the FENCE/DELTA_COPY phases, leading to silent write loss or inconsistency.

Per the general rules, when a design document identifies a significant operational risk like rolling upgrade incompatibility, it must detail potential mitigation strategies. Please update the design (e.g., in the self-review or as a new section) to include layered mitigations, such as using admin RPC capability gates to prevent starting any SplitRange migration jobs until all nodes have upgraded and advertised M2 capability.

References

When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.

Already addressed in §11.1 ("Rolling-upgrade compatibility") landed in commit 93de9b2: cluster-readiness gate at the SplitRange entry RPC keyed on the cap_migration_v2 capability bit in the existing distribution heartbeat, with M2-PR5 as the bit-advertise commit so the gate semantics are safe across rollout. Could you confirm this resolves the finding, or point at what's still uncovered?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3aca90c833

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T11:21:49Z

+  ├─ BACKFILL:   right child created with state=MigratingTarget on target group;
+  │              source remains Active; writes still flow to source.
+  ├─ FENCE:      source.right_child → WriteFenced; writes on the moving range


Avoid publishing overlapping migration routes

When a cross-group job enters BACKFILL, this plan creates the target right child while leaving the full source route Active, so the catalog snapshot contains overlapping ranges for the moving interval. The current engine does not ignore non-Active states when applying snapshots: routesFromCatalog converts every descriptor and validateRouteOrder rejects any prev.End > curr.Start with ErrEngineSnapshotRouteOverlap (distribution/engine.go:468-523). As written, watchers would fail to apply the first migration catalog snapshot, so the migration cannot progress through the proposed visible route states unless the design keeps the target out of the route snapshot or changes the overlap invariant.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T11:21:49Z

+
+- `kv/fsm.go` already routes through `verifyComposed1` at apply. Extend `verifyOwnerFromSnapshot` to also reject when the resolved route's state is `WriteFenced` — even if a coordinator with a stale engine forwarded a write, the FSM closes the door. The error wraps `ErrRouteWriteFenced` so M4 retry logic differentiates from `ErrComposed1Violation`.
+
+Reads are **not** fenced: `ShardStore.GetAt` continues to serve from source until CUTOVER. Snapshot reads at any `commit_ts ≤ fence_ts` remain consistent because the source's MVCC history is unchanged through DELTA_COPY.


Add a read fence for post-cutover stale routes

If a coordinator or source-group leader has not applied the CUTOVER catalog yet, leaving reads unfenced lets it keep routing the moved key to the source while new writes are already accepted on the target. The current read path resolves the group from the local engine and then serves/proxies the read without carrying an observed catalog version (kv/shard_store.go:38-53, kv/shard_store.go:1471-1477), while the Composed-1 guard cited here only checks mutations in verifyOwnerFromSnapshot (kv/fsm.go:753-773). During the CLEANUP grace window this can return the pre-cutover source value after a successful post-cutover target write, so the design needs a read-side route-version fence/redirect or another mechanism that stops stale source reads for the moved range.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md (2)
108-108: ⚖️ Poor tradeoff

Confirm operational acceptance of one-way CUTOVER.

After CUTOVER completes, the migration cannot be rolled back without implementing a full reverse migration (explicitly out of scope for M2). This means if post-CUTOVER issues arise (e.g., target group instability, data corruption discovered later), the only recovery path is forward through CLEANUP or manual intervention.

Is this operational trade-off acceptable for M2, or should §12 Open Questions include a specific item about post-CUTOVER failure scenarios and operator runbooks?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` at
line 108, The document currently states that CUTOVER is one-way and rollback
requires a reverse migration; update the design doc to explicitly capture
operational acceptance by either (A) adding a clear sentence in the CUTOVER
paragraph that the team accepts the one-way behavior for M2 and lists recovery
paths (CLEANUP, manual intervention, reverse migration out-of-scope), or (B) add
a new item under §12 Open Questions titled "Post-CUTOVER failure scenarios and
operator runbooks" that requests acceptance criteria, required runbooks, and
escalation steps for issues like target-group instability or latent data
corruption; reference the terms CUTOVER, PLANNED → BACKFILL, AbandonJob, and
CLEANUP so reviewers can locate and verify the change.
258-261: ⚖️ Poor tradeoff

Verify sufficiency of routeKey() assertion failure handling.

Lines 260-261 state that if an exported key fails the routeKey() equivalence check, "an assertion failure aborts the job and surfaces in last_error." Per §4 line 106, this puts the job into FAILED state requiring manual operator intervention.

This safety check protects against a future internal-key family being added without updating the exporter. However, silent job abortion may not provide sufficient visibility for this class of bug.

Consider:

Is last_error surfaced prominently enough (e.g., via metrics/alerts) to catch this during development/testing?

Should dev/test environments panic on this assertion to fail-fast rather than require operator log inspection?

Should this be elevated in §12 Open Questions as a code-maintenance concern?

The defensive check is excellent; the question is whether the failure mode provides adequate feedback.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` around
lines 258 - 261, The routeKey() assertion currently aborts the exporter job and
writes to last_error leaving the job in FAILED state; enhance visibility and
fail-fast behavior by (1) emitting a dedicated metric/counter and structured
error log whenever routeKey() mismatches occur (include routeKey, key_family,
and chunk id) so monitoring/alerts can catch it, (2) adding a configurable
dev-only mode on the exporter (e.g., Exporter devFailFast flag or ENV) that
panics/throws immediately instead of only setting last_error to ensure fast
feedback in testing, and (3) document this operational choice in the design
notes (add to §12 Open Questions) so maintainers know to update alerting/ops
runbooks; refer to routeKey(), exporter, last_error and the FAILED job state
when making these changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md`:
- Around line 76-86: Add a language specifier to the fenced code block that
contains the migration state diagram (the block starting with "source.Active"
and the listed states BACKFILL/FENCE/DELTA_COPY/CUTOVER/CLEANUP) to satisfy
markdownlint MD040; update the opening fence from ``` to something like ```text
(or ```nohighlight) so the block is explicitly marked as plain text.

---

Nitpick comments:
In `@docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md`:
- Line 108: The document currently states that CUTOVER is one-way and rollback
requires a reverse migration; update the design doc to explicitly capture
operational acceptance by either (A) adding a clear sentence in the CUTOVER
paragraph that the team accepts the one-way behavior for M2 and lists recovery
paths (CLEANUP, manual intervention, reverse migration out-of-scope), or (B) add
a new item under §12 Open Questions titled "Post-CUTOVER failure scenarios and
operator runbooks" that requests acceptance criteria, required runbooks, and
escalation steps for issues like target-group instability or latent data
corruption; reference the terms CUTOVER, PLANNED → BACKFILL, AbandonJob, and
CLEANUP so reviewers can locate and verify the change.
- Around line 258-261: The routeKey() assertion currently aborts the exporter
job and writes to last_error leaving the job in FAILED state; enhance visibility
and fail-fast behavior by (1) emitting a dedicated metric/counter and structured
error log whenever routeKey() mismatches occur (include routeKey, key_family,
and chunk id) so monitoring/alerts can catch it, (2) adding a configurable
dev-only mode on the exporter (e.g., Exporter devFailFast flag or ENV) that
panics/throws immediately instead of only setting last_error to ensure fast
feedback in testing, and (3) document this operational choice in the design
notes (add to §12 Open Questions) so maintainers know to update alerting/ops
runbooks; refer to routeKey(), exporter, last_error and the FAILED job state
when making these changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 70d218ee-479f-4859-864a-0bdc51656929

📥 Commits

Reviewing files that changed from the base of the PR and between f598868 and 3aca90c.

📒 Files selected for processing (1)

docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md

… minor) Codex P1 (line 80) — overlapping migration routes Rewrote §3.2 state machine: catalog snapshots are kept strictly non-overlapping throughout the job. BACKFILL/DELTA_COPY land in a shadow keyspace (!dist|migstage|<job_id>|<raw_key>) on the target group's MVCC store, outside the catalog. The catalog gets a NEW WriteFenced route at FENCE-time (split out from source) and the group-rebind happens atomically at CUTOVER. routesFromCatalog + validateRouteOrder (distribution/engine.go:496-527) stay green. RouteState.MigratingSource/MigratingTarget are reserved but unused in M2. Codex P1 (line 277) — post-CUTOVER read fence Added §7.2: read-path now carries read_route_version (parallel to ObservedRouteVersion on writes). New verifyReadOwner gate next to verifyOwnerFromSnapshot returns ErrRouteMoved{new_route_version, new_group_id} when a stale coordinator routes a read to source after CUTOVER. Coordinator does WatcherRefresh + single retry. CLEANUP grace window readFenceGrace=30s bounds the corner case. The write-side Composed-1 gate is unchanged so PR #932's Jepsen trust holds. Gemini medium (line 51) — SplitJob unbounded growth Added §3.1.1 + §4.2: live cap maxLiveJobs=64 enforced by SplitRange; on terminal transition jobs move to !dist|jobhist|<terminal_at_ms_be>|<id> with retention by maxJobHistory=1k OR historyTTL=7d (whichever tighter); once-per- minute sweep on default-group leader. ListSplitJobs supports since_terminal_at_ms / phase filters and cursors. Gemini medium (line 260) — store/routeKey tight coupling Added store.KeyFilter delegate; ExportVersions takes the filter as a nil-safe argument so the store package never imports kv or knows about routing keys. routeKey() closure built in kv/migrator_filter.go and passed in by the migrator. §6.3 rewritten with the explicit decoupling seam. Gemini medium (line 346) — rolling-upgrade incompatibility Added §11.1: node_capability_bitfield in the existing distribution heartbeat; cap_migration_v2 bit gates cross-group SplitRange at the entry RPC with ErrClusterNotReadyForMigration; in-flight job-pause path on observed downgrade; bit is advertised only in M2-PR5 (the commit that lands the FENCE rejection enforcement) so the gate is safe across the rollout. Same-group split (M1) remains unconditionally available. CodeRabbit minor (line 86) — MD040 fenced-code-block language All fenced blocks now carry a language tag (text/proto/go). Replaced the bare ``` with ```text on the state-machine diagram and consolidated to a single language across the diagram and evolution table. PR-plan note (§11): the rolling-upgrade gate dictates a hard ordering between M2-PR4 (state machine) and M2-PR5 (FENCE enforcement) — the capability bit only goes live in PR5.

bootjp · 2026-06-11T16:36:28Z

@claude review — addressed all 6 findings from the gemini / codex / coderabbit reviews in commit 93de9b2:

Codex P1 (line 80, overlapping routes) — §3.2 rewritten so catalog snapshots stay strictly non-overlapping. BACKFILL/DELTA_COPY land in a target-side shadow keyspace (!dist|migstage|<job_id>|<raw_key>); the catalog only gains a NEW WriteFenced route at FENCE-time (split out from source) and the group-rebind is a single atomic CUTOVER write. routesFromCatalog + validateRouteOrder stay green throughout.

Codex P1 (line 277, post-CUTOVER read fence) — added §7.2 with read_route_version on pb.ReadRequest, a parallel verifyReadOwner gate returning ErrRouteMoved{new_route_version, new_group_id}, coordinator-side WatcherRefresh+retry, and a 30s grace window before CLEANUP. Write-side Composed-1 gate untouched (PR #932 Jepsen trust preserved).

Gemini medium (line 51, SplitJob unbounded growth) — added §3.1.1 / §4.2: maxLiveJobs=64 live cap on SplitRange; terminal jobs move to !dist|jobhist|<terminal_at_ms_be>|<id> with maxJobHistory=1k OR historyTTL=7d; once-per-minute sweep.

Gemini medium (line 260, store/routeKey coupling) — store.KeyFilter delegate; ExportVersions takes a nil-safe filter so the store package never imports kv. routeKey() closure built in kv/migrator_filter.go and passed in by the migrator.

Gemini medium (line 346, rolling upgrade) — added §11.1: node_capability_bitfield heartbeat extension; cap_migration_v2 gate at the entry RPC (ErrClusterNotReadyForMigration); bit goes live only at M2-PR5 (the FENCE enforcement commit) so the gate is safe across rollout. Same-group split (M1) unconditionally available.

CodeRabbit minor (line 86, MD040) — all fenced blocks now carry a language tag.

claude · 2026-06-11T16:36:52Z

Claude finished @bootjp's task in 3m 26s —— View job

Review of commit `93de9b2` — addressing 6 prior findings

Todo list

Load tools
Read the updated design document
Verify Codex P1: overlapping routes fix (§3.2 shadow keyspace)
Verify Codex P1: post-CUTOVER read fence fix (§7.2)
Verify Gemini medium: SplitJob unbounded growth fix (§3.1.1 / §4.2)
Verify Gemini medium: store/routeKey coupling fix (KeyFilter delegate)
Verify Gemini medium: rolling upgrade fix (§11.1 capability gate)
Verify CodeRabbit minor: MD040 fenced code blocks
Post comprehensive review

Verified: all 6 prior findings are addressed

Finding	Where	Verdict
Codex P1 — overlapping routes	§3.2, shadow keyspace + state diagram	✅ Resolved. Catalog never holds overlapping routes. BACKFILL/DELTA_COPY write only to `!dist
Codex P1 — post-CUTOVER read fence	§7.2 (5 sub-sections)	✅ Resolved. `read_route_version` stamped on every read, `verifyReadOwner` gate on source FSM returns `ErrRouteMoved`, coordinator does `WatcherRefresh` + one retry, 30 s grace window before CLEANUP. Parallel path from `verifyOwnerFromSnapshot` preserves Jepsen trust.
Gemini medium — SplitJob unbounded growth	§3.1.1, §4.2	✅ Resolved. `maxLiveJobs=64` live cap; terminal jobs moved to `!dist
Gemini medium — store/routeKey coupling	§6.1, §6.3	✅ Resolved. `store.KeyFilter` nil-safe delegate; `ExportVersions` takes it as a parameter; `kv/migrator_filter.go` builds the closure where `routeKey()` lives; `store` never imports `kv`.
Gemini medium — rolling upgrade	§11.1	✅ Resolved. `node_capability_bitfield` heartbeat extension; `cap_migration_v2` bit; cluster-readiness gate at SplitRange entry; mid-migration downgrade pauses the job; bit goes live at M2-PR5 (when FENCE enforcement lands); test plan specified.
CodeRabbit minor — MD040	lines 89, 179, 293, 317, 335	✅ Resolved. All fenced blocks now carry `text`, `proto`, or `go` language tags.

New observations on the updated design

P1 — CUTOVER two-step ordering undefined (data-visibility gap)

§3.2 (line 139) says:

CUTOVER promotes the staged data to live keys via a single bulk-rename FSM apply

But the CUTOVER step as described consists of two independent Raft proposals on different groups:

Catalog CAS on default group — removes source.WriteFenced(right), inserts target.Active(right), bumps catalog_version.
Shadow→live rename on target group — rewrites !dist|migstage|<job_id>|<raw_key> → live keys.

The design does not specify the ordering of these two. If (1) commits before (2), reads routed to the target during the rename window return empty results: the catalog says the target owns the range, but the target's MVCC only has shadow keys. The read fence in §7.2 only redirects stale source reads; it doesn't compensate for the target not yet having live data.

The safe ordering is: (2) commits on the target first, then (1) executes on the default group — by the time routing switches, all data is live on the target. But:

This ordering must be explicit in the design.
The migrator must confirm target-group apply before proposing the catalog write.
If a leader flap occurs between (2) committed and (1) proposed, the new migrator resumes and must re-check whether the target rename already landed (idempotent) before re-proposing the catalog write.

Suggest: add a §7.5 (or a sub-bullet in §4's CUTOVER row) that specifies: "Target bulk-rename commits first (raft.Status().Applied ≥ rename_index observed by migrator), then default-group catalog CAS. Migrator confirms applied index on the target leader before proposing the catalog write; on resume it re-checks whether rename already landed via a sentinel key !dist|migstage|<job_id>|committed."

Medium — Grace window inconsistency (§7.2.4 vs §4 vs Open Question 3)

Three places cite the CLEANUP grace default and they disagree:

Location	Value
§4 table, CLEANUP row (line 158)	60 s
§7.2.4 (line 389)	30 s
Open Question 3 (line 504)	60 s

Pick one and make them consistent. Whichever wins should also answer Open Question 3 (or at least note the reasoning). For reference: hlcPhysicalWindowMs = 3 s, lease TTL is typically ~3 s too; 30 s (10×) is already very conservative. If 30 s wins, Open Question 3 can be closed.

Fix this →

Medium — "Wait until applied" in §7.2.2 under-specified

the FSM treats read_route_version > local as a "wait until applied" (bounded by lease TTL) before answering

How is this wait implemented? Options include an applied-index condition variable, a short retry loop with a timeout, or a channel signalled by the FSM apply goroutine. The choice affects:

Read latency during CUTOVER propagation (commit CPU, GC pause, etc.)
What happens if the applied index never catches up within the lease TTL (timeout → ErrRouteMoved? ErrTimeout?)

A one-sentence note (e.g., "implemented as a condition-variable wait on the FSM's appliedIndex, timedout by min(remaining_lease_TTL, 200ms)") would make this reviewable.

Minor — `!dist|` prefix on target-group import cursor (§6.2)

Line 321 stores the import cursor as !dist|job|<id>|import_cursor in the target group's MVCC, but the !dist| prefix is associated throughout the codebase with the default group's catalog. A reader scanning the target group's MVCC (or a future operator tool) could confuse these for catalog entries. Consider a distinct prefix scoped to the target — e.g., !migstage|cursor|<id> — to avoid confusion and to prevent any future catalog scanner (distribution/catalog.go) from trying to decode them.

Summary

All 6 findings from the prior reviews are properly addressed and the design is substantially improved. The main item to resolve before implementation starts is the CUTOVER two-step ordering (target rename must commit before the catalog write; the design should specify this ordering and how it survives a leader flap). The grace-window inconsistency is a quick cleanup. The other two observations are low-risk and can be addressed in the design doc without blocking M2-PR1.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93de9b2d2e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T16:38:18Z

+// ImportVersions writes the given versions idempotently. A second call
+// with the same (job_id, cursor) MUST be a no-op past the recorded
+// cursor — the importer records last-applied cursor under
+// !dist|job|<id>|import_cursor so a network retry doesn't double-write.
+ImportVersions(ctx context.Context, jobID uint64, versions []MVCCVersion,
+    cursor []byte) ([]byte, error)


Advance target HLC when importing versions

When the source range contains commit timestamps higher than the target group's current LastCommitTS/HLC, this import contract only writes versions and an idempotency cursor, but does not require the target to observe the max imported commit_ts. Since the existing FSM observes commit timestamps only after normal pb.Request applies, a migration import path can leave the target clock below the staged data; after CUTOVER the first target write can receive a lower timestamp than an imported older value, causing MVCC time travel where later reads resurrect the pre-cutover value. The import apply needs to atomically advance metaLastCommitTS and the target HLC (or otherwise fence new writes) to at least the max imported timestamp before the route becomes active.

Useful? React with 👍 / 👎.

Fixed in commit 17b0719 — added §6.2.1 "Target HLC advancement": ImportVersions now atomically advances both metaLastCommitTS and the HLC ceiling (hlc.SetPhysicalCeiling) to max(batch.commit_ts) under the same FSM apply lock as the MVCC writes. SplitJob carries a monotone max_imported_ts witness, and CUTOVER is gated on a final SetPhysicalCeiling(max_imported_ts) proposal on the TARGET group before the catalog CAS — so post-CUTOVER Next() is provably greater than every imported ts. Reuses the existing Composed-1 monotone-ceiling primitive (PR #927) — no new fence.

chatgpt-codex-connector · 2026-06-11T16:38:18Z

+| `snapshot_ts` | uint64 | HLC commit-ts pinned at BACKFILL entry. |
+| `fence_ts` | uint64 | HLC commit-ts pinned at FENCE → DELTA_COPY entry. |
+| `cutover_version` | uint64 | Catalog version produced by the CUTOVER write; populated at CUTOVER, used by §7.2 read-fence and §4.2 GC. |
+| `cursor` | bytes | Next-key cursor for BACKFILL / DELTA_COPY resume. |


Make the export cursor address MVCC versions

The job cursor is defined as a next-key cursor, but an MVCC export can contain many committed versions for the same raw key. If a hot key's history exceeds chunkBytes, resuming by key alone either has to keep that entire key history in one unbounded chunk or risks skipping/duplicating the remaining versions after a restart; the API should make the cursor opaque over the encoded MVCC position (raw key plus commit timestamp) or explicitly require chunks never split a key's full version history.

Useful? React with 👍 / 👎.

Fixed in commit 17b0719 — added §6.1.1 "Cursor granularity": the cursor now encodes (raw_key, commit_ts) and the exporter iterates (raw_key ASC, commit_ts DESC) with exclusive-resume semantics. Chunks may contain only versions of one raw key without an all-or-nothing constraint on the history — the next chunk continues at the next-older commit_ts of the same key. The SplitJob row was updated accordingly. Opaque encoding so the store-side type stays []byte and the exporter alone owns the codec (preserves the §6.1 store/kv decoupling seam).

…x P1/P2) Codex P1 on PR #945 (line 323) — Advance target HLC when importing Source range can contain commit_ts > target group's current HLC / metaLastCommitTS. Import-only versions-and-cursor writes leave the target clock behind; after CUTOVER the first target write at Next() can collide with imported ts and resurrect pre-cutover values on a snapshot read. Fix: ImportVersions atomically advances metaLastCommitTS AND calls hlc.SetPhysicalCeiling(batchMax) under the same FSM apply lock as the MVCC writes; SplitJob carries a monotone max_imported_ts witness; CUTOVER is gated on a final SetPhysicalCeiling(max_imported_ts) proposal on the TARGET group before the catalog CAS. Reuses the existing Composed-1 / HLC monotone-ceiling primitive — no new fence. New unit + property tests pin (a) per-batch advance, (b) CUTOVER gap-closing proposal, (c) strict Next() > max_imported_ts post-CUTOVER. Codex P2 on PR #945 (line 66) — Cursor must address MVCC versions A "next-key" cursor cannot resume safely on a hot key whose version history exceeds chunkBytes. New §6.1.1 spec: opaque cursor encodes (raw_key, commit_ts), iterated in (raw_key ASC, commit_ts DESC) order with exclusive-resume semantics. Chunks may contain only versions of one raw key; the next chunk continues with the next-older version of the same key — no all-or-nothing constraint on a key's history. Idempotent ack at the importer dedups any one resume overlap. SplitJob row (§3.1) gains `max_imported_ts uint64` and the cursor field description is updated. §3.2 state-machine diagram now shows the SetPhysicalCeiling proposal as a precondition of CUTOVER. No caller-visible wire change beyond the existing M2 proto deltas.

bootjp · 2026-06-11T17:05:50Z

@claude review — round 2:

codex P1 line 323 (HLC target advancement) — fixed in commit 17b0719 / §6.2.1. ImportVersions atomically advances metaLastCommitTS + hlc.SetPhysicalCeiling under the FSM apply lock; SplitJob carries max_imported_ts; CUTOVER is gated on a final SetPhysicalCeiling proposal on the target group before the catalog CAS. Reuses the Composed-1 monotone-ceiling primitive (PR docs(design): rename Composed-1 docs *_proposed_* → *_partial_* (M5a shipped) #927).
codex P2 line 66 (cursor MVCC granularity) — fixed in commit 17b0719 / §6.1.1. Cursor encodes (raw_key, commit_ts) with exclusive-resume semantics, so a hot key's history can span chunks safely.
gemini medium line 467 (rolling upgrade) — already covered by §11.1 (commit 93de9b2). Asked the reviewer to confirm via inline reply.

claude · 2026-06-11T17:06:08Z

Claude finished @bootjp's task in 2m 51s —— View job

Review — Round 2

Todo list

Load tools
Read the updated design document (commit 17b0719)
Verify Codex P1 (§6.2.1): HLC target advancement during ImportVersions
Verify Codex P2 (§6.1.1): Cursor MVCC granularity (raw_key, commit_ts)
Confirm Gemini medium (§11.1): rolling upgrade capability gate
Check status of P1 (CUTOVER two-step ordering) from prior review
Check status of medium findings from prior review (grace window inconsistency, wait-until-applied, !dist| prefix)
Post comprehensive review

Round-2 items — all three verified

Finding	Section	Verdict
Codex P1 — HLC target advancement	§6.2.1 (lines 357–371)	✅ Fully resolved. Per-batch `SetPhysicalCeiling` under FSM apply lock, `max_imported_ts` witness in SplitJob, CUTOVER precondition proposal on target, and the post-CUTOVER `Next()` invariant are all explicit. Reuses the Composed-1 monotone-ceiling primitive cleanly. Test plan in `kv/migrator_hlc_advance_test.go` covers the three necessary assertions.
Codex P2 — cursor MVCC granularity	§6.1.1 (lines 325–337)	✅ Fully resolved. Cursor encodes `(raw_key, commit_ts)` as opaque bytes, iterates `(raw_key ASC, commit_ts DESC)`, exclusive-resume semantics. Hot-key safe, resume-safe, encoding is forward-extensible. `store` type stays `[]byte`; decoupling seam from §6.1 preserved.
Gemini medium — rolling upgrade	§11.1 (lines 531–543)	✅ Fully resolved. `node_capability_bitfield` + `cap_migration_v2` gate, cluster-readiness check at SplitRange entry, mid-migration downgrade pause, PR1–PR5 ordering constraint to prevent premature bit advertisement, test plan in `distribution/capability_test.go`. Confirmed.

Remaining items from prior review

P1 — CUTOVER bulk-rename ordering still implicit (unresolved)

The prior review flagged that the CUTOVER step on two different Raft groups must be sequenced: target bulk-rename first, then catalog CAS — otherwise there is a window where the catalog says the target owns the range but the target's MVCC only has shadow keys, making reads to the target return empty results.

The current doc has moved closer but the ordering is still not stated explicitly in the §3.2 CUTOVER description. §6.2.1 (line 367) says SetPhysicalCeiling is "the last target-side write before CUTOVER" — this implies the bulk-rename happened earlier, but the §3.2 diagram only lists:

DELTA_COPY → CUTOVER:
  1. SetPhysicalCeiling on TARGET (last target-side write)
  2. Catalog CAS on default group

It does not list (0) bulk-rename migstage → live keys on TARGET. The two-line note on line 147 ("CUTOVER promotes the staged data to live keys via a single bulk-rename FSM apply") is the only mention of this step, and it's detached from the CUTOVER phase breakdown.

The safe and intended ordering is:

Bulk-rename (!dist|migstage|<job_id>|* → live keys) on target group — must complete and be observed by migrator before step 2
SetPhysicalCeiling(max_imported_ts) on target group
Catalog CAS on default group

Steps 2 and 3 are specified, step 1 is not. Without step 1 appearing in the CUTOVER sequence, an implementer working from §3.2 alone could issue the catalog CAS before the rename is applied, creating the data-availability gap. The leader-flap resume case (new migrator checking whether rename already landed before re-proposing the catalog write) also needs to be specced.

Suggested fix: add a (0) bullet to the §3.2 CUTOVER diagram, and add one sentence in §4's CUTOVER crash-safety row covering how the migrator confirms the target rename was applied before issuing the catalog CAS on resume.

Fix this →

Medium — Grace window inconsistency (still unresolved)

Three locations cite the CLEANUP grace default and they still disagree after commit 17b0719:

Location	Line	Value
§4 table, CLEANUP row	166	60 s
§7.2.4	434	30 s
§12 Open Question 3	549	60 s

§7.2.4 was updated with a rationale ("comfortably exceed both lease TTL and watcher tick × 2") but the value there (30 s) still disagrees with the other two (60 s). Pick one and update all three; if 30 s wins, close Open Question 3 with the chosen rationale.

Fix this →

Minor — `!dist|` prefix on target-group import cursor (still unresolved)

Line 347 stores the import cursor as !dist|job|<id>|import_cursor in the target group's MVCC store, but !dist| is the convention for default-group catalog keys throughout the codebase. A future operator tool scanning the default-group catalog, or distribution/catalog.go's key parsers, could misparse these. Suggest a target-scoped prefix such as !migstage|cursor|<id> to avoid confusion. Low risk, quick rename.

Fix this →

Medium — "Wait until applied" still a one-liner

§7.2.2 (line 426) still reads only: "the FSM treats read_route_version > local as a 'wait until applied' (bounded by lease TTL) before answering". The implementation mechanism (condition variable on applied index? retry loop? channel?) and the timeout semantics (what error surfaces if the timeout fires — ErrRouteMoved or ErrTimeout?) remain unspecified. Given how much implementation detail the rest of the doc provides, a single sentence here under-specs the read-path hot case. A one-line elaboration (e.g., "condition-variable wait on the FSM's appliedIndex, timed out by min(remaining_lease_TTL, 200ms); on timeout, re-evaluate ownership and return ErrRouteMoved if still not owned") would make this reviewable.

Summary

All three round-2 items are properly verified. Two of the four items from the prior review remain open: the CUTOVER bulk-rename ordering (P1 — needs an explicit step in the §3.2 diagram + resume semantics in §4) and the grace window inconsistency (medium — three locations cite different defaults). The !dist| prefix collision and the wait-until-applied spec are low-risk and can be addressed in the same pass.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 17b0719055

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T17:06:56Z

+
+FSM-side defense in depth:
+
+- `kv/fsm.go` already routes through `verifyComposed1` at apply. Extend `verifyOwnerFromSnapshot` to also reject when the resolved route's state is `WriteFenced` — even if a coordinator with a stale engine forwarded a write, the FSM closes the door. The error wraps `ErrRouteWriteFenced` so M4 retry logic differentiates from `ErrComposed1Violation`.


Reject fenced writes even when the route version is unpinned

This FSM-side fence is described as an extension of verifyOwnerFromSnapshot, but today verifyComposed1 returns before that helper when ObservedRouteVersion == 0. There are still intentional zero-version write paths (for example read/write txns with ReadKeys, caller-supplied StartTS, or resolver-claimed keys in kv/sharded_coordinator.go:694-754), so a stale coordinator in those contexts can forward a write to the source during FENCE and the proposed FSM defense will not run. The fence check needs to be on an unconditional apply path, or the design needs to require all mutation flows to carry a nonzero observed route version before cross-group split is enabled.

Useful? React with 👍 / 👎.

Fixed in commit c16e1c4. §7.1 now introduces verifyRouteNotFenced as a SEPARATE, unconditional pre-gate ahead of verifyComposed1 — not an extension of verifyOwnerFromSnapshot. It scans mutations with routeKey() against the current catalog snapshot and rejects ErrRouteWriteFenced for any key whose route is WriteFenced, with the same Phase_ABORT bypass and NO dependency on ObservedRouteVersion. So the zero-version write flows in kv/sharded_coordinator.go:694-754 (ReadKeys, caller-supplied StartTS, resolver-claimed keys) are covered — the gate runs on every applied write request unconditionally. The two gates stay independent so a future change to one cannot regress the other.

chatgpt-codex-connector · 2026-06-11T17:06:56Z

+[ target.Active(right)          ]
+```
+
+Key invariant: **catalog snapshots never contain overlapping ranges**. The target shadow keyspace lives outside the catalog (BACKFILL / DELTA_COPY land into private `!dist|migstage|<job_id>|<raw_key>` keys on the target group's MVCC store; CUTOVER promotes the staged data to live keys via a single bulk-rename FSM apply). This keeps `routesFromCatalog` + `validateRouteOrder` (`distribution/engine.go:496-527`) green throughout the migration — addresses codex P1 on overlapping routes (PR #945 review). RouteState `MigratingSource` / `MigratingTarget` are reserved but unused in M2; they remain available for a future merge / multi-stage migration design.


Avoid an unbounded CUTOVER bulk rename

Staging every imported row under !dist|migstage|... and then promoting the whole range via a single FSM apply makes CUTOVER proportional to the entire migrated range. For any split larger than the Raft proposal/apply budget, this defeats the chunked BACKFILL/DELTA_COPY design and can block or fail exactly when the catalog must switch atomically. The design should avoid a per-key bulk move at cutover, or make the staged-to-live promotion incremental while preserving the visibility fence.

Useful? React with 👍 / 👎.

Fixed in commit c16e1c4. New §6.4 "Incremental staged-to-live promotion": CUTOVER is O(1) — just a catalog version bump + route swap + set staged_visibility_active on the new target route, NO per-key work. Reads use a staged-fallback through the per-route bit so imported data is visible the instant CUTOVER lands. A leader-local promoter on the target group walks !dist|migstage|<job_id>|* in cursor-resumable chunks (same chunkBytes / pacing as BACKFILL) and proposes one bounded PromoteStaged Raft entry per batch — never a single oversized apply. CLEANUP → DONE waits for the promoter to drain AND the grace window AND the staged_visibility_active flag clear. SplitJob picks up promote_cursor + promotion_completed_ts for resumability and audit.

…945) Codex P1 on PR #945 (line 407) — FENCE must run on unconditional path verifyComposed1 returns before verifyOwnerFromSnapshot when ObservedRouteVersion == 0, and several intentional zero-version write paths exist today (read/write txns with ReadKeys, caller- supplied StartTS, resolver-claimed keys in kv/sharded_coordinator.go:694-754). The naïve "extend verifyOwnerFromSnapshot to reject WriteFenced" placement would silently skip those paths — a stale coordinator could forward a write to the source during FENCE without the FSM defense ever running. Fix in §7.1: introduce verifyRouteNotFenced as a separate, unconditional pre-gate ahead of verifyComposed1. It scans mutations with routeKey() against the current catalog snapshot and rejects ErrRouteWriteFenced for any key whose route is WriteFenced, with the same Phase_ABORT bypass and no dependency on ObservedRouteVersion. Two independent gates, write side unchanged Composed-1 trust preserved. Codex P2 on PR #945 (line 147) — CUTOVER bulk rename is unbounded Staging every imported row under !dist|migstage|... and promoting via a single FSM apply makes CUTOVER O(migrated range) — defeats chunked BACKFILL/DELTA_COPY exactly when the catalog must atomically switch. New §6.4 ships incremental promotion: 1. CUTOVER apply is O(1): catalog version bump + route swap + set staged_visibility_active on the new target route. 2. Reads use a staged-fallback through the per-route bit so imported data is visible the instant CUTOVER lands. 3. A leader-local promoter on the target group walks the staged prefix in cursor-resumable chunks, copying each version to its live position with one bounded PromoteStaged apply per batch. 4. CLEANUP → DONE precondition extends to "promoter empty AND grace window elapsed AND staged_visibility_active cleared", so route is indistinguishable from any Active route once promotion finishes. SplitJob row picks up promote_cursor (opaque) and promotion_completed_ts (diagnostic). §3.2 state-machine diagram updated: CUTOVER no longer says "bulk-rename"; the staged-to- live work is explicitly background and incremental. Both fixes preserve the §6.2.1 HLC ceiling advance and §7.2 read fence guarantees — incremental promotion is a strict superset of the chunked BACKFILL/DELTA_COPY semantics applied at a different phase.

bootjp · 2026-06-11T17:29:59Z

@claude review — round 3 (2 codex findings on commit 17b0719):

codex P1 line 407 (FENCE on unconditional path) — fixed in commit c16e1c4. §7.1 introduces verifyRouteNotFenced as a SEPARATE pre-gate ahead of verifyComposed1, with NO dependency on ObservedRouteVersion — so the zero-version write paths in kv/sharded_coordinator.go:694-754 (ReadKeys / caller-supplied StartTS / resolver-claimed keys) are covered.
codex P2 line 147 (unbounded CUTOVER bulk rename) — fixed in commit c16e1c4. New §6.4 "Incremental staged-to-live promotion": CUTOVER is O(1) — catalog version bump + route swap + flag set, no per-key work. Reads use a staged-fallback through the per-route staged_visibility_active bit; a leader-local promoter on the target group drains the staged prefix in cursor-resumable chunks. CLEANUP waits for the promoter to drain. SplitJob picks up promote_cursor + promotion_completed_ts.

claude · 2026-06-11T17:30:20Z

Claude finished @bootjp's task in 3m 56s —— View job

Review — Round 3

Todo list

Load tools
Read updated design document (commit c16e1c4)
Verify Codex P1 (§7.1): verifyRouteNotFenced unconditional pre-gate
Verify Codex P2 (§6.4): Incremental staged-to-live promotion / O(1) CUTOVER
Check remaining items from round 2 (CUTOVER ordering, grace window, !dist| prefix, wait-until-applied)
Post comprehensive review

Round-3 items — both verified

Finding	Section	Verdict
Codex P1 — FENCE on unconditional path	§7.1 (lines 440–445)	✅ Resolved. `verifyRouteNotFenced` is explicitly a separate pre-gate that runs before `verifyComposed1` with no dependency on `ObservedRouteVersion`. The apply path is `verifyRouteNotFenced → verifyComposed1 → existing handlers`. The zero-version write flows (`ReadKeys`, caller-supplied `StartTS`, resolver-claimed keys in `kv/sharded_coordinator.go:694-754`) are covered. The gates are kept independent so a future change to one cannot regress the other. The M3 path to eventually collapse the `ObservedRouteVersion == 0` short-circuit is also called out.
Codex P2 — unbounded CUTOVER bulk rename	§6.4 (lines 396–427)	✅ Resolved. CUTOVER is explicitly `O(1)` — a single catalog write that sets `staged_visibility_active = true` and swaps routes. No iteration over migrated data. Background incremental promoter on the target group drains the staged prefix in cursor-resumable chunks using the same chunking/pacing/cursor codec as BACKFILL. CLEANUP → DONE requires promoter drain + `staged_visibility_active` cleared. The staged-fallback read path provides correct visibility during promotion.

Prior round-2 P1 — CUTOVER two-step ordering: resolved by §6.4

The round-2 P1 flagged that "target bulk-rename must commit before the catalog CAS" to avoid a data-availability gap. §6.4 eliminates the bulk-rename entirely: CUTOVER sets staged_visibility_active on the target route, and reads immediately use the staged-fallback path for any key not yet promoted. There is no window where the catalog says the target owns the range but data is unavailable — the staged prefix is there from the moment staged_visibility_active is set.

The CUTOVER sequence (§3.2, lines 120–134) now correctly specifies: SetPhysicalCeiling(max_imported_ts) on target → catalog CAS. This ordering is resilient to leader flap: max_imported_ts is persisted in SplitJob, so a new migrator can re-issue the SetPhysicalCeiling (idempotent — monotone) and then the catalog CAS. Round-2 P1 is closed. ✅

New P1 — §6.4 step 3 "no-op when a later live write has shadowed it" can lose MVCC history

§6.4 step 3 (line 411) says the PromoteStaged apply is a "no-op when a later live write has already shadowed it". This is incorrect for MVCC correctness.

Scenario:

Staged key K at commit_ts = 3 (imported from source).
Live write K at commit_ts = 10 (post-CUTOVER client write, guaranteed > max_imported_ts by §6.2.1).
PromoteStaged runs and sees the live version at ts=10 → skips copy → deletes staged version at ts=3.
staged_visibility_active is later cleared (CLEANUP → DONE).
Client reads K at read_ts = 5: staged-fallback is gone; live namespace only has ts=10 (newer than read_ts); MVCC snapshot read at ts=5 returns not found — the ts=3 value has been permanently dropped.

The "shadow" optimization is safe only for a last-write-wins (non-MVCC) store. In an MVCC store with snapshot read semantics, every version of a key must be preserved across its valid range. The correct guard is idempotency: "no-op if the staged version has already been copied to the live namespace (checked by the presence of a live version at the same commit_ts)."

The fix is two parts:

Replace the "shadowed" clause: "no-op if a live version at the same commit_ts already exists (idempotent resume after a leader flap)."
Clarify that PromoteStaged must always copy the staged version to live, regardless of whether a newer live version exists — the MVCC store naturally holds both versions without conflict.

The MVCC compaction GC (kv/compactor.go) will eventually prune versions older than its configured retention horizon; that is the correct place for history pruning, not the promoter.

Fix this →

Remaining from round 2 (three items still open)

Medium — Grace window inconsistency (unresolved, 3rd flag)

Three locations still disagree after commit c16e1c4:

Location	Line	Value
§4 table, CLEANUP row	168	60 s
§7.2.4	472	30 s (with rationale: "comfortably exceed both lease TTL and watcher tick × 2")
§12 Open Question 3	587	60 s

§7.2.4 has a clear rationale for 30 s. If 30 s is the chosen value: update §4 line 168 to 30 s and close Open Question 3 with the chosen value + rationale. If 60 s is the chosen value: update §7.2.4 to 60 s and reconcile the rationale. Either way, one pass to make all three locations consistent and close OQ3.

Fix this →

Minor — `!dist|` prefix on target-group import cursor (unresolved, 3rd flag)

Line 348 stores the import cursor as !dist|job|<id>|import_cursor in the target group's MVCC store, but !dist| is the default-group catalog namespace throughout the codebase. A future operator tool scanning the default-group catalog or distribution/catalog.go's key parsers would misparse this as a catalog entry. Suggest renaming to a target-scoped prefix such as !migstage|cursor|<id> — requires a one-line change in the spec.

Fix this →

Medium — "Wait until applied" still a one-liner (unresolved, 3rd flag)

§7.2.2 (line 464) still reads: "the FSM treats read_route_version > local as a 'wait until applied' (bounded by lease TTL) before answering." The implementation mechanism (condition variable on applied index? channel from FSM apply goroutine? short retry loop?) and the timeout error type (is it ErrRouteMoved or a distinct ErrTimeout?) remain unspecified. Given the level of implementation detail elsewhere in the doc, one sentence would close this: e.g., "condition-variable wait on the FSM's appliedIndex signalled by the apply goroutine, timedout by min(remaining_lease_TTL, 200ms); on timeout, re-evaluate ownership — if still not owned, return ErrRouteMoved."

Summary

Both round-3 items are verified and properly addressed. The §6.4 incremental promotion design is a clean solution to the codex P2 bulk-rename concern and the round-2 CUTOVER ordering concern — both are closed as side effects of the redesign.

One new P1 emerges from §6.4: the PromoteStaged "no-op when shadowed" clause loses MVCC history versions that older snapshot reads still need — requires a targeted wording fix (change "shadowed" to "already at same commit_ts in live"). Three items from round 2 remain open (grace window, !dist| prefix, wait-until-applied). All are quick editorial fixes that can go in one commit.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c16e1c4335

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T17:31:37Z

+// ExportVersions iterates committed versions in the half-open range
+// [start, end) whose commit_ts is in (minCommitTS, maxCommitTS],


Scan internal families outside raw route bounds

For splits whose logical range is a normal user-key interval such as [foo, bar), this raw [start, end) MVCC iterator never visits keys like !redis|str|foo, !lst|...foo, or !txn|lock|foo, because those prefixes sort outside the user-key range before the routeKey() filter can accept them. That means the proposed migration can cut over without copying internal state that routes to the moved key, causing data loss or broken adapter invariants; the export needs to scan the relevant internal-family prefixes or use a route-key-indexed traversal rather than relying on a raw-range scan plus filter.

Useful? React with 👍 / 👎.

Fixed in commit 231e9a5. New §6.3.1 "Internal-family brackets": migration now runs MULTIPLE parallel exports per phase, one per relevant raw-key family (user-key, !txn|lock|, !txn|cmt|, !txn|rb|, !txn|meta|, !txn|int|, !lst|meta|, !lst|itm|, !redis|, !ddb|item|, !ddb|gsi|, !sqs|, !s3|), each driven by the same KeyFilter so routeKey() remains the single source of truth. PlanExportBrackets() returns the bracket list; per-bracket opaque cursors + done flags on the SplitJob allow independent resumability. Bracket count is small (~12 today) and fan-out parameter caps parallelism. A future adapter MUST add its bracket at the same PR that teaches routeKey() its family.

chatgpt-codex-connector · 2026-06-11T17:31:37Z

+
+#### 7.2.2 Source FSM rejects reads after CUTOVER
+
+When a source-group FSM applies a lease read whose `read_route_version` is **strictly less than** the FSM's last-applied catalog version, and the requested key's `routeKey()` maps to a route the FSM no longer owns, the FSM returns `ErrRouteMoved{new_route_version, new_group_id}` instead of the value. The error is mapped to a retryable status (gRPC `codes.FailedPrecondition` + detail) and the coordinator retries after a `RouteEngine` refresh.


Reject reads when both caller and source are stale

This condition only rejects when the request version is older than the source FSM's local catalog version, so it misses the common post-CUTOVER case where both the coordinator and the source group have only applied the pre-CUTOVER/FENCE version. In that window another coordinator that has refreshed can already write the key on the target, while this stale coordinator routes a read to the stale source and the source serves the old value because read_route_version == local_version; the read fence needs an authoritative cutover/ownership check that also catches equal-stale source reads during the grace window.

Useful? React with 👍 / 👎.

Fixed in commit 231e9a5. §6.4 step 2 rewritten: the staged area now participates in MVCC MERGE — newest commit_ts wins. The read takes the newest live version with commit_ts ≤ read_ts AND the newest staged version with commit_ts ≤ read_ts and returns whichever has the greater commit_ts. Four read_ts cases enumerated to prove correctness (live_write.ts > max_imported_ts > staged.ts; merge correctly surfaces live for read_ts ≥ live.ts and staged for read_ts ∈ (max_imported_ts, live.ts)). Per-family adapter invariants preserved via the key_family dispatch.

chatgpt-codex-connector · 2026-06-11T17:31:37Z

+
+   No iteration over the migrated key range, no bulk rename, no per-key proposals. The Raft proposal carrying this apply is `O(1)` in the migrated data size — bounded by a handful of catalog descriptors and the SplitJob record.
+
+2. **Staged keyspace stays visible through a per-route flag.** After CUTOVER the target FSM's read path checks the route's `staged_visibility_active` bit. When set, a read for key `K` in the moved range first looks up `!dist|migstage|<job_id>|<K>` and falls back to the live key only if the staged entry is absent. (Writes always land in the live keyspace — the CUTOVER bump already routed writers to the target group via the catalog, and §6.2.1's HLC ceiling advance guarantees their `commit_ts` is strictly greater than every imported `commit_ts`, so a live write shadowing a staged version is the *correct* MVCC visibility.) This gives reads access to the imported data the instant CUTOVER lands, with **no promotion work blocking CUTOVER itself**.


Merge staged and live MVCC versions on reads

Looking in the staged keyspace first and falling back to live only when staged is absent makes post-CUTOVER reads return stale imported versions after any live write to the same key, until promotion deletes the staged entry. For a read timestamp after the live write's commit timestamp, both staged and live versions exist and the newest live version must win, so the read path needs to merge/compare the staged and live MVCC candidates by commit timestamp instead of treating staged presence as authoritative.

Useful? React with 👍 / 👎.

Fixed in commit 231e9a5. §7.2.2 replaces the strict-less-than version check with an authoritative ownership check. When source has applied CUTOVER → unconditional ErrRouteMoved. When source has NOT applied CUTOVER but the heartbeat-piggybacked max_cutover_version_seen indicates a possible stale read → run GetRouteOwnership(routeKey(K), max_cutover_version_seen) against the default-group leader (cached one second). New §7.2.2a walks through the exact 7-step equal-stale scenario (t=0..6) showing the watermark + ownership-query path catches the case the original rule missed.

Codex P1 round-4 on PR #945 — three fundamental design issues in the §6.3 / §6.4 / §7.2 designs landed in c16e1c4: P1 line 310 — Scan internal families outside raw route bounds A single raw [start, end) iterator + routeKey filter misses every internal-family raw key (!txn|lock|foo, !lst|meta|foo, !redis|str|foo, !ddb|item|<tbl>|foo, !sqs|..., !s3|...) because those prefixes sort outside the user-key band — the filter rejects nothing because the iterator never visits those bytes. Without the fix, CUTOVER would leave intent locks, list metadata, and adapter-private state behind on the source. Fix in new §6.3.1 "Internal-family brackets": migration runs multiple parallel exports per phase, one per relevant raw-key family, driven by the same KeyFilter (routeKey() remains the single source of truth). The migrator computes the bracket list from the moving range's logical scope via PlanExportBrackets; each bracket gets its own opaque cursor and done flag on the SplitJob. Bracket count is small (~12 today) and bounded; the fan-out parameter caps parallelism. A future adapter MUST add its bracket at the same PR that teaches routeKey() its family — both are required for end-to-end coverage. P1 line 408 — Merge staged and live MVCC versions on reads The §6.4 step 2 "staged-first, fall back to live when staged is absent" form returned stale imported versions after any live write to the same key. After the §6.2.1 HLC advance, live writes have commit_ts > max_imported_ts; for read_ts ≥ live_write.ts the live version is the most-recent committed, but the fallback rule would still return the older staged version because the staged entry exists until promotion. Fix in §6.4 step 2: the staged area participates in MVCC merge — newest commit_ts wins. The read takes the newest live version with commit_ts ≤ read_ts AND the newest staged version with commit_ts ≤ read_ts and returns whichever has the greater commit_ts. This is the standard MVCC visibility rule applied to a second column-family-like source, not stage-shadows-live. Four read-ts cases enumerated to prove correctness; per-family adapter invariants preserved via the key_family dispatch (§6.3.1). P1 line 459 — Reject reads when both caller and source are stale The strict-less-than 'read_route_version < source_local_version' rule misses the equal-stale window: both coordinator and source pinned at the pre-CUTOVER version, while a different refreshed coordinator has already written to the target. Source's local snapshot says it still owns the key, so the read fence passes and serves the stale value. Fix in §7.2.2: replace version comparison with an authoritative ownership check scoped on whichever snapshot exists. When source has applied CUTOVER → unconditional ErrRouteMoved. When source has NOT applied CUTOVER but the heartbeat-piggybacked max_cutover_version_seen indicates a possible stale read → run GetRouteOwnership(routeKey(K), max_cutover_version_seen) against the default-group leader, cache one second. New §7.2.2a walks through the exact 7-step equal-stale scenario to demonstrate the fix catches the case the original rule missed. All three are fundamentally correctness fixes — the original wording would have produced silent data loss (310), MVCC visibility regression (408), or stale reads (459) in production. The fixes preserve the catalog-stays-non-overlapping invariant (§3.2), the §6.2.1 HLC advance, and the §7.2 read-fence scaffolding; only the under-spec'd details are tightened.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 231e9a5729

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T17:48:31Z

+
+1. **Per-batch advance.** On every apply of an `ImportVersions` batch, the target FSM computes `batchMax = max(versions[i].commit_ts)` and **atomically** (under the same FSM apply lock that mutates the MVCC store):
+   - sets `metaLastCommitTS = max(metaLastCommitTS, batchMax)`, and
+   - calls `hlc.SetPhysicalCeiling(batchMax)` (the same Raft-agreed primitive used by the M1 HLC renewal proposer, `kv/coordinator.go:644-669`). `SetPhysicalCeiling` is already monotone — a lower argument is a no-op — so duplicate / out-of-order batches are safe.


Convert HLC timestamps before setting the physical ceiling

This passes batchMax/max_imported_ts as a full HLC commit timestamp into hlc.SetPhysicalCeiling, but the current HLC stores that field as Unix milliseconds and Next() later shifts it as the physical part (kv/hlc.go). For any imported timestamp with logical bits set or even a normal ms<<16 encoding, this advances the target ceiling to an artificial far-future millisecond, causing subsequent writes to jump to roughly commit_ts<<16 and weakening the lease-ceiling fence until wall time catches up. The import path should either Observe the full HLC timestamp or pass only its physical millisecond component to SetPhysicalCeiling.

Useful? React with 👍 / 👎.

Fixed in commit 44e6125. §6.2.1 step 1 now passes hlcPhysicalMs(batchMax) = int64(batchMax >> 16) to SetPhysicalCeiling (the same shift HLC.Next() uses to extract the physical half, kv/hlc.go:128-145) AND calls hlc.Observe(batchMax) for the local clock's last-issued value. Step 3 (CUTOVER precondition) uses the same conversion. Both functions are monotone so duplicates/out-of-order are safe.

chatgpt-codex-connector · 2026-06-11T17:48:31Z

+
+The migrator runs the brackets **in parallel up to `--migrationExportFanout` (default 4)** within a single phase (BACKFILL or DELTA_COPY): each bracket gets its own opaque cursor (the §6.1.1 codec is extended to key on `(family, raw_key, commit_ts)`), so a bracket can be paused/resumed independently and the §9 resumability matrix applies per bracket. The migrator records per-bracket `cursor` and `done` flags on the SplitJob; the phase advances only when **every bracket** reports `done = true`. Cost is at most `fanout`× the gRPC frame budget but is bounded by the total bracket count (~12 today), not by data volume.
+
+A bracket whose family doesn't actually intersect the moving range (e.g. an SQS migration on a Redis-only deployment) returns its first chunk empty and `done=true` immediately — no wasted scan work. The `keys per bracket` distribution is a §7.3 / §11.x metric so operators can spot a runaway family scan.


Bound sparse internal-family export scans

This claims an internal-family bracket with no matching keys returns immediately, but the proposed brackets are family-wide and the only narrowing happens after iteration through routeKey(). In a cluster with a large SQS/Redis/etc. prefix outside the moving range, an export for a range with no accepted keys must scan that whole raw prefix to prove done=true because rejected rows do not count toward chunkBytes, making a single chunk unbounded in CPU/I/O and defeating the migration pacing guarantees. The exporter needs a route-key-indexed traversal or a separate scan budget that advances the cursor over rejected rows.

Useful? React with 👍 / 👎.

Fixed in commit 44e6125. §6.3.1 ships two layers: (1) Scan-budget pacing — each ExportVersions call carries a separate maxScannedBytes (default 4× chunkBytes) that counts both accepted and rejected rows; iterator returns at the budget with cursor advanced over the rejected tail. (2) Per-family EncodeBracketStart hook tightens family-wide bounds to the moving-slice band where the family layout allows it. Layer (1) alone is sufficient for correctness; (2) is the optimization per adapter PR. New metrics: keys per bracket + rejected_rows per bracket.

bootjp · 2026-06-11T17:50:14Z

@claude review — round 4 (3 codex P1 on commit c16e1c4):

line 310 (internal families outside raw bounds) — fixed in 231e9a5. New §6.3.1: per-family export brackets driven by the same KeyFilter; PlanExportBrackets() returns the list; per-bracket cursors + done flags. Adapter follow-ons MUST add their bracket at the same PR that teaches routeKey() the new family.
line 408 (staged + live MVCC merge) — fixed in 231e9a5. §6.4 step 2 rewritten: staged participates in MVCC merge by commit_ts. Four read_ts cases enumerated to prove correctness.
line 459 (equal-stale read fence) — fixed in 231e9a5. §7.2.2 now uses authoritative ownership check + heartbeat-piggybacked max_cutover_version_seen watermark + GetRouteOwnership RPC (1s cache). §7.2.2a walks through the exact 7-step equal-stale scenario.

claude · 2026-06-11T17:50:33Z

Claude finished @bootjp's task in 5m 5s —— View job

Claude finished @bootjp's task in 3m — View job

Review — Round 4

Todo list

Load tools
Read updated design document (commit 231e9a5)
Verify line 310 fix (§6.3.1): internal-family export brackets
Verify line 408 fix (§6.4 step 2): staged + live MVCC merge by commit_ts
Verify line 459 fix (§7.2.2): authoritative ownership check + watermark
Check new Codex findings on commit 231e9a5 (HLC physical ceiling, sparse scan bound)
Check remaining items from round 3 (PromoteStaged shadow clause, grace window, !dist| prefix, wait-until-applied)
Post comprehensive review

Round-4 items — all three verified

Finding	Section	Verdict
Codex P1 — internal families outside raw bounds	§6.3.1 (lines 394–447)	✅ Resolved. `PlanExportBrackets()` returns one `ExportBracket` per key family (~12 total); each bracket is a family-wide raw-key band and the `KeyFilter` delegate narrows it to the moving range. Per-bracket opaque cursors + done flags on the SplitJob allow independent resumability. Phase advances only when every bracket is `done=true`. The "adapter MUST add bracket at the same PR that teaches `routeKey()` the new family" enforcement invariant is explicit. Safety assertion (every exported key maps back via `routeKey()`) still runs per-row.
Codex P1 — staged + live MVCC merge	§6.4 step 2 (lines 461–474)	✅ Resolved. Staged area participates as a second MVCC source; newest `commit_ts` wins. Four read_ts cases are enumerated and each is proven correct. The earlier "staged-first-then-live-fallback" is explicitly called out as wrong and replaced. Per-family adapter helpers follow the same merge rule.
Codex P1 — equal-stale read fence	§7.2.2 + §7.2.2a (lines 523–558)	✅ Resolved. Authoritative ownership check via heartbeat-piggybacked `max_cutover_version_seen` watermark + `GetRouteOwnership` RPC (1 s cache). §7.2.2a walks through the exact 7-step equal-stale scenario proving the watermark + ownership-query path catches what the strict-less-than form missed.

New P1 — `SetPhysicalCeiling` receives a full HLC timestamp, not Unix ms (confirmed by code)

I read kv/hlc.go to verify the codex P1 finding on commit 231e9a5 (line 367):

// SetPhysicalCeiling atomically advances the Raft-agreed physical ceiling.
// ...
func (h *HLC) SetPhysicalCeiling(ms int64) {
    for {
        prev := h.physicalCeiling.Load()
        if ms <= prev {
            return
        }
        ...
    }
}

SetPhysicalCeiling takes Unix milliseconds (int64), and physicalCeiling is stored and compared directly against time.Now().UnixMilli() inside nextLocked. The HLC commit timestamp layout is (unix_ms << 16) | logical_counter — the same shift constant the code calls hlcLogicalBits = 16.

§6.2.1 (line 367) says:

calls hlc.SetPhysicalCeiling(batchMax) … batchMax = max(versions[i].commit_ts)

batchMax is a full HLC timestamp with format ms<<16 | logical. Passing it directly to SetPhysicalCeiling sets the ceiling to (ms<<16 | logical) milliseconds, which is on the order of ms × 65536 — roughly 3.5 billion years in the future. Subsequently, nextLocked clamps nowMs = ceiling because wall time is always below this, producing:

next = (batchMax_as_ms_value << 16) | 0

which is an astronomically wrong timestamp — a total corruption of the HLC.

Fix: extract the physical component before calling SetPhysicalCeiling:

hlc.SetPhysicalCeiling(int64(batchMax >> hlcLogicalBits))

Alternatively, use the existing Observe(ts uint64) method (which already bumps last correctly for full HLC values) if the goal is just to bump the logical counter, then separately call SetPhysicalCeiling(int64(batchMax >> hlcLogicalBits)) to advance the Raft-agreed ceiling. This affects every call site in §6.2.1: the per-batch advance, the CUTOVER precondition proposal, and the max_imported_ts witness.

Fix this →

Remaining P1 from round 3 — PromoteStaged "no-op when shadowed" still drops MVCC history (unresolved)

§6.4 step 3 (line 477) still reads:

copies the staged key into its live position (no-op when a later live write has already shadowed it; the FSM applies metaLastCommitTS-respecting MVCC put)

This was flagged as wrong in round 3 and has not been corrected. The problem has two horns and both are wrong:

Horn 1 (skip copy but delete staged): PromoteStaged sees staged K at ts=3 and live K at ts=10, declares it "shadowed", skips the copy, but deletes staged K ts=3 (otherwise CLEANUP never reaches done). After deletion, the staged-fallback merge (step 2) finds nothing for K at read_ts < 10, and the live namespace also has nothing ≤ read_ts since live K ts=10 > read_ts — result: "not found" for a snapshot read at read_ts=5. Incorrect: the imported ts=3 value should be visible at ts=5.

Horn 2 (skip copy AND skip delete): PromoteStaged skips the entry entirely. The staged key K ts=3 remains forever. The CLEANUP precondition "no staged keys remain" is never satisfied — job stuck in CLEANUP forever.

Neither interpretation is MVCC-correct. The fix from round 3 stands unchanged:

Always copy the staged version to the live namespace, regardless of whether a newer live version exists. The MVCC store holds both K@ts=3 and K@ts=10 naturally at different commit timestamps.
No-op only on true idempotent resume: "no-op if a live version at the same commit_ts already exists" — the promoter already ran this batch before a leader flap.
History pruning is kv/compactor.go's job, not the promoter's.

Fix this →

New P2 — Sparse internal-family brackets scan unboundedly (from codex, commit `231e9a5`)

§6.3.1 line 445 claims:

A bracket whose family doesn't actually intersect the moving range (e.g. an SQS migration on a Redis-only deployment) returns its first chunk empty and done=true immediately — no wasted scan work.

This is over-stated. The exporter iterates the raw key range [start, end) and applies the KeyFilter to every row it encounters. Rejected rows do not count toward chunkBytes (§6.1 line 325: "rejected rows do not advance the chunk-bytes budget"). A deployment with a large SQS prefix (!sqs|...) but no SQS keys in the moving range must scan the entire !sqs|* family prefix before it can set done=true — not one row returned, but potentially millions of rows visited, unbounded in CPU and I/O, and all in a single "chunk" because no row ever increments the budget counter. This defeats the pacing guarantees §8 establishes.

The claim "returns immediately" is only true if the family prefix is empty for the cluster entirely. A simple fix: add a separate scan counter (e.g. maxScannedBytes int passed to ExportVersions alongside chunkBytes) that advances on every iterated row regardless of the filter result. When maxScannedBytes is hit, the exporter checkpoints the cursor (even though chunkBytes is still zero for that call) and returns, allowing the migrator to re-enter and continue scanning on the next tick. This gives bounded CPU/IO per ExportVersions call even over a completely non-intersecting bracket.

Fix this →

Remaining from round 2 — three items still open (4th flag)

Medium — Grace window inconsistency

Three locations still disagree:

Location	Line	Value
§4 table, CLEANUP row	168	60 s
§7.2.4	566	30 s (with rationale: "comfortably exceed both lease TTL and watcher tick × 2")
§12 Open Question 3	681	60 s

§7.2.4 has the rationale for 30 s. Pick one value, update all three locations, and close Open Question 3.

Fix this →

Minor — `!dist|` prefix on target-group import cursor

Line 349 stores the import cursor as !dist|job|<id>|import_cursor in the target group's MVCC store. !dist| is the default-group catalog namespace throughout the codebase. Future operator tools and distribution/catalog.go's key parsers would treat this as a catalog entry. Rename to a target-scoped prefix: !migstage|cursor|<id>.

Fix this →

Medium — "Wait until applied" still a one-liner

Line 544: "the FSM treats read_route_version > source_local_version as a 'wait until applied' (bounded by lease TTL) before answering" — still unspecified. The implementation mechanism (condition variable on appliedIndex signalled by the apply goroutine? short retry loop?) and the timeout error type (does the timeout surface as ErrRouteMoved or a distinct ErrTimeout?) remain unspecified. A one-sentence elaboration would close this: e.g., "condition-variable wait on the FSM's appliedIndex signalled by the apply goroutine, timedout by min(remaining_lease_TTL, 200ms); on timeout, re-evaluate ownership — if still not owned, return ErrRouteMoved."

Summary

All three round-4 items (codex P1s on commit c16e1c4) are properly verified as resolved. One new P1 emerges from the latest codex review on commit 231e9a5: SetPhysicalCeiling(batchMax) receives a full HLC timestamp instead of Unix ms, confirmed by reading kv/hlc.go — passing a full HLC value would set the ceiling ~3.5 billion years ahead and corrupt every subsequent Next() call. The correct call is SetPhysicalCeiling(int64(batchMax >> hlcLogicalBits)).

The round-3 P1 (PromoteStaged "no-op when shadowed" drops MVCC history) remains unresolved for the fourth consecutive round — the "shadowed" clause on line 477 still exists and is still wrong. Three editorial items from round 2 (grace window inconsistency, !dist| prefix, wait-until-applied spec) and one new P2 (sparse family bracket scan unbounded) round out the remaining work before implementation can start cleanly.

…dex round-5) Codex P1 round-5 on PR #945 (line 367) — HLC SetPhysicalCeiling units hlc.SetPhysicalCeiling takes int64 Unix milliseconds (kv/hlc.go:222), but the previous spec passed the full HLC encoded commit_ts (ms<<16 | logical) as the argument. With logical bits set or any normal ms<<16 encoding that would advance the ceiling by 65 536× the intended milliseconds — weakening the lease fence until wall time catches up. Fix §6.2.1 steps 1 and 3: pass hlcPhysicalMs(batchMax) = int64(batchMax >> 16) (the same shift HLC.Next() uses to extract the physical half) AND call hlc.Observe(batchMax) so the local clock's last-issued value tracks the full HLC ts. Both calls are monotone so duplicates and out-of-order batches are safe. Codex P2 round-5 on PR #945 (line 445) — sparse internal-family scans The earlier 'returns its first chunk empty and done=true immediately' claim only holds when the iterator can short-circuit; under the §6.1 contract accept rejections do NOT count against chunkBytes, so a family-wide raw range can scan an arbitrarily large prefix to prove done=true for a moving range with no intersection. Fix §6.3.1: two layers. (1) Scan-budget pacing — each ExportVersions call carries a separate maxScannedBytes (default 4× chunkBytes) that counts both accepted and rejected rows; when the budget hits the iterator returns whatever it has accepted + the next cursor (over the rejected tail position). (2) Route-key sub-prefix indexing where the family layout allows it — per-family EncodeBracketStart() hook tightens bracket bounds from family-wide to moving-slice-within-family; adapters without the hook keep the scan-budget pacing alone, which is sufficient for correctness. New metrics: keys per bracket + rejected_rows per bracket.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 44e6125daf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T17:56:57Z

+        |  FENCE → DELTA_COPY              Copy (snapshot_ts, fence_ts] from
+        |                                  source to target shadow space.


Wait for source to enforce the fence before fixing fence_ts

When the catalog enters WriteFenced, the source FSM only learns that through its local route engine, which is refreshed asynchronously by the catalog watcher (distribution/watcher.go:11, distribution/watcher.go:87-98), and the proposed FSM fence checks the current local snapshot. If the migrator records fence_ts and starts copying (snapshot_ts, fence_ts] immediately, a stale source leader can still accept writes for the moved keys after fence_ts but before its watcher applies the fenced route; those versions are above the DELTA_COPY upper bound and will be left on the source at CUTOVER. The phase needs a source-side barrier/ack that the fenced route is applied, or fence_ts must be chosen after writes are actually rejected on the source.

Useful? React with 👍 / 👎.

Fixed in commit 9fa5931. New §3.2a 'Source-side fence-apply barrier': migrator advances FENCE → DELTA_COPY only after every voter of the source-group Raft membership has reported catalog_version_applied >= fence_catalog_version (via the cap_migration_v2 heartbeat or a per-node GetCatalogVersion RPC). Only after the unanimous voter-ACK does the migrator query source.LastCommitTS() and pin fence_ts. Unanimous (not quorum) is required for fence safety — any voter could become the leader later and re-serve writes for the moved range in the pre-FENCE state if its catalog hadn't applied. fence_ack_cursor is durable in the SplitJob for restart-safety. State machine diagram (§3.2) updated.

bootjp · 2026-06-11T18:00:42Z

@claude review — round 5 (1 P1 + 1 P2 on commit 231e9a5):

line 367 (HLC SetPhysicalCeiling units) — fixed in 44e6125. §6.2.1 now uses hlcPhysicalMs(ts) = int64(ts >> 16) for the physical-ms component, and additionally calls hlc.Observe(batchMax) for the full HLC ts. Closes the 65 536× far-future advance regression.
line 445 (sparse internal-family scans) — fixed in 44e6125. §6.3.1 adds scan-budget pacing (maxScannedBytes counts accepted+rejected) + per-family EncodeBracketStart hook for sub-prefix indexing where the family layout allows. Cursor advances over rejected rows.

claude · 2026-06-11T18:00:58Z

Claude finished @bootjp's task in 4m 52s —— View job

Review — Round 5

Todo list

Load tools
Read updated design document (commit 44e6125)
Verify line 367 fix (§6.2.1): HLC SetPhysicalCeiling units — hlcPhysicalMs(ts) = int64(ts >> 16) + hlc.Observe(batchMax)
Verify line 445 fix (§6.3.1): Sparse internal-family scan budget (maxScannedBytes + EncodeBracketStart hook)
Check remaining items from round 4 (PromoteStaged shadow clause P1, grace window, !dist| prefix, wait-until-applied)
Post comprehensive review

Round-5 items — both verified

Finding	Section	Verdict
Codex P1 — HLC SetPhysicalCeiling units	§6.2.1 (lines 365–376)	✅ Resolved. Confirmed against `kv/hlc.go`: `SetPhysicalCeiling(ms int64)` takes Unix milliseconds, `physicalCeiling` is compared to `time.Now().UnixMilli()` in `nextLocked`, and the physical part of a commit_ts is extracted with `>> hlcLogicalBits` (16). §6.2.1 now specifies `hlcPhysicalMs(ts) = int64(ts >> 16)` for every `SetPhysicalCeiling` call site (per-batch advance, CUTOVER precondition), and additionally calls `hlc.Observe(batchMax)` for the full HLC high-water mark. Both functions are monotone so out-of-order/duplicate batches are safe. The CUTOVER precondition and the `max_imported_ts` witness are updated consistently.
Codex P2 — Sparse internal-family scan budget	§6.3.1 (lines 448–453)	✅ Resolved. Two layers specified: (1) `maxScannedBytes` (default 4×`chunkBytes`) advances on every iterated row — accepted or rejected — and causes a partial-chunk return + cursor advance when hit, so the next call resumes past the already-scanned window. This alone bounds per-call wall-clock cost regardless of intersection. (2) Per-family `EncodeBracketStart(routeStart/End)` hook tightens bracket bounds to the moving slice within the family prefix where the family layout allows, reducing the scan budget from "full family" to "moving slice within family" as a per-adapter optimization. `keys_per_bracket` and `rejected_rows_per_bracket` metrics named so operators can spot a runaway family scan and prioritize encoder hooks.

New P1 — `fence_ts` set before source FSM enforces the fence (from codex review of `44e6125`)

The latest codex review (line 115) identifies a data-loss window that is not addressed in the current doc. Looking at the §3.2 state diagram and §7.1 together:

Migrator issues the FENCE catalog write on the default group. Catalog commits at version v+1.
Migrator reads the committed write, records fence_ts, and advances to DELTA_COPY (copying (snapshot_ts, fence_ts] from source to target).
The source-group leader's catalog watcher has not yet applied version v+1. The watcher is an async polling loop; its tick interval is on the order of seconds (referenced in §7.2.4 as "watcher tick × 2" in the 30 s grace rationale).
In the window [T_catalog_commit, T_catalog_commit + watcher_lag], both the coordinator (if also stale) and the source FSM (verifyRouteNotFenced consults "the current catalog snapshot" from the watcher — §7.1 line 517) still have no WriteFenced route in their local view.
A write for the fenced range in this window is accepted by the source at a commit_ts that may be above fence_ts (depending on HLC skew and timing). DELTA_COPY's upper bound is fence_ts; those writes are outside the window and are never copied to the target. At CUTOVER the target's MVCC lacks them — silent data loss.

§7.1's FSM defense in depth is the right idea but it reads the local catalog snapshot, so it is subject to the same watcher lag. The gap exists unless the design specifies a barrier between recording fence_ts and starting DELTA_COPY that ensures the source FSM has applied the WriteFenced route.

Suggested fix: add a FENCE → DELTA_COPY transition precondition — "before recording fence_ts and advancing to DELTA_COPY, the migrator polls the source-group leader (via the existing distribution heartbeat or a lightweight GetCatalogVersion RPC) until it reports catalog_version ≥ fence_catalog_version; fence_ts is then taken from HLC.Next() on the source-group leader or from max_imported_ts at that moment, ensuring the source's FSM snapshot already includes the WriteFenced route when DELTA_COPY begins." Alternatively, widen the DELTA_COPY upper bound to fence_ts + some_safe_watcher_lag and guarantee that any write accepted in the lag window is included — but the poll approach is cleaner.

[Fix this →](https://claude.ai/code?q=In%20docs%2Fdesign%2F2026_06_11_proposed_hotspot_split_milestone2_migration.md%20(branch%20docs%2Fhotspot-split-m2-proposal)%2C%20%C2%A73.2%20and%20%C2%A77.1%3A%20the%20migrator%20records%20fence_ts%20and%20starts%20DELTA_COPY%20as%20soon%20as%20the%20FENCE%20catalog%20write%20commits%20on%20the%20default%20group%2C%20but%20the%20source%20FSM's%20verifyRouteNotFenced%20gate%20reads%20the%20local%20catalog%20snapshot%20(refreshed%20by%20the%20watcher%2C%20which%20has%20a%20seconds-scale%20polling%20lag).%20Writes%20accepted%20by%20the%20source%20in%20the%20watcher-lag%20window%20with%20commit_ts%20%3E%20fence_ts%20are%20outside%20DELTA_COPY's%20(snapshot_ts%2C%20fence_ts%5D%20window%20and%20will%20be%20lost%20at%20CUTOVER.%20Add%20a%20FENCE%20%E2%86%92%20DELTA_COPY%20precondition%3A%20before%20recording%20fence_ts%20and%20advancing%2C%20the%20migrator%20polls%20the%20source-group%20leader%20until%20it%20reports%20catalog_version%20%3E%3D%20fence_catalog_version%2C%20then%20takes%20fence_ts%20from%20that%20point.%20Update%20%C2%A73.2%20FENCE%20row%2C%20%C2%A74%20FENCE%20crash-safety%20row%2C%20and%20%C2%A78%20migrator%20runtime%20to%20document%20this%20barrier.&repo=bootjp/elastickv)

Remaining P1 from round 3 — PromoteStaged "no-op when shadowed" still drops MVCC history (5th flag)

§6.4 step 3 (line 485) still reads:

copies the staged key into its live position (no-op when a later live write has already shadowed it; the FSM applies metaLastCommitTS-respecting MVCC put)

This has been flagged in rounds 3, 4, and now again as the 5th flag. The §6.4 step 2 MVCC merge (added in round 4) does not rescue this: the merge is only active while staged_visibility_active = true. Step 3 deletes the staged key; once all staged keys are deleted and CLEANUP completes, staged_visibility_active is cleared. At that point:

Staged K@ts=3 was not copied to live (declared "shadowed" by live K@ts=10) and was deleted.
Live namespace holds only K@ts=10.
A snapshot read at read_ts=5 now returns "not found" — the imported value at ts=3 is permanently gone.

The issue is the same two horns from round 3. There is no interpretation of "no-op when shadowed" that is both MVCC-correct and satisfies the CLEANUP drain precondition:

Skip copy + delete staged: ts=3 version disappears from MVCC — snapshot reads at read_ts ∈ (3, 10) silently return wrong results.
Skip copy + skip delete: CLEANUP drain ("no staged keys remain") never completes — job stuck forever.

The correct idempotency guard for the copy is: "no-op if a live version at the same commit_ts already exists" (meaning this exact staged version was already promoted in a prior batch, before a leader flap). The MVCC store naturally holds both K@ts=3 and K@ts=10 at different commit timestamps — there is no conflict. History pruning belongs to kv/compactor.go, not the promoter.

Note: §6.4 step 6 bullet "Concurrent writes during promotion" (line 496) says "the FSM's per-key metaLastCommitTS check applied by step 3's PromoteStaged keeps the staged copy from overwriting a fresher live value" — this confirms the copy should always happen (the MVCC put is idempotent on same-ts, and does not overwrite a newer-ts version by design). The "no-op when shadowed" clause on line 485 contradicts this intent.

Fix this →

Remaining from prior rounds

Medium — Grace window inconsistency (5th flag)

Three locations still disagree after commit 44e6125:

Location	Line	Value
§4 table, CLEANUP row	168	60 s
§7.2.4	574	30 s (with rationale: "comfortably exceed both lease TTL and watcher tick × 2")
§12 Open Question 3	689	60 s

§7.2.4 has the rationale for 30 s. Picking 30 s: update lines 168 and 689 to match, and close Open Question 3 with the §7.2.4 rationale. Five rounds of flagging — this is a one-line fix.

Fix this →

Minor — `!dist|` prefix on target-group import cursor (5th flag)

Line 350, inside the ImportVersions doc comment: !dist|job|<id>|import_cursor is stored in the target group's MVCC store. !dist| is the default-group catalog namespace throughout the codebase (distribution/catalog.go). An operator tool or the catalog's key parsers scanning the default-group catalog would misparse these as catalog entries. Rename to a target-scoped prefix such as !migstage|cursor|<id> — one-line change.

Fix this →

Medium — "Wait until applied" spec (4th flag, slightly improved)

Line 552 (§7.2.2 sub-case 2): "the FSM treats this as 'wait until applied' (bounded by lease TTL) before answering, then re-evaluates ownership; if still not owned, returns ErrRouteMoved."

The round-4 addition of "then re-evaluates ownership; if still not owned, returns ErrRouteMoved" is an improvement. What remains unspecified: the implementation mechanism for "wait until applied" (condition variable on appliedIndex signalled by the apply goroutine? a retry loop with time.Sleep? a channel?) and the timeout value within the lease TTL (200 ms? 500 ms?). Given the level of implementation detail elsewhere in the doc, one sentence would close this: e.g., "condition-variable wait on the FSM's appliedIndex signalled by the apply goroutine, timedout by min(remaining_lease_TTL, 200ms); on timeout, re-evaluate ownership — if still not owned, return ErrRouteMoved."

Summary

Both round-5 items are verified and properly resolved. hlcPhysicalMs(ts) = int64(ts >> 16) is correct per kv/hlc.go (confirmed hlcLogicalBits = 16, SetPhysicalCeiling takes Unix ms, Observe takes a full HLC value). The scan-budget pacing correctly bounds per-call cost for sparse family brackets.

One new P1 emerges from the codex review of commit 44e6125: fence_ts is recorded before the source-group FSM has applied the WriteFenced route — writes accepted in the watcher-lag window after fence_ts are outside DELTA_COPY's range and would be lost at CUTOVER. The fix is a source-group catalog-version poll before DELTA_COPY begins.

The round-3 P1 (PromoteStaged "no-op when shadowed") has now been flagged five consecutive rounds without resolution; it is a correctness bug that causes MVCC history versions to vanish. Three editorial items from round 2 remain (grace window, !dist| prefix, wait-until-applied).

…und-6 P1) Codex P1 round-6 on PR #945 (line 115) — fence_ts pinned before source FSM actually rejects writes The FENCE catalog write commits on the default group, but the source FSM only learns about WriteFenced through the watcher's async catalog snapshot apply (distribution/watcher.go:11, 87-98). If the migrator records fence_ts = source.LastCommitTS() and starts DELTA_COPY immediately, a stale source leader can still accept writes for the moved keys for one watcher-tick worth of time after the catalog commit. Those writes get commit_ts > fence_ts, fall outside DELTA_COPY's (snapshot_ts, fence_ts] window, and are silently left on the source at CUTOVER — write loss after CUTOVER hands the routing to the target. Fix: new §3.2a 'Source-side fence-apply barrier'. The migrator advances FENCE → DELTA_COPY only after every voter of the source-group Raft membership has reported catalog_version_applied >= fence_catalog_version (via the cap_migration_v2 heartbeat's existing catalog_version_applied field, or a per-node GetCatalogVersion RPC if the heartbeat is stale). Only then does the migrator query source.LastCommitTS() and pin that as fence_ts. Crucially: this requires UNANIMOUS voter-ACK, not quorum. Any voter could become the leader at any future moment; without local WriteFenced applied, it would re-serve writes for the moved range in the pre-FENCE state. Quorum is necessary for catalog-write durability; unanimous voter-ACK is necessary for fence safety. State machine diagram (§3.2) updated: FENCE → DELTA_COPY now explicitly waits for the source-side fence ACK before picking fence_ts. SplitJob gains a fence_ack_cursor for restart-safety during the wait. A non-responsive voter blocks the phase; the migrator logs the pending voter list and either operator intervenes or Raft removes the voter via the leader-loss path. CUTOVER is structurally guaranteed by a different mechanism (the §7.2 read fence's authoritative ownership check), so no parallel 'every voter applies CUTOVER' gate is needed — readers are the consumers there.

bootjp · 2026-06-11T18:04:29Z

@claude review — round 6 (1 P1 on commit 44e6125):

line 115 (fence_ts pinned before source rejects writes) — fixed in 9fa5931. New §3.2a barrier: FENCE → DELTA_COPY waits for unanimous voter-ACK that catalog_version_applied >= fence_catalog_version before pinning fence_ts. Closes the write-loss window where the source FSM hasn't applied WriteFenced yet but the migrator has already started counting (snapshot_ts, fence_ts].

bootjp · 2026-06-14T08:41:14Z

@codex review

Addressed the fresh PR #945 Codex findings in 414c1f8 (doc-only).

P1: made the equal-stale read fence fail closed when the default-group ownership lookup times out, is unreachable, or cannot verify the requested catalog version. Point reads and scan-interval checks now return retryable ErrRouteOwnershipUnknown and must not touch source MVCC until ownership is known.
P2: defined zero-version scan-progress chunks as first-class ImportVersions batches. They durably ack per-bracket batch_seq/cursor and update SplitJob bracket_progress without MVCC writes, batchMax, HLC/metaLastCommitTS, or max_imported_ts advancement.

Validation: git diff --check; GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./... -run "^$".

Caller audit for the fail-closed design change: rg over GetRouteOwnership/GetIntersectingRoutes/ErrRouteOwnershipUnknown/ErrRouteCutoverPending/ImportVersions. GetRouteOwnership and ErrRouteOwnershipUnknown are design-new symbols in this doc; existing GetIntersectingRoutes callers are distribution.Engine tests and kv/shard_store scan routing.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 414c1f8a63

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T08:46:32Z

+        {Start: ddbItemPrefix(),  End: ddbItemPrefixEnd(),  Family: familyDdbItem},
+        {Start: ddbGSIPrefix(),   End: ddbGSIPrefixEnd(),   Family: familyDdbGSI},


Include DynamoDB metadata brackets in exports

When the moving route is a DynamoDB table, this plan only exports the item and GSI raw prefixes. The existing router also maps !ddb|meta|table| and !ddb|meta|gen| into the same table route (kv/shard_key.go:96-99), and the adapter reads those keys to load the table schema and generation (adapter/dynamodb.go:8163 and adapter/dynamodb.go:8179). Because those raw prefixes sort outside both the route-key bracket and the item/GSI brackets, they would never be copied, so after cutover the target can have the table's items but no schema/generation and DynamoDB operations will treat the table as missing or use generation 0. Add export brackets for the DynamoDB table metadata and generation prefixes alongside item/GSI.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T08:46:32Z

+        rKey := routeKey(rawKey)
+        // half-open [rangeStart, rangeEnd) in the routing-key namespace.
+        if bytes.Compare(rKey, rangeStart) < 0 { return false }
+        if rangeEnd != nil && bytes.Compare(rKey, rangeEnd) >= 0 { return false }


Treat empty range ends as unbounded

For an open-ended route, the wire contract above says route_end is empty for +infinity, but this predicate only special-cases nil. If an empty-but-non-nil slice reaches RouteKeyFilter, bytes.Compare(rKey, rangeEnd) >= 0 is true for every non-empty routing key, so BACKFILL/DELTA_COPY and the shared lock/prepare scans reject the entire rightmost range and can cut over without its data. Normalize empty ends to nil at the boundary or check len(rangeEnd) > 0 here.

Useful? React with 👍 / 👎.

bootjp · 2026-06-14T09:08:11Z

@codex review

Addressed the fresh PR #945 Codex findings in 8a1a301 (doc-only).

P1: added DynamoDB table metadata and table generation export brackets (!ddb|meta|table|, !ddb|meta|gen|) alongside item/GSI, and documented why schema/generation are required for post-CUTOVER DynamoDB reads.
P1: made RouteKeyFilter treat nil and empty route_end as unbounded (len(rangeEnd) > 0), matching the wire contract where empty means +infinity.

Validation: git diff --check; GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./... -run "^$".

Caller/semantic audit: rg over RouteKeyFilter, rangeEnd, route_end, DynamoDB metadata/generation prefixes, loadTableSchemaAt, and loadTableGenerationAt. The filter/bracket planner is design-new in this doc; existing code confirms routeKey() maps !ddb|meta|table| and !ddb|meta|gen| to the table route, and the DynamoDB adapter reads those metadata keys directly.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a1a301af5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T09:13:36Z

+message MVCCVersion {
+  bytes key = 1;        // raw storage key (NOT routeKey-normalized)
+  uint64 commit_ts = 2;
+  bool tombstone = 3;
+  bytes value = 4;      // omitted when tombstone == true


Carry expireAt in MVCCVersion

Values written through PutWithTTLAt / ExpireAt have an expireAt timestamp in the MVCC value header, and the read path suppresses them once ExpireAt <= read_ts. This wire record only carries value/tombstone, so the importer has no way to preserve an existing TTL; a not-yet-expired key migrated before CUTOVER would be written on the target as non-expiring (or with an invented expiry), causing post-CUTOVER reads after the original expiry to resurrect data. Add expire_at to the exported version, or carry the raw stored-value header through import.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T09:13:37Z

+// Visibility rules MUST match the read path's snapshot at maxCommitTS;
+// in particular, intent locks (!txn|lock|...) are excluded — those
+// belong to in-flight txns the importer must not observe as commits.


Export delete tombstones during delta copy

This contract says export visibility must match the read path at maxCommitTS, but the read path hides tombstones. In a DELTA_COPY where BACKFILL copied K@10 and the source deletes K at commit_ts=20 <= fence_ts, a read-visible export at fence_ts emits nothing for K, so the target keeps K@10 and the delete is lost after CUTOVER. The exporter needs to iterate raw committed MVCC versions in (minCommitTS, maxCommitTS], including tombstones, rather than only rows visible to a snapshot read.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T09:13:37Z

+3. **Background incremental promoter on the target group.** Immediately after CUTOVER the target group's FSM starts a **leader-local promoter goroutine** that walks the `!dist|migstage|<job_id>|*` prefix in cursor-resumable chunks (the same `chunkBytes` / pacing knobs §6.1 and the migrator already use), and for each staged version proposes one `PromoteStaged` Raft entry that:
+   - copies the staged key into its live position with the original imported `commit_ts` — **always**, regardless of whether a newer live version at a higher `commit_ts` exists. The MVCC store naturally holds both `K@commit_ts=3` (the staged version being promoted) and `K@commit_ts=10` (a later live write) at different commit timestamps; a snapshot read at any `read_ts ∈ [3, 10)` correctly sees the imported `K@3`, and a read at `read_ts ≥ 10` correctly sees `K@10`. The idempotency guard for this copy is "**no-op if a live version at the same `commit_ts` already exists**" — that means this exact staged version was already promoted in a prior batch before a leader flap re-tried it. It is **not** "no-op when a newer live version exists," which would silently drop the imported version from MVCC history and make snapshot reads at older `read_ts` return wrong results (closes claude round-3 P1, 5-round flag). History pruning belongs to `kv/compactor.go`, not the promoter,
+   - deletes the staged key,
+   - advances the per-job `promote_cursor` on the SplitJob.


Keep promotion state under one Raft owner

PromoteStaged is described as a target-group Raft entry, but promote_cursor is stored in the default-group SplitJob and the later staged_visibility_active flag is default-group catalog state. In the CLEANUP path this cannot be atomic: a crash between separate target/default writes can resume from a cursor that skips unpromoted staged rows, or clear the merge flag before target promotion is durable. Keep promotion progress in target-local Raft state and clear catalog visibility only via an ordered default-group CAS after the target reports completion.

Useful? React with 👍 / 👎.

bootjp · 2026-06-14T09:42:14Z

@codex review

Addressed the fresh PR #945 Codex findings in 701845a (doc-only).

P1: MVCCVersion now carries expire_at; export/import preserve the MVCC value header needed for TTL visibility, and promotion copies the original commit_ts / tombstone / expire_at without minting timestamps.
P1: ExportVersions is specified as raw committed-version iteration over (minCommitTS, maxCommitTS], including delete tombstones and expired-at-maxCommitTS TTL versions, excluding only in-flight intent locks. Test plan now covers DELTA_COPY delete tombstones and TTL preservation.
P2: promotion progress moved to target-local Raft state under !migstage|promote|<job_id>; the default group records only target_promotion_done after target-local done=true, then clears staged_visibility_active via ordered CAS. Recovery/test/PR tables now cover target promotion leader loss and default-CAS retry.

Validation: git diff --check; GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./... -run "^$".

Caller/semantic audit: rg over VersionedValue, ExpireAt, PutWithTTLAt, ExpireAt, PutAt, DeleteAt, Tombstone, MVCCVersion, ExportVersions, ImportVersions, PromoteStaged, PromotionState, promote_cursor, staged_visibility_active, and target_promotion_done. Existing store code confirms Tombstone/ExpireAt live in the MVCC value header and read visibility hides tombstones/expired values; promotion symbols are design-new in this doc, so ownership is documented at the target/default boundary.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 701845acec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T09:45:55Z

+   - read the **newest live MVCC version of `K` with `commit_ts ≤ read_ts`** (the standard snapshot read against `!ddb|...K`, `!redis|...K`, the user-key band, etc. — whichever family `K` belongs to);
+   - read the **newest staged MVCC version of `K` with `commit_ts ≤ read_ts`** under `!dist|migstage|<job_id>|<K>` (the §6.1.1 staged area, iterated by the same `(raw_key ASC, commit_ts DESC)` order the snapshot iterator already uses);
+   - return whichever has the **greater `commit_ts`** (newest wins); if both are absent return "not found".


Merge hidden tombstones before older staged versions

When promotion is only partially complete, using the standard snapshot read here hides tombstones/expired versions as “absent”, which can resurrect older staged data. For example, if staged has K@10 and a delete tombstone K@20, the promoter can copy/delete the K@20 staged row into live first; the live snapshot read then returns absent while the remaining staged K@10 is visible, so reads at read_ts >= 20 return the deleted value until the older staged row is promoted. The merge needs to compare raw MVCC versions including tombstone/expire_at and treat a winning hidden version as not found.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T09:45:55Z

+3. **CUTOVER precondition.** Before the migrator issues the CUTOVER catalog write, it ensures the target group's HLC ceiling is at least `hlcPhysicalMs(max_imported_ts)` by issuing a final `SetPhysicalCeiling(hlcPhysicalMs(max_imported_ts))` + `Observe(max_imported_ts)` proposal on the **target group** (not the default group). The proposal is gated by `cap_migration_v2` (§11.1) and is the **last** target-side write before CUTOVER. If the target HLC is already above this value (e.g. due to per-batch advancement), the `SetPhysicalCeiling` call is the trivial no-op the ceiling already enforces; if a follower flap dropped one of the per-batch advances, this proposal closes the gap.
+4. **Post-CUTOVER write-side invariant.** A target-group `Next()` after CUTOVER is bounded below by `metaLastCommitTS` AND by the HLC ceiling (the existing `HLC.Next()` semantics in `kv/hlc.go`), so it is provably strictly greater than every `commit_ts` ever imported under this job. Reads at any post-CUTOVER `Next()` therefore see the imported version as historical (older than the read ts) and see the new write as the most-recent committed version — no resurrection, no time travel.


Persist the full imported HLC high-water mark

This invariant does not survive a target restart/snapshot because HLC.last is in-memory: the existing snapshot restore path reapplies only the physical ceiling, and HLC.Next() does not consult metaLastCommitTS. If max_imported_ts has the same physical millisecond as the restored ceiling but a non-zero logical half, a restarted target can issue ms<<16|0, which is below an imported version despite metaLastCommitTS being higher. The design needs to durably re-observe the full max_imported_ts on target startup/apply, or advance the physical ceiling past that timestamp before serving post-CUTOVER writes.

Useful? React with 👍 / 👎.

bootjp · 2026-06-14T10:16:44Z

@codex review

Addressed the fresh PR #945 Codex findings in 41b8aba (doc-only).

P1: persisted the imported full-HLC high-water mark in target-local Raft state under !migstage|hlc_floor|<job_id>. ImportVersions and the final pre-CUTOVER target proposal update that full value, while startup/snapshot restore/leadership acquisition replays SetPhysicalCeiling(hlcPhysicalMs(fullFloor)) plus Observe(fullFloor) before serving writes. The test plan now includes a restart/snapshot red control where physical-ceiling-only restore emits ms<<16|0 below an imported ms<<16|logical.
P1: made staged/live visibility merge raw MVCC candidates, including tombstone and expire_at, before applying the visibility filter. If the winning candidate is hidden, point reads and scans return not found / omit the key and never fall back to an older candidate from the other source. The test plan now includes partial-promotion tombstone and TTL red controls.

Validation: git diff --check; GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./... -run '^$'.

Semantic/caller audit: rg over HLC.Next, NextFenced, Observe, SetPhysicalCeiling, PhysicalCeiling, metaLastCommitTS, LastCommitTS, Snapshot, Restore, readVisibleVersion, processFoundValue, ScanAt, Tombstone, ExpireAt, ExportVersions, ImportVersions, PromoteStaged, and staged_visibility_active. Existing code confirms HLC.last is in-memory and restore currently replays only the physical ceiling; readVisibleVersion / processFoundValue hide tombstones and expired TTL values; the staged migration APIs are design-new in this doc.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 41b8abaa6b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T10:20:04Z

+        {Start: txnIntPrefix(),   End: txnIntPrefixEnd(),   Family: familyTxnInternal},
+        {Start: lstMetaPrefix(),  End: lstMetaPrefixEnd(),  Family: familyListMeta},
+        {Start: lstItmPrefix(),   End: lstItmPrefixEnd(),   Family: familyListItem},
+        {Start: redisPrefix(),    End: redisPrefixEnd(),    Family: familyRedis},


Add Redis wide-column brackets to migration export

This redisPrefix() bracket only covers !redis|... keys, but current Redis collection state is also stored under non-!redis prefixes such as !hs|, !st|, !zs|, and !stream| (see adapter/redis_compat_types.go's knownInternalPrefixes and the store/*_helpers.go layouts). A cross-group split of a route containing a Redis hash/set/zset/stream key would never export those rows (and needs matching routeKey decoders), so after CUTOVER the target would serve empty or partial collections. Please include these existing Redis wide-column prefixes in the bracket plan and tests.

Useful? React with 👍 / 👎.

bootjp · 2026-06-14T10:42:28Z

@codex review

対応しました (b799e94): fresh Codex P1 の Redis wide-column migration bracket 漏れを doc 側で修正しました。

export bracket sketch に !redis|... に加えて !hs|, !st|, !zs|, !stream|meta|, !stream|entry| を追加
routeKey() に store-owned Redis wide-column decoder が必要なことを明記 (ExtractHash/Set/ZSet/Stream*UserKey helpers)
hash/set/zset は delta prefix が meta prefix 配下なので、重複 scan を避ける umbrella bracket と raw subprefix dispatch に整理
stream は !stream|foo user key を誤分類しないよう concrete meta/entry prefix のみを bracket 化
test plan に Redis collection migration と red controls (bracket 省略、decoder 省略) を追加

Validation:

git diff --check
GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./... -run '^$'

Audit:

rg で knownInternalPrefixes, redisInternalPrefixes, Redis wide-column store prefixes, Extract*UserKey helpers, routeKey(), redisRouteKey() を確認しました。現状の kv/shard_key.go は !redis|... と list 等のみを normalize しているため、M2 実装要件として wide-column decoder 追加を明記しました。

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b799e94b29

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T10:47:07Z

+
+M2 fixes this with a **three-step gate** on the FENCE → DELTA_COPY transition (step 0 added per claude round-12 P1 on PR #945 — prepared-txn drain — see §7.1's no-`Phase_COMMIT`-bypass rationale):
+
+0. **Prepared-txn drain (resolves OQ-6).** Before transitioning `BACKFILL → FENCE` (yes, before — the FENCE catalog write itself does not run until this drain returns), the migrator waits until **no in-flight transaction intent locks remain in the moving range**. "No intents remaining" is observed via the **route-faithful lock drain** defined in §3.2a.0a below — a `!txn|lock|` family scan filtered by `routeKey(lockedKey) ∈ [routeStart, routeEnd)`, returning empty. The wait is bounded by the txn timeout — a long-lived prepared txn either commits or is aborted by the lock resolver (`kv/lock_resolver.go`) within its TTL, so the scan converges in at worst one TTL window. While the drain is pending the migrator logs `last_error = "fence drain pending: <N> intent locks in moving range"`. **Why this is required:** §7.1's `verifyRouteNotFenced` rejects every write request in the WriteFenced range and has **no `Phase_COMMIT` bypass** because the matching intent lock is excluded from BACKFILL (§6.1) and absent on the target after CUTOVER — a `Phase_COMMIT` reaching the FSM after FENCE has nothing to commit against. Letting a prepared txn into the FENCE phase would therefore strand its commit indefinitely. The drain guarantees every commit that reaches the FSM during/after FENCE is for a txn whose intent has already resolved (committed via the pre-FENCE accept window, or aborted by the lock resolver). After CUTOVER the target has no historical intent for a moved-range key, so a coordinator that retries a Phase_COMMIT against the target would also fail; the drain is the only safe ordering.


Gate new prepares before waiting for an empty drain

While this BACKFILL→FENCE drain is waiting, the route is still Active and nothing prevents new prepares from landing in the moving range. Under a steady transactional workload, each lock may resolve within its TTL but another lock can arrive before the scan is empty, so the job is not bounded by one TTL and can starve before it ever writes the WriteFenced route. The migration needs a draining gate or Phase_COMMIT-safe fence that stops new prepares before relying on an empty lock scan.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T10:47:07Z

+
+   Because the source is still `Active` during BACKFILL and FENCE-apply, new prepares with arbitrarily low caller-supplied / lagging-clock `commit_ts` can keep appearing on **either** the primary or a secondary key in the moving range; the running minimum must capture every one. To avoid an inter-tick gap — a prepare that both lands and resolves between two ticks would never appear in a *live-lock* sample — the running minimum is taken over the following sources, all route-faithful by `routeKey()` against the moving range (`routeKey(key) ∈ [routeStart, routeEnd)`):
+   - the live `!txn|lock|` family (in-flight prepares): each accepted lock is in the moving range by its **own** `routeKey(lockedKey)` regardless of whether the locked key is the txn's primary or a secondary, so a secondary lock in the moving range counts here. For each accepted lock the migrator records `(routeKey(lockedKey), PrimaryKey, StartTS)` from the lock value (`txnLock` carries `PrimaryKey` + `StartTS`, `kv/txn_codec.go:223-245`; the `commit_ts` itself is added by the M2-PR0 wire change). Timestamp priority is explicit: first use the non-zero `commit_ts` in the lock value; for legacy locks where that field is absent or zero, fall back to a **direct point lookup** of the commit record at `!txn|cmt|<PrimaryKey>|<startTS>`. If the direct lookup is absent because the legacy txn is still prepared, the migrator treats that lock as unresolved and keeps the relevant scan/drain pending until the lock resolves or the resolver aborts it. The migrator already holds the exact primary key, so it does not need the commit record to route into the moving range. The primary-keyed commit record is therefore consulted by **direct address from the secondary lock**, never by a route-faithful commit-record-family *scan* (which, per the paragraph above, cannot find a moved-secondary's primary-keyed record), AND
+   - the **committed MVCC versions in the moving range itself** — a route-faithful scan of committed versions over `[routeStart, routeEnd)` (the same §6.3.1 brackets + `routeKey()` filter the data export uses) with `commit_ts ≤ snapshot_ts`. This is the backstop for the same-tick land-and-resolve case where the secondary's intent lock was never sampled: the txn's committed secondary version `(secondaryKey, commit_ts)` is written into the moving-range MVCC space and routes into `[routeStart, routeEnd)` by construction, so this scan observes its `commit_ts` directly without any dependence on the primary-keyed commit record. (It is sufficient on its own for correctness; the live-lock + direct-commit-record path above is the cheap fast path that lets the migrator floor `delta_floor` below a still-prepared secondary before its version even commits, sparing DELTA_COPY a re-export.)


Avoid lowering delta_floor with all historical versions

This final scan includes every committed MVCC version in the range with commit_ts <= snapshot_ts, so any old retained version that was already copied by BACKFILL lowers snapshot_min_prepared_ts to the oldest history, not just to late prepares that BACKFILL missed. That makes delta_floor near zero and forces DELTA_COPY to re-export almost the entire range while writes are fenced, turning the intended short cutover window into another full backfill for large ranges. The design needs a way to distinguish newly resolved prepares from ordinary historical versions before using them to set the floor.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T10:47:07Z

+        |  CUTOVER → CLEANUP               Source group's leader deletes
+        v                                  the moved key range from its
+[ source.Active(left)           ]          MVCC store after the read-fence
+[ target.Active(right)          ]          grace window (§7.2).


Exclude default-catalog keys from source cleanup

If the source is the default group and the moving right child intersects !dist|, this cleanup deletes the authoritative catalog/SplitJob keys that §3.1 stores under !dist|meta|, !dist|route|, and !dist|job|. A split such as one starting at or before !dist| would migrate and then remove the control-plane records needed to route and resume the job. Cross-group migration should explicitly reject or carve out the default-group reserved catalog prefix before applying generic source-range deletion.

Useful? React with 👍 / 👎.

bootjp · 2026-06-14T11:16:55Z

@codex review

対応しました (a603f17): fresh Codex findings at b799e94.

P1: FENCE now publishes WriteFenced before waiting for an empty drain, so new prepares cannot starve the drain; existing prepared txns resolve only through a narrow Phase_COMMIT/ABORT lane while post_fence_drain_completed=false and matching intent exists.
P1: Replaced all-historical committed-version floor scan with a source-group prepare tracker armed before snapshot_ts. Same-tick land-and-resolve prepares are recorded at admission, while old retained MVCC history cannot lower delta_floor.
P2: Cross-group migration now rejects default-group moving intervals intersecting reserved !dist| control namespace before SplitJob creation, so CLEANUP cannot delete catalog/SplitJob keys.
Updated recovery/test plan with non-starving drain, tracker red controls, and !dist| reserved range validation.

Validation:

git diff --check
GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./... -run '^$'

Semantic/caller audit:

rg over verifyRouteNotFenced, Phase_COMMIT, Phase_ABORT, ObservedRouteVersion == 0, routeKey(mut.Key), ErrRouteWriteFenced, txn lock/commit key helpers, resolveDispatchCommitTS, caller-supplied StartTS/CommitTS, SplitRange, and !dist| catalog prefixes.
Existing code confirms commit/abort resolution uses Phase_COMMIT/Phase_ABORT, txnCommitKey is primary-keyed, caller-supplied commit timestamps are accepted, and !dist|meta| / !dist|route| are default catalog prefixes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a603f175c3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T11:22:51Z

+        {Start: lstMetaPrefix(),  End: lstMetaPrefixEnd(),  Family: familyListMeta},
+        {Start: lstItmPrefix(),   End: lstItmPrefixEnd(),   Family: familyListItem},


Include list delta and claim rows

When a Redis list has un-compacted metadata deltas or LPOP/RPOP claim keys at migration time, this bracket plan only exports !lst|meta| and !lst|itm|. The existing store also persists live list state under store.ListMetaDeltaPrefix (!lst|meta|d|) and store.ListClaimPrefix (!lst|claim|), which Redis reads/compaction consume via the corresponding scan prefixes, while current routeKey() only calls store.ExtractListUserKey for meta/item keys. Without adding both brackets and matching route-key decoders, the target misses these rows after CUTOVER, so list length/head state can be wrong and pop claim protection can be lost.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T11:22:51Z

+- **Branch 1 — source has applied CUTOVER (`source_local_version >= cutover_version`).** The FSM calls `GetIntersectingRoutes(start, end)` against its **own** current catalog snapshot. If **every** intersecting route is owned by this group, the scan proceeds. If **any** intersecting route is owned by a different group, the FSM returns `ErrRouteMoved{new_route_version, new_group_id, sub_range_start, sub_range_end}` for the **first** foreign-owned sub-range encountered (lexicographically smallest).
+- **Branch 2 — source has NOT applied CUTOVER but `max_cutover_version_seen > source_local_version` (equal-stale, watcher lag).** The local snapshot still says `source.Active(full: [A, Z))` — a single route, no foreign owner visible — so Branch 1 alone would silently pass a cross-boundary scan. The gate therefore additionally calls `GetIntersectingRoutes([start, end), catalog_version = max_cutover_version_seen)` as a single deadline-bounded RPC to the **default-group leader** (mirroring exactly §7.2.2's point-key extension — same watermark trigger, same RPC target, same 1 s cache TTL with cache key `(start, end, max_cutover_version_seen)`). The default-group leader answers from its current catalog snapshot. If any returned route is owned by another group, the source FSM returns `ErrRouteMoved{max_cutover_version_seen, foreign_group_id, sub_range_start, sub_range_end}` for the first foreign-owned sub-range. If every returned route is still source-owned (the elevated watermark covers an *unrelated* migration that did not touch this scan range), the scan proceeds normally. If the RPC times out, the default-group leader is unreachable, or the route set cannot be verified at `max_cutover_version_seen`, the FSM returns `ErrRouteOwnershipUnknown{route_version=max_cutover_version_seen, sub_range_start=start, sub_range_end=end, retry_after_ms}` and **must not scan source MVCC**.


Normalize scan ownership checks by route key

When the stale read is an adapter/internal scan, both branches here compare the raw scan bounds against catalog routes. For families whose raw scan prefix sorts outside the logical route key acknowledged in §6.3.1 (for example Redis wide-column !hs|... / !st|... / !zs|... or list delta/claim prefixes), raw [start,end) can appear source-owned even though the rows' routeKey() values are in the moved right half. In that case the source FSM will pass the fence and scan source MVCC, returning stale collection data after target.Active; the scan fence needs per-family route-key interval mapping or fail-closed filtering instead of raw GetIntersectingRoutes(start,end).

Useful? React with 👍 / 👎.

bootjp · 2026-06-14T11:41:15Z

@codex review

対応しました (646fe7e): fresh Codex P1 findings at a603f17.

P1: added Redis list meta-delta and claim families to the migration export bracket plan (!lst|meta|d|, !lst|claim|) and documented the matching routeKey() decoder order (ExtractListUserKeyFromDelta / ExtractListUserKeyFromClaim before the broader base list decoder). Test plan now has list delta/claim red controls.
P1: changed scan ownership fencing to normalize raw scan bounds into route-key intervals before any GetIntersectingRoutes proof. Internal scans such as !lst|meta|d|, !lst|claim|, !hs|, !st|, and !zs| are checked against decoded logical owners or explicit logical route bounds; un-normalizable raw internal scans fail closed with ErrRouteOwnershipUnknown and do not read source MVCC.

Validation:

git diff --check
GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./... -run '^$'

Semantic/caller audit:

rg over GetIntersectingRoutes, routesForScan, ExtractListUserKey, ExtractListUserKeyFromDelta, ExtractListUserKeyFromClaim, ListMetaDeltaPrefix, ListClaimPrefix, routeKey(), Redis route-key helpers, and the retryable read-fence errors (ErrRouteOwnershipUnknown, ErrRouteMoved, ErrRouteCutoverPending).
Existing code confirms current scan routing has raw GetIntersectingRoutes(start,end) fallback and list-only special casing, while list delta/claim helpers already exist; the M2 design now assigns the needed route-key-normalized scan gate and decoder/bracket coverage to M2-PR4/PR6 tests.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 646fe7ee86

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T11:45:06Z

+M2 therefore makes target visibility depend on a synchronous source-side read barrier:
+
+1. After DELTA_COPY completes and after the target full-HLC floor proposal in §6.2.1, but **before** the default-group CUTOVER catalog CAS, the migrator proposes `ArmCutoverReadFence(job_id, route_start, route_end, expected_cutover_version, target_group_id)` to the **source group**. `expected_cutover_version` is the catalog version the pending CAS will create (`current_catalog_version + 1` under the CAS input).
+2. The source FSM applies this as source-group Raft state: a pending-fence entry keyed by `(job_id, route_start, route_end, expected_cutover_version)`. After the source-group apply ACK, the migrator records `cutover_read_fence_armed = true` on the default-group SplitJob as restart progress. From that point until the source watcher applies CUTOVER and normal §7.2.2 Branch 1 takes over, any point read or scan whose **route-key-normalized interval** intersects the moving range and whose `read_route_version < expected_cutover_version` **must not read source MVCC**. If a raw internal scan cannot be normalized to route-key intervals, the source fails closed with `ErrRouteOwnershipUnknown` / retry rather than proving ownership from raw bytes. If the source has already learned that the default-group CAS committed (via watcher, heartbeat watermark, or the ownership RPC), it returns `ErrRouteMoved{expected_cutover_version, target_group_id}`. Otherwise it returns `ErrRouteCutoverPending{retry_after_ms}`; coordinators treat it like `ErrRouteMoved` with a short wait + `WatcherRefresh` retry. A transient retry is acceptable; serving stale source data is not.


Persist the pending read-fence before relying on it

If the default-group leader crashes after the source-group ACKs ArmCutoverReadFence but before this SplitJob flag is written, the source has a durable pending fence while recovery has no witness that it must complete the CAS or clear the fence. In that window, reads intersecting the moving range keep returning ErrRouteCutoverPending (including after an abandon/rollback path that sees cutover_read_fence_armed=false) because no migrator is required to clear the source-group state. Persist an arming intent before the source proposal, or make recovery/abandon probe and clear source fences even when this flag is false.

Useful? React with 👍 / 👎.

## Summary Doc-only PR. Proposes **multi-node multi-group bootstrap** — the prerequisite for true distributed multi-Raft scaling. Today a real multi-node deployment can only run **single-group**: `resolveBootstrapServers` rejects `--raftBootstrapMembers` when `len(groups) != 1` (`main.go:744-748`), `groupSpec` carries one address per group (`shard_config.go:14-17`), and `buildRuntimeForGroup` threads the same (nil-in-multi-group) `bootstrapServers` into every group (`multiraft_runtime.go:234-254`), so each group bootstraps single-member and the factory never even builds a transport (`factory.go:49-52`). This is exactly the PR0 / OQ-9 "option (a)" gap the leader-balance design (#953) identified. New file: `docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md` (Status: Proposed / Author: bootjp / Date: 2026-06-12). Every current-state claim is cited at file:line, verified against `main`. ## Key design points - **Flag surface:** new companion `--raftGroupPeers "1=n1@host1:5051,n2@host2:5051,...;2=..."` (per-group `raftID@host:port` voter lists), lifting the `len(groups)==1` guard. `--raftGroups`/`--raftBootstrapMembers`/`--raftBootstrap` and the `cmd/server/demo.go` demo are unchanged; mutually exclusive with `--raftBootstrapMembers`; full validation rules (local-node-present, addr-match, homogeneity). - **Bootstrap semantics:** all-nodes-same-list (the etcd model `cmd/server/demo.go:204-219` already uses for single-group), NOT a designated proposer. Idempotent restart and divergent-config fail-fast come free from the existing persisted-peers + `errClusterMismatch` + `raft-engine` marker machinery (already per-group). - **Transport/addressing:** transport is already per-group and address-map driven; keep one listener per group (the `5005{1..6}` convention). No transport change — only feeding per-group lists in. - **Alternative considered:** AddVoter-composition stays the **live-expansion** path, not the bootstrap path (per #953 OQ-9). - **Cross-doc impact:** unblocks leader-balance PR2/PR3 (#953) and hotspot-split M2 cross-group migration testing (#945). ## Open questions OQ-1 flag spelling (companion flag vs. extend `--raftGroups`); OQ-2 `--raftBootstrap` required vs. implied; OQ-3 per-group carrier (map vs. func); OQ-4 heterogeneous membership (hard error vs. warning in v1); OQ-5 learner-start interop; OQ-6 single multiplexed raft listener for very high M. ## Test plan Doc-only — no code, no tests run. The doc itself proposes the test strategy: unit (flag-parse table tests, per-group resolution, restart idempotency), integration (in-process 3-node × 2-group harness asserting multi-voter `Configuration()` + inter-node `TransferLeadershipToServer`), and a later Jepsen milestone (true multi-node multi-group runner).  ## Summary by CodeRabbit * **Documentation** * Added comprehensive design proposal for multi-node multi-group bootstrap capabilities, including CLI flag specifications, backward compatibility guidelines, bootstrap semantics, and integration testing roadmap.

bootjp · 2026-06-14T12:09:05Z

@codex review

対応しました (ad2f7bb): fresh Codex P1 finding at 646fe7e.

P1: replaced the single cutover_read_fence_armed bool with cutover_read_fence_state = NONE / ARMING / ARMED / CLEARING.
The migrator now fixes/persists cutover_version and ARMING on the default-group SplitJob before proposing ArmCutoverReadFence to the source group, so recovery has a durable witness even if the leader dies after the source ACK but before the old armed flag would have been written.
Recovery at ARMING idempotently probes/re-proposes the source fence and records ARMED; recovery/abandon at ARMED transitions through CLEARING and keeps issuing ClearCutoverReadFence until the source ACKs before returning to DELTA_COPY/FAILED.
Updated the recovery matrix and tests with crash controls for ARMING, source ACK before ARMED, and stranded ErrRouteCutoverPending prevention.

Validation:

git diff --check
GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./... -run '^$'

Semantic/caller audit:

rg over cutover_read_fence_armed, cutover_read_fence_state, ARMING, ARMED, CLEARING, ArmCutoverReadFence, ClearCutoverReadFence, ErrRouteCutoverPending, ErrRouteMoved, and ErrRouteOwnershipUnknown.
These read-fence state-machine symbols are design-new in this doc; existing code has no shipped callers yet, so the implementation responsibility is now anchored in M2-PR6 tests and the recovery matrix.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad2f7bb534

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T12:13:42Z

+   The tracker is keyed by the **locked key**, not the primary, so a secondary lock in the moving range counts even when the txn's primary routes elsewhere. Verified against `kv/fsm.go:1136-1142` and `kv/txn_keys.go:50-58`: the durable `!txn|cmt|<primaryKey>|<startTS>` commit record is written **only when committing the primary** (`if committingPrimary`) and is keyed by `primaryKey`, so for such a txn the commit record's `routeKey()` falls **outside** `[routeStart, routeEnd)` — a commit-record family scan filtered by `routeKey()` never sees it. The migrator may still use the exact `PrimaryKey` + `StartTS` decoded from a live lock to do a **direct point lookup** for backward-compat locks where the M2-PR0 `commit_ts` field is absent or zero, but it never relies on a route-faithful commit-record-family scan for moved-secondaries.
+
+   The minimum is only **final** once §3.2a step 1's unanimous voter-ACK has made every source voter reject new prepares (`verifyRouteNotFenced`) **and** step 0b's post-fence drain has returned empty — at that point no future prepare can land on any key in the moving range, and every migration-window prepare was either recorded by the tracker or was a pre-tracker lock sampled by the drain path. Tracker rows are retained until `delta_floor` is finalised, then deleted by the same cleanup that clears the import ack state for the job.
+3. **Finalise the boundary after the post-fence drain.** When step 0b's drain returns empty (`post_fence_drain_completed = true`), the migrator reads the prepare tracker and live-lock scan **one last time**; the lock side returns empty by construction (no in-flight prepare survives the drain), and the tracker yields the final running minimum over every migration-window prepare — primary or secondary — that resolved into the moving range. The DELTA_COPY lower bound is then fixed as `delta_floor = min(snapshot_ts, snapshot_min_prepared_ts - 1)` and DELTA_COPY exports the window **`(delta_floor, fence_ts]`** (not `(snapshot_ts, fence_ts]`). Any post-scan prepare — on the primary or a moved secondary — that committed with `commit_ts ≤ snapshot_ts` was recorded by the step-2 tracker, so `delta_floor ≤ snapshot_min_prepared_ts - 1 < commit_ts`, i.e. `commit_ts > delta_floor` and the version falls **inside** DELTA_COPY's window — it is no longer dropped. `delta_floor` is persisted on the SplitJob alongside `snapshot_min_prepared_ts` so a DELTA_COPY resume uses the same lower bound.


Track low-ts one-phase writes in the delta floor

This boundary only accounts for prepared transactions, but the existing non-prepare paths can also apply preallocated timestamps: handleRawRequest writes at r.Ts, and handleOnePhaseTxnRequest writes at meta.CommitTS. While the source route remains Active during BACKFILL, a raw/one-phase request can allocate or carry commit_ts <= snapshot_ts, apply after the exporter has already passed its key, and create no prepare-tracker row or live lock; with no prepares lowering the floor, DELTA_COPY's (delta_floor, fence_ts] still starts at snapshot_ts and excludes that committed version, so CUTOVER can lose it. The fence/drain logic needs to cover all admitted low-timestamp writes, not only prepares, before finalizing this floor.

Useful? React with 👍 / 👎.

bootjp · 2026-06-14T12:44:31Z

@codex review

対応しました (cbdc0fc): fresh Codex P1 finding at ad2f7bb.

P1: widened the boundary tracker from prepare-only to a source-group migration write tracker. It now records raw writes (handleRawRequest at r.Ts), one-phase txns (handleOnePhaseTxnRequest at TxnMeta.CommitTS), and prepares (at admission time) for any mutation/locked key whose routeKey() is in the moving range.
Renamed the SplitJob boundary field to snapshot_min_admitted_ts and fixed delta_floor = min(snapshot_ts, snapshot_min_admitted_ts - 1), so a post-scan low-ts raw/one-phase/prepared write is exported by DELTA_COPY instead of being excluded by (snapshot_ts, fence_ts].
Kept the historical-version guard: old retained committed MVCC versions are not allowed to lower the floor; only migration-window tracker rows plus pre-tracker live locks do.
Updated recovery/test plan with raw, one-phase, and prepare red controls, including a prepare-only tracker red control.

Validation:

git diff --check
GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./... -run '^$'

Semantic/caller audit:

rg over requestCommitTS, handleRawRequest, handleOnePhaseTxnRequest, r.Ts, TxnMeta.CommitTS, resolveDispatchCommitTS, resolveTxnCommitTS, onePhaseTxnRequestWithPrevCommit, verifyRouteNotFenced, Phase_COMMIT, and Phase_ABORT.
Existing code confirms raw requests use request Ts, one-phase requests carry TxnMeta.CommitTS, and prepared commits remain on the Phase_COMMIT path; the design now records all three admitted low-ts write paths before finalising the DELTA_COPY floor.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cbdc0fc576

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T12:49:35Z

+
+1. **Arm the migration write tracker, then choose the provisional snapshot for BACKFILL.** Before opening BACKFILL, the migrator proposes `ArmMigrationWriteTracker(job_id, route_start, route_end)` to the source group and waits for it to apply; only then may it set `write_tracker_armed=true` on the default-group SplitJob and choose `snapshot_ts`. The tracker is source-group Raft state. From that apply onward:
+
+   - `handleRawRequest` appends a tracker row for each mutation whose `routeKey(mut.Key) ∈ [routeStart, routeEnd)`, using the request's `commit_ts = r.Ts`, in the same apply that writes the MVCC version.


Track prefix deletes in the write tracker

This records raw writes only by the point key routeKey(mut.Key), but handleRawRequest special-cases Op_DEL_PREFIX into DeletePrefixAt, which writes tombstones for every live key under the prefix (and nil prefix means all keys). If such a prefix delete is admitted after BACKFILL has passed moved keys with r.Ts <= snapshot_ts, it can create low-ts tombstones without lowering snapshot_min_admitted_ts, so DELTA_COPY may exclude those tombstones and the target resurrects values copied by BACKFILL. Prefix deletes need to be recorded as range-intersecting migration-window writes, not as a single mutation key.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T12:49:35Z

+FSM-side defense in depth — **must run on an unconditional apply path** (closes codex round-3 P1 on PR #945):
+
+- The naïve placement — extend `verifyOwnerFromSnapshot` so it also rejects routes in `WriteFenced` — does NOT work. `verifyComposed1` short-circuits when `ObservedRouteVersion == 0` (`kv/fsm.go:608-611`), and there are several **intentional zero-version write paths today**: read/write txns with `ReadKeys`, caller-supplied `StartTS`, and resolver-claimed keys all dispatch with `ObservedRouteVersion = 0` (`kv/sharded_coordinator.go:694-754`, M3 sibling §3.5 + auto-pin policy). A stale coordinator in any of those flows would forward a write to the source during FENCE and the FSM defense would never run. The fence must be enforced **before** any version-conditional early return.
+- M2 therefore introduces a **separate, unconditional pre-gate** `verifyRouteNotFenced(mutations)` in `kv/fsm.go` that the apply path consults **before** `verifyComposed1`. It scans the mutations exactly as `verifyOwnerFromSnapshot` does — `routeKey(mut.Key)` against the **current** catalog snapshot — and rejects with `ErrRouteWriteFenced` when **any** key resolves to a route whose `RouteState == WriteFenced`. No dependency on `ObservedRouteVersion`, no early return on `0`; the gate runs on every applied write request unconditionally. The gate has the same `Phase_ABORT` bypass as the Composed-1 gate (for the same lock-release reason in §3.5 of the parent partial doc). `Phase_COMMIT` is **not** a blanket bypass: it is allowed only while the SplitJob is still in the post-fence drain window (`post_fence_drain_completed=false`) and only if every committed mutation resolves a matching pre-existing `!txn|lock|` row for the same `(primary_key, start_ts, commit_ts)` in the moving range. That narrow lane lets already-prepared txns resolve after the fence has stopped new prepares (closing codex round-18 P1's starvation issue) without letting a stale coordinator create new state. Once the post-fence drain returns empty, the lane closes and any later `Phase_COMMIT` for the moved range is rejected; after CUTOVER the target has no historical intent for that key, by construction.


Fence prefix deletes by range intersection

The proposed fence gate checks each mutation as a point lookup via routeKey(mut.Key), but Op_DEL_PREFIX is a range write: an empty prefix is broadcast to every group, and a non-empty prefix can overlap the moving right child even when the prefix key itself is outside it. During FENCE/DELTA_COPY such a request would bypass verifyRouteNotFenced, write source tombstones after the point where the source is supposed to be write-fenced, and those tombstones are outside the already-fixed copy window, so deleted data can reappear after CUTOVER. The fence path needs explicit DEL_PREFIX intersection handling.

Useful? React with 👍 / 👎.

bootjp · 2026-06-14T13:50:08Z

@codex review

対応しました (c0a678f): fresh Codex P1 review 4492933289.

P1: changed FENCE write rejection from point-only routeKey(mut.Key) checks to mutation write-footprints. Op_DEL_PREFIX is now treated as a range write: empty prefix intersects every non-txn key route, and non-empty prefix uses [prefix, prefixScanEnd(prefix)) normalized to route-key intervals before checking WriteFenced route intersection.
The source FSM remains authoritative: coordinator-side DEL_PREFIX broadcast may preflight, but every recipient re-runs the prefix-footprint gate before handleDelPrefix, so a stale coordinator cannot write source tombstones after FENCE.
Extended the migration write tracker / snapshot_min_admitted_ts design to include low-ts DEL_PREFIX range tombstones, not only raw point writes, one-phase txns, and prepares. Internal/family-wide prefixes that cannot prove disjoint are conservative: fence fails closed, and the tracker records the range tombstone for boundary lowering.
Added test plan coverage for empty-prefix DEL_PREFIX, prefix-start-outside/right-child-overlap cases, internal prefix normalization, and red controls for point-only routeKey(prefix) handling.

Validation:

git diff --check
GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./... -run '^$'

Semantic/caller audit:

rg over DelPrefix, Op_DEL_PREFIX, dispatchDelPrefixBroadcast, extractDelPrefix, handleDelPrefix, DeletePrefixAtRaftAt, DeletePrefixAt, prefixScanEnd / PrefixScanEnd, GetIntersectingRoutes, verifyComposed1, verifyOwnerFromSnapshot, ObservedRouteVersion, and routeKey(.
Current code confirms DEL_PREFIX is non-transactional, broadcast to every shard group, handled before normal raw mutations in handleRawRequest, and ultimately writes tombstones via DeletePrefixAtRaftAt; the design now fences and tracks that range-shaped write before it reaches the prefix delete path.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c0a678f1bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T13:58:04Z

+
+| PR | Scope | Tests |
+|---|---|---|
+| M2-PR0 | Add `commit_ts` to the `txnLock` wire value on the Raft-applied lock-write path; decoder backward-compat for existing lock values without the field (zero → migrator fallback to direct commit-record lookup / pending drain) | `kv/txn_codec_test.go` round-trip + backward-compat case |


Carry commit_ts on PREPARE before tracking locks

PR0 only calls out adding commit_ts to the txnLock value, but in the current multi-shard path the PREPARE request has no actual commit timestamp (prewriteTxn builds txnMetaMutation(..., 0) and sends Request.Ts = startTS; the real commit_ts is sent later on COMMIT). If the implementation follows this touchpoint literally, handlePrepare still cannot populate the new lock field or append the admission-time tracker row with the transaction's commit_ts, so a low-ts prepare that lands and resolves between migrator polls can be absent from snapshot_min_admitted_ts and the DELTA_COPY floor can again exclude the committed version. Please include the coordinator/request/meta plumbing and a prepare-path test here, not just the lock codec round-trip.

Useful? React with 👍 / 👎.

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

bootjp mentioned this pull request Jun 10, 2026

docs(design): propose hotspot split M3 — automation #951

Open

Merge branch 'main' into docs/hotspot-split-m2-proposal

3aca90c

bootjp marked this pull request as ready for review June 11, 2026 11:19

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md Outdated

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

docs: cover export bracket edge cases

8a1a301

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

docs: preserve MVCC headers in migration

701845a

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

docs: harden HLC floor and staged merge

41b8aba

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

docs: cover redis wide-column brackets

b799e94

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

docs: close migration fence drain gaps

a603f17

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

docs: cover internal scan route ownership

646fe7e

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

docs: persist cutover read fence intent

ad2f7bb

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

docs: track low-ts migration writes

cbdc0fc

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

docs: fence prefix deletes by range

c0a678f

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes


		- `kv/fsm.go` already routes through `verifyComposed1` at apply. Extend `verifyOwnerFromSnapshot` to also reject when the resolved route's state is `WriteFenced` — even if a coordinator with a stale engine forwarded a write, the FSM closes the door. The error wraps `ErrRouteWriteFenced` so M4 retry logic differentiates from `ErrComposed1Violation`.

		Reads are not fenced: `ShardStore.GetAt` continues to serve from source until CUTOVER. Snapshot reads at any `commit_ts ≤ fence_ts` remain consistent because the source's MVCC history is unchanged through DELTA_COPY.


		FSM-side defense in depth:

		- `kv/fsm.go` already routes through `verifyComposed1` at apply. Extend `verifyOwnerFromSnapshot` to also reject when the resolved route's state is `WriteFenced` — even if a coordinator with a stale engine forwarded a write, the FSM closes the door. The error wraps `ErrRouteWriteFenced` so M4 retry logic differentiates from `ErrComposed1Violation`.

		// ExportVersions iterates committed versions in the half-open range
		// [start, end) whose commit_ts is in (minCommitTS, maxCommitTS],


		#### 7.2.2 Source FSM rejects reads after CUTOVER

		When a source-group FSM applies a lease read whose `read_route_version` is strictly less than the FSM's last-applied catalog version, and the requested key's `routeKey()` maps to a route the FSM no longer owns, the FSM returns `ErrRouteMoved{new_route_version, new_group_id}` instead of the value. The error is mapped to a retryable status (gRPC `codes.FailedPrecondition` + detail) and the coordinator retries after a `RouteEngine` refresh.


		No iteration over the migrated key range, no bulk rename, no per-key proposals. The Raft proposal carrying this apply is `O(1)` in the migrated data size — bounded by a handful of catalog descriptors and the SplitJob record.

		2. Staged keyspace stays visible through a per-route flag. After CUTOVER the target FSM's read path checks the route's `staged_visibility_active` bit. When set, a read for key `K` in the moved range first looks up `!dist\|migstage\|<job_id>\|<K>` and falls back to the live key only if the staged entry is absent. (Writes always land in the live keyspace — the CUTOVER bump already routed writers to the target group via the catalog, and §6.2.1's HLC ceiling advance guarantees their `commit_ts` is strictly greater than every imported `commit_ts`, so a live write shadowing a staged version is the correct MVCC visibility.) This gives reads access to the imported data the instant CUTOVER lands, with no promotion work blocking CUTOVER itself.


		The migrator runs the brackets in parallel up to `--migrationExportFanout` (default 4) within a single phase (BACKFILL or DELTA_COPY): each bracket gets its own opaque cursor (the §6.1.1 codec is extended to key on `(family, raw_key, commit_ts)`), so a bracket can be paused/resumed independently and the §9 resumability matrix applies per bracket. The migrator records per-bracket `cursor` and `done` flags on the SplitJob; the phase advances only when every bracket reports `done = true`. Cost is at most `fanout`× the gRPC frame budget but is bounded by the total bracket count (~12 today), not by data volume.

		A bracket whose family doesn't actually intersect the moving range (e.g. an SQS migration on a Redis-only deployment) returns its first chunk empty and `done=true` immediately — no wasted scan work. The `keys per bracket` distribution is a §7.3 / §11.x metric so operators can spot a runaway family scan.

		\| FENCE → DELTA_COPY Copy (snapshot_ts, fence_ts] from
		\| source to target shadow space.

		{Start: ddbItemPrefix(), End: ddbItemPrefixEnd(), Family: familyDdbItem},
		{Start: ddbGSIPrefix(), End: ddbGSIPrefixEnd(), Family: familyDdbGSI},

		3. CUTOVER precondition. Before the migrator issues the CUTOVER catalog write, it ensures the target group's HLC ceiling is at least `hlcPhysicalMs(max_imported_ts)` by issuing a final `SetPhysicalCeiling(hlcPhysicalMs(max_imported_ts))` + `Observe(max_imported_ts)` proposal on the target group (not the default group). The proposal is gated by `cap_migration_v2` (§11.1) and is the last target-side write before CUTOVER. If the target HLC is already above this value (e.g. due to per-batch advancement), the `SetPhysicalCeiling` call is the trivial no-op the ceiling already enforces; if a follower flap dropped one of the per-batch advances, this proposal closes the gap.
		4. Post-CUTOVER write-side invariant. A target-group `Next()` after CUTOVER is bounded below by `metaLastCommitTS` AND by the HLC ceiling (the existing `HLC.Next()` semantics in `kv/hlc.go`), so it is provably strictly greater than every `commit_ts` ever imported under this job. Reads at any post-CUTOVER `Next()` therefore see the imported version as historical (older than the read ts) and see the new write as the most-recent committed version — no resurrection, no time travel.


		M2 fixes this with a three-step gate on the FENCE → DELTA_COPY transition (step 0 added per claude round-12 P1 on PR #945 — prepared-txn drain — see §7.1's no-`Phase_COMMIT`-bypass rationale):

		0. Prepared-txn drain (resolves OQ-6). Before transitioning `BACKFILL → FENCE` (yes, before — the FENCE catalog write itself does not run until this drain returns), the migrator waits until no in-flight transaction intent locks remain in the moving range. "No intents remaining" is observed via the route-faithful lock drain defined in §3.2a.0a below — a `!txn\|lock\|` family scan filtered by `routeKey(lockedKey) ∈ [routeStart, routeEnd)`, returning empty. The wait is bounded by the txn timeout — a long-lived prepared txn either commits or is aborted by the lock resolver (`kv/lock_resolver.go`) within its TTL, so the scan converges in at worst one TTL window. While the drain is pending the migrator logs `last_error = "fence drain pending: <N> intent locks in moving range"`. Why this is required: §7.1's `verifyRouteNotFenced` rejects every write request in the WriteFenced range and has no `Phase_COMMIT` bypass because the matching intent lock is excluded from BACKFILL (§6.1) and absent on the target after CUTOVER — a `Phase_COMMIT` reaching the FSM after FENCE has nothing to commit against. Letting a prepared txn into the FENCE phase would therefore strand its commit indefinitely. The drain guarantees every commit that reaches the FSM during/after FENCE is for a txn whose intent has already resolved (committed via the pre-FENCE accept window, or aborted by the lock resolver). After CUTOVER the target has no historical intent for a moved-range key, so a coordinator that retries a Phase_COMMIT against the target would also fail; the drain is the only safe ordering.

		{Start: lstMetaPrefix(), End: lstMetaPrefixEnd(), Family: familyListMeta},
		{Start: lstItmPrefix(), End: lstItmPrefixEnd(), Family: familyListItem},

		- Branch 1 — source has applied CUTOVER (`source_local_version >= cutover_version`). The FSM calls `GetIntersectingRoutes(start, end)` against its own current catalog snapshot. If every intersecting route is owned by this group, the scan proceeds. If any intersecting route is owned by a different group, the FSM returns `ErrRouteMoved{new_route_version, new_group_id, sub_range_start, sub_range_end}` for the first foreign-owned sub-range encountered (lexicographically smallest).
		- Branch 2 — source has NOT applied CUTOVER but `max_cutover_version_seen > source_local_version` (equal-stale, watcher lag). The local snapshot still says `source.Active(full: [A, Z))` — a single route, no foreign owner visible — so Branch 1 alone would silently pass a cross-boundary scan. The gate therefore additionally calls `GetIntersectingRoutes([start, end), catalog_version = max_cutover_version_seen)` as a single deadline-bounded RPC to the default-group leader (mirroring exactly §7.2.2's point-key extension — same watermark trigger, same RPC target, same 1 s cache TTL with cache key `(start, end, max_cutover_version_seen)`). The default-group leader answers from its current catalog snapshot. If any returned route is owned by another group, the source FSM returns `ErrRouteMoved{max_cutover_version_seen, foreign_group_id, sub_range_start, sub_range_end}` for the first foreign-owned sub-range. If every returned route is still source-owned (the elevated watermark covers an unrelated migration that did not touch this scan range), the scan proceeds normally. If the RPC times out, the default-group leader is unreachable, or the route set cannot be verified at `max_cutover_version_seen`, the FSM returns `ErrRouteOwnershipUnknown{route_version=max_cutover_version_seen, sub_range_start=start, sub_range_end=end, retry_after_ms}` and must not scan source MVCC.


		1. Arm the migration write tracker, then choose the provisional snapshot for BACKFILL. Before opening BACKFILL, the migrator proposes `ArmMigrationWriteTracker(job_id, route_start, route_end)` to the source group and waits for it to apply; only then may it set `write_tracker_armed=true` on the default-group SplitJob and choose `snapshot_ts`. The tracker is source-group Raft state. From that apply onward:

		- `handleRawRequest` appends a tracker row for each mutation whose `routeKey(mut.Key) ∈ [routeStart, routeEnd)`, using the request's `commit_ts = r.Ts`, in the same apply that writes the MVCC version.

Conversation

bootjp commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key design points

Open questions for reviewers

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bootjp commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of commit 93de9b2 — addressing 6 prior findings

Verified: all 6 prior findings are addressed

New observations on the updated design

P1 — CUTOVER two-step ordering undefined (data-visibility gap)

Medium — Grace window inconsistency (§7.2.4 vs §4 vs Open Question 3)

Medium — "Wait until applied" in §7.2.2 under-specified

Minor — !dist| prefix on target-group import cursor (§6.2)

Summary

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bootjp commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review — Round 2

bootjp commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

claude Bot commented Jun 11, 2026 •

edited

Loading

Review of commit `93de9b2` — addressing 6 prior findings

Minor — `!dist|` prefix on target-group import cursor (§6.2)

claude Bot commented Jun 11, 2026 •

edited

Loading

Minor — `!dist|` prefix on target-group import cursor (still unresolved)

claude Bot commented Jun 11, 2026 •

edited

Loading

Minor — `!dist|` prefix on target-group import cursor (unresolved, 3rd flag)