fix(persistence): multi-shard AOF gate + per-shard AOF foundation (Option B step 1) by pilotspacex-byte · Pull Request #129 · pilotspace/moon

pilotspacex-byte · 2026-05-26T15:23:06Z

Summary

Three commits on this branch:

e0bb658 — P0 gate (P0-FIX-01a/b). BGREWRITEAOF refusal + startup refusal for --shards >= 2 + --appendonly yes, with --unsafe-multishard-aof escape hatch. Empirical loss matrix below.
7b61898 — docs/changelog. v0.1.12 launch posture, 3-way Moon/Redis/Valkey comparison, alpha-leak qualifiers, missing v0.1.9/v0.1.10/v0.1.12 CHANGELOG entries.
3bb4790 — per-shard AOF manifest format (Option B step 1). First step of the 9-step RFC at tmp/rfc-per-shard-aof-v02.md that eventually lifts the gate in step 9.

Why the gate (commit 1)

Empirical re-verification on HEAD 6e49050 (2026-05-26) found the durability bug is in the multi-shard AOF path itself, not the rewrite path the older bug memory blamed:

Configuration	Recovered
`--shards 1 --appendonly yes --appendfsync always` (control)	5000 / 5000
`--shards 1 --disk-offload enable --appendonly yes` (control)	12714 / 12714
`--shards 2 --disk-offload enable --appendonly yes` (BGREWRITEAOF + SIGKILL)	7892 / 12662 (-38%)
`--shards 2 --disk-offload enable --appendonly yes` (plain SIGKILL)	7888 / 12655 (-38%)
`--shards 2 --disk-offload enable --appendonly yes --appendfsync always`	2474 / 5000 (-50%)
`--shards 2 --disk-offload disable --appendonly yes --appendfsync always`	2453 / 5000 (-50%)

Root cause investigation in tmp/P0-INVEST-01-multishard-aof-rootcause.md identified two complementary bugs:

H2 (structural): src/main.rs:562-563 literally skips multi-part AOF replay for num_shards >= 2. Closed by step 4.
H1 (in-flight): try_send(AofMessage::Append) is fire-and-forget; +OK returns before the writer thread fsyncs; channel buffer is lost on SIGKILL. Closed by step 7.

Why step 1 lands here (commit 3)

The decision tree:

Build Option B (per-shard AOF, 16d + 1wk soak) vs make the gate permanent → user chose Option B.
First obstacle is that the current AofManifest has no place to describe per-shard segments. Step 1 introduces that structure additively.

Strictly additive at the file-system level:

v1 manifests continue to load as AofLayout::TopLevel single-shard (shard_id=0).
No in-place migration triggered. No behavior change for any existing deployment.
The --unsafe-multishard-aof gate from commit 1 remains the load-bearing safety net until step 9.

What step 1 adds

AofLayout { TopLevel, PerShard } discriminator.
ShardManifest { shard_id: u16, max_lsn: u64 } — max_lsn semantics deferred to step 3.
AofManifest.layout + AofManifest.shards: Vec<ShardManifest>.
initialize_multi(dir, num_shards) — v2 PerShard constructor.
shard_dir, shard_base_path[_seq], shard_incr_path[_seq] — per-shard path helpers.
verify_shard_count(expected) — returns the verbatim RFC § 3 error.
is_legacy_top_level_layout(dir) — pure detection (no side effects).
migrate_top_level_to_per_shard() — explicit in-place rename for RFC § 5 case 1; idempotent.
global_max_lsn() — computed accessor, not stored (avoids drift with the per-shard records).

Manifest text format

v1 (unchanged, full backcompat):

seq <N>
base moon.aof.<N>.base.rdb
incr moon.aof.<N>.incr.aof

v2 (new):

version 2
seq <N>
shards <K>
shard 0 max_lsn <lsn0>
shard 1 max_lsn <lsn1>
...

Paths are derived from shard_id + seq rather than stored explicitly — the layout is canonical, so a stored path could drift from the computed location.

Test plan

Unit test test_bgrewriteaof_sharded_refuses_under_unsafe_config covers gate-on + gate-off paths.
Live OrbStack boundary tests for the P0 gate:
- PASS --shards 1 + AOF starts cleanly
- PASS --shards 2 + AOF + --unsafe-multishard-aof starts
- PASS --shards 2 + --appendonly no starts
- REFUSED --shards 2 + AOF without escape hatch (exit code 2)
9 new step-1 unit tests (in aof_manifest.rs tests_v2 module):
- v1_manifest_loads_as_top_level_single_shard
- v2_manifest_round_trips
- verify_shard_count_emits_rfc_error_verbatim
- migrate_top_level_to_per_shard_moves_files_and_rewrites_manifest
- global_max_lsn_returns_max_across_shards
- is_legacy_top_level_layout_detects_v1_files
- is_legacy_top_level_layout_returns_false_for_v2
- parse_v2_rejects_shard_count_mismatch_in_file
- parse_v2_rejects_non_contiguous_shard_ids
21 existing persistence::aof tests still green; library cargo check --no-default-features --features runtime-tokio,jemalloc clean.
CI green (Rust 1.94, both feature sets, clippy, audit-unsafe, audit-unwrap).
CRASH-01-LITE 7-config matrix as #[ignore] tests (step 8 of the RFC).

Operator impact

Existing --shards >= 2 + --appendonly yes deployments fail to start after upgrade. Error message is actionable: pick --shards 1, --appendonly no, or --unsafe-multishard-aof. Runbook walks each option.
Single-shard deployments unaffected.
--appendonly no (any shard count) unaffected.

Next steps on this branch

Per the RFC implementation table (16 days + 1 wk soak):

Step	Effort	Blocked by
2 — Per-shard `AofWriter` task; `aof_tx: Vec<Sender>`	3d	1
3 — LSN tagging in `AofMessage::Append`	1d	2 + v0.2 S1.3
4 — Replace `Multi-part AOF skipped` skip branch (closes H2)	2d	1, 2, 3
5 — Cross-shard ordering merge (TXN + SCRIPT)	2d	4
6 — `moon migrate-aof` subcommand (RFC § 5 case 2)	2d	1
7 — `AppendSync { bytes, ack }` rendezvous (closes H1)	3d	2 (parallel with 3-6)
8 — CRASH-01-LITE matrix in `tests/crash_matrix.rs`	1d	4
9 — Lift `--unsafe-multishard-aof` gate	1h	8 green

Summary by CodeRabbit

New Features
- Server refuses unsafe multi‑shard + appendonly configurations unless explicitly acknowledged; BGREWRITEAOF is gated.
- Added per‑shard AOF mode with multi‑part manifest, per‑shard writers, replay/migration, and safer shutdown behavior.
Documentation
- Updated release notes, README (badges, production‑readiness matrix, benchmarks, TTL examples) and added a multi‑shard AOF runbook.
Tests
- Test harnesses set unsafe multi‑shard AOF off by default to reflect the new safety gate.

…OF (P0-FIX-01a/b) Empirical re-verification on HEAD 6e49050 (2026-05-26) found that `--shards >= 2 + --appendonly yes` silently loses ~50 % of writes on SIGKILL, independent of `--appendfsync` and `--disk-offload`. The original 33-day-old bug memory had narrowed the loss to BGREWRITEAOF + disk-offload; the discriminator matrix below shows the bug is in the multi-shard AOF durability path itself. | Configuration | Recovered | |--------------------------------------------------------------------------------|----------------| | --shards 1 --appendonly yes --appendfsync always | 5000 / 5000 | | --shards 1 --disk-offload enable --appendonly yes | 12714 / 12714 | | --shards 2 --disk-offload enable --appendonly yes (BGREWRITEAOF + SIGKILL) | 7892 / 12662 | | --shards 2 --disk-offload enable --appendonly yes (plain SIGKILL, no rewrite) | 7888 / 12655 | | --shards 2 --disk-offload enable --appendonly yes --appendfsync always | 2474 / 5000 | | --shards 2 --disk-offload disable --appendonly yes --appendfsync always | 2453 / 5000 | Two complementary gates ship in this commit; both lift in v2.0 when multi-shard AOF replay walks every shard's segment manifest on recovery (see docs/runbooks/multi-shard-aof-rewrite.md): P0-FIX-01a (defence-in-depth, command-level) bgrewriteaof_start_sharded refuses with a clear ERR when the multi-shard + disk-offload + AOF combo is active. Gated by MULTI_SHARD_AOF_REWRITE_UNSAFE: AtomicBool, set once in main.rs. Unit test test_bgrewriteaof_sharded_refuses_under_unsafe_config covers gate-on + gate-off paths and asserts the gate does not flip AOF_REWRITE_IN_PROGRESS. P0-FIX-01b (load-bearing, startup) main.rs aborts with exit code 2 if `--shards >= 2 + --appendonly yes` without `--unsafe-multishard-aof`. The new flag is the explicit escape hatch for cache-only deployments where the loss window is acceptable. Boundary tests verified live on OrbStack: PASS --shards 1 + AOF starts cleanly (no false positives) PASS --shards 2 + AOF + --unsafe-multishard-aof starts PASS --shards 2 + --appendonly no starts (cache-only) REFUSED --shards 2 + AOF without escape hatch Files src/command/persistence.rs + gate + unit test src/main.rs + startup refusal + BGREWRITEAOF gate set src/config.rs + --unsafe-multishard-aof flag docs/runbooks/multi-shard-aof-rewrite.md + operator runbook Reproducer scripts live in tmp/ (gitignored): p0-repro.sh, p0-no-rewrite.sh, p0-always.sh, p0-multishard-no-offload.sh, p0-shards1-exact.sh. Encoding them as #[ignore] crash-matrix tests is tracked as CRASH-01-LITE in the ship plan. Multi-shard masters with AOF are now explicitly cache-only in v1.0. Root-cause investigation P0-INVEST-01 (1-2 wk) is the prerequisite to lifting the startup gate in v2.0. author: Tin Dang

…lpha-leak qualifiers README * Bumps version badge v0.1.10 → v0.1.12 and replaces the "experimental" status with "single-node production-grade" plus a "cluster v0.2 alpha" badge, mirroring the new ship plan posture. * Replaces the blanket experimental warning with a "production-grade architecture, pre-1.0 maturity" framing that points at the new Production readiness section for the honest GA matrix. * Reconciles platform support — macOS is a supported development platform per the PRODUCTION-CONTRACT Tier table; production deployments target Linux. * Adds a Valkey 9.1.0 column to the peak-throughput tables (honest "not yet benched" placeholders) and a new Moon vs Redis vs Valkey section: a three-way comparison table plus "when to choose" guidance, all traced to docs/comparison-valkey.md. * Rewrites the trailing roadmap into a Production readiness section with what's GA today, what's not, operator gotchas, and a roadmap table. Alpha-leak qualifiers added so v0.1.12 framing does not implicitly promise v0.2.0-alpha features: * Quick-start HEXPIRE / HTTL lines annotated "(v0.2.0-alpha; build from main)". * Hash-field TTL benchmark section retitled "v0.2.0-alpha preview" with a callout that the latest tag (v0.1.12) does not include it. * "What's already in main" list split into v0.1.12 (latest tag, single-node production-grade) and v0.2.0-alpha additions (hash-field TTL, PITR, CDC, multi-node cluster soak). * Comparison-table row for hash-field TTL qualified as "v0.2-alpha". CHANGELOG * Adds v0.1.12 entry covering Phase 189 (DashTable pre-sizing + --initial-keyspace-hint, PERF-07/09), Phase 190 (moon_memory_bytes Prometheus gauge with 7 subsystem kinds, MEMORY DOCTOR schema, resident_bytes trait), Phase 191 (jemalloc narenas:8 cap, --memory-arenas-cap, mimalloc-alt feature, OPERATOR-GUIDE Memory Accounting), Phase 177 dispatch observability, text-index default feature, SDK validate.{py,rs}, Python SDK graph parser fix, CI hygiene. * Adds v0.1.10 entry (single-shard PSYNC2 wired end-to-end). * Adds v0.1.9 Lunaris Retriever Gap Closure entry. * Consolidates three orphan Unreleased blocks under v0.1.3. * Sharpens v0.2.0-alpha entry with TL;DR headline capabilities (hash-field TTL stack, PITR, CDC, multi-node cluster soak). * Fixes version ordering so v0.1.12 sits above v0.1.11. No code changes; this is purely documentation framing aligned to the v1.0-rc1 single-node ship plan in tmp/SHIP-PLAN-v1.0-rc1-single-node.md. author: Tin Dang

qodo-code-review · 2026-05-26T15:23:12Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-05-26T15:23:23Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR enforces a startup/runtime safety gate for a known multi-shard AOF rewrite durability bug (CLI override added), adds v2 PerShard AOF manifest format and migration/replay, implements AofWriterPool and per-shard writers, migrates connections/handlers to the pool with per-entry LSNs, and updates tests, runbook, README, and CHANGELOG.

Changes

Multi-Shard AOF Safety Gates

Layer / File(s)	Summary
Release notes and operational runbook `CHANGELOG.md`, `docs/runbooks/multi-shard-aof-rewrite.md`	Reorganizes CHANGELOG into dated releases, adds `v0.2.0-alpha`/`v0.1.12` sections, and adds a runbook documenting startup and BGREWRITEAOF refusal gates, measured data-loss characteristics, recovery steps, telemetry signals, and planned removal.
Configuration flag and startup enforcement `src/config.rs`, `src/main.rs`	Adds `--unsafe-multishard-aof` to `ServerConfig` (default false) and implements an early startup refusal when `--shards >= 2` + `appendonly yes` without the override (exits code 2 and prints a data-loss warning).
Runtime BGREWRITEAOF command gating & tests `src/command/persistence.rs`	Introduces `MULTI_SHARD_AOF_REWRITE_UNSAFE` AtomicBool; maps AOF pool send errors to RESP frames; `bgrewriteaof_start_sharded` refuses when the gate is set; `bgrewriteaof_start`/sharded now dispatch via `AofWriterPool::try_send_rewrite`; adds a mutex-protected unit test validating the gate behavior.
AOF manifest multi-layout and migration `src/persistence/aof_manifest.rs`	Adds `AofLayout` (TopLevel/PerShard) and `ShardManifest`, extends `AofManifest` with layout/shards, implements layout-aware path helpers, v1/v2 parse & write, orphan cleanup, TopLevel→PerShard migration with rollback, `initialize_multi`, `replay_incr_framed` and exported `replay_per_shard`, plus extensive v2 tests.
AOF writer pool & per-shard writer task `src/persistence/aof.rs`	Introduces `AofWriterPool` (top-level/per-shard constructors, routing, `try_send_append`/`try_send_rewrite`, `broadcast_shutdown`) and `per_shard_aof_writer_task` (PerShard incremental AOF processing with fsync policies and cancellation); changes `AofMessage::Append` to include `lsn`; updates TopLevel writers to new shape and adds unit tests.
Startup wiring, manifest load, and shutdown `src/main.rs`, `src/server/listener.rs`, `src/server/embedded.rs`	Loads/validates `AofManifest` on appendonly=yes, constructs TopLevel or PerShard `AofWriterPool`, spawns per-shard writers, passes pool into `shard.run(...)`, and uses `aof_pool.broadcast_shutdown()` on shutdown; embedded/listener wrap sender into pool.
Connection context and spawn wiring `src/server/conn/core.rs`, `src/server/conn_state.rs`, `src/shard/conn_accept.rs`, `src/shard/event_loop.rs`	Adds optional `aof_pool: Option<Arc<AofWriterPool>>` to connection context/state, updates spawn helpers to accept `aof_pool` and pass pool clones into connections, and updates `Shard::run` signature to accept the pool.
Handler migration: monoio/sharded/single/blocking `src/server/conn/handler_monoio/`, `src/server/conn/handler_sharded/`, `src/server/conn/handler_single.rs`, `src/server/conn/blocking.rs`	Switches AOF write gating to `aof_pool.is_some()`, issues per-append LSNs via `AofWriterPool::issue_append_lsn`, routes appends to owning shard (including cross-shard pipeline responses), and rewires BGREWRITEAOF/SWAPDB/SUBSCRIBE/GRAPH WAL flows to use the pool API.
Tests and harness updates `src/server/conn/tests.rs`, `tests/*`	Updates inline-dispatch tests to use `Option<Arc<AofWriterPool>>`, constructs `AofWriterPool::top_level(...)` in AOF tests, and sets `unsafe_multishard_aof: false` across multiple test server bootstraps.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

pilotspace/moon#96: Overlaps AOF replay/manifest recovery code.
pilotspace/moon#71: Overlaps inline-dispatch AOF append logic.
pilotspace/moon#63: Prior sharded AOF rewrite/replay work this PR extends.

Suggested labels: enhancement

"🐰 I hop where shards and AOF collide,
A careful flag keeps unsafe rewrites aside.
Manifests split and writers find their lane,
Docs, tests, and pools to guide the terrain.
Hooray — safer starts, one careful stride."

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: introducing a safety gate for multi-shard AOF configurations and establishing the foundational manifest structures needed for per-shard AOF support. It is specific, clear, and reflects the primary objectives of commit 1 and 3.
Description check	✅ Passed	The description is comprehensive and detailed: it explains the three commits, provides empirical data justifying the gate, documents manifest format changes, includes a test plan with specific test cases, and outlines operator impact and next steps. It follows the general structure expected in a well-documented PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/p0-multishard-aof-gate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

README.md (1)

229-233: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Quick-start production flags now conflict with startup safety gate.

This command should fail under the new startup refusal (--shards >= 2 + --appendonly yes without override), so the README is currently instructing an invalid config.

Suggested README correction

 # Or with production flags
 ./target/release/moon \
   --port 6379 \
-  --shards 8 \
-  --appendonly yes --appendfsync everysec \
+  --shards 1 \
+  --appendonly yes --appendfsync everysec \
   --maxmemory 8g --maxmemory-policy allkeys-lfu
+
+# Multi-shard cache-only alternative
+# ./target/release/moon --shards 8 --appendonly no ...
+
+# Unsafe override (not recommended; known durability risk)
+# ./target/release/moon --shards 8 --appendonly yes --unsafe-multishard-aof ...

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 229 - 233, The README's quick-start example uses
conflicting flags (--shards 8 together with --appendonly yes) which will trigger
the new startup safety gate and refuse to start; update the example command
under the block that contains the flags (--port, --shards, --appendonly,
--appendfsync, --maxmemory, --maxmemory-policy) to a valid configuration (e.g.,
set --shards 1 or remove/disable --appendonly) or explicitly show the required
override flag and text that allows bypassing the safety gate (add a clear
placeholder like --<startup-override> if an override exists) so the documented
command actually starts successfully.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/runbooks/multi-shard-aof-rewrite.md`:
- Around line 10-16: Three fenced code blocks in
docs/runbooks/multi-shard-aof-rewrite.md are missing language identifiers
(markdownlint MD040). Edit the three blocks shown (the startup refusal block
starting "REFUSING TO START: --shards 2 + --appendonly yes...", the BGREWRITEAOF
interaction block containing "BGREWRITEAOF" and "(error) ERR BGREWRITEAOF...",
and the final explanatory block starting "BGREWRITEAOF gated for this
config...") and add the language tags: use ```text for the two plain-text blocks
and ```redis for the BGREWRITEAOF example so markdownlint MD040 is satisfied.

In `@src/main.rs`:
- Around line 273-289: The --check-config path currently returns before the
multishard-AOF safety gate runs, so add the same refusal logic used at startup
into the check_config branch: detect the condition (num_shards >= 2 &&
config.appendonly == "yes" && !config.unsafe_multishard_aof) inside the
check_config handling and print the identical error message and exit non‑zero
(or return an error) so preflight fails the same way real startup would; use the
same symbols/strings (num_shards, config.appendonly,
config.unsafe_multishard_aof) and the same message text used near the startup
gate to keep behavior consistent.

---

Outside diff comments:
In `@README.md`:
- Around line 229-233: The README's quick-start example uses conflicting flags
(--shards 8 together with --appendonly yes) which will trigger the new startup
safety gate and refuse to start; update the example command under the block that
contains the flags (--port, --shards, --appendonly, --appendfsync, --maxmemory,
--maxmemory-policy) to a valid configuration (e.g., set --shards 1 or
remove/disable --appendonly) or explicitly show the required override flag and
text that allows bypassing the safety gate (add a clear placeholder like
--<startup-override> if an override exists) so the documented command actually
starts successfully.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c11a2da9-b702-43f0-91ac-59786ae9a841

📥 Commits

Reviewing files that changed from the base of the PR and between 6e49050 and 7b61898.

📒 Files selected for processing (6)

CHANGELOG.md
README.md
docs/runbooks/multi-shard-aof-rewrite.md
src/command/persistence.rs
src/config.rs
src/main.rs

coderabbitai · 2026-05-26T15:27:54Z

+```
+REFUSING TO START: --shards 2 + --appendonly yes has a known data-loss
+bug on SIGKILL (~50 % loss verified 2026-05-26). Fix: use --shards 1,
+or pass --appendonly no for cache-only deployments, or pass
+--unsafe-multishard-aof to acknowledge the risk and start anyway. See
+docs/runbooks/multi-shard-aof-rewrite.md.
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add fenced code languages to satisfy markdownlint MD040.

These three fenced blocks are missing language identifiers and will keep markdownlint warnings active.

Suggested doc-only fix

-``` +```text REFUSING TO START: --shards 2 + --appendonly yes has a known data-loss ... -``` +``` -``` +```redis > BGREWRITEAOF (error) ERR BGREWRITEAOF is unsafe with --shards >= 2 + --disk-offload enable ... -``` +``` -``` +```text BGREWRITEAOF gated for this config (known data-loss path; see docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1 or --disk-offload disable to re-enable rewrite. -``` +```

Also applies to: 20-26, 88-92

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 10-10: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/runbooks/multi-shard-aof-rewrite.md` around lines 10 - 16, Three fenced code blocks in docs/runbooks/multi-shard-aof-rewrite.md are missing language identifiers (markdownlint MD040). Edit the three blocks shown (the startup refusal block starting "REFUSING TO START: --shards 2 + --appendonly yes...", the BGREWRITEAOF interaction block containing "BGREWRITEAOF" and "(error) ERR BGREWRITEAOF...", and the final explanatory block starting "BGREWRITEAOF gated for this config...") and add the language tags: use ```text for the two plain-text blocks and ```redis for the BGREWRITEAOF example so markdownlint MD040 is satisfied.

coderabbitai · 2026-05-26T15:27:54Z

+    // P0-FIX-01b: refuse to start under the known durability bug
+    // (`shards >= 2 + appendonly yes` loses ~50 % of writes on SIGKILL,
+    //  verified 2026-05-26 on HEAD `6e49050`; reproducer in
+    //  `tmp/p0-no-rewrite.sh` and `tmp/p0-always.sh`).  The bug is
+    // independent of `--appendfsync` and `--disk-offload` settings.  An
+    // operator can override via `--unsafe-multishard-aof` if the
+    // deployment is cache-only and the loss window is acceptable.
+    if num_shards >= 2 && config.appendonly == "yes" && !config.unsafe_multishard_aof {
+        eprintln!(
+            "REFUSING TO START: --shards {num_shards} + --appendonly yes has a known data-loss \
+             bug on SIGKILL (~50 % loss verified 2026-05-26). Fix: use --shards 1, or pass \
+             --appendonly no for cache-only deployments, or pass --unsafe-multishard-aof to \
+             acknowledge the risk and start anyway. See \
+             docs/runbooks/multi-shard-aof-rewrite.md."
+        );
+        std::process::exit(2);
+    }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mirror this refusal in --check-config validation.

Line 143 returns from --check-config before Line 280 runs, so preflight can pass a config that real startup immediately refuses. Please enforce the same gate in the check_config branch.

Suggested patch

@@ if config.check_config { + if config.shards >= 2 && config.appendonly == "yes" && !config.unsafe_multishard_aof { + return Err(anyhow::anyhow!( + "--shards {} + --appendonly yes is refused unless --unsafe-multishard-aof is set (or use --shards 1 / --appendonly no)", + config.shards + )); + } // Validate shard count is reasonable if config.shards == 0 { return Err(anyhow::anyhow!("--shards must be >= 1")); }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/main.rs` around lines 273 - 289, The --check-config path currently returns before the multishard-AOF safety gate runs, so add the same refusal logic used at startup into the check_config branch: detect the condition (num_shards >= 2 && config.appendonly == "yes" && !config.unsafe_multishard_aof) inside the check_config handling and print the identical error message and exit non‑zero (or return an error) so preflight fails the same way real startup would; use the same symbols/strings (num_shards, config.appendonly, config.unsafe_multishard_aof) and the same message text used near the startup gate to keep behavior consistent.

First implementation step of the per-shard AOF RFC (Option B in tmp/rfc-per-shard-aof-v02.md). Closes Hypothesis 2 of the P0-INVEST-01 root cause: multi-part AOF replay is currently skipped for num_shards >= 2 because there is no manifest structure that can describe per-shard segments. This commit lays the foundation by introducing a manifest v2 format that carries per-shard metadata; the writer, replay, and lift-the-gate work follows in steps 2-9. The change is purely additive at the file-system level — v1 manifests continue to load as TopLevel single-shard with shard_id=0, no in-place migration is triggered, and no behavior is altered for any existing deployment. The escape-hatch gate (--unsafe-multishard-aof) from commit ce05fa9 remains the load-bearing safety net until step 9 lands. New types AofLayout { TopLevel, PerShard } Discriminates v1 top-level layout from v2 per-shard layout. A directory holds one layout exclusively — never a mix. ShardManifest { shard_id: u16, max_lsn: u64 } Per-shard entry. The max_lsn semantics are deliberately deferred to step 3 (LSN tagging); until then it is always 0 and recovery does not consult it. This avoids locking in an LSN namespace contract before v0.2 S1.3 (REPLCONF ACK / WAIT) lands and clarifies what LSN MEANS in the multi-shard AOF context. AofManifest extensions + layout: AofLayout + shards: Vec<ShardManifest> // length == num_shards + initialize_multi(dir, num_shards) — v2 PerShard constructor + shard_dir / shard_base_path / shard_incr_path (+ _seq variants) + global_max_lsn() — computed accessor, not stored (per advisor's note: a stored mirror invites drift with the per-shard records) + verify_shard_count(expected) — returns the exact RFC § 3 verbatim error string ("ERR shard count changed (manifest=N, config=M)…") so operator-facing wording is uniform across boot, BGREWRITEAOF, and the migration tool. + is_legacy_top_level_layout(dir) — pure detection helper for callers that want to decide whether to migrate. NOT called from load() — side effects belong in explicit migrate_* methods. + migrate_top_level_to_per_shard() — in-place rename for RFC § 5 case 1 (single-shard v0.1.x → v2 single-shard). Idempotent. Case 2 (legacy multi-shard with the gate engaged) ships in step 6 as the `moon migrate-aof` subcommand. Manifest text format v1 (unchanged, preserves backcompat): seq <N> base moon.aof.<N>.base.rdb incr moon.aof.<N>.incr.aof v2 (new): version 2 seq <N> shards <K> shard 0 max_lsn <lsn0> shard 1 max_lsn <lsn1> ... Paths are derived from shard_id + seq rather than stored explicitly. The layout is canonical, so a stored path could drift from the computed location and silently shadow real files on disk. Tests (9 new, in src/persistence/aof_manifest.rs tests_v2 module) PASS v1_manifest_loads_as_top_level_single_shard PASS v2_manifest_round_trips PASS verify_shard_count_emits_rfc_error_verbatim PASS migrate_top_level_to_per_shard_moves_files_and_rewrites_manifest PASS global_max_lsn_returns_max_across_shards PASS is_legacy_top_level_layout_detects_v1_files PASS is_legacy_top_level_layout_returns_false_for_v2 PASS parse_v2_rejects_shard_count_mismatch_in_file PASS parse_v2_rejects_non_contiguous_shard_ids All 21 existing persistence::aof tests remain green. cargo check (runtime-tokio,jemalloc) clean. What this does NOT do (in scope for later steps) Step 2 — per-shard AofWriter task; aof_tx becomes Vec<Sender> Step 3 — LSN tagging in AofMessage::Append (after v0.2 S1.3) Step 4 — Replace `Multi-part AOF skipped` skip branch (closes H2) Step 5 — Cross-shard ordering merge (TXN + SCRIPT) Step 6 — `moon migrate-aof` subcommand for case 2 migration Step 7 — AppendSync rendezvous for appendfsync=always (closes H1) Step 8 — CRASH-01-LITE matrix in tests/crash_matrix.rs Step 9 — Lift --unsafe-multishard-aof gate Refs tmp/rfc-per-shard-aof-v02.md (RFC) tmp/P0-INVEST-01-multishard-aof-rootcause.md (root cause) PR #129 (P0 escape-hatch gate this work lifts) author: Tin Dang

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/persistence/aof_manifest.rs`:
- Around line 669-775: migrate_top_level_to_per_shard and initialize_multi are
making AofLayout::PerShard visible before the rest of the I/O path
(replay_multi_part, manifest.base_path/incr_path, manifest.incr_path(),
manifest.advance()) understands per-shard locations; this causes subsequent
boots/writes to look in the wrong place. Fix by deferring setting self.layout =
AofLayout::PerShard (and any manifest persisted as PerShard via
write_manifest()) until after you have created/moved the per-shard files and
ensured callers will open the new paths: in migrate_top_level_to_per_shard move
the layout assignment to after the rename/create operations and only call
write_manifest() once layout is set; in initialize_multi avoid persisting a
PerShard manifest or exposing PerShard paths until all shard dirs/files are
created (set layout and call write_manifest() last). Alternatively, make
replay_multi_part, base_path(), incr_path(), manifest.incr_path(), and
manifest.advance() layout-aware so they resolve per-shard paths immediately;
pick one approach and apply it consistently.
- Around line 688-717: After renaming old_base→new_base and optionally
old_incr→new_incr, add a rollback guard so any subsequent error (including
write_manifest() failing) moves the files back and restores self.layout to
AofLayout::TopLevel; implement this by tracking that the base (and possibly
incr) have been moved and on any error attempt std::fs::rename(new_base,
old_base) and, if incr was moved, std::fs::rename(new_incr, old_incr) (or remove
created new_incr if it was created), then set self.layout = AofLayout::TopLevel
before returning the error. Ensure the guard runs for failures after the first
rename but not if everything succeeds (write_manifest() completes), and
reference the existing symbols new_base, new_incr, old_base, old_incr,
self.layout, and write_manifest() to locate where to add the rollback.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5a131377-fab2-4cd3-8f99-12ed9fc7f9ff

📥 Commits

Reviewing files that changed from the base of the PR and between 7b61898 and 3bb4790.

📒 Files selected for processing (1)

src/persistence/aof_manifest.rs

Second implementation step of the per-shard AOF RFC (Option B in tmp/rfc-per-shard-aof-v02.md). Step 2 is split into six sub-steps (2a-2f) to keep the blast radius reviewable; this commit ships 2a. 2a is purely additive — a new public type and tests, zero call-site changes. The pool's API mirrors the patterns the call sites already use (try_send append, broadcast Shutdown), so steps 2c-2f reduce to a mechanical type-plumbing pass. New type AofWriterPool { senders: Vec<MpscSender<AofMessage>>, layout: AofLayout, } Constructors: top_level(sender) -> Arc<Self> One sender; every shard multiplexes onto it. Used for legacy v1 deployments and `--shards 1` v2 deployments. per_shard(senders) -> Arc<Self> One sender per shard. senders[i] MUST be the writer task that owns appendonlydir/shard-{i}/. debug_assert rejects a length-1 vector (use top_level instead). Dispatch: sender(shard_id) -> &MpscSender<AofMessage> TopLevel: ignores shard_id, returns senders[0]. PerShard: returns senders[shard_id]. debug_assert on out-of-range. try_send_append(shard_id, bytes) Convenience for the `let _ = tx.try_send(AofMessage::Append(bytes))` pattern at 12 call sites today. Fire-and-forget, matches current hot-path semantics (H1 fix is step 7's AppendSync rendezvous). try_send_rewrite(msg) -> Result<(), AofPoolSendError> Only legal for TopLevel pools; PerShard rejects with AofPoolSendError::RewriteUnsupportedInPerShard. BGREWRITEAOF in the per-shard layout becomes a per-shard operation in step 6 — the legacy single-writer rewrite enum variant has no meaning once the writer is one-per-shard. broadcast_shutdown() Sends Shutdown to every writer. Used by orchestrated shutdown in main.rs / embedded.rs (wired in step 2f). New error type AofPoolSendError { RewriteUnsupportedInPerShard, SendFailed, } Tests (5 new, in src/persistence/aof.rs pool_tests module) PASS top_level_pool_routes_all_shards_to_writer_zero PASS per_shard_pool_routes_each_shard_to_its_own_writer PASS per_shard_pool_rejects_rewrite_with_explicit_error PASS top_level_pool_accepts_rewrite PASS broadcast_shutdown_reaches_every_writer All 21 existing persistence::aof tests + 9 manifest tests from step 1 remain green (26 total in persistence::aof). cargo check + clippy (runtime-tokio,jemalloc) clean. What this does NOT do (in scope for later sub-steps) Step 2b — per-shard writer task body (reads from manifest.shard_incr_path(shard_id) for PerShard, manifest.incr_path() for TopLevel) Step 2c — type plumbing: aof_tx: Option<MpscSender> → aof_pool: Option<Arc<AofWriterPool>> in conn_state.rs and conn/core.rs Step 2d — handler_monoio call sites use ctx.aof_pool.sender(ctx.shard_id) Step 2e — handler_sharded call sites (same pattern) Step 2f — spawn sites (main.rs, listener.rs, embedded.rs) build the pool via top_level() or per_shard() based on layout Refs tmp/rfc-per-shard-aof-v02.md (RFC § 4 — writer architecture) tmp/P0-INVEST-01-multishard-aof-rootcause.md (H1/H2 root cause) PR #129 (P0 escape-hatch gate this work lifts in step 9) Commit 3bb4790 (step 1 — manifest v2 format) author: Tin Dang

Third implementation step of the per-shard AOF RFC (Option B in tmp/rfc-per-shard-aof-v02.md). Adds the per-shard writer task body as an additive function alongside the existing `aof_writer_task`. Zero call sites changed in this commit — wiring lands in step 2f. New function per_shard_aof_writer_task(rx, base_dir, shard_id, fsync, cancel) One instance is spawned per shard in PerShard layout. Each instance owns appendonlydir/shard-{shard_id}/moon.aof.{seq}.incr.aof exclusively, so there is no per-file locking. Mirrors the production monoio path of the existing aof_writer_task (60s bounded wait for manifest, hard fail on corrupt manifest, per-fsync-policy cadence). Differences from aof_writer_task (TopLevel): - Opens manifest.shard_incr_path(shard_id) instead of manifest.incr_path(). Defensive `create_dir_all` of the parent `shard-{N}/` directory in case a manual deletion or older binary left it missing. - Rejects Rewrite/RewriteSharded variants with a `warn!` and drops the message. The legacy single-writer rewrite enum has no meaning when each shard owns its own files; per-shard BGREWRITEAOF will be a separate per-shard operation in a later step. - Refuses to start if the loaded manifest's layout is TopLevel — the spawn site (step 2f) must only invoke this body for PerShard layouts. Layout mismatch is a programmer error and logs at error level before exiting. - Refuses to start if shard_id is out of range for the manifest's `shards.len()` (defensive against config drift between manifest write and writer spawn). - Every log line includes `shard {shard_id}` so operators can map log lines to filesystem state without ambiguity. Both runtimes (runtime-tokio async I/O via tokio::fs + BufWriter + tokio::select!, runtime-monoio sync I/O via std::fs in a blocking recv loop) are covered with feature-gated blocks. The shape mirrors aof_writer_task closely so future fixes to fsync handling or shutdown flush can be applied uniformly to both functions. What this does NOT do (in scope for later sub-steps) Step 2c — type plumbing: aof_tx: Option<MpscSender> → aof_pool: Option<Arc<AofWriterPool>> in conn_state.rs and conn/core.rs Step 2d — handler_monoio call sites use ctx.aof_pool.sender(ctx.shard_id) Step 2e — handler_sharded / handler_single / blocking call sites Step 2f — spawn sites (main.rs, listener.rs, embedded.rs) build the pool via top_level()/per_shard() and spawn N per_shard_aof_writer_task instances for PerShard layouts Tests No new tests in this commit. The function body mirrors the message loop in aof_writer_task line-for-line (with the per-shard differences above), which already has 21 unit tests covering Append, Rewrite, and Shutdown handling. An end-to-end integration test that spawns N writers, drives appends through them, kills the process, and verifies per-shard files reload cleanly lands as an #[ignore]-by-default test in tests/ alongside step 2f. Verification cargo check + cargo clippy clean on both feature combinations: --no-default-features --features runtime-tokio,jemalloc (defaults: runtime-monoio,jemalloc,graph,text-index) All 21 existing persistence::aof tests + 5 pool tests from step 2a + 9 manifest tests from step 1 remain green (35 in persistence). Refs tmp/rfc-per-shard-aof-v02.md (RFC § 4 — writer architecture) tmp/P0-INVEST-01-multishard-aof-rootcause.md (H1/H2 root cause) Commit 3bb4790 (step 1 — manifest v2 format) Commit 5a546ff (step 2a — AofWriterPool type) author: Tin Dang

…ack) Two reviewer-flagged bugs in the step 1 manifest work (commit 3bb4790): 1. base_path/incr_path/base_path_seq/incr_path_seq were NOT layout-aware 2. migrate_top_level_to_per_shard flipped self.layout = PerShard BEFORE any I/O succeeded and had no rollback for failures after the first rename Both verified against current code before fixing. A third reviewer suggestion (initialize_multi) was reviewed and skipped — see "Note on initialize_multi" below. Bug 1 — Layout-aware path helpers (replay/advance routed to wrong dir) ---------------------------------------------------------------------- Before: base_path(), incr_path(), base_path_seq(), incr_path_seq() unconditionally computed TopLevel paths (`appendonlydir/moon.aof.*`). After migrate_top_level_to_per_shard flips layout to PerShard, replay_multi_part (aof_manifest.rs:871, 895, 916) and advance() (lines 796, 821, 836-837) still asked these helpers for paths and got TopLevel locations — while the actual files now lived under shard-0/. Symptom: post-migration boot fails recovery with "AOF base RDB missing"; BGREWRITEAOF after migration writes new files to TopLevel locations the per-shard writer never reads. Fix: route PerShard layout through the existing shard_*_path_seq helpers, with debug_assert that shards.len() == 1 (these single-file helpers are by definition meaningful only for single-shard layouts; multi-shard PerShard callers MUST use shard_*_path[_seq] explicitly). Release builds fall back to shard-0 paths rather than panicking so production stays recoverable on a stale call site. No callers need to change — same signatures, layout-correct results. Bug 2 — Migrate rollback on partial failure ------------------------------------------- Before the fix, migrate_top_level_to_per_shard did: 1. self.layout = PerShard (line 689; in-memory flip) 2. create_dir_all(new_dir) (line 691; may fail) 3. rename(old_base → new_base) (line 708; may fail) 4. rename or create incr (lines 709-714; may fail) 5. write_manifest() (line 717; may fail) Only step 2's `!old_base.exists()` branch (lines 698-707) reset the layout flag on error. Any failure at steps 4 or 5 left the base file moved with no rollback AND left self.layout out of sync with the on-disk manifest (which still claimed v1 if write_manifest had not yet run, or claimed v2 with the wrong file locations if it had). Fix: defer the layout flip until everything on disk is in the new shape; explicit per-step rollback on every failure path: - rename(old_base) failure: nothing moved, plain ? return - rename(old_incr) or create(new_incr) failure: rename base back, return original error (rollback errors logged but do not mask the cause) - write_manifest() failure: revert layout flag, remove created incr or rename incr back, rename base back After this fix the migration is atomic from the loader's perspective: either everything is in shard-0/ AND the v2 manifest is on disk, or everything is at the top level AND the v1 manifest is on disk. No intermediate state survives a crash mid-migration. Note on initialize_multi ------------------------ The reviewer also flagged initialize_multi (lines 733-776) for the same "layout flipped before I/O" pattern. Verified — does NOT apply: initialize_multi constructs the struct with `layout: PerShard` in local scope only (no manifest on disk yet), creates all dirs/files via the shard_* helpers (which don't depend on self.layout), and calls write_manifest() LAST. Any failure aborts before any caller observes the half-built state. Orphan shard-{N}/ dirs left on disk on failure are harmless (next boot's load() returns Ok(None) and recovery treats as fresh init). Skipped — no change needed. Tests (3 new) base_incr_paths_route_to_shard_zero_after_migration Pre-migration: base_path() and incr_path() return TopLevel paths. Post-migration: they route to shard-0/ AND the file exists there. migrate_rolls_back_filesystem_when_incr_rename_fails Pre-creates shard-0/moon.aof.1.incr.aof as a DIRECTORY (rename onto a non-empty dir fails on every supported OS), forcing the rename after-base-already-moved path. Verifies: layout reverts to TopLevel, base file restored, base contents intact, on-disk manifest still v1. migrate_does_not_mutate_on_missing_base Pre-flight check path: layout never flips, no rollback needed, NotFound error surfaced. Verification 379 persistence tests pass on both feature combinations: --no-default-features --features runtime-tokio,jemalloc (defaults: runtime-monoio,jemalloc,graph,text-index) cargo clippy clean on both. cargo check clean on both. Refs Reviewer comments on aof_manifest.rs:669-775 and :688-717 Commit 3bb4790 (step 1 introduced the bugs) tmp/rfc-per-shard-aof-v02.md (RFC § 5 case 1 migration) author: Tin Dang

Fourth implementation step of the per-shard AOF RFC (Option B in tmp/rfc-per-shard-aof-v02.md). Adds `aof_pool: Option<Arc<AofWriterPool>>` to ConnectionContext as a **compat alias** alongside the existing `aof_tx: Option<MpscSender<AofMessage>>`. Zero call-site behavior change. Why compat alias (and not a single big-bang refactor) ----------------------------------------------------- The aof_tx → aof_pool transition touches 16 call sites across 10 files (handler_monoio, handler_sharded, handler_single, blocking, command/persistence, shard/conn_accept, shard/event_loop, main, listener, embedded), AND one of those sites carries a load-bearing correctness fix for cross-shard routing (handler_sharded/mod.rs:1651 — owner shard must be `target`, not `ctx.shard_id`, otherwise per-shard AOF writes land in the wrong file). Splitting plumbing from call-site migration: - 2c (this commit) adds the field; ConnectionContext::new takes both aof_tx and aof_pool; spawn sites build the pool via AofWriterPool::top_level(tx). All four ConnectionContext::new call sites in shard/conn_accept.rs updated. No behavior change — pool just wraps the same single sender. - 2d migrates handler_monoio + handler_monoio/dispatch + handler_single + blocking.rs call sites (owner = ctx.shard_id / shard_id / 0; all uncontroversial). - 2e migrates handler_sharded + handler_sharded/dispatch + command/persistence call sites. **Includes the cross-shard routing fix at mod.rs:1651** (target, not ctx.shard_id) with the audit table pasted into its commit body for posterity, plus removal of the legacy aof_tx field. Each commit compiles and tests green. Bisect remains useful because the type system always has a consistent shape (both fields present during 2c-2e, only pool present after 2e). Pre-refactor audit (16 sites mapped to owner shard) --------------------------------------------------- | Site | Owner shard | |--------------------------------------------|-------------------| | handler_sharded/mod.rs:1175 MOVE | ctx.shard_id | | handler_sharded/mod.rs:1219 COPY | ctx.shard_id | | handler_sharded/mod.rs:1430 local write | ctx.shard_id | | handler_sharded/mod.rs:1651 x-shard reply | **target** | | handler_sharded/dispatch.rs:356 BGREWRITEAOF | (Rewrite — pool rejects) | | handler_monoio/mod.rs:486,1124,1189,1538,1937 | ctx.shard_id | | handler_monoio/dispatch.rs:981 BGREWRITEAOF | (Rewrite — pool rejects) | | handler_single.rs (5) | 0 | | blocking.rs:1349 inline SET | shard_id (param) | | command/persistence.rs:233,263 BGREWRITEAOF helpers | (Rewrite) | | shard/conn_accept.rs + event_loop.rs | plumbing only | Verified by reading the binding scope at each site: - mod.rs:1175/1219 inside `if is_local` (line 1125) → home shard. - mod.rs:1430 inside `if is_local` + write-path branch → home shard. - mod.rs:1651 inside `for (meta, target) in reply_futures` where meta was built per-target by remote_groups.entry(target).or_default() (line 1610) — every entry's aof_bytes belongs to that target's shard. - handler_monoio is shared-nothing per-shard; ctx.shard_id is the handler's home shard which also owns the Database being mutated. - blocking.rs::try_inline_dispatch takes shard_id as a parameter. Changes in this commit ---------------------- src/server/conn/core.rs (ConnectionContext) + import AofWriterPool + aof_pool: Option<Arc<AofWriterPool>> (with #[allow(dead_code)] explaining 2d/2e are the readers) + ConnectionContext::new signature gains aof_pool parameter src/server/conn_state.rs (ConnectionContext — definition-only twin) + import AofWriterPool, mirror field for type-system consistency. This struct is #[allow(dead_code)] at the struct level (Phase 44 placeholder, not constructed anywhere); no constructor changes. src/shard/conn_accept.rs (4 ConnectionContext::new call sites) At each site: compute `aof_pool = aof.as_ref().map(|tx| AofWriterPool::top_level(tx.clone()))` and pass it into the new parameter. Wrapping the same sender means pool.try_send_append(N, b) is identical to tx.try_send(AofMessage::Append(b)) for any N — no routing change yet. Verification cargo check + cargo clippy clean on both feature combinations: --no-default-features --features runtime-tokio,jemalloc (defaults: runtime-monoio,jemalloc,graph,text-index) All 379 persistence tests remain green. What this does NOT do (in scope for 2d/2e/2f) Step 2d — migrate handler_monoio + handler_single + blocking sites from ctx.aof_tx to ctx.aof_pool.as_ref().map(|p| p.try_send_append(ctx.shard_id, bytes)) Step 2e — migrate handler_sharded sites INCLUDING the line 1651 target-routing fix; remove the legacy aof_tx field; update command/persistence BGREWRITEAOF helpers to use try_send_rewrite (with PerShard rejection) Step 2f — spawn sites (main.rs, listener.rs, embedded.rs) detect manifest layout and spawn N per_shard_aof_writer_task instances wrapped in AofWriterPool::per_shard() Refs tmp/rfc-per-shard-aof-v02.md (RFC § 4 — writer architecture) tmp/P0-INVEST-01-multishard-aof-rootcause.md (H1/H2 root cause) Commit 3bb4790 (step 1 — manifest v2 format) Commit 5a546ff (step 2a — AofWriterPool type) Commit 3afe21f (step 2b — per-shard writer task body) Commit cb254ce (review fix — layout-aware paths + migrate rollback) author: Tin Dang

Fifth implementation step of the per-shard AOF RFC (Option B in tmp/rfc-per-shard-aof-v02.md). Migrates the 7 `ctx.aof_tx` usages in `server/conn/handler_monoio/mod.rs` to `ctx.aof_pool`. Includes a cross-shard routing correctness fix at line 1937 that the compat-alias plumbing in step 2c made discoverable before it could ship as a silent data-loss bug. Routing fix at handler_monoio/mod.rs:1937 ----------------------------------------- The reanalysis triggered by step 2c surfaced that this site is structurally identical to handler_sharded/mod.rs:1651 — both are the bottom of a cross-shard reply loop where AOF append must land in the **target** shard's writer, NOT `ctx.shard_id`. Before: `let _ = tx.try_send(AofMessage::Append(bytes));` — `tx` is the single top-level writer, so under TopLevel layout this was correct. Under PerShard layout (step 2f and beyond) it would have written every cross-shard write into the connection's home shard AOF, leaving the target shard's AOF without the record and breaking per-shard recovery. After: `pool.try_send_append(target, bytes);` where `target` is captured per-batch when the remote_groups entry is drained. Plumbing required to expose `target` in scope: 1. `oneshot_futures` declaration at line 1840 gained a leading `usize` element (the target shard) — the type-system anchor making the rest of the change mechanical. 2. The push at line 1884 captures `target` from the drain loop. 3. The polling loop at line 1892 destructures `(target, meta, reply_rx)`. 4. The AOF send inside the response-zip at line 1937 uses `target`. Verified by reading the surrounding scope: `target` is bound in `for (target, entries) in remote_groups.drain()` at line 1844, where remote_groups was populated by `remote_groups.entry(target).or_default()` during command classification — so every entry's aof_bytes belongs to that target shard's data. Other migrated sites in this commit ----------------------------------- | Site | Owner shard | Pattern | |------------------------|----------------|----------------------------------| | mod.rs:1069 is_write | n/a | `aof_tx.is_some()` → `aof_pool.is_some()` | | mod.rs:1124 MOVE | ctx.shard_id | `pool.try_send_append(ctx.shard_id, _)` | | mod.rs:1189 COPY | ctx.shard_id | `pool.try_send_append(ctx.shard_id, _)` | | mod.rs:1538 local write| ctx.shard_id | `pool.try_send_append(ctx.shard_id, _)` | | mod.rs:1771 aof_bytes | n/a | `aof_tx.is_some()` → `aof_pool.is_some()` | | mod.rs:1937 x-shard | **target** | `pool.try_send_append(target, _)` ← fix | All four direct-append sites use `pool.try_send_append(owner, bytes)` which returns `()` (fire-and-forget — back-pressure is intentional in the AOF hot path; loss is bounded by the channel capacity already chosen for the single writer). The `let _ =` wrapper from the tx form is dropped along with the `AofMessage` import that is no longer referenced at any call site in this file. What this does NOT do (deferred to 2e) -------------------------------------- handler_monoio/dispatch.rs:981 — BGREWRITEAOF still calls `bgrewriteaof_start_sharded(tx, ...)` because the helper itself takes `&MpscSender<AofMessage>`. Step 2e migrates the helper to `pool.try_send_rewrite(msg)` (with PerShard rejection) and updates this call site in the same commit. handler_monoio/mod.rs:486 — still passes `&ctx.aof_tx` into `try_inline_dispatch_loop` in blocking.rs. Step 2e flips the parameter type alongside the body migration in blocking.rs and handler_single.rs. Compat-alias progress --------------------- After this commit, ctx.aof_pool is the sole AOF interface in handler_monoio's main dispatch loop. ctx.aof_tx remains as a field because: - dispatch.rs:981 (BGREWRITEAOF) still reads it - mod.rs:486 (inline path) still reads it - handler_sharded and handler_single haven't migrated yet Step 2e removes the field entirely after the remaining 11 sites move. Verification ------------ cargo check + cargo clippy clean on both feature combinations: --no-default-features --features runtime-tokio,jemalloc (defaults: runtime-monoio,jemalloc,graph,text-index) Lib persistence tests: tokio: 379 passed monoio: 378 passed (tokio/monoio diff is feature-gated; matches step 2c baseline.) Integration tests (`tests/integration.rs`) fail to compile with "missing field unsafe_multishard_aof" on 7 ServerConfig literals — this is pre-existing (commit e0bb658 added the field but did not update the test file), unrelated to step 2c/2d, and verified on the branch tip without these changes via `git stash`. Refs ---- tmp/rfc-per-shard-aof-v02.md (RFC § 4 — writer architecture) tmp/P0-INVEST-01-multishard-aof-rootcause.md (H1/H2 root cause) Commit 5a546ff (step 2a — AofWriterPool type) Commit 3afe21f (step 2b — per-shard writer task body) Commit cb254ce (review fix — layout-aware paths + migrate rollback) Commit 6a758f4 (step 2c — type plumbing aof_tx → aof_pool) author: Tin Dang

… 2e-α) Sixth implementation step of the per-shard AOF RFC (Option B in tmp/rfc-per-shard-aof-v02.md). Migrates the 5 direct `ctx.aof_tx` AOF- append sites and 2 `is_some()` gates in `server/conn/handler_sharded/mod.rs` to `ctx.aof_pool`. Includes the **canonical** cross-shard routing fix at line 1651 that was the motivating P0 for this entire RFC. Routing fix at handler_sharded/mod.rs:1651 ------------------------------------------ This is the originally-discovered site (counterpart to the latent fix shipped in step 2d for handler_monoio:1937). The cross-shard reply loop already had `target` in scope at line 1646 — the loop variable from `for (meta, target) in reply_futures` — so the change is mechanical: Before: `if let Some(ref tx) = ctx.aof_tx { let _ = tx.try_send(AofMessage::Append(bytes)); }` After: `if let Some(ref pool) = ctx.aof_pool { pool.try_send_append(target, bytes); }` Why this matters: under TopLevel layout, a single writer absorbs every append regardless of `target`, so the wrong-owner write was structurally masked. Under PerShard (step 2f and beyond) each shard owns its own AOF file, and a write that mutates target shard's data MUST land in target shard's file — otherwise replay of target's AOF won't contain the record and post-crash state diverges. This was the H1/H2 root cause in P0-INVEST-01-multishard-aof-rootcause.md. Other migrated sites in this commit ----------------------------------- | Site | Owner shard | Pattern | |------------------------|----------------|----------------------------------| | mod.rs:1122 is_write | n/a | `aof_tx.is_some()` → `aof_pool.is_some()` | | mod.rs:1123 aof_bytes | n/a | `aof_tx.is_some()` → `aof_pool.is_some()` | | mod.rs:1175 MOVE | ctx.shard_id | `pool.try_send_append(ctx.shard_id, _)` | | mod.rs:1219 COPY | ctx.shard_id | `pool.try_send_append(ctx.shard_id, _)` | | mod.rs:1430 local write| ctx.shard_id | `pool.try_send_append(ctx.shard_id, _)` | | mod.rs:1651 x-shard | **target** | `pool.try_send_append(target, _)` ← fix | The `AofMessage` import is no longer referenced at any call site in this file and is removed. Scope split (subdivision of step 2e) ------------------------------------ The original 2c plan listed 2e as one big commit. To keep each step green-on-both-runtimes and bisectable, 2e is split into 4 atomic commits: 2e-α (this commit) — handler_sharded/mod.rs only (mirrors 2d shape). 2e-β — command/persistence.rs BGREWRITEAOF helpers swap to `&AofWriterPool` (with PerShard rejection translated to a user-facing RESP error); both handler_*/dispatch.rs BGREWRITEAOF call sites flip together. 2e-γ — handler_single.rs (6 sites, parameter type swap), blocking.rs (2 fn signatures + 1 use), handler_monoio/mod.rs:486 (call site for the migrated blocking helper), and the 12 test call sites in server/conn/tests.rs. 2e-δ — Remove `aof_tx` field from ConnectionContext and conn_state.rs; drop the parameter from `ConnectionContext::new`; simplify the 4 spawn sites in shard/conn_accept.rs. Each commit compiles + clippy clean + lib persistence tests green on both `runtime-monoio` and `runtime-tokio,jemalloc`. The compat-alias field (`ctx.aof_tx` alongside `ctx.aof_pool`) introduced in step 2c lets each commit flip its slice of call sites without breaking the other consumers. Verification ------------ cargo clippy clean on both feature combinations: --no-default-features --features runtime-tokio,jemalloc (defaults: runtime-monoio,jemalloc,graph,text-index) Lib persistence tests: tokio: 379 passed monoio: 378 passed (Diff is feature-gated; matches step 2c/2d baseline.) Pre-existing tests/integration.rs breakage on `unsafe_multishard_aof` ServerConfig field (commit e0bb658) remains unrelated to this commit — verified via `git stash` in step 2d. Refs ---- tmp/rfc-per-shard-aof-v02.md (RFC § 4 — writer architecture) tmp/P0-INVEST-01-multishard-aof-rootcause.md (H1/H2 — the bug 1651 fixes) Commit 5a546ff (step 2a — AofWriterPool type) Commit 3afe21f (step 2b — per-shard writer task body) Commit cb254ce (review fix — layout-aware paths + migrate rollback) Commit 6a758f4 (step 2c — type plumbing aof_tx → aof_pool) Commit a05f3d8 (step 2d — handler_monoio migration + latent routing fix) author: Tin Dang

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/persistence/aof.rs`:
- Around line 858-930: The per-shard writer loop currently blocks on rx.recv()
and never checks cancel, so a cancellation-only shutdown can hang; modify the
loop around rx.recv() (the match handling AofMessage::{Append, Rewrite,
RewriteSharded, Shutdown}) to make cancellation reachable by either using a
non-blocking/timeout receive (e.g., try_recv/recv_timeout) or by selecting
between rx.recv() and checking the cancel flag (atomic or channel) before/after
the recv, breaking out when cancel is set; ensure you still perform the same
final flush/sync logic (the file.flush().and_then(|_| file.sync_data()) block
guarded by write_error) and preserve the metrics/fsync handling and warnings for
Rewrite/RewriteSharded.
- Around line 720-756: The per-shard AOF writer currently ignores errors from
writer.flush().await and writer.get_ref().sync_data().await in the EverySec,
Shutdown, and cancel branches (as well as the Always path already checked),
which can silently drop durability; update the branches handling interval.tick()
when fsync == FsyncPolicy::EverySec, the Ok(AofMessage::Shutdown) / Err(_)
shutdown branch, and the cancel.cancelled() branch to check the Result values
from flush() and sync_data(), log failures including shard_id and the error, and
surface a degraded state (e.g., return or set a tracer/metric) instead of
discarding errors—modify the calls around writer.flush().await and
writer.get_ref().sync_data().await and add error handling/logging similar to the
existing Append path handling.

In `@src/server/conn/handler_monoio/mod.rs`:
- Around line 1536-1541: The AOF is serializing the original client Frame
(`frame`) instead of the possibly workspace-rewritten command arguments
(`cmd_args`), causing persisted writes to use unprefixed keys; change the AOF
serialization for local writes to serialize the dispatched command (the same
representation used by `dispatch_frame`) by passing the rewritten command args
(or the dispatched command object) into `aof::serialize_command` before calling
`pool.try_send_append` so local writes persist the post-rewrite command; update
the block that checks `is_write`/`ctx.aof_pool` to use `cmd_args` (or the
dispatched command) rather than `frame`, keeping the
`pool.try_send_append(ctx.shard_id, ...)` call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b9290671-bc3d-4bf2-9a53-f512eecb98c0

📥 Commits

Reviewing files that changed from the base of the PR and between 3bb4790 and a05f3d8.

📒 Files selected for processing (6)

src/persistence/aof.rs
src/persistence/aof_manifest.rs
src/server/conn/core.rs
src/server/conn/handler_monoio/mod.rs
src/server/conn_state.rs
src/shard/conn_accept.rs

🚧 Files skipped from review as they are similar to previous changes (1)

src/persistence/aof_manifest.rs

coderabbitai · 2026-05-27T06:42:28Z

                    // AOF logging for successful local writes
                    if !matches!(response, Frame::Error(_)) && is_write {
-                        if let Some(ref tx) = ctx.aof_tx {
+                        if let Some(ref pool) = ctx.aof_pool {
                            let serialized = aof::serialize_command(&frame);
-                            let _ = tx.try_send(AofMessage::Append(serialized));
+                            pool.try_send_append(ctx.shard_id, serialized);
                        }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Serialize the dispatched command, not the client frame, for local AOF writes.

Line 1539 still uses frame, but by this point cmd_args may already be workspace-rewritten. Remote writes handle this correctly with dispatch_frame; local writes will persist the unprefixed command and replay to the wrong key after restart.

Suggested fix

- if let Some(ref pool) = ctx.aof_pool { - let serialized = aof::serialize_command(&frame); + if let Some(ref pool) = ctx.aof_pool { + let aof_frame = if rewritten.is_some() { + let mut parts = Vec::with_capacity(1 + cmd_args.len()); + parts.push(Frame::BulkString(Bytes::copy_from_slice(cmd))); + parts.extend_from_slice(cmd_args); + Frame::Array(parts.into()) + } else { + frame.clone() + }; + let serialized = aof::serialize_command(&aof_frame); pool.try_send_append(ctx.shard_id, serialized); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// AOF logging for successful local writes

if !matches!(response, Frame::Error(_)) && is_write {

if let Some(ref tx) = ctx.aof_tx {

if let Some(ref pool) = ctx.aof_pool {

let serialized = aof::serialize_command(&frame);

let _ = tx.try_send(AofMessage::Append(serialized));

pool.try_send_append(ctx.shard_id, serialized);

}

// AOF logging for successful local writes

if !matches!(response, Frame::Error(_)) && is_write {

if let Some(ref pool) = ctx.aof_pool {

let aof_frame = if rewritten.is_some() {

let mut parts = Vec::with_capacity(1 + cmd_args.len());

parts.push(Frame::BulkString(Bytes::copy_from_slice(cmd)));

parts.extend_from_slice(cmd_args);

Frame::Array(parts.into())

} else {

frame.clone()

};

let serialized = aof::serialize_command(&aof_frame);

pool.try_send_append(ctx.shard_id, serialized);

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/server/conn/handler_monoio/mod.rs` around lines 1536 - 1541, The AOF is serializing the original client Frame (`frame`) instead of the possibly workspace-rewritten command arguments (`cmd_args`), causing persisted writes to use unprefixed keys; change the AOF serialization for local writes to serialize the dispatched command (the same representation used by `dispatch_frame`) by passing the rewritten command args (or the dispatched command object) into `aof::serialize_command` before calling `pool.try_send_append` so local writes persist the post-rewrite command; update the block that checks `is_write`/`ctx.aof_pool` to use `cmd_args` (or the dispatched command) rather than `frame`, keeping the `pool.try_send_append(ctx.shard_id, ...)` call.

…tep 2e-β) Seventh implementation step of the per-shard AOF RFC (Option B in tmp/rfc-per-shard-aof-v02.md). Swings `bgrewriteaof_start` and `bgrewriteaof_start_sharded` over to `&AofWriterPool` and routes through `pool.try_send_rewrite(...)`, which rejects under PerShard layout with a stable user-facing RESP error. All three callers flip together so the helpers stay strictly typed. Why this matters ---------------- Step 2b shipped `per_shard_aof_writer_task` with PerShard rejection of Rewrite/RewriteSharded messages (logged at `warn!`). Before this commit, under PerShard layout BGREWRITEAOF would have: 1. Sent `AofMessage::RewriteSharded(...)` into shard-0's writer via the legacy `tx.try_send(...)` path, 2. Received `Ok(())` (channel accepted the message), 3. Returned `+Background append only file rewriting started\r\n` to the client, 4. The per-shard writer would warn and drop the message — no rewrite happens. That is a silent failure: the client thinks a rewrite is in progress when nothing is actually happening, and the rewrite-in-progress flag is stuck set. After this commit, `pool.try_send_rewrite(...)` returns `RewriteUnsupportedInPerShard`, the helper clears the flag, and the client receives an explicit error: -ERR BGREWRITEAOF is not yet supported under per-shard AOF layout; per-shard rewrite ships in step 6 of the per-shard AOF migration (Under TopLevel layout — i.e. today — `try_send_rewrite` is a thin pass-through, so behaviour is unchanged.) Changes ------- command/persistence.rs - Both `bgrewriteaof_start` and `bgrewriteaof_start_sharded` now take `pool: &AofWriterPool` instead of `&channel::MpscSender<AofMessage>`. - New `rewrite_pool_error_frame(err: AofPoolSendError)` translates pool failures into RESP errors (PerShard rejection → user-facing "not yet supported"; channel send fail → existing "failed to start"). - `AOF_REWRITE_IN_PROGRESS` is still cleared on any send failure, matching prior behaviour. - Removed now-unused `crate::runtime::channel` import. - Existing gate test `test_bgrewriteaof_sharded_refuses_under_unsafe_config` updated to wrap the local sender as a `TopLevel` pool before invoking the helper. server/conn/handler_monoio/dispatch.rs:980 server/conn/handler_sharded/dispatch.rs:355 - BGREWRITEAOF dispatch path uses `ctx.aof_pool` (the field plumbed in step 2c) instead of `ctx.aof_tx`. Behaviour identical under TopLevel; gains PerShard rejection in step 2f. server/conn/handler_single.rs:610 - Wraps the local `aof_tx` parameter as a transient `AofWriterPool::top_level(tx.clone())` before calling the helper. handler_single is single-shard mode by definition, so the writer is always TopLevel — the wrapper is purely a type adapter. BGREWRITEAOF is a manual admin command, not a hot path; the transient allocation is acceptable. Step 2e-γ swaps the function's `aof_tx` parameter to `aof_pool` and removes this wrapper. server/conn/core.rs (ConnectionContext.aof_tx) - Doc comment expanded to track the staged removal. - `#[cfg_attr(not(feature = "runtime-monoio"), allow(dead_code))]` silences clippy under tokio (where the only remaining reader is `handler_monoio/mod.rs:486`, which is `#[cfg(feature = "runtime-monoio")]`). Future regressions on monoio still trip a real dead-code warning. What this does NOT do (deferred to 2e-γ) --------------------------------------- - handler_single's 5 remaining `aof_tx` sites (SWAPDB at 658, AOF drain at 881, WAL records at 1513, is_write at 1531, AOF drain at 2235). All keep using the local `aof_tx` parameter. - handler_single function-parameter rename (`aof_tx` → `aof_pool`). - blocking.rs `try_inline_dispatch` / `try_inline_dispatch_loop` signatures + the AOF send at line 1349. - handler_monoio/mod.rs:486 call site for the migrated blocking helper. - server/conn/tests.rs (12 call sites — straightforward None/Some swaps once blocking.rs's signature flips). What this does NOT do (deferred to 2e-δ) --------------------------------------- - Remove the `aof_tx` field from ConnectionContext and conn_state.rs. - Drop the parameter from `ConnectionContext::new`. - Simplify the 4 spawn sites in shard/conn_accept.rs. Verification ------------ cargo clippy clean on both feature combinations: --no-default-features --features runtime-tokio,jemalloc (defaults: runtime-monoio,jemalloc,graph,text-index) Lib persistence tests: tokio: 379 passed monoio: 378 passed Including the gate-refusal test that now exercises the pool path. Refs ---- tmp/rfc-per-shard-aof-v02.md (RFC § 4 — writer architecture) Commit 5a546ff (step 2a — AofWriterPool type + try_send_rewrite) Commit 3afe21f (step 2b — per-shard writer task body that rejects Rewrite/RewriteSharded with warn!) Commit 6a758f4 (step 2c — type plumbing aof_tx → aof_pool) Commit a05f3d8 (step 2d — handler_monoio migration + latent routing fix) Commit eb90419 (step 2e-α — handler_sharded migration + canonical routing fix at line 1651) author: Tin Dang

…o aof_pool (Option B step 2e-γ) Eighth implementation step of the per-shard AOF RFC (Option B in tmp/rfc-per-shard-aof-v02.md). Drains the remaining `ctx.aof_tx` and parameter-level `aof_tx` readers from the connection-handler layer: - `blocking.rs::try_inline_dispatch` + `try_inline_dispatch_loop`: parameter type changes from `&Option<MpscSender<AofMessage>>` to `&Option<Arc<AofWriterPool>>`. The L1349 AOF append uses `pool.try_send_append(shard_id, frozen)` — under PerShard layout this routes to the shard that owns the data, fixing the same latent bug class as 2d/2e-α (a TopLevel writer would absorb every shard's inline SET regardless of routing). - `handler_monoio/mod.rs:486`: flipped to pass `&ctx.aof_pool` into the migrated blocking helper. After this commit no consumer reads `ctx.aof_tx` under any feature combo. - `handler_single.rs`: top of `handle_connection` constructs `aof_pool: Option<Arc<AofWriterPool>>` from the inbound `aof_tx` parameter via `AofWriterPool::top_level(tx.clone())`. All six consumer sites (BGREWRITEAOF wrapper from 2e-β, SWAPDB WAL, per-batch AOF drain at 905, per-batch AOF drain at 2260, GRAPH WAL records at 1537, is_write/aof_bytes gate at 1556) now read `aof_pool` instead of `aof_tx`. The `aof_tx` function parameter survives as a placeholder for 2e-δ when listener.rs starts constructing the pool itself. - `server/conn/tests.rs`: 12 inline-dispatch test fixtures swap `aof_tx: Option<MpscSender<AofMessage>>` for `aof_pool: Option<Arc<AofWriterPool>>` and pass `&aof_pool` into the migrated `try_inline_dispatch[_loop]`. The one Some-form fixture (`test_inline_set_with_aof_falls_through_when_writes_disabled`) wraps the local sender as a TopLevel pool. Two send-style choices made deliberately ---------------------------------------- `AofWriterPool` exposes two send paths today: a fire-and-forget `try_send_append(shard_id, bytes)` (returns `()`) and the lower-level `sender(shard_id)` which returns the underlying `&MpscSender` for callers that need the `Result` or want `send_async`. Most migrated sites use `try_send_append`; the four exceptions are: - SWAPDB at handler_single:677 keeps `sender(0).try_send(...).is_ok()` because the swap MUST abort cleanly if the WAL enqueue fails (it is the only durability hook before the in-memory swap). The fire-and-forget helper silently drops; here we need the Result. - The three `send_async(AofMessage::Append(...)).await` sites at handler_single:909 / 1540 / 2266 keep `sender(0).send_async(...).await` because their pre-pool code awaited capacity on a full channel (back-pressure on the inbound write path). `try_send_append` would drop instead. Preserving the semantics is more important than the uniform call shape here — the per-shard pool exposes the same sender under PerShard, so the semantics carry over in 2f. ConnectionContext.aof_tx ------------------------ After this commit the field has no readers under either runtime. The doc comment is updated to reflect the staged removal, and the `cfg_attr(not(...))` gate from 2e-β collapses to a plain `#[allow(dead_code)]` (the field is write-only — populated by the constructor — until 2e-δ drops both the constructor parameter and the field itself). What this does NOT do (deferred to 2e-δ) --------------------------------------- - Remove the `aof_tx` field from ConnectionContext + conn_state.rs. - Drop the constructor parameter `aof_tx: Option<MpscSender<...>>` from `ConnectionContext::new`. - Simplify the 4 spawn sites in shard/conn_accept.rs (they currently clone `aof` only to pass it as the field; once the field is gone the field-assignment can go too). - Replace the `aof_tx` parameter on handler_single's `handle_connection` with `aof_pool` (and update listener.rs to construct the pool itself). Verification ------------ cargo clippy clean on both feature combinations: --no-default-features --features runtime-tokio,jemalloc (defaults: runtime-monoio,jemalloc,graph,text-index) Lib persistence tests: tokio: 379 passed monoio: 378 passed Inline dispatch tests (server::conn::tests): 11 passed (covers GET hit/miss, multi-shard skip, SET inline, SET with AOF fall-through, several malformed-input rejects). Refs ---- tmp/rfc-per-shard-aof-v02.md (RFC § 4 — writer architecture) Commit 5a546ff (step 2a — AofWriterPool type) Commit 3afe21f (step 2b — per-shard writer task body) Commit 6a758f4 (step 2c — type plumbing aof_tx → aof_pool) Commit a05f3d8 (step 2d — handler_monoio migration + latent routing fix) Commit eb90419 (step 2e-α — handler_sharded migration + canonical routing fix) Commit 5735031 (step 2e-β — BGREWRITEAOF helpers via AofWriterPool) author: Tin Dang

…step 2e-δ) Ninth implementation step of the per-shard AOF RFC (Option B in tmp/rfc-per-shard-aof-v02.md). With handler_monoio (2d), handler_sharded (2e-α), BGREWRITEAOF helpers (2e-β), and handler_single + blocking + inline tests (2e-γ) all migrated to `AofWriterPool`, the compat-alias `aof_tx` field on `ConnectionContext` has no remaining consumers. This commit removes it, drops the parameter from `ConnectionContext::new`, and simplifies the 4 spawn sites in `shard/conn_accept.rs` that no longer need to clone `aof_tx` as an intermediate. Changes ------- src/server/conn/core.rs - Remove `aof_tx: Option<MpscSender<AofMessage>>` field (was `#[allow(dead_code)]` in step 2e-γ after the last reader left). - Drop `aof_tx` parameter from `ConnectionContext::new`. - Drop `aof_tx` from struct initializer. - Doc-comment on `aof_pool` updated to reflect it as the sole AOF interface (the "compat alias" framing from step 2c is now history). - Remove unused `AofMessage` import. src/server/conn_state.rs (definition-only placeholder twin) - Mirror the same field removal + doc-comment update. - Remove unused `AofMessage` import. src/shard/conn_accept.rs (4 ConnectionContext::new spawn sites) - Drop the intermediate `let aof = aof_tx.clone();` — the only consumer was the constructor's removed parameter. - Build the pool directly: `aof_pool = aof_tx.as_ref().map(...)`. - Drop the `aof,` positional argument from each constructor call. - Update the "2c compat alias" comment to point forward at the layout-aware constructor in step 2f. What this does NOT do (deferred to 2f) ------------------------------------- - handler_single's `aof_tx` parameter on `handle_connection` — needs listener.rs (the spawn site) to construct the pool itself first. - Spawn-side AOF channel construction in main.rs, listener.rs, and embedded.rs — they still build a single `MpscSender<AofMessage>` and pass it through `aof_tx` chains. Step 2f introduces the layout-aware `AofWriterPool::from_manifest(...)` that emits `top_level(tx)` for TopLevel or `per_shard(senders)` for PerShard and replaces the per-shard channel fanout in `shard/event_loop.rs`. Verification ------------ cargo clippy clean on both feature combinations: --no-default-features --features runtime-tokio,jemalloc (defaults: runtime-monoio,jemalloc,graph,text-index) Lib persistence tests: tokio: 379 passed monoio: 378 passed Inline-dispatch tests (server::conn::tests): 11 passed. End-state of step 2 (handler-layer migration) --------------------------------------------- After this commit `aof_pool` is the sole AOF interface across: - ConnectionContext (struct + constructor) - handler_sharded (mod.rs + dispatch.rs) - handler_monoio (mod.rs + dispatch.rs) - handler_single (all internal sites; parameter still receives `aof_tx` but is only used to bootstrap the pool) - blocking.rs (try_inline_dispatch + try_inline_dispatch_loop) - command/persistence.rs (BGREWRITEAOF helpers, with PerShard rejection) - server/conn/tests.rs (12 inline-dispatch fixtures) The remaining `aof_tx` references in the tree: - src/main.rs, src/server/embedded.rs, src/server/listener.rs (spawn-side channel construction — 2f scope) - src/shard/event_loop.rs (passes `aof_tx` through to conn_accept; 2f flips to per-shard pool construction) - src/shard/conn_accept.rs (still receives `aof_tx: &Option<MpscSender>` as parameter; 2f changes to `aof_pool: &Option<Arc<AofWriterPool>>`) - src/server/conn/handler_single.rs (function parameter only; bootstrap site for the local pool — 2f rename) - src/persistence/aof.rs (channel type definitions — stable) Refs ---- tmp/rfc-per-shard-aof-v02.md (RFC § 4 — writer architecture) Commit 5a546ff (step 2a — AofWriterPool type) Commit 3afe21f (step 2b — per-shard writer task body) Commit 6a758f4 (step 2c — type plumbing aof_tx → aof_pool compat alias) Commit a05f3d8 (step 2d — handler_monoio migration + latent routing fix) Commit eb90419 (step 2e-α — handler_sharded migration + canonical routing fix) Commit 5735031 (step 2e-β — BGREWRITEAOF helpers via AofWriterPool) Commit ceac655 (step 2e-γ — handler_single + blocking + inline tests) author: Tin Dang

…ig literals Commit e0bb658 added `unsafe_multishard_aof: bool` to `ServerConfig` (the P0 gate against multi-shard AOF data loss until per-shard replay lands) but did not update the 17 `ServerConfig { .. }` literals scattered across the integration-test suite. The tests have been failing to compile since then on both feature combinations. This commit backfills `unsafe_multishard_aof: false,` in all affected literals — preserving the production default (refuse the unsafe config at startup unless explicitly overridden). No test semantics change: the tests that exercise multi-shard configs already use single-shard storage layouts or `appendonly = "no"`, so the gate doesn't fire for them. Files touched (17 literals across 10 files) ------------------------------------------- tests/ft_search_multi_shard_as_of.rs tests/ft_search_temporal_parity.rs tests/integration.rs (7 sites) tests/kill_snapshot.rs tests/mq_integration.rs tests/replication_test.rs tests/txn_ft_search_snapshot.rs tests/txn_kv_wiring.rs tests/vacuum_commands.rs tests/workspace_integration.rs (2 sites) Verification ------------ cargo check --tests cargo check --tests --no-default-features --features runtime-tokio,jemalloc Both clean. Unblocks integration-test runs for the per-shard AOF migration commits (2a..2e-δ on origin) and any future PRs landing on this branch. Refs ---- Commit e0bb658 (origin of the unbackfilled field) Commit 6e49050 (docs noting the multi-shard AOF safety gate) tmp/rfc-per-shard-aof-v02.md (per-shard AOF migration scope) author: Tin Dang

…tep 2f-α) Tenth implementation step of the per-shard AOF RFC (Option B in tmp/rfc-per-shard-aof-v02.md). Closes out the handler-layer migration sequence by lifting `AofWriterPool` construction to the three spawn sites (`main.rs`, `server/listener.rs`, `server/embedded.rs`) and retyping the connection-accept fan-out (`shard/event_loop.rs`, `shard/conn_accept.rs`, `server/conn/handler_single.rs`) to thread `Option<Arc<AofWriterPool>>` end-to-end. The compat-alias inline construction that step 2c–2e-δ relied on (`let aof_pool = aof_tx .as_ref().map(|tx| AofWriterPool::top_level(tx.clone()))`) is deleted from every site. After this commit, `aof_tx` no longer exists anywhere in `src/`. Grep confirms zero matches under any feature combo. Scope split: 2f-α vs 2f-β ------------------------- This commit is strictly **type plumbing** — every writer pool is still `AofLayout::TopLevel` wrapping a single sender. The layout-aware constructor that reads `AofManifest` and emits PerShard pools (with fan-out to N writer threads) lands as a follow-up commit (2f-β). The RFC's "Step 2f" originally bundled both; separating them keeps the diff bisectable and preserves the property that today's runtime behavior is byte-identical to step 2e-δ. Changes ------- src/main.rs - Import `AofWriterPool` alongside `AofMessage` + `FsyncPolicy`. - Replace `let aof_tx: Option<MpscSender<AofMessage>>` with `let aof_pool: Option<Arc<AofWriterPool>>`. Wrap the writer sender via `AofWriterPool::top_level(tx)`. - Rename per-shard clone `shard_aof_tx` → `shard_aof_pool` and the matching positional argument in `Shard::run(...)`. - Shutdown path: `tx.send(AofMessage::Shutdown)` → `pool.broadcast_shutdown()`. Under TopLevel this is one try_send; under PerShard (2f-β) it fans to every per-shard writer. src/server/listener.rs - Same pattern. `aof_tx` → `aof_pool: Option<Arc<AofWriterPool>>`, wrapped at the construction site. - Accept-loop captures `aof_pool_conn = aof_pool.clone()` (Arc bump) and passes it as the `aof_pool` parameter of `connection::handle_connection` (handler_single). - Cancel-path shutdown switches to `pool.broadcast_shutdown()` (note: `try_send`-based, not async — listener already drains on the same runtime). src/server/embedded.rs - Mirror change: outer tuple now `(Option<Arc<AofWriterPool>>, Option<JoinHandle>)`. - Shutdown-ordering comment updated to reflect the pool-Drop semantics — dropping the last `Arc` drops the pool, which drops the underlying `Vec<MpscSender>`, which closes the channel. The writer's `recv_async()` returns `Err(_)` and the task drains + fsyncs + exits cleanly. This preserves Qodo bug #5's fix: shards drop their clones before the outer pool, so the writer never terminates while shards still have pending appends. src/shard/event_loop.rs - `Shard::run` signature: `aof_tx: Option<MpscSender<AofMessage>>` → `aof_pool: Option<Arc<AofWriterPool>>`. - 9 internal pass-through sites (`&aof_tx` → `&aof_pool`) updated. src/shard/conn_accept.rs - 4 function signatures (`spawn_tokio_connection`, `spawn_monoio_connection`, `spawn_monoio_tls_connection`, `spawn_migrated_monoio_connection`): parameter `aof_tx: &Option<MpscSender<AofMessage>>` → `aof_pool: &Option<Arc<AofWriterPool>>`. - 4 inline pool-construction blocks deleted (the compat-alias `let aof_pool = aof_tx.as_ref().map(|tx| top_level(tx.clone()))` pattern from step 2c). Replaced by a one-line Arc bump: `let pool_for_ctx = aof_pool.as_ref().map(Arc::clone);` passed positionally into `ConnectionContext::new(.., pool_for_ctx, ..)`. src/server/conn/handler_single.rs - Parameter `aof_tx: Option<MpscSender<AofMessage>>` → `aof_pool: Option<Arc<AofWriterPool>>`. - **DELETED** the step-2e-γ bootstrap block that wrapped the inbound `aof_tx` as a TopLevel pool. The parameter IS the pool now; the bootstrap was always a placeholder for this commit. - Doc comment on `handle_connection` updated to reflect the pool semantics (single-shard ⇒ always TopLevel). What this does NOT do (deferred to 2f-β) ---------------------------------------- - Read `AofManifest` from disk in `main.rs`/`embedded.rs` to choose between `top_level(...)` and `per_shard(senders)`. - Spawn N writer threads when the on-disk manifest is `AofLayout::PerShard`. - Add a manifest mismatch warning (manifest says PerShard but constructed as TopLevel, or vice versa). - Wire `per_shard_aof_writer_task` (already defined in step 2b) into the spawn flow. Today's runtime behavior is byte-identical to step 2e-δ. The only observable change is: every site speaks the `AofWriterPool` API instead of `MpscSender<AofMessage>`, which is a precondition for 2f-β shipping the PerShard fan-out without touching call sites again. Verification ------------ cargo check on both feature combinations: --no-default-features --features runtime-tokio,jemalloc clean (defaults: runtime-monoio,jemalloc,graph,text-index) clean cargo clippy -- -D warnings on both feature combinations: clean. Lib persistence tests (full set, including the 5 pool_tests added in step 2a): tokio: 379 passed (baseline match) monoio: 378 passed (baseline match) cargo test --lib (full lib suite): tokio: 2751 passed monoio: pre-existing stack overflow in `graph::cypher::parser::tests::test_nesting_depth_exceeded` (verified on origin/HEAD without these changes — unrelated to AOF migration). Integration-test compile: clean on both combos after the parallel test-fix commit `4fdd50f` (unsafe_multishard_aof backfill). Net `aof_tx` references in src/ ------------------------------- Before this commit: 37 across 6 files. After this commit: 0. The full per-shard AOF refactor (steps 2a–2f-α) is now complete on the handler + spawn layer. Step 2f-β (layout-aware fan-out) and step 3+ (LSN tagging, per-shard replay, cross-shard ordering, AppendSync, crash matrix) are unblocked. Refs ---- tmp/rfc-per-shard-aof-v02.md (RFC § 4 — writer architecture) Commit 5a546ff (step 2a — AofWriterPool type) Commit 3afe21f (step 2b — per-shard writer task body) Commit 6a758f4 (step 2c — type plumbing aof_tx → aof_pool compat alias) Commit a05f3d8 (step 2d — handler_monoio migration + latent routing fix) Commit eb90419 (step 2e-α — handler_sharded migration + canonical routing fix) Commit 5735031 (step 2e-β — BGREWRITEAOF helpers via AofWriterPool) Commit ceac655 (step 2e-γ — handler_single + blocking + inline tests) Commit d9a3651 (step 2e-δ — drop ConnectionContext.aof_tx field) Commit 4fdd50f (test backfill — unsafe_multishard_aof field) author: Tin Dang

…step 2f-β) Eleventh implementation step of the per-shard AOF RFC (Option B in tmp/rfc-per-shard-aof-v02.md). Replaces the unconditional TopLevel construction at `main.rs:312` (left in place by step 2f-α) with a read-only manifest peek + layout-aware spawn. When an on-disk manifest declares `layout == PerShard` AND `--shards >= 2`, main.rs now spawns one `per_shard_aof_writer_task` per shard and returns `AofWriterPool::per_shard(senders)` instead of the single-writer TopLevel pool. Scope: main.rs only ------------------- `embedded.rs` and `server/listener.rs` are deliberately untouched. Both run the tokio single-file legacy AOF path (`aof_writer_task` opens `<dir>/<appendfilename>`) and never engage the manifest by design — see the comment block at `embedded.rs:222-235`. Adding a PerShard branch in either would risk Qodo bug #3 (incr-only replay on the next boot silently dropping data). `listener.rs` is the tokio single-shard path: per-shard fan-out has no meaning with one shard, so it inherits TopLevel from `AofWriterPool::top_level(tx)` at the construction site. The new branching logic ----------------------- src/main.rs (L308-419) 1. If `appendonly == "yes"`: AofManifest::load(&base_dir) - Ok(Some(m)) → continue with existing manifest - Ok(None) → no manifest yet (fresh install) - Err(_) → **fatal exit (2)** with the same "refusing to start to avoid data loss" message used by the replay block at L514-526. Mirroring this is load-bearing: silently falling back to TopLevel on a corrupt manifest would let the next write create a fresh manifest that overwrites the reference to the real base RDB, losing data. 2. If a manifest was loaded: `verify_shard_count(num_shards as u16)`. Mismatch is fatal (exit 2) with the verbatim RFC § 3 error ("ERR shard count changed (manifest=N, config=M); refusing to start to avoid data loss. See docs/runbooks/shard-count-change.md"). 3. Spawn decision: use_per_shard = manifest.is_some() && manifest.layout == PerShard && num_shards >= 2 4. If `use_per_shard`: - for sid in 0..num_shards: (tx, rx) = channel::mpsc_bounded::<AofMessage>(10_000) thread `aof-writer-{sid}` running `per_shard_aof_writer_task(rx, base_dir, sid as u16, fsync, cancel)` push tx to senders - return `Some(AofWriterPool::per_shard(senders))` Else (existing TopLevel path): - single `aof-writer` thread running `aof_writer_task` against `<dir>/<appendfilename>` - return `Some(AofWriterPool::top_level(tx))` What this does NOT do (deferred) -------------------------------- - **Fresh-install PerShard creation.** `AofManifest::initialize()` still hardcodes TopLevel; nothing in main.rs constructs a PerShard manifest from scratch. The PerShard branch is therefore reachable only by: a) hand-crafting a v2 manifest (the smoke test below) b) future migration logic (RFC step 5/9 territory) Until then, runtime behavior under default configurations is byte-identical to step 2f-α. - **Multi-part AOF replay for multi-shard.** The replay block at `main.rs:528` still gates on `num_shards == 1`. Step 4 of the RFC closes this. A PerShard manifest with `num_shards >= 2` will spawn the writers correctly (smoke verified) and the writers will tail the existing incr files, but boot-time replay still warns "Multi-part AOF skipped in multi-shard mode". - **TopLevel→PerShard auto-migration.** `migrate_top_level_to_per_shard` exists in `aof_manifest.rs` (step 1) but is not wired into boot. - **AppendSync rendezvous, LSN tagging, cross-shard merge, CRASH-01 matrix.** Steps 3, 5, 7, 8 of the RFC. - **Lifting the `--unsafe-multishard-aof` gate.** Step 9. The L280 refusal still fires whenever `num_shards >= 2 && appendonly == "yes"` unless the operator explicitly opts in. Manual smoke verification ------------------------- Built `target/debug/moon` and ran four hand-crafted scenarios from `/tmp/moon-smoke-*` directories (cleaned up post-run): 1. **PerShard happy path.** Hand-wrote version 2 seq 1 shards 2 shard 0 max_lsn 0 shard 1 max_lsn 0 at `appendonlydir/moon.aof.manifest`, created shard-0/ and shard-1/ dirs. Started with moon --port 16399 --shards 2 --unsafe-multishard-aof --appendonly yes --dir <smoke> --appendfsync everysec Log output: "AOF enabled (PerShard, 2 writers, fsync: EverySec)" "AOF writer shard 0: seq 1, incr=<smoke>/appendonlydir/shard-0/moon.aof.1.incr.aof" "AOF writer shard 1: seq 1, incr=<smoke>/appendonlydir/shard-1/moon.aof.1.incr.aof" Both per-shard writer tasks reached their per-shard incr files. 2. **Shard-count mismatch.** Same manifest, started with `--shards 4`. Process exited 2 with verbatim: "REFUSING TO START: ERR shard count changed (manifest=2, config=4); refusing to start to avoid data loss. See docs/runbooks/shard-count-change.md" 3. **Corrupt manifest.** Wrote garbage at the manifest path, started with `--shards 1`. Process exited 2 with: "REFUSING TO START: AOF manifest at <dir>/appendonlydir/ is corrupt: AOF manifest at .../moon.aof.manifest has no valid sequence number. Inspect manually before deleting; overwriting silently loses data." 4. **TopLevel regression.** Fresh empty `--dir`, `--shards 1 --appendonly yes`. Log: "AOF enabled (TopLevel, fsync: EverySec)". `initialize()` wrote v1 manifest + seq 1 base/incr. Behavior identical to step 2f-α. Verification ------------ cargo check on both feature combinations: --no-default-features --features runtime-tokio,jemalloc clean (defaults: runtime-monoio,jemalloc,graph,text-index) clean cargo clippy -- -D warnings on both combinations: clean. Lib persistence tests: tokio: 379 passed (baseline match) monoio: 378 passed (baseline match) Refs ---- tmp/rfc-per-shard-aof-v02.md (RFC § 3 + § 4) Commit 5a546ff (step 2a — AofWriterPool type) Commit 3afe21f (step 2b — per_shard_aof_writer_task body) Commit 6a758f4 (step 2c — type plumbing aof_tx → aof_pool) Commit a05f3d8 (step 2d — handler_monoio migration) Commit eb90419 (step 2e-α — handler_sharded migration) Commit 5735031 (step 2e-β — BGREWRITEAOF helpers) Commit ceac655 (step 2e-γ — handler_single + blocking + inline) Commit d9a3651 (step 2e-δ — drop ConnectionContext.aof_tx) Commit 4fdd50f (test backfill — unsafe_multishard_aof) Commit 8fd769c (step 2f-α — spawn-site type plumbing) author: Tin Dang

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/command/persistence.rs (1)
34-47: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep the unsafe-gate message aligned with the actual gate.

The new docs/error text still says this refusal only applies when --disk-offload enable is set, but the PR objective describes the startup/runtime gate as --shards >= 2 plus --appendonly yes. If MULTI_SHARD_AOF_REWRITE_UNSAFE is set under that broader condition, this response tells operators to try the wrong workaround (--disk-offload disable).

Also applies to: 284-287
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/command/persistence.rs` around lines 34 - 47, The refusal
message/documentation around MULTI_SHARD_AOF_REWRITE_UNSAFE is misleading:
update the error/runbook text and any related doc comments (including the
instance at the later lines ~284-287) to reflect the actual gate condition
(shards >= 2 AND --appendonly yes) rather than suggesting it only applies when
--disk-offload enable; search for and edit the strings emitted by
bgrewriteaof_start_sharded and the public comment for
MULTI_SHARD_AOF_REWRITE_UNSAFE to mention the correct combination (shards >= 2 +
appendonly yes) and adjust suggested operator workarounds/runbook pointers
accordingly.
src/server/embedded.rs (1)
123-139: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Refuse multi-shard AOF in run_embedded too.

Lines 123-139 still start the legacy top-level AOF writer for any appendonly == "yes", but this path never enforces the new shards >= 2 && !unsafe_multishard_aof gate. That leaves embedded deployments on the known write-loss path this PR is supposed to block. Please reuse the same startup validation here before spawning the writer.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/server/embedded.rs` around lines 123 - 139, The code unconditionally
starts the top-level AOF writer when config.appendonly == "yes" without checking
multi-shard safety; update the block that spawns the AOF writer in run_embedded
to reuse the same startup validation used elsewhere: check the shard count
(shards) and the unsafe_multishard_aof flag (or config.unsafe_multishard_aof)
and refuse/return an error if shards >= 2 and unsafe_multishard_aof is false,
before creating the channel, token, and calling AofWriterPool::top_level or
spawning aof::aof_writer_task; ensure you mirror the exact error message/flow
used in the other validation path so embedded deployments cannot start the
legacy top-level writer in unsafe multi-shard configurations.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/server/conn/handler_sharded/mod.rs`:
- Around line 1122-1123: The current AOF serialization captures aof_bytes from
the original frame (variable frame) before workspace prefix injection, which
records client-visible keys and can replay to wrong shards; change the logic to
compute aof_bytes from the post-rewrite command used for execution—i.e.,
serialize the actual dispatched command (use dispatch_frame or the rewritten
cmd_args that include the {ws_id} injection) when is_write is true and
ctx.aof_pool.is_some(); keep the same is_write calculation
(metadata::is_write(cmd)) but ensure aof::serialize_command is called on the
final command used for dispatch/execution instead of the pre-injection frame so
AOF reflects the physical stored keys.

---

Outside diff comments:
In `@src/command/persistence.rs`:
- Around line 34-47: The refusal message/documentation around
MULTI_SHARD_AOF_REWRITE_UNSAFE is misleading: update the error/runbook text and
any related doc comments (including the instance at the later lines ~284-287) to
reflect the actual gate condition (shards >= 2 AND --appendonly yes) rather than
suggesting it only applies when --disk-offload enable; search for and edit the
strings emitted by bgrewriteaof_start_sharded and the public comment for
MULTI_SHARD_AOF_REWRITE_UNSAFE to mention the correct combination (shards >= 2 +
appendonly yes) and adjust suggested operator workarounds/runbook pointers
accordingly.

In `@src/server/embedded.rs`:
- Around line 123-139: The code unconditionally starts the top-level AOF writer
when config.appendonly == "yes" without checking multi-shard safety; update the
block that spawns the AOF writer in run_embedded to reuse the same startup
validation used elsewhere: check the shard count (shards) and the
unsafe_multishard_aof flag (or config.unsafe_multishard_aof) and refuse/return
an error if shards >= 2 and unsafe_multishard_aof is false, before creating the
channel, token, and calling AofWriterPool::top_level or spawning
aof::aof_writer_task; ensure you mirror the exact error message/flow used in the
other validation path so embedded deployments cannot start the legacy top-level
writer in unsafe multi-shard configurations.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 717300dd-c55e-4336-83a6-22aca7f3b2c6

📥 Commits

Reviewing files that changed from the base of the PR and between a05f3d8 and 5004f4e.

📒 Files selected for processing (25)

src/command/persistence.rs
src/main.rs
src/server/conn/blocking.rs
src/server/conn/core.rs
src/server/conn/handler_monoio/dispatch.rs
src/server/conn/handler_monoio/mod.rs
src/server/conn/handler_sharded/dispatch.rs
src/server/conn/handler_sharded/mod.rs
src/server/conn/handler_single.rs
src/server/conn/tests.rs
src/server/conn_state.rs
src/server/embedded.rs
src/server/listener.rs
src/shard/conn_accept.rs
src/shard/event_loop.rs
tests/ft_search_multi_shard_as_of.rs
tests/ft_search_temporal_parity.rs
tests/integration.rs
tests/kill_snapshot.rs
tests/mq_integration.rs
tests/replication_test.rs
tests/txn_ft_search_snapshot.rs
tests/txn_kv_wiring.rs
tests/vacuum_commands.rs
tests/workspace_integration.rs

🚧 Files skipped from review as they are similar to previous changes (4)

src/server/conn_state.rs
src/main.rs
src/server/conn/handler_monoio/mod.rs
src/shard/conn_accept.rs

coderabbitai · 2026-05-27T08:13:44Z

+                    let is_write = if ctx.aof_pool.is_some() || conn.tracking_state.enabled { metadata::is_write(cmd) } else { false };
+                    let aof_bytes = if is_write && ctx.aof_pool.is_some() { Some(aof::serialize_command(&frame)) } else { None };


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Serialize the rewritten command into AOF.

aof_bytes is captured from frame before workspace prefix injection, but both the local path (cmd_args) and the remote path (dispatch_frame) can execute the {ws_id}-prefixed form. In a workspace session this persists the client-visible key instead of the physical stored key, so AOF replay diverges from live state and can even route the write to the wrong shard.

Suggested fix

- let is_write = if ctx.aof_pool.is_some() || conn.tracking_state.enabled { metadata::is_write(cmd) } else { false }; - let aof_bytes = if is_write && ctx.aof_pool.is_some() { Some(aof::serialize_command(&frame)) } else { None }; + let is_write = if ctx.aof_pool.is_some() || conn.tracking_state.enabled { + metadata::is_write(cmd) + } else { + false + }; + let aof_bytes = if is_write && ctx.aof_pool.is_some() { + Some(match rewritten.as_deref() { + Some(rewritten_args) => { + let mut parts = Vec::with_capacity(1 + rewritten_args.len()); + parts.push(Frame::BulkString(Bytes::copy_from_slice(cmd))); + parts.extend_from_slice(rewritten_args); + aof::serialize_command(&Frame::Array(parts.into())) + } + None => aof::serialize_command(&frame), + }) + } else { + None + };

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/server/conn/handler_sharded/mod.rs` around lines 1122 - 1123, The current AOF serialization captures aof_bytes from the original frame (variable frame) before workspace prefix injection, which records client-visible keys and can replay to wrong shards; change the logic to compute aof_bytes from the post-rewrite command used for execution—i.e., serialize the actual dispatched command (use dispatch_frame or the rewritten cmd_args that include the {ws_id} injection) when is_write is true and ctx.aof_pool.is_some(); keep the same is_write calculation (metadata::is_write(cmd)) but ensure aof::serialize_command is called on the final command used for dispatch/execution instead of the pre-injection frame so AOF reflects the physical stored keys.

…tep 3) Threads a real `lsn: u64` through every AOF append site and prefixes each PerShard on-disk entry with `[u64 lsn LE][u32 len LE]` ahead of the RESP-encoded command, matching RFC § 2 Rule 1 wire format. TopLevel writers continue to emit plain RESP — the framing change is gated on layout, so legacy single-file deployments and the embedded/listener tokio paths are unaffected. LSN sourcing: a new `ReplicationState::issue_lsn(shard_id, delta)` helper atomically advances both `shard_offsets[shard_id]` and `master_repl_offset`, returning the master offset *before* the bump. Existing `increment_shard_offset` delegates through it so call sites that previously used the legacy helper are unchanged. AOF write sites go through a new associated function `AofWriterPool::issue_append_lsn(repl_state, shard_id, delta)` that issues an LSN when replication state is configured and returns 0 otherwise — keeping standalone (no-replication) and replica startup paths working without a behavioural change. Wire-level changes: - `AofMessage::Append(Bytes)` → `AofMessage::Append { lsn: u64, bytes: Bytes }` - `AofWriterPool::try_send_append(shard_id, lsn, bytes)` (new lsn arg) - TopLevel writer (tokio + monoio): destructures `{ bytes, lsn: _ }` — ignores LSN, writes plain RESP exactly as before. - PerShard writer: writes the 12-byte header then bytes; verified on disk via `xxd` — shard 0 entries carry monotonically advancing LSNs (0 → 0x69), shard 1 carries its own per-shard sequence (0x46). Call-site fan-out (every place that constructs or dispatches `AofMessage::Append`): - `handler_monoio`, `handler_sharded`: 4 sites each, use `AofWriterPool::issue_append_lsn`. - `handler_single`, `blocking::try_inline_dispatch{,_loop}`: now take `&Option<Arc<RwLock<ReplicationState>>>` so the inline AOF path can source an LSN; 11 test sites updated to pass `&None` (Rust infers the Option type from the slot). - `drain_pending_appends` (rewrite path): keeps the lsn field, threads it through the per-message destructure but never reads it because rewrite output is the TopLevel base.rdb/incr.aof file. Tests: - 4 existing pool tests updated to the new signature. - New `per_shard_pool_threads_lsn_field_to_each_writer` test verifies the LSN survives the channel hop unmodified for each shard. - Persistence tests: 379 pass under tokio, 379 under monoio (+1 each). - Replication tests: 31 pass. - Full lib tests (tokio): 2752 pass. - Smoke test on a 2-shard server: PerShard manifest spawns 2 writers, framed format verified on disk for both shards; TopLevel regression smoke confirms plain RESP at offset 0 with no header bytes. Rule 3 (single LSN issuance point) limitation — call out explicitly: Step 3 ships the per-entry framing and monotonic per-shard LSN tagging that step 4 (per-shard replay) requires. Strict Rule 3 alignment — making the AOF LSN equal the per-shard replication backlog byte position for the same write — is NOT achieved by this commit. SPSC-routed writes hit both `master_repl_offset.fetch_add` at `spsc_handler.rs:3017` (existing) and at the new AOF write site (`AofWriterPool::issue_append_lsn`), so master advances twice per such write. Fix is a single-LSN-issuance-point refactor in v0.2 replication state; out of step 3 scope. Step 4 only depends on per-shard monotonicity, which this commit provides and the smoke test confirms. Refs: tmp/rfc-per-shard-aof-v02.md § 2, § 3 author: Tin Dang

…ap (Option B step 4) Replaces the `warn!("Multi-part AOF skipped in multi-shard mode")` branch in `main.rs` with a real per-shard replay path. With this, a `--shards N`/`--appendonly yes` deployment that crashed and was restarted now recovers all on-disk state instead of dropping it on the floor — closing the P0 lying behind the `--unsafe-multishard-aof` gate. Implementation: - `aof_manifest::replay_incr_framed(databases, data, engine)` parses the step-3 wire format `[u64 lsn LE][u32 len LE][RESP]` and returns `(commands_replayed, max_lsn)`. Truncated headers and truncated payloads are treated as crash-time EOF (parity with `replay_incr_resp`); a header that fully declares a payload which then fails to parse is escalated as corruption. - `aof_manifest::replay_per_shard(per_shard_databases, manifest, engine)` walks `manifest.shards` and for each shard loads `shard_base_path` into that shard's `&mut [Database]` slice, then replays `shard_incr_path` through `replay_incr_framed`. Per-shard work is sequential for step 4 (cold-path correctness over throughput); a parallel implementation is a future optimization once CRASH-01-LITE soaks the sequential path. - `ReplicationState::seed_master_offset(lsn)` uses `fetch_max` to bring `master_repl_offset` up to the global AOF max-LSN before client traffic is accepted. RFC § 2 Rule 3 — otherwise the next write would reissue an LSN already present on disk and break the backlog merge. Per-shard offsets are intentionally NOT seeded (issue_lsn advances them on the first write; pre-seeding would double-count). `main.rs` integration: - The existing `if num_shards == 1` branch is unchanged (TopLevel and single-shard PerShard both keep routing through `replay_multi_part`). - New `else if manifest.layout == PerShard` branch clears each shard's databases (same wipe-then-replay invariant as the single-shard arm), walks `shards.split_first_mut()` to build a `Vec<&mut [Database]>` without aliasing, calls `replay_per_shard`, seeds `repl_state` via `seed_master_offset`, then retires any stray legacy `appendonly.aof` so v2 recovery on next boot does not double-replay. - A multi-shard config that finds a TopLevel manifest (operator did not run `migrate-aof`) gets a loud warn — no silent skip, no replay, unchanged from the previous skip behaviour but with an actionable hint. Tests (all under `tests_v2`, single-threaded due to a pre-existing `temp_dir()` race in earlier tests in this module — flake is unrelated): - `replay_incr_framed_decodes_lsn_and_resp` — two framed PING/DBSIZE entries decode in order and return the correct max LSN. - `replay_incr_framed_truncated_header_is_crash_eof` — partial header trailing one good entry returns Ok(1, lsn). - `replay_incr_framed_truncated_payload_is_crash_eof` — declared payload longer than file returns Ok(0, 0). - `replay_incr_framed_complete_but_corrupt_payload_errors` — full payload that fails RESP parse escalates as an error. - `replay_per_shard_round_trips_two_shards` — initialize_multi(2), hand-write framed SETs per shard, replay through `DispatchReplayEngine`, verify keys landed in their own DBs and global_max_lsn == max(per-shard maxes). - `replay_per_shard_rejects_shard_count_mismatch` — slice count ≠ manifest.shards.len() returns the verbatim error path. Verification: - `cargo check` (default monoio): clean. - `cargo check --no-default-features --features runtime-tokio,jemalloc`: clean. - `cargo clippy -- -D warnings` (both feature combos): zero warnings. - `cargo test -p moon --lib persistence:: -- --test-threads=1`: 377 pass. - New tests on tokio: 2/2 pass (`replay_per_shard_*`). Out of scope (deferred to later steps per RFC § 8): - Cross-shard ordering merge for TXN + SCRIPT (step 5). - Two-phase rendezvous `AppendSync { bytes, ack }` for `appendfsync=always` (step 7). - CRASH-01-LITE end-to-end soak (step 8). - Lifting the `--unsafe-multishard-aof` gate itself (step 9 — gated on step 8 green). Refs: tmp/rfc-per-shard-aof-v02.md § 2, § 4 author: Tin Dang

…B step 5) Ships the framing + recovery infrastructure that lets a future cross-shard TXN or replicated SCRIPT command be replayed atomically across shards, per RFC § 2 Rule 2. Wire-level encoding (zero impact on existing entries): - `ORDERED_LSN_FLAG = 1 << 63` reserved as the per-entry OrderedAcrossShards marker. Practical LSN ceiling even at 10 M writes/s for a century is near 2^58, so reserving bit 63 has no observable effect on normal writes — every entry produced by `try_send_append` keeps it clear. - `AofWriterPool::try_send_append_ordered(shard_id, lsn, bytes)` is the new producer entry point. It debug-asserts `lsn & FLAG == 0` and ORs the flag into the LSN before queueing. Today's call sites: none in production code; only `cfg(test)` exercises this path so the round-trip is verified end-to-end before a real consumer wires in. Recovery: - `persistence::aof_manifest::OrderedEntry { shard_id, lsn, bytes }` is the buffered representation a `replay_incr_framed` decode produces when it sees the flag. - `replay_incr_framed` gains `(shard_id, ordered_buf)` parameters. The high bit is masked off before the LSN is stored in the buffer or compared against `max_lsn`, so the buffer carries true LSNs. Inline (non-ordered) entries continue to be dispatched immediately as before. - `replay_per_shard` now returns `(total, global_max_lsn, Vec<OrderedEntry>)`. Ordered entries are deliberately NOT replayed inline (per-shard ordering alone does not preserve cross-shard atomicity). - `replay_ordered_merge(per_shard_databases, entries, engine)` sorts entries by LSN globally then dispatches each one to its origin shard's databases. It also audits per-LSN cardinality and emits a `warn!` when an LSN is unevenly represented across shards — the forensic signal of a torn cross-shard commit. Detecting and rolling back torn commits is out of scope for step 5 (no production emitter yet to define those semantics). main.rs integration: - After per-shard replay finishes, the boot path calls `replay_ordered_merge` if `ordered_entries` is non-empty. The `DispatchReplayEngine` is reused so behaviour matches the inline path. Empty buffer is the common case today (no emitter), so the cost is one length check on the hot recovery path. Tests (under `tests_v2`, single-threaded due to pre-existing temp_dir race in earlier tests): - `replay_incr_framed_buffers_ordered_entries` — mixed inline+ordered stream: inline entries dispatch via the engine, ordered entries land in the buffer with the high bit stripped, `max_lsn` reflects both. - `replay_ordered_merge_sorts_by_lsn_across_shards` — three entries spanning two shards, wire-order ≠ LSN-order: merge sorts then dispatches to the correct shard databases. - `replay_ordered_merge_empty_returns_zero` — empty buffer is Ok(0). - `ordered_entry_lsn_flag_set_via_try_send_append_ordered` — end-to-end round trip from `try_send_append_ordered` through the channel back to a consumer observes the flag set and the low bits preserved. The four pre-existing step-4 tests were updated for the new `replay_incr_framed` (shard_id + ordered_buf) and `replay_per_shard` (3-tuple) signatures; their assertions are unchanged. Verification: - `cargo check` both feature combos: clean. - `cargo clippy -- -D warnings` both feature combos: zero warnings. - `cargo test persistence:: -- --test-threads=1`: 381 pass (was 377, +4 new tests). - `cargo test persistence::aof_manifest::tests_v2 --no-default-features --features runtime-tokio,jemalloc -- --test-threads=1`: 22 pass. Out of scope (deferred per RFC § 8): - A real production emitter for ordered entries (gated on a future cross-shard TXN command landing). - Torn-commit rollback semantics (need the emitter's contract first). - Two-phase rendezvous `AppendSync { bytes, ack }` (step 7). - CRASH-01-LITE end-to-end soak (step 8). - Lifting `--unsafe-multishard-aof` (step 9 — gated on step 8 green). Refs: tmp/rfc-per-shard-aof-v02.md § 2 (Rule 2) author: Tin Dang

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/persistence/aof_manifest.rs`:
- Around line 1198-1239: The parser may return a single RESP frame while leaving
trailing bytes in buf, causing extra garbage to be silently dropped; after
calling parse::parse(&mut buf, &config) in the block that builds buf from
data[payload_start..payload_end], check that buf.is_empty() (or that the number
of bytes consumed equals payload_end - payload_start) and if not return a
crate::error::MoonError::from(crate::error::AofError::RewriteFailed { detail:
... }) indicating "framed payload contains extra bytes" (include offset and
lsn), before calling engine.replay_command(databases, cmd, cmd_args, &mut
selected_db); this ensures any trailing garbage or extra RESP commands in the
payload are treated as corrupt rather than silently dropped.
- Around line 1179-1180: The code in the replay path uses try_into().expect() to
build fixed-size arrays for lsn and len; replace the expect-based conversions in
this block (the two lines that produce lsn and len) with stack-allocated
fixed-size buffers and copy_from_slice to avoid panics: create a [u8;8] buffer,
copy data[offset..offset+8] into it and call u64::from_le_bytes on that buffer
to produce lsn, and likewise create a [u8;4] buffer, copy
data[offset+8..offset+12] into it and call u32::from_le_bytes to produce len,
thereby preserving bounds-checked behavior without using expect()/unwrap().

In `@src/persistence/aof.rs`:
- Around line 819-823: The Append branch that builds the 12-byte header
currently casts data.len() to u32 which will truncate payloads >4GiB; add an
explicit check in the AofMessage::Append handling (the block that prepares
header and calls writer.write_all) to validate that data.len() <= u32::MAX and
return or propagate an error if it exceeds that limit (or alternatively
implement chunking before framing), update the header construction to only run
after the check, and apply the same guard to the other occurrence noted (the
similar block around lines 972–975) so no payload is ever narrowed silently.
- Around line 819-829: The writer loop that handles AofMessage::Append must not
just continue on writer.write_all failures because partial writes corrupt framed
entries; in the block handling AofMessage::Append (referencing
AofMessage::Append, writer.write_all, and shard_id) replace the `continue`
behavior with logic that stops the writer loop and surfaces or logs the error
(e.g., break/return from the task or close the writer and propagate the error)
so the shard stops producing further writes after a partial header or payload
write, preventing replay_incr_framed from encountering a truncated/partial
entry.
- Around line 192-204: The function issue_append_lsn currently takes
Option<Arc<std::sync::RwLock<ReplicationState>>> and uses .read().ok() to handle
std::sync::RwLock poisoning; change the signature to
Option<Arc<parking_lot::RwLock<ReplicationState>>> and update all call sites to
pass the parking_lot RwLock (modify where ReplicationState is wrapped in
src/replication/master.rs and src/replication/replica.rs). Replace the
.read().ok().map(...) chain with a direct .read().map(...)
(parking_lot::RwLock::read does not return a Result), and remove the
unwrap_or(0) poison-handling branch so the function simply maps through the
guard to call ReplicationState::issue_lsn(shard_id, delta as u64) or returns 0
when the Option is None as before.

In `@src/replication/state.rs`:
- Around line 119-166: Recovery is seeding master_repl_offset with the observed
start LSN, but issue_lsn() returns the pre-increment value so you must seed with
the next free offset to avoid reissuing an LSN; change the recovery code to
compute next_free = observed_start_lsn + payload_len (or the entry's end LSN)
for each recovered AOF entry and call seed_master_offset(next_free)
(seed_master_offset should keep using master_repl_offset.fetch_max(next_free,
Ordering::Relaxed)); ensure any place that currently passes the start LSN is
updated to pass the computed next_free so the master offset is advanced past
on-disk entries.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d996bceb-e894-4303-8aaa-5fb3f47f02e6

📥 Commits

Reviewing files that changed from the base of the PR and between 5004f4e and b59ae4d.

📒 Files selected for processing (9)

src/main.rs
src/persistence/aof.rs
src/persistence/aof_manifest.rs
src/replication/state.rs
src/server/conn/blocking.rs
src/server/conn/handler_monoio/mod.rs
src/server/conn/handler_sharded/mod.rs
src/server/conn/handler_single.rs
src/server/conn/tests.rs

🚧 Files skipped from review as they are similar to previous changes (4)

src/server/conn/handler_sharded/mod.rs
src/server/conn/handler_single.rs
src/server/conn/tests.rs
src/server/conn/handler_monoio/mod.rs

coderabbitai · 2026-06-01T08:23:02Z

+        let lsn = u64::from_le_bytes(data[offset..offset + 8].try_into().expect("8 bytes"));
+        let len = u32::from_le_bytes(data[offset + 8..offset + 12].try_into().expect("4 bytes"))


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Avoid expect() in library replay code.

These two expect()s are on a hot library path and the bounds have already been checked, so this can stay panic-free with fixed-size buffers plus copy_from_slice().

As per coding guidelines, src/**/*.rs: No unwrap() or expect() in library code outside tests. Use pattern matching, if let, or let-else.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/aof_manifest.rs` around lines 1179 - 1180, The code in the replay path uses try_into().expect() to build fixed-size arrays for lsn and len; replace the expect-based conversions in this block (the two lines that produce lsn and len) with stack-allocated fixed-size buffers and copy_from_slice to avoid panics: create a [u8;8] buffer, copy data[offset..offset+8] into it and call u64::from_le_bytes on that buffer to produce lsn, and likewise create a [u8;4] buffer, copy data[offset+8..offset+12] into it and call u32::from_le_bytes to produce len, thereby preserving bounds-checked behavior without using expect()/unwrap().

coderabbitai · 2026-06-01T08:23:02Z

+        let mut buf = BytesMut::from(&data[payload_start..payload_end]);
+        match parse::parse(&mut buf, &config) {
+            Ok(Some(frame)) => {
+                let (cmd, cmd_args) = match &frame {
+                    Frame::Array(arr) if !arr.is_empty() => {
+                        let name = match &arr[0] {
+                            Frame::BulkString(s) => s.as_ref(),
+                            Frame::SimpleString(s) => s.as_ref(),
+                            other => {
+                                return Err(crate::error::MoonError::from(
+                                    crate::error::AofError::RewriteFailed {
+                                        detail: format!(
+                                            "AOF incr framed command at offset {} (lsn {}) has non-string name frame: {:?}",
+                                            offset,
+                                            lsn,
+                                            std::mem::discriminant(other)
+                                        ),
+                                    },
+                                ));
+                            }
+                        };
+                        (name as &[u8], &arr[1..])
+                    }
+                    other => {
+                        return Err(crate::error::MoonError::from(
+                            crate::error::AofError::RewriteFailed {
+                                detail: format!(
+                                    "AOF incr framed non-array frame at offset {} (lsn {}): {:?}",
+                                    offset,
+                                    lsn,
+                                    std::mem::discriminant(other)
+                                ),
+                            },
+                        ));
+                    }
+                };
+                engine.replay_command(databases, cmd, cmd_args, &mut selected_db);
+                count += 1;
+                if lsn > max_lsn {
+                    max_lsn = lsn;
+                }
+            }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject extra bytes inside a framed payload.

After parse() returns one frame, buf can still contain trailing garbage or a second RESP command. This code still advances to payload_end, so replay silently drops those bytes instead of treating the entry as corrupt.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/aof_manifest.rs` around lines 1198 - 1239, The parser may return a single RESP frame while leaving trailing bytes in buf, causing extra garbage to be silently dropped; after calling parse::parse(&mut buf, &config) in the block that builds buf from data[payload_start..payload_end], check that buf.is_empty() (or that the number of bytes consumed equals payload_end - payload_start) and if not return a crate::error::MoonError::from(crate::error::AofError::RewriteFailed { detail: ... }) indicating "framed payload contains extra bytes" (include offset and lsn), before calling engine.replay_command(databases, cmd, cmd_args, &mut selected_db); this ensures any trailing garbage or extra RESP commands in the payload are treated as corrupt rather than silently dropped.

coderabbitai · 2026-06-01T08:23:02Z

+    pub fn issue_append_lsn(
+        repl_state: &Option<Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>>,
+        shard_id: usize,
+        delta: usize,
+    ) -> u64 {
+        repl_state
+            .as_ref()
+            .and_then(|rs| {
+                rs.read()
+                    .ok()
+                    .map(|g| g.issue_lsn(shard_id, delta as u64))
+            })
+            .unwrap_or(0)


🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show the current implementation around issue_append_lsn sed -n '160,240p' src/persistence/aof.rs | cat -n # Locate ReplicationState definition and inspect its lock type usage rg -n "struct ReplicationState|type ReplicationState|enum ReplicationState" -S src/replication || true rg -n "ReplicationState" src/replication -S || true # Find the issue_lsn method signature on ReplicationState rg -n "fn issue_lsn" -S src/replication src || true # Confirm whether std::sync::RwLock appears anywhere in these files rg -n "std::sync::RwLock|std::sync::Mutex" -S src/persistence/aof.rs src/replication || true

Repository: pilotspace/moon

Length of output: 7319

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect imports / lock type usage in aof.rs sed -n '1,120p' src/persistence/aof.rs | cat -n # Show the per-shard writer implementation and error handling region rg -n "PerShard|per[- ]shard writer|try_send_append|Append|write.*header|write.*body|framed|codec" src/persistence/aof.rs # Pull a focused window around the per-shard writer loop (adjust based on matches) # First, locate the most relevant loop by matching writer/task creation rg -n "spawn|writer|task::|monoio::spawn|tokio::spawn|for .*in .*shards" src/persistence/aof.rs # Also capture the exact area around issue_append_lsn callers (to see what lock type they pass) rg -n "issue_append_lsn\\(" -S src/persistence/aof.rs # Finally, check if any other std::sync locks exist in this file rg -n "std::sync::(RwLock|Mutex)" src/persistence/aof.rs

Repository: pilotspace/moon

Length of output: 19592

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the per-shard writer tokio error handling around header/body writes sed -n '780,880p' src/persistence/aof.rs | cat -n # Inspect the per-shard writer monoio error handling around header/body writes sed -n '940,1045p' src/persistence/aof.rs | cat -n # See how ReplicationState is guarded in replication/master.rs / imports (parking_lot vs std::sync) rg -n "use .*RwLock" src/replication/master.rs src/replication/replica.rs src/replication/state.rs sed -n '1,120p' src/replication/master.rs | cat -n sed -n '1,120p' src/replication/replica.rs | cat -n sed -n '1,220p' src/replication/state.rs | cat -n # Check if std::sync::RwLock is used anywhere in replication module rg -n "std::sync::RwLock|std::sync::Mutex" -S src/replication || true # Check if parking_lot::RwLock is used in replication module rg -n "parking_lot::RwLock|parking_lot::Mutex" -S src/replication || true

Repository: pilotspace/moon

Length of output: 32022

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find replay function / parsing logic for per-shard framed format rg -n "replay_aof|parse.*u64 lsn|\\[u64 lsn|OrderedAcrossShards|AofMessage::Append|PerShard.*replay|seed_master_offset|corrupt" src/persistence -S # Inspect replay implementation around the per-shard framing parsing and error handling # (use narrower windows by first locating a likely function name) rg -n "fn replay_|pub async fn replay_|replay_aof" src/persistence -S # Show the most relevant region once we know the function line numbers # We'll print a few candidate locations with line numbers. # Candidate: any file under src/persistence/replay*.rs or src/persistence/*.rs with per-shard parse fd -t f "replay" src/persistence --max-depth 2 || true fd -t f "aof" src/persistence --max-depth 2 || true # Use rg on likely file(s) rg -n "OrderedAcrossShards|merge.*shard|lsn" src/persistence -S # Finally, inspect the aof replay module/file(s) directly if found # We'll pick the top few files reported by fd/replay searches.

Repository: pilotspace/moon

Length of output: 50371

Replace std::sync::RwLock in issue_append_lsn with parking_lot::RwLock

issue_append_lsn hardcodes Option<Arc<std::sync::RwLock<ReplicationState>>> (src/persistence/aof.rs), extending std lock + poisoning semantics into the AOF append hot path.

Switch the signature and all call sites to Option<Arc<parking_lot::RwLock<ReplicationState>>> (where ReplicationState is currently wrapped with std::sync::RwLock in src/replication/master.rs and src/replication/replica.rs), and remove the .read().ok() poison handling (parking_lot doesn’t poison).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/aof.rs` around lines 192 - 204, The function issue_append_lsn currently takes Option<Arc<std::sync::RwLock<ReplicationState>>> and uses .read().ok() to handle std::sync::RwLock poisoning; change the signature to Option<Arc<parking_lot::RwLock<ReplicationState>>> and update all call sites to pass the parking_lot RwLock (modify where ReplicationState is wrapped in src/replication/master.rs and src/replication/replica.rs). Replace the .read().ok().map(...) chain with a direct .read().map(...) (parking_lot::RwLock::read does not return a Result), and remove the unwrap_or(0) poison-handling branch so the function simply maps through the guard to call ReplicationState::issue_lsn(shard_id, delta as u64) or returns 0 when the Option is None as before.

coderabbitai · 2026-06-01T08:23:02Z

+                        Ok(AofMessage::Append { lsn, bytes: data }) => {
+                            let mut header = [0u8; 12];
+                            header[..8].copy_from_slice(&lsn.to_le_bytes());
+                            header[8..].copy_from_slice(&(data.len() as u32).to_le_bytes());
+                            if let Err(e) = writer.write_all(&header).await {


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the framed payload length before narrowing to u32.

data.len() as u32 truncates above 4 GiB. If that ever happens, the stored header advertises a shorter payload than was actually written, and replay will misframe the rest of the AOF stream.

Also applies to: 972-975

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/aof.rs` around lines 819 - 823, The Append branch that builds the 12-byte header currently casts data.len() to u32 which will truncate payloads >4GiB; add an explicit check in the AofMessage::Append handling (the block that prepares header and calls writer.write_all) to validate that data.len() <= u32::MAX and return or propagate an error if it exceeds that limit (or alternatively implement chunking before framing), update the header construction to only run after the check, and apply the same guard to the other occurrence noted (the similar block around lines 972–975) so no payload is ever narrowed silently.

coderabbitai · 2026-06-01T08:23:03Z

+                        Ok(AofMessage::Append { lsn, bytes: data }) => {
+                            let mut header = [0u8; 12];
+                            header[..8].copy_from_slice(&lsn.to_le_bytes());
+                            header[8..].copy_from_slice(&(data.len() as u32).to_le_bytes());
+                            if let Err(e) = writer.write_all(&header).await {
+                                error!("AOF header write error shard {}: {}", shard_id, e);
+                                continue;
+                            }
+                            if let Err(e) = writer.write_all(&data).await {
+                                error!("AOF write error shard {}: {}", shard_id, e);
+                                continue;


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Stop this writer after a framed write failure.

write_all() can fail after writing part of the header or payload. If the loop just continues, the next append can start in the middle of that broken entry, and replay_incr_framed() will drop the rest of the file as a truncated tail.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/persistence/aof.rs` around lines 819 - 829, The writer loop that handles AofMessage::Append must not just continue on writer.write_all failures because partial writes corrupt framed entries; in the block handling AofMessage::Append (referencing AofMessage::Append, writer.write_all, and shard_id) replace the `continue` behavior with logic that stops the writer loop and surfaces or logs the error (e.g., break/return from the task or close the writer and propagate the error) so the shard stops producing further writes after a partial header or payload write, preventing replay_incr_framed from encountering a truncated/partial entry.

coderabbitai · 2026-06-01T08:23:03Z

+    /// Atomically issue an LSN for a write and advance per-shard +
+    /// master replication offsets by `delta`.
+    ///
+    /// Returns the LSN that uniquely identifies this write — equal to the
+    /// value of `master_repl_offset` BEFORE the increment, mirroring Redis's
+    /// `+ delta - delta` semantics. The same LSN MUST tag the corresponding
+    /// `AofMessage::Append` entry and the replication backlog entry for that
+    /// write so per-shard AOF replay can rebuild a globally consistent log
+    /// (per-shard AOF RFC § 2 Rule 3).
+    ///
+    /// Atomicity caveat: the per-shard offset advance and the master offset
+    /// advance are TWO separate `fetch_add`s, not one composite op. Concurrent
+    /// callers across shards observe a brief window where the master sum
+    /// disagrees with the sum of shard offsets. Acceptable today because the
+    /// only `total_offset()` consumer is INFO replication, which tolerates
+    /// transient skew. Do not promote to a hard invariant without redesign.
+    ///
+    /// Returns 0 if `shard_id` is out of range (defensive; production callers
+    /// must pass a valid id).
+    pub fn issue_lsn(&self, shard_id: usize, delta: u64) -> u64 {
+        if shard_id >= self.shard_offsets.len() {
+            return 0;
        }
+        self.shard_offsets[shard_id].fetch_add(delta, Ordering::Relaxed);
+        self.master_repl_offset.fetch_add(delta, Ordering::Relaxed)
    }

    /// Returns sum of all per-shard offsets.
    pub fn total_offset(&self) -> u64 {
        self.master_repl_offset.load(Ordering::Relaxed)
    }

+    /// Seed `master_repl_offset` to at least `lsn` after AOF recovery.
+    ///
+    /// Per-shard AOF RFC § 2 Rule 3: after recovery reads the per-shard AOFs,
+    /// `master_repl_offset` MUST be at least the max LSN observed across all
+    /// shards before the server accepts client traffic. Otherwise the next
+    /// write would issue an LSN already present on disk, breaking the
+    /// `lsn → entry` uniqueness invariant the backlog merge depends on.
+    ///
+    /// Uses `fetch_max` so a concurrent in-flight increment (extremely
+    /// unlikely at boot, but free to guard against) cannot regress the value.
+    /// Per-shard offsets are intentionally NOT touched here — at boot they
+    /// are still 0, and seeding shard offsets to the per-shard AOF max would
+    /// double-count once the first write advances them via `issue_lsn`.
+    pub fn seed_master_offset(&self, lsn: u64) {
+        self.master_repl_offset.fetch_max(lsn, Ordering::Relaxed);
+    }


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

LSN recovery is seeding the wrong cursor.

issue_lsn() returns the pre-increment master offset, so 0 is a valid first on-disk LSN. But the recovery path only carries forward the max observed start LSN, which means the first post-restart append can reissue an existing LSN and break the lsn -> entry uniqueness invariant this API documents. Recovery needs the next free offset (for example max(lsn + payload_len)), not the max starting offset or a sentinel 0.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/replication/state.rs` around lines 119 - 166, Recovery is seeding master_repl_offset with the observed start LSN, but issue_lsn() returns the pre-increment value so you must seed with the next free offset to avoid reissuing an LSN; change the recovery code to compute next_free = observed_start_lsn + payload_len (or the entry's end LSN) for each recovered AOF entry and call seed_master_offset(next_free) (seed_master_offset should keep using master_repl_offset.fetch_max(next_free, Ordering::Relaxed)); ensure any place that currently passes the start LSN is updated to pass the computed next_free so the master offset is advanced past on-disk entries.

…tep 7) Ships the H1 fix from the investigation report: the mechanism for `appendfsync=always` to honour its durability contract end-to-end, so the client `+OK` does not race the disk-side fsync. API: - New `AofMessage::AppendSync { lsn, bytes, ack }` variant carries a `OneshotSender<AofAck>` alongside the same `(lsn, bytes)` payload as the existing `Append`. The writer ALWAYS fsyncs and acks via this variant, regardless of the configured `FsyncPolicy` — the caller has signed the durability contract by choosing AppendSync over Append. - `AofAck { Synced, WriteFailed, FsyncFailed }` reports the outcome. `Synced` means `sync_data()` returned successfully and the entry is on durable storage. The two failure variants are emitted from the precise syscall that failed so callers can map back to a specific client error. - `AofWriterPool::try_send_append_sync(shard_id, lsn, bytes) -> OneshotReceiver<AofAck>` is the caller entry point. The handler awaits the receiver before responding to the client; if the receiver resolves with `Err(RecvError)` (channel disconnect / writer dead), the caller treats that as a hard failure too. Writer-task integration (4 sites + 1 helper): - TopLevel monoio (`aof_writer_task`): write → flush → sync_data → ack. `write_error` sticky flag still gates subsequent writes; the ack reports `WriteFailed` for both first-failure and follow-on. - TopLevel tokio (`aof_writer_task`): same shape, async syscalls. - PerShard tokio (`per_shard_aof_writer_task`): framed `[u64 lsn LE][u32 len LE][RESP]` header + payload + fsync + ack. - PerShard monoio (`per_shard_aof_writer_task`): same framed format, blocking syscalls. - `drain_pending_appends` (BGREWRITEAOF rewrite drain): bytes are written and counted; the post-drain fsync at the rewrite boundary covers durability, so the ack is `Synced`. On write error the `?` bubbles up and the ack is dropped — caller observes `RecvError`. Production call sites: NONE in step 7. The per-handler integration (when to use AppendSync vs Append based on `FsyncPolicy::Always`) is wired in step 9 prep before lifting the `--unsafe-multishard-aof` gate. Step 7 ships only the mechanism + adversarial tests so step 8 (CRASH-01-LITE) and step 9 can build on a stable foundation. Tests (under `pool_tests`): - `try_send_append_sync_queues_appendsync_with_ack` — caller-side `try_send_append_sync` queues an `AppendSync` with the correct lsn and bytes; mocked writer acks `Synced`; receiver resolves with `Synced`. - `append_sync_writer_dropped_resolves_recv_error` — if the writer drops the ack sender (death / disconnect / channel close), the receiver resolves with `Err(RecvError)` rather than hanging. - `append_sync_writer_reports_write_failed` — writer ack of `WriteFailed` is propagated to the caller verbatim. - `append_sync_writer_reports_fsync_failed` — same for `FsyncFailed`. Verification: - `cargo check` both feature combos: clean. - `cargo clippy -- -D warnings` both feature combos: zero warnings. - `cargo test persistence:: -- --test-threads=1`: 385 pass (was 381, +4 new tests). - `cargo test persistence::aof::pool_tests`: 10 pass. Out of scope (per RFC § 8 dependency chain): - Per-handler wiring of `try_send_append_sync` for `appendfsync=always` (step 9 prep). - CRASH-01-LITE end-to-end test exercising the rendezvous under SIGKILL (step 8). - Lifting `--unsafe-multishard-aof` (step 9 — gated on step 8 green). Refs: tmp/rfc-per-shard-aof-v02.md § 4 (Fsync semantics) author: Tin Dang

…oot PerShard spawn fix (Option B step 8) Ships the end-to-end crash-recovery validation per RFC § 7 and closes a P0 bug that step 8's red-green TDD uncovered: PerShard writers were NOT spawned on first boot, so a brand-new `--shards 2 --appendonly yes` deployment silently wrote plain RESP into shard-0's directory and lost all data on restart. Two changes in one commit because the test is the only thing that catches the spawn bug. ## Bug fix (main.rs spawn-site gate) Before: spawn decision keyed on `existing_manifest.layout == PerShard`. With no manifest on disk yet (first boot), `existing_manifest = None` so the TopLevel writer was chosen, even when `num_shards >= 2`. The TopLevel writer wrote plain RESP into whatever path `manifest.incr_path()` resolved to AFTER `initialize_multi` ran later in the boot sequence — which under PerShard always routes to shard-0. Result: all writes for all shards landed in `appendonlydir/shard-0/moon.aof.1.incr.aof` in plain RESP (no LSN header), shard-1's incr file was 0 bytes, and on restart the framed replay parser saw garbage LSN bytes and treated the whole file as truncated EOF → 0 keys recovered out of 200. Fix: `use_per_shard = num_shards >= 2 && (existing PerShard manifest OR no manifest yet)`. The "no manifest yet" branch covers first-boot and lines up with the existing `initialize_multi(num_shards)` call in the recovery block (added in step 8 a few hunks below — also new in this commit). Caught locally on commit b59ae4d before pushing CRASH-01-LITE. ## CRASH-01-LITE test (tests/crash_matrix_per_shard_aof.rs) Subset of the RFC § 7 matrix — "LITE" defers cross-shard TXN and BGREWRITEAOF interleaving to step 9 + future work. Scenario: `--shards 2 --appendonly yes --appendfsync everysec --unsafe-multishard-aof`, write 200 keys (alternating `{a}` / `{b}` hash tags so both shards populate), wait > 1s for the everysec fsync window, SIGKILL the process via `libc::kill(pid, SIGKILL)`, restart with same args, verify all 200 keys recovered with correct values. The `#[ignore]` gate keeps the test out of `cargo test` default runs — it needs a built `./target/release/moon` and `redis-cli` on PATH. Mirror of `scan_fanout_multishard.rs` conventions. Run explicitly with: cargo build --release --features runtime-monoio,jemalloc cargo test --release --test crash_matrix_per_shard_aof -- --ignored Stdout/stderr go to log files in the test dir (NEVER `Stdio::null()`) so a CI flake produces real diagnostics — see [[feedback_silenced_child_stdio_flake]]. ## Verification - `cargo check` both feature combos: clean. - `cargo clippy -- -D warnings` (library, both feature combos): zero warnings. Pre-existing warnings in unrelated test files (clippy --tests) are not introduced by this commit. - `cargo test persistence:: -- --test-threads=1`: 385 pass. - `cargo test --release --test crash_matrix_per_shard_aof -- --ignored`: **1 pass** (CRASH-01-LITE: 200/200 keys recovered after SIGKILL). - Manual disk inspection (`xxd appendonlydir/shard-N/moon.aof.1.incr.aof`): framed format `[u64 lsn LE][u32 len LE][RESP]` on both shards; shard-0 LSN=0x3E for k0, shard-1 LSN=0x1F for k1. ## Out of scope (per RFC § 8) - Per-handler integration of `try_send_append_sync` for `appendfsync=always` (step 9 prep). - Lifting `--unsafe-multishard-aof` (step 9 — gated on step 8 green, which it now is). - Adding `--appendfsync always` row to the matrix once step 9 wires the handler integration. - BGREWRITEAOF interleaving row (RFC § 6 — out of step 8 scope). Refs: tmp/rfc-per-shard-aof-v02.md § 7 author: Tin Dang

… — closes P0) The per-shard AOF pipeline (RFC steps 1-8, commits 5004f4e → b59ae4d) makes `--shards >= 2 + --appendonly yes` crash-safe. CRASH-01-LITE confirms 200/200 keys recover after SIGKILL on a 2-shard `everysec` deployment, with framed `[u64 lsn LE][u32 len LE][RESP]` entries on disk across both shards. The startup refusal that PR #129 introduced is no longer needed and is hereby lifted. ## Changes - `main.rs`: the P0-FIX-01b refusal block now only emits a one-line info notice if `--unsafe-multishard-aof` is set explicitly. The exit-2 path is gone. Multi-shard + appendonly deployments are permitted by default. - `--unsafe-multishard-aof` flag is preserved as a no-op so existing operator runbooks and CI command lines do not break. Removing it entirely is a future cleanup PR once dependents are audited. - `tests/crash_matrix_per_shard_aof.rs`: the test launches without the flag — exercising the default crash-safe path end-to-end. Still green: 200/200 recover after SIGKILL. ## Risk register (carried forward from RFC § 8) - **Rule 3 strict alignment** is NOT achieved (called out in step 3 commit body `e46dc4e`): SPSC-routed writes hit `master_repl_offset.fetch_add` twice — once at `spsc_handler.rs:3017` (existing replication path) and once at the new AOF write site (`AofWriterPool::issue_append_lsn`). Per-shard monotonicity holds and CRASH-01-LITE passes, but the master replication offset advances 2x per such write. Fix is a single-LSN-issuance-point refactor in v0.2 replication state. Lifting the gate does not regress this — the refusal block never enforced Rule 3. - **`appendfsync=always` handler integration**: step 7 shipped the `AppendSync` mechanism but no production call site uses it yet. With `appendfsync=always`, durability still depends on the everysec-style tick at the writer task. End-to-end fsync-before-ack on the always policy requires per-handler wiring; tracked as a v0.1.13 follow-up. CRASH-01-LITE deliberately uses `everysec` so this isn't a regression versus the pre-Per-Shard state. - **Cross-shard TXN / SCRIPT replay** is the empty-buffer case today (step 5 ships the scaffold; no production emitter). Lifting the gate does not introduce cross-shard atomicity — moon's TXN/SCRIPT remain single-shard local operations. - **BGREWRITEAOF in PerShard layout** is still gated (separately) by `MULTI_SHARD_AOF_REWRITE_UNSAFE` in `main.rs:430`. That's RFC step 6 scope (deferred when the original 9-step plan dropped step 6) and is orthogonal to this lift. Disabling `--disk-offload` re-enables the legacy rewrite path. ## Verification - `cargo check` (default monoio + tokio + jemalloc): clean. - `cargo clippy -- -D warnings` (both feature combos): zero warnings. - `cargo test persistence:: -- --test-threads=1`: 385 pass. - `cargo build --release` + `cargo test --release --test crash_matrix_per_shard_aof -- --ignored`: 1 pass, all 200 keys recovered, no `--unsafe-multishard-aof` in launch command. - Manual: `xxd appendonlydir/shard-N/moon.aof.1.incr.aof` confirms framed `[u64 lsn LE][u32 len LE][RESP]` on both shards after a default-config run. ## RFC closure This closes the Option B plan from `tmp/rfc-per-shard-aof-v02.md`: | Step | Commit | Status | |------|--------|--------| | 1: AofManifest PerShard layout | (pre-existing) | done | | 2: Per-shard AofWriter task | (pre-existing) | done | | 2b: Writer task body | (pre-existing) | done | | 2c: aof_tx → aof_pool plumbing | d9a3651 | done | | 2d: handler_monoio sites | (pre-existing) | done | | 2e: handler_sharded/single/blk | ceac655 | done | | 2f: layout-aware spawn | 5004f4e | done | | 3: per-entry LSN framing | e46dc4e | done | | 4: per-shard replay | b59ae4d | done | | 5: OrderedAcrossShards scaffold | adf151d | done | | 6: migrate-aof tool | - | deferred (not needed; first-boot path covered) | | 7: AppendSync rendezvous | (this batch) | mechanism done; integration v0.1.13 | | 8: CRASH-01-LITE | (this batch) | green | | 9: lift gate | (this commit) | done | Refs: tmp/rfc-per-shard-aof-v02.md § 8 author: Tin Dang

Close P0 H1 (in-flight loss under appendfsync=always). The mechanism landed in step 7 (`AofMessage::AppendSync` + `AofWriterPool:: try_send_append_sync`), but the production write paths still used the fire-and-forget `try_send_append`, so `+OK` returned before the per-shard writer fsynced. A SIGKILL between accept and the everysec tick lost in-flight entries — the exact symptom reported in tmp/P0-INVEST-01-multishard-aof-rootcause.md. This patch threads the durable-send through every production write call site and validates the closure with a SIGKILL crash matrix. Changes ------- src/persistence/aof.rs - `AofWriterPool` gains a `fsync_policy: FsyncPolicy` field and the `fsync_policy()` accessor. - New `try_send_append_durable(shard_id, lsn, bytes)` async helper: * Always → routes via `try_send_append_sync` and awaits the writer's ack; returns `Err(AofAck)` on failure. * EverySec → fire-and-forget via `try_send_append`; returns `Ok`. * No → same as EverySec. - Construction now goes through `top_level_with_policy` / `per_shard_with_policy`; the old constructors are retained as thin wrappers that default to `EverySec` for crate-internal tests. src/server/conn/handler_monoio/mod.rs src/server/conn/handler_sharded/mod.rs - All 4 write call sites per handler (MOVE, COPY..DB, general write path, cross-shard dispatch) replace `pool.try_send_append(shard_id, lsn, bytes)` with `pool.try_send_append_durable(target, lsn, bytes).await`. - On `Err` the response is replaced with `-ERR AOF fsync failed; write not durable` so the client never sees `+OK` for a non-durable write. - The local-shard response binding becomes `mut response` to allow the override on AOF failure. src/server/conn/blocking.rs - `try_inline_dispatch` is synchronous and cannot await the writer's ack. Under `appendfsync=always` it now bails out for `*3` (SET-shape) frames, forcing the write through the async dispatch path which IS H1-integrated. GETs continue to inline. Single-effect cost: ~20 ns of policy-load on every SET, paid only when Always is configured. src/server/conn/handler_single.rs - The three `send_async(AofMessage::Append { ... })` sites (batch subscribe flush, GRAPH.* WAL records, main batch flush) now call `try_send_append_durable(0, lsn, bytes).await`. - NOTE: single-shard handler still flushes client responses BEFORE the AOF batch (pre-existing ordering bug). Always semantics in this path are partial — multi-shard handlers (handler_monoio / handler_sharded) DO enforce pre-response durability. Tracked as a follow-up; out of scope for the multi-shard PR. src/server/embedded.rs src/server/listener.rs src/main.rs - Updated spawn sites to use `top_level_with_policy(tx, fsync)` and `per_shard_with_policy(senders, fsync)` so the pool's policy field reflects the configured `appendfsync`. tests/crash_matrix_per_shard_aof.rs - Refactored `start_moon` to delegate to `start_moon_with_fsync(port, dir, fsync)`. The existing everysec test is unchanged. - New `crash_01_lite_always_per_shard_aof_recovers_after_sigkill`: * `--shards 2 --appendonly yes --appendfsync always` * 200 SET commands (hash-tagged across both shards) * SIGKILL with NO quiescing sleep * restart → assert 100% recovery (every +OK observed implies fsync) Verification ------------ cargo clippy --lib -- -D warnings # clean cargo clippy --lib --no-default-features --features runtime-tokio,jemalloc -- -D warnings # clean cargo test --lib persistence:: -- --test-threads=1 # 385/385 cargo build --release --features runtime-monoio,jemalloc # ok cargo test --release --features runtime-monoio,jemalloc --test crash_matrix_per_shard_aof -- --ignored --test-threads=1 # 2/2 pass └── crash_01_lite_per_shard_aof_recovers_after_sigkill (everysec) └── crash_01_lite_always_per_shard_aof_recovers_after_sigkill (always) Closes the multi-shard AOF PR scope. H2 (skipped multi-part replay for num_shards >= 2) was closed structurally in step 4 + main.rs replay wiring; H1 (fire-and-forget ack) is now closed by this commit's handler integration plus the validating crash matrix row. author: Tin Dang

qodo-code-review · 2026-06-01T09:45:29Z

CI Feedback 🧐

A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

Action: Check
Failed stage: Test [❌]
Failed test name: test_txn_commit_wal_crash_recovery
Failure summary: The action failed during `cargo test --no-default-features --features runtime-tokio,jemalloc` because one Rust integration test failed: - `test_txn_commit_wal_crash_recovery` in `tests/txn_kv_wiring.rs` panicked at `tests/txn_kv_wiring.rs:996:13`. - The panic occurred because the test could not connect to the phase-1 Moon server within 60 seconds: `connect failed: Resource temporarily unavailable (os` `error 11)`. - The server started and was listening on `127.0.0.1:46795`, but the client connection attempts still timed out, causing the test binary `txn_kv_wiring` to fail and `cargo test` to exit with code 101.
Relevant error logs: 1: ##[group]Runner Image Provisioner 2: Hosted Compute Agent ... 178: MOON_NO_URING: 1 179: targets: 180: components: clippy 181: ##[endgroup] 182: ##[group]Run : set $CARGO_HOME 183: �[36;1m: set $CARGO_HOME�[0m 184: �[36;1mecho CARGO_HOME=${CARGO_HOME:-"$HOME/.cargo"} >> $GITHUB_ENV�[0m 185: shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 186: env: 187: CARGO_TERM_COLOR: always 188: MOON_NO_URING: 1 189: ##[endgroup] 190: ##[group]Run : install rustup if needed 191: �[36;1m: install rustup if needed�[0m 192: �[36;1mif ! command -v rustup &>/dev/null; then�[0m 193: �[36;1m curl --proto '=https' --tlsv1.2 --retry 10 --retry-connrefused --location --silent --show-error --fail https://sh.rustup.rs \| sh -s -- --default-toolchain none -y�[0m 194: �[36;1m echo "$CARGO_HOME/bin" >> $GITHUB_PATH�[0m ... 265: �[36;1m if rustc +1.94.1 --version --verbose \| grep -q '^release: 1\.6[89]\.'; then�[0m 266: �[36;1m touch "/home/runner/work/_temp"/.implicit_cargo_registries_crates_io_protocol \|\| true�[0m 267: �[36;1m echo CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse >> $GITHUB_ENV�[0m 268: �[36;1m elif rustc +1.94.1 --version --verbose \| grep -q '^release: 1\.6[67]\.'; then�[0m 269: �[36;1m touch "/home/runner/work/_temp"/.implicit_cargo_registries_crates_io_protocol \|\| true�[0m 270: �[36;1m echo CARGO_REGISTRIES_CRATES_IO_PROTOCOL=git >> $GITHUB_ENV�[0m 271: �[36;1m fi�[0m 272: �[36;1mfi�[0m 273: shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 274: env: 275: CARGO_TERM_COLOR: always 276: MOON_NO_URING: 1 277: CARGO_HOME: /home/runner/.cargo 278: CARGO_INCREMENTAL: 0 279: ##[endgroup] 280: ##[group]Run : work around spurious network errors in curl 8.0 281: �[36;1m: work around spurious network errors in curl 8.0�[0m 282: �[36;1m# https://rust-lang.zulipchat.com/#narrow/stream/246057-t-cargo/topic/timeout.20investigation�[0m ... 357: .. Lockfiles considered: 358: - /home/runner/work/moon/moon/Cargo.lock 359: - /home/runner/work/moon/moon/Cargo.toml 360: - /home/runner/work/moon/moon/rust-toolchain.toml 361: ##[endgroup] 362: ... Restoring cache ... 363: No cache found. 364: ##[group]Run cargo clippy -- -D warnings 365: �[36;1mcargo clippy -- -D warnings�[0m 366: shell: /usr/bin/bash -e {0} 367: env: 368: CARGO_TERM_COLOR: always 369: MOON_NO_URING: 1 370: CARGO_HOME: /home/runner/.cargo 371: CARGO_INCREMENTAL: 0 372: CACHE_ON_FAILURE: false 373: ##[endgroup] ... 425: �[1m�[92m Downloaded�[0m utf8-ranges v1.0.5 426: �[1m�[92m Downloaded�[0m crossbeam-utils v0.8.21 427: �[1m�[92m Downloaded�[0m wyhash v0.5.0 428: �[1m�[92m Downloaded�[0m want v0.3.1 429: �[1m�[92m Downloaded�[0m version_check v0.9.5 430: �[1m�[92m Downloaded�[0m twox-hash v2.1.2 431: �[1m�[92m Downloaded�[0m zeroize v1.8.2 432: �[1m�[92m Downloaded�[0m sketches-ddsketch v0.3.1 433: �[1m�[92m Downloaded�[0m pin-utils v0.1.0 434: �[1m�[92m Downloaded�[0m unicode-ident v1.0.24 435: �[1m�[92m Downloaded�[0m zmij v1.0.21 436: �[1m�[92m Downloaded�[0m indexmap v2.13.1 437: �[1m�[92m Downloaded�[0m untrusted v0.9.0 438: �[1m�[92m Downloaded�[0m unicode-segmentation v1.13.2 439: �[1m�[92m Downloaded�[0m unicode-normalization v0.1.25 440: �[1m�[92m Downloaded�[0m thiserror v1.0.69 441: �[1m�[92m Downloaded�[0m zerocopy-derive v0.8.48 442: �[1m�[92m Downloaded�[0m tracing-log v0.2.0 443: �[1m�[92m Downloaded�[0m parking_lot v0.12.5 444: �[1m�[92m Downloaded�[0m tikv-jemalloc-ctl v0.6.1 445: �[1m�[92m Downloaded�[0m thiserror v2.0.18 446: �[1m�[92m Downloaded�[0m parking_lot_core v0.9.12 ... 491: �[1m�[92m Downloaded�[0m once_cell v1.21.4 492: �[1m�[92m Downloaded�[0m smallvec v1.15.1 493: �[1m�[92m Downloaded�[0m quanta v0.12.6 494: �[1m�[92m Downloaded�[0m lua-src v550.0.0 495: �[1m�[92m Downloaded�[0m tikv-jemalloc-sys v0.6.1+5.3.0-1-ge13ca993e8ccb9ba9847cc330696e02839f328f7 496: �[1m�[92m Downloaded�[0m pin-project-lite v0.2.17 497: �[1m�[92m Downloaded�[0m ordered-float v5.3.0 498: �[1m�[92m Downloaded�[0m nanorand v0.7.0 499: �[1m�[92m Downloaded�[0m mio v0.8.11 500: �[1m�[92m Downloaded�[0m metrics-util v0.19.1 501: �[1m�[92m Downloaded�[0m metrics-exporter-prometheus v0.16.2 502: �[1m�[92m Downloaded�[0m luajit-src v210.6.6+707c12b 503: �[1m�[92m Downloaded�[0m metrics v0.24.3 504: �[1m�[92m Downloaded�[0m memmap2 v0.9.10 505: �[1m�[92m Downloaded�[0m thread_local v1.1.9 506: �[1m�[92m Downloaded�[0m thiserror-impl v1.0.69 507: �[1m�[92m Downloaded�[0m signal-hook-registry v1.4.8 508: �[1m�[92m Downloaded�[0m ppv-lite86 v0.2.21 509: �[1m�[92m Downloaded�[0m logos v0.14.4 510: �[1m�[92m Downloaded�[0m fst v0.4.7 511: �[1m�[92m Downloaded�[0m pkg-config v0.3.32 512: �[1m�[92m Downloaded�[0m phf_macros v0.13.1 513: �[1m�[92m Downloaded�[0m paste v1.0.15 514: �[1m�[92m Downloaded�[0m thiserror-impl v2.0.18 515: �[1m�[92m Downloaded�[0m slab v0.4.12 ... 684: �[1m�[92m Checking�[0m rustls-pki-types v1.14.0 685: �[1m�[92m Checking�[0m http-body v1.0.1 686: �[1m�[92m Compiling�[0m tikv-jemalloc-sys v0.6.1+5.3.0-1-ge13ca993e8ccb9ba9847cc330696e02839f328f7 687: �[1m�[92m Compiling�[0m lua-src v550.0.0 688: �[1m�[92m Compiling�[0m memoffset v0.9.1 689: �[1m�[92m Compiling�[0m rustversion v1.0.22 690: �[1m�[92m Checking�[0m atomic-waker v1.1.2 691: �[1m�[92m Checking�[0m fnv v1.0.7 692: �[1m�[92m Compiling�[0m pkg-config v0.3.32 693: �[1m�[92m Checking�[0m rand_core v0.10.0 694: �[1m�[92m Checking�[0m untrusted v0.7.1 695: �[1m�[92m Compiling�[0m semver v1.0.28 696: �[1m�[92m Checking�[0m try-lock v0.2.5 697: �[1m�[92m Checking�[0m utf8parse v0.2.2 698: �[1m�[92m Compiling�[0m rayon-core v1.13.0 699: �[1m�[92m Compiling�[0m thiserror v1.0.69 700: �[1m�[92m Compiling�[0m getrandom v0.4.2 701: �[1m�[92m Compiling�[0m cfg_aliases v0.2.1 702: �[1m�[92m Compiling�[0m siphasher v1.0.2 703: �[1m�[92m Checking�[0m pin-utils v0.1.0 704: �[1m�[92m Compiling�[0m phf_shared v0.13.1 705: �[1m�[92m Checking�[0m nix v0.26.4 706: �[1m�[92m Compiling�[0m nix v0.31.2 707: �[1m�[92m Checking�[0m anstyle-parse v1.0.0 708: �[1m�[92m Compiling�[0m mlua-sys v0.10.0 709: �[1m�[92m Checking�[0m want v0.3.1 710: �[1m�[92m Compiling�[0m rustc_version v0.4.1 711: �[1m�[92m Checking�[0m h2 v0.4.13 712: �[1m�[92m Checking�[0m rand_chacha v0.9.0 713: �[1m�[92m Checking�[0m flume v0.11.1 714: �[1m�[92m Checking�[0m fxhash v0.2.1 715: �[1m�[92m Checking�[0m threadpool v1.8.1 716: �[1m�[92m Compiling�[0m thiserror-impl v1.0.69 717: �[1m�[92m Compiling�[0m monoio-macros v0.1.0 ... 745: �[1m�[92m Checking�[0m hashbrown v0.15.5 746: �[1m�[92m Checking�[0m regex-automata v0.4.14 747: �[1m�[92m Checking�[0m quanta v0.12.6 748: �[1m�[92m Checking�[0m monoio v0.2.4 749: �[1m�[92m Checking�[0m metrics v0.24.3 750: �[1m�[92m Checking�[0m rand v0.9.2 751: �[1m�[92m Compiling�[0m stop-words v0.10.0 752: �[1m�[92m Compiling�[0m crc32c v0.6.8 753: �[1m�[92m Checking�[0m crypto-common v0.2.1 754: �[1m�[92m Checking�[0m block-buffer v0.12.0 755: �[1m�[92m Checking�[0m rand_xoshiro v0.7.0 756: �[1m�[92m Compiling�[0m serde_derive v1.0.228 757: �[1m�[92m Compiling�[0m slotmap v1.1.1 758: �[1m�[92m Checking�[0m strsim v0.11.1 759: �[1m�[92m Checking�[0m utf8-ranges v1.0.5 760: �[1m�[92m Compiling�[0m thiserror v2.0.18 761: �[1m�[92m Checking�[0m tinyvec_macros v0.1.1 ... 775: �[1m�[92m Checking�[0m hyper-util v0.1.20 776: �[1m�[92m Checking�[0m metrics-util v0.19.1 777: �[1m�[92m Checking�[0m wyhash v0.5.0 778: �[1m�[92m Checking�[0m sharded-slab v0.1.7 779: �[1m�[92m Checking�[0m digest v0.11.2 780: �[1m�[92m Checking�[0m tinyvec v1.11.0 781: �[1m�[92m Checking�[0m rayon v1.11.0 782: �[1m�[92m Checking�[0m monoio-io-wrapper v0.1.1 783: �[1m�[92m Checking�[0m matchers v0.2.0 784: �[1m�[92m Compiling�[0m logos-derive v0.14.4 785: �[1m�[92m Compiling�[0m phf_macros v0.13.1 786: �[1m�[92m Checking�[0m chacha20 v0.10.0 787: �[1m�[92m Checking�[0m parking_lot v0.12.5 788: �[1m�[92m Checking�[0m futures-executor v0.3.32 789: �[1m�[92m Checking�[0m http-body-util v0.1.3 790: �[1m�[92m Compiling�[0m thiserror-impl v2.0.18 791: �[1m�[92m Checking�[0m tracing-log v0.2.0 ... 832: �[1m�[92m Checking�[0m hex v0.4.3 833: �[1m�[92m Checking�[0m xxhash-rust v0.8.15 834: �[1m�[92m Checking�[0m tikv-jemallocator v0.6.1 835: �[1m�[92m Checking�[0m tikv-jemalloc-ctl v0.6.1 836: �[1m�[92m Checking�[0m rustls-webpki v0.103.10 837: �[1m�[92m Checking�[0m monoio-rustls v0.4.0 838: �[1m�[92m Finished�[0m `dev` profile [unoptimized + debuginfo] target(s) in 1m 27s 839: ##[group]Run cargo clippy --features graph -- -D warnings 840: �[36;1mcargo clippy --features graph -- -D warnings�[0m 841: shell: /usr/bin/bash -e {0} 842: env: 843: CARGO_TERM_COLOR: always 844: MOON_NO_URING: 1 845: CARGO_HOME: /home/runner/.cargo 846: CARGO_INCREMENTAL: 0 847: CACHE_ON_FAILURE: false 848: ##[endgroup] 849: �[1m�[92m Finished�[0m `dev` profile [unoptimized + debuginfo] target(s) in 0.16s 850: ##[group]Run cargo clippy --no-default-features --features runtime-tokio,jemalloc -- -D warnings 851: �[36;1mcargo clippy --no-default-features --features runtime-tokio,jemalloc -- -D warnings�[0m 852: shell: /usr/bin/bash -e {0} 853: env: 854: CARGO_TERM_COLOR: always 855: MOON_NO_URING: 1 856: CARGO_HOME: /home/runner/.cargo 857: CARGO_INCREMENTAL: 0 858: CACHE_ON_FAILURE: false 859: ##[endgroup] ... 866: �[1m�[92m Checking�[0m tokio-util v0.7.18 867: �[1m�[92m Checking�[0m tokio-rustls v0.26.4 868: �[1m�[92m Checking�[0m h2 v0.4.13 869: �[1m�[92m Checking�[0m hyper v1.9.0 870: �[1m�[92m Checking�[0m hyper-util v0.1.20 871: �[1m�[92m Checking�[0m metrics-exporter-prometheus v0.16.2 872: �[1m�[92m Finished�[0m `dev` profile [unoptimized + debuginfo] target(s) in 16.77s 873: ##[group]Run cargo test --no-default-features --features runtime-tokio,jemalloc 874: �[36;1mcargo test --no-default-features --features runtime-tokio,jemalloc�[0m 875: shell: /usr/bin/bash -e {0} 876: env: 877: CARGO_TERM_COLOR: always 878: MOON_NO_URING: 1 879: CARGO_HOME: /home/runner/.cargo 880: CARGO_INCREMENTAL: 0 881: CACHE_ON_FAILURE: false 882: ##[endgroup] ... 1079: �[1m�[92m Compiling�[0m sharded-slab v0.1.7 1080: �[1m�[92m Compiling�[0m hyper v1.9.0 1081: �[1m�[92m Compiling�[0m quanta v0.12.6 1082: �[1m�[92m Compiling�[0m phf_generator v0.13.1 1083: �[1m�[92m Compiling�[0m hashbrown v0.15.5 1084: �[1m�[92m Compiling�[0m clap_derive v4.6.0 1085: �[1m�[92m Compiling�[0m serde_json v1.0.149 1086: �[1m�[92m Compiling�[0m metrics v0.24.3 1087: �[1m�[92m Compiling�[0m rand v0.9.2 1088: �[1m�[92m Compiling�[0m rayon v1.11.0 1089: �[1m�[92m Compiling�[0m memoffset v0.9.1 1090: �[1m�[92m Compiling�[0m tikv-jemalloc-sys v0.6.1+5.3.0-1-ge13ca993e8ccb9ba9847cc330696e02839f328f7 1091: �[1m�[92m Compiling�[0m crc32fast v1.5.0 1092: �[1m�[92m Compiling�[0m matchers v0.2.0 1093: �[1m�[92m Compiling�[0m rand_xoshiro v0.7.0 1094: �[1m�[92m Compiling�[0m thiserror-impl v1.0.69 1095: �[1m�[92m Compiling�[0m tracing-log v0.2.0 1096: �[1m�[92m Compiling�[0m thread_local v1.1.9 1097: �[1m�[92m Compiling�[0m nu-ansi-term v0.50.3 1098: �[1m�[92m Compiling�[0m fastrand v2.4.1 1099: �[1m�[92m Compiling�[0m sketches-ddsketch v0.3.1 1100: �[1m�[92m Compiling�[0m tower-service v0.3.3 1101: �[1m�[92m Compiling�[0m rand_core v0.6.4 1102: �[1m�[92m Compiling�[0m hyper-util v0.1.20 1103: �[1m�[92m Compiling�[0m wyhash v0.5.0 1104: �[1m�[92m Compiling�[0m metrics-util v0.19.1 1105: �[1m�[92m Compiling�[0m tracing-subscriber v0.3.23 1106: �[1m�[92m Compiling�[0m thiserror v1.0.69 1107: �[1m�[92m Compiling�[0m nix v0.31.2 1108: �[1m�[92m Compiling�[0m icu_properties v2.2.0 1109: �[1m�[92m Compiling�[0m icu_normalizer v2.2.0 1110: �[1m�[92m Compiling�[0m clap v4.6.0 1111: �[1m�[92m Compiling�[0m libmimalloc-sys v0.1.44 1112: �[1m�[92m Compiling�[0m phf_macros v0.13.1 1113: �[1m�[92m Compiling�[0m chacha20 v0.10.0 1114: �[1m�[92m Compiling�[0m parking_lot v0.12.5 1115: �[1m�[92m Compiling�[0m http-body-util v0.1.3 1116: �[1m�[92m Compiling�[0m futures-executor v0.3.32 1117: �[1m�[92m Compiling�[0m thiserror-impl v2.0.18 1118: �[1m�[92m Compiling�[0m spin v0.9.8 1119: �[1m�[92m Compiling�[0m bstr v1.12.1 1120: �[1m�[92m Compiling�[0m num_cpus v1.17.0 1121: �[1m�[92m Compiling�[0m concurrent-queue v2.5.0 1122: �[1m�[92m Compiling�[0m ipnet v2.12.0 1123: �[1m�[92m Compiling�[0m rustc-hash v2.1.2 1124: �[1m�[92m Compiling�[0m hashbrown v0.14.5 1125: �[1m�[92m Compiling�[0m bytemuck v1.25.0 1126: �[1m�[92m Compiling�[0m sha1_smol v1.0.1 1127: �[1m�[92m Compiling�[0m moon v0.1.12 (/home/runner/work/moon/moon) 1128: �[1m�[92m Compiling�[0m parking v2.2.1 1129: �[1m�[92m Compiling�[0m percent-encoding v2.3.2 1130: �[1m�[92m Compiling�[0m xxhash-rust v0.8.15 1131: �[1m�[92m Compiling�[0m twox-hash v2.1.2 1132: �[1m�[92m Compiling�[0m byteorder v1.5.0 1133: �[1m�[92m Compiling�[0m lz4_flex v0.13.0 1134: �[1m�[92m Compiling�[0m thiserror v2.0.18 1135: �[1m�[92m Compiling�[0m dashmap v6.1.0 ... 1257: test acl::table::tests::test_check_command_permission_allallowed ... ok 1258: test acl::table::tests::test_load_or_default_nopass ... ok 1259: test acl::table::tests::test_list_users_sorted ... ok 1260: test acl::table::tests::version_handle_survives_replace_with ... ok 1261: test acl::table::tests::version_bumps_on_set_del_apply ... ok 1262: test admin::http_server::tests::test_healthz_response ... ok 1263: test admin::http_server::tests::test_readyz_not_ready ... ok 1264: test admin::metrics_setup::tests::cached_metrics_skips_rebuild_on_same_cmd ... ok 1265: test admin::metrics_setup::tests::cross_read_fastpath_batch_atomic_increments ... ok 1266: test admin::metrics_setup::tests::cross_read_fastpath_batch_zero_is_noop ... ok 1267: test admin::metrics_setup::tests::cross_spsc_atomic_increments ... ok 1268: test admin::metrics_setup::tests::cross_spsc_batch_atomic_increments ... ok 1269: test admin::metrics_setup::tests::cross_spsc_batch_zero_is_noop ... ok 1270: test admin::metrics_setup::tests::dispatch_path_counters_no_op_before_init ... ok 1271: test admin::metrics_setup::tests::record_command_cached_no_op_before_init ... ok 1272: test admin::slowlog::tests::test_get_negative_count_error ... ok 1273: test admin::slowlog::tests::test_get_non_numeric_error ... ok 1274: test admin::slowlog::tests::test_handle_slowlog_help ... ok 1275: test admin::slowlog::tests::test_max_len_zero_disables ... ok 1276: test admin::slowlog::tests::test_slowlog_basic ... ok 1277: test admin::slowlog::tests::test_slowlog_max_len ... ok 1278: test admin::slowlog::tests::test_slowlog_reset ... ok 1279: test admin::slowlog::tests::test_threshold_zero_logs_everything ... ok 1280: test auth_ratelimit::tests::test_exponential_backoff ... ok 1281: test admin::metrics_setup::tests::cross_read_fastpath_atomic_increments ... ok 1282: test auth_ratelimit::tests::test_first_failure_returns_base_delay ... ok 1283: test auth_ratelimit::tests::test_success_clears_record ... ok ... 1300: test cdc::decode::tests::test_decode_temporal_upsert ... ok 1301: test cdc::decode::tests::test_decode_unknown_command_preserves_raw ... ok 1302: test cdc::fanout::tests::test_cdc_fanout_no_subscribers_is_noop ... ok 1303: test cdc::fanout::tests::test_cdc_fanout_drops_full_subscriber ... ok 1304: test cdc::fanout::tests::test_register_pending_creates_working_subscriber ... ok 1305: test cdc::fanout::tests::test_cdc_fanout_tick_drains_to_subscriber ... ok 1306: test cdc::fanout::tests::test_register_pending_drops_when_no_wal_dir ... ok 1307: test client_pause::tests::test_pause_and_check ... ok 1308: test cdc::fanout::tests::test_cdc_fanout_replays_from_lsn ... ok 1309: test client_pause::tests::test_write_mode_allows_reads ... ok 1310: test client_registry::tests::test_client_info ... ok 1311: test client_registry::tests::test_kill_by_user ... ok 1312: test client_registry::tests::test_kill_by_id ... ok 1313: test client_registry::tests::test_parse_kill_args ... ok 1314: test client_registry::tests::test_update ... ok 1315: test cluster::command::tests::cluster_count_failure_reports_counts_active_reports ... ok 1316: test cluster::command::tests::cluster_count_failure_reports_returns_zero_for_healthy_node ... ok 1317: test client_registry::tests::test_register_and_list ... ok 1318: test cluster::command::tests::cluster_count_failure_reports_excludes_stale_reports ... ok 1319: test cluster::command::tests::cluster_count_failure_reports_returns_zero_for_unknown_node ... ok 1320: test cluster::command::tests::cluster_replicas_rejects_unknown_node_id ... ok 1321: test cluster::command::tests::cluster_replicas_returns_empty_for_master_with_no_replicas ... ok 1322: test cluster::command::tests::test_addslots_updates_bitmap ... ok 1323: test cluster::command::tests::cluster_replicas_includes_myself_marker_when_self_is_replica ... ok 1324: test cluster::command::tests::test_cluster_info_contains_enabled ... ok 1325: test cluster::command::tests::test_cluster_meet_adds_node ... ok 1326: test cluster::command::tests::test_cluster_myid_length ... ok 1327: test cluster::command::tests::test_delslots ... ok 1328: test cluster::command::tests::test_cluster_nodes_format ... ok 1329: test cluster::command::tests::test_failover_invalid_subcommand ... ok 1330: test cluster::command::tests::test_failover_normal_sets_waiting_delay ... ok 1331: test cluster::command::tests::test_failover_rejects_on_master ... ok 1332: test cluster::command::tests::cluster_replicas_lists_replicas ... ok 1333: test cluster::command::tests::test_failover_force_promotes_replica ... ok 1334: test cluster::command::tests::test_keyslot_foo ... ok 1335: test cluster::command::tests::test_setslot_migrating_importing ... ok 1336: test cluster::command::tests::test_setslot_node_clears_migration ... ok 1337: test cluster::failover::tests::test_compute_failover_delay_includes_rank ... ok 1338: test cluster::command::tests::test_failover_takeover_promotes_replica ... ok 1339: test cluster::failover::tests::test_failover_initiates_when_master_fail ... ok 1340: test cluster::failover::tests::test_failover_vote_epoch_guard ... ok 1341: test cluster::failover::tests::test_no_failover_when_master_healthy ... ok 1342: test cluster::failover::tests::test_try_mark_fail_needs_majority ... ok 1343: test cluster::gossip::tests::test_gossip_section_roundtrip ... ok 1344: test cluster::gossip::tests::test_bad_magic_returns_err ... ok 1345: test cluster::gossip::tests::test_ping_roundtrip ... ok 1346: test cluster::gossip::tests::test_pong_with_sections_roundtrip ... ok 1347: test cluster::gossip::tests::test_truncated_returns_err ... ok 1348: test cluster::command::tests::cluster_slaves_is_alias_for_replicas ... ok 1349: test cluster::slots::tests::test_empty_hash_tag_uses_full_key ... ok 1350: test cluster::migration::tests::test_get_keys_in_slot_filters_correctly ... ok 1351: test cluster::slots::tests::test_error_format ... ok 1352: test cluster::migration::tests::test_migrating_slot_returns_ask_route ... ok 1353: test cluster::slots::tests::test_local_shard_for_slot ... ok 1354: test cluster::slots::tests::test_foo_slot ... ok 1355: test cluster::slots::tests::test_hash_tag_co_location ... ok 1356: test cluster::tests::test_asking_flag_with_importing_slot ... ok 1357: test cluster::tests::test_asking_without_importing_still_moved ... ok 1358: test cluster::tests::test_my_node_id ... ok 1359: test cluster::tests::test_owns_slot_bitmap ... ok 1360: test cluster::migration::tests::test_nodes_conf_roundtrip ... ok 1361: test cluster::tests::test_moved_error_frame_format ... ok 1362: test command::acl::tests::test_acl_cat_all_categories ... ok 1363: test cluster::tests::test_route_local_owned_slot ... ok 1364: test command::acl::tests::test_acl_cat_unknown_category ... ok 1365: test command::acl::tests::test_acl_deluser ... ok 1366: test cluster::tests::test_route_moved_for_peer_slot ... ok 1367: test command::acl::tests::test_acl_cat_string_category ... ok 1368: test command::acl::tests::test_acl_deluser_default_fails ... ok 1369: test command::acl::tests::test_acl_genpass_custom_bits ... ok 1370: test command::acl::tests::test_acl_genpass_invalid_bits ... ok 1371: test command::acl::tests::test_acl_getuser_nonexistent ... ok 1372: test command::acl::tests::test_acl_list ... ok 1373: test command::acl::tests::test_acl_genpass_default ... ok 1374: test command::acl::tests::test_acl_load_no_aclfile ... ok 1375: test command::acl::tests::test_acl_log_and_reset ... ok 1376: test command::acl::tests::test_acl_log_with_count ... ok 1377: test command::acl::tests::test_acl_save_no_aclfile ... ok 1378: test command::acl::tests::test_acl_save_and_load ... ok 1379: test command::acl::tests::test_acl_unknown_subcommand ... ok 1380: test command::acl::tests::test_acl_setuser_and_getuser ... ok 1381: test command::cdc::read::tests::test_cdc_read_argument_errors ... ok 1382: test command::acl::tests::test_acl_whoami ... ok 1383: test command::cdc::read::tests::test_cdc_read_no_new_records_returns_cursor_only ... ok 1384: test command::cdc::read::tests::test_cdc_read_drains_in_lsn_order ... ok 1385: test command::cdc::read::tests::test_cdc_read_limit_clamps_batch ... ok 1386: test command::client::tests::test_parse_tracking_off ... ok 1387: test command::client::tests::test_parse_tracking_on ... ok 1388: test command::client::tests::test_parse_tracking_on_bcast_noloop_multiple_prefixes ... ok 1389: test command::client::tests::test_parse_tracking_on_bcast_prefix ... ok 1390: test command::client::tests::test_parse_tracking_on_noloop ... ok 1391: test command::client::tests::test_parse_tracking_on_redirect ... ok 1392: test command::cdc::read::tests::test_cdc_read_respects_from_lsn ... ok 1393: test command::client::tests::test_parse_tracking_prefix_without_bcast_fails ... ok 1394: test command::client::tests::test_parse_tracking_redirect_invalid_int ... ok ... 1408: test command::connection::tests::test_auth_acl_1arg_wrong_password ... ok 1409: test command::connection::tests::test_auth_acl_2arg_wrong_password ... ok 1410: test command::connection::tests::test_auth_acl_wrong_arity ... ok 1411: test command::connection::tests::test_auth_acl_disabled_user ... ok 1412: test command::connection::tests::test_auth_correct_password ... ok 1413: test command::connection::tests::test_auth_no_password_configured ... ok 1414: test command::connection::tests::test_auth_wrong_password ... ok 1415: test command::connection::tests::test_client_id_returns_integer ... ok 1416: test command::connection::tests::test_command_bare ... ok 1417: test command::connection::tests::test_command_docs ... ok 1418: test command::connection::tests::test_auth_wrong_arity ... ok 1419: test command::connection::tests::test_command_docs_lowercase ... ok 1420: test command::connection::tests::test_auth_acl_2arg_named_user ... ok 1421: test command::connection::tests::test_echo ... ok 1422: test command::connection::tests::test_echo_wrong_arity ... ok 1423: test command::connection::tests::test_hello_acl_with_auth_failure ... ok 1424: test command::connection::tests::test_hello_acl_no_args ... ok 1425: test command::connection::tests::test_hello_acl_with_auth_success ... ok 1426: test command::connection::tests::test_hello_downgrade_to_resp2 ... ok 1427: test command::connection::tests::test_hello_no_args_returns_current_proto ... ok 1428: test command::connection::tests::test_hello_noproto ... ok 1429: test command::connection::tests::test_hello_upgrade_to_resp3 ... ok 1430: test command::connection::tests::test_hello_with_auth_failure ... ok 1431: test command::connection::tests::test_hello_with_auth_success ... ok 1432: test command::connection::tests::test_hello_with_setname ... ok 1433: test command::connection::tests::test_info_basic ... ok 1434: test command::connection::tests::test_ping_too_many_args ... ok 1435: test command::connection::tests::test_ping_no_args ... ok 1436: test command::connection::tests::test_info_with_keys ... ok 1437: test command::connection::tests::test_replconf_ack_offset ... ok 1438: test command::connection::tests::test_replconf_capa ... ok 1439: test command::connection::tests::test_replconf_empty_args ... ok 1440: test command::connection::tests::test_replconf_listening_port ... ok 1441: test command::connection::tests::test_ping_with_arg ... ok 1442: test command::connection::tests::test_replconf_missing_value_errors ... ok 1443: test command::connection::tests::test_replconf_multi_pair_capa_handshake ... ok ... 1457: test command::geo::tests::test_geohash_roundtrip ... ok 1458: test command::geo::tests::test_geohash_string ... ok 1459: test command::geo::tests::test_haversine_rome_paris ... ok 1460: test command::geo::tests::test_geosearch_byradius ... ok 1461: test command::hash::tests::test_active_tick_deletes_hash_when_all_fields_expired ... ok 1462: test command::hash::tests::test_active_tick_downgrades_hash_when_last_ttl_drained ... ok 1463: test command::hash::tests::test_hdel_downgrades_when_last_ttl_removed ... ok 1464: test command::hash::tests::test_active_tick_reaps_expired_fields ... ok 1465: test command::hash::tests::test_hdel_fields ... ok 1466: test command::hash::tests::test_hdel_removes_empty_hash ... ok 1467: test command::hash::tests::test_hdel_removes_ttl_sidecar_entry ... ok 1468: test command::hash::tests::test_hexists ... ok 1469: test command::hash::tests::test_hdel_works_on_ttl_encoded_hash ... ok 1470: test command::hash::tests::test_hexists_returns_zero_for_expired_field ... ok 1471: test command::hash::tests::test_hexpire_family_read_missing_key_returns_neg2_per_field ... ok 1472: test command::hash::tests::test_hexpire_family_read_numfields_zero_returns_error ... ok 1473: test command::hash::tests::test_hexpire_family_read_wrongtype_for_string_value ... ok 1474: test command::hash::tests::test_hexpire_missing_key_returns_zero_per_field ... ok 1475: test command::hash::tests::test_hexpire_numfields_mismatch_returns_error ... ok 1476: test command::hash::tests::test_hexpire_numfields_zero_returns_error ... ok 1477: test command::hash::tests::test_hexpire_nx_xx_conflict_returns_error ... ok 1478: test command::hash::tests::test_hexpire_returns_neg2_when_lt_and_new_gt_current ... ok ... 1485: test command::hash::tests::test_hexpiretime_returns_absolute_seconds ... ok 1486: test command::hash::tests::test_hexpireat_absolute_seconds ... ok 1487: test command::hash::tests::test_hexpireat_returns_two_when_expiry_in_past ... ok 1488: test command::hash::tests::test_hexpiretime_returns_neg2_for_missing_field ... ok 1489: test command::hash::tests::test_hexpiretime_returns_neg1_for_no_ttl ... ok 1490: test command::hash::tests::test_hget_missing_key ... ok 1491: test command::hash::tests::test_hget_existing ... ok 1492: test command::hash::tests::test_hget_missing_field ... ok 1493: test command::hash::tests::test_hget_skips_expired_field ... ok 1494: test command::hash::tests::test_hget_returns_value_when_now_below_min_expiry ... ok 1495: test command::hash::tests::test_hgetall ... ok 1496: test command::hash::tests::test_hgetall_omits_expired_fields ... ok 1497: test command::hash::tests::test_hgetall_missing ... ok 1498: test command::hash::tests::test_hgetdel_deletes_key_when_last_field ... ok 1499: test command::hash::tests::test_hgetdel_downgrades_when_last_ttl_removed ... ok 1500: test command::hash::tests::test_hgetdel_numfields_zero_returns_error ... ok 1501: test command::hash::tests::test_hgetdel_on_listpack_encoded_hash ... ok 1502: test command::hash::tests::test_hgetdel_on_hash_with_ttl_removes_ttl_sidecar ... ok 1503: test command::hash::tests::test_hgetdel_returns_values_and_deletes ... ok 1504: test command::hash::tests::test_hgetdel_wrongtype_error ... ok 1505: test command::hash::tests::test_hgetdel_returns_nil_for_missing_field ... ok 1506: test command::hash::tests::test_hgetex_mutually_exclusive_modes_return_error ... ok 1507: test command::hash::tests::test_hgetex_returns_nil_for_missing_field ... ok 1508: test command::hash::tests::test_hgetex_with_ex_updates_ttl ... ok 1509: test command::hash::tests::test_hgetex_with_exat_sets_absolute_ttl ... ok 1510: test command::hash::tests::test_hgetex_with_persist_removes_ttl ... ok 1511: test command::hash::tests::test_hgetex_with_px_updates_ttl_in_ms ... ok 1512: test command::hash::tests::test_hgetex_with_pxat_sets_absolute_ms ... ok 1513: test command::hash::tests::test_hgetex_without_mode_returns_values_unchanged_ttl ... ok 1514: test command::hash::tests::test_hgetex_wrongtype_error ... ok 1515: test command::hash::tests::test_hincrby ... ok ... 1548: test command::hash::tests::test_hset_update_existing ... ok 1549: test command::hash::tests::test_hset_clears_ttl_on_overwrite ... ok 1550: test command::hash::tests::test_hsetnx ... ok 1551: test command::hash::tests::test_hsetnx_does_not_clear_ttl_when_field_exists ... ok 1552: test command::hash::tests::test_hset_works_on_ttl_encoded_hash ... ok 1553: test command::hash::tests::test_httl_returns_remaining_seconds ... ok 1554: test command::hash::tests::test_httl_returns_zero_for_already_expired ... ok 1555: test command::hash::tests::test_hset_wrong_args ... ok 1556: test command::hash::tests::test_hvals_omits_expired_fields ... ok 1557: test command::hash::tests::test_hsetnx_works_on_ttl_encoded_hash ... ok 1558: test command::hash::tests::test_min_expiry_ms_recomputes_after_hpersist ... ok 1559: test command::hash::tests::test_hlen_fast_path_when_no_fields_expired ... ok 1560: test command::hash::tests::test_min_expiry_ms_recomputes_after_hset_overwrite ... ok 1561: test command::hash::tests::test_min_expiry_ms_recomputes_after_active_reap ... ok 1562: test command::hash::tests::test_min_expiry_ms_tracks_minimum_across_hexpire_calls ... ok 1563: test command::hash::tests::test_wrongtype_error ... ok 1564: test command::info_reclamation::tests::info_reclamation_plan_cache_ratio_zero_by_default ... ok ... 1610: test command::key::tests::test_time ... ok 1611: test command::key::tests::test_scan_with_type_filter ... ok 1612: test command::key::tests::test_touch ... ok 1613: test command::key::tests::test_ttl_missing_key ... ok 1614: test command::key::tests::test_ttl_no_expiry ... ok 1615: test command::key::tests::test_type_hash ... ok 1616: test command::key::tests::test_type_list ... ok 1617: test command::key::tests::test_type_none ... ok 1618: test command::key::tests::test_type_set ... ok 1619: test command::key::tests::test_rename_same_key ... ok 1620: test command::key::tests::test_renamenx_dest_exists ... ok 1621: test command::key::tests::test_type_zset ... ok 1622: test command::key::tests::test_type_string ... ok 1623: test command::key::tests::test_unlink_no_args ... ok 1624: test command::key_extra::tests::test_copy_basic ... ok 1625: test command::key_extra::tests::test_copy_db_missing_index_returns_error ... ok 1626: test command::key_extra::tests::test_copy_db_same_db_fallthrough ... ok ... 1769: test command::server_admin::tests::debug_help_lists_subcommands ... ok 1770: test command::server_admin::tests::debug_object_returns_encoding_refcount_serlen ... ok 1771: test command::server_admin::tests::debug_reclamation_returns_bulk_string_with_sections ... ok 1772: test command::mq::tests::test_validate_mq_trigger_with_debounce ... ok 1773: test command::server_admin::tests::debug_sleep_rejects_negative ... ok 1774: test command::server_admin::tests::debug_sleep_rejects_non_float ... ok 1775: test command::server_admin::tests::debug_unknown_subcommand ... ok 1776: test command::persistence::tests::test_bgrewriteaof_sharded_refuses_under_unsafe_config ... ok 1777: test command::server_admin::tests::flushall_accepts_async ... ok 1778: test command::server_admin::tests::flushall_empties_db ... ok 1779: test command::server_admin::tests::debug_sleep_zero_is_immediate ... ok 1780: test command::server_admin::tests::flushall_rejects_garbage ... ok 1781: test command::server_admin::tests::flushdb_clears_current ... ok 1782: test command::server_admin::tests::kill_snapshot_active_txn_returns_ok ... ok 1783: test command::server_admin::tests::flushall_accepts_sync ... ok 1784: test command::server_admin::tests::kill_snapshot_unknown_txn_id_returns_error ... ok 1785: test command::server_admin::tests::kill_snapshot_missing_txn_id_returns_error ... ok 1786: test command::server_admin::tests::kill_snapshot_wrong_subcommand_returns_error ... ok 1787: test command::server_admin::tests::memory_stats_returns_map ... ok 1788: test command::server_admin::tests::memory_help_lists_usage ... ok 1789: test command::server_admin::tests::memory_unknown_subcommand ... ok 1790: test command::server_admin::tests::memory_usage_existing_key ... ok 1791: test command::server_admin::tests::memory_usage_missing_key_returns_null ... ok 1792: test command::server_admin::tests::memory_usage_samples_flag_accepted ... ok 1793: test command::server_admin::tests::memory_usage_samples_rejects_non_integer ... ok 1794: test command::server_admin::tests::vacuum_files_no_manifest_returns_zero ... ok 1795: test command::server_admin::tests::vacuum_freeze_kills_active_snapshots ... ok 1796: test command::server_admin::tests::vacuum_freeze_returns_kv_array ... ok 1797: test command::server_admin::tests::vacuum_graph_returns_pending ... ok 1798: test command::server_admin::tests::vacuum_unknown_subcommand_returns_error ... ok 1799: test command::server_admin::tests::vacuum_no_persistence_returns_array ... ok ... 1996: test command::string::tests::test_set_xx_exists ... ok 1997: test command::string::tests::test_strlen_existing ... ok 1998: test command::string::tests::test_strlen_missing ... ok 1999: test command::temporal::tests::test_capture_wall_ms_positive ... ok 2000: test command::string::tests::test_substr_negative_indices ... ok 2001: test command::string::tests::test_substr_alias ... ok 2002: test command::temporal::tests::test_is_temporal_invalidate ... ok 2003: test command::temporal::tests::test_is_temporal_snapshot_at ... ok 2004: test command::temporal::tests::test_validate_invalidate_edge ... ok 2005: test command::temporal::tests::test_validate_invalidate_invalid_entity_id ... ok 2006: test command::temporal::tests::test_validate_invalidate_invalid_kind ... ok 2007: test command::temporal::tests::test_validate_invalidate_valid ... ok 2008: test command::temporal::tests::test_validate_snapshot_at_no_args ... ok 2009: test command::temporal::tests::test_validate_invalidate_wrong_arg_count ... ok 2010: test command::temporal::tests::test_validate_snapshot_at_rejects_args ... ok 2011: test command::tests::move_dispatch_fallback_returns_error ... ok 2012: test command::tests::test_dispatch_case_insensitive ... ok 2013: test command::tests::swapdb_dispatch_stub_returns_error ... ok 2014: test command::tests::test_dispatch_get_set ... ok ... 2026: test command::tests::test_object_encoding_list ... ok 2027: test command::tests::test_object_encoding_list_upgrade ... ok 2028: test command::tests::test_object_encoding_missing_key ... ok 2029: test command::tests::test_object_encoding_set_hashtable ... ok 2030: test command::tests::test_object_encoding_set_intset ... ok 2031: test command::tests::test_object_encoding_sorted_set ... ok 2032: test command::tests::test_object_encoding_string ... ok 2033: test command::tests::test_object_help ... ok 2034: test command::tests::test_object_unknown_subcommand ... ok 2035: test command::tests::test_sadd_intset_add_more_integers ... ok 2036: test command::tests::test_sadd_intset_upgrade_on_non_integer ... ok 2037: test command::transaction::tests::test_err_txn_cross_shard_is_defined ... ok 2038: test command::transaction::tests::test_is_txn_abort ... ok 2039: test command::transaction::tests::test_is_txn_commit ... ok 2040: test command::transaction::tests::test_parse_subcommand_empty_args ... ok 2041: test command::transaction::tests::test_txn_abort_validate_not_in_txn_fails ... ok 2042: test command::transaction::tests::test_txn_abort_validate_success ... ok 2043: test command::transaction::tests::test_txn_begin_validate_in_multi_fails ... ok 2044: test command::transaction::tests::test_txn_begin_validate_in_txn_fails ... ok 2045: test command::transaction::tests::test_txn_begin_validate_success ... ok 2046: test command::transaction::tests::test_txn_commit_validate_not_in_txn_fails ... ok 2047: test command::transaction::tests::test_txn_commit_validate_success ... ok ... 2064: test command::vector_search::hybrid::tests::test_bm25_to_search_results_xxh64_seed_zero ... ok 2065: test command::vector_search::hybrid::tests::test_execute_hybrid_local_unknown_index ... ok 2066: test command::vector_search::hybrid::tests::test_hybrid_backward_compat_no_hybrid_keyword ... ok 2067: test command::tests::test_object_encoding_hash_upgrade ... ok 2068: test command::vector_search::hybrid::tests::test_effective_k_per_stream_default ... ok 2069: test command::vector_search::hybrid::tests::test_parse_hybrid_accepts_zero_weight ... ok 2070: test command::vector_search::hybrid::tests::test_parse_hybrid_extracts_dollar_blob ... ok 2071: test command::vector_search::hybrid::tests::test_parse_hybrid_k_per_stream ... ok 2072: test command::vector_search::hybrid::tests::test_parse_hybrid_minimal ... ok 2073: test command::vector_search::hybrid::tests::test_parse_hybrid_missing_keyword_returns_none ... ok 2074: test command::vector_search::hybrid::tests::test_parse_hybrid_rejects_nan_weight ... ok 2075: test command::vector_search::hybrid::tests::test_parse_hybrid_rejects_negative_weight ... ok 2076: test command::vector_search::hybrid::tests::test_parse_hybrid_rejects_non_rrf_fusion ... ok 2077: test command::vector_search::hybrid::tests::test_parse_hybrid_rejects_wrong_weight_count_exact ... ok 2078: test command::vector_search::hybrid::tests::test_parse_hybrid_rejects_wrong_weight_count ... ok 2079: test command::vector_search::hybrid::tests::test_parse_hybrid_unknown_param_errors ... ok 2080: test command::vector_search::hybrid::tests::test_parse_hybrid_weights ... ok ... 2110: test command::vector_search::tests::test_ft_config_autocompact_accepts_variants ... ok 2111: test command::vector_search::tests::test_ft_config_autocompact_guards_try_compact ... ok 2112: test command::vector_search::tests::test_ft_config_autocompact_on_off ... ok 2113: test command::vector_search::tests::test_ft_config_unknown_index ... ok 2114: test command::vector_search::tests::test_ft_create_duplicate_field_rejected ... ok 2115: test command::vector_search::tests::test_ft_create_duplicate ... ok 2116: test command::vector_search::tests::test_ft_config_unknown_param ... ok 2117: test command::vector_search::tests::test_ft_create_missing_dim ... ok 2118: test command::vector_search::tests::test_ft_create_exceeds_max_fields ... ok 2119: test command::vector_search::tests::test_ft_create_multi_field ... ok 2120: test command::vector_search::tests::test_ft_create_parse_full_syntax ... ok 2121: test command::vector_search::tests::test_ft_dropindex ... ok 2122: test command::vector_search::tests::test_ft_dropindex_dd_deletes_docs ... ok 2123: test command::vector_search::tests::test_ft_dropindex_dd_case_insensitive ... ok 2124: test command::vector_search::tests::test_ft_dropindex_dd_unknown_index ... ok 2125: test command::vector_search::tests::test_ft_dropindex_extra_args_error ... ok 2126: test command::vector_search::tests::test_ft_info ... ok 2127: test command::vector_search::tests::test_ft_dropindex_preserves_docs ... ok 2128: test command::vector_search::tests::test_ft_info_multi_field ... ok 2129: test command::vector_search::tests::test_ft_info_returns_correct_data ... ok 2130: test command::vector_search::tests::test_ft_search_dimension_mismatch ... ok 2131: test command::vector_search::tests::test_ft_search_default_field_compat ... ok 2132: test command::vector_search::tests::test_ft_search_empty_index ... ok 2133: test command::vector_search::tests::test_ft_search_unknown_field_error ... ok 2134: test command::vector_search::tests::test_ft_search_field_targeting ... ok 2135: test command::vector_search::tests::test_ft_search_unknown_index ... ok 2136: test command::vector_search::tests::test_ft_search_with_filter_no_regression ... ok 2137: test command::vector_search::tests::test_hybrid_search_basic ... ok 2138: test command::vector_search::tests::test_hybrid_search_hit_counts ... ok 2139: test command::vector_search::tests::test_merge_search_results_combines_shards ... ok 2140: test command::vector_search::tests::test_hybrid_search_dense_only_backward_compat ... ok 2141: test command::vector_search::tests::test_merge_search_results_empty ... ok 2142: test command::vector_search::tests::test_merge_search_results_handles_errors ... ok 2143: test command::vector_search::tests::test_hybrid_search_sparse_only ... ok ... 2224: test config::tests::test_custom_port ... ok 2225: test config::tests::test_default_values ... ok 2226: test config::tests::test_disk_offload_defaults ... ok 2227: test config::tests::test_maxmemory_custom ... ok 2228: test config::tests::test_effective_disk_offload_dir ... ok 2229: test config::tests::test_parse_size ... ok 2230: test config::tests::test_maxmemory_defaults ... ok 2231: test config::tests::test_persistence_custom_values ... ok 2232: test config::tests::test_pagecache_size_bytes ... ok 2233: test config::tests::test_persistence_defaults ... ok 2234: test config::tests::test_runtime_config_default ... ok 2235: test config::tests::test_requirepass ... ok 2236: test config::tests::test_requirepass_default_none ... ok 2237: test config::tests::test_shards_custom ... ok 2238: test config::tests::test_to_runtime_config ... ok 2239: test error::tests::moon_error_from_aof_error ... ok 2240: test error::tests::moon_error_from_io_error ... ok 2241: test config::tests::test_shards_default ... ok 2242: test error::tests::moon_error_from_rdb_error ... ok 2243: test error::tests::moon_error_from_snapshot_error ... ok 2244: test error::tests::moon_result_alias_works ... ok 2245: test error::tests::moon_error_from_wal_error ... ok 2246: test io::buf_ring::tests::test_buf_ring_manager_new ... ok ... 2315: test mq::wal::tests::test_mq_ack_malformed_truncated_key ... ok 2316: test mq::wal::tests::test_mq_ack_roundtrip ... ok 2317: test mq::wal::tests::test_mq_ack_roundtrip_max_id ... ok 2318: test mq::wal::tests::test_mq_ack_roundtrip_empty_key ... ok 2319: test mq::wal::tests::test_mq_ack_roundtrip_zero_id ... ok 2320: test mq::wal::tests::test_mq_create_extra_bytes_ignored ... ok 2321: test mq::wal::tests::test_mq_create_malformed_too_short ... ok 2322: test mq::wal::tests::test_mq_create_malformed_empty ... ok 2323: test mq::wal::tests::test_mq_create_malformed_missing_max_delivery ... ok 2324: test mq::wal::tests::test_mq_create_malformed_truncated_key ... ok 2325: test mq::wal::tests::test_mq_create_roundtrip ... ok 2326: test mq::wal::tests::test_mq_create_roundtrip_empty_key ... ok 2327: test mq::wal::tests::test_mq_create_roundtrip_long_key ... ok 2328: test mq::wal::tests::test_mq_create_roundtrip_max_delivery ... ok 2329: test mq::wal::tests::test_mq_create_roundtrip_zero_delivery ... ok 2330: test persistence::aof::pool_tests::append_sync_writer_dropped_resolves_recv_error ... ok 2331: test persistence::aof::pool_tests::append_sync_writer_reports_fsync_failed ... ok 2332: test persistence::aof::pool_tests::append_sync_writer_reports_write_failed ... ok 2333: test persistence::aof::pool_tests::per_shard_pool_rejects_rewrite_with_explicit_error ... ok 2334: test persistence::aof::pool_tests::broadcast_shutdown_reaches_every_writer ... ok 2335: test persistence::aof::pool_tests::per_shard_pool_routes_each_shard_to_its_own_writer ... ok 2336: test persistence::aof::pool_tests::top_level_pool_accepts_rewrite ... ok 2337: test persistence::aof::pool_tests::per_shard_pool_threads_lsn_field_to_each_writer ... ok 2338: test persistence::aof::pool_tests::try_send_append_sync_queues_appendsync_with_ack ... ok 2339: test persistence::aof::pool_tests::top_level_pool_routes_all_shards_to_writer_zero ... ok 2340: test persistence::aof::tests::test_aof_replay_corrupt_truncated_logs_error_loads_what_it_can ... ok 2341: test persistence::aof::tests::test_aof_replay_collection_types ... ok ... 2346: test persistence::aof::tests::test_aof_replay_with_select_switches_databases ... ok 2347: test persistence::aof::tests::test_generate_aof_command_produces_valid_resp_that_round_trips ... ok 2348: test persistence::aof::tests::test_generate_rewrite_commands_with_ttl ... ok 2349: test persistence::aof::tests::test_generate_rewrite_commands_all_5_types ... ok 2350: test persistence::aof::tests::test_serialize_command_round_trip_hset ... ok 2351: test persistence::aof_manifest::tests_v2::global_max_lsn_returns_max_across_shards ... ok 2352: test persistence::aof::tests::test_generate_rewrite_round_trip_preserves_state ... ok 2353: test persistence::aof_manifest::tests_v2::is_legacy_top_level_layout_detects_v1_files ... ok 2354: test persistence::aof_manifest::tests_v2::base_incr_paths_route_to_shard_zero_after_migration ... ok 2355: test persistence::aof_manifest::tests_v2::is_legacy_top_level_layout_returns_false_for_v2 ... ok 2356: test persistence::aof_manifest::tests_v2::migrate_does_not_mutate_on_missing_base ... ok 2357: test persistence::aof_manifest::tests_v2::ordered_entry_lsn_flag_set_via_try_send_append_ordered ... ok 2358: test persistence::aof_manifest::tests_v2::parse_v2_rejects_non_contiguous_shard_ids ... ok 2359: test persistence::aof_manifest::tests_v2::parse_v2_rejects_shard_count_mismatch_in_file ... ok 2360: test persistence::aof_manifest::tests_v2::replay_incr_framed_buffers_ordered_entries ... ok 2361: test persistence::aof_manifest::tests_v2::replay_incr_framed_complete_but_corrupt_payload_errors ... ok 2362: test persistence::aof_manifest::tests_v2::migrate_rolls_back_filesystem_when_incr_rename_fails ... ok 2363: test persistence::aof_manifest::tests_v2::replay_incr_framed_decodes_lsn_and_resp ... ok 2364: test persistence::aof_manifest::tests_v2::migrate_top_level_to_per_shard_moves_files_and_rewrites_manifest ... ok 2365: test persistence::aof_manifest::tests_v2::replay_incr_framed_truncated_payload_is_crash_eof ... ok 2366: test persistence::aof_manifest::tests_v2::replay_incr_framed_truncated_header_is_crash_eof ... ok 2367: test persistence::aof_manifest::tests_v2::replay_ordered_merge_empty_returns_zero ... ok 2368: test persistence::aof_manifest::tests_v2::replay_ordered_merge_sorts_by_lsn_across_shards ... ok 2369: test persistence::aof_manifest::tests_v2::v1_manifest_loads_as_top_level_single_shard ... ok 2370: test persistence::aof_manifest::tests_v2::replay_per_shard_rejects_shard_count_mismatch ... ok 2371: test persistence::aof_manifest::tests_v2::verify_shard_count_emits_rfc_error_verbatim ... ok 2372: test persistence::auto_save::tests::test_parse_save_rules_empty_string ... ok ... 2423: test persistence::compression::tests::test_gorilla_single ... ok 2424: test persistence::compression::tests::test_gorilla_varying ... ok 2425: test persistence::compression::tests::test_gorilla_special_values ... ok 2426: test persistence::compression::tests::test_varint_roundtrip ... ok 2427: test persistence::compression::tests::test_zigzag_roundtrip ... ok 2428: test persistence::control::tests::test_control_path ... ok 2429: test persistence::control::tests::test_corrupted_crc_detected ... ok 2430: test persistence::control::tests::test_atomic_write_overwrites_existing_file ... ok 2431: test persistence::control::tests::test_read_nonexistent_file ... ok 2432: test persistence::control::tests::test_corrupted_control_file_recovers_via_manual_replace ... ok 2433: test persistence::control::tests::test_lsn_fields_survive_roundtrip ... ok 2434: test persistence::control::tests::test_shard_state_from_u8 ... ok 2435: test persistence::control::tests::test_roundtrip_all_fields ... ok 2436: test persistence::control::tests::test_write_produces_exactly_4096_bytes ... ok 2437: test persistence::fsync::tests::test_fsync_directory ... ok 2438: test persistence::fsync::tests::test_fsync_nonexistent_returns_error ... ok 2439: test persistence::fsync::tests::test_fsync_file ... ok ... 2458: test persistence::kv_page::tests::test_page_full ... ok 2459: test persistence::control::tests::test_shard_state_variants ... ok 2460: test persistence::kv_page::tests::test_value_type_roundtrip ... ok 2461: test persistence::kv_page::tests::test_small_values_not_compressed ... ok 2462: test persistence::kv_page::tests::test_value_type_from_u8 ... ok 2463: test persistence::manifest::tests::file_entry_exactly_56_bytes ... ok 2464: test persistence::manifest::tests::file_entry_page_size_variants ... ok 2465: test persistence::manifest::tests::file_entry_read_from_short_buffer ... ok 2466: test persistence::manifest::tests::file_entry_last_modified_lsn_independent_of_created ... ok 2467: test persistence::manifest::tests::file_entry_roundtrip_all_fields ... ok 2468: test persistence::manifest::tests::file_entry_v1_decodes_with_synthesized_last_modified ... ok 2469: test persistence::manifest::tests::file_status_all_variants ... ok 2470: test persistence::manifest::tests::file_storage_tier_all_variants ... ok 2471: test persistence::manifest::tests::test_manifest_add_remove_file ... ok 2472: test persistence::manifest::tests::test_manifest_alternating_commit ... ok 2473: test persistence::manifest::tests::test_manifest_both_corrupt_returns_error ... ok 2474: test persistence::manifest::tests::test_manifest_create_and_open ... ok ... 2496: test persistence::page_cache::frame::tests::test_fpi_pending_preserves_other_flags ... ok 2497: test persistence::page_cache::frame::tests::test_frame_descriptor_stores_identity ... ok 2498: test persistence::page_cache::frame::tests::test_fpi_pending_set_clear ... ok 2499: test persistence::page_cache::frame::tests::test_initial_state_is_zeroed ... ok 2500: test persistence::page_cache::frame::tests::test_io_in_progress_prevents_eviction ... ok 2501: test persistence::page_cache::frame::tests::test_is_evictable ... ok 2502: test persistence::page_cache::frame::tests::test_pack_unpack_roundtrip ... ok 2503: test persistence::page_cache::frame::tests::test_touch_caps_at_max_usage ... ok 2504: test persistence::page_cache::frame::tests::test_pin_increments_refcount ... ok 2505: test persistence::page_cache::frame::tests::test_unpin_decrements_refcount ... ok 2506: test persistence::page_cache::tests::test_arm_all_fpi_pending_sets_on_valid_frames ... ok 2507: test persistence::page_cache::tests::test_flush_dirty_pages_basic ... ok 2508: test persistence::page_cache::tests::test_flush_dirty_pages_respects_max ... ok 2509: test persistence::page_cache::tests::test_flush_dirty_pages_with_fpi_calls_fpi_fn ... ok 2510: test persistence::page_cache::tests::test_flush_dirty_pages_with_fpi_skips_non_fpi ... ok 2511: test persistence::page_cache::tests::test_page_cache_all_pinned_returns_error ... ok 2512: test persistence::page_cache::tests::test_page_cache_eviction_on_full ... ok 2513: test persistence::page_cache::tests::test_page_cache_cache_hit ... ok 2514: test persistence::page_cache::tests::test_page_cache_fetch_and_pin ... ok 2515: test persistence::page_cache::tests::test_page_cache_flush_wal_before_data ... ok 2516: test persistence::page_cache::tests::test_page_cache_mark_dirty ... ok 2517: test persistence::page_cache::tests::test_page_cache_mixed_sizes ... ok 2518: test persistence::rdb::tests::test_expired_keys_skipped_during_save ... ok 2519: test persistence::rdb::tests::test_empty_database_produces_valid_rdb ... ok 2520: test persistence::rdb::tests::test_crc32_catches_corruption ... ok 2521: test persistence::rdb::tests::test_missing_file_returns_error ... ok 2522: test persistence::rdb::tests::test_round_trip_hash ... ok ... 2649: test persistence::wal_v3::segment::tests::test_recycle_respects_min_wal_size ... ok 2650: test persistence::wal_v3::segment::tests::test_recycle_segments_before ... ok 2651: test persistence::wal_v3::segment::tests::test_writer_resumes_lsn_across_restart ... ok 2652: test persistence::wal_v3::segment::tests::test_writer_segment_rotation ... ok 2653: test persistence::wal_v3::tail::tests::test_tail_empty_dir_returns_none ... ok 2654: test persistence::wal_v3::tail::tests::test_tail_handles_partial_record_at_tail ... ok 2655: test persistence::wal_v3::segment::tests::test_writer_resumes_across_segments ... ok 2656: test persistence::wal_v3::tail::tests::test_tail_reads_appended_records_in_order ... ok 2657: test protocol::frame::tests::frame_size_measurement ... ok 2658: test protocol::frame::tests::test_frame_empty_array_is_valid ... ok 2659: test protocol::frame::tests::test_frame_null_not_equal_to_empty_bulk_string ... ok 2660: test protocol::frame::tests::test_frame_simple_string_debug_clone_partialeq ... ok 2661: test protocol::frame::tests::test_parse_config_default_max_array_depth ... ok 2662: test protocol::frame::tests::test_parse_config_default_max_array_length ... ok 2663: test persistence::wal_v3::tail::tests::test_tail_resumes_from_cursor ... ok 2664: test protocol::frame::tests::test_parse_error_incomplete_display ... ok 2665: test protocol::frame::tests::test_parse_config_default_max_bulk_string_size ... ok 2666: test persistence::wal_v3::tail::tests::test_tail_advances_across_segment_rotation ... ok 2667: test protocol::frame::tests::test_parse_error_invalid_display ... ok 2668: test protocol::inline::tests::test_parse_inline_buffer_consumed ... ok ... 2676: test protocol::inline::tests::test_parse_inline_set_key_value ... ok 2677: test protocol::inline::tests::test_parse_inline_whitespace_only ... ok 2678: test protocol::inline::tests::test_parse_inline_tab_separated ... ok 2679: test protocol::parse::tests::test_buffer_consumed_after_parse ... ok 2680: test protocol::parse::tests::test_crash_artifact_bare_lf_in_frame_count ... ok 2681: test protocol::parse::tests::test_fuzz_crash_resp3_set_negative_count ... ok 2682: test protocol::parse::tests::test_parse_array_depth_exceeding_max ... ok 2683: test protocol::parse::tests::test_parse_array_of_bulk_strings ... ok 2684: test protocol::parse::tests::test_parse_array_with_null_element ... ok 2685: test protocol::parse::tests::test_parse_binary_data_in_bulk_string ... ok 2686: test protocol::parse::tests::test_parse_bulk_string_exceeding_max_size ... ok 2687: test protocol::parse::tests::test_parse_bulk_string ... ok 2688: test protocol::parse::tests::test_parse_empty_array ... ok 2689: test protocol::parse::tests::test_parse_empty_buffer ... ok 2690: test protocol::parse::tests::test_parse_empty_bulk_string ... ok 2691: test protocol::parse::tests::test_parse_error ... ok 2692: test protocol::parse::tests::test_parse_incomplete_array ... ok ... 2721: test protocol::parse::tests::test_parse_simple_string ... ok 2722: test protocol::parse::tests::test_parse_simple_string_long ... ok 2723: test protocol::parse::tests::test_parse_two_frames_sequentially ... ok 2724: test protocol::parse::tests::test_resp3_negative_map_count ... ok 2725: test protocol::parse::tests::test_resp3_negative_set_count ... ok 2726: test protocol::parse::tests::test_resp3_null_map ... ok 2727: test protocol::parse::tests::test_resp3_null_push ... ok 2728: test protocol::parse::tests::test_resp3_null_set ... ok 2729: test protocol::resp3::tests::test_array_to_map ... ok 2730: test protocol::resp3::tests::test_array_to_map_empty_passthrough ... ok 2731: test protocol::resp3::tests::test_array_to_set ... ok 2732: test protocol::resp3::tests::test_bulk_to_double_null ... ok 2733: test protocol::resp3::tests::test_int_to_bool ... ok 2734: test protocol::resp3::tests::test_bulk_to_double ... ok 2735: test protocol::resp3::tests::test_maybe_convert_get_unchanged ... ok 2736: test protocol::resp3::tests::test_maybe_convert_error_passthrough ... ok 2737: test protocol::resp3::tests::test_maybe_convert_hgetall_resp2_unchanged ... ok 2738: test protocol::resp3::tests::test_maybe_convert_hgetall_resp3 ... ok 2739: test protocol::resp3::tests::test_maybe_convert_null_passthrough ... ok 2740: test protocol::resp3::tests::test_maybe_convert_sismember_resp3 ... ok 2741: test protocol::serialize::tests::test_resp2_downgrade_boolean_to_integer ... ok 2742: test protocol::resp3::tests::test_maybe_convert_smembers_resp3 ... ok 2743: test protocol::resp3::tests::test_maybe_convert_zscore_resp3 ... ok 2744: test protocol::serialize::tests::test_resp2_downgrade_double_to_bulk_string ... ok 2745: test protocol::serialize::tests::test_resp2_downgrade_map_to_flat_array ... ok 2746: test protocol::serialize::tests::test_resp2_null_still_dollar_minus_one ... ok 2747: test protocol::serialize::tests::test_resp2_downgrade_set_to_array ... ok 2748: test protocol::serialize::tests::test_round_trip_array ... ok 2749: test protocol::serialize::tests::test_round_trip_error ... ok 2750: test protocol::serialize::tests::test_round_trip_bulk_string ... ok ... 2753: test protocol::serialize::tests::test_round_trip_nested_array ... ok 2754: test protocol::serialize::tests::test_round_trip_null ... ok 2755: test protocol::serialize::tests::test_round_trip_resp3_big_number ... ok 2756: test protocol::serialize::tests::test_round_trip_resp3_boolean ... ok 2757: test protocol::serialize::tests::test_round_trip_resp3_double ... ok 2758: test protocol::serialize::tests::test_round_trip_resp3_null ... ok 2759: test protocol::serialize::tests::test_round_trip_resp3_map ... ok 2760: test protocol::serialize::tests::test_round_trip_resp3_verbatim_string ... ok 2761: test protocol::serialize::tests::test_round_trip_resp3_set ... ok 2762: test protocol::serialize::tests::test_round_trip_resp3_push ... ok 2763: test protocol::serialize::tests::test_round_trip_simple_string ... ok 2764: test protocol::serialize::tests::test_serialize_bulk_string ... ok 2765: test protocol::serialize::tests::test_serialize_empty_array ... ok 2766: test protocol::serialize::tests::test_serialize_array_of_bulk_strings ... ok 2767: test protocol::serialize::tests::test_serialize_empty_bulk_string ... ok 2768: test protocol::serialize::tests::test_serialize_error ... ok 2769: test protocol::serialize::tests::test_serialize_integer_negative ... ok ... 2865: test scripting::sandbox::tests::test_sandbox_allows_string_math_table ... ok 2866: test scripting::sandbox::tests::test_sandbox_blocks_os_other_fns ... ok 2867: test scripting::sandbox::tests::test_sandbox_removes_dangerous_globals ... ok 2868: test scripting::sandbox::tests::test_timeout_hook ... ok 2869: test scripting::tests::test_handle_eval_basic ... ok 2870: test scripting::tests::test_handle_evalsha_after_eval ... ok 2871: test scripting::tests::test_handle_evalsha_noscript ... ok 2872: test scripting::tests::test_handle_script_subcommand_exists ... ok 2873: test scripting::tests::test_handle_script_subcommand_flush ... ok 2874: test scripting::tests::test_handle_script_subcommand_load ... ok 2875: test scripting::tests::test_parse_eval_args_basic ... ok 2876: test scripting::tests::test_parse_eval_args_numkeys_exceeds_args ... ok 2877: test scripting::tests::test_parse_eval_args_too_few_args ... ok 2878: test scripting::tests::test_parse_eval_args_with_keys_and_argv ... ok 2879: test scripting::tests::test_run_script_keys_argv ... ok 2880: test scripting::tests::test_run_script_redis_pcall_catches_error ... ok 2881: test scripting::tests::test_run_script_simple ... ok 2882: test scripting::tests::test_run_script_type_conversions ... ok 2883: test scripting::tests::test_run_script_with_redis_call ... ok 2884: test scripting::tests::test_setup_lua_vm ... ok 2885: test scripting::types::tests::test_frame_array_to_lua ... ok 2886: test scripting::types::tests::test_frame_boolean_to_lua ... ok 2887: test scripting::types::tests::test_frame_bulk_string_to_lua ... ok 2888: test scripting::types::tests::test_frame_double_to_lua ... ok 2889: test scripting::types::tests::test_frame_error_to_lua ... ok 2890: test scripting::types::tests::test_frame_integer_to_lua ... ok ... 2947: test server::response_slot::tests::test_reset_clears_filled_data ... ok 2948: test server::response_slot::tests::test_slot_reuse_fill_take_cycle ... ok 2949: test shard::affinity::tests::key_hint_stored_even_without_pubsub ... ok 2950: test shard::affinity::tests::key_hint_wins_over_pubsub_hint ... ok 2951: test shard::affinity::tests::legacy_register_and_remove_still_work ... ok 2952: test shard::affinity::tests::missing_lookup_is_none ... ok 2953: test shard::affinity::tests::overwrite_pubsub_affinity ... ok 2954: test shard::affinity::tests::pubsub_register_and_lookup ... ok 2955: test shard::affinity::tests::remove_pubsub_drops_entry_when_no_key_hint ... ok 2956: test shard::affinity::tests::remove_pubsub_preserves_key_hint ... ok 2957: test shard::autovacuum::tests::test_aimd_hold_when_p95_in_band ... ok 2958: test shard::autovacuum::tests::test_aimd_initial_budget_midpoint ... ok 2959: test shard::autovacuum::tests::test_latency_window_p95_100_samples ... ok 2960: test shard::autovacuum::tests::test_latency_window_p95_empty ... ok 2961: test shard::autovacuum::tests::test_latency_window_p95_single ... ok 2962: test shard::coordinator::tests::test_aggregate_doc_freq_error_frame_skipped ... ok 2963: test shard::coordinator::tests::test_aggregate_doc_freq_missing_term_on_one_shard ... ok ... 2983: test shard::disk_monitor::tests::test_poll_real_path_smoke ... ok 2984: test shard::disk_monitor::tests::test_zero_total_does_not_panic ... ok 2985: test shard::dispatch::tests::print_shard_message_sizes ... ok 2986: test shard::dispatch::tests::test_extract_hash_tag_basic ... ok 2987: test shard::dispatch::tests::test_extract_hash_tag_empty ... ok 2988: test shard::dispatch::tests::test_extract_hash_tag_none ... ok 2989: test shard::dispatch::tests::test_hash_tag_co_location ... ok 2990: test shard::dispatch::tests::test_key_to_shard_deterministic ... ok 2991: test shard::dispatch::tests::test_key_to_shard_distribution ... ok 2992: test shard::dispatch::tests::test_key_to_shard_single_shard ... ok 2993: test shard::dispatch::tests::test_pubsub_slot_already_ready ... ok 2994: test shard::dispatch::tests::test_pubsub_slot_multiple_shards ... ok 2995: test shard::dispatch::tests::test_pubsub_slot_waker ... ok 2996: test shard::dispatch::tests::test_shard_message_size_bounded ... ok 2997: test shard::maintenance_schedule::tests::test_cron_exact_hour ... ok 2998: test shard::maintenance_sc...

TinDang97 added 2 commits May 26, 2026 22:19

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

pilotspacex-byte changed the title ~~fix(persistence): refuse multi-shard AOF at startup + gate BGREWRITEAOF (P0-FIX-01a/b)~~ fix(persistence): multi-shard AOF gate + per-shard AOF foundation (Option B step 1) May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread src/persistence/aof_manifest.rs

Comment thread src/persistence/aof_manifest.rs Outdated

TinDang97 added 6 commits May 27, 2026 10:54

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

TinDang97 added 6 commits May 27, 2026 13:45

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

TinDang97 added 3 commits May 27, 2026 16:01

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

TinDang97 added 4 commits June 1, 2026 15:28

		let is_write = if ctx.aof_pool.is_some() \|\| conn.tracking_state.enabled { metadata::is_write(cmd) } else { false };
		let aof_bytes = if is_write && ctx.aof_pool.is_some() { Some(aof::serialize_command(&frame)) } else { None };

		let lsn = u64::from_le_bytes(data[offset..offset + 8].try_into().expect("8 bytes"));
		let len = u32::from_le_bytes(data[offset + 8..offset + 12].try_into().expect("4 bytes"))

Conversation

pilotspacex-byte commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why the gate (commit 1)

Why step 1 lands here (commit 3)

What step 1 adds

Manifest text format

Test plan

Operator impact

Next steps on this branch

Summary by CodeRabbit

Uh oh!

qodo-code-review Bot commented May 26, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot commented Jun 1, 2026

CI Feedback 🧐

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

pilotspacex-byte commented May 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading