inline-checksum: optional per-cluster CRC validation (TD.100226.1) by schmidt-scaled · Pull Request #1016 · simplyblock/sbcli

schmidt-scaled · 2026-04-30T17:31:06Z

Summary

Control-plane support for the inline silent-data-error protection feature whose data-plane work landed on ultra branch checksum-validation. Optional, frozen at cluster-create time, no upgrade path for existing clusters.
Adds Cluster.inline_checksum, NVMeDevice.md_size / md_supported, bdev_alceml_create checksum-method kwargs, and per-cluster CLI flags on cluster create, cluster add, and sn configure.
Auto-detects per-device NVMe metadata support at add-node and on restart from SPDK's bdev_get_bdevs md_size field (spdk_bdev_get_md_size), so a cluster can mix md-on-device and fallback drives. Capacity overhead from the fallback layout (6 / 510 blocks per 2 MiB extent ≈ 1.17%) is charged as initial provisioned utilization rather than reducing reported raw capacity.

Why this branch and not main

A few inline-checksum lines previously leaked into main via the unrelated commit aa81fafe (iteration-77 hang fix) on 2026-04-30. Those four hunks were removed in 5ff3e214 so main is clean again. This branch carries the full feature in one self-contained commit on top of that clean main.

Design ref

TD.100226.1 (Inline Protection from Silent Data Errors).

Compatibility

Cluster flag defaults to False; existing clusters and code paths unchanged.
Data plane reads md_size directly from the bound bdev, so the control plane only flips checksum_validation_method 1/2/0 — matching the checksum-validation branch RPC schema (checksum_validation_method, cache_size, cache_eviction_threshold).

Test plan

27 new unit tests in tests/test_inline_checksum.py (model defaults, find_md_lbaf_id, alceml_checksum_params, fallback-overhead math, RPC param wire-up for each method, addNvmeDevices md detection from a mocked bdev payload). All pass.
Full unit-test suite (870 tests) stayed green throughout development on a separate worktree.
Lab dual-fault test on an FT=2 cluster with --enable-inline-checksum, mixing one drive formatted with an md-capable LBAF (cv_md_method) and one without (cv_fallback_method); verify capacity reporting reflects the ~1.17% pre-charge only on the fallback drive.
Negative test: simulate single-bit corruption on a passtest bdev under an inline-checksum cluster and confirm read returns IO error rather than silent corruption.
Confirm existing clusters (no inline_checksum field in DB) read back as False and behave exactly as today.

Open follow-ups (not blocking this PR)

The add_cluster / cluster add controller has a pre-existing positional-arg shift around fabric (unrelated to this work). New kwarg added keyword-only so it isn't affected; bug should be fixed separately.
Operator-facing capacity reporting (sb cluster get-capacity) is fed from Prometheus and reflects what alceml itself reports; the per-device pre-charge added here is in the lvol-create provisioning gate. If we want it visible in cluster get-capacity too, that's a small extension to the stats collector.

🤖 Generated with Claude Code

+        for attempt in range(1, 16):
+            try:
+                client = paramiko.SSHClient()
+                client.set_missing_host_key_policy(paramiko.AutoAddPolicy())


Add --verify=md5 --do_verify=1 --verify_fatal=1 to the fio job so data corruption is caught and fatal. Soften the "unrelated pair" category from a hard TestRunError to a warning plus fallback to primary_tertiary then primary_secondary, since a dense FT=2 ring with N <= 4 has no unrelated pair and should still run the scenario. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pin BRANCH to inline-checksum-validation. Make ssh_exec accept a timeout (default 600s) and use 1800s for the parallel docker pull so large image pulls don't time out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the sbcli control-plane support for the inline silent-data-error protection feature whose data-plane work landed on ultra branch checksum-validation. Frozen at cluster-create time; no upgrade path for existing clusters. Per-device alceml mode (md-on-device vs fallback) is auto-detected from the bound SPDK bdev's md_size at add-node and on restart, so a cluster can have a heterogeneous mix of drives. Cluster + per-device flags - Cluster.inline_checksum (bool, default False): persisted only via create_cluster / add_cluster; no mutator. - NVMeDevice.md_size (int) and NVMeDevice.md_supported (bool): set in addNvmeDevices from bdev_get_bdevs' top-level md_size field (spdk_bdev_get_md_size; emitted by lib/bdev/bdev_rpc.c). Refreshed on every restart-device path so a between-restart `nvme format` is reflected. alceml RPC plumbing (matches checksum-validation branch) - bdev_alceml_create gains optional checksum_method (1=md-on-device, 2=fallback, default 0=off), cache_size, cache_eviction_threshold; zero-defaults are not sent so the data plane keeps its built-ins (cv_default_cache_size=2000, threshold 90%). - utils.alceml_checksum_params(cluster, dev) picks 0/1/2 from the cluster flag plus md_supported. Data plane refuses method=1 on md_size==0, so the per-device decision is mandatory. - Both alceml call sites (storage_node_ops._create_storage_device_stack and device_controller restart-device) thread the params through and warn when a device falls back. CLI - cluster create / cluster add / sn configure each gain --enable-inline-checksum. - sn configure runs `find_md_lbaf_id` against `nvme id-ns` JSON to pick the smallest qualifying LBAF (ds=12, ms>=8) and forces a reformat past the existing 4K-already-formatted early-out (SectorSize stays 4096 across an md/no-md transition). Capacity accounting - alceml_fallback_overhead_bytes(cluster, size): 6 lost data blocks per 2 MiB extent (510 -> 504, ~1.171875%) when the device runs in cv_fallback_method. - lvol_controller charges that as initial provisioned utilization rather than reducing reported raw cluster_size_total, so the overhead surfaces through the existing prov_cap_warn / prov_cap_crit thresholds and md-on-device drives contribute zero. Tests - 27 new unit tests in tests/test_inline_checksum.py: model defaults, find_md_lbaf_id corner cases, alceml_checksum_params combos, fallback-overhead math (including 6/512 ratio sanity), RPC param wire-up for each method, and addNvmeDevices md detection from a mocked bdev_get_bdevs payload. Behaviour notes - Default-off cluster flag means no behaviour change for existing clusters; full unit-test suite (870 passing) stayed green during development on a separate worktree. - Data plane reads md_size itself via spdk_bdev_get_md_size, so the control plane never has to pass it. Cv_md_method does not auto fall back: control plane must pick correctly per-device, hence the add-node-time detection plumbing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- cli-reference.yaml: add --enable-inline-checksum under storage-node configure, cluster create and cluster deploy. The flag was hand-edited into cli.py but missing from the YAML source-of-truth, so the generator kept dropping it. - storage_node_ops.py: annotate attached as set[str] for mypy. - test_subsystem_add_ns_idempotent.py: drop unused MagicMock import. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gure The lab perf bring-up now opts the cluster into inline CRC checksum validation for silent-data-error protection. Two flags, paired: * `cluster create --enable-inline-checksum`: sets the cluster-level flag at create time (frozen — no mutator, no upgrade path for existing clusters). * `sn configure --enable-inline-checksum`: picks an LBAF with ds=12, ms>=8 (8B+ NVMe metadata per 4K block) and force-reformats through the existing 4K-already-set early-out, so alceml can run in md-on-device mode (checksum_method=1). Devices with no md-capable LBAF fall through to plain 4K + cv_fallback (checksum_method=2, ~1.17% capacity overhead) — auto-detected per-device at sn add-node and refreshed on every restart-device path. Also preserves the existing branch pin (BRANCH=inline-checksum- validation) and the corresponding pip install URL so the lab pulls the matching control-plane and data-plane builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Restore the checksum-validation docker/ultra images that a prior rebase onto origin/main clobbered back to the :main defaults. A local merge=keepdev driver now keeps this branch's env_var across future rebases. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adds a frozen-at-create cluster flag for devices with a <4K logical block size that still guarantee 4K write atomicity (e.g. AWS NVMe, 512B logical but 4K-atomic). When set, bdev_alceml_create sends cv_ignore_block_size=True so the data plane skips its >=4K block-size gate, allowing fallback-mode inline checksum on such devices. - cluster.atomic_4k model field, persisted to FDB, set in create/add - rpc_client.bdev_alceml_create sends cv_ignore_block_size (True/False) within the checksum-enabled branch - storage node device stack passes cluster.atomic_4k at add-node time - CLI --4k_atomic on cluster create/add (yaml + regenerated cli.py) - setup_perf_test1.py enables --4k_atomic for the AWS perf cluster Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… teardown Add the AWS dual-node failure-domain outage soak. Its per-volume churn teardown (stop fio -> umount -> nvme disconnect -> delete) could emit spurious "no available path / XFS log I/O error / filesystem shut down" noise in client dmesg even though the lvol was being deleted on purpose. Root cause: _unmount_one_volume fell back to lazy umount (umount -l) and swallowed the result with `|| true`. fio uses --direct=1 (no bulk dirty data), but the XFS journal and inode metadata are still buffered and flushed at umount; a lazy detach returns instantly with that flush pending, and the nvme disconnect ~1s later rips the paths away mid-flush. Fix: retry a *clean* umount (sync + umount, evicting straggler holders with fuser between tries) and verify with mountpoint; never lazy-umount on the happy path. If the mount still will not release after retries -- the likely sign of a stuck I/O on the target (client runs with nvme_core.io_timeout effectively infinite) -- log a loud WARNING with holder / D-state evidence before a last-resort lazy detach, so the soak surfaces the potential target-side bug instead of hiding it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

        argument = subcommand.add_argument('--size-range', help='NVMe SSD device size range separated by -, can be X(m,g,t) or bytes as integer, example: --size-range 50G-1T or --size-range 1232345-67823987, --device-model and --size-range must be set together.', type=str, default='', dest='size_range', required=False)
        argument = subcommand.add_argument('--nvme-names', help='Comma separated list of nvme namespace names like nvme0n1,nvme1n1.', type=str, default='', dest='nvme_names', required=False)
        argument = subcommand.add_argument('--force', help='Force format detected or passed nvme pci address to 4K and clean partitions.', dest='force', action='store_true')
+        argument = subcommand.add_argument('--enable-inline-checksum', help='When formatting (with --force), prefer an LBAF that supports >=8 bytes of NVMe metadata per block, so alceml can run inline checksum validation in md-on-device mode. Drives with no md-capable LBAF still format to plain 4K and will use the fallback layout.', dest='inline_checksum', action='store_true')


        if self.developer_mode:
            argument = subcommand.add_argument('--disable-monitoring', help='Disable monitoring stack, false by default. Default: `false`.', dest='disable_monitoring', action='store_true')
        argument = subcommand.add_argument('--strict-node-anti-affinity', help='Enable strict node anti affinity for storage nodes. Never more than one chunk is placed on a node. This requires a minimum of _data-chunks-in-stripe + parity-chunks-in-stripe + 1_ nodes in the cluster.', dest='strict_node_anti_affinity', action='store_true')
+        argument = subcommand.add_argument('--enable-inline-checksum', help='Enable inline CRC checksum validation on every IO for silent-data-error protection. Cannot be enabled or disabled after cluster creation. Per-device alceml mode (md-on-device vs fallback) is auto-detected at add-node.', dest='inline_checksum', action='store_true')


            argument = subcommand.add_argument('--disable-monitoring', help='Disable monitoring stack, false by default. Default: `false`.', dest='disable_monitoring', action='store_true')
        argument = subcommand.add_argument('--strict-node-anti-affinity', help='Enable strict node anti affinity for storage nodes. Never more than one chunk is placed on a node. This requires a minimum of _data-chunks-in-stripe + parity-chunks-in-stripe + 1_ nodes in the cluster.', dest='strict_node_anti_affinity', action='store_true')
+        argument = subcommand.add_argument('--enable-inline-checksum', help='Enable inline CRC checksum validation on every IO for silent-data-error protection. Cannot be enabled or disabled after cluster creation. Per-device alceml mode (md-on-device vs fallback) is auto-detected at add-node.', dest='inline_checksum', action='store_true')
+        argument = subcommand.add_argument('--4k_atomic', help='Declare that devices guarantee 4K write atomicity even with a <4K logical block size (e.g. AWS NVMe is 512B but atomic at 4K). Allows fallback-mode inline checksum on such devices by skipping the data-plane 4K block-size requirement. Only meaningful with --enable-inline-checksum. Cannot be changed after cluster creation.', dest='atomic_4k', action='store_true')


        if self.developer_mode:
            argument = subcommand.add_argument('--inflight-io-threshold', help='The number of inflight IOs allowed before the IO queuing starts. Default: `4`.', type=int, default=4, dest='inflight_io_threshold')
        argument = subcommand.add_argument('--strict-node-anti-affinity', help='Enable strict node anti affinity for storage nodes. Never more than one chunk is placed on a node. This requires a minimum of _data-chunks-in-stripe + parity-chunks-in-stripe + 1_ nodes in the cluster."', dest='strict_node_anti_affinity', action='store_true')
+        argument = subcommand.add_argument('--enable-inline-checksum', help='Enable inline CRC checksum validation on every IO for silent-data-error protection. Cannot be enabled or disabled after cluster creation.', dest='inline_checksum', action='store_true')


            argument = subcommand.add_argument('--inflight-io-threshold', help='The number of inflight IOs allowed before the IO queuing starts. Default: `4`.', type=int, default=4, dest='inflight_io_threshold')
        argument = subcommand.add_argument('--strict-node-anti-affinity', help='Enable strict node anti affinity for storage nodes. Never more than one chunk is placed on a node. This requires a minimum of _data-chunks-in-stripe + parity-chunks-in-stripe + 1_ nodes in the cluster."', dest='strict_node_anti_affinity', action='store_true')
+        argument = subcommand.add_argument('--enable-inline-checksum', help='Enable inline CRC checksum validation on every IO for silent-data-error protection. Cannot be enabled or disabled after cluster creation.', dest='inline_checksum', action='store_true')
+        argument = subcommand.add_argument('--4k_atomic', help='Declare that devices guarantee 4K write atomicity even with a <4K logical block size (e.g. AWS NVMe is 512B but atomic at 4K). Allows fallback-mode inline checksum on such devices by skipping the data-plane 4K block-size requirement. Only meaningful with --enable-inline-checksum. Cannot be changed after cluster creation.', dest='atomic_4k', action='store_true')


…ardown The per-volume churn teardown (stop fio -> umount -> nvme disconnect -> delete) could emit spurious "no available path / XFS log I/O error / filesystem shut down" noise in client dmesg even though the lvol was being deleted on purpose. Root cause: _unmount_one_volume fell back to lazy umount (umount -l) and swallowed the result with `|| true`. fio uses --direct=1 (no bulk dirty data), but the XFS journal and inode metadata are still buffered and flushed at umount; a lazy detach returns instantly with that flush pending, and the nvme disconnect ~1s later rips the paths away mid-flush. Fix: retry a *clean* umount (sync + umount, evicting straggler holders with fuser between tries) and verify with mountpoint; never lazy-umount on the happy path. If the mount still will not release after retries -- the likely sign of a stuck I/O on the target (client runs with nvme_core.io_timeout effectively infinite) -- log a loud WARNING with holder / D-state evidence before a last-resort lazy detach, so the soak surfaces the potential target-side bug instead of hiding it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Restarting one node can briefly knock a peer 'down' (NVMe reconnect / CP re-sync). That is not, on its own, an FTT-budget breach, but wait_for_all_online aborted the whole soak on the first poll that saw it. Treat 'down' like 'online' for both the unaffected-node failure check and the all-online return gate — if a down peer actually breaks I/O, fio surfaces the error anyway. Outage-detection paths (waiting for a node to LEAVE online) are untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

QPAIR_COUNT feeds max_io_qpairs_per_ctrlr on nvmf_create_transport (the target transport). Raise it to allow more queue pairs per controller. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Remove the `sync` from the umount retry loop. A global sync blocks on every dirty mount on the client (other soak volumes mid-fio) and can itself hang behind the stuck target I/O we are trying to work around; umount already flushes this filesystem's own metadata. Key per-volume mount point (and fio log/stderr/rc paths) on the unique volume_name instead of vol<index>. Churn reuses the old volume's index, so an index-based mount point remounted the new volume onto the old directory, which may still be pinned by a lazily-detached, stuck-I/O umount. A fresh mount point per volume avoids that. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…block window recreate_lvstore / recreate_lvstore_on_non_leader build the raid in the PRE_BLOCK phase (step 1) but never examine it there. The step-5 examine logic then hits `raid_already and not lvstore_already` — which was written for the activation-retry convergence trap (raid already examined on a prior pass, SPDK rejects re-examine) — and drops + recreates the raid to force a clean examine. On a normal restart that condition is ALSO true (raid just built, examine not yet run), so the just-built raid is needlessly torn down and rebuilt INSIDE the minimized port-block window. Observed as a duplicate bdev_raid_create (e.g. raid0_5199, 2026-06-12) and 90 "dropping raid for clean re-examine" hits in a single restart log. Probe whether the raid existed BEFORE step 1 (raid_preexisted). superblock is False, so on a real restart the raid cannot persist and the probe is False -> just examine. Only the genuine activation-retry case (raid pre-existed and was already examined) takes the drop+recreate path. bdev_examine stays inside the block window (the restarting node reads blobstore metadata the demoted leader was still mutating); only the redundant raid teardown/rebuild is removed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-code-quality Bot found potential problems Apr 30, 2026

View reviewed changes

Comment thread simplyblock_cli/cli.py Fixed

Comment thread simplyblock_cli/cli.py Fixed

Comment thread simplyblock_cli/cli.py Fixed

Comment thread simplyblock_core/utils/__init__.py Fixed

schmidt-scaled force-pushed the inline-checksum-validation branch from f7823d3 to 9273148 Compare May 5, 2026 09:16

github-code-quality Bot found potential problems May 5, 2026

View reviewed changes

Comment thread scripts/setup_lab_perf_test1.py Fixed

Comment thread simplyblock_core/utils/__init__.py Fixed

Comment thread scripts/setup_lab_perf_test1.py Fixed

Comment thread scripts/setup_lab_perf_test1.py Fixed

github-advanced-security AI found potential problems May 5, 2026

View reviewed changes

Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py

for attempt in range(1, 16):

try:

client = paramiko.SSHClient()

client.set_missing_host_key_policy(paramiko.AutoAddPolicy())

github-code-quality Bot found potential problems May 5, 2026

View reviewed changes

Comment thread scripts/setup_lab_perf_test1.py Fixed

Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed

Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed

schmidt-scaled force-pushed the inline-checksum-validation branch from 4e3d8a4 to ca9fa2f Compare May 5, 2026 12:06

github-code-quality Bot found potential problems May 5, 2026

View reviewed changes

Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed

Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed

github-code-quality Bot found potential problems May 6, 2026

View reviewed changes

Comment thread simplyblock_core/storage_node_ops.py Fixed

github-code-quality Bot found potential problems May 6, 2026

View reviewed changes

Comment thread simplyblock_core/storage_node_ops.py Fixed

github-code-quality Bot found potential problems May 6, 2026

View reviewed changes

Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed

Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed

schmidt-scaled force-pushed the inline-checksum-validation branch from 9ba3cbc to 9bdcc9f Compare May 10, 2026 10:19

github-code-quality Bot found potential problems May 10, 2026

View reviewed changes

Comment thread simplyblock_core/utils/__init__.py Fixed

schmidt-scaled force-pushed the inline-checksum-validation branch from 48245dc to 4748981 Compare May 17, 2026 16:46

github-code-quality Bot found potential problems May 17, 2026

View reviewed changes

Comment thread simplyblock_core/utils/__init__.py Fixed

Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed

github-code-quality Bot found potential problems May 21, 2026

View reviewed changes

Comment thread simplyblock_cli/cli.py Fixed

Comment thread simplyblock_core/utils/__init__.py Fixed

schmidt-scaled force-pushed the inline-checksum-validation branch from 2a899b4 to ef60683 Compare May 26, 2026 21:30

github-code-quality Bot found potential problems May 26, 2026

View reviewed changes

Comment thread simplyblock_cli/cli.py Fixed

Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed

schmidt-scaled force-pushed the inline-checksum-validation branch from ef60683 to e2b9c7e Compare May 27, 2026 17:39

github-code-quality Bot found potential problems May 28, 2026

View reviewed changes

Comment thread tests/test_soak_outage_gap.py Fixed

github-code-quality Bot found potential problems May 28, 2026

View reviewed changes

Comment thread simplyblock_cli/cli.py Fixed

Comment thread simplyblock_cli/cli.py Fixed

Comment thread simplyblock_cli/cli.py Fixed

Comment thread simplyblock_core/utils/__init__.py Fixed

github-code-quality Bot found potential problems May 28, 2026

View reviewed changes

Comment thread scripts/stop_lab_after_soak.py Fixed

github-code-quality Bot found potential problems Jun 1, 2026

View reviewed changes

Comment thread simplyblock_cli/cli.py Fixed

Comment thread simplyblock_cli/cli.py Fixed

Comment thread simplyblock_cli/cli.py Fixed

schmidt-scaled force-pushed the inline-checksum-validation branch from a889ee3 to 86cd6ab Compare June 8, 2026 08:18

github-code-quality Bot found potential problems Jun 8, 2026

View reviewed changes

Comment thread simplyblock_cli/cli.py Fixed

Comment thread simplyblock_cli/cli.py Fixed

schmidt-scaled and others added 6 commits June 14, 2026 17:17

schmidt-scaled force-pushed the inline-checksum-validation branch from 4fd2e35 to 59325ea Compare June 14, 2026 14:19

github-code-quality Bot found potential problems Jun 17, 2026

View reviewed changes

schmidt-scaled and others added 5 commits June 17, 2026 15:32

perf: increase target transport queue pairs from 32 to 64

c6a113e

QPAIR_COUNT feeds max_io_qpairs_per_ctrlr on nvmf_create_transport (the target transport). Raise it to allow more queue pairs per controller. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Conversation

schmidt-scaled commented Apr 30, 2026

Summary

Why this branch and not main

Design ref

Compatibility

Test plan

Open follow-ups (not blocking this PR)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants