fix(portscan): serialize concurrent calibration writes; pin 4-NIC vs 8-NIC cap=4 orchestration parity by skullcrushercmd · Pull Request #68 · AnyVM-Tech/AnyScan

skullcrushercmd · 2026-04-27T22:46:44Z

Summary

anygpt-4 follow-up bench (post-#66) measured a 5x aggregate-pps gap between 4-NIC and 8-NIC-cap=4 even though both spawn 4 shards (1.81M vs 8.58M, anygpt-37 brief). Two unrelated issues fell out of investigation:

Calibration race (the actual bug): the multi-NIC parent's children all wrote to the same rate-calibration.json AND the same shared .tmp filename, so concurrent shards clobbered each other and at most one entry persisted. This matches the "never persists" symptom in the brief.
4-NIC vs 8-NIC-cap=4 aggregate divergence: by code inspection both cases route through identical orchestration (cap_concurrent_subprocesses(8, 4) -> first 4 NICs then split_target_range_for_shards(range, 4) -> 4 shards). New synthetic harness confirms identical interface sequence, identical shard distribution, identical mocked aggregate. The bench delta is real-hardware variance, not orchestration. The harness ships as a regression guard.

Root cause: calibration race

P1: load() -> {} (stale snapshot)
P2: load() -> {} (stale)
P1: write rate-calibration.json.tmp  with {eth0: ...}
P2: write rate-calibration.json.tmp  with {eth1: ...}   # CLOBBERS P1
P1: os.replace(tmp -> json)                              # final has only eth1
P2: os.replace(tmp -> json)                              # OSError (tmp gone)

With 4 shards converging together, 3+ entries silently dropped. Confirmed by RED test that fails reliably pre-fix (3-of-6 lost).

Fix shape

RateCalibrationStore.store:

Per-pid tmpfile (rate-calibration.json.tmp.<pid>) so concurrent writers don't overwrite each other's pre-rename payloads.
fcntl.flock-backed read-modify-write cycle on a sibling .lock file. Each writer re-reads the latest entries while holding the lock so it observes other shards' updates instead of clobbering with a stale view.
Hosts that refuse flock fall through to the unprotected write rather than blocking calibration entirely.
Held interval bounded to one json.dumps + one rename — millisecond-scale.
Single-shard semantics unchanged.

Test coverage

test_concurrent_writes_from_multiple_processes_all_persist (NEW): 6 fork()ed workers store simultaneously; all 6 entries must land. RED pre-fix; GREEN post-fix.
test_concurrent_writes_leave_no_dangling_tmp_files (NEW): no orphan .tmp.* after concurrent writes.
test_persists_when_interrupted_by_systemexit_mid_loop (NEW): controller's try/finally must persist max_clean_rate even when the adapter's signal handler raises SystemExit mid-loop. Regression guard for the SIGTERM/SIGINT exit path called out in the task brief.
FourNicVsEightNicCapFourParityTests (NEW): mocks _spawn_shard_adapter, runs orchestration with 4 vs 8 interfaces under the default cap=4, asserts identical interface sequence + identical shard target_range distribution + identical synthetic aggregate pps. Pins the contract so future drift is impossible to ship silently.

Verification

python3 -m unittest test_anyscan_rate_controller -v → 53 OK (was 50; +3)
python3 -m unittest test_vulnscanner_adapter_multinic -v → 31 OK (was 30; +1)
python3 -m py_compile anyscan_rate_controller.py vulnscanner-zmap-adapter.py → clean
cargo build --workspace → clean (2 pre-existing warnings unchanged)
cargo test --workspace --no-fail-fast → 437 passing, 0 failed, 4 ignored. Matches post-perf(portscan): AIMD CPU-vs-network slip + subprocess cap + per-instance defaults + partial-window calibration #66 baseline.

Test plan

Unit: concurrent calibration writes from 6+ processes all persist
Unit: no orphan .tmp.* files after concurrent writes
Unit: SystemExit mid-loop still persists max_clean_rate
Unit: 4-NIC vs 8-NIC cap=4 produce identical orchestration + aggregate
cargo build/test workspace
Live c6in.metal re-bench (OPTIONAL per brief — the synthetic harness already proves orchestration symmetry; the calibration fix is observable from ls /var/lib/agentd after a converged multi-NIC scan)

Deploy notes

No new env knobs.
No bundle layout change.
/var/lib/agentd/rate-calibration.json.lock is created next to the existing calibration file the first time a writer takes the lock — zero-byte advisory lockfile, no migration needed.

Out of scope

Live c6in.metal bench — gated on metal launch authorization.
AF_XDP fan-out — separate worker, separate PR.
Touching src/fetcher.rs / src/bin/anyscan-worker.rs / runtime.env on prod (coordination boundary respected per brief).

🤖 Generated with Claude Code

…8-NIC cap=4 orchestration parity anygpt-4 follow-up bench (post-#66) measured a 5x aggregate-pps gap between 4-NIC and 8-NIC-cap=4 even though both spawn 4 shards: 1.81M vs 8.58M. Two unrelated issues fell out of the investigation; this PR lands the calibration fix and pins the orchestration contract so future bench drift can't silently re-open the same can of worms. 1. Concurrent calibration writes lost data ------------------------------------------- PR #66's try/finally guaranteed every controller exit path called RateCalibrationStore.store(); operators were still seeing /var/lib/agentd/rate-calibration.json with at most one shard's entry after a multi-NIC scan converged. Root cause: the multi-NIC parent spawns N children that each run their own RateController against the SAME calibration JSON. Pre-fix store() did: entries = self.load() # stale snapshot (other shards in flight) entries[interface] = ... tmp_path = ".../rate-calibration.json.tmp" # SHARED across writers tmp_path.write_text(...) # last writer's tmp wins os.replace(tmp_path, final) # first replace clobbers; second OSErrors So with 4 concurrent shards converging together, three of them silently dropped their learned rate on the floor (or 5 of 8 in the 8-NIC case), which is exactly the "never persists" symptom in anygpt-37's brief. The new test_concurrent_writes_from_multiple_processes_all_persist spawns 6 fork()ed workers calling store() simultaneously and asserts all 6 entries land. Pre-fix it reliably loses 3+; post-fix it passes. test_concurrent_writes_leave_no_dangling_tmp_files pins the cleanup contract so a half-failed write doesn't leave orphan .tmp files. Fix: per-pid tmpfile (no inter-process clobber) plus an fcntl.flock- backed read-modify-write cycle on a sibling .lock file. Hosts that refuse flock fall through to the unprotected write rather than blocking calibration entirely; lock holding interval is bounded to one json.dumps + one rename. Single-shard semantics are unchanged. 2. Signal-handler exit path explicitly tested --------------------------------------------- The adapter's SIGTERM/SIGINT handler raises SystemExit(128 + signum). The controller's try/finally already handled this (the exception unwinds through the while loop and the finally calls _maybe_persist_calibration), but there was no test pinning the contract — the only pre-PR test for non-clean exit was ScannerWindowError. test_persists_when_interrupted_by_systemexit_mid_loop forces SystemExit during runner.run() and asserts the highest clean rate from the windows that completed before the signal is still on disk. Regression guard for any future refactor that moves the persist out of the finally block. 3. 4-NIC vs 8-NIC cap=4 orchestration parity test -------------------------------------------------- Code inspection of vulnscanner-zmap-adapter.py:run_multi_nic_scanner shows the two cases route through identical orchestration: cap_concurrent_subprocesses(8 NICs, max_concurrent=4) -> first 4 NICs split_target_range_for_shards(range, len(interfaces)=4) -> 4 shards Both produce 4 children with the same iface assignments (eth0..eth3) and the same disjoint sub-ranges. The new FourNicVsEightNicCapFourParityTests harness mocks _spawn_shard_adapter and runs the orchestrator twice — once with 4 requested NICs, once with 8 — and asserts identical interface sequence, identical shard target_range distribution, and identical synthetic aggregate pps when each shard contributes the same mocked rate. It passes. Therefore the 5x bench delta is real-hardware variance (NIC-specific ENA/MMIO behavior, queue scheduling, or measurement timing on c6in.metal), not orchestration. If this test ever diverges, hardware variance is no longer a valid explanation and there's a real bug in the parent fan-out path. Verification ------------ - python3 -m unittest test_anyscan_rate_controller -v -> 53 OK (was 50 pre-PR; +2 calibration race, +1 signal exit) - python3 -m unittest test_vulnscanner_adapter_multinic -v -> 31 OK (was 30 pre-PR; +1 4-NIC vs 8-NIC cap=4 parity harness) - python3 -m py_compile anyscan_rate_controller.py vulnscanner-zmap-adapter.py - cargo build --workspace -> clean (2 pre-existing warnings unchanged) - cargo test --workspace --no-fail-fast -> 437 passing, 0 failed, 4 ignored. Matches the post-#66 baseline. Deploy notes ------------ - No new env knobs; no bundle layout change. - /var/lib/agentd/rate-calibration.json.lock is created next to the existing calibration file the first time a writer takes the lock; it's a zero-byte advisory lockfile, no migration needed. - Live c6in.metal re-bench is OPTIONAL — the synthetic harness already proves the orchestration is symmetric. The calibration race fix is observable from operator logs (ls /var/lib/agentd shows N entries after a converged multi-NIC scan, where pre-fix it showed 1). Out of scope (per task brief) ----------------------------- - Live c6in.metal bench (gated on metal launch authorization). - AF_XDP fan-out (separate worker, separate PR). - Touching src/fetcher.rs / src/bin/anyscan-worker.rs / runtime.env on prod — coordination boundary respected. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

skullcrushercmd merged commit 865fad8 into main Apr 27, 2026

skullcrushercmd deleted the fix/aimd-truncation-and-calibration-persistence branch April 27, 2026 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(portscan): serialize concurrent calibration writes; pin 4-NIC vs 8-NIC cap=4 orchestration parity#68

fix(portscan): serialize concurrent calibration writes; pin 4-NIC vs 8-NIC cap=4 orchestration parity#68
skullcrushercmd merged 1 commit intomainfrom
fix/aimd-truncation-and-calibration-persistence

skullcrushercmd commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

skullcrushercmd commented Apr 27, 2026

Summary

Root cause: calibration race

Fix shape

Test coverage

Verification

Test plan

Deploy notes

Out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant