fix(portscan): serialize concurrent calibration writes; pin 4-NIC vs 8-NIC cap=4 orchestration parity#68
Merged
skullcrushercmd merged 1 commit intomainfrom Apr 27, 2026
Conversation
…8-NIC cap=4 orchestration parity anygpt-4 follow-up bench (post-#66) measured a 5x aggregate-pps gap between 4-NIC and 8-NIC-cap=4 even though both spawn 4 shards: 1.81M vs 8.58M. Two unrelated issues fell out of the investigation; this PR lands the calibration fix and pins the orchestration contract so future bench drift can't silently re-open the same can of worms. 1. Concurrent calibration writes lost data ------------------------------------------- PR #66's try/finally guaranteed every controller exit path called RateCalibrationStore.store(); operators were still seeing /var/lib/agentd/rate-calibration.json with at most one shard's entry after a multi-NIC scan converged. Root cause: the multi-NIC parent spawns N children that each run their own RateController against the SAME calibration JSON. Pre-fix store() did: entries = self.load() # stale snapshot (other shards in flight) entries[interface] = ... tmp_path = ".../rate-calibration.json.tmp" # SHARED across writers tmp_path.write_text(...) # last writer's tmp wins os.replace(tmp_path, final) # first replace clobbers; second OSErrors So with 4 concurrent shards converging together, three of them silently dropped their learned rate on the floor (or 5 of 8 in the 8-NIC case), which is exactly the "never persists" symptom in anygpt-37's brief. The new test_concurrent_writes_from_multiple_processes_all_persist spawns 6 fork()ed workers calling store() simultaneously and asserts all 6 entries land. Pre-fix it reliably loses 3+; post-fix it passes. test_concurrent_writes_leave_no_dangling_tmp_files pins the cleanup contract so a half-failed write doesn't leave orphan .tmp files. Fix: per-pid tmpfile (no inter-process clobber) plus an fcntl.flock- backed read-modify-write cycle on a sibling .lock file. Hosts that refuse flock fall through to the unprotected write rather than blocking calibration entirely; lock holding interval is bounded to one json.dumps + one rename. Single-shard semantics are unchanged. 2. Signal-handler exit path explicitly tested --------------------------------------------- The adapter's SIGTERM/SIGINT handler raises SystemExit(128 + signum). The controller's try/finally already handled this (the exception unwinds through the while loop and the finally calls _maybe_persist_calibration), but there was no test pinning the contract — the only pre-PR test for non-clean exit was ScannerWindowError. test_persists_when_interrupted_by_systemexit_mid_loop forces SystemExit during runner.run() and asserts the highest clean rate from the windows that completed before the signal is still on disk. Regression guard for any future refactor that moves the persist out of the finally block. 3. 4-NIC vs 8-NIC cap=4 orchestration parity test -------------------------------------------------- Code inspection of vulnscanner-zmap-adapter.py:run_multi_nic_scanner shows the two cases route through identical orchestration: cap_concurrent_subprocesses(8 NICs, max_concurrent=4) -> first 4 NICs split_target_range_for_shards(range, len(interfaces)=4) -> 4 shards Both produce 4 children with the same iface assignments (eth0..eth3) and the same disjoint sub-ranges. The new FourNicVsEightNicCapFourParityTests harness mocks _spawn_shard_adapter and runs the orchestrator twice — once with 4 requested NICs, once with 8 — and asserts identical interface sequence, identical shard target_range distribution, and identical synthetic aggregate pps when each shard contributes the same mocked rate. It passes. Therefore the 5x bench delta is real-hardware variance (NIC-specific ENA/MMIO behavior, queue scheduling, or measurement timing on c6in.metal), not orchestration. If this test ever diverges, hardware variance is no longer a valid explanation and there's a real bug in the parent fan-out path. Verification ------------ - python3 -m unittest test_anyscan_rate_controller -v -> 53 OK (was 50 pre-PR; +2 calibration race, +1 signal exit) - python3 -m unittest test_vulnscanner_adapter_multinic -v -> 31 OK (was 30 pre-PR; +1 4-NIC vs 8-NIC cap=4 parity harness) - python3 -m py_compile anyscan_rate_controller.py vulnscanner-zmap-adapter.py - cargo build --workspace -> clean (2 pre-existing warnings unchanged) - cargo test --workspace --no-fail-fast -> 437 passing, 0 failed, 4 ignored. Matches the post-#66 baseline. Deploy notes ------------ - No new env knobs; no bundle layout change. - /var/lib/agentd/rate-calibration.json.lock is created next to the existing calibration file the first time a writer takes the lock; it's a zero-byte advisory lockfile, no migration needed. - Live c6in.metal re-bench is OPTIONAL — the synthetic harness already proves the orchestration is symmetric. The calibration race fix is observable from operator logs (ls /var/lib/agentd shows N entries after a converged multi-NIC scan, where pre-fix it showed 1). Out of scope (per task brief) ----------------------------- - Live c6in.metal bench (gated on metal launch authorization). - AF_XDP fan-out (separate worker, separate PR). - Touching src/fetcher.rs / src/bin/anyscan-worker.rs / runtime.env on prod — coordination boundary respected. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
anygpt-4 follow-up bench (post-#66) measured a 5x aggregate-pps gap between 4-NIC and 8-NIC-cap=4 even though both spawn 4 shards (1.81M vs 8.58M, anygpt-37 brief). Two unrelated issues fell out of investigation:
rate-calibration.jsonAND the same shared.tmpfilename, so concurrent shards clobbered each other and at most one entry persisted. This matches the "never persists" symptom in the brief.cap_concurrent_subprocesses(8, 4) -> first 4 NICsthensplit_target_range_for_shards(range, 4) -> 4 shards). New synthetic harness confirms identical interface sequence, identical shard distribution, identical mocked aggregate. The bench delta is real-hardware variance, not orchestration. The harness ships as a regression guard.Root cause: calibration race
With 4 shards converging together, 3+ entries silently dropped. Confirmed by RED test that fails reliably pre-fix (3-of-6 lost).
Fix shape
RateCalibrationStore.store:rate-calibration.json.tmp.<pid>) so concurrent writers don't overwrite each other's pre-rename payloads.fcntl.flock-backed read-modify-write cycle on a sibling.lockfile. Each writer re-reads the latest entries while holding the lock so it observes other shards' updates instead of clobbering with a stale view.json.dumps+ one rename — millisecond-scale.Test coverage
test_concurrent_writes_from_multiple_processes_all_persist(NEW): 6 fork()ed workers store simultaneously; all 6 entries must land. RED pre-fix; GREEN post-fix.test_concurrent_writes_leave_no_dangling_tmp_files(NEW): no orphan.tmp.*after concurrent writes.test_persists_when_interrupted_by_systemexit_mid_loop(NEW): controller's try/finally must persist max_clean_rate even when the adapter's signal handler raises SystemExit mid-loop. Regression guard for the SIGTERM/SIGINT exit path called out in the task brief.FourNicVsEightNicCapFourParityTests(NEW): mocks_spawn_shard_adapter, runs orchestration with 4 vs 8 interfaces under the default cap=4, asserts identical interface sequence + identical shard target_range distribution + identical synthetic aggregate pps. Pins the contract so future drift is impossible to ship silently.Verification
python3 -m unittest test_anyscan_rate_controller -v→ 53 OK (was 50; +3)python3 -m unittest test_vulnscanner_adapter_multinic -v→ 31 OK (was 30; +1)python3 -m py_compile anyscan_rate_controller.py vulnscanner-zmap-adapter.py→ cleancargo build --workspace→ clean (2 pre-existing warnings unchanged)cargo test --workspace --no-fail-fast→ 437 passing, 0 failed, 4 ignored. Matches post-perf(portscan): AIMD CPU-vs-network slip + subprocess cap + per-instance defaults + partial-window calibration #66 baseline.Test plan
.tmp.*files after concurrent writesls /var/lib/agentdafter a converged multi-NIC scan)Deploy notes
/var/lib/agentd/rate-calibration.json.lockis created next to the existing calibration file the first time a writer takes the lock — zero-byte advisory lockfile, no migration needed.Out of scope
src/fetcher.rs/src/bin/anyscan-worker.rs/runtime.envon prod (coordination boundary respected per brief).🤖 Generated with Claude Code