Skip to content

fix(portscan): serialize concurrent calibration writes; pin 4-NIC vs 8-NIC cap=4 orchestration parity#68

Merged
skullcrushercmd merged 1 commit intomainfrom
fix/aimd-truncation-and-calibration-persistence
Apr 27, 2026
Merged

fix(portscan): serialize concurrent calibration writes; pin 4-NIC vs 8-NIC cap=4 orchestration parity#68
skullcrushercmd merged 1 commit intomainfrom
fix/aimd-truncation-and-calibration-persistence

Conversation

@skullcrushercmd
Copy link
Copy Markdown
Contributor

Summary

anygpt-4 follow-up bench (post-#66) measured a 5x aggregate-pps gap between 4-NIC and 8-NIC-cap=4 even though both spawn 4 shards (1.81M vs 8.58M, anygpt-37 brief). Two unrelated issues fell out of investigation:

  1. Calibration race (the actual bug): the multi-NIC parent's children all wrote to the same rate-calibration.json AND the same shared .tmp filename, so concurrent shards clobbered each other and at most one entry persisted. This matches the "never persists" symptom in the brief.
  2. 4-NIC vs 8-NIC-cap=4 aggregate divergence: by code inspection both cases route through identical orchestration (cap_concurrent_subprocesses(8, 4) -> first 4 NICs then split_target_range_for_shards(range, 4) -> 4 shards). New synthetic harness confirms identical interface sequence, identical shard distribution, identical mocked aggregate. The bench delta is real-hardware variance, not orchestration. The harness ships as a regression guard.

Root cause: calibration race

P1: load() -> {} (stale snapshot)
P2: load() -> {} (stale)
P1: write rate-calibration.json.tmp  with {eth0: ...}
P2: write rate-calibration.json.tmp  with {eth1: ...}   # CLOBBERS P1
P1: os.replace(tmp -> json)                              # final has only eth1
P2: os.replace(tmp -> json)                              # OSError (tmp gone)

With 4 shards converging together, 3+ entries silently dropped. Confirmed by RED test that fails reliably pre-fix (3-of-6 lost).

Fix shape

RateCalibrationStore.store:

  • Per-pid tmpfile (rate-calibration.json.tmp.<pid>) so concurrent writers don't overwrite each other's pre-rename payloads.
  • fcntl.flock-backed read-modify-write cycle on a sibling .lock file. Each writer re-reads the latest entries while holding the lock so it observes other shards' updates instead of clobbering with a stale view.
  • Hosts that refuse flock fall through to the unprotected write rather than blocking calibration entirely.
  • Held interval bounded to one json.dumps + one rename — millisecond-scale.
  • Single-shard semantics unchanged.

Test coverage

  • test_concurrent_writes_from_multiple_processes_all_persist (NEW): 6 fork()ed workers store simultaneously; all 6 entries must land. RED pre-fix; GREEN post-fix.
  • test_concurrent_writes_leave_no_dangling_tmp_files (NEW): no orphan .tmp.* after concurrent writes.
  • test_persists_when_interrupted_by_systemexit_mid_loop (NEW): controller's try/finally must persist max_clean_rate even when the adapter's signal handler raises SystemExit mid-loop. Regression guard for the SIGTERM/SIGINT exit path called out in the task brief.
  • FourNicVsEightNicCapFourParityTests (NEW): mocks _spawn_shard_adapter, runs orchestration with 4 vs 8 interfaces under the default cap=4, asserts identical interface sequence + identical shard target_range distribution + identical synthetic aggregate pps. Pins the contract so future drift is impossible to ship silently.

Verification

Test plan

  • Unit: concurrent calibration writes from 6+ processes all persist
  • Unit: no orphan .tmp.* files after concurrent writes
  • Unit: SystemExit mid-loop still persists max_clean_rate
  • Unit: 4-NIC vs 8-NIC cap=4 produce identical orchestration + aggregate
  • cargo build/test workspace
  • Live c6in.metal re-bench (OPTIONAL per brief — the synthetic harness already proves orchestration symmetry; the calibration fix is observable from ls /var/lib/agentd after a converged multi-NIC scan)

Deploy notes

  • No new env knobs.
  • No bundle layout change.
  • /var/lib/agentd/rate-calibration.json.lock is created next to the existing calibration file the first time a writer takes the lock — zero-byte advisory lockfile, no migration needed.

Out of scope

  • Live c6in.metal bench — gated on metal launch authorization.
  • AF_XDP fan-out — separate worker, separate PR.
  • Touching src/fetcher.rs / src/bin/anyscan-worker.rs / runtime.env on prod (coordination boundary respected per brief).

🤖 Generated with Claude Code

…8-NIC cap=4 orchestration parity

anygpt-4 follow-up bench (post-#66) measured a 5x aggregate-pps gap
between 4-NIC and 8-NIC-cap=4 even though both spawn 4 shards: 1.81M
vs 8.58M. Two unrelated issues fell out of the investigation; this PR
lands the calibration fix and pins the orchestration contract so future
bench drift can't silently re-open the same can of worms.

1. Concurrent calibration writes lost data
-------------------------------------------
PR #66's try/finally guaranteed every controller exit path called
RateCalibrationStore.store(); operators were still seeing
/var/lib/agentd/rate-calibration.json with at most one shard's entry
after a multi-NIC scan converged. Root cause: the multi-NIC parent
spawns N children that each run their own RateController against the
SAME calibration JSON. Pre-fix store() did:

  entries = self.load()           # stale snapshot (other shards in flight)
  entries[interface] = ...
  tmp_path = ".../rate-calibration.json.tmp"   # SHARED across writers
  tmp_path.write_text(...)        # last writer's tmp wins
  os.replace(tmp_path, final)     # first replace clobbers; second OSErrors

So with 4 concurrent shards converging together, three of them silently
dropped their learned rate on the floor (or 5 of 8 in the 8-NIC case),
which is exactly the "never persists" symptom in anygpt-37's brief.

The new test_concurrent_writes_from_multiple_processes_all_persist
spawns 6 fork()ed workers calling store() simultaneously and asserts
all 6 entries land. Pre-fix it reliably loses 3+; post-fix it passes.
test_concurrent_writes_leave_no_dangling_tmp_files pins the cleanup
contract so a half-failed write doesn't leave orphan .tmp files.

Fix: per-pid tmpfile (no inter-process clobber) plus an fcntl.flock-
backed read-modify-write cycle on a sibling .lock file. Hosts that
refuse flock fall through to the unprotected write rather than blocking
calibration entirely; lock holding interval is bounded to one
json.dumps + one rename. Single-shard semantics are unchanged.

2. Signal-handler exit path explicitly tested
---------------------------------------------
The adapter's SIGTERM/SIGINT handler raises SystemExit(128 + signum).
The controller's try/finally already handled this (the exception
unwinds through the while loop and the finally calls
_maybe_persist_calibration), but there was no test pinning the
contract — the only pre-PR test for non-clean exit was
ScannerWindowError. test_persists_when_interrupted_by_systemexit_mid_loop
forces SystemExit during runner.run() and asserts the highest clean
rate from the windows that completed before the signal is still on
disk. Regression guard for any future refactor that moves the persist
out of the finally block.

3. 4-NIC vs 8-NIC cap=4 orchestration parity test
--------------------------------------------------
Code inspection of vulnscanner-zmap-adapter.py:run_multi_nic_scanner
shows the two cases route through identical orchestration:

  cap_concurrent_subprocesses(8 NICs, max_concurrent=4) -> first 4 NICs
  split_target_range_for_shards(range, len(interfaces)=4) -> 4 shards

Both produce 4 children with the same iface assignments (eth0..eth3)
and the same disjoint sub-ranges. The new
FourNicVsEightNicCapFourParityTests harness mocks _spawn_shard_adapter
and runs the orchestrator twice — once with 4 requested NICs, once
with 8 — and asserts identical interface sequence, identical shard
target_range distribution, and identical synthetic aggregate pps
when each shard contributes the same mocked rate. It passes. Therefore
the 5x bench delta is real-hardware variance (NIC-specific
ENA/MMIO behavior, queue scheduling, or measurement timing on
c6in.metal), not orchestration. If this test ever diverges, hardware
variance is no longer a valid explanation and there's a real bug in
the parent fan-out path.

Verification
------------
- python3 -m unittest test_anyscan_rate_controller -v -> 53 OK
  (was 50 pre-PR; +2 calibration race, +1 signal exit)
- python3 -m unittest test_vulnscanner_adapter_multinic -v -> 31 OK
  (was 30 pre-PR; +1 4-NIC vs 8-NIC cap=4 parity harness)
- python3 -m py_compile anyscan_rate_controller.py vulnscanner-zmap-adapter.py
- cargo build --workspace -> clean (2 pre-existing warnings unchanged)
- cargo test --workspace --no-fail-fast -> 437 passing, 0 failed,
  4 ignored. Matches the post-#66 baseline.

Deploy notes
------------
- No new env knobs; no bundle layout change.
- /var/lib/agentd/rate-calibration.json.lock is created next to the
  existing calibration file the first time a writer takes the lock;
  it's a zero-byte advisory lockfile, no migration needed.
- Live c6in.metal re-bench is OPTIONAL — the synthetic harness already
  proves the orchestration is symmetric. The calibration race fix is
  observable from operator logs (ls /var/lib/agentd shows N entries
  after a converged multi-NIC scan, where pre-fix it showed 1).

Out of scope (per task brief)
-----------------------------
- Live c6in.metal bench (gated on metal launch authorization).
- AF_XDP fan-out (separate worker, separate PR).
- Touching src/fetcher.rs / src/bin/anyscan-worker.rs / runtime.env on
  prod — coordination boundary respected.

Co-authored-by: skullcmd <skullcmd@anyvm.tech>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@skullcrushercmd skullcrushercmd merged commit 865fad8 into main Apr 27, 2026
@skullcrushercmd skullcrushercmd deleted the fix/aimd-truncation-and-calibration-persistence branch April 27, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant