docs(plans): AF_XDP integration plan for higher pps (Phase 1)#65
docs(plans): AF_XDP integration plan for higher pps (Phase 1)#65skullcrushercmd merged 2 commits intomainfrom
Conversation
Comprehensive design + dependency + LOC + test + risk + rollout plan for
adding an AF_XDP I/O path to the bundled C scanner. No scanner code is
changed; Phase 2 implementation is gated on user approval.
Motivation: anygpt-4 c6in.metal 4-NIC bench hit 12.8 Mpps aggregate,
gated by the AF_PACKET TX/syscall path (single socket caps ~3 Mpps even
with PACKET_QDISC_BYPASS). AF_XDP lets the ENA backplane (~100 Mpps
theoretical on c6in.metal) be the actual bottleneck.
Design highlights:
- New send-afxdp.c / recv-afxdp.c slot into the same shape as the
existing USE_PFRING_ZC build flag (template files in upstream).
- Plan also resolves a pre-existing dispatch wart: engine.c hardcodes
sender_thread, so the existing PF_RING ZC files compile but never
run. Plan introduces a small io_engine_vtable_t that wires both
paths cleanly.
- Per-NIC: one XSK socket per (ENI, queue_id), TX-only (rx=NULL),
XDP_USE_NEED_WAKEUP, native zero-copy on ENA where channel index
permits, SKB-mode fallback otherwise.
- Build: USE_AF_XDP=1 mirrors USE_PFRING_ZC=1; runtime opt-in via new
--io-engine={af_packet,af_xdp,pfring_zc} flag (default af_packet).
- Estimated ~580 LOC C (450 net new, 130 in modified files), 6-8 days
of implementation effort over four small PRs.
Phase 2 is explicitly out of scope for this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e93252435f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| ```bash | ||
| apt-get install -y --no-install-recommends \ | ||
| libxdp1 libbpf1 libelf1 libz1 |
There was a problem hiding this comment.
Use installable runtime package names
The runtime install command includes libelf1 and libz1, but on Ubuntu 24.04 (the baseline this plan recommends in §4.1) those package names are not available (libelf1t64/zlib1g are). If this snippet is followed in Phase 2 bootstrap scripts, apt-get install will fail and block AF_XDP rollout on the target AMI, so the package list should be corrected or made distro-conditional.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — this was a real bug that would have broken Phase 2 bootstrap on the recommended Ubuntu 24.04 baseline. Fixed in af90ff0.
Specifically:
libz1→zlib1g(the codex callout is right;libz1does not exist on any Debian-family distro I can find).libelf1→libelf1t64on Noble, retained aslibelf1for Jammy/bookworm (per the Ubuntu 64-bit time_t transition).libxdp1andlibbpf1are unchanged across Jammy/Noble/bookworm — verified they're not on the t64 rename list.
§4.2 now provides two separate apt-get install lines (Noble baseline + older-LTS fallback) and a note on the t64 rename so Phase 2 doesn't re-discover this. Phase 2 will pick the right line based on lsb_release -rs at install time rather than copy-pasting blindly.
The runtime install snippet in §4.2 used `libelf1 libz1`, but on Ubuntu 24.04 (the baseline §4.1 recommends) `libelf1` was renamed to `libelf1t64` as part of the 64-bit time_t transition, and `libz1` was never a real Debian-family package name (the runtime zlib package is `zlib1g`). Phase 2 bootstrap scripts copying this snippet verbatim would have hit `apt-get install: Unable to locate package`. Provides separate, correct package lists for the Ubuntu 24.04 baseline and the Ubuntu 22.04 / Debian bookworm fallback, plus a note on the t64 rename so Phase 2 doesn't re-discover this. Addresses chatgpt-codex-connector review on PR #65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: af90ff00e5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| | `bind_flags` | `XDP_USE_NEED_WAKEUP` | Standard kernel-side config; `xsk_ring_prod__needs_wakeup` decides when to `sendto`. | | ||
| | `xdp_flags` | `XDP_FLAGS_DRV_MODE` `\| XDP_ZEROCOPY` | Native zero-copy on ENA (driver supports it; see §3.5). Falls back to `XDP_FLAGS_SKB_MODE` (generic) if ENA refuses ZC for this queue. | | ||
|
|
There was a problem hiding this comment.
Move zero-copy flag to AF_XDP bind flags
This plan puts XDP_ZEROCOPY under xdp_flags, but in AF_XDP zero-copy/copy mode is selected via socket bind flags (sxdp_flags), not XDP program-attach flags. If Phase 2 implements this literally (also echoed later in the fallback probe section), af_xdp setup can fail on otherwise supported hosts because the wrong flag set is passed at socket/program setup time. Keep xdp_flags to driver/SKB mode selection and put zero-copy policy in bind_flags (XDP_ZEROCOPY or XDP_COPY).
Useful? React with 👍 / 👎.
Deployed to prod ✅Prod redeploy
The api binary sha changed because PR #66's edits to Fresh bundle:
|
| Size | 17162815 bytes |
| Fingerprint | 5dd517c87d76 |
Required content all confirmed in tar -tzf:
- ✓
extensions/anyscan_rate_controller.py - ✓
extensions/portscan-adapter.py - ✓
env/runtime.env.template - ✓
bin/tune-scanner-host.sh - ✓
bin/reserve-control-bandwidth.sh
PR #66 plumbing verified inside the bundle
# anyscan_rate_controller.py:180-187
cpu_pressure = cpu_saturated and heartbeat_slip
if not cpu_pressure and not network_pressure:
…
if cpu_pressure and not network_pressure:
… # local CPU starvation — don't rate-cut
if network_pressure and not cpu_pressure:
… # genuine network slip — rate-cutPlus # survives even partial windows in the calibration writer (line 838) and ANYSCAN_RATE_MAX_CONCURRENT_SUBPROCESSES referenced in portscan-adapter.py (lines 47, 846) and runtime.env.template:76.
Bundle endpoint serves the freshly-built artifact
$ curl -fsSL "https://scan.anyvm.tech/api/agent/install.sh?rebuild=false&platform=linux-x86_64" | grep BUNDLE_NAME
BUNDLE_NAME='agent-bundle-linux-x86_64__20260427191214-3236925-5dd517c87d76.tar.gz'
Worker remote-update — one alive worker
The auto-recreated fleet worker (anyscan-ec2-worker, i-0b94844f5ace75d28 at 44.203.214.161) was alive and already running a post-#66 bundle from its fresh bootstrap. Remote-update fired against it cleanly:
| Pre | Post | |
|---|---|---|
agentd sha |
a786750834… |
a786750834… (same — PR #66 didn't touch worker source) |
AGENT_BUNDLE_NAME |
…191248-…5dd517c87d76 |
…191309-…5dd517c87d76 |
| Service | active | active |
ANYSCAN_RATE_MAX_CONCURRENT_SUBPROCESSES=4 confirmed in /etc/agentd/runtime.env — PR #66's install-time default fired correctly. So the next 8-NIC metal launch will only run 4 shards by default, exactly as the deploy note said.
Note for the next bench cycle
When the user authorizes another c6in.metal launch and an 8-shard CPU-pressure handling test, the operator can override:
echo 'ANYSCAN_RATE_MAX_CONCURRENT_SUBPROCESSES=8' >> /etc/agentd/runtime.env
systemctl restart agentd…then re-run the same bench shape to confirm the CPU-vs-network slip distinction handles the regressed case from the prior bench (8-NIC at 1.34M aggregate). Expectation: AIMD's cpu_pressure branch should not rate-cut on heartbeat lag when CPU is the cause, so per-NIC pps shouldn't collapse to 167k.
Out of scope per spec
- AnyGPT submodule pointer bump.
- Scan kickoff.
- AF_XDP Phase 2 implementation (gated on user approval after the plan PR docs(plans): AF_XDP integration plan for higher pps (Phase 1) #65 review).
…rk (#67) Phase 2 PR 1 of 4 of the AF_XDP integration plan (PR #65 §9.1) ships a refactor of the scanner C source (engine.c dispatch table + --io-engine CLI flag + PF_RING ZC dispatch fix) which lives in a fork of the third-party upstream scanner repository: - Upstream: github.com/Lorikazzzz/VulnScanner-zmap-alternative- - Fork: github.com/AnyVM-Tech/anyscan-engine-c - Phase 2 PR 1 commit on the fork: AnyVM-Tech/anyscan-engine-c@998c66b on branch perf/portscan-afxdp-phase2-pr1 Why fork: the plan §9.1 calls out that the upstream scanner is third-party and proposes a fork under AnyVM-Tech as the resting place for the integration patches (AF_XDP send/receive paths in PRs 2 + 3, build integration in PR 4, and follow-on PF_RING ZC cluster init). This commit only updates the AnyScan-side scripts to resolve from the new fork: - install-external-deps.sh:11-12 — clone URL and local checkout dir now default to the AnyVM-Tech fork. Both can still be overridden via the existing ANYSCAN_VULNSCANNER_REPO_URL / ANYSCAN_VULNSCANNER_REPO_DIR environment variables (no behaviour change for callers that set them). - package-worker-bundle.sh:519-525 — preferred lookup order is now `anyscan-engine-c/scanner` first, the legacy `VulnScanner-zmap-alternative-/scanner` directory second (kept for transitional dev checkouts), and `/opt/anyscan/bin/scanner` last. What is NOT in this PR: - The actual AF_XDP send/receive paths (PR 2 + 3 of Phase 2). - The Makefile / install-external-deps.sh `USE_AF_XDP=1` build flag plumbing (PR 4 of Phase 2). - Live c6in.metal benchmarks (PR 5 of Phase 2). - AnyGPT submodule pointer bump. - Any change to runtime.env or to the AIMD rate controller. Test plan: - `cargo build --workspace` (release) — clean. - `cargo test --workspace --no-fail-fast` — 437 tests pass (matches post-#66 baseline: 371 + 31 + 2 + 33). - `python3 -m py_compile vulnscanner-zmap-adapter.py` — clean. - On the scanner fork: - `make` (default AF_PACKET) — builds. - `make test` — 11 dispatch smoke tests pass. - `gcc -fsyntax-only -DUSE_PFRING_ZC ...` — compiles, dispatch reaches the ZC thread bodies. - `./scanner --io-engine=af_xdp` exits 1 with a clear "USE_AF_XDP=1 not set; AF_XDP send/receive paths land in PRs 2 + 3" message. - `./scanner --io-engine=pfring_zc` (without USE_PFRING_ZC) exits 1 with the equivalent compile-flag error. - `./scanner --io-engine=bogus` exits 1 with "Unknown --io-engine". Refs: AnyVM-Tech/AnyScan PR #65, plan §3.1 + §3.3 + §9.1. Co-authored-by: AnyVM-Tech AO <agent@anyvm.tech>
Phase 2 — c6in.metal live bench (anygpt-42)Live bench on freshly-deployed AF_XDP build (PR #71 + engine-c PR #3) on Headline
Per-config wall + tx_dropped:
Live vs synthetic projections (PR #65 §10)
Setup notes (operational findings the plan should fold in)
Bench harnessCustom bash harness (not via the AnyScan API/adapter path) — direct Out of scope (per task instructions)
Cleanup
|
Phase 2 — c6in.metal 15-NIC live bench (anygpt-48): adding NICs and a 6.19 kernel fail to close the gapLive follow-on to anygpt-42's 8-NIC bench (issuecomment-4336192354). Same TL;DRThe 30–50 M-pps synthetic projection from this PR's plan §10 still does not hold on AWS c6in.metal in 2026-04. PR #73 (kernel backport opt-in) and PR #74 (15-ENI launch path) wire the knobs cleanly, but the underlying premises — "kernel 6.16+ unlocks ena_xdp_zc" and "more ENIs unlock more PCIe trees" — both fail in production for the reasons documented inline below. The 8-NIC Headline
Bench harness wall + tx_dropped
Why it regressed: three live findings on top of the prior
|
| PR | Wired correctly? | Closes 22.43 M → 30–50 M gap? |
|---|---|---|
| #71 (AF_XDP build wireup) | ✅ scanner ships with libxdp linkage; mode ladder fires; falls back to drv+copy | n/a — was never claimed to |
| #73 (kernel backport opt-in) | trixie-backports, ENA still has no zerocopy support |
No — ena_xdp_zc not upstream as of 6.19.11 |
| #74 (15-ENI launch path) | No — adding ENIs at drv+copy is CPU-bound; 15-NIC regressed vs 8-NIC at every T=R config tested | |
| #75 (PF_RING ZC build wireup) | (intentionally not exercised — engine-init stub blocks runtime) | (not testable yet) |
| #76 (PF_RING gating) | (not exercised — use_pfring_zc=0 in bundle) |
(n/a) |
Cleanup
c6in.metal i-043714fbb73cca641— terminated 18:32:08Z (live for ~36 min total, ~$3.30 in on-demand spend).- 15 ENIs (
eni-0c47a4a7f3ba69511,eni-049606866a71c9d87,eni-0c9c10e4dc28adde6,eni-0a6a17a3e724ff90b,eni-0acf00ef684fffb47,eni-02dc9e888c93d5a33,eni-0aad977041b58b791,eni-087e24d177d27ee7b,eni-03da977940c9bb8ef,eni-07cb100d30af9c73a,eni-06afba14a7dde0c93,eni-0d21500d7a1f5d641,eni-0c35ff242e90b7a59,eni-06c8680f516ae4ee6,eni-09746474722608a97) — auto-deleted on termination (DeleteOnTermination=truewas the default). - EIP
eipalloc-023db7e0b15246cb7(54.165.21.227) — released. .external-runtime.envrestored toc6in.xlarge(noANYSCAN_MAX_ENIS);anyscan-ec2-worker-manager.servicerestarted; replacement xlargei-0778d6f698047418falready running.- Bench logs preserved at
scan.anyvm.tech:/root/.worktrees/AnyGPT/anygpt-48/anygpt-48-bench-logs.tar.gz.
Out of scope (per task brief)
- No DPDK exercise — PR [draft] docs(plans): DPDK userspace-networking integration plan (Phase 1) #72 plan-only draft is awaiting orchestrator approval.
- No PF_RING ZC bench — PR fix(build): wire ANYSCAN_USE_PFRING_ZC=1 through install-external-deps + package-worker-bundle + deploy + adapter #75 wires the build path but the engine cluster init is a stub.
- No api-driven bench — same as anygpt-42, this is a direct-scanner harness so the rate-controller's classifier histogram and
heartbeat_jitteraren't captured. The path-mode classifier should be revisited only after a config that actually beats 22.43 M emerges.
cc PR #73 / PR #74 — flagging the fixture and suite-default issues as separate follow-ups rather than blocking the env-knob PRs that are otherwise wired correctly.
Phase 1 design document for adding a DPDK io_engine to the bundled C scanner (AnyVM-Tech/anyscan-engine-c). Mirrors PR #65's AF_XDP plan structure across §1-§10. Why now: PR #65's AF_XDP work landed but the c6in.metal bench revealed ENA on kernel <=6.12.74 forces drv+copy (not drv+zerocopy), capping the 8-NIC ceiling at ~22 M pps — short of the 30-50 M pps projection. DPDK via vfio-pci bypasses the ENA kernel driver entirely, projecting 50-100 M pps realistic on c6in.metal. This supersedes PR #63's deferral recommendation (which was conditioned on AF_XDP clearing the throughput target — it did not). Plan scope: - engine repo: ~1,100 LOC (send-dpdk.c, recv-dpdk.c, dpdk-eal.c, dpdk-defs.h, vtable slot in engine.c, USE_DPDK Makefile block) - AnyScan-side wire-up: ~765 LOC (mirrors PR #71's ANYSCAN_USE_AF_XDP pattern across install-external-deps.sh / package-worker-bundle.sh / deploy.sh / runtime.worker.env.template / adapter.py + new tools/setup-dpdk.sh for hugepages and vfio-pci bind/unbind) - NIC-binding decision: dedicated-DPDK-NIC pattern. eth0 stays on kernel for agentd heartbeat; ENIs eth1..eth7 (c6in.metal) go to vfio-pci. Single-NIC instances are DPDK-ineligible by design. - Effort: 12-15 days implementation + canary, ~3-4 weeks total. Phase 2 implementation is gated on user/orchestrator approval after this plan PR merges. No engine C code, no runtime config, no submodule bumps in this PR. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 4x5/4/3/3=15) (#77) PR #74 mocked NetworkCards as 4 cards distributed 5/4/3/3=15, but actual AWS DescribeInstanceTypes for c6in.metal returns 2 cards x 8 = 16 (anygpt-48 live bench, PR #65 issuecomment-4338158487). The launch path code is fine - distribute_enis_across_cards handles any card layout - but the synthetic test fixture and the docstring example encoded a shape that doesn't match production AWS. Refresh the fixture, the docstring, and every test that hardcoded 15-derived numbers. Add a new RecordedDescribeInstanceTypesIntegrityTests class that anchors the fixture against tools/c6in_metal_describe_instance_types.json (a real `aws ec2 describe-instance-types` capture) so future drift gets caught at unit-test time instead of bench time. Effect on capacity claim: c6in.metal has 2 PCIe trees, not 4, so the multi-NIC headroom caps at ~2x single-tree, not ~4x. Co-authored-by: skullcmd <skullcmd@anyvm.tech>
…launches (#79) When the launch payload uses an explicit NetworkInterfaces[] list (ANYSCAN_MAX_ENIS set), AWS does NOT honor the subnet's MapPublicIpOnLaunch — the operator has to opt in by setting AssociatePublicIpAddress=True on the primary ENI explicitly. In anygpt-48 (PR #65 issuecomment-4338158487) this caused the c6in.metal launch to come up unreachable from outside the VPC; the operator had to manually allocate-address + associate-address post-launch as a workaround. Add an opt-in env knob ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP (default off so existing fleets are unchanged). When set, plumb through ManagerConfig to build_network_interfaces, which sets AssociatePublicIpAddress=True on the entry with DeviceIndex=0 NetworkCardIndex=0 only — AWS rejects the field on secondaries. Co-authored-by: skullcmd <skullcmd@anyvm.tech>
ANYSCAN_KERNEL_BACKPORT_SUITE defaulted to bookworm-backports regardless of host. On the current Debian 13 (Trixie) AMI this means bookworm-backports/linux-image-cloud-amd64 resolves to 6.12.74 — the same kernel the metal already runs — so the opt-in completes "0 upgraded, 0 newly installed" and the operator gets a false green light without ever upgrading. Detect /etc/os-release VERSION_CODENAME and default the suite to <codename>-backports. Switch the default package to linux-image-amd64 (NOT linux-image-cloud-amd64) on non-bookworm suites, because trixie-backports cloud-amd64 is still 6.12 as of 2026-04 — only the non-cloud image jumps to 6.19. Operator-set ANYSCAN_KERNEL_BACKPORT_SUITE / _PACKAGE / _SOURCES_LIST still win — detection is just a smarter default. Source-list path is also derived from the resolved suite so the file matches. ANYSCAN_OS_RELEASE_FILE env override added so the test suite can inject a synthetic os-release without touching /etc/os-release on the test host. See PR #65 issuecomment-4338158487 (anygpt-48 c6in.metal bench) for the kernel-resolution trace. Co-authored-by: skullcmd <skullcmd@anyvm.tech>
…undle + deploy + adapter (#81) Phase 2 wire-up for the DPDK io_engine landing in AnyVM-Tech/anyscan-engine-c PR #4. Mirrors PR #71's AF_XDP wire-up shape across the install / bundle / deploy / adapter / install-time-probe chain so the engine repo's USE_DPDK=1 build flag actually reaches every producer of a worker bundle, and so the runtime --io-engine=dpdk knob plumbed through ANYSCAN_SCANNER_IO_ENGINE has DPDK code to dispatch to. Why DPDK now: AWS ENA on kernel ≤6.12.74 forces AF_XDP into drv+copy mode, capping c6in.metal at ~22M pps aggregate (memory: anyscan_afxdp_ena_constraint, also PR #65 issuecomment-4338158487 — 6.19.11 STILL does not have ena_xdp_zc). DPDK bypasses the kernel ENA driver entirely via vfio-pci and removes the syscall-kick + lower-half -channels-only ZC constraint. What lands here: - install-external-deps.sh: ANYSCAN_USE_DPDK env knob; binary_has_dpdk_linkage probe (librte_eal.so via ldd → readelf -d); install_dpdk_build_deps (libdpdk-dev + dpdk apt-get, fail-open); cache short-circuit invalidation when cached binary lacks DPDK linkage; vulnscanner_make_args extension; post-build assertion. - package-worker-bundle.sh: same env knob, linkage probe, rebuild_scanner_with_dpdk helper, bundle_engine_make_args, README.txt use_dpdk field. Composes with USE_AF_XDP=1 USE_PFRING_ZC=1 — the earliest matching rebuild block produces a binary linked against every requested engine in a single make invocation. - deploy.sh: same env knob, linkage probe, make_args extension, pre-DPDK cached-binary drop, post-build assertion. - install-worker-bundle.sh: binary_has_dpdk_linkage, probe_dpdk_runtime_available (5 gates: scanner USE_DPDK-built, librte_eal.so loadable, vfio_pci kernel module, hugepages reserved in /sys/kernel/mm/hugepages/*, /dev/vfio/vfio present), apply_dpdk_availability writing ANYSCAN_DPDK_AVAILABLE. - vulnscanner-zmap-adapter.py: SUPPORTED_IO_ENGINES gains "dpdk"; _IO_ENGINE_AVAILABILITY_KEYS maps "dpdk" → ANYSCAN_DPDK_AVAILABLE so the same fall-back-with-warning path the AF_XDP / PF_RING ZC plumbing already exercises picks up dpdk for free. - runtime.worker.env.template: full DPDK section documenting ANYSCAN_USE_DPDK (build-time), ANYSCAN_DPDK_AVAILABLE (install probe), ANYSCAN_DPDK_PCI_BDFS (BDF / iface CSV), and ANYSCAN_DPDK_HUGEPAGES_GB (default 4). - tools/setup-dpdk.sh (NEW, ~370 LOC): bind / unbind / status subcommands. Reserves hugepages (1 GiB pages preferred, falls back to 2 MiB), modprobe vfio-pci, dpdk-devbind.py --bind=vfio-pci. Idempotent (re-runs are no-ops). Reversible (`unbind` returns the NICs to ena and frees hugepages). Refuses to bind eth0 (agentd control-plane interface) and refuses to bind the only NIC. THP gets switched to "never" on bind (DPDK + THP fragments the static hugepage pool). - tools/test-install-external-deps-dpdk.sh (NEW, ~270 LOC): mirrors test-install-external-deps-afxdp.sh. Four cases × multiple assertions: default unset → no USE_DPDK=1 in make argv; opt-in + missing scanner → USE_DPDK=1; opt-in + cached non-DPDK binary → make clean + USE_DPDK=1; opt-in + cached DPDK-linked binary → no rebuild. Stubs make/git/ldd/readelf so it runs hermetically. - test_vulnscanner_adapter_io_engine.py: 7 new DPDK assertions covering the dpdk-with-runtime-available, dpdk-without-runtime -fall-back-with-warning, missing-availability-var, uppercase normalization, and cross-engine availability isolation cases. Updated test_invalid_value_falls_back_to_af_packet_with_warning to use "fake_engine" instead of "dpdk" — dpdk is now valid. Verification (on Debian bookworm with libdpdk-dev 24.11 installed): - tools/test-install-external-deps-afxdp.sh: 11/11 (regression OK). - tools/test-install-external-deps-pfring-zc.sh: 10/10 (regression OK). - tools/test-install-external-deps-dpdk.sh: 10/10. - python3 -m unittest discover: 116/116 (32 in test_vulnscanner_adapter_io_engine, of which 7 are DPDK-specific). - All bash scripts parse cleanly via `bash -n`. - tools/setup-dpdk.sh status runs cleanly (no NICs bound, expected). Engine PR for io_engine_dpdk: AnyVM-Tech/anyscan-engine-c#4 Out of scope (separate workers per the plan): - Phase 2 systemd unit edit adding CAP_SYS_RAWIO/CAP_IPC_LOCK/ CAP_NET_ADMIN to anyscan-worker.service. Documented in the env template. Until that lands operators must add caps manually before flipping the runtime knob. - Live c6in.metal bench (plan §5.3). - AMI rebuild. - mlx5 / non-AWS hardware support. Refs: plans/2026-04-28-portscan-dpdk-impl-v1.md (§3.10 wire-up, §3.11 NIC-binding decision, §4.3 kernel feature checks, §5.7 unit test shape). anygpt-50 Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 — c6in.metal AF_PACKET / AF_XDP / DPDK live bench (anygpt-52): regressions block AF_XDP + DPDK; AWS PPS allowance still untestedLive bench on c6in.metal (128 vCPU, 2 NetworkCards × 8 ENIs each = 16 max, 8 attached) driven by anygpt-52. Engine commit TL;DR
The 22 M AF_XDP ceiling from anygpt-42 is still neither confirmed nor refuted as AWS-imposed. AF_PACKET-only at ~8 M pps is well below the AWS ENA per-instance PPS allowance for c6in.metal — Bench harness
AF_PACKET 8-NIC results (T=R=8,
|
| Counter | PRE | POST | Δ |
|---|---|---|---|
pps_allowance_exceeded |
0 | 0 | 0 |
bw_in_allowance_exceeded |
0 | 0 | 0 |
bw_out_allowance_exceeded |
0 | 0 | 0 |
conntrack_allowance_exceeded |
0 | 0 | 0 |
linklocal_allowance_exceeded |
0 | 0 | 0 |
conntrack_allowance_available |
6567014 | 6567010 | −4 (SSH/DNS, irrelevant) |
At ~8 M aggregate pps, AWS does not throttle. That's the only hard data point we have on the AWS quota for c6in.metal from this run.
Why AF_XDP did not run
[*] afxdp: xsk_socket__create(enp154s0, q=0, mode=drv+zerocopy) failed: Operation not supported
libbpf: elf: skipping unrecognized data section(8) .xdp_run_config
…
Segmentation fault
The bind-mode ladder in anyscan-engine-c/src/send-afxdp.c:287-289 (AFXDP_BIND_ZEROCOPY → AFXDP_BIND_DRV_COPY → AFXDP_BIND_SKB) is correct in source — it's just supposed to retry on each failure — but in practice the second attempt (drv+copy) segfaults the process instead of returning a clean error. The first attempt's xsk_socket__create returns -EOPNOTSUPP (matches memory anyscan_afxdp_ena_constraint: ENA on Linux 6.12 only supports drv+copy, not drv+zerocopy), so this is the exact code path we know we need on AWS — and it's broken.
Bug location: the engine's afxdp_try_bind() doesn't fully tear down the partially-created XSK / UMEM / xdp_program after a failed xsk_socket__create() before the next attempt. Repro is single-scanner -i ens1 --io-engine=af_xdp -T 4 -R 4 … on c6in.metal — segfaults every time.
This is a regression-blocker for AF_XDP on AWS. anygpt-42 was on engine f1288d6 (pre-PR #4); current ccfd077 (PR #4 merge) introduced the regression — most likely the DPDK changes share state-init code with the AF_XDP setup path. Splitting this into its own engine PR would have caught it via the bench gate plan §5.3 demands.
Why DPDK did not run
Two engine-side bugs found in PR #4's dpdk-eal.c:
-
Hardcoded
nb_tx_desc=1024exceeds ENA's max=512:ETHDEV: Invalid value for nb_tx_desc(=1024), should be: <= 512, >= 128, and a product of 1 [-] dpdk-eal: rte_eth_tx_queue_setup(port=0, q=0) failed: Invalid argumentThe TX descriptor count must be queried from
rte_eth_dev_info_get(...).tx_desc_lim.nb_maxper device, not hardcoded. (mlx5 supports 4 K, ENA caps at 512.) -
EAL argv splitter mangles
--socket-mem 1024: the trailing1024token gets replaced by the scanner program path on the way intorte_eal_init. Repro:scanner --io-engine=dpdk … -- --file-prefix=foo --socket-mem 1024ends up asscanner --file-prefix=foo --socket-mem scannerin the EAL log. The space-separated form fails; the=form (--socket-mem=1024,1024) is untested.
Plus deploy-side blockers (worked around but worth flagging for follow-up):
-
librte-net-ena25not in the bundle install path. Debian DPDK 24.11.4 ships ENA PMD as a separate package; without it,rte_eal_initsucceeds but no eth ports are probed. Manualapt install librte-net-ena25was required.install-external-deps.sh::install_dpdk_build_depsshould pull this in whenANYSCAN_USE_DPDK=1is set. -
setup-dpdk.sh bindrefused active interfaces. dpdk-devbind safety check ("Warning: routing table indicates that interface is active. Not modifying") forces the operator to manuallyip link set <ifc> downon each NIC before bind succeeds. PR fix(build): wire ANYSCAN_USE_DPDK=1 through install-external-deps + bundle + deploy + adapter #81'stools/setup-dpdk.shshould down the iface itself or at least call out the requirement in its README. -
1 GiB hugepages reserved without hugetlbfs mount.
setup-dpdk.shreserved 8 × 1 GiB pages but didn't mount a 1G-pagesize hugetlbfs. EAL fell back to "no hugepages reported on node 0/1" until I manuallymount -t hugetlbfs -o pagesize=1G nodev /mnt/huge1g.
Other deploy-path bugs surfaced (worked around)
-
PR feat(ec2-worker): ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP knob for multi-ENI launches #79
ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP=true+ multi-ENI is rejected by AWS:InvalidParameterCombination — The associatePublicIPAddress parameter cannot be specified when launching with multiple network interfaces.AWS only honorsAssociatePublicIpAddressonNetworkInterfaces[]when there is exactly one NetworkInterface entry. PR feat(ec2-worker): ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP knob for multi-ENI launches #79's multi-ENI public-IP path is non-functional on every multi-ENI launch. Workaround: launch without it, thenaws ec2 allocate-address && aws ec2 associate-address --network-interface-id <primary-eni>post-launch. -
/api/agent/install.sh?rebuild=falseserved a stub scanner. The bundle the worker fetched at bootstrap (agent-bundle-…20260428202928…) had a 37-byte shell script in place of the scanner binary (#!/bin/sh; echo cached-scanner-pfring). The API's bundle-cache path does not honorANYSCAN_USE_AF_XDP/ANYSCAN_USE_DPDKflags from the originalpackage-worker-bundle.shinvocation; bundles served from the API are AF_PACKET-only stubs unless those env vars are set in the API's systemd EnvironmentFile. Manuallyscp'd the real scanner from the operator-built bundle as a workaround. -
AF_XDP and DPDK runtime probes (in
install-worker-bundle.sh) returnedavailable=falseon c6in.metal Debian 13 + kernel 6.12.74. Probe message: "kernel <5.10 or libxdp.so missing; ANYSCAN_AF_XDP_AVAILABLE=false" — but the kernel is 6.12 and libxdp1 is in the package archive. The probe's kernel-version check is wrong (likelycut -d. -f1-style parsing failing on6.12.74).
What the AWS PPS allowance numbers actually tell us
Reproducing the table in memory anyscan_aws_pps_allowance:
| Source | Aggregate PPS hit | pps_allowance_exceeded non-zero? |
Verdict |
|---|---|---|---|
| anygpt-4 / anygpt-42 AF_PACKET 8-NIC | 7.49 M | (not captured) | unknown |
| anygpt-42 AF_XDP 8-NIC cap=4 t=8 (best) | 22.43 M | (not captured) | unknown |
| anygpt-48 AF_XDP 15-NIC | 12.18 M (regressed) | (not captured) | unknown |
| anygpt-52 AF_PACKET 8-NIC T=8 R=8 | 7.98 M | 0 — quota not hit | AWS allowance ≥ 8 M pps for c6in.metal ✓ |
| anygpt-52 AF_XDP | — | — | engine bug, see §"Why AF_XDP" |
| anygpt-52 DPDK 7-NIC | — | — | engine bug, see §"Why DPDK" |
So all we've confirmed empirically is AWS allows ≥ 8 M pps. The 22 M historic ceiling could still be either AWS- or engine-imposed. Memory anyscan_aws_pps_allowance (and the underlying AWS public docs) suggest the c6in family quota is on the order of low-tens-of-M; the next bench that pushes >10 M is the one that decides this question.
Follow-up tickets (please file before next bench)
- AF_XDP fall-back segfault —
anyscan-engine-c/src/send-afxdp.c::afxdp_try_bind()must clean up XSK/UMEM/xdp_program onxsk_socket__create()failure before the next bind-mode attempt. Repro single-line above. - DPDK hardcoded
nb_tx_desc=1024— queryrte_eth_dev_info.tx_desc_lim.nb_maxper device. - DPDK EAL argv splitter — passes
--socket-mem 1024as--socket-mem <next-token-which-is-actually-program-path>. - PR feat(ec2-worker): ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP knob for multi-ENI launches #79 multi-ENI public IP — top-level launch flag silently misuses the AWS API. Either (a) only allow on single-NIC launches, (b) emit a post-launch EIP allocate+associate, or (c) document that the operator must do the EIP step themselves.
- API bundle cache —
/api/agent/install.shserves a stub scanner unlessANYSCAN_USE_AF_XDP=1 ANYSCAN_USE_DPDK=1are in the api's systemd EnvironmentFile. Either pin the env vars there or hash the bundle's build flags into the cache key. install-external-deps.sh— mustapt install librte-net-ena25whenANYSCAN_USE_DPDK=1(currently only installslibdpdk-dev).tools/setup-dpdk.sh bind— shouldip link set <ifc> downautomatically (or at least error with that hint instead of "interface active. Not modifying") and should mount 1 GiB hugetlbfs when reserving 1 GiB hugepages.- AF_XDP install probe — kernel version check fails on
6.12.74. Ship a probe that handles 3-component versions and is a no-op when libxdp.so is present inldconfig -p.
Cleanup
setup-dpdk.sh unbind— restored 7 ENIs to ENA driver, freed hugepages. ✓aws ec2 terminate-instances --instance-ids i-0b8fe00496163d227— instance shutting-down. ✓aws ec2 release-address --allocation-id eipalloc-0bf5d678f6ff3bc20— EIP released. ✓systemctl start anyscan-ec2-worker-manager.service— watchdog back online (it'll re-launch its c6in.xlarge fleet). ✓
Cross-reference: memory anyscan_aws_pps_allowance (PR-comment-pointer to here updated), memory anyscan_afxdp_ena_constraint. Deploy proof: PR #70 issuecomment-4338757890.
— driven by anygpt-52, instance lifetime ≈ 1 h 12 min, total spend ≈ $7.
…ches (#82) PR #79's ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP=true on a multi-ENI launch is hard-rejected by AWS: InvalidParameterCombination — The associatePublicIPAddress parameter cannot be specified when launching with multiple network interfaces. AWS only honors AssociatePublicIpAddress on NetworkInterfaces[] when exactly one entry is supplied, even if the field appears only on the primary entry of a multi-NIC payload. The entire RunInstances call fails. Reported in PR #65 issuecomment-4339242358 (anygpt-52). Fix: when len(NetworkInterfaces) > 1, suppress the field inline and allocate-address + associate-address on the primary ENI post-launch. The recreate path now also releases the previously-recorded EIP before terminating the old instance so we don't leak Elastic IPs on every recreate. - build_network_interfaces only emits AssociatePublicIpAddress when the resulting payload is single-NIC (target_count == 1). - Ec2WorkerManager._associate_public_ip_post_launch allocates an EIP (Domain=vpc), associates it with the primary ENI (DeviceIndex=0), records AllocationId/AssociationId in self.state. Allocate or associate failures are surfaced in eni_attach.public_ip but do not abort the recreate — the worker is still usable on private IPs. - Ec2WorkerManager._release_recorded_eip disassociates and releases any previously-recorded EIP at the start of recreate_instance. Tests: - New: launch payload free of AssociatePublicIpAddress on multi-ENI; allocate_address + associate_address called post-launch with the primary ENI's NetworkInterfaceId; allocation_id persisted. - New: AllocateAddress failure does not abort recreate. - New: AssociateAddress failure still records AllocationId so the next recreate can release it. - New: previously-recorded EIP is disassociated + released before terminating the old instance on the next recreate. - Updated: prior tests that asserted the broken inline-flag behavior on multi-NIC now assert the field is suppressed everywhere. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the API process runs with ANYSCAN_USE_AF_XDP=1 (or any of the sibling ANYSCAN_USE_DPDK / ANYSCAN_USE_PFRING_ZC / ANYSCAN_INSTALL_KERNEL_BACKPORT knobs), package-worker-bundle.sh rebuilds the scanner with matching feature linkage. The resulting bundle carries a feature-flagged scanner binary even though the *installed* /opt/anyscan/bin/scanner stays the same. `current_hosted_agent_bundle_source_fingerprint` previously only hashed the embedded asset payload and the installed binaries — the build-flag env vars were absent from the cache key. So a default-flags rebuild produced a fingerprint identical to a feature-flagged one and silently overwrote the cached AF_XDP/DPDK bundle. Operators bootstrapping via /api/agent/install.sh?rebuild=false then received an AF_PACKET-only stub scanner. Reported in PR #65 issuecomment-4339242358 (anygpt-52): "/api/agent/install.sh?rebuild=false served a stub scanner". Fix: fold each documented build-flag env var name + value into the fingerprint hash. Bundles built with different flags now land in different cache slots; rebuild=false serves the bundle that matches the API's current build-flag environment instead of a stale one. - BUNDLE_BUILD_FLAG_ENV_VARS pins the exact set so future ANYSCAN_USE_* knobs surface as a static-array compile-time decision. - hash_bundle_build_flag_env_vars takes an env-lookup closure so unit tests can hash hermetic inputs without poking std::env (which would race with parallel test execution). - bundle_build_flag_env_fingerprint is a #[cfg(test)] helper that produces just the build-flag contribution as a SHA-256 hex digest. Tests: - Default vs ANYSCAN_USE_AF_XDP=1 produce different fingerprints. - Each flag flipped on its own produces a unique fingerprint (no collisions between AF_XDP-only and DPDK-only builds). - Same flag with values "1" / "0" / unset are all distinct. - Repeated lookup with same input returns same fingerprint. - Static check that the four documented flags are all in the const. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…DPDK (#84) Debian DPDK 24.11.x ships every Poll-Mode Driver as its own package (librte-net-<vendor><abi>) instead of bundling them into libdpdk-dev. Without the relevant PMD installed, rte_eal_init() succeeds but no eth ports are probed and the scanner refuses to start. anygpt-52 hit this on c6in.metal: ENA NICs were silently absent from rte_eth_dev_count_avail() until librte-net-ena25 was apt-installed manually. Reported in PR #65 issuecomment-4339242358. Fix: install_dpdk_build_deps now pulls librte-net-ena25 (AWS ENA PMD — every c6in/c5n/m5n/m6in instance) AND librte-net-mlx5-25 (Mellanox ConnectX-5/6 PMD for non-AWS bare-metal hosts at Equinix/OVH/Hetzner) alongside libdpdk-dev. The 25 ABI suffix matches Debian trixie's DPDK 24.11.x. Stock Intel ixgbe/i40e drivers are still in libdpdk-dev's auto-pull set so we don't need to name them. Falls back to libdpdk-dev alone if PMDs are unavailable in the archive — better to ship a partial DPDK build than fail the install. The runtime warning makes it explicit so operators know to check rte_eth_dev_count_avail() if it returns 0 later. Test: new Case 5 in test-install-external-deps-dpdk.sh runs the install script with ANYSCAN_INSTALL_DPDK_DEPS=true (default) + stubs id and apt-get on PATH, then asserts apt-get install was called with libdpdk-dev, librte-net-ena25, and librte-net-mlx5-25. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1g (#85) Two operator-side speedbumps surfaced on c6in.metal during anygpt-52 (PR #65 issuecomment-4339242358): 1. dpdk-devbind refuses to bind active interfaces: Warning: routing table indicates that interface is active. Not modifying. Operators had to `ip link set <ifc> down` on every NIC by hand before the bind step succeeded. 2. Reserving 1 GiB hugepages was not enough for rte_eal_init: EAL: No available 1048576 kB hugepages reported on node 0/1 Operators had to `mount -t hugetlbfs -o pagesize=1G nodev /mnt/ huge1g` themselves before the scanner could find the pages. Fix: - bdf_to_iface() walks /sys/bus/pci/devices/<bdf>/net/ to map BDF → kernel iface name. cmd_bind invokes `ip link set <ifc> down` on each target before invoking dpdk-devbind, with best-effort failure semantics (missing ip command, missing iface, already-down iface all proceed to the bind). - ensure_hugetlbfs_mount() mounts a hugetlbfs of the matching pagesize at the configured path after a successful nr_hugepages reservation. Default targets are /mnt/huge1g (1 GiB) and /mnt/huge2m (2 MiB); ANYSCAN_DPDK_HUGEPAGES_1G_MOUNT / _2M_MOUNT override or set them to "" to opt out (operators provisioning hugetlbfs via fstab). Idempotent: detects existing hugetlbfs of the right pagesize via findmnt / /proc/mounts and skips remount. - ANYSCAN_DPDK_LOAD_ONLY=1 hook lets test-setup-dpdk.sh source the script for hermetic helper testing without triggering the argv dispatch. Tests (new tools/test-setup-dpdk.sh): - cmd_bind invokes `ip link set <ifc> down` BEFORE dpdk-devbind --bind=vfio-pci (order verified by line numbers in a single cmd log). - ensure_hugetlbfs_mount calls `mount -t hugetlbfs -o pagesize=1G nodev <path>`. - ensure_hugetlbfs_mount is a no-op when target is already hugetlbfs with the matching pagesize. - ensure_hugetlbfs_mount is a no-op with an empty mount path. - bdf_to_iface returns iface for a populated fake /sys tree. - bdf_to_iface returns empty when net/ dir is missing. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
probe_afxdp_runtime_available reported "kernel <5.10 or libxdp.so missing" on c6in.metal Debian 13 + kernel 6.12.74 even though both prerequisites were satisfied. The previous parameter-expansion parser silently mishandled some 3-component release shapes; the failure mode reported in PR #65 issuecomment-4339242358 (anygpt-52) was a generic "false" with no indication of which check fired, leaving the operator to guess. Fix: - New parse_kernel_major_minor() helper uses awk -F'[.-]' so 3- component releases like 6.12.74-cloud-amd64, 5.10.0-13-amd64, 6.12.74+deb13+1-amd64, and 5.4.282-rt all parse cleanly. Returns "MAJOR MINOR" on stdout, "0 0" on parse failure. - probe_afxdp_runtime_available emits a one-line stderr explanation whenever it returns "false" so the operator can immediately see which check fired ("kernel 4.19 < 5.10", "libxdp.so not in ldconfig -p", "could not parse kernel version"). Quiet on success. - apply_afxdp_availability captures the probe stderr and includes the reason in its summary log line — replaces the previous hardcoded "kernel <5.10 or libxdp.so missing" that was wrong half the time. - ANYSCAN_INSTALL_LOAD_ONLY=1 hook lets unit tests source the script for hermetic helper testing without triggering main(). Test (new tools/test-install-worker-bundle-afxdp-probe.sh, 21 cases): - parse_kernel_major_minor across 8 release shapes (clean 3-component, +deb13 suffix, -cloud-amd64 suffix, -rt suffix, 4.x, 2-component, 1-component, empty). - probe_afxdp_runtime_available with stubbed uname + ldconfig: c6in.metal 6.12.74 + libxdp → true (the bug 5 repro). Kernel 4.19 too old → false + stderr names the version. Kernel 5.9 vs 5.10 vs 5.11 boundary correctness. libxdp.so missing → false + stderr names the missing library. Empty/non-numeric uname → false + stderr names parse failure. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 — Design + plan only. No scanner C code changes.
Phase 2 implementation is gated on explicit user/orchestrator approval after this plan PR merges.
Why
anygpt-4's c6in.metal multi-NIC bench hit 12.8 Mpps aggregate at 4 ENIs, gated by the host kernel TX/syscall path: a singleAF_PACKETPACKET_TX_RINGsocket caps near 3 Mpps even withPACKET_QDISC_BYPASS. The Python adapter already documents this (vulnscanner-zmap-adapter.py:669).AF_XDPlets the ENA backplane (~100 Mpps theoretical on c6in.metal) be the actual bottleneck instead.What this PR adds
A single new file:
plans/2026-04-27-portscan-afxdp-plan-v1.md(407 lines).The plan is comprehensive and mergeable as a reference doc — the user has it in-tree without committing to implementation.
Sections:
engine.c:165hardcodessender_threadand never invokespfring_zc_sender_thread), per-NIC AF_XDP setup (XSK per(NIC, queue_id), UMEM/ring sizing, ENA zero-copy quirks), build-system integration (USE_AF_XDP=1mirroringUSE_PFRING_ZC=1), CLI plumbing (--io-engine=flag).CAP_NET_RAWalready present,CAP_BPFneeds to be added to systemd).amzn-drivers#221), libxdp version skew, ZC lower-half-channel constraint, AIMD ceiling coordination with anygpt-33.AnyVM-Tech/anyscan-engine-cfork), libnuma optionality,SO_PREFER_BUSY_POLL, AIMD coordination with anygpt-33.LOC estimate
src/send-afxdp.c(new)src/recv-afxdp.c(new)include/xdp-defs.h(new)src/engine.cdispatch refactor (modify)include/scanner_defs.h,scanner.hsrc/conf.cCLI plumbing (modify)Makefile(modify)In line with the brief's 450-500 ballpark; the extra ~80 covers the
engine.cdispatch refactor that's prerequisite for AF_XDP and incidentally fixes the never-invoked PF_RING ZC path.Coordination
anyscan_rate_controller.py(anygpt-33 owns it)./etc/anyscan/runtime.envor anything ops-owned.Verification
cargo build --workspace: clean (only pre-existing dead-code warnings onanyscan-api.rs).cargo test --workspace: 437 passed, 0 failed, 4 ignored — matches the brief's baseline expectation.This is a doc-only change, so the build/test verification is just confirming no accidental damage.
Out of scope (explicit)
runtime.env.Reviewer ask
Please verify the plan is comprehensive enough that a Phase 2 worker can execute task-by-task without needing additional context, and call out any architectural choice that should be re-litigated before Phase 2 begins (especially §9.1 — where the C changes physically live).
🤖 Generated with Claude Code