docs(plans): AF_XDP integration plan for higher pps (Phase 1) by skullcrushercmd · Pull Request #65 · AnyVM-Tech/AnyScan

skullcrushercmd · 2026-04-27T18:59:35Z

Phase 1 — Design + plan only. No scanner C code changes.

Phase 2 implementation is gated on explicit user/orchestrator approval after this plan PR merges.

Why

anygpt-4's c6in.metal multi-NIC bench hit 12.8 Mpps aggregate at 4 ENIs, gated by the host kernel TX/syscall path: a single AF_PACKET PACKET_TX_RING socket caps near 3 Mpps even with PACKET_QDISC_BYPASS. The Python adapter already documents this (vulnscanner-zmap-adapter.py:669).

AF_XDP lets the ENA backplane (~100 Mpps theoretical on c6in.metal) be the actual bottleneck instead.

What this PR adds

A single new file: plans/2026-04-27-portscan-afxdp-plan-v1.md (407 lines).

The plan is comprehensive and mergeable as a reference doc — the user has it in-tree without committing to implementation.

Sections:

§2: Walk-through of the existing scanner I/O paths (AF_PACKET sender, half-wired PF_RING ZC) with concrete file:line citations.
§3: Architecture diagram, file layout, dispatch refactor (resolves the pre-existing wart where engine.c:165 hardcodes sender_thread and never invokes pfring_zc_sender_thread), per-NIC AF_XDP setup (XSK per (NIC, queue_id), UMEM/ring sizing, ENA zero-copy quirks), build-system integration (USE_AF_XDP=1 mirroring USE_PFRING_ZC=1), CLI plumbing (--io-engine= flag).
§4: Dependency surface (libxdp-dev, libbpf-dev), Ubuntu 22.04 vs 24.04 caveats, runtime probe sequence, capability requirements (CAP_NET_RAW already present, CAP_BPF needs to be added to systemd).
§5: Test plan — synthetic veth loopback, unit-style harness, live c6in.metal bench, AF_PACKET regression.
§6: Risk register including ENA driver-reset history (amzn-drivers#221), libxdp version skew, ZC lower-half-channel constraint, AIMD ceiling coordination with anygpt-33.
§7: Effort estimate — 6-8 days over four small Phase 2 PRs.
§8: Rollout plan — feature-flag default off, manual canary on c6in.metal first, gradual default-on per tier.
§9: Open questions (deliberately not blocking this PR): where the C changes live (upstream PR vs AnyVM-Tech/anyscan-engine-c fork), libnuma optionality, SO_PREFER_BUSY_POLL, AIMD coordination with anygpt-33.
§10: Reference index — kernel docs, libxdp API, xdp-tutorial, Suricata AF_XDP, Cloudflare postmortems, amzn-drivers issues.

LOC estimate

Component	Est. LOC C
`src/send-afxdp.c` (new)	~280
`src/recv-afxdp.c` (new)	~120
`include/xdp-defs.h` (new)	~60
`src/engine.c` dispatch refactor (modify)	~50
`include/scanner_defs.h`, `scanner.h`	~30
`src/conf.c` CLI plumbing (modify)	~30
`Makefile` (modify)	~10
Total	~580

In line with the brief's 450-500 ballpark; the extra ~80 covers the engine.c dispatch refactor that's prerequisite for AF_XDP and incidentally fixes the never-invoked PF_RING ZC path.

Coordination

✅ Stayed out of anyscan_rate_controller.py (anygpt-33 owns it).
✅ Did not touch /etc/anyscan/runtime.env or anything ops-owned.
✅ Did not touch the AnyGPT submodule pointer.
§6 risk register flags one item that needs anygpt-33 coordination during Phase 2 (AIMD ceiling parameter).

Verification

cargo build --workspace: clean (only pre-existing dead-code warnings on anyscan-api.rs).
cargo test --workspace: 437 passed, 0 failed, 4 ignored — matches the brief's baseline expectation.

This is a doc-only change, so the build/test verification is just confirming no accidental damage.

Out of scope (explicit)

Writing the AF_XDP C code (Phase 2).
Implementing the upstream fork decision (open question §9.1).
Bumping the AnyGPT submodule pointer.
Editing prod systemd units or runtime.env.

Reviewer ask

Please verify the plan is comprehensive enough that a Phase 2 worker can execute task-by-task without needing additional context, and call out any architectural choice that should be re-litigated before Phase 2 begins (especially §9.1 — where the C changes physically live).

🤖 Generated with Claude Code

Comprehensive design + dependency + LOC + test + risk + rollout plan for adding an AF_XDP I/O path to the bundled C scanner. No scanner code is changed; Phase 2 implementation is gated on user approval. Motivation: anygpt-4 c6in.metal 4-NIC bench hit 12.8 Mpps aggregate, gated by the AF_PACKET TX/syscall path (single socket caps ~3 Mpps even with PACKET_QDISC_BYPASS). AF_XDP lets the ENA backplane (~100 Mpps theoretical on c6in.metal) be the actual bottleneck. Design highlights: - New send-afxdp.c / recv-afxdp.c slot into the same shape as the existing USE_PFRING_ZC build flag (template files in upstream). - Plan also resolves a pre-existing dispatch wart: engine.c hardcodes sender_thread, so the existing PF_RING ZC files compile but never run. Plan introduces a small io_engine_vtable_t that wires both paths cleanly. - Per-NIC: one XSK socket per (ENI, queue_id), TX-only (rx=NULL), XDP_USE_NEED_WAKEUP, native zero-copy on ENA where channel index permits, SKB-mode fallback otherwise. - Build: USE_AF_XDP=1 mirrors USE_PFRING_ZC=1; runtime opt-in via new --io-engine={af_packet,af_xdp,pfring_zc} flag (default af_packet). - Estimated ~580 LOC C (450 net new, 130 in modified files), 6-8 days of implementation effort over four small PRs. Phase 2 is explicitly out of scope for this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e93252435f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T19:02:26Z

+
+```bash
+apt-get install -y --no-install-recommends \
+    libxdp1 libbpf1 libelf1 libz1


Use installable runtime package names

The runtime install command includes libelf1 and libz1, but on Ubuntu 24.04 (the baseline this plan recommends in §4.1) those package names are not available (libelf1t64/zlib1g are). If this snippet is followed in Phase 2 bootstrap scripts, apt-get install will fail and block AF_XDP rollout on the target AMI, so the package list should be corrected or made distro-conditional.

Useful? React with 👍 / 👎.

Good catch — this was a real bug that would have broken Phase 2 bootstrap on the recommended Ubuntu 24.04 baseline. Fixed in af90ff0.

Specifically:

libz1 → zlib1g (the codex callout is right; libz1 does not exist on any Debian-family distro I can find).

libelf1 → libelf1t64 on Noble, retained as libelf1 for Jammy/bookworm (per the Ubuntu 64-bit time_t transition).

libxdp1 and libbpf1 are unchanged across Jammy/Noble/bookworm — verified they're not on the t64 rename list.

§4.2 now provides two separate apt-get install lines (Noble baseline + older-LTS fallback) and a note on the t64 rename so Phase 2 doesn't re-discover this. Phase 2 will pick the right line based on lsb_release -rs at install time rather than copy-pasting blindly.

The runtime install snippet in §4.2 used `libelf1 libz1`, but on Ubuntu 24.04 (the baseline §4.1 recommends) `libelf1` was renamed to `libelf1t64` as part of the 64-bit time_t transition, and `libz1` was never a real Debian-family package name (the runtime zlib package is `zlib1g`). Phase 2 bootstrap scripts copying this snippet verbatim would have hit `apt-get install: Unable to locate package`. Provides separate, correct package lists for the Ubuntu 24.04 baseline and the Ubuntu 22.04 / Debian bookworm fallback, plus a note on the t64 rename so Phase 2 doesn't re-discover this. Addresses chatgpt-codex-connector review on PR #65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af90ff00e5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T19:08:56Z

+| `bind_flags`      | `XDP_USE_NEED_WAKEUP` | Standard kernel-side config; `xsk_ring_prod__needs_wakeup` decides when to `sendto`. |
+| `xdp_flags`       | `XDP_FLAGS_DRV_MODE` `\| XDP_ZEROCOPY` | Native zero-copy on ENA (driver supports it; see §3.5). Falls back to `XDP_FLAGS_SKB_MODE` (generic) if ENA refuses ZC for this queue. |
+


Move zero-copy flag to AF_XDP bind flags

This plan puts XDP_ZEROCOPY under xdp_flags, but in AF_XDP zero-copy/copy mode is selected via socket bind flags (sxdp_flags), not XDP program-attach flags. If Phase 2 implements this literally (also echoed later in the fallback probe section), af_xdp setup can fail on otherwise supported hosts because the wrong flag set is passed at socket/program setup time. Keep xdp_flags to driver/SKB mode selection and put zero-copy policy in bind_flags (XDP_ZEROCOPY or XDP_COPY).

Useful? React with 👍 / 👎.

skullcrushercmd · 2026-04-27T19:14:09Z

Deployed to prod ✅

Prod redeploy


Deployed	`2026-04-27 19:11 UTC`
Source HEAD	`origin/main` @ `551d1f48` (covers #65, #66)
Build	`cargo build --release --locked --bin anyscan-api --bin anyscan-worker` → 1m 4s
anyscan-api	`1f3af11f…` → `00b4b83b…` (PID `3236925`)
anyscan-worker	`3f77e19e…` (unchanged — PR #66 doesn't touch worker rust source)
Old binaries	preserved at `/opt/anyscan/bin/anyscan-{api,worker}.pre-pr66-deploy.bak`
Public site	`HTTP 200 \| 10ms \| 61107b` ✓
Wedge-sweep janitor	startup line confirmed

The api binary sha changed because PR #66's edits to anyscan_rate_controller.py, vulnscanner-zmap-adapter.py, and runtime.worker.env.template flow into the api binary via include_bytes! in HOSTED_AGENT_BUNDLE_ASSETS. Asset audit clean (no install-line-vs-asset-list mismatch this round either).

Fresh bundle: `agent-bundle-linux-x86_64__20260427191153-3236925-5dd517c87d76.tar.gz`


Size	17162815 bytes
Fingerprint	`5dd517c87d76`

Required content all confirmed in tar -tzf:

✓ extensions/anyscan_rate_controller.py
✓ extensions/portscan-adapter.py
✓ env/runtime.env.template
✓ bin/tune-scanner-host.sh
✓ bin/reserve-control-bandwidth.sh

PR #66 plumbing verified inside the bundle

# anyscan_rate_controller.py:180-187
cpu_pressure = cpu_saturated and heartbeat_slip
if not cpu_pressure and not network_pressure:
    …
if cpu_pressure and not network_pressure:
    …                       # local CPU starvation — don't rate-cut
if network_pressure and not cpu_pressure:
    …                       # genuine network slip — rate-cut

Plus # survives even partial windows in the calibration writer (line 838) and ANYSCAN_RATE_MAX_CONCURRENT_SUBPROCESSES referenced in portscan-adapter.py (lines 47, 846) and runtime.env.template:76.

Bundle endpoint serves the freshly-built artifact

$ curl -fsSL "https://scan.anyvm.tech/api/agent/install.sh?rebuild=false&platform=linux-x86_64" | grep BUNDLE_NAME
BUNDLE_NAME='agent-bundle-linux-x86_64__20260427191214-3236925-5dd517c87d76.tar.gz'

Worker remote-update — one alive worker

The auto-recreated fleet worker (anyscan-ec2-worker, i-0b94844f5ace75d28 at 44.203.214.161) was alive and already running a post-#66 bundle from its fresh bootstrap. Remote-update fired against it cleanly:

	Pre	Post
`agentd` sha	`a786750834…`	`a786750834…` (same — PR #66 didn't touch worker source)
`AGENT_BUNDLE_NAME`	`…191248-…5dd517c87d76`	`…191309-…5dd517c87d76`
Service	active	active

ANYSCAN_RATE_MAX_CONCURRENT_SUBPROCESSES=4 confirmed in /etc/agentd/runtime.env — PR #66's install-time default fired correctly. So the next 8-NIC metal launch will only run 4 shards by default, exactly as the deploy note said.

Note for the next bench cycle

When the user authorizes another c6in.metal launch and an 8-shard CPU-pressure handling test, the operator can override:

echo 'ANYSCAN_RATE_MAX_CONCURRENT_SUBPROCESSES=8' >> /etc/agentd/runtime.env
systemctl restart agentd

…then re-run the same bench shape to confirm the CPU-vs-network slip distinction handles the regressed case from the prior bench (8-NIC at 1.34M aggregate). Expectation: AIMD's cpu_pressure branch should not rate-cut on heartbeat lag when CPU is the cause, so per-NIC pps shouldn't collapse to 167k.

Out of scope per spec

AnyGPT submodule pointer bump.
Scan kickoff.
AF_XDP Phase 2 implementation (gated on user approval after the plan PR docs(plans): AF_XDP integration plan for higher pps (Phase 1) #65 review).

…rk (#67) Phase 2 PR 1 of 4 of the AF_XDP integration plan (PR #65 §9.1) ships a refactor of the scanner C source (engine.c dispatch table + --io-engine CLI flag + PF_RING ZC dispatch fix) which lives in a fork of the third-party upstream scanner repository: - Upstream: github.com/Lorikazzzz/VulnScanner-zmap-alternative- - Fork: github.com/AnyVM-Tech/anyscan-engine-c - Phase 2 PR 1 commit on the fork: AnyVM-Tech/anyscan-engine-c@998c66b on branch perf/portscan-afxdp-phase2-pr1 Why fork: the plan §9.1 calls out that the upstream scanner is third-party and proposes a fork under AnyVM-Tech as the resting place for the integration patches (AF_XDP send/receive paths in PRs 2 + 3, build integration in PR 4, and follow-on PF_RING ZC cluster init). This commit only updates the AnyScan-side scripts to resolve from the new fork: - install-external-deps.sh:11-12 — clone URL and local checkout dir now default to the AnyVM-Tech fork. Both can still be overridden via the existing ANYSCAN_VULNSCANNER_REPO_URL / ANYSCAN_VULNSCANNER_REPO_DIR environment variables (no behaviour change for callers that set them). - package-worker-bundle.sh:519-525 — preferred lookup order is now `anyscan-engine-c/scanner` first, the legacy `VulnScanner-zmap-alternative-/scanner` directory second (kept for transitional dev checkouts), and `/opt/anyscan/bin/scanner` last. What is NOT in this PR: - The actual AF_XDP send/receive paths (PR 2 + 3 of Phase 2). - The Makefile / install-external-deps.sh `USE_AF_XDP=1` build flag plumbing (PR 4 of Phase 2). - Live c6in.metal benchmarks (PR 5 of Phase 2). - AnyGPT submodule pointer bump. - Any change to runtime.env or to the AIMD rate controller. Test plan: - `cargo build --workspace` (release) — clean. - `cargo test --workspace --no-fail-fast` — 437 tests pass (matches post-#66 baseline: 371 + 31 + 2 + 33). - `python3 -m py_compile vulnscanner-zmap-adapter.py` — clean. - On the scanner fork: - `make` (default AF_PACKET) — builds. - `make test` — 11 dispatch smoke tests pass. - `gcc -fsyntax-only -DUSE_PFRING_ZC ...` — compiles, dispatch reaches the ZC thread bodies. - `./scanner --io-engine=af_xdp` exits 1 with a clear "USE_AF_XDP=1 not set; AF_XDP send/receive paths land in PRs 2 + 3" message. - `./scanner --io-engine=pfring_zc` (without USE_PFRING_ZC) exits 1 with the equivalent compile-flag error. - `./scanner --io-engine=bogus` exits 1 with "Unknown --io-engine". Refs: AnyVM-Tech/AnyScan PR #65, plan §3.1 + §3.3 + §9.1. Co-authored-by: AnyVM-Tech AO <agent@anyvm.tech>

skullcrushercmd · 2026-04-28T14:29:15Z

Phase 2 — c6in.metal live bench (anygpt-42)

Live bench on freshly-deployed AF_XDP build (PR #71 + engine-c PR #3) on c6in.metal (128 vCPU, 8 ENIs). Driven by anygpt-42; replaces the wedged anygpt-4. Engine commit f1288d6, AnyScan commit 989c44e, scanner: /opt/agentd/bin/scanner (60,992 B stripped, libxdp.so.1 / libbpf.so.1 / libelf.so.1 linked).

Headline

Config	Aggregate peak	Aggregate avg	vs prior baseline
AF_PACKET 8-NIC, threads=8	—	7.49 M pps	0.87× baseline 8.58 M (regression check)
AF_XDP 1-NIC ens1, threads=4	6.40 M	5.83 M	2.02× AF_PACKET 1-NIC 3.16 M ✅
AF_XDP 8-NIC, cap=4, threads=8	22.43 M	19.20 M	2.66× AF_PACKET 8-NIC baseline ✅

Per-config wall + tx_dropped:

Config	Wall	Per-NIC peak (pps)	Notes
AF_PACKET 8-NIC, t=8, cooldown=2	17.92 s	~936 K	Counters via `/sys/class/net/.../tx_packets` (kernel TX ring)
AF_XDP 1-NIC ens1, t=4, c=2	25.54 s	6.40 M	drv+copy mode (zerocopy `Operation not supported` on ENA at this kernel)
AF_XDP 8-NIC cap=4, t=4	29.02 s	~1.55 M	Cap=4 design: 4 simultaneous scanners, each 4 sender threads = 16 active threads
AF_XDP 8-NIC cap=4, t=8	20.26 s	~2.80 M	Best — 32 active threads, all within 128-core capacity
AF_XDP 8-NIC cap=4, t=16 (combined=16)	21.46 s	~2.85 M	No further gain — bottleneck is per-NIC, not thread count
AF_XDP 8-NIC cap=8, t=4	26.28 s	~805 K	Regression — 32 sockets fight for memory bandwidth, cap=4 is the sweet spot

Live vs synthetic projections (PR #65 §10)

Comparison	Synthetic projection	Live	Verdict
AF_XDP single-NIC speedup over AF_PACKET 1-NIC	~10–12 M (3.5× baseline)	6.40 M (2.02× baseline)	Lower than projected; ENA forces drv+copy (not driver-mode zerocopy), per-thread copy budget ceilings throughput
AF_XDP 8-NIC cap=4 aggregate	30–50 M (toward 14 M / ENI ENA spec)	22.43 M peak / 19.2 M avg	Below projection but ~2.66× AF_PACKET 8-NIC baseline; ENA `xdp drv+copy` per-NIC ceiling appears to be ~3 M, not 14 M
AF_PACKET 8-NIC regression check	8.58 M baseline holds	7.49 M (87 %)	Within ~13 % of baseline; jitter likely from `cooldown-time=2` per-shard tail rather than algorithmic regression

Setup notes (operational findings the plan should fold in)

MTU constraint: ENA driver rejects XDP attach at MTU=9001 (jumbo frames default). Bench had to lower all 8 ENIs to MTU=3498 (ip link set dev <iface> mtu 3498). Plan §6 should add this as a worker-bench host-prep step.
Queue space: ENA also rejects XDP attach when combined-queue count is at hardware max (32 on c6in.metal). Setting ethtool -L <iface> combined 8 (or 16) freed up XDP TX queue slots. Plan §6 should add this too.
Mode ladder: scanner correctly walks drv+zerocopy → drv+copy → skb. ENA on kernel 6.12.74 supports only drv+copy; zerocopy was tested across all 8 NICs and rejected with Operation not supported. Driver-side zerocopy patches in newer kernels (6.16+ ena_xdp_zc, in-flight upstream) would close the projection gap.
/sys/class/net/.../statistics/tx_packets does NOT count XDP TX on ENA — counters appeared as 0 (or ~negative due to clock skew) for AF_XDP runs. Bench harness switched to scanner self-reported pps for AF_XDP rows. PR docs(plans): AF_XDP integration plan for higher pps (Phase 1) #65 §11 (telemetry) should call out this counter caveat.
PACKET_FANOUT errno 22 on AF_PACKET path — visible in scanner stderr; affects RX dedup only, doesn't affect TX bench numbers but should be investigated separately.
Cap=8 regression is real and worth an instrumented note in the plan: AF_XDP umem (16 MiB × 32 sockets = 512 MiB resident) plus simultaneous send-thread descriptor churn evicts cache lines; the rate-controller should keep cap=4 as the multi-NIC default on ≤128-core hosts.

Bench harness

Custom bash harness (not via the AnyScan API/adapter path) — direct /opt/agentd/bin/scanner invocations, one subprocess per ENI, with --shards i/N sharding the target range. Captures /sys/class/net/<iface>/statistics/{tx_packets,tx_dropped} pre/post for AF_PACKET, and parses the scanner's own Mp/s avg self-reports for AF_XDP. Target 198.18.0.0/15 (RFC2544 benchmark range, IGW-dropped sink) ports 1-1024 = 134 M probes total per run, sharded 1/8 each. Cooldown 2 s. Rate cap 99999999 (effectively uncapped). All runs returned tx_dropped=0 — no kernel TX-ring overflow. Logs preserved in /tmp/anygpt-42-bench/ on the metal until termination.

Out of scope (per task instructions)

No api/adapter-driven bench (rate-controller per-window classification, heartbeat_jitter not captured) — direct scanner invocation traded these for a smaller measurement window. Future bench should drive via port_scan API on metal-afxdp-bench-1 worker so the adapter's classifier emits its histogram.
No AnyGPT submodule bump.
No /0 or production-scope scan.

Cleanup

c6in.metal i-0958c76a9ba1a0483 — terminated.
7 secondary ENIs (eni-06c639cf.., eni-0b6da590.., eni-0ad37ebf.., eni-098e0e96.., eni-0815d240.., eni-04459a62.., eni-065de86f..) — deleted post-termination.
anyscan-ec2-worker-manager.service — restarted (will respawn the standard xlarge fleet from the AF_XDP bundle published in PR feat(portscan): ANYSCAN_SCANNER_IO_ENGINE env knob + adapter --io-engine plumbing #70 deploy).

skullcrushercmd · 2026-04-28T18:47:45Z

Phase 2 — c6in.metal 15-NIC live bench (anygpt-48): adding NICs and a 6.19 kernel fail to close the gap

Live follow-on to anygpt-42's 8-NIC bench (issuecomment-4336192354). Same 198.18.0.0/15 × ports 1-1024 = 134 M-probe target; same custom bash harness (one scanner subprocess per ENI, --shards i/N). c6in.metal launched via tools/ec2_worker_manager.py once with ANYSCAN_EC2_INSTANCE_TYPE=c6in.metal ANYSCAN_MAX_ENIS=15 after stopping anyscan-ec2-worker-manager.service and terminating the existing xlarge fleet.

TL;DR

The 30–50 M-pps synthetic projection from this PR's plan §10 still does not hold on AWS c6in.metal in 2026-04. PR #73 (kernel backport opt-in) and PR #74 (15-ENI launch path) wire the knobs cleanly, but the underlying premises — "kernel 6.16+ unlocks ena_xdp_zc" and "more ENIs unlock more PCIe trees" — both fail in production for the reasons documented inline below. The 8-NIC cap=4 t=8 22.43 M-peak number from anygpt-42 remains the c6in.metal AF_XDP ceiling.

Headline

Config	Aggregate peak	Aggregate avg	vs anygpt-42 8-NIC baseline
AF_PACKET 8-NIC, T=8 (anygpt-42)	—	7.49 M	— (the baseline)
AF_XDP 8-NIC cap=4, T=R=8 (anygpt-42, best)	22.43 M	19.20 M	2.66× the AF_PACKET baseline ✅
AF_PACKET 15-NIC, T=8 (anygpt-48)	12.96 M	7.69 M	1.03× AF_PACKET 8-NIC — flat regression-check
AF_XDP 15-NIC, T=R=4 (apples-to-apples thread budget)	8.56 M	5.18 M	0.27× the 22.43 M peak — sharp regression ❌
AF_XDP 15-NIC, T=R=8 (matching anygpt-42 per-NIC config)	12.18 M	8.65 M	0.54× peak / 0.45× avg — still regressed ❌
AF_XDP 15-NIC `drv+zerocopy` (Bench C)	—	—	UNRUNNABLE — `ena_xdp_zc` still not upstream as of Linux 6.19.11

Bench harness wall + tx_dropped

Config	Wall	Per-NIC peak	tx_dropped	Notes
AF_PACKET 15-NIC, T=8, c=2	19.05 s	~ 469 K	0	`/sys/class/net/<if>/statistics/tx_packets` deltas; perfectly balanced 8.95 M packets per NIC across all 15 (sharding good)
AF_XDP 15-NIC, T=R=4, c=2	29.08 s	0.42 – 0.78 M	n/a	drv+copy fallback; clear card asymmetry — card-1 NICs 0.66–0.78 M peak, card-0 NICs 0.42–0.44 M
AF_XDP 15-NIC, T=R=8, c=2	18.46 s	0.72 – 1.04 M	n/a	drv+copy fallback; 240 active threads on 128-core (oversubscription); per-NIC peak ~3.9× lower than anygpt-42's 2.80 M @ 8-NIC

Why it regressed: three live findings on top of the prior `anyscan_afxdp_ena_constraint` memory

1. PR #74's c6in.metal NetworkCard fixture is incorrect

The PR's test fixture and docstrings claim:

NetworkCards = [
  {NetworkCardIndex: 0, MaximumNetworkInterfaces: 5},  (primary)
  {NetworkCardIndex: 1, MaximumNetworkInterfaces: 4},
  {NetworkCardIndex: 2, MaximumNetworkInterfaces: 3},
  {NetworkCardIndex: 3, MaximumNetworkInterfaces: 3},
]

Live aws ec2 describe-instance-types --instance-types c6in.metal --region us-east-1:

TopLevel MaximumNetworkInterfaces = 16
NetworkCards:
  card 0: max_nics=8, perf=Up to 170 Gigabit
  card 1: max_nics=8, perf=Up to 170 Gigabit
total via cards = 16

So the live launch placed 15 ENIs as 8/7 across 2 cards, not 5/4/3/3 across 4. PR #74's tools/test_ec2_worker_manager.py::test_max_enis_15_on_c6in_metal_emits_15_network_interfaces (and the docs at tools/ec2_worker_manager.py:121-125) were authored against a synthetic mock that doesn't match AWS's reality. The 40-test unit suite passes against the mock; the launch-path code itself is fine — it's the verification that's mocking the wrong shape. Suggest: refresh the fixture to 2 cards × 8 + add an integration test that asserts against aws ec2 describe-instance-types output for the actual instance type.

The "more PCIe trees" justification in PR #74's commit body (and §6.1 of the plan doc) therefore over-promises: c6in.metal has 2 trees, not 4. That alone caps the multi-NIC headroom at ~2× single-tree, not ~4×.

2. PR #73's `bookworm-backports` source is the wrong suite for the current AMI

The PR defaults to ANYSCAN_KERNEL_BACKPORT_SUITE=bookworm-backports and ANYSCAN_KERNEL_BACKPORT_PACKAGE=linux-image-cloud-amd64. The current ANYSCAN_EC2_AMI_ID=ami-06e3e2b7faca0265d is Debian 13 (Trixie), not Debian 12 (Bookworm) — so bookworm-backports/linux-image-cloud-amd64 is at version 6.12.74-2~bpo12+1, which is the same kernel version the metal already runs. The opt-in completes successfully ("0 upgraded, 0 newly installed"), so the operator gets a green light without ever moving off 6.12.74.

For this bench I worked around it by apt-get install -t trixie-backports linux-image-amd64, which pulls 6.19.11-1~bpo13+1. Suggest: the install path should detect /etc/os-release ID=debian VERSION_CODENAME=trixie and switch the suite to trixie-backports. The package selection should also note that linux-image-cloud-amd64 from trixie-backports is currently still 6.12.74 — only the non-cloud linux-image-amd64 jumps to 6.19.

3. The big one: `ena_xdp_zc` still hasn't landed upstream as of Linux 6.19.11

Post-reboot probe:

$ uname -r
6.19.11+deb13-cloud-amd64

$ nm /lib/modules/.../ena.ko.xz | grep -iE 'xsk|_zc|zerocopy'
                 U xdp_convert_zc_to_xdp_frame   ← only undefined import (generic XDP)

$ strings /lib/modules/.../ena.ko.xz | grep -iE 'xsk|af_xdp_zc|XDP_ZEROCOPY|xsk_pool|xsk_buff'
(no matches)

Compare with mlx5_core.ko which has dozens of xsk_* and _zc symbols. The ENA driver has standard XDP_TX/REDIRECT/PASS/DROP paths but no driver-side zerocopy/XSK pool support — exactly what the in-flight upstream patches were supposed to add for "6.16+".

Live confirmation from the scanner's mode ladder when attaching XDP on ens1:

[*] afxdp: xsk_socket__create(ens1, q=0, mode=drv+zerocopy) failed: Operation not supported
[*] afxdp: xsk_socket__create(ens1, q=1, mode=drv+zerocopy) failed: Operation not supported
[*] afxdp: xsk_socket__create(ens1, q=2, mode=drv+zerocopy) failed: Operation not supported
[*] afxdp: xsk_socket__create(ens1, q=3, mode=drv+zerocopy) failed: Operation not supported
[*] afxdp: thread 0 bound ens1 queue 0 in mode=drv+copy (umem=16 MiB, tx=4096 rx=4096 frames × 2048 B)

Identical Operation not supported to anygpt-42 on the 6.12.74 kernel. The kernel backport is not a workaround for the ENA zerocopy gap on AWS today. The plan's "wait for kernel 6.16+ AMI" path needs to be revised — the patches still aren't merged. The viable paths to unlock zerocopy continue to be (a) Mellanox/mlx5 NICs on non-AWS bare metal, or (b) PF_RING ZC with a paid ntop license on AWS (PR #75 is currently engine-init-stub, so that path is not yet usable either).

4. Bonus: PR #74's multi-ENI launch path skips public-IP allocation

A single-NIC launch on this subnet auto-assigns a public IP because MapPublicIpOnLaunch=true on the subnet. The multi-ENI path (launch_args["NetworkInterfaces"] = build_network_interfaces(...)) does not set AssociatePublicIpAddress=True on the primary interface, so the launched metal had no public IP and was unreachable from outside the VPC. I worked around it by aws ec2 allocate-address + associate-address on the primary ENI. Suggest: when eni_attach['attached'] == 1 (single-ENI fallback even with the new path) the launch should set AssociatePublicIpAddress: True on the primary entry; for multi-ENI, the operator may still want the same on DeviceIndex=0, NetworkCardIndex=0.

Per-NIC detail (T=R=8 run, the better of the two AF_XDP runs)

  enp13s0   peak=0.74M avg=0.60M
  enp154s0  peak=0.91M avg=0.56M
  enp155s0  peak=0.90M avg=0.56M
  enp156s0  peak=0.89M avg=0.56M
  enp157s0  peak=0.81M avg=0.56M
  enp158s0  peak=0.83M avg=0.56M
  enp159s0  peak=0.77M avg=0.56M
  enp15s0   peak=0.74M avg=0.60M
  enp160s0  peak=0.90M avg=0.56M
  ens1      peak=0.73M avg=0.60M
  ens2      peak=0.72M avg=0.56M
  ens3      peak=1.04M avg=0.60M
  ens4      peak=0.72M avg=0.60M
  ens5      peak=0.74M avg=0.60M
  ens7      peak=0.74M avg=0.60M
AGGREGATE   peak=12.18M  avg=8.65M

Per-NIC numbers are roughly uniform within each card domain (enp154-160 cluster around 0.83 M, ens*+enp1[35]s0 around 0.75 M). The drop from 2.80 M per-NIC (anygpt-42, 8 NICs, T=R=8) to ~0.81 M (anygpt-48, 15 NICs, T=R=8) is the CPU-thrashing signature of 240 active threads on 128 cores — confirms that per-host AF_XDP throughput on c6in.metal is CPU-bound under drv+copy, not NIC-bound. Adding NICs without unlocking zerocopy doesn't help because the bottleneck moves to descriptor-copy CPU work, which scales with thread count regardless of NIC count.

Setup notes (delta from anygpt-42)

Bundle: agent-bundle-linux-x86_64__anygpt-48-afxdp-kbackport-20260428174058.tar.gz (sha256 775cf88e1a3434846a038f210c1aa841e10efa215fbcb689cef7d6db34d930c0) built via package-worker-bundle.sh ANYSCAN_USE_AF_XDP=1 ANYSCAN_INSTALL_KERNEL_BACKPORT=1 ANYSCAN_USE_PFRING_ZC=0. PF_RING off per task brief (PR fix(build): wire ANYSCAN_USE_PFRING_ZC=1 through install-external-deps + package-worker-bundle + deploy + adapter #75 engine-init stub).
Metal i-043714fbb73cca641, AZ us-east-1a, EIP 54.165.21.227. Primary ENI eni-0c47a4a7f3ba69511.
SSH wedge after systemctl reboot was longer than expected (~6 min), but came back into 6.19 cleanly. EC2 console showed boot reaching cloud-final.service then continuing into a normal start. No grub reorder needed; linux-image-amd64 set itself as the default.
AF_XDP host prep was the same as anygpt-42: ip link set <if> mtu 3498 + ethtool -L <if> combined 8 (matched RSS to T=R=8) on every ENI before bench B.
/sys/class/net/<if>/statistics/tx_packets continues to NOT count XDP TX on ENA — bench-A used kernel-counter deltas; bench-B used scanner self-reports (send: …M p/s).
AF_PACKET path was perfectly balanced (each of 15 ENIs sent exactly 8,947,8xx packets ± 60), confirming the harness's --shards i/15 distribution is sound.

Summary verdict on the env-knob PRs

PR	Wired correctly?	Closes 22.43 M → 30–50 M gap?
#71 (AF_XDP build wireup)	✅ scanner ships with libxdp linkage; mode ladder fires; falls back to drv+copy	n/a — was never claimed to
#73 (kernel backport opt-in)	⚠️ wired correctly but defaults to wrong suite for current AMI; even on 6.19.11 from `trixie-backports`, ENA still has no zerocopy support	No — `ena_xdp_zc` not upstream as of 6.19.11
#74 (15-ENI launch path)	⚠️ launches 15 ENIs successfully, but fixture claims 4 cards × (5/4/3/3) when reality is 2 cards × 8; also drops public-IP allocation on multi-ENI path	No — adding ENIs at drv+copy is CPU-bound; 15-NIC regressed vs 8-NIC at every T=R config tested
#75 (PF_RING ZC build wireup)	(intentionally not exercised — engine-init stub blocks runtime)	(not testable yet)
#76 (PF_RING gating)	(not exercised — `use_pfring_zc=0` in bundle)	(n/a)

Cleanup

c6in.metal i-043714fbb73cca641 — terminated 18:32:08Z (live for ~36 min total, ~$3.30 in on-demand spend).
15 ENIs (eni-0c47a4a7f3ba69511, eni-049606866a71c9d87, eni-0c9c10e4dc28adde6, eni-0a6a17a3e724ff90b, eni-0acf00ef684fffb47, eni-02dc9e888c93d5a33, eni-0aad977041b58b791, eni-087e24d177d27ee7b, eni-03da977940c9bb8ef, eni-07cb100d30af9c73a, eni-06afba14a7dde0c93, eni-0d21500d7a1f5d641, eni-0c35ff242e90b7a59, eni-06c8680f516ae4ee6, eni-09746474722608a97) — auto-deleted on termination (DeleteOnTermination=true was the default).
EIP eipalloc-023db7e0b15246cb7 (54.165.21.227) — released.
.external-runtime.env restored to c6in.xlarge (no ANYSCAN_MAX_ENIS); anyscan-ec2-worker-manager.service restarted; replacement xlarge i-0778d6f698047418f already running.
Bench logs preserved at scan.anyvm.tech:/root/.worktrees/AnyGPT/anygpt-48/anygpt-48-bench-logs.tar.gz.

Out of scope (per task brief)

No DPDK exercise — PR [draft] docs(plans): DPDK userspace-networking integration plan (Phase 1) #72 plan-only draft is awaiting orchestrator approval.
No PF_RING ZC bench — PR fix(build): wire ANYSCAN_USE_PFRING_ZC=1 through install-external-deps + package-worker-bundle + deploy + adapter #75 wires the build path but the engine cluster init is a stub.
No api-driven bench — same as anygpt-42, this is a direct-scanner harness so the rate-controller's classifier histogram and heartbeat_jitter aren't captured. The path-mode classifier should be revisited only after a config that actually beats 22.43 M emerges.

cc PR #73 / PR #74 — flagging the fixture and suite-default issues as separate follow-ups rather than blocking the env-knob PRs that are otherwise wired correctly.

Phase 1 design document for adding a DPDK io_engine to the bundled C scanner (AnyVM-Tech/anyscan-engine-c). Mirrors PR #65's AF_XDP plan structure across §1-§10. Why now: PR #65's AF_XDP work landed but the c6in.metal bench revealed ENA on kernel <=6.12.74 forces drv+copy (not drv+zerocopy), capping the 8-NIC ceiling at ~22 M pps — short of the 30-50 M pps projection. DPDK via vfio-pci bypasses the ENA kernel driver entirely, projecting 50-100 M pps realistic on c6in.metal. This supersedes PR #63's deferral recommendation (which was conditioned on AF_XDP clearing the throughput target — it did not). Plan scope: - engine repo: ~1,100 LOC (send-dpdk.c, recv-dpdk.c, dpdk-eal.c, dpdk-defs.h, vtable slot in engine.c, USE_DPDK Makefile block) - AnyScan-side wire-up: ~765 LOC (mirrors PR #71's ANYSCAN_USE_AF_XDP pattern across install-external-deps.sh / package-worker-bundle.sh / deploy.sh / runtime.worker.env.template / adapter.py + new tools/setup-dpdk.sh for hugepages and vfio-pci bind/unbind) - NIC-binding decision: dedicated-DPDK-NIC pattern. eth0 stays on kernel for agentd heartbeat; ENIs eth1..eth7 (c6in.metal) go to vfio-pci. Single-NIC instances are DPDK-ineligible by design. - Effort: 12-15 days implementation + canary, ~3-4 weeks total. Phase 2 implementation is gated on user/orchestrator approval after this plan PR merges. No engine C code, no runtime config, no submodule bumps in this PR. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… 4x5/4/3/3=15) (#77) PR #74 mocked NetworkCards as 4 cards distributed 5/4/3/3=15, but actual AWS DescribeInstanceTypes for c6in.metal returns 2 cards x 8 = 16 (anygpt-48 live bench, PR #65 issuecomment-4338158487). The launch path code is fine - distribute_enis_across_cards handles any card layout - but the synthetic test fixture and the docstring example encoded a shape that doesn't match production AWS. Refresh the fixture, the docstring, and every test that hardcoded 15-derived numbers. Add a new RecordedDescribeInstanceTypesIntegrityTests class that anchors the fixture against tools/c6in_metal_describe_instance_types.json (a real `aws ec2 describe-instance-types` capture) so future drift gets caught at unit-test time instead of bench time. Effect on capacity claim: c6in.metal has 2 PCIe trees, not 4, so the multi-NIC headroom caps at ~2x single-tree, not ~4x. Co-authored-by: skullcmd <skullcmd@anyvm.tech>

…launches (#79) When the launch payload uses an explicit NetworkInterfaces[] list (ANYSCAN_MAX_ENIS set), AWS does NOT honor the subnet's MapPublicIpOnLaunch — the operator has to opt in by setting AssociatePublicIpAddress=True on the primary ENI explicitly. In anygpt-48 (PR #65 issuecomment-4338158487) this caused the c6in.metal launch to come up unreachable from outside the VPC; the operator had to manually allocate-address + associate-address post-launch as a workaround. Add an opt-in env knob ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP (default off so existing fleets are unchanged). When set, plumb through ManagerConfig to build_network_interfaces, which sets AssociatePublicIpAddress=True on the entry with DeviceIndex=0 NetworkCardIndex=0 only — AWS rejects the field on secondaries. Co-authored-by: skullcmd <skullcmd@anyvm.tech>

ANYSCAN_KERNEL_BACKPORT_SUITE defaulted to bookworm-backports regardless of host. On the current Debian 13 (Trixie) AMI this means bookworm-backports/linux-image-cloud-amd64 resolves to 6.12.74 — the same kernel the metal already runs — so the opt-in completes "0 upgraded, 0 newly installed" and the operator gets a false green light without ever upgrading. Detect /etc/os-release VERSION_CODENAME and default the suite to <codename>-backports. Switch the default package to linux-image-amd64 (NOT linux-image-cloud-amd64) on non-bookworm suites, because trixie-backports cloud-amd64 is still 6.12 as of 2026-04 — only the non-cloud image jumps to 6.19. Operator-set ANYSCAN_KERNEL_BACKPORT_SUITE / _PACKAGE / _SOURCES_LIST still win — detection is just a smarter default. Source-list path is also derived from the resolved suite so the file matches. ANYSCAN_OS_RELEASE_FILE env override added so the test suite can inject a synthetic os-release without touching /etc/os-release on the test host. See PR #65 issuecomment-4338158487 (anygpt-48 c6in.metal bench) for the kernel-resolution trace. Co-authored-by: skullcmd <skullcmd@anyvm.tech>

…undle + deploy + adapter (#81) Phase 2 wire-up for the DPDK io_engine landing in AnyVM-Tech/anyscan-engine-c PR #4. Mirrors PR #71's AF_XDP wire-up shape across the install / bundle / deploy / adapter / install-time-probe chain so the engine repo's USE_DPDK=1 build flag actually reaches every producer of a worker bundle, and so the runtime --io-engine=dpdk knob plumbed through ANYSCAN_SCANNER_IO_ENGINE has DPDK code to dispatch to. Why DPDK now: AWS ENA on kernel ≤6.12.74 forces AF_XDP into drv+copy mode, capping c6in.metal at ~22M pps aggregate (memory: anyscan_afxdp_ena_constraint, also PR #65 issuecomment-4338158487 — 6.19.11 STILL does not have ena_xdp_zc). DPDK bypasses the kernel ENA driver entirely via vfio-pci and removes the syscall-kick + lower-half -channels-only ZC constraint. What lands here: - install-external-deps.sh: ANYSCAN_USE_DPDK env knob; binary_has_dpdk_linkage probe (librte_eal.so via ldd → readelf -d); install_dpdk_build_deps (libdpdk-dev + dpdk apt-get, fail-open); cache short-circuit invalidation when cached binary lacks DPDK linkage; vulnscanner_make_args extension; post-build assertion. - package-worker-bundle.sh: same env knob, linkage probe, rebuild_scanner_with_dpdk helper, bundle_engine_make_args, README.txt use_dpdk field. Composes with USE_AF_XDP=1 USE_PFRING_ZC=1 — the earliest matching rebuild block produces a binary linked against every requested engine in a single make invocation. - deploy.sh: same env knob, linkage probe, make_args extension, pre-DPDK cached-binary drop, post-build assertion. - install-worker-bundle.sh: binary_has_dpdk_linkage, probe_dpdk_runtime_available (5 gates: scanner USE_DPDK-built, librte_eal.so loadable, vfio_pci kernel module, hugepages reserved in /sys/kernel/mm/hugepages/*, /dev/vfio/vfio present), apply_dpdk_availability writing ANYSCAN_DPDK_AVAILABLE. - vulnscanner-zmap-adapter.py: SUPPORTED_IO_ENGINES gains "dpdk"; _IO_ENGINE_AVAILABILITY_KEYS maps "dpdk" → ANYSCAN_DPDK_AVAILABLE so the same fall-back-with-warning path the AF_XDP / PF_RING ZC plumbing already exercises picks up dpdk for free. - runtime.worker.env.template: full DPDK section documenting ANYSCAN_USE_DPDK (build-time), ANYSCAN_DPDK_AVAILABLE (install probe), ANYSCAN_DPDK_PCI_BDFS (BDF / iface CSV), and ANYSCAN_DPDK_HUGEPAGES_GB (default 4). - tools/setup-dpdk.sh (NEW, ~370 LOC): bind / unbind / status subcommands. Reserves hugepages (1 GiB pages preferred, falls back to 2 MiB), modprobe vfio-pci, dpdk-devbind.py --bind=vfio-pci. Idempotent (re-runs are no-ops). Reversible (`unbind` returns the NICs to ena and frees hugepages). Refuses to bind eth0 (agentd control-plane interface) and refuses to bind the only NIC. THP gets switched to "never" on bind (DPDK + THP fragments the static hugepage pool). - tools/test-install-external-deps-dpdk.sh (NEW, ~270 LOC): mirrors test-install-external-deps-afxdp.sh. Four cases × multiple assertions: default unset → no USE_DPDK=1 in make argv; opt-in + missing scanner → USE_DPDK=1; opt-in + cached non-DPDK binary → make clean + USE_DPDK=1; opt-in + cached DPDK-linked binary → no rebuild. Stubs make/git/ldd/readelf so it runs hermetically. - test_vulnscanner_adapter_io_engine.py: 7 new DPDK assertions covering the dpdk-with-runtime-available, dpdk-without-runtime -fall-back-with-warning, missing-availability-var, uppercase normalization, and cross-engine availability isolation cases. Updated test_invalid_value_falls_back_to_af_packet_with_warning to use "fake_engine" instead of "dpdk" — dpdk is now valid. Verification (on Debian bookworm with libdpdk-dev 24.11 installed): - tools/test-install-external-deps-afxdp.sh: 11/11 (regression OK). - tools/test-install-external-deps-pfring-zc.sh: 10/10 (regression OK). - tools/test-install-external-deps-dpdk.sh: 10/10. - python3 -m unittest discover: 116/116 (32 in test_vulnscanner_adapter_io_engine, of which 7 are DPDK-specific). - All bash scripts parse cleanly via `bash -n`. - tools/setup-dpdk.sh status runs cleanly (no NICs bound, expected). Engine PR for io_engine_dpdk: AnyVM-Tech/anyscan-engine-c#4 Out of scope (separate workers per the plan): - Phase 2 systemd unit edit adding CAP_SYS_RAWIO/CAP_IPC_LOCK/ CAP_NET_ADMIN to anyscan-worker.service. Documented in the env template. Until that lands operators must add caps manually before flipping the runtime knob. - Live c6in.metal bench (plan §5.3). - AMI rebuild. - mlx5 / non-AWS hardware support. Refs: plans/2026-04-28-portscan-dpdk-impl-v1.md (§3.10 wire-up, §3.11 NIC-binding decision, §4.3 kernel feature checks, §5.7 unit test shape). anygpt-50 Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

skullcrushercmd · 2026-04-28T21:33:31Z

Phase 2 — c6in.metal AF_PACKET / AF_XDP / DPDK live bench (anygpt-52): regressions block AF_XDP + DPDK; AWS PPS allowance still untested

Live bench on c6in.metal (128 vCPU, 2 NetworkCards × 8 ENIs each = 16 max, 8 attached) driven by anygpt-52. Engine commit ccfd077 (post-PR #4 DPDK io_engine), AnyScan commit 4faa236 (post-PR #81 DPDK build wireup), scanner SHA256 19a0435964c5… linked against libxdp.so.1 + librte_eal.so.25 + libbpf.so.1. Bundle build proven OK; deploy comment on PR #70 (issuecomment-4338757890).

TL;DR

Config	Aggregate avg pps	vs anygpt-42 baseline	AWS `pps_allowance_exceeded`
AF_PACKET 8-NIC, T=R=8, `-r 50M`, ~36 s wall	7.98 M	1.07× (anygpt-42 was 7.49 M) ✅ regression-check passes	0 / no quota hit
AF_XDP 8-NIC	—	n/a, segfaulted (engine bug) ❌	—
DPDK 7-NIC (vfio-pci)	—	n/a, EAL refused (engine bug) ❌	—

The 22 M AF_XDP ceiling from anygpt-42 is still neither confirmed nor refuted as AWS-imposed. AF_PACKET-only at ~8 M pps is well below the AWS ENA per-instance PPS allowance for c6in.metal — pps_allowance_exceeded, bw_in_allowance_exceeded, bw_out_allowance_exceeded all stayed at 0 across all 8 ENIs for the full bench window. Without working AF_XDP or DPDK on this build, we cannot push hard enough to surface the AWS quota.

Bench harness

Target: 198.18.0.0/15 × ports 1-1024 (= 134 M probes, 16.78 M per shard at 8 shards). Same target as anygpt-42 / anygpt-48.
One scanner subprocess per ENI, --shards i/N, -T 8 -R 8, -r 50M, -c 1 (cooldown 1 s), -q.
PPS measurement: /sys/class/net/<iface>/statistics/tx_packets delta ÷ wall time (ethtool -S does not surface tx_packets on ENA).
AWS quota: ethtool -S <iface> | grep allowance_exceeded PRE and POST on every NIC.

AF_PACKET 8-NIC results (T=R=8, `-r 50M`, wall 35.84 s)

enp154s0 tx_delta=16777290  (468377 pps)  rx_delta=16
enp155s0 tx_delta=16777260  (468377 pps)  rx_delta=15
enp156s0 tx_delta=16777261  (468377 pps)  rx_delta=15
enp157s0 tx_delta=16777263  (468377 pps)  rx_delta=16
ens1     tx_delta=16777328  (468379 pps)  rx_delta=98     ← control plane (also scanned)
ens2     tx_delta=16779063  (468427 pps)  rx_delta=1689   ← carries SSH traffic + DNS
ens3     tx_delta=16777284  (468378 pps)  rx_delta=18
ens4     tx_delta=16777269  (468377 pps)  rx_delta=17

AGGREGATE_TX_PACKETS=285,901,464  (16.78 M × 8 shards × ~2 passes; ran longer than one scan-cycle)
AGGREGATE_PPS=7,981,615           (≈ 7.98 M aggregate)

Per-NIC PPS is remarkably uniform at ~468 K each — matches the PR #65 plan §2 documented AF_PACKET single-socket cap (~3 M / 8 threads ≈ 375-500 K per scanner — consistent with vulnscanner-zmap-adapter.py:669's comment about per-PACKET_TX_RING socket throughput). 8-NIC scaling is near-linear because each scanner has its own ring on its own ENI.

AWS `*_allowance_exceeded` deltas across all 8 ENIs

Counter	PRE	POST	Δ
`pps_allowance_exceeded`	0	0	0
`bw_in_allowance_exceeded`	0	0	0
`bw_out_allowance_exceeded`	0	0	0
`conntrack_allowance_exceeded`	0	0	0
`linklocal_allowance_exceeded`	0	0	0
`conntrack_allowance_available`	6567014	6567010	−4 (SSH/DNS, irrelevant)

At ~8 M aggregate pps, AWS does not throttle. That's the only hard data point we have on the AWS quota for c6in.metal from this run.

Why AF_XDP did not run

[*] afxdp: xsk_socket__create(enp154s0, q=0, mode=drv+zerocopy) failed: Operation not supported
libbpf: elf: skipping unrecognized data section(8) .xdp_run_config
…
Segmentation fault

The bind-mode ladder in anyscan-engine-c/src/send-afxdp.c:287-289 (AFXDP_BIND_ZEROCOPY → AFXDP_BIND_DRV_COPY → AFXDP_BIND_SKB) is correct in source — it's just supposed to retry on each failure — but in practice the second attempt (drv+copy) segfaults the process instead of returning a clean error. The first attempt's xsk_socket__create returns -EOPNOTSUPP (matches memory anyscan_afxdp_ena_constraint: ENA on Linux 6.12 only supports drv+copy, not drv+zerocopy), so this is the exact code path we know we need on AWS — and it's broken.

Bug location: the engine's afxdp_try_bind() doesn't fully tear down the partially-created XSK / UMEM / xdp_program after a failed xsk_socket__create() before the next attempt. Repro is single-scanner -i ens1 --io-engine=af_xdp -T 4 -R 4 … on c6in.metal — segfaults every time.

This is a regression-blocker for AF_XDP on AWS. anygpt-42 was on engine f1288d6 (pre-PR #4); current ccfd077 (PR #4 merge) introduced the regression — most likely the DPDK changes share state-init code with the AF_XDP setup path. Splitting this into its own engine PR would have caught it via the bench gate plan §5.3 demands.

Why DPDK did not run

Two engine-side bugs found in PR #4's dpdk-eal.c:

Hardcoded nb_tx_desc=1024 exceeds ENA's max=512:
```
ETHDEV: Invalid value for nb_tx_desc(=1024), should be: <= 512, >= 128, and a product of 1
[-] dpdk-eal: rte_eth_tx_queue_setup(port=0, q=0) failed: Invalid argument
```
The TX descriptor count must be queried from rte_eth_dev_info_get(...).tx_desc_lim.nb_max per device, not hardcoded. (mlx5 supports 4 K, ENA caps at 512.)
EAL argv splitter mangles --socket-mem 1024: the trailing 1024 token gets replaced by the scanner program path on the way into rte_eal_init. Repro: scanner --io-engine=dpdk … -- --file-prefix=foo --socket-mem 1024 ends up as scanner --file-prefix=foo --socket-mem scanner in the EAL log. The space-separated form fails; the = form (--socket-mem=1024,1024) is untested.

Plus deploy-side blockers (worked around but worth flagging for follow-up):

librte-net-ena25 not in the bundle install path. Debian DPDK 24.11.4 ships ENA PMD as a separate package; without it, rte_eal_init succeeds but no eth ports are probed. Manual apt install librte-net-ena25 was required. install-external-deps.sh::install_dpdk_build_deps should pull this in when ANYSCAN_USE_DPDK=1 is set.
setup-dpdk.sh bind refused active interfaces. dpdk-devbind safety check ("Warning: routing table indicates that interface is active. Not modifying") forces the operator to manually ip link set <ifc> down on each NIC before bind succeeds. PR fix(build): wire ANYSCAN_USE_DPDK=1 through install-external-deps + bundle + deploy + adapter #81's tools/setup-dpdk.sh should down the iface itself or at least call out the requirement in its README.
1 GiB hugepages reserved without hugetlbfs mount. setup-dpdk.sh reserved 8 × 1 GiB pages but didn't mount a 1G-pagesize hugetlbfs. EAL fell back to "no hugepages reported on node 0/1" until I manually mount -t hugetlbfs -o pagesize=1G nodev /mnt/huge1g.

Other deploy-path bugs surfaced (worked around)

PR feat(ec2-worker): ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP knob for multi-ENI launches #79 ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP=true + multi-ENI is rejected by AWS: InvalidParameterCombination — The associatePublicIPAddress parameter cannot be specified when launching with multiple network interfaces. AWS only honors AssociatePublicIpAddress on NetworkInterfaces[] when there is exactly one NetworkInterface entry. PR feat(ec2-worker): ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP knob for multi-ENI launches #79's multi-ENI public-IP path is non-functional on every multi-ENI launch. Workaround: launch without it, then aws ec2 allocate-address && aws ec2 associate-address --network-interface-id <primary-eni> post-launch.
/api/agent/install.sh?rebuild=false served a stub scanner. The bundle the worker fetched at bootstrap (agent-bundle-…20260428202928…) had a 37-byte shell script in place of the scanner binary (#!/bin/sh; echo cached-scanner-pfring). The API's bundle-cache path does not honor ANYSCAN_USE_AF_XDP/ANYSCAN_USE_DPDK flags from the original package-worker-bundle.sh invocation; bundles served from the API are AF_PACKET-only stubs unless those env vars are set in the API's systemd EnvironmentFile. Manually scp'd the real scanner from the operator-built bundle as a workaround.
AF_XDP and DPDK runtime probes (in install-worker-bundle.sh) returned available=false on c6in.metal Debian 13 + kernel 6.12.74. Probe message: "kernel <5.10 or libxdp.so missing; ANYSCAN_AF_XDP_AVAILABLE=false" — but the kernel is 6.12 and libxdp1 is in the package archive. The probe's kernel-version check is wrong (likely cut -d. -f1-style parsing failing on 6.12.74).

What the AWS PPS allowance numbers actually tell us

Reproducing the table in memory anyscan_aws_pps_allowance:

Source	Aggregate PPS hit	`pps_allowance_exceeded` non-zero?	Verdict
anygpt-4 / anygpt-42 AF_PACKET 8-NIC	7.49 M	(not captured)	unknown
anygpt-42 AF_XDP 8-NIC cap=4 t=8 (best)	22.43 M	(not captured)	unknown
anygpt-48 AF_XDP 15-NIC	12.18 M (regressed)	(not captured)	unknown
anygpt-52 AF_PACKET 8-NIC T=8 R=8	7.98 M	0 — quota not hit	AWS allowance ≥ 8 M pps for c6in.metal ✓
anygpt-52 AF_XDP	—	—	engine bug, see §"Why AF_XDP"
anygpt-52 DPDK 7-NIC	—	—	engine bug, see §"Why DPDK"

So all we've confirmed empirically is AWS allows ≥ 8 M pps. The 22 M historic ceiling could still be either AWS- or engine-imposed. Memory anyscan_aws_pps_allowance (and the underlying AWS public docs) suggest the c6in family quota is on the order of low-tens-of-M; the next bench that pushes >10 M is the one that decides this question.

Follow-up tickets (please file before next bench)

AF_XDP fall-back segfault — anyscan-engine-c/src/send-afxdp.c::afxdp_try_bind() must clean up XSK/UMEM/xdp_program on xsk_socket__create() failure before the next bind-mode attempt. Repro single-line above.
DPDK hardcoded nb_tx_desc=1024 — query rte_eth_dev_info.tx_desc_lim.nb_max per device.
DPDK EAL argv splitter — passes --socket-mem 1024 as --socket-mem <next-token-which-is-actually-program-path>.
PR feat(ec2-worker): ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP knob for multi-ENI launches #79 multi-ENI public IP — top-level launch flag silently misuses the AWS API. Either (a) only allow on single-NIC launches, (b) emit a post-launch EIP allocate+associate, or (c) document that the operator must do the EIP step themselves.
API bundle cache — /api/agent/install.sh serves a stub scanner unless ANYSCAN_USE_AF_XDP=1 ANYSCAN_USE_DPDK=1 are in the api's systemd EnvironmentFile. Either pin the env vars there or hash the bundle's build flags into the cache key.
install-external-deps.sh — must apt install librte-net-ena25 when ANYSCAN_USE_DPDK=1 (currently only installs libdpdk-dev).
tools/setup-dpdk.sh bind — should ip link set <ifc> down automatically (or at least error with that hint instead of "interface active. Not modifying") and should mount 1 GiB hugetlbfs when reserving 1 GiB hugepages.
AF_XDP install probe — kernel version check fails on 6.12.74. Ship a probe that handles 3-component versions and is a no-op when libxdp.so is present in ldconfig -p.

Cleanup

setup-dpdk.sh unbind — restored 7 ENIs to ENA driver, freed hugepages. ✓
aws ec2 terminate-instances --instance-ids i-0b8fe00496163d227 — instance shutting-down. ✓
aws ec2 release-address --allocation-id eipalloc-0bf5d678f6ff3bc20 — EIP released. ✓
systemctl start anyscan-ec2-worker-manager.service — watchdog back online (it'll re-launch its c6in.xlarge fleet). ✓

Cross-reference: memory anyscan_aws_pps_allowance (PR-comment-pointer to here updated), memory anyscan_afxdp_ena_constraint. Deploy proof: PR #70 issuecomment-4338757890.

— driven by anygpt-52, instance lifetime ≈ 1 h 12 min, total spend ≈ $7.

…ches (#82) PR #79's ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP=true on a multi-ENI launch is hard-rejected by AWS: InvalidParameterCombination — The associatePublicIPAddress parameter cannot be specified when launching with multiple network interfaces. AWS only honors AssociatePublicIpAddress on NetworkInterfaces[] when exactly one entry is supplied, even if the field appears only on the primary entry of a multi-NIC payload. The entire RunInstances call fails. Reported in PR #65 issuecomment-4339242358 (anygpt-52). Fix: when len(NetworkInterfaces) > 1, suppress the field inline and allocate-address + associate-address on the primary ENI post-launch. The recreate path now also releases the previously-recorded EIP before terminating the old instance so we don't leak Elastic IPs on every recreate. - build_network_interfaces only emits AssociatePublicIpAddress when the resulting payload is single-NIC (target_count == 1). - Ec2WorkerManager._associate_public_ip_post_launch allocates an EIP (Domain=vpc), associates it with the primary ENI (DeviceIndex=0), records AllocationId/AssociationId in self.state. Allocate or associate failures are surfaced in eni_attach.public_ip but do not abort the recreate — the worker is still usable on private IPs. - Ec2WorkerManager._release_recorded_eip disassociates and releases any previously-recorded EIP at the start of recreate_instance. Tests: - New: launch payload free of AssociatePublicIpAddress on multi-ENI; allocate_address + associate_address called post-launch with the primary ENI's NetworkInterfaceId; allocation_id persisted. - New: AllocateAddress failure does not abort recreate. - New: AssociateAddress failure still records AllocationId so the next recreate can release it. - New: previously-recorded EIP is disassociated + released before terminating the old instance on the next recreate. - Updated: prior tests that asserted the broken inline-flag behavior on multi-NIC now assert the field is suppressed everywhere. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When the API process runs with ANYSCAN_USE_AF_XDP=1 (or any of the sibling ANYSCAN_USE_DPDK / ANYSCAN_USE_PFRING_ZC / ANYSCAN_INSTALL_KERNEL_BACKPORT knobs), package-worker-bundle.sh rebuilds the scanner with matching feature linkage. The resulting bundle carries a feature-flagged scanner binary even though the *installed* /opt/anyscan/bin/scanner stays the same. `current_hosted_agent_bundle_source_fingerprint` previously only hashed the embedded asset payload and the installed binaries — the build-flag env vars were absent from the cache key. So a default-flags rebuild produced a fingerprint identical to a feature-flagged one and silently overwrote the cached AF_XDP/DPDK bundle. Operators bootstrapping via /api/agent/install.sh?rebuild=false then received an AF_PACKET-only stub scanner. Reported in PR #65 issuecomment-4339242358 (anygpt-52): "/api/agent/install.sh?rebuild=false served a stub scanner". Fix: fold each documented build-flag env var name + value into the fingerprint hash. Bundles built with different flags now land in different cache slots; rebuild=false serves the bundle that matches the API's current build-flag environment instead of a stale one. - BUNDLE_BUILD_FLAG_ENV_VARS pins the exact set so future ANYSCAN_USE_* knobs surface as a static-array compile-time decision. - hash_bundle_build_flag_env_vars takes an env-lookup closure so unit tests can hash hermetic inputs without poking std::env (which would race with parallel test execution). - bundle_build_flag_env_fingerprint is a #[cfg(test)] helper that produces just the build-flag contribution as a SHA-256 hex digest. Tests: - Default vs ANYSCAN_USE_AF_XDP=1 produce different fingerprints. - Each flag flipped on its own produces a unique fingerprint (no collisions between AF_XDP-only and DPDK-only builds). - Same flag with values "1" / "0" / unset are all distinct. - Repeated lookup with same input returns same fingerprint. - Static check that the four documented flags are all in the const. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…DPDK (#84) Debian DPDK 24.11.x ships every Poll-Mode Driver as its own package (librte-net-<vendor><abi>) instead of bundling them into libdpdk-dev. Without the relevant PMD installed, rte_eal_init() succeeds but no eth ports are probed and the scanner refuses to start. anygpt-52 hit this on c6in.metal: ENA NICs were silently absent from rte_eth_dev_count_avail() until librte-net-ena25 was apt-installed manually. Reported in PR #65 issuecomment-4339242358. Fix: install_dpdk_build_deps now pulls librte-net-ena25 (AWS ENA PMD — every c6in/c5n/m5n/m6in instance) AND librte-net-mlx5-25 (Mellanox ConnectX-5/6 PMD for non-AWS bare-metal hosts at Equinix/OVH/Hetzner) alongside libdpdk-dev. The 25 ABI suffix matches Debian trixie's DPDK 24.11.x. Stock Intel ixgbe/i40e drivers are still in libdpdk-dev's auto-pull set so we don't need to name them. Falls back to libdpdk-dev alone if PMDs are unavailable in the archive — better to ship a partial DPDK build than fail the install. The runtime warning makes it explicit so operators know to check rte_eth_dev_count_avail() if it returns 0 later. Test: new Case 5 in test-install-external-deps-dpdk.sh runs the install script with ANYSCAN_INSTALL_DPDK_DEPS=true (default) + stubs id and apt-get on PATH, then asserts apt-get install was called with libdpdk-dev, librte-net-ena25, and librte-net-mlx5-25. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…1g (#85) Two operator-side speedbumps surfaced on c6in.metal during anygpt-52 (PR #65 issuecomment-4339242358): 1. dpdk-devbind refuses to bind active interfaces: Warning: routing table indicates that interface is active. Not modifying. Operators had to `ip link set <ifc> down` on every NIC by hand before the bind step succeeded. 2. Reserving 1 GiB hugepages was not enough for rte_eal_init: EAL: No available 1048576 kB hugepages reported on node 0/1 Operators had to `mount -t hugetlbfs -o pagesize=1G nodev /mnt/ huge1g` themselves before the scanner could find the pages. Fix: - bdf_to_iface() walks /sys/bus/pci/devices/<bdf>/net/ to map BDF → kernel iface name. cmd_bind invokes `ip link set <ifc> down` on each target before invoking dpdk-devbind, with best-effort failure semantics (missing ip command, missing iface, already-down iface all proceed to the bind). - ensure_hugetlbfs_mount() mounts a hugetlbfs of the matching pagesize at the configured path after a successful nr_hugepages reservation. Default targets are /mnt/huge1g (1 GiB) and /mnt/huge2m (2 MiB); ANYSCAN_DPDK_HUGEPAGES_1G_MOUNT / _2M_MOUNT override or set them to "" to opt out (operators provisioning hugetlbfs via fstab). Idempotent: detects existing hugetlbfs of the right pagesize via findmnt / /proc/mounts and skips remount. - ANYSCAN_DPDK_LOAD_ONLY=1 hook lets test-setup-dpdk.sh source the script for hermetic helper testing without triggering the argv dispatch. Tests (new tools/test-setup-dpdk.sh): - cmd_bind invokes `ip link set <ifc> down` BEFORE dpdk-devbind --bind=vfio-pci (order verified by line numbers in a single cmd log). - ensure_hugetlbfs_mount calls `mount -t hugetlbfs -o pagesize=1G nodev <path>`. - ensure_hugetlbfs_mount is a no-op when target is already hugetlbfs with the matching pagesize. - ensure_hugetlbfs_mount is a no-op with an empty mount path. - bdf_to_iface returns iface for a populated fake /sys tree. - bdf_to_iface returns empty when net/ dir is missing. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

probe_afxdp_runtime_available reported "kernel <5.10 or libxdp.so missing" on c6in.metal Debian 13 + kernel 6.12.74 even though both prerequisites were satisfied. The previous parameter-expansion parser silently mishandled some 3-component release shapes; the failure mode reported in PR #65 issuecomment-4339242358 (anygpt-52) was a generic "false" with no indication of which check fired, leaving the operator to guess. Fix: - New parse_kernel_major_minor() helper uses awk -F'[.-]' so 3- component releases like 6.12.74-cloud-amd64, 5.10.0-13-amd64, 6.12.74+deb13+1-amd64, and 5.4.282-rt all parse cleanly. Returns "MAJOR MINOR" on stdout, "0 0" on parse failure. - probe_afxdp_runtime_available emits a one-line stderr explanation whenever it returns "false" so the operator can immediately see which check fired ("kernel 4.19 < 5.10", "libxdp.so not in ldconfig -p", "could not parse kernel version"). Quiet on success. - apply_afxdp_availability captures the probe stderr and includes the reason in its summary log line — replaces the previous hardcoded "kernel <5.10 or libxdp.so missing" that was wrong half the time. - ANYSCAN_INSTALL_LOAD_ONLY=1 hook lets unit tests source the script for hermetic helper testing without triggering main(). Test (new tools/test-install-worker-bundle-afxdp-probe.sh, 21 cases): - parse_kernel_major_minor across 8 release shapes (clean 3-component, +deb13 suffix, -cloud-amd64 suffix, -rt suffix, 4.x, 2-component, 1-component, empty). - probe_afxdp_runtime_available with stubbed uname + ldconfig: c6in.metal 6.12.74 + libxdp → true (the bug 5 repro). Kernel 4.19 too old → false + stderr names the version. Kernel 5.9 vs 5.10 vs 5.11 boundary correctness. libxdp.so missing → false + stderr names the missing library. Empty/non-numeric uname → false + stderr names parse failure. Co-authored-by: skullcmd <skullcmd@anyvm.tech> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

skullcrushercmd merged commit 551d1f4 into main Apr 27, 2026

skullcrushercmd deleted the perf/portscan-afxdp-plan branch April 27, 2026 19:08

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

This was referenced Apr 27, 2026

perf(portscan): AIMD CPU-vs-network slip + subprocess cap + per-instance defaults + partial-window calibration #66

Merged

perf(portscan): AF_XDP plan Phase 2 PR 1 — io_engine dispatch refactor + scanner fork #67

Merged

skullcrushercmd mentioned this pull request Apr 27, 2026

perf(send): AF_XDP TX path — Phase 2 PR 2 of AF_XDP plan AnyVM-Tech/anyscan-engine-c#1

Merged

8 tasks

skullcrushercmd mentioned this pull request Apr 28, 2026

fix(ec2-worker): correct c6in.metal NetworkCards fixture (2×8=16, not 4×5/4/3/3=15) #77

Merged

2 tasks

skullcrushercmd mentioned this pull request Apr 28, 2026

feat(ec2-worker): ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP knob for multi-ENI launches #79

Merged

1 task

skullcrushercmd mentioned this pull request Apr 28, 2026

fix(install): detect VERSION_CODENAME for kernel backport defaults #80

Merged

1 task

skullcrushercmd mentioned this pull request Apr 28, 2026

fix(build): wire ANYSCAN_USE_DPDK=1 through install-external-deps + bundle + deploy + adapter #81

Merged

7 tasks

skullcrushercmd mentioned this pull request Apr 28, 2026

fix(ec2-worker): post-launch EIP allocate+associate on multi-ENI launches #82

Merged

6 tasks

skullcrushercmd mentioned this pull request Apr 28, 2026

fix(api): include build-flag env vars in hosted bundle cache key #83

Merged

6 tasks

skullcrushercmd mentioned this pull request Apr 28, 2026

fix(install): apt install librte-net-ena25 + librte-net-mlx5-25 with DPDK #84

Merged

4 tasks

skullcrushercmd mentioned this pull request Apr 28, 2026

fix(setup-dpdk): ip link down each NIC + mount hugetlbfs at /mnt/huge1g #85

Merged

6 tasks

skullcrushercmd mentioned this pull request Apr 28, 2026

fix(install): robust kernel-version parser in AF_XDP runtime probe #86

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(plans): AF_XDP integration plan for higher pps (Phase 1)#65

docs(plans): AF_XDP integration plan for higher pps (Phase 1)#65
skullcrushercmd merged 2 commits intomainfrom
perf/portscan-afxdp-plan

skullcrushercmd commented Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

skullcrushercmd Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

skullcrushercmd commented Apr 27, 2026

Uh oh!

skullcrushercmd commented Apr 28, 2026

Uh oh!

skullcrushercmd commented Apr 28, 2026

Uh oh!

skullcrushercmd commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		\| `bind_flags` \| `XDP_USE_NEED_WAKEUP` \| Standard kernel-side config; `xsk_ring_prod__needs_wakeup` decides when to `sendto`. \|
		\| `xdp_flags` \| `XDP_FLAGS_DRV_MODE` `\\| XDP_ZEROCOPY` \| Native zero-copy on ENA (driver supports it; see §3.5). Falls back to `XDP_FLAGS_SKB_MODE` (generic) if ENA refuses ZC for this queue. \|

Conversation

skullcrushercmd commented Apr 27, 2026

Phase 1 — Design + plan only. No scanner C code changes.

Why

What this PR adds

LOC estimate

Coordination

Verification

Out of scope (explicit)

Reviewer ask

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

skullcrushercmd Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

skullcrushercmd commented Apr 27, 2026

Deployed to prod ✅

Prod redeploy

Fresh bundle: agent-bundle-linux-x86_64__20260427191153-3236925-5dd517c87d76.tar.gz

PR #66 plumbing verified inside the bundle

Bundle endpoint serves the freshly-built artifact

Worker remote-update — one alive worker

Note for the next bench cycle

Out of scope per spec

Uh oh!

skullcrushercmd commented Apr 28, 2026

Phase 2 — c6in.metal live bench (anygpt-42)

Headline

Live vs synthetic projections (PR #65 §10)

Setup notes (operational findings the plan should fold in)

Bench harness

Out of scope (per task instructions)

Cleanup

Uh oh!

skullcrushercmd commented Apr 28, 2026

Phase 2 — c6in.metal 15-NIC live bench (anygpt-48): adding NICs and a 6.19 kernel fail to close the gap

TL;DR

Headline

Bench harness wall + tx_dropped

Why it regressed: three live findings on top of the prior anyscan_afxdp_ena_constraint memory

1. PR #74's c6in.metal NetworkCard fixture is incorrect

2. PR #73's bookworm-backports source is the wrong suite for the current AMI

3. The big one: ena_xdp_zc still hasn't landed upstream as of Linux 6.19.11

4. Bonus: PR #74's multi-ENI launch path skips public-IP allocation

Per-NIC detail (T=R=8 run, the better of the two AF_XDP runs)

Setup notes (delta from anygpt-42)

Summary verdict on the env-knob PRs

Cleanup

Out of scope (per task brief)

Uh oh!

skullcrushercmd commented Apr 28, 2026

Phase 2 — c6in.metal AF_PACKET / AF_XDP / DPDK live bench (anygpt-52): regressions block AF_XDP + DPDK; AWS PPS allowance still untested

TL;DR

Bench harness

AF_PACKET 8-NIC results (T=R=8, -r 50M, wall 35.84 s)

AWS *_allowance_exceeded deltas across all 8 ENIs

Why AF_XDP did not run

Why DPDK did not run

Other deploy-path bugs surfaced (worked around)

What the AWS PPS allowance numbers actually tell us

Follow-up tickets (please file before next bench)

Cleanup

Uh oh!

Reviewers

Assignees

Labels

Projects

Fresh bundle: `agent-bundle-linux-x86_64__20260427191153-3236925-5dd517c87d76.tar.gz`

Why it regressed: three live findings on top of the prior `anyscan_afxdp_ena_constraint` memory

2. PR #73's `bookworm-backports` source is the wrong suite for the current AMI

3. The big one: `ena_xdp_zc` still hasn't landed upstream as of Linux 6.19.11

AF_PACKET 8-NIC results (T=R=8, `-r 50M`, wall 35.84 s)

AWS `*_allowance_exceeded` deltas across all 8 ENIs