feat(ec2-worker): scale launch path beyond 8 ENIs to c6in.metal max of 15 by skullcrushercmd · Pull Request #74 · AnyVM-Tech/AnyScan

skullcrushercmd · 2026-04-28T17:16:09Z

Summary

Adds opt-in multi-ENI attach to tools/ec2_worker_manager.py so c6in.metal-class workers can launch with up to 15 ENIs instead of the current single-NIC default. Uses DescribeInstanceTypes for both the MaximumNetworkInterfaces cap and the per-card layout, so a 15-ENI launch on c6in.metal lands 5/4/3/3 across the four physical cards rather than hard-failing on card 0's 5-slot limit.

This is the launch-path change. PR #64 already auto-discovers whatever ENIs are attached at boot and writes them to ANYSCAN_SCANNER_INTERFACES; PR #65's plan covers the AF_XDP I/O path that lets the extra NICs actually translate into pps. This PR closes the loop on the AWS side.

Default behavior is unchanged when ANYSCAN_MAX_ENIS is unset.

Behavior

`ANYSCAN_MAX_ENIS`	Launch payload	Notes
unset	Top-level `SubnetId`/`SecurityGroupIds`, single ENI	Pre-PR behavior, every existing fleet
`15` (recommended on c6in.metal)	`NetworkInterfaces=[15 entries with NetworkCardIndex 0..3]`	min(15, hw_cap), spread 5/4/3/3 across cards
`15` on c6in.xlarge	`NetworkInterfaces=[4 entries]`	Clamped to MaximumNetworkInterfaces=4
set, `DescribeInstanceTypes` denied	Single-NIC fallback + reason recorded	Operator can grant IAM perm and retry
set, no `ANYSCAN_EC2_SUBNET_ID`	Single-NIC fallback + reason recorded	Same fallback
`0`, `-1`, `"abc"`	`SystemExit` at config load	Loud-fail rather than silent single-NIC

New env vars (manager-side, sourced from the .external-runtime.env the systemd unit reads):

ANYSCAN_MAX_ENIS — opt-in cap on attached ENIs (defaults: unset = single-NIC). Recommended 15 on c6in.metal.
ANYSCAN_EC2_ENI_SUBNET_IDS — optional comma list of subnet IDs the secondary ENIs round-robin through. Single-subnet operators leave this unset.

Per-card placement

distribute_enis_across_cards() round-robins ENIs across cards in NetworkCardIndex order. For c6in.metal (NetworkCards = [{0:5}, {1:4}, {2:3}, {3:3}], total 15) a request for 15 ENIs places:

card 0 (cap 5): device-indexes [0, 1, 2, 3, 4]
card 1 (cap 4): device-indexes [0, 1, 2, 3]
card 2 (cap 3): device-indexes [0, 1, 2]
card 3 (cap 3): device-indexes [0, 1, 2]

The primary ENI (sequence 0) always lands on NetworkCardIndex 0, DeviceIndex 0 because AWS rejects a primary ENI on a non-zero card. Single-card instance types (any pre-c6in family — and the response payload shape used by mocked test helpers) skip NetworkCardIndex entirely so the legacy payload is unchanged.

Verification

New unit tests: 40 cases covering env parsing, hardware-cap detection, per-card distribution, and the run_instances launch payload. boto3 is stubbed at import time so tests run without AWS credentials.
```
$ python3 -m unittest tools.test_ec2_worker_manager
Ran 40 tests in 0.003s — OK
```

New bash sanity test: confirms tune-scanner-host.sh + reserve-control-bandwidth.sh comma-list iterators handle 15 ENIs and the install-worker-bundle.sh multi-NIC gate fires for a 15-entry candidate list (regression guard against any future hardcoded N=8 cap):

$ bash tools/test-install-worker-bundle-eni-discovery.sh
PASS: resolve_managed_interfaces handles 15 ENIs (got 15)
PASS: tune-scanner-host.sh resolve_ifaces handles 15 ENIs (got 15)
PASS: install-worker-bundle.sh multi-NIC gate triggers for 15-entry list

Existing tests untouched: test_vulnscanner_adapter_io_engine (16 ✓), test_vulnscanner_adapter_multinic (31 ✓), test_anyscan_rate_controller (53 ✓).

install-worker-bundle.sh N>8 audit

Per the brief, verified the existing per-NIC iteration paths handle N>8:

detect_host_scanner_eni_candidates (install-worker-bundle.sh:151-211) iterates ip -o link show up with no upper bound; the skip list (lo|docker*|br-*|veth*|tun*|tap*|wg*|zt*|cni*|cilium*|flannel*|kube-*) excludes none of the ENA naming conventions (ens*, eth*).
The "more than one entry" gate (install-worker-bundle.sh:312-317) uses the ${var%,*} self-comparison trick, which stays correct for any N≥2.
tune-scanner-host.sh:resolve_ifaces and reserve-control-bandwidth.sh:resolve_managed_interfaces both use for entry in $(printf '%s' "$provided" | tr ',;' ' ') — well under any shell word-list bound at N=15.
No policy-routing / route-table-numbering / per-NIC sysctl-rp_filter scripts exist in the repo (the brief's reference to those was directionally off; the per-NIC managed surface today is just ANYSCAN_RESERVE_INTERFACE + ANYSCAN_TUNE_INTERFACE, both already comma-list-aware from PR perf(portscan): multi-NIC sharding + ENI auto-discovery toward ENA spec ceiling #64).
AF_XDP queue/MTU prep is done at scanner runtime, not at install time, so no install-side change is needed for AF_XDP drv+copy mode.

Plan doc update

plans/2026-04-27-portscan-afxdp-plan-v1.md §6 (risk register) gains a new row + sub-section §6.1 documenting that ANYSCAN_RATE_MAX_CONCURRENT_SUBPROCESSES=4 stays the per-host CPU sweet spot regardless of NIC count. anygpt-4's c6in.metal bench data (4-NIC/cap=4 → 12.8 Mpps, 8-NIC/cap=8 → 1.3 Mpps CPU-thrash collapse, 8-NIC/cap=4 → ~12 Mpps parity) is the load-bearing evidence; the additional NICs unlock the kernel-upgrade zerocopy ceiling, not the userland subproc-cap one.

Out of scope (per brief)

No change to the cap=4 default (documented as the sweet spot, intentional).
No kernel upgrade (separate worker, blocked on ENA zerocopy enablement on kernel >6.12.74).
No DPDK or PF_RING (separate workers).
No change to runtime.worker.env.template — the new env vars are manager-side, sourced from .external-runtime.env.

Test plan

python3 -m unittest tools.test_ec2_worker_manager — 40 ✓
python3 -m unittest test_vulnscanner_adapter_multinic test_vulnscanner_adapter_io_engine test_anyscan_rate_controller — 100 ✓ (pre-existing baseline)
bash tools/test-install-worker-bundle-eni-discovery.sh — 3 ✓
Live verification on a freshly launched c6in.metal worker with ANYSCAN_MAX_ENIS=15 set in .external-runtime.env (pending operator-driven box launch + IAM grant for ec2:DescribeInstanceTypes).

References

PR perf(portscan): multi-NIC sharding + ENI auto-discovery toward ENA spec ceiling #64 — multi-NIC sharding + ENI auto-discovery (boot-side foundation)
PR docs(plans): AF_XDP integration plan for higher pps (Phase 1) #65 — AF_XDP integration plan (Phase 1 design doc)
Bench memory anyscan_afxdp_ena_constraint: ENA on kernel ≤6.12.74 forces drv+copy, which is why cap=4 doesn't move with NIC count today

🤖 Generated with Claude Code

…f 15 Adds opt-in multi-ENI attach to ec2_worker_manager. ANYSCAN_MAX_ENIS unset preserves the legacy single-NIC RunInstances payload shape; set to N (recommended 15 on c6in.metal) attaches min(N, hw_cap) ENIs at launch, spread across NetworkCards via DescribeInstanceTypes so a 15-ENI launch on c6in.metal lands 5/4/3/3 across the four physical cards instead of hard-failing on card 0's 5-slot limit. Also documents in plans/2026-04-27-portscan-afxdp-plan-v1.md §6.1 that ANYSCAN_RATE_MAX_CONCURRENT_SUBPROCESSES=4 stays the per-host CPU sweet spot regardless of NIC count — the additional NICs unlock a kernel- upgrade ceiling, not a userland cap one (anygpt-4 8/8 collapse data). - tools/ec2_worker_manager.py: new helpers (parse_max_enis, compute_target_eni_count, distribute_enis_across_cards, build_network_interfaces, eni_cap_from_describe_response, network_cards_from_describe_response); ManagerConfig.max_enis + eni_subnet_ids; recreate_instance() routes through NetworkInterfaces when opted in, falls back to single-NIC on Describe denial / missing subnet so existing fleets are unaffected. - tools/test_ec2_worker_manager.py: 40 unit tests covering env parsing, hardware-cap detection, per-card distribution, and the run_instances launch payload (boto3 stubbed at import). - tools/test-install-worker-bundle-eni-discovery.sh: bash sanity test confirms tune-scanner-host.sh + reserve-control-bandwidth.sh comma- list iterators handle 15 ENIs and the install-worker-bundle.sh multi-NIC gate fires for a 15-entry candidate list (regression guard against any future hardcoded N=8 cap). - plans/2026-04-27-portscan-afxdp-plan-v1.md: new §6.1 risk-register sub-section explaining why subproc cap stays at 4 when scaling NICs past 8. Default behavior unchanged when ANYSCAN_MAX_ENIS is unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 896768b0e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-28T17:20:18Z

+            if self.config.subnet_id:
+                launch_args["SubnetId"] = self.config.subnet_id


Preserve configured subnet pool on single-NIC fallback

When ANYSCAN_MAX_ENIS is set and DescribeInstanceTypes fails, target_count falls back to 1 and this branch only copies self.config.subnet_id into run_instances. If an operator relies on ANYSCAN_EC2_ENI_SUBNET_IDS (with ANYSCAN_EC2_SUBNET_ID unset), the launch request is sent without any subnet even though valid subnets were configured, so EC2 launch fails in accounts without a default VPC. Because recreate_instance() may terminate the current instance before launching the replacement, this can turn a transient describe-permission issue into worker downtime.

Useful? React with 👍 / 👎.

skullcrushercmd · 2026-04-28T18:50:31Z

Live verification on a freshly-launched c6in.metal — two findings to fold back into the merged code

Driven by anygpt-48; full bench numbers + cross-PR context on PR #65 issuecomment-4338158487. Flagging the two issues that affect this PR specifically.

1. The c6in.metal NetworkCards fixture in `tools/test_ec2_worker_manager.py` does not match AWS's live response

The PR's RecreateInstanceLaunchPathTests::test_max_enis_15_on_c6in_metal_emits_15_network_interfaces (and the docstring at tools/ec2_worker_manager.py:121-125) assume:

NetworkCards = [
  {NetworkCardIndex: 0, MaximumNetworkInterfaces: 5},  # primary
  {NetworkCardIndex: 1, MaximumNetworkInterfaces: 4},
  {NetworkCardIndex: 2, MaximumNetworkInterfaces: 3},
  {NetworkCardIndex: 3, MaximumNetworkInterfaces: 3},
]

Live aws ec2 describe-instance-types --instance-types c6in.metal --region us-east-1 (from this bench's launch-time describe call):

TopLevel MaximumNetworkInterfaces = 16
NetworkCards:
  card 0: max_nics=8, perf=Up to 170 Gigabit
  card 1: max_nics=8, perf=Up to 170 Gigabit
total via cards = 16

distribute_enis_across_cards(15, real_cards) against the real shape places 8 ENIs on card 0 and 7 on card 1, not 5/4/3/3 across 4 cards. The launch went through fine and 15 ENIs attached cleanly on this bench (and showed up as ens1..ens5, ens7, enp13s0, enp15s0, enp154s0..enp160s0 post-boot), so the launch-path code itself is correct — it's the verification that's mocking the wrong topology.

This matters because §6.1 of the plan doc and the PR commit body both lean on the "more cards = more PCIe trees = unlock kernel-upgrade zerocopy ceiling" framing. With only 2 PCIe trees (not 4), the multi-NIC headroom over the 8-NIC baseline is much smaller than projected. The bench data in PR #65 issuecomment-4338158487 confirms this empirically: 15-NIC AF_XDP at every T=R config tested regressed versus the 8-NIC cap=4 t=8 22.43 M peak from anygpt-42.

Suggested follow-up: rebuild the test fixture from a recorded describe-instance-types payload (or hit the live API in an integration test gated on AWS_INTEGRATION=1). Same applies to c6in.xlarge — the unit test for "clamps to 4" should be checked against the live shape.

2. The multi-ENI launch path skips public-IP allocation on the primary

build_network_interfaces() doesn't set AssociatePublicIpAddress=True on the primary NetworkInterfaces[0] entry. The legacy single-NIC path (top-level SubnetId) honored the subnet's MapPublicIpOnLaunch=true, so existing fleets had a public IP on the primary ENI; the new multi-ENI path doesn't, and AWS does not auto-assign when NetworkInterfaces[] is set explicitly.

Live impact on this bench: the c6in.metal came up with 15 private 172.31.x.y ENIs and no public IP — unreachable from scan.anyvm.tech (which connects via internet, not VPN). I worked around it with aws ec2 allocate-address + associate-address on the primary ENI to get the EIP 54.165.21.227, then SSH worked.

Suggested follow-up: set AssociatePublicIpAddress: True on the primary entry (NetworkCardIndex == 0 and DeviceIndex == 0) when the subnet's MapPublicIpOnLaunch is true (or unconditionally when there's no IPv6/EIP override). The single-NIC fallback path is fine because it doesn't use NetworkInterfaces[].

This isn't blocking — operators with VPN tunnels or AWS Console / SSM Session Manager access wouldn't have noticed — but it's the kind of thing that shows up the first time someone tries to drive a 15-NIC bench from outside the VPC.

What worked

40-test unit suite passes against the (incorrect) fixture, which means the placement logic itself is solid.

The launch-time eni_attach payload was correctly populated:

"eni_attach": {
  "requested": 15,
  "hardware_cap": 16,
  "network_cards": [{"NetworkCardIndex": 0, "MaximumNetworkInterfaces": 8, ...},
                    {"NetworkCardIndex": 1, "MaximumNetworkInterfaces": 8, ...}],
  "attached": 15,
  "subnet_pool": ["subnet-0a8e834fdf69c0839"]
}

All 15 ENIs attached, came up as ENA devices post-cloud-init, accepted MTU=3498 + ethtool -L combined 8 per the AF_XDP host-prep findings from anygpt-42, and AF_XDP attached cleanly in drv+copy mode. The compute_target_eni_count(16, 15) = 15 clamp logic is correct.
15-ENI auto-cleanup on instance termination worked: DeleteOnTermination=true was the default; all 15 ENIs were gone within the termination window without any manual delete-network-interface call.

So the env-knob plumbing is right; the test fixture (and the per-card narrative in the PR body) just need to match what AWS actually returns today.

… 4x5/4/3/3=15) (#77) PR #74 mocked NetworkCards as 4 cards distributed 5/4/3/3=15, but actual AWS DescribeInstanceTypes for c6in.metal returns 2 cards x 8 = 16 (anygpt-48 live bench, PR #65 issuecomment-4338158487). The launch path code is fine - distribute_enis_across_cards handles any card layout - but the synthetic test fixture and the docstring example encoded a shape that doesn't match production AWS. Refresh the fixture, the docstring, and every test that hardcoded 15-derived numbers. Add a new RecordedDescribeInstanceTypesIntegrityTests class that anchors the fixture against tools/c6in_metal_describe_instance_types.json (a real `aws ec2 describe-instance-types` capture) so future drift gets caught at unit-test time instead of bench time. Effect on capacity claim: c6in.metal has 2 PCIe trees, not 4, so the multi-NIC headroom caps at ~2x single-tree, not ~4x. Co-authored-by: skullcmd <skullcmd@anyvm.tech>

skullcrushercmd merged commit 98ee63a into main Apr 28, 2026

skullcrushercmd deleted the feat/scale-eni-cap-15 branch April 28, 2026 17:17

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

This was referenced Apr 28, 2026

feat(portscan): ANYSCAN_SCANNER_IO_ENGINE env knob + adapter --io-engine plumbing #70

Merged

docs(plans): AF_XDP integration plan for higher pps (Phase 1) #65

Merged

skullcrushercmd mentioned this pull request Apr 28, 2026

fix(ec2-worker): correct c6in.metal NetworkCards fixture (2×8=16, not 4×5/4/3/3=15) #77

Merged

2 tasks

skullcrushercmd mentioned this pull request Apr 28, 2026

feat(ec2-worker): ANYSCAN_EC2_ASSOCIATE_PUBLIC_IP knob for multi-ENI launches #79

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ec2-worker): scale launch path beyond 8 ENIs to c6in.metal max of 15#74

feat(ec2-worker): scale launch path beyond 8 ENIs to c6in.metal max of 15#74
skullcrushercmd merged 1 commit intomainfrom
feat/scale-eni-cap-15

skullcrushercmd commented Apr 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Uh oh!

skullcrushercmd commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if self.config.subnet_id:
		launch_args["SubnetId"] = self.config.subnet_id

Conversation

skullcrushercmd commented Apr 28, 2026

Summary

Behavior

Per-card placement

Verification

install-worker-bundle.sh N>8 audit

Plan doc update

Out of scope (per brief)

Test plan

References

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

skullcrushercmd commented Apr 28, 2026

Live verification on a freshly-launched c6in.metal — two findings to fold back into the merged code

1. The c6in.metal NetworkCards fixture in tools/test_ec2_worker_manager.py does not match AWS's live response

2. The multi-ENI launch path skips public-IP allocation on the primary

What worked

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. The c6in.metal NetworkCards fixture in `tools/test_ec2_worker_manager.py` does not match AWS's live response