feat(ec2-worker): scale launch path beyond 8 ENIs to c6in.metal max of 15#74
feat(ec2-worker): scale launch path beyond 8 ENIs to c6in.metal max of 15#74skullcrushercmd merged 1 commit intomainfrom
Conversation
…f 15 Adds opt-in multi-ENI attach to ec2_worker_manager. ANYSCAN_MAX_ENIS unset preserves the legacy single-NIC RunInstances payload shape; set to N (recommended 15 on c6in.metal) attaches min(N, hw_cap) ENIs at launch, spread across NetworkCards via DescribeInstanceTypes so a 15-ENI launch on c6in.metal lands 5/4/3/3 across the four physical cards instead of hard-failing on card 0's 5-slot limit. Also documents in plans/2026-04-27-portscan-afxdp-plan-v1.md §6.1 that ANYSCAN_RATE_MAX_CONCURRENT_SUBPROCESSES=4 stays the per-host CPU sweet spot regardless of NIC count — the additional NICs unlock a kernel- upgrade ceiling, not a userland cap one (anygpt-4 8/8 collapse data). - tools/ec2_worker_manager.py: new helpers (parse_max_enis, compute_target_eni_count, distribute_enis_across_cards, build_network_interfaces, eni_cap_from_describe_response, network_cards_from_describe_response); ManagerConfig.max_enis + eni_subnet_ids; recreate_instance() routes through NetworkInterfaces when opted in, falls back to single-NIC on Describe denial / missing subnet so existing fleets are unaffected. - tools/test_ec2_worker_manager.py: 40 unit tests covering env parsing, hardware-cap detection, per-card distribution, and the run_instances launch payload (boto3 stubbed at import). - tools/test-install-worker-bundle-eni-discovery.sh: bash sanity test confirms tune-scanner-host.sh + reserve-control-bandwidth.sh comma- list iterators handle 15 ENIs and the install-worker-bundle.sh multi-NIC gate fires for a 15-entry candidate list (regression guard against any future hardcoded N=8 cap). - plans/2026-04-27-portscan-afxdp-plan-v1.md: new §6.1 risk-register sub-section explaining why subproc cap stays at 4 when scaling NICs past 8. Default behavior unchanged when ANYSCAN_MAX_ENIS is unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 896768b0e3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.config.subnet_id: | ||
| launch_args["SubnetId"] = self.config.subnet_id |
There was a problem hiding this comment.
Preserve configured subnet pool on single-NIC fallback
When ANYSCAN_MAX_ENIS is set and DescribeInstanceTypes fails, target_count falls back to 1 and this branch only copies self.config.subnet_id into run_instances. If an operator relies on ANYSCAN_EC2_ENI_SUBNET_IDS (with ANYSCAN_EC2_SUBNET_ID unset), the launch request is sent without any subnet even though valid subnets were configured, so EC2 launch fails in accounts without a default VPC. Because recreate_instance() may terminate the current instance before launching the replacement, this can turn a transient describe-permission issue into worker downtime.
Useful? React with 👍 / 👎.
Live verification on a freshly-launched c6in.metal — two findings to fold back into the merged codeDriven by anygpt-48; full bench numbers + cross-PR context on PR #65 issuecomment-4338158487. Flagging the two issues that affect this PR specifically. 1. The c6in.metal NetworkCards fixture in
|
… 4x5/4/3/3=15) (#77) PR #74 mocked NetworkCards as 4 cards distributed 5/4/3/3=15, but actual AWS DescribeInstanceTypes for c6in.metal returns 2 cards x 8 = 16 (anygpt-48 live bench, PR #65 issuecomment-4338158487). The launch path code is fine - distribute_enis_across_cards handles any card layout - but the synthetic test fixture and the docstring example encoded a shape that doesn't match production AWS. Refresh the fixture, the docstring, and every test that hardcoded 15-derived numbers. Add a new RecordedDescribeInstanceTypesIntegrityTests class that anchors the fixture against tools/c6in_metal_describe_instance_types.json (a real `aws ec2 describe-instance-types` capture) so future drift gets caught at unit-test time instead of bench time. Effect on capacity claim: c6in.metal has 2 PCIe trees, not 4, so the multi-NIC headroom caps at ~2x single-tree, not ~4x. Co-authored-by: skullcmd <skullcmd@anyvm.tech>
Summary
Adds opt-in multi-ENI attach to
tools/ec2_worker_manager.pyso c6in.metal-class workers can launch with up to 15 ENIs instead of the current single-NIC default. UsesDescribeInstanceTypesfor both theMaximumNetworkInterfacescap and the per-card layout, so a 15-ENI launch on c6in.metal lands 5/4/3/3 across the four physical cards rather than hard-failing on card 0's 5-slot limit.This is the launch-path change. PR #64 already auto-discovers whatever ENIs are attached at boot and writes them to
ANYSCAN_SCANNER_INTERFACES; PR #65's plan covers the AF_XDP I/O path that lets the extra NICs actually translate into pps. This PR closes the loop on the AWS side.Default behavior is unchanged when
ANYSCAN_MAX_ENISis unset.Behavior
ANYSCAN_MAX_ENISSubnetId/SecurityGroupIds, single ENI15(recommended on c6in.metal)NetworkInterfaces=[15 entries with NetworkCardIndex 0..3]15on c6in.xlargeNetworkInterfaces=[4 entries]DescribeInstanceTypesdeniedANYSCAN_EC2_SUBNET_ID0,-1,"abc"SystemExitat config loadNew env vars (manager-side, sourced from the
.external-runtime.envthe systemd unit reads):ANYSCAN_MAX_ENIS— opt-in cap on attached ENIs (defaults: unset = single-NIC). Recommended 15 on c6in.metal.ANYSCAN_EC2_ENI_SUBNET_IDS— optional comma list of subnet IDs the secondary ENIs round-robin through. Single-subnet operators leave this unset.Per-card placement
distribute_enis_across_cards()round-robins ENIs across cards inNetworkCardIndexorder. For c6in.metal (NetworkCards = [{0:5}, {1:4}, {2:3}, {3:3}], total 15) a request for 15 ENIs places:The primary ENI (sequence 0) always lands on
NetworkCardIndex 0, DeviceIndex 0because AWS rejects a primary ENI on a non-zero card. Single-card instance types (any pre-c6in family — and the response payload shape used by mocked test helpers) skipNetworkCardIndexentirely so the legacy payload is unchanged.Verification
run_instanceslaunch payload. boto3 is stubbed at import time so tests run without AWS credentials.tune-scanner-host.sh+reserve-control-bandwidth.shcomma-list iterators handle 15 ENIs and theinstall-worker-bundle.shmulti-NIC gate fires for a 15-entry candidate list (regression guard against any future hardcoded N=8 cap):test_vulnscanner_adapter_io_engine(16 ✓),test_vulnscanner_adapter_multinic(31 ✓),test_anyscan_rate_controller(53 ✓).install-worker-bundle.sh N>8 audit
Per the brief, verified the existing per-NIC iteration paths handle N>8:
detect_host_scanner_eni_candidates(install-worker-bundle.sh:151-211) iteratesip -o link show upwith no upper bound; the skip list (lo|docker*|br-*|veth*|tun*|tap*|wg*|zt*|cni*|cilium*|flannel*|kube-*) excludes none of the ENA naming conventions (ens*,eth*).install-worker-bundle.sh:312-317) uses the${var%,*}self-comparison trick, which stays correct for any N≥2.tune-scanner-host.sh:resolve_ifacesandreserve-control-bandwidth.sh:resolve_managed_interfacesboth usefor entry in $(printf '%s' "$provided" | tr ',;' ' ')— well under any shell word-list bound at N=15.ANYSCAN_RESERVE_INTERFACE+ANYSCAN_TUNE_INTERFACE, both already comma-list-aware from PR perf(portscan): multi-NIC sharding + ENI auto-discovery toward ENA spec ceiling #64).drv+copymode.Plan doc update
plans/2026-04-27-portscan-afxdp-plan-v1.md§6 (risk register) gains a new row + sub-section §6.1 documenting thatANYSCAN_RATE_MAX_CONCURRENT_SUBPROCESSES=4stays the per-host CPU sweet spot regardless of NIC count. anygpt-4's c6in.metal bench data (4-NIC/cap=4 → 12.8 Mpps, 8-NIC/cap=8 → 1.3 Mpps CPU-thrash collapse, 8-NIC/cap=4 → ~12 Mpps parity) is the load-bearing evidence; the additional NICs unlock the kernel-upgradezerocopyceiling, not the userland subproc-cap one.Out of scope (per brief)
cap=4default (documented as the sweet spot, intentional).zerocopyenablement on kernel >6.12.74).runtime.worker.env.template— the new env vars are manager-side, sourced from.external-runtime.env.Test plan
python3 -m unittest tools.test_ec2_worker_manager— 40 ✓python3 -m unittest test_vulnscanner_adapter_multinic test_vulnscanner_adapter_io_engine test_anyscan_rate_controller— 100 ✓ (pre-existing baseline)bash tools/test-install-worker-bundle-eni-discovery.sh— 3 ✓ANYSCAN_MAX_ENIS=15set in.external-runtime.env(pending operator-driven box launch + IAM grant forec2:DescribeInstanceTypes).References
anyscan_afxdp_ena_constraint: ENA on kernel ≤6.12.74 forcesdrv+copy, which is whycap=4doesn't move with NIC count today🤖 Generated with Claude Code