Skip to content

feat(ec2-worker): scale launch path beyond 8 ENIs to c6in.metal max of 15#74

Merged
skullcrushercmd merged 1 commit intomainfrom
feat/scale-eni-cap-15
Apr 28, 2026
Merged

feat(ec2-worker): scale launch path beyond 8 ENIs to c6in.metal max of 15#74
skullcrushercmd merged 1 commit intomainfrom
feat/scale-eni-cap-15

Conversation

@skullcrushercmd
Copy link
Copy Markdown
Contributor

Summary

Adds opt-in multi-ENI attach to tools/ec2_worker_manager.py so c6in.metal-class workers can launch with up to 15 ENIs instead of the current single-NIC default. Uses DescribeInstanceTypes for both the MaximumNetworkInterfaces cap and the per-card layout, so a 15-ENI launch on c6in.metal lands 5/4/3/3 across the four physical cards rather than hard-failing on card 0's 5-slot limit.

This is the launch-path change. PR #64 already auto-discovers whatever ENIs are attached at boot and writes them to ANYSCAN_SCANNER_INTERFACES; PR #65's plan covers the AF_XDP I/O path that lets the extra NICs actually translate into pps. This PR closes the loop on the AWS side.

Default behavior is unchanged when ANYSCAN_MAX_ENIS is unset.

Behavior

ANYSCAN_MAX_ENIS Launch payload Notes
unset Top-level SubnetId/SecurityGroupIds, single ENI Pre-PR behavior, every existing fleet
15 (recommended on c6in.metal) NetworkInterfaces=[15 entries with NetworkCardIndex 0..3] min(15, hw_cap), spread 5/4/3/3 across cards
15 on c6in.xlarge NetworkInterfaces=[4 entries] Clamped to MaximumNetworkInterfaces=4
set, DescribeInstanceTypes denied Single-NIC fallback + reason recorded Operator can grant IAM perm and retry
set, no ANYSCAN_EC2_SUBNET_ID Single-NIC fallback + reason recorded Same fallback
0, -1, "abc" SystemExit at config load Loud-fail rather than silent single-NIC

New env vars (manager-side, sourced from the .external-runtime.env the systemd unit reads):

  • ANYSCAN_MAX_ENIS — opt-in cap on attached ENIs (defaults: unset = single-NIC). Recommended 15 on c6in.metal.
  • ANYSCAN_EC2_ENI_SUBNET_IDS — optional comma list of subnet IDs the secondary ENIs round-robin through. Single-subnet operators leave this unset.

Per-card placement

distribute_enis_across_cards() round-robins ENIs across cards in NetworkCardIndex order. For c6in.metal (NetworkCards = [{0:5}, {1:4}, {2:3}, {3:3}], total 15) a request for 15 ENIs places:

card 0 (cap 5): device-indexes [0, 1, 2, 3, 4]
card 1 (cap 4): device-indexes [0, 1, 2, 3]
card 2 (cap 3): device-indexes [0, 1, 2]
card 3 (cap 3): device-indexes [0, 1, 2]

The primary ENI (sequence 0) always lands on NetworkCardIndex 0, DeviceIndex 0 because AWS rejects a primary ENI on a non-zero card. Single-card instance types (any pre-c6in family — and the response payload shape used by mocked test helpers) skip NetworkCardIndex entirely so the legacy payload is unchanged.

Verification

  • New unit tests: 40 cases covering env parsing, hardware-cap detection, per-card distribution, and the run_instances launch payload. boto3 is stubbed at import time so tests run without AWS credentials.
    $ python3 -m unittest tools.test_ec2_worker_manager
    Ran 40 tests in 0.003s — OK
    
  • New bash sanity test: confirms tune-scanner-host.sh + reserve-control-bandwidth.sh comma-list iterators handle 15 ENIs and the install-worker-bundle.sh multi-NIC gate fires for a 15-entry candidate list (regression guard against any future hardcoded N=8 cap):
    $ bash tools/test-install-worker-bundle-eni-discovery.sh
    PASS: resolve_managed_interfaces handles 15 ENIs (got 15)
    PASS: tune-scanner-host.sh resolve_ifaces handles 15 ENIs (got 15)
    PASS: install-worker-bundle.sh multi-NIC gate triggers for 15-entry list
    
  • Existing tests untouched: test_vulnscanner_adapter_io_engine (16 ✓), test_vulnscanner_adapter_multinic (31 ✓), test_anyscan_rate_controller (53 ✓).

install-worker-bundle.sh N>8 audit

Per the brief, verified the existing per-NIC iteration paths handle N>8:

  • detect_host_scanner_eni_candidates (install-worker-bundle.sh:151-211) iterates ip -o link show up with no upper bound; the skip list (lo|docker*|br-*|veth*|tun*|tap*|wg*|zt*|cni*|cilium*|flannel*|kube-*) excludes none of the ENA naming conventions (ens*, eth*).
  • The "more than one entry" gate (install-worker-bundle.sh:312-317) uses the ${var%,*} self-comparison trick, which stays correct for any N≥2.
  • tune-scanner-host.sh:resolve_ifaces and reserve-control-bandwidth.sh:resolve_managed_interfaces both use for entry in $(printf '%s' "$provided" | tr ',;' ' ') — well under any shell word-list bound at N=15.
  • No policy-routing / route-table-numbering / per-NIC sysctl-rp_filter scripts exist in the repo (the brief's reference to those was directionally off; the per-NIC managed surface today is just ANYSCAN_RESERVE_INTERFACE + ANYSCAN_TUNE_INTERFACE, both already comma-list-aware from PR perf(portscan): multi-NIC sharding + ENI auto-discovery toward ENA spec ceiling #64).
  • AF_XDP queue/MTU prep is done at scanner runtime, not at install time, so no install-side change is needed for AF_XDP drv+copy mode.

Plan doc update

plans/2026-04-27-portscan-afxdp-plan-v1.md §6 (risk register) gains a new row + sub-section §6.1 documenting that ANYSCAN_RATE_MAX_CONCURRENT_SUBPROCESSES=4 stays the per-host CPU sweet spot regardless of NIC count. anygpt-4's c6in.metal bench data (4-NIC/cap=4 → 12.8 Mpps, 8-NIC/cap=8 → 1.3 Mpps CPU-thrash collapse, 8-NIC/cap=4 → ~12 Mpps parity) is the load-bearing evidence; the additional NICs unlock the kernel-upgrade zerocopy ceiling, not the userland subproc-cap one.

Out of scope (per brief)

  • No change to the cap=4 default (documented as the sweet spot, intentional).
  • No kernel upgrade (separate worker, blocked on ENA zerocopy enablement on kernel >6.12.74).
  • No DPDK or PF_RING (separate workers).
  • No change to runtime.worker.env.template — the new env vars are manager-side, sourced from .external-runtime.env.

Test plan

  • python3 -m unittest tools.test_ec2_worker_manager — 40 ✓
  • python3 -m unittest test_vulnscanner_adapter_multinic test_vulnscanner_adapter_io_engine test_anyscan_rate_controller — 100 ✓ (pre-existing baseline)
  • bash tools/test-install-worker-bundle-eni-discovery.sh — 3 ✓
  • Live verification on a freshly launched c6in.metal worker with ANYSCAN_MAX_ENIS=15 set in .external-runtime.env (pending operator-driven box launch + IAM grant for ec2:DescribeInstanceTypes).

References

🤖 Generated with Claude Code

…f 15

Adds opt-in multi-ENI attach to ec2_worker_manager. ANYSCAN_MAX_ENIS
unset preserves the legacy single-NIC RunInstances payload shape; set to
N (recommended 15 on c6in.metal) attaches min(N, hw_cap) ENIs at launch,
spread across NetworkCards via DescribeInstanceTypes so a 15-ENI launch
on c6in.metal lands 5/4/3/3 across the four physical cards instead of
hard-failing on card 0's 5-slot limit.

Also documents in plans/2026-04-27-portscan-afxdp-plan-v1.md §6.1 that
ANYSCAN_RATE_MAX_CONCURRENT_SUBPROCESSES=4 stays the per-host CPU sweet
spot regardless of NIC count — the additional NICs unlock a kernel-
upgrade ceiling, not a userland cap one (anygpt-4 8/8 collapse data).

- tools/ec2_worker_manager.py: new helpers (parse_max_enis,
  compute_target_eni_count, distribute_enis_across_cards,
  build_network_interfaces, eni_cap_from_describe_response,
  network_cards_from_describe_response); ManagerConfig.max_enis +
  eni_subnet_ids; recreate_instance() routes through NetworkInterfaces
  when opted in, falls back to single-NIC on Describe denial / missing
  subnet so existing fleets are unaffected.
- tools/test_ec2_worker_manager.py: 40 unit tests covering env parsing,
  hardware-cap detection, per-card distribution, and the run_instances
  launch payload (boto3 stubbed at import).
- tools/test-install-worker-bundle-eni-discovery.sh: bash sanity test
  confirms tune-scanner-host.sh + reserve-control-bandwidth.sh comma-
  list iterators handle 15 ENIs and the install-worker-bundle.sh
  multi-NIC gate fires for a 15-entry candidate list (regression guard
  against any future hardcoded N=8 cap).
- plans/2026-04-27-portscan-afxdp-plan-v1.md: new §6.1 risk-register
  sub-section explaining why subproc cap stays at 4 when scaling NICs
  past 8.

Default behavior unchanged when ANYSCAN_MAX_ENIS is unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@skullcrushercmd skullcrushercmd merged commit 98ee63a into main Apr 28, 2026
@skullcrushercmd skullcrushercmd deleted the feat/scale-eni-cap-15 branch April 28, 2026 17:17
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 896768b0e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +959 to +960
if self.config.subnet_id:
launch_args["SubnetId"] = self.config.subnet_id
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve configured subnet pool on single-NIC fallback

When ANYSCAN_MAX_ENIS is set and DescribeInstanceTypes fails, target_count falls back to 1 and this branch only copies self.config.subnet_id into run_instances. If an operator relies on ANYSCAN_EC2_ENI_SUBNET_IDS (with ANYSCAN_EC2_SUBNET_ID unset), the launch request is sent without any subnet even though valid subnets were configured, so EC2 launch fails in accounts without a default VPC. Because recreate_instance() may terminate the current instance before launching the replacement, this can turn a transient describe-permission issue into worker downtime.

Useful? React with 👍 / 👎.

@skullcrushercmd
Copy link
Copy Markdown
Contributor Author

Live verification on a freshly-launched c6in.metal — two findings to fold back into the merged code

Driven by anygpt-48; full bench numbers + cross-PR context on PR #65 issuecomment-4338158487. Flagging the two issues that affect this PR specifically.

1. The c6in.metal NetworkCards fixture in tools/test_ec2_worker_manager.py does not match AWS's live response

The PR's RecreateInstanceLaunchPathTests::test_max_enis_15_on_c6in_metal_emits_15_network_interfaces (and the docstring at tools/ec2_worker_manager.py:121-125) assume:

NetworkCards = [
  {NetworkCardIndex: 0, MaximumNetworkInterfaces: 5},  # primary
  {NetworkCardIndex: 1, MaximumNetworkInterfaces: 4},
  {NetworkCardIndex: 2, MaximumNetworkInterfaces: 3},
  {NetworkCardIndex: 3, MaximumNetworkInterfaces: 3},
]

Live aws ec2 describe-instance-types --instance-types c6in.metal --region us-east-1 (from this bench's launch-time describe call):

TopLevel MaximumNetworkInterfaces = 16
NetworkCards:
  card 0: max_nics=8, perf=Up to 170 Gigabit
  card 1: max_nics=8, perf=Up to 170 Gigabit
total via cards = 16

distribute_enis_across_cards(15, real_cards) against the real shape places 8 ENIs on card 0 and 7 on card 1, not 5/4/3/3 across 4 cards. The launch went through fine and 15 ENIs attached cleanly on this bench (and showed up as ens1..ens5, ens7, enp13s0, enp15s0, enp154s0..enp160s0 post-boot), so the launch-path code itself is correct — it's the verification that's mocking the wrong topology.

This matters because §6.1 of the plan doc and the PR commit body both lean on the "more cards = more PCIe trees = unlock kernel-upgrade zerocopy ceiling" framing. With only 2 PCIe trees (not 4), the multi-NIC headroom over the 8-NIC baseline is much smaller than projected. The bench data in PR #65 issuecomment-4338158487 confirms this empirically: 15-NIC AF_XDP at every T=R config tested regressed versus the 8-NIC cap=4 t=8 22.43 M peak from anygpt-42.

Suggested follow-up: rebuild the test fixture from a recorded describe-instance-types payload (or hit the live API in an integration test gated on AWS_INTEGRATION=1). Same applies to c6in.xlarge — the unit test for "clamps to 4" should be checked against the live shape.

2. The multi-ENI launch path skips public-IP allocation on the primary

build_network_interfaces() doesn't set AssociatePublicIpAddress=True on the primary NetworkInterfaces[0] entry. The legacy single-NIC path (top-level SubnetId) honored the subnet's MapPublicIpOnLaunch=true, so existing fleets had a public IP on the primary ENI; the new multi-ENI path doesn't, and AWS does not auto-assign when NetworkInterfaces[] is set explicitly.

Live impact on this bench: the c6in.metal came up with 15 private 172.31.x.y ENIs and no public IP — unreachable from scan.anyvm.tech (which connects via internet, not VPN). I worked around it with aws ec2 allocate-address + associate-address on the primary ENI to get the EIP 54.165.21.227, then SSH worked.

Suggested follow-up: set AssociatePublicIpAddress: True on the primary entry (NetworkCardIndex == 0 and DeviceIndex == 0) when the subnet's MapPublicIpOnLaunch is true (or unconditionally when there's no IPv6/EIP override). The single-NIC fallback path is fine because it doesn't use NetworkInterfaces[].

This isn't blocking — operators with VPN tunnels or AWS Console / SSM Session Manager access wouldn't have noticed — but it's the kind of thing that shows up the first time someone tries to drive a 15-NIC bench from outside the VPC.

What worked

  • 40-test unit suite passes against the (incorrect) fixture, which means the placement logic itself is solid.
  • The launch-time eni_attach payload was correctly populated:
    "eni_attach": {
      "requested": 15,
      "hardware_cap": 16,
      "network_cards": [{"NetworkCardIndex": 0, "MaximumNetworkInterfaces": 8, ...},
                        {"NetworkCardIndex": 1, "MaximumNetworkInterfaces": 8, ...}],
      "attached": 15,
      "subnet_pool": ["subnet-0a8e834fdf69c0839"]
    }
    
  • All 15 ENIs attached, came up as ENA devices post-cloud-init, accepted MTU=3498 + ethtool -L combined 8 per the AF_XDP host-prep findings from anygpt-42, and AF_XDP attached cleanly in drv+copy mode. The compute_target_eni_count(16, 15) = 15 clamp logic is correct.
  • 15-ENI auto-cleanup on instance termination worked: DeleteOnTermination=true was the default; all 15 ENIs were gone within the termination window without any manual delete-network-interface call.

So the env-knob plumbing is right; the test fixture (and the per-card narrative in the PR body) just need to match what AWS actually returns today.

skullcrushercmd added a commit that referenced this pull request Apr 28, 2026
… 4x5/4/3/3=15) (#77)

PR #74 mocked NetworkCards as 4 cards distributed 5/4/3/3=15, but actual
AWS DescribeInstanceTypes for c6in.metal returns 2 cards x 8 = 16
(anygpt-48 live bench, PR #65 issuecomment-4338158487). The launch path
code is fine - distribute_enis_across_cards handles any card layout -
but the synthetic test fixture and the docstring example encoded a
shape that doesn't match production AWS.

Refresh the fixture, the docstring, and every test that hardcoded
15-derived numbers. Add a new RecordedDescribeInstanceTypesIntegrityTests
class that anchors the fixture against
tools/c6in_metal_describe_instance_types.json (a real
`aws ec2 describe-instance-types` capture) so future drift gets caught
at unit-test time instead of bench time.

Effect on capacity claim: c6in.metal has 2 PCIe trees, not 4, so the
multi-NIC headroom caps at ~2x single-tree, not ~4x.

Co-authored-by: skullcmd <skullcmd@anyvm.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant