Skip to content

fix(ec2-worker): correct c6in.metal NetworkCards fixture (2×8=16, not 4×5/4/3/3=15)#77

Merged
skullcrushercmd merged 1 commit intomainfrom
fix/c6in-metal-network-card-fixture
Apr 28, 2026
Merged

fix(ec2-worker): correct c6in.metal NetworkCards fixture (2×8=16, not 4×5/4/3/3=15)#77
skullcrushercmd merged 1 commit intomainfrom
fix/c6in-metal-network-card-fixture

Conversation

@skullcrushercmd
Copy link
Copy Markdown
Contributor

Summary

PR #74's c6in.metal NetworkCards fixture mocked AWS as returning 4 cards distributed 5/4/3/3=15. The anygpt-48 live bench (PR #65 issuecomment-4338158487) showed AWS actually returns 2 cards × 8 = 16:

TopLevel MaximumNetworkInterfaces = 16
NetworkCards:
  card 0: max_nics=8, perf=Up to 170 Gigabit
  card 1: max_nics=8, perf=Up to 170 Gigabit
total via cards = 16

The launch path code itself is fine — distribute_enis_across_cards handles any per-card layout — but the synthetic mock encoded a shape that doesn't match production. The 40-test unit suite passed against the wrong shape, so the bug only surfaced at bench time.

What changed

  • tools/ec2_worker_manager.py — docstring + comment now describe the real 2×8 layout
  • tools/test_ec2_worker_manager.py::_c6in_metal_describe() — fixture refreshed to 2 cards × 8 = 16, with NetworkPerformance strings included for parity with the AWS shape
  • All tests that hardcoded 15 against c6in.metal updated to 16; test_c6in_metal_15_eni_layout_respects_per_card_caps renamed and rewritten for the 8/8 split; test_partial_count_round_robins_across_cards updated for the 2-card round-robin pattern [0, 1, 0, 1]
  • New: tools/c6in_metal_describe_instance_types.json — verbatim recorded aws ec2 describe-instance-types payload
  • New: RecordedDescribeInstanceTypesIntegrityTests — 5 tests that load the recorded JSON and assert the synthetic fixture agrees with it on every load-bearing field. Future AWS-side drift gets caught at unit-test time, not after a 36-min metal launch.

Effect on capacity claims

The "more PCIe trees" justification in PR #74's commit body now matches reality: c6in.metal has 2 trees, not 4 — so multi-NIC headroom caps at ~2× single-tree, not ~4×. (Live bench in anygpt-48 confirmed this empirically: 15-NIC at drv+copy regressed vs 8-NIC because the bottleneck is per-NIC CPU descriptor copy, not NIC count.)

Test plan

  • python3 -m unittest tools.test_ec2_worker_manager -v → 45 tests pass (40 existing + 5 new integrity tests)
  • No production-code changes — the launch path is unchanged

🤖 Generated with Claude Code

… 4x5/4/3/3=15)

PR #74 mocked NetworkCards as 4 cards distributed 5/4/3/3=15, but actual
AWS DescribeInstanceTypes for c6in.metal returns 2 cards x 8 = 16
(anygpt-48 live bench, PR #65 issuecomment-4338158487). The launch path
code is fine - distribute_enis_across_cards handles any card layout -
but the synthetic test fixture and the docstring example encoded a
shape that doesn't match production AWS.

Refresh the fixture, the docstring, and every test that hardcoded
15-derived numbers. Add a new RecordedDescribeInstanceTypesIntegrityTests
class that anchors the fixture against
tools/c6in_metal_describe_instance_types.json (a real
`aws ec2 describe-instance-types` capture) so future drift gets caught
at unit-test time instead of bench time.

Effect on capacity claim: c6in.metal has 2 PCIe trees, not 4, so the
multi-NIC headroom caps at ~2x single-tree, not ~4x.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant