Skip to content

fix(ci): give AVM check-circuit more CPU/timeout so large txs don't time out#24493

Draft
AztecBot wants to merge 1 commit into
nextfrom
cb/895969e9a98e
Draft

fix(ci): give AVM check-circuit more CPU/timeout so large txs don't time out#24493
AztecBot wants to merge 1 commit into
nextfrom
cb/895969e9a98e

Conversation

@AztecBot

@AztecBot AztecBot commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

What

The nightly AVM Circuit Inputs Collection and Check job (avm-check-circuit) failed on next: run 28637998116.

Root cause

Exactly one of the per-tx avm_check_circuit invocations timed out (exit code 124), which fails the whole parallelize run:

FAILED (ccebcf9b6b3aa6af): bb-avm avm_check_circuit -v --avm-inputs .../e2e_multiple_blobs/avm-circuit-inputs-tx-0x0c37df31...bin (31s) (code: 124)
run_test_cmd '...:ISOLATE=1:TIMEOUT=30s:NAME=avm_cc_e2e_multiple_blobs_0x0c37df31 ...'

The individual test log shows simulation and trace generation completed fine; it was the circuit check that ran out of time:

04:56:50 Checking circuit...
04:56:50 Running check (with skippable) circuit over 700560 rows.
04:57:13 timeout: sending signal TERM to command 'bash'

This is a performance/timeout failure, not a correctness one. The e2e_multiple_blobs tx 0x0c37df31… produces a ~700k-row circuit whose check does not finish within the per-container TIMEOUT=30s when the container is capped at the default --cpus=2. Every other (smaller) tx checked in 1–3s. Container resources were CPUS=2 MEM=8g, peak mem only ~2.2 GiB, so memory was not the constraint — CPU/wall-clock was.

Yesterday's #24458 ("reduce parallelism for e2e check circuit — it was timing out") lowered concurrency to parallelize 16, but on the 64-core CI fleet instance that already leaves contention at zero (16 × 2 = 32 of 64 cores). Lowering concurrency does not raise a single container's --cpus quota, so a single large circuit still can't finish in 30s — hence the recurrence.

Fix

Bump the per-container resources for these check-circuit commands:

- local prefix="$hash:ISOLATE=1:TIMEOUT=30s"
+ local prefix="$hash:ISOLATE=1:TIMEOUT=90s:CPUS=4"
  • CPUS=4check_circuit parallelises over rows, so doubling the CPU quota roughly halves the check time for the large circuit, bringing it comfortably under the timeout. MEM auto-derives to 16g (MEM=${MEM:-$((CPUS*4))g}), still far above the ~2.2 GiB peak.
  • TIMEOUT=90s — 3× headroom to absorb variance and future input growth.
  • With parallelize 16 this is 16 × 4 = 64 CPUs, exactly matching the smallest instance in the CI spot fleet (r6a.16xlarge = 64 vCPU), so there is no oversubscription on any fleet instance.

Also refreshed the stale WARNING comment (which said "for now all txs seem small") to document the observed large-tx case and the reasoning behind the allocation.

Notes

  • Scoped to the CI harness (yarn-project/end-to-end/bootstrap.sh); no product/circuit code changes.
  • The dumped AVM inputs are downloaded from cache in CI and are not in the checkout, so the timing itself is exercised only on the CI fleet runners; verified the change is a syntactically valid, self-contained prefix bump and that CPUS/TIMEOUT are recognized run_test_cmd prefix params (ci3/source_test_params).

Created by claudebox · group: slackbot

@AztecBot AztecBot added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR. labels Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant