fix(ci): give AVM check-circuit more CPU/timeout so large txs don't time out#24493
Draft
AztecBot wants to merge 1 commit into
Draft
fix(ci): give AVM check-circuit more CPU/timeout so large txs don't time out#24493AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The nightly AVM Circuit Inputs Collection and Check job (
avm-check-circuit) failed onnext: run 28637998116.Root cause
Exactly one of the per-tx
avm_check_circuitinvocations timed out (exit code 124), which fails the wholeparallelizerun:The individual test log shows simulation and trace generation completed fine; it was the circuit check that ran out of time:
This is a performance/timeout failure, not a correctness one. The
e2e_multiple_blobstx0x0c37df31…produces a ~700k-row circuit whose check does not finish within the per-containerTIMEOUT=30swhen the container is capped at the default--cpus=2. Every other (smaller) tx checked in 1–3s. Container resources wereCPUS=2 MEM=8g, peak mem only ~2.2 GiB, so memory was not the constraint — CPU/wall-clock was.Yesterday's #24458 ("reduce parallelism for e2e check circuit — it was timing out") lowered concurrency to
parallelize 16, but on the 64-core CI fleet instance that already leaves contention at zero (16 × 2 = 32 of 64 cores). Lowering concurrency does not raise a single container's--cpusquota, so a single large circuit still can't finish in 30s — hence the recurrence.Fix
Bump the per-container resources for these check-circuit commands:
CPUS=4—check_circuitparallelises over rows, so doubling the CPU quota roughly halves the check time for the large circuit, bringing it comfortably under the timeout.MEMauto-derives to16g(MEM=${MEM:-$((CPUS*4))g}), still far above the ~2.2 GiB peak.TIMEOUT=90s— 3× headroom to absorb variance and future input growth.parallelize 16this is 16 × 4 = 64 CPUs, exactly matching the smallest instance in the CI spot fleet (r6a.16xlarge = 64 vCPU), so there is no oversubscription on any fleet instance.Also refreshed the stale
WARNINGcomment (which said "for now all txs seem small") to document the observed large-tx case and the reasoning behind the allocation.Notes
yarn-project/end-to-end/bootstrap.sh); no product/circuit code changes.CPUS/TIMEOUTare recognizedrun_test_cmdprefix params (ci3/source_test_params).Created by claudebox · group:
slackbot