Skip to content

fix(ci): give avm-check-circuit more CPU and timeout for large traces#23747

Draft
AztecBot wants to merge 1 commit into
nextfrom
cb/avm-check-circuit-resources
Draft

fix(ci): give avm-check-circuit more CPU and timeout for large traces#23747
AztecBot wants to merge 1 commit into
nextfrom
cb/avm-check-circuit-resources

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

Problem

The AVM Circuit Inputs Collection and Check workflow has been failing consistently on next (runs 1716–1723, since ~2026-05-29 14:00 UTC). The avm-check-circuit job exits with code 124 (timeout).

Root cause

avm_check_circuit_cmds in yarn-project/end-to-end/bootstrap.sh runs each dumped AVM input through bb-avm avm_check_circuit with the per-test prefix ISOLATE=1:TIMEOUT=30s and the default CPUS=2.

Every input passes in 3–5s except the e2e_multiple_blobs tx (0x2f5d642b…), which produces a much larger circuit. From the failing input's log:

08:47:44 Generating trace... (mem: 824 MiB)
08:48:07 Checking circuit... (mem: 3914 MiB)      # ~23s just to generate the trace
08:48:07 Running check (with skippable) circuit over 700560 rows.
08:48:13 timeout: sending signal TERM to command 'bash'   # killed at 30s, mid-check

With only 2 CPUs, trace generation alone consumed ~23s, so the circuit check never finished before the 30s timeout fired. parallelize runs with --halt now,fail=1, so this one timed-out input failed the entire job. It was killed mid-check — there is no circuit correctness error, purely an insufficient resource/time budget. This is exactly the case the file's existing WARNING comment anticipated.

Fix

Bump the per-input allocation from CPUS=2 / TIMEOUT=30s to CPUS=8 / TIMEOUT=300s:

  • CPUS=8 is an upper bound (docker --cpus quota), so it speeds up the large stragglers that run near-alone at the tail of the parallel run, while staying neutral during the parallel burst of small txs (which finish before CPU contention matters, and where CFS throttles all containers to the physical core count regardless).
  • TIMEOUT=300s gives generous headroom over the ~30s+ this tx needs, while still bounding a genuine hang.

Memory follows the default CPUS*4 = 32g; the heavy tx peaked at ~4 GiB, so memory was never the constraint.

Testing

Not reproduced locally: the check runs against a cached, freshly-built bb-avm plus the S3 inputs tarball on a CI EC2 host, which isn't feasible to stand up in this session. The change is confined to the CI resource prefix and is validated by the failing input's own log (700k-row trace, killed mid-check at the 30s boundary with no assertion failure). The fix will be exercised by the next push run of this workflow on next.


Created by claudebox · group: slackbot

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant