Skip to content

fix(ci): raise avm_check_circuit per-check timeout from 30s to 300s#23743

Draft
AztecBot wants to merge 1 commit into
nextfrom
cb/avm-check-circuit-timeout-300s
Draft

fix(ci): raise avm_check_circuit per-check timeout from 30s to 300s#23743
AztecBot wants to merge 1 commit into
nextfrom
cb/avm-check-circuit-timeout-300s

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

Root cause

The avm-check-circuit job (workflow AVM Circuit Inputs Collection and Check) has been failing on essentially every run since ~Feb 10 (last green run was #877; the run that triggered this PR is #1722, run 26675832739). It is not caused by the commit that triggered the failing run — the job is red on every commit.

The failures are at the CI harness layer, not in the AVM circuit. Every bb-avm avm_check_circuit invocation that actually runs reports PASSED. The dominant failure signature across recent runs (1720, 1721) is the remote command exiting with responseCode 124 at a consistent ~4m25s:

  • avm_check_circuit_cmds hard-codes TIMEOUT=30s per check (yarn-project/end-to-end/bootstrap.sh).
  • exec_test enforces it via timeout -v $TIMEOUT bash -c "$test_cmd", so a check that exceeds 30s is killed with exit 124.
  • parallelize runs with --halt now,fail=1 (fail-fast), so a single timed-out check aborts the whole job and propagates 124.

The consistent ~4m25s failure time (≈3m55s build + 30s) is the tell: with 64-way parallelism the first wave of checks all start at once, and a single check that needs >30s times out exactly 30s into the run and fail-fasts the job. The 30s value has been unchanged since the feature was introduced (#18747); the per-check cost has simply grown past it as the AVM gained columns/relations. Normal checks complete in 3–5s, so 30s has no headroom for the heavier transactions.

(One recent run, #1722, instead died with responseCode=-1 / Undeliverable — a separate, rarer EC2 instance-loss event that this change does not address. The dominant, deterministic cause is the 30s timeout.)

Note also that the failure Slack notification targets #team-bonobos and returns channel_not_found, which is why this job rotted silently for ~3.5 months — humans were never alerted. That is a separate issue worth fixing (correct channel name) but is out of scope here.

Fix

Raise the per-check timeout from 30s to 300s. This is the mitigation the in-code WARNING comment already anticipated. It gives slow-but-valid checks ample headroom while keeping a bound so a genuinely stuck input still fails rather than hanging.

Verification

Not reproducible in this session — avm-check-circuit requires a full barretenberg build plus the dumped AVM inputs tarball and runs on a dedicated EC2 instance, none of which are available locally. The change is a single CI harness parameter; correctness rests on the log evidence above (responseCode 124 = timeout firing, consistent ~30s-after-build failure, every executed check PASSED). The real validation is a green avm-check-circuit run on next after merge.


Created by claudebox · group: slackbot

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant