fix(ci): raise avm_check_circuit per-check timeout from 30s to 300s#23743
Draft
AztecBot wants to merge 1 commit into
Draft
fix(ci): raise avm_check_circuit per-check timeout from 30s to 300s#23743AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
The
avm-check-circuitjob (workflowAVM Circuit Inputs Collection and Check) has been failing on essentially every run since ~Feb 10 (last green run was #877; the run that triggered this PR is #1722, run 26675832739). It is not caused by the commit that triggered the failing run — the job is red on every commit.The failures are at the CI harness layer, not in the AVM circuit. Every
bb-avm avm_check_circuitinvocation that actually runs reportsPASSED. The dominant failure signature across recent runs (1720, 1721) is the remote command exiting with responseCode 124 at a consistent ~4m25s:avm_check_circuit_cmdshard-codesTIMEOUT=30sper check (yarn-project/end-to-end/bootstrap.sh).exec_testenforces it viatimeout -v $TIMEOUT bash -c "$test_cmd", so a check that exceeds 30s is killed with exit 124.parallelizeruns with--halt now,fail=1(fail-fast), so a single timed-out check aborts the whole job and propagates 124.The consistent ~4m25s failure time (≈3m55s build + 30s) is the tell: with 64-way parallelism the first wave of checks all start at once, and a single check that needs >30s times out exactly 30s into the run and fail-fasts the job. The 30s value has been unchanged since the feature was introduced (#18747); the per-check cost has simply grown past it as the AVM gained columns/relations. Normal checks complete in 3–5s, so 30s has no headroom for the heavier transactions.
(One recent run, #1722, instead died with
responseCode=-1 / Undeliverable— a separate, rarer EC2 instance-loss event that this change does not address. The dominant, deterministic cause is the 30s timeout.)Note also that the failure Slack notification targets
#team-bonobosand returnschannel_not_found, which is why this job rotted silently for ~3.5 months — humans were never alerted. That is a separate issue worth fixing (correct channel name) but is out of scope here.Fix
Raise the per-check timeout from
30sto300s. This is the mitigation the in-code WARNING comment already anticipated. It gives slow-but-valid checks ample headroom while keeping a bound so a genuinely stuck input still fails rather than hanging.Verification
Not reproducible in this session —
avm-check-circuitrequires a full barretenberg build plus the dumped AVM inputs tarball and runs on a dedicated EC2 instance, none of which are available locally. The change is a single CI harness parameter; correctness rests on the log evidence above (responseCode 124 =timeoutfiring, consistent ~30s-after-build failure, every executed check PASSED). The real validation is a greenavm-check-circuitrun onnextafter merge.Created by claudebox · group:
slackbot