Skip to content

Split Full Suite YAML batches so each heavy folder runs isolated#8093

Merged
MaxGhenis merged 1 commit intomainfrom
tighten-test-batches
Apr 19, 2026
Merged

Split Full Suite YAML batches so each heavy folder runs isolated#8093
MaxGhenis merged 1 commit intomainfrom
tighten-test-batches

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Why

Full Suite jobs on ubuntu-latest have been intermittently failing with The runner has received a shutdown signal mid-batch (see #8069 / #8077 / #8078 across the last two days). The signal is a runner OOM kill — our grouped batches peak at ~8-9 GB per subprocess, which is borderline on 16 GB runners and tips over once the policyengine-core 3.24+ per-simulation overhead is added.

Fix

Give every heavy folder its own batch. Each subprocess now peaks around ~3-5 GB instead of ~8-9 GB, so the runner never runs out of memory regardless of the PE-core version.

Batch count changes

Job Before After
Full Suite - Structural (Other) (policy/contrib) 7 15
Full Suite - Baseline (excl States) (policy/baseline gov/) 5 6

Small folders and root YAML files split across two deterministic catch-all groups so new additions to the repo have somewhere safe to land.

Trade-off

Extra 3-5 minutes of wall time per job from the additional subprocess startups. In exchange, Full Suite stops getting killed mid-batch and we stop needing --admin merges to land dependency bumps.

Test plan

  • split_into_batches returns the expected lists locally
  • CI runs to completion without runner shutdowns

Generated with Claude Code

Previous grouping (3 folders per contrib batch, usda+hhs paired in
baseline-other) pushed peak memory to ~8-9 GB per subprocess on the
16 GB ubuntu-latest runner. Once policyengine-core 3.24+ overhead
landed this exceeded the cap and surfaced as 'The runner has received
a shutdown signal' mid-batch, intermittently failing Full Suite -
Baseline States / Baseline (excl States) / Structural (Other).

Every heavy folder now gets its own batch (~3-5 GB peak each). The
remaining small folders and root YAML files split across two
deterministic catch-all groups so new unknown folders have somewhere
safe to land without pushing either group past ~5 GB.

Batch counts:
  Structural (Other) policy/contrib: 7 -> 15 batches
  Baseline (excl States):            5 -> 6 batches

Trade-off: ~3-5 min extra wall time from subprocess startup, in
exchange for CI stability. Each subprocess starts fresh so holder
memory is fully freed between batches regardless of PE-core version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit 40157c4 into main Apr 19, 2026
10 of 13 checks passed
@MaxGhenis MaxGhenis deleted the tighten-test-batches branch April 19, 2026 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant