Stabilize Cloud Run staging prewarm by anth-volk · Pull Request #1600 · PolicyEngine/policyengine-household-api

anth-volk · 2026-07-01T18:58:29Z

Summary

Replace the lightweight per-matrix Cloud Run warm step with channel-specific my_friend_ben prewarm gates.
Keep channel and exact Cloud Run route-mode tests parallel by splitting current and frontier into separate integration jobs.
Remove the obsolete lightweight warm script and mark the new prewarm script as Modal-release-sensitive.

Validation

uv run --frozen --extra dev pytest .github/scripts -q
uv run --frozen --extra dev pytest --confcutdir=tests/unit/modal_release tests/unit/modal_release/test_release_config.py .github/scripts/test_cloud_run_prewarm_full_calc_via_gateway.py .github/scripts/test_run_deployed_tests_for_modal_route.py
uv run --frozen ruff format --check policyengine_household_api/modal_release/release_config.py tests/unit/modal_release/test_release_config.py .github/scripts/cloud_run_prewarm_full_calc_via_gateway.py .github/scripts/test_cloud_run_prewarm_full_calc_via_gateway.py
uv run --frozen ruff check policyengine_household_api/modal_release/release_config.py tests/unit/modal_release/test_release_config.py .github/scripts/cloud_run_prewarm_full_calc_via_gateway.py .github/scripts/test_cloud_run_prewarm_full_calc_via_gateway.py

modal_release:
  new_app_target: none
  promote_existing_frontier: false
  cleanup_target: none

hua7450

Thanks for tackling this — a full-calculation prewarm is the right instinct, but making it a separate hard-gating job introduces some operational regressions I'd like resolved before this merges.

Design (blocking)

gh run rerun --failed no longer re-warms. The old warm step ran inside every test job, so rerunning failed jobs re-warmed first. Now the prewarm is a separate job that stays green, so gh run rerun <id> --failed — our documented remediation for this exact flake — reruns only the failed test job against a worker that scaled to zero long ago. It 503s again, and the only recovery is rerunning the entire workflow including the deploys.
A Modal outage now blocks all deploys, including failover fixes. The prewarm requires Modal to serve the heavy calculation under 90s within 5 attempts (_attempt_is_warm requires backend == "modal"). If Modal staging is degraded, the forced-fallback test jobs — which would pass — never run, and production deploys are blocked. We couldn't ship a Cloud Run failover fix during the exact incident the failover exists for.
The warm can expire before the tests start. Staging workers have scaledown_window=300 and no min_containers (worker_app.py:54). The downstream test job re-runs checkout + setup-python + full uv sync + Auth0 fetch after the prewarm succeeds; if that exceeds 5 minutes, the warmed container is gone and the cold-start flake recurs — now with a hard gate in front of it. The old in-job warm had zero gap.

All three stem from the same choice. Suggest either keeping the prewarm job best-effort (report, don't gate — the deployed tests remain the gate, as the old script documented), or moving the warm back inside each test job (e.g., a session-scoped pytest fixture) so it survives --failed reruns and has no warm-to-test gap.

Script robustness (blocking)

Several exception paths bypass the retry loop entirely, so a single transient hard-fails the pipeline — the exact thing the retries exist to absorb (the old curl warmer swallowed all of these with || true):

_resolve_package_version makes one unretried GET on /versions/us and raises SystemExit on any failure, so a blip on the seconds-old gateway revision fails the job before the first warm attempt (cloud_run_prewarm_full_calc_via_gateway.py:149-168).
The bare exc.read() inside the except error.HTTPError handler is unguarded — a connection reset or IncompleteRead while draining a 503 body escapes and crashes the script on attempt 1 (line 211).
http.client.HTTPException subclasses aren't caught anywhere: IncompleteRead on a truncated 200 body (line 201), BadStatusLine. Likewise JSONDecodeError on the versions response (line 153) and a valid-JSON non-dict 200 body (payload.get → AttributeError, line 240).
urlopen(timeout=...) is a per-socket-operation timeout, not a total deadline — a response that trickles bytes keeps one attempt alive until GitHub's 6-hour job timeout instead of failing within the intended ~5×130s (line 200).
Deterministic 4xx (bad token scope, payload drift against a new package version) is retried all 5 attempts identically to a 503, and the error body — the useful diagnostic, e.g. the 422 unsupported_version payload — is read and discarded. Fail fast on 4xx and print the body (lines 69, 211).
print() without flush=True is block-buffered under Actions, so the per-attempt progress lines only appear at process exit — you can't tell a progressing prewarm from a hung one (line 260). flush=True or PYTHONUNBUFFERED=1 in the step.

Non-blocking cleanups

The script imports HouseholdModelUS + the test fixture only to round-trip the payload through model_dump(), which is what forces uv sync --extra dev (full policyengine-us/uk/canada) in both prewarm jobs on the release critical path. cloud_run_gateway_load_test.py posts the raw fixture dict to the same endpoint — doing the same makes this script stdlib-only, drops the sync step, and shrinks the warm-to-test gap from design item 3.
The current/frontier prewarm + integration-test jobs are four near-identical blocks differing only by the channel literal (~120 duplicated lines); a matrix.channel: [current, frontier] prewarm job feeding the original test matrix does the same gating with half the YAML.
_post_calculation/WarmAttempt near-duplicate the HTTP client in cloud_run_gateway_load_test.py — worth extracting a shared helper in .github/scripts (precedent: modal_release_check_common.py).
--max-attempts 5 --max-elapsed-seconds 90 in both workflow jobs exactly restate the script's defaults — two sources of truth for the gating thresholds; drop one.
The test's inline ThreadingHTTPServer harness duplicates the one in test_cloud_run_gateway_load_test.py, and the relative script path with no cwd= makes the test repo-root-dependent.

anth-volk added 2 commits July 1, 2026 20:53

Stabilize Cloud Run staging prewarm

9d810b3

Rename Cloud Run prewarm script

ed162b8

anth-volk marked this pull request as ready for review July 1, 2026 19:13

Add Cloud Run prewarm changelog

4bfe37f

anth-volk requested a review from hua7450 July 1, 2026 19:17

hua7450 requested changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stabilize Cloud Run staging prewarm#1600

Stabilize Cloud Run staging prewarm#1600
anth-volk wants to merge 3 commits into
mainfrom
codex/cloud-run-mfb-prewarm

anth-volk commented Jul 1, 2026 •

edited

Loading

Uh oh!

hua7450 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

anth-volk commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

hua7450 left a comment

Choose a reason for hiding this comment

Design (blocking)

Script robustness (blocking)

Non-blocking cleanups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anth-volk commented Jul 1, 2026 •

edited

Loading