Stabilize Cloud Run staging prewarm#1600
Conversation
hua7450
left a comment
There was a problem hiding this comment.
Thanks for tackling this — a full-calculation prewarm is the right instinct, but making it a separate hard-gating job introduces some operational regressions I'd like resolved before this merges.
Design (blocking)
gh run rerun --failedno longer re-warms. The old warm step ran inside every test job, so rerunning failed jobs re-warmed first. Now the prewarm is a separate job that stays green, sogh run rerun <id> --failed— our documented remediation for this exact flake — reruns only the failed test job against a worker that scaled to zero long ago. It 503s again, and the only recovery is rerunning the entire workflow including the deploys.- A Modal outage now blocks all deploys, including failover fixes. The prewarm requires Modal to serve the heavy calculation under 90s within 5 attempts (
_attempt_is_warmrequiresbackend == "modal"). If Modal staging is degraded, the forced-fallback test jobs — which would pass — never run, and production deploys are blocked. We couldn't ship a Cloud Run failover fix during the exact incident the failover exists for. - The warm can expire before the tests start. Staging workers have
scaledown_window=300and nomin_containers(worker_app.py:54). The downstream test job re-runs checkout + setup-python + fulluv sync+ Auth0 fetch after the prewarm succeeds; if that exceeds 5 minutes, the warmed container is gone and the cold-start flake recurs — now with a hard gate in front of it. The old in-job warm had zero gap.
All three stem from the same choice. Suggest either keeping the prewarm job best-effort (report, don't gate — the deployed tests remain the gate, as the old script documented), or moving the warm back inside each test job (e.g., a session-scoped pytest fixture) so it survives --failed reruns and has no warm-to-test gap.
Script robustness (blocking)
Several exception paths bypass the retry loop entirely, so a single transient hard-fails the pipeline — the exact thing the retries exist to absorb (the old curl warmer swallowed all of these with || true):
_resolve_package_versionmakes one unretried GET on/versions/usand raisesSystemExiton any failure, so a blip on the seconds-old gateway revision fails the job before the first warm attempt (cloud_run_prewarm_full_calc_via_gateway.py:149-168).- The bare
exc.read()inside theexcept error.HTTPErrorhandler is unguarded — a connection reset orIncompleteReadwhile draining a 503 body escapes and crashes the script on attempt 1 (line 211). http.client.HTTPExceptionsubclasses aren't caught anywhere:IncompleteReadon a truncated 200 body (line 201),BadStatusLine. LikewiseJSONDecodeErroron the versions response (line 153) and a valid-JSON non-dict 200 body (payload.get→AttributeError, line 240).urlopen(timeout=...)is a per-socket-operation timeout, not a total deadline — a response that trickles bytes keeps one attempt alive until GitHub's 6-hour job timeout instead of failing within the intended ~5×130s (line 200).- Deterministic 4xx (bad token scope, payload drift against a new package version) is retried all 5 attempts identically to a 503, and the error body — the useful diagnostic, e.g. the 422
unsupported_versionpayload — is read and discarded. Fail fast on 4xx and print the body (lines 69, 211). print()withoutflush=Trueis block-buffered under Actions, so the per-attempt progress lines only appear at process exit — you can't tell a progressing prewarm from a hung one (line 260).flush=TrueorPYTHONUNBUFFERED=1in the step.
Non-blocking cleanups
- The script imports
HouseholdModelUS+ the test fixture only to round-trip the payload throughmodel_dump(), which is what forcesuv sync --extra dev(full policyengine-us/uk/canada) in both prewarm jobs on the release critical path.cloud_run_gateway_load_test.pyposts the raw fixture dict to the same endpoint — doing the same makes this script stdlib-only, drops the sync step, and shrinks the warm-to-test gap from design item 3. - The current/frontier prewarm + integration-test jobs are four near-identical blocks differing only by the channel literal (~120 duplicated lines); a
matrix.channel: [current, frontier]prewarm job feeding the original test matrix does the same gating with half the YAML. _post_calculation/WarmAttemptnear-duplicate the HTTP client incloud_run_gateway_load_test.py— worth extracting a shared helper in.github/scripts(precedent:modal_release_check_common.py).--max-attempts 5 --max-elapsed-seconds 90in both workflow jobs exactly restate the script's defaults — two sources of truth for the gating thresholds; drop one.- The test's inline
ThreadingHTTPServerharness duplicates the one intest_cloud_run_gateway_load_test.py, and the relative script path with nocwd=makes the test repo-root-dependent.
Fixes #1599
Summary
Validation
uv run --frozen --extra dev pytest .github/scripts -quv run --frozen --extra dev pytest --confcutdir=tests/unit/modal_release tests/unit/modal_release/test_release_config.py .github/scripts/test_cloud_run_prewarm_full_calc_via_gateway.py .github/scripts/test_run_deployed_tests_for_modal_route.pyuv run --frozen ruff format --check policyengine_household_api/modal_release/release_config.py tests/unit/modal_release/test_release_config.py .github/scripts/cloud_run_prewarm_full_calc_via_gateway.py .github/scripts/test_cloud_run_prewarm_full_calc_via_gateway.pyuv run --frozen ruff check policyengine_household_api/modal_release/release_config.py tests/unit/modal_release/test_release_config.py .github/scripts/cloud_run_prewarm_full_calc_via_gateway.py .github/scripts/test_cloud_run_prewarm_full_calc_via_gateway.py