Skip to content

Stabilize Cloud Run staging prewarm#1600

Open
anth-volk wants to merge 3 commits into
mainfrom
codex/cloud-run-mfb-prewarm
Open

Stabilize Cloud Run staging prewarm#1600
anth-volk wants to merge 3 commits into
mainfrom
codex/cloud-run-mfb-prewarm

Conversation

@anth-volk

@anth-volk anth-volk commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Fixes #1599

Summary

  • Replace the lightweight per-matrix Cloud Run warm step with channel-specific my_friend_ben prewarm gates.
  • Keep channel and exact Cloud Run route-mode tests parallel by splitting current and frontier into separate integration jobs.
  • Remove the obsolete lightweight warm script and mark the new prewarm script as Modal-release-sensitive.

Validation

  • uv run --frozen --extra dev pytest .github/scripts -q
  • uv run --frozen --extra dev pytest --confcutdir=tests/unit/modal_release tests/unit/modal_release/test_release_config.py .github/scripts/test_cloud_run_prewarm_full_calc_via_gateway.py .github/scripts/test_run_deployed_tests_for_modal_route.py
  • uv run --frozen ruff format --check policyengine_household_api/modal_release/release_config.py tests/unit/modal_release/test_release_config.py .github/scripts/cloud_run_prewarm_full_calc_via_gateway.py .github/scripts/test_cloud_run_prewarm_full_calc_via_gateway.py
  • uv run --frozen ruff check policyengine_household_api/modal_release/release_config.py tests/unit/modal_release/test_release_config.py .github/scripts/cloud_run_prewarm_full_calc_via_gateway.py .github/scripts/test_cloud_run_prewarm_full_calc_via_gateway.py
modal_release:
  new_app_target: none
  promote_existing_frontier: false
  cleanup_target: none

@anth-volk anth-volk marked this pull request as ready for review July 1, 2026 19:13
@anth-volk anth-volk requested a review from hua7450 July 1, 2026 19:17

@hua7450 hua7450 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this — a full-calculation prewarm is the right instinct, but making it a separate hard-gating job introduces some operational regressions I'd like resolved before this merges.

Design (blocking)

  1. gh run rerun --failed no longer re-warms. The old warm step ran inside every test job, so rerunning failed jobs re-warmed first. Now the prewarm is a separate job that stays green, so gh run rerun <id> --failed — our documented remediation for this exact flake — reruns only the failed test job against a worker that scaled to zero long ago. It 503s again, and the only recovery is rerunning the entire workflow including the deploys.
  2. A Modal outage now blocks all deploys, including failover fixes. The prewarm requires Modal to serve the heavy calculation under 90s within 5 attempts (_attempt_is_warm requires backend == "modal"). If Modal staging is degraded, the forced-fallback test jobs — which would pass — never run, and production deploys are blocked. We couldn't ship a Cloud Run failover fix during the exact incident the failover exists for.
  3. The warm can expire before the tests start. Staging workers have scaledown_window=300 and no min_containers (worker_app.py:54). The downstream test job re-runs checkout + setup-python + full uv sync + Auth0 fetch after the prewarm succeeds; if that exceeds 5 minutes, the warmed container is gone and the cold-start flake recurs — now with a hard gate in front of it. The old in-job warm had zero gap.

All three stem from the same choice. Suggest either keeping the prewarm job best-effort (report, don't gate — the deployed tests remain the gate, as the old script documented), or moving the warm back inside each test job (e.g., a session-scoped pytest fixture) so it survives --failed reruns and has no warm-to-test gap.

Script robustness (blocking)

Several exception paths bypass the retry loop entirely, so a single transient hard-fails the pipeline — the exact thing the retries exist to absorb (the old curl warmer swallowed all of these with || true):

  • _resolve_package_version makes one unretried GET on /versions/us and raises SystemExit on any failure, so a blip on the seconds-old gateway revision fails the job before the first warm attempt (cloud_run_prewarm_full_calc_via_gateway.py:149-168).
  • The bare exc.read() inside the except error.HTTPError handler is unguarded — a connection reset or IncompleteRead while draining a 503 body escapes and crashes the script on attempt 1 (line 211).
  • http.client.HTTPException subclasses aren't caught anywhere: IncompleteRead on a truncated 200 body (line 201), BadStatusLine. Likewise JSONDecodeError on the versions response (line 153) and a valid-JSON non-dict 200 body (payload.getAttributeError, line 240).
  • urlopen(timeout=...) is a per-socket-operation timeout, not a total deadline — a response that trickles bytes keeps one attempt alive until GitHub's 6-hour job timeout instead of failing within the intended ~5×130s (line 200).
  • Deterministic 4xx (bad token scope, payload drift against a new package version) is retried all 5 attempts identically to a 503, and the error body — the useful diagnostic, e.g. the 422 unsupported_version payload — is read and discarded. Fail fast on 4xx and print the body (lines 69, 211).
  • print() without flush=True is block-buffered under Actions, so the per-attempt progress lines only appear at process exit — you can't tell a progressing prewarm from a hung one (line 260). flush=True or PYTHONUNBUFFERED=1 in the step.

Non-blocking cleanups

  • The script imports HouseholdModelUS + the test fixture only to round-trip the payload through model_dump(), which is what forces uv sync --extra dev (full policyengine-us/uk/canada) in both prewarm jobs on the release critical path. cloud_run_gateway_load_test.py posts the raw fixture dict to the same endpoint — doing the same makes this script stdlib-only, drops the sync step, and shrinks the warm-to-test gap from design item 3.
  • The current/frontier prewarm + integration-test jobs are four near-identical blocks differing only by the channel literal (~120 duplicated lines); a matrix.channel: [current, frontier] prewarm job feeding the original test matrix does the same gating with half the YAML.
  • _post_calculation/WarmAttempt near-duplicate the HTTP client in cloud_run_gateway_load_test.py — worth extracting a shared helper in .github/scripts (precedent: modal_release_check_common.py).
  • --max-attempts 5 --max-elapsed-seconds 90 in both workflow jobs exactly restate the script's defaults — two sources of truth for the gating thresholds; drop one.
  • The test's inline ThreadingHTTPServer harness duplicates the one in test_cloud_run_gateway_load_test.py, and the relative script path with no cwd= makes the test repo-root-dependent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stabilize Cloud Run staging warmup with MFB prewarm

2 participants