Split the simulation service into gateway/executor projects with lockfile-derived Modal images#610
Merged
Merged
Conversation
logfire >=4.7 imports importlib_metadata unconditionally on Python 3.13 but stopped receiving it transitively, so the freshly rebuilt gateway and simulation images crash on import logfire. The gateway ASGI factory died at startup, hanging every request past Modal's HTTP window (303 redirects), which failed all beta integration tests and blocked the production deploy of the observability PR (#594). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Modal images resolved their packages fresh at image-build time from loose pip_install ranges, so image dependencies drifted from the tested uv.lock environment; issue #602 (logfire resolving to a release with an undeclared importlib_metadata dependency) is the failure mode. Define the image package sets as uv dependency groups resolved inside uv.lock, export them to checked-in pinned requirements files, and install the images from those. make update re-exports after relocking and a unit test fails CI when the exports drift from the lock. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prepares the gateway split (#609): the existing project becomes the executor (versioned Modal worker apps + standalone FastAPI service), and the directory/package names now say so. Package policyengine_api_simulation becomes policyengine_simulation_executor. The generated client package name (policyengine_api_simulation_client) and the deployed Modal app names are deliberately unchanged — external consumers and deployed identity are unaffected. Also deletes the dead libs/policyengine-simulation-api stub and drops the stale src/policyengine_api entry from the wheel target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the observability plumbing shared by the gateway and executor — the policyengine-observability configuration wrapper, legacy Logfire helpers, error redaction, and the telemetry envelope — into a lib with its own pyproject and uv.lock. Isolated from the upcoming contract lib because it changes for a different reason: the Logfire pieces are retained only while a replacement platform is evaluated, and this dependency cluster is the one that caused #602. Both Modal images now mount the lib alongside the project sources, with the mount tuple pinned by a new image-structure assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the gateway↔executor contract into its own lib: the request/ response models (gateway_models, moved whole — the budget-window models inherit GatewayRequestBase, so splitting the module would strand a base class), shared budget-window job state over modal.Dict, and dataset reference resolution (dataset_uri, hf_dataset). The budget_window_state → gateway models import cycle dissolves intra-package. Depends on the observability lib for TelemetryEnvelope. Both Modal images mount the lib; the contract lib carries its own minimal modal mock for state round-trip tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The gateway becomes its own uv project: its Modal image now installs with uv_sync(frozen=True) from the project's own uv.lock, so the image environment is exactly the one CI unit tests run against — packages can only change through a relock (closes the #602 class). Path dependencies (contract/observability libs, policyengine-fastapi) live only in the dev group and the image passes --no-default-groups, because Modal ships just pyproject+lock into the build context; local code mounts via add_local_python_source. The uv_project_dir is guarded by modal.is_local() — inside the container this module mounts at /root/app.py where the local path math cannot resolve (caught by the staging deploy gate). Gateway tests move with it and run in the parity env, plus new guards: image-structure (uv_sync + mounts + no ad-hoc pip layers), import coverage including the explicit 'import logfire' #602 path, an OpenAPI golden captured from main before the migration (byte-identical), and a route-table pin between the real router and the OpenAPI stub. The cross-project response-shape contract test stays in the executor, which gains a dev-only path dependency on the gateway for its scheduler seam tests. Client generation and publishing move to the gateway project; the generated client package name is unchanged. The policyengine update script now re-exports the executor image requirements after relocking so bot PRs pass the freshness test. Verified live: gateway deployed to Modal staging from this project and /health returns 200 from the uv_sync image. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Each smoke runs the true entrypoint surface inside the real Modal image before merge: every package module, the explicit 'import logfire' that crashed in #602, and (for the gateway) the real ASGI factory. Images are built through the same shared builders as the deployed apps — build_gateway_image() and the newly factored build_runtime_simulation_image() (the executor's layer prefix without the multi-hour dataset prebuild, which adds no Python packages) — so layers are content-addressed cache hits and a warm smoke takes seconds. Wired as .github/workflows/pr-image-smoke.yml: path-filtered to image inputs, staging Modal env, skipped on fork PRs (no secrets), with integration-marked pytest wrappers for local use. Verified against staging: gateway smoke imports 15 modules and returns the route table; executor smoke imports the worker entrypoints and both libs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The executor's module-level requirements-file path used Path(__file__).resolve().parents[2], which raises IndexError when Modal loads the deployed function's module at /root/app.py — the container died at import and the staging worker crash-looped on its first boot after the restructure (the gateway had the same bug, fixed in the gateway commit; this applies the same modal.is_local() guard to the executor). The image smoke missed it because it imports src.modal.app as a package, not as the entrypoint; both projects now carry a regression test that compiles the app module at /root/app.py with is_local()=False, replicating the exact container placement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The executor image now installs its bootstrap straight from this project's uv.lock — uv_sync(frozen=True, --only-group modal-simulation-image) — replacing the checked-in requirements export and deleting its whole apparatus: the export script, the freshness test, and the re-export hooks in make update and the policyengine bot script. Image packages are canonically lock-derived in both simulation projects now. The policyengine bundle installs the country models into uv_sync's venv via --venv /.uv/.venv so both installers share one environment (first attempt split them: the bundle's pip ran against system Python; second attempt found uv venvs ship without pip — the group now carries pip for the bundle installer, and policyengine.py#452 requests the datasets-only mode that will let uv own the model packages too; #611 tracks that flip). Verified in staging: full image rebuild (bundle into venv, dataset prebuild imports from it, baked files verified), image smoke, scripted deploy, and a Utah state-level calculation completing in 127s. Main-env image cache prewarmed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The executor's dev dependency group holds a path dependency on the gateway project (for the scheduler seam tests), and the container filesystem deliberately excludes it. The Dockerfile's build-time sync already passed --no-dev, but the runtime 'uv run uvicorn' did not — and uv run both installs the dev group by default and revalidates the lock, which builds metadata for every path source. Either is fatal in the container, so the service never bound its port and the PR's integration lane failed from the first push (masked by a faulty CI watch that could not see the gated job). --frozen --no-dev at both call sites: sync from the baked lock as-is, never touch dev. Verified locally: compose service healthy on :8082 and the CI integration selection passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #602
Fixes #607
Fixes #609
Supersedes #603 and #608 — their commits re-land verbatim as commits 1–2.
Problem
Modal images resolved dependencies fresh at image-build time from loose
pip_installranges, so image environments drifted from the testeduv.lockenvironment. #602 is the incident this enabled: a fresh resolution pulled logfire 4.37.0 without its undeclaredimportlib_metadatadependency, the gateway crash-looped at container startup, all beta integration tests failed, and the production deploy was skipped — with nothing visible in unit tests or PR CI.Structure
projects/policyengine-simulation-gateway(new): the Modal gateway as its own uv project. Its image installs withuv_sync(frozen=True, --no-default-groups)from the project's ownuv.lock— the exact environment CI unit tests run in (environment parity). Path deps live only in the dev group because Modal ships just pyproject+lock into the build context; local code mounts viaadd_local_python_source.projects/policyengine-simulation-executor(renamed frompolicyengine-api-simulation): versioned Modal worker apps + the standalone FastAPI service. Keeps the bundle-install architecture; its image bootstrap installs a pinned export of themodal-simulation-imagedependency group (requirements/, freshness-tested in CI, regenerated bymake updateand the policyengine bot script).libs/policyengine-simulation-contract(new): gateway↔executor contract — request/response models, budget-window state overmodal.Dict, dataset references.libs/policyengine-simulation-observability(new): observability wrapper,logfire_legacy, error redaction, telemetry — isolated because Logfire is transitional and this dependency cluster caused Beta gateway crash-loops: logfire import fails with ModuleNotFoundError (importlib_metadata) in rebuilt Modal images #602.libs/policyengine-simulation-api.policyengine_api_simulation_client).New guards
.github/workflows/pr-image-smoke.yml, path-filtered, staging): imports every module, the explicitimport logfireBeta gateway crash-loops: logfire import fails with ModuleNotFoundError (importlib_metadata) in rebuilt Modal images #602 path, and the real ASGI factory inside the actual images; the executor smoke reuses the deployed image's cached layer prefix minus the multi-hour dataset prebuild./root/app.pywithmodal.is_local()false — Modal's exact deployed-entrypoint placement, which crashed twice during this work (parents[2]IndexError; caught in the gateway by the staging-deploy gate and in the executor by the first real worker invocation).Verification (all against staging; production untouched)
modal-deploy-app.sh staging): gateway + versioned worker + routing state./health200,/ping200, gated routes return immediate 403 without a token (vs. this morning's indefinite hangs).fc-01KWMT0QYAERVN68Y9QBYV7A1B), full output structure, observability operation record emitted from inside the image.Note for reviewers: the gateway image's pinned-requirements export is added in commit 2 (verbatim #608 re-land) and removed in commit 6 when the gateway becomes uv_sync-based — kept verbatim to preserve the already-reviewed commits.
🤖 Generated with Claude Code