Skip to content

Split the simulation service into gateway/executor projects with lockfile-derived Modal images#610

Merged
anth-volk merged 13 commits into
mainfrom
simulation-service-split
Jul 3, 2026
Merged

Split the simulation service into gateway/executor projects with lockfile-derived Modal images#610
anth-volk merged 13 commits into
mainfrom
simulation-service-split

Conversation

@anth-volk

Copy link
Copy Markdown
Contributor

Fixes #602
Fixes #607
Fixes #609

Supersedes #603 and #608 — their commits re-land verbatim as commits 1–2.

Problem

Modal images resolved dependencies fresh at image-build time from loose pip_install ranges, so image environments drifted from the tested uv.lock environment. #602 is the incident this enabled: a fresh resolution pulled logfire 4.37.0 without its undeclared importlib_metadata dependency, the gateway crash-looped at container startup, all beta integration tests failed, and the production deploy was skipped — with nothing visible in unit tests or PR CI.

Structure

  • projects/policyengine-simulation-gateway (new): the Modal gateway as its own uv project. Its image installs with uv_sync(frozen=True, --no-default-groups) from the project's own uv.lock — the exact environment CI unit tests run in (environment parity). Path deps live only in the dev group because Modal ships just pyproject+lock into the build context; local code mounts via add_local_python_source.
  • projects/policyengine-simulation-executor (renamed from policyengine-api-simulation): versioned Modal worker apps + the standalone FastAPI service. Keeps the bundle-install architecture; its image bootstrap installs a pinned export of the modal-simulation-image dependency group (requirements/, freshness-tested in CI, regenerated by make update and the policyengine bot script).
  • libs/policyengine-simulation-contract (new): gateway↔executor contract — request/response models, budget-window state over modal.Dict, dataset references.
  • libs/policyengine-simulation-observability (new): observability wrapper, logfire_legacy, error redaction, telemetry — isolated because Logfire is transitional and this dependency cluster caused Beta gateway crash-loops: logfire import fails with ModuleNotFoundError (importlib_metadata) in rebuilt Modal images #602.
  • Deleted dead stub libs/policyengine-simulation-api.
  • Unchanged on purpose: deployed Modal app names and the generated client package name (policyengine_api_simulation_client).

New guards

  • Pre-merge image smoke (.github/workflows/pr-image-smoke.yml, path-filtered, staging): imports every module, the explicit import logfire Beta gateway crash-loops: logfire import fails with ModuleNotFoundError (importlib_metadata) in rebuilt Modal images #602 path, and the real ASGI factory inside the actual images; the executor smoke reuses the deployed image's cached layer prefix minus the multi-hour dataset prebuild.
  • Entrypoint-placement regression tests in both projects: compile the app module at /root/app.py with modal.is_local() false — Modal's exact deployed-entrypoint placement, which crashed twice during this work (parents[2] IndexError; caught in the gateway by the staging-deploy gate and in the executor by the first real worker invocation).
  • OpenAPI golden captured from main pre-migration (byte-identical after the move) + route-table pin between the real router and the OpenAPI stub.
  • CI test matrix now covers each project/lib in its own locked env (libs were previously untested in CI).

Verification (all against staging; production untouched)

  • All four suites green in their own envs: executor 167, gateway 101, contract 51, observability 18.
  • Full scripted deploy (modal-deploy-app.sh staging): gateway + versioned worker + routing state.
  • Gateway live: /health 200, /ping 200, gated routes return immediate 403 without a token (vs. this morning's indefinite hangs).
  • End-to-end calculation on the deployed worker: Utah state-level macro comparison (CTC fully-refundable), completed in 118s (fc-01KWMT0QYAERVN68Y9QBYV7A1B), full output structure, observability operation record emitted from inside the image.
  • Both image smokes pass; image layer caches prewarmed in staging and main so the post-merge deploys fast-forward.

Note for reviewers: the gateway image's pinned-requirements export is added in commit 2 (verbatim #608 re-land) and removed in commit 6 when the gateway becomes uv_sync-based — kept verbatim to preserve the already-reviewed commits.

🤖 Generated with Claude Code

anth-volk and others added 10 commits July 3, 2026 18:46
logfire >=4.7 imports importlib_metadata unconditionally on Python 3.13
but stopped receiving it transitively, so the freshly rebuilt gateway
and simulation images crash on import logfire. The gateway ASGI factory
died at startup, hanging every request past Modal's HTTP window (303
redirects), which failed all beta integration tests and blocked the
production deploy of the observability PR (#594).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Modal images resolved their packages fresh at image-build time from
loose pip_install ranges, so image dependencies drifted from the tested
uv.lock environment; issue #602 (logfire resolving to a release with an
undeclared importlib_metadata dependency) is the failure mode. Define
the image package sets as uv dependency groups resolved inside uv.lock,
export them to checked-in pinned requirements files, and install the
images from those. make update re-exports after relocking and a unit
test fails CI when the exports drift from the lock.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prepares the gateway split (#609): the existing project becomes the
executor (versioned Modal worker apps + standalone FastAPI service),
and the directory/package names now say so. Package
policyengine_api_simulation becomes policyengine_simulation_executor.
The generated client package name (policyengine_api_simulation_client)
and the deployed Modal app names are deliberately unchanged — external
consumers and deployed identity are unaffected. Also deletes the dead
libs/policyengine-simulation-api stub and drops the stale
src/policyengine_api entry from the wheel target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the observability plumbing shared by the gateway and executor —
the policyengine-observability configuration wrapper, legacy Logfire
helpers, error redaction, and the telemetry envelope — into a lib with
its own pyproject and uv.lock. Isolated from the upcoming contract lib
because it changes for a different reason: the Logfire pieces are
retained only while a replacement platform is evaluated, and this
dependency cluster is the one that caused #602. Both Modal images now
mount the lib alongside the project sources, with the mount tuple
pinned by a new image-structure assertion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the gateway↔executor contract into its own lib: the request/
response models (gateway_models, moved whole — the budget-window
models inherit GatewayRequestBase, so splitting the module would
strand a base class), shared budget-window job state over modal.Dict,
and dataset reference resolution (dataset_uri, hf_dataset). The
budget_window_state → gateway models import cycle dissolves
intra-package. Depends on the observability lib for TelemetryEnvelope.
Both Modal images mount the lib; the contract lib carries its own
minimal modal mock for state round-trip tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The gateway becomes its own uv project: its Modal image now installs
with uv_sync(frozen=True) from the project's own uv.lock, so the image
environment is exactly the one CI unit tests run against — packages can
only change through a relock (closes the #602 class). Path dependencies
(contract/observability libs, policyengine-fastapi) live only in the
dev group and the image passes --no-default-groups, because Modal ships
just pyproject+lock into the build context; local code mounts via
add_local_python_source. The uv_project_dir is guarded by
modal.is_local() — inside the container this module mounts at
/root/app.py where the local path math cannot resolve (caught by the
staging deploy gate).

Gateway tests move with it and run in the parity env, plus new guards:
image-structure (uv_sync + mounts + no ad-hoc pip layers), import
coverage including the explicit 'import logfire' #602 path, an OpenAPI
golden captured from main before the migration (byte-identical), and a
route-table pin between the real router and the OpenAPI stub. The
cross-project response-shape contract test stays in the executor, which
gains a dev-only path dependency on the gateway for its scheduler
seam tests. Client generation and publishing move to the gateway
project; the generated client package name is unchanged. The
policyengine update script now re-exports the executor image
requirements after relocking so bot PRs pass the freshness test.

Verified live: gateway deployed to Modal staging from this project and
/health returns 200 from the uv_sync image.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Each smoke runs the true entrypoint surface inside the real Modal image
before merge: every package module, the explicit 'import logfire' that
crashed in #602, and (for the gateway) the real ASGI factory. Images
are built through the same shared builders as the deployed apps —
build_gateway_image() and the newly factored
build_runtime_simulation_image() (the executor's layer prefix without
the multi-hour dataset prebuild, which adds no Python packages) — so
layers are content-addressed cache hits and a warm smoke takes seconds.

Wired as .github/workflows/pr-image-smoke.yml: path-filtered to image
inputs, staging Modal env, skipped on fork PRs (no secrets), with
integration-marked pytest wrappers for local use. Verified against
staging: gateway smoke imports 15 modules and returns the route table;
executor smoke imports the worker entrypoints and both libs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The executor's module-level requirements-file path used
Path(__file__).resolve().parents[2], which raises IndexError when Modal
loads the deployed function's module at /root/app.py — the container
died at import and the staging worker crash-looped on its first boot
after the restructure (the gateway had the same bug, fixed in the
gateway commit; this applies the same modal.is_local() guard to the
executor). The image smoke missed it because it imports src.modal.app
as a package, not as the entrypoint; both projects now carry a
regression test that compiles the app module at /root/app.py with
is_local()=False, replicating the exact container placement.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
anth-volk and others added 3 commits July 3, 2026 23:35
The executor image now installs its bootstrap straight from this
project's uv.lock — uv_sync(frozen=True, --only-group
modal-simulation-image) — replacing the checked-in requirements export
and deleting its whole apparatus: the export script, the freshness
test, and the re-export hooks in make update and the policyengine bot
script. Image packages are canonically lock-derived in both simulation
projects now.

The policyengine bundle installs the country models into uv_sync's
venv via --venv /.uv/.venv so both installers share one environment
(first attempt split them: the bundle's pip ran against system Python;
second attempt found uv venvs ship without pip — the group now carries
pip for the bundle installer, and policyengine.py#452 requests the
datasets-only mode that will let uv own the model packages too; #611
tracks that flip).

Verified in staging: full image rebuild (bundle into venv, dataset
prebuild imports from it, baked files verified), image smoke, scripted
deploy, and a Utah state-level calculation completing in 127s. Main-env
image cache prewarmed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The executor's dev dependency group holds a path dependency on the
gateway project (for the scheduler seam tests), and the container
filesystem deliberately excludes it. The Dockerfile's build-time sync
already passed --no-dev, but the runtime 'uv run uvicorn' did not —
and uv run both installs the dev group by default and revalidates the
lock, which builds metadata for every path source. Either is fatal in
the container, so the service never bound its port and the PR's
integration lane failed from the first push (masked by a faulty CI
watch that could not see the gated job). --frozen --no-dev at both
call sites: sync from the baked lock as-is, never touch dev.

Verified locally: compose service healthy on :8082 and the CI
integration selection passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@anth-volk anth-volk marked this pull request as ready for review July 3, 2026 22:50
@anth-volk anth-volk merged commit 0f4bee9 into main Jul 3, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant