feat(webapp): plan-aware compute migration#3957
Conversation
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR introduces a plan-aware compute migration system that routes organizations onto compute backing at task trigger time. It adds a generic 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Addressed the review feedback, plus a few issues a deeper review pass turned up:
Operational notes for rollout:
|
3cf484d to
697de03
Compare
|
Follow-up: replaced the
Rollout requirement: set Treat |
a1a460e to
b75e18a
Compare
@trigger.dev/build
trigger.dev
@trigger.dev/core
@trigger.dev/python
@trigger.dev/react-hooks
@trigger.dev/redis-worker
@trigger.dev/rsc
@trigger.dev/schema-to-json
@trigger.dev/sdk
commit: |
…egion, drop env map
…ting Drop the per-trigger readiness gate (waitUntilReady) and the /healthcheck dependency; a cold registry read falls back to not-migrated, matching the datastore/llm-pricing registries. Add computeMigrationRequireTemplate flag (migrated orgs build the compute template in required mode at deploy when on, else shadow) and a reloading_registry_loaded gauge so a never-loaded registry is alertable. Drop dead GLOBAL_FLAGS_READY_TIMEOUT_MS.
188dac2 to
1930f26
Compare
Prevents the startup retry from racing the reload interval: an interval tick firing mid-startup-load would bump the load sequence and discard the startup result, leaving the registry cold until a later reload. Start polling only after the first successful load.
Adds an opt-in mechanism to route a configurable percentage of organizations onto the compute (MicroVM) backing of their region at trigger time, without changing their stored region settings.
Routing is gated by three global feature flags -
computeMigrationEnabled,computeMigrationFreePercentage,computeMigrationPaidPercentage- plus a per-orgcomputeMigrationEnabledoverride that wins in both directions. A region's compute backing is resolved from a newWorkerInstanceGroup.regioncolumn: a container group and its MicroVM group share one georegion, so the migration swaps the resolved worker queue to the backing group's queue. Orgs are bucketed deterministically by id, so ramping a percentage down keeps a strict subset rather than reshuffling, and a region with no compute backing is never touched. Everything is off by default - behaviour is unchanged unless the flags are set.The flags and the worker-region groups are read on the trigger hot path from in-memory snapshots rather than the database: a small
createReloadingRegistryhelper loads each at startup and refreshes them on an interval, so no per-trigger query is added and a percentage or kill-switch change propagates within the reload interval. A cold replica whose snapshot hasn't loaded yet reads as not-migrated (the container path) and self-corrects on the next load - the same cold-start contract as the datastore / LLM-pricing registries, with areloading_registry_loadedmetric so a never-loaded registry is alertable.The same migration decision is consulted at deploy-time template creation so a migrated org gets a compute template built ahead of its first run. This runs in shadow mode (best-effort, never fails the deploy) by default, or - when the
computeMigrationRequireTemplateflag is on - in required mode, built synchronously at deploy so the first run never builds on-demand and template errors surface at deploy time.So operators keep "which runs ran where" while customers only see geography: the run's actual worker queue is stored raw, and the geo region is stamped separately on
TaskRun.region(and a new ClickHouseregioncolumn) at trigger time. Read surfaces - the dashboard, the API, and the Query/Logs page - show the geo region, falling back to the worker queue for runs written before the column existed.Minor follow-ups left out of scope: the percentage flags render as text inputs on the admin flags page (the catalog UI has no numeric control type yet), and
createReloadingRegistrycould later gain pub/sub for sub-second cross-replica propagation if the reload interval proves too slow.