Skip to content

feat(webapp): plan-aware compute migration#3957

Open
nicktrn wants to merge 24 commits into
mainfrom
feat/compute-migration
Open

feat(webapp): plan-aware compute migration#3957
nicktrn wants to merge 24 commits into
mainfrom
feat/compute-migration

Conversation

@nicktrn

@nicktrn nicktrn commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Adds an opt-in mechanism to route a configurable percentage of organizations onto the compute (MicroVM) backing of their region at trigger time, without changing their stored region settings.

Routing is gated by three global feature flags - computeMigrationEnabled, computeMigrationFreePercentage, computeMigrationPaidPercentage - plus a per-org computeMigrationEnabled override that wins in both directions. A region's compute backing is resolved from a new WorkerInstanceGroup.region column: a container group and its MicroVM group share one geo region, so the migration swaps the resolved worker queue to the backing group's queue. Orgs are bucketed deterministically by id, so ramping a percentage down keeps a strict subset rather than reshuffling, and a region with no compute backing is never touched. Everything is off by default - behaviour is unchanged unless the flags are set.

The flags and the worker-region groups are read on the trigger hot path from in-memory snapshots rather than the database: a small createReloadingRegistry helper loads each at startup and refreshes them on an interval, so no per-trigger query is added and a percentage or kill-switch change propagates within the reload interval. A cold replica whose snapshot hasn't loaded yet reads as not-migrated (the container path) and self-corrects on the next load - the same cold-start contract as the datastore / LLM-pricing registries, with a reloading_registry_loaded metric so a never-loaded registry is alertable.

The same migration decision is consulted at deploy-time template creation so a migrated org gets a compute template built ahead of its first run. This runs in shadow mode (best-effort, never fails the deploy) by default, or - when the computeMigrationRequireTemplate flag is on - in required mode, built synchronously at deploy so the first run never builds on-demand and template errors surface at deploy time.

So operators keep "which runs ran where" while customers only see geography: the run's actual worker queue is stored raw, and the geo region is stamped separately on TaskRun.region (and a new ClickHouse region column) at trigger time. Read surfaces - the dashboard, the API, and the Query/Logs page - show the geo region, falling back to the worker queue for runs written before the column existed.

Minor follow-ups left out of scope: the percentage flags render as text inputs on the admin flags page (the catalog UI has no numeric control type yet), and createReloadingRegistry could later gain pub/sub for sub-second cross-replica propagation if the reload interval proves too slow.

@nicktrn nicktrn self-assigned this Jun 15, 2026
@changeset-bot

changeset-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: b0472c5

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR introduces a plan-aware compute migration system that routes organizations onto compute backing at task trigger time. It adds a generic createReloadingRegistry utility with Prometheus metrics, p-retry startup loading, and periodic refresh. A new workerRegionRegistry loads WorkerGroupRegionRow data from the database and exposes regionForQueue and backingForQueue helpers; the WorkerInstanceGroup table gains a nullable region TEXT column via migration. Three feature flags (computeMigrationEnabled, computeMigrationFreePercentage, computeMigrationPaidPercentage) and two new environment variables (GLOBAL_FLAGS_RELOAD_INTERVAL_MS, GLOBAL_FLAGS_READY_TIMEOUT_MS) are added. A globalFlagsRegistry singleton caches global flags from the database. An FNV-1a hashBucket function and isOrgMigrated/resolveComputeMigration functions implement the enrollment decision and queue rewrite logic. TaskRun gains a region column persisted by RunEngine.trigger. The triggerTask and computeTemplateCreation services are updated to evaluate migration at routing time and rewrite worker queues to compute backing when enrolled. Region derivation across presenters, routes, and the ClickHouse replication service is updated to use explicit region field when present. ClickHouse task_runs_v2 table gains a region column for analytics.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat(webapp): plan-aware compute migration' is specific, concise, and accurately describes the main feature being added—a plan-aware mechanism for compute migration controlled by feature flags and percentage bucketing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is comprehensive and detailed, covering the feature purpose, implementation approach, key decisions, and refinements made during review.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/compute-migration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

coderabbitai[bot]

This comment was marked as resolved.

@nicktrn

nicktrn commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the review feedback, plus a few issues a deeper review pass turned up:

  • Replay of a migrated run would have silently produced no run: the stored backing queue (us-east-1-next) was read back as an explicit region override and rejected by the compute-access gate. Replay now reverse-maps the stored backing to its geo region and re-resolves, so migration re-applies with current flags (and an org that's since been excluded replays onto the container path).
  • Backing hidden on customer surfaces: a regionForBacking inverse of COMPUTE_BACKING_MAP is applied at the run API, run list, run detail, replay, and the ClickHouse worker_queue write, so the API / dashboard / Query feature all report the geo region. The raw backing stays on TaskRun.workerQueue in Postgres for internal use - no schema change.
  • Registry: reloads are now sequence-guarded so a slow older reload can't overwrite a newer snapshot (the kill switch can't silently revert), and waitUntilReady clears its timeout instead of leaking one per cold-start trigger.
  • Kill switch uses strict z.boolean() (coercion turned the string "false" into true); the reload interval is now bounded.

Operational notes for rollout:

  • Billing should key off machine preset / actual execution, not hasComputeAccess - migrated orgs run on the backing without that flag.
  • The compute backing needs its own :scheduled consumer for scheduled runs.
  • The deprecated V3 batch path doesn't percentage-enroll (it passes skipChecks without a plan type); per-org overrides still apply there.

@nicktrn nicktrn force-pushed the feat/compute-migration branch from 3cf484d to 697de03 Compare June 15, 2026 20:05
@nicktrn

nicktrn commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up: replaced the COMPUTE_BACKING_MAP env var with a region column on WorkerInstanceGroup, so region<->backing resolution comes from data instead of editable config (removes the "edit a config blob and silently break reverse-mapping for historical runs" footgun).

  • New nullable WorkerInstanceGroup.region (migration ..._add_worker_instance_group_region). Container and compute groups for one geo share the value - e.g. both us-east-1 and us-east-1-next get region = "us-east-1".
  • A workerRegionRegistry (same createReloadingRegistry pattern, in-memory snapshot) serves both directions off the hot path: forward (region -> its MICROVM backing) at trigger, reverse (a stored queue -> its geo region) at the presenters / replay / ClickHouse write.
  • COMPUTE_BACKING_MAP and computeBackingMap.server.ts deleted.

Rollout requirement: set region on the live worker groups before enabling migration. It's nullable - unset means that group never migrates and resolves to its own queue (safe no-op). Backfill the container + compute groups of each geo to the same region value.

Treat region as set-once while a group has run history: changing it re-breaks region resolution for existing runs. The durability win is that this is now one immutable data field rather than an editable config map.

devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@nicktrn nicktrn force-pushed the feat/compute-migration branch from a1a460e to b75e18a Compare June 15, 2026 22:07
@pkg-pr-new

pkg-pr-new Bot commented Jun 15, 2026

Copy link
Copy Markdown

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@1930f26

trigger.dev

npm i https://pkg.pr.new/trigger.dev@1930f26

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@1930f26

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@1930f26

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@1930f26

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@1930f26

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@1930f26

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@1930f26

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@1930f26

commit: 1930f26

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

nicktrn added 11 commits June 16, 2026 16:04
…ting

Drop the per-trigger readiness gate (waitUntilReady) and the /healthcheck
dependency; a cold registry read falls back to not-migrated, matching the
datastore/llm-pricing registries. Add computeMigrationRequireTemplate flag
(migrated orgs build the compute template in required mode at deploy when on,
else shadow) and a reloading_registry_loaded gauge so a never-loaded registry
is alertable. Drop dead GLOBAL_FLAGS_READY_TIMEOUT_MS.
@nicktrn nicktrn force-pushed the feat/compute-migration branch from 188dac2 to 1930f26 Compare June 16, 2026 15:05
devin-ai-integration[bot]

This comment was marked as resolved.

Prevents the startup retry from racing the reload interval: an interval tick
firing mid-startup-load would bump the load sequence and discard the startup
result, leaving the registry cold until a later reload. Start polling only
after the first successful load.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants