Skip to content

feat(schedule-engine): stop persisting per-tick schedule state#3476

Merged
ericallam merged 8 commits intomainfrom
feature/tri-8891-stop-persisting-per-tick-schedule-engine-state-on
May 1, 2026
Merged

feat(schedule-engine): stop persisting per-tick schedule state#3476
ericallam merged 8 commits intomainfrom
feature/tri-8891-stop-persisting-per-tick-schedule-engine-state-on

Conversation

@ericallam
Copy link
Copy Markdown
Member

Summary

Each scheduled-task tick previously issued 3 Prisma UPDATEs against
TaskSchedule.lastRunTriggeredAt, TaskScheduleInstance.lastScheduledTimestamp,
and TaskScheduleInstance.nextScheduledTimestamp. All three were pure
denormalization — every value can be derived without persisting.

After this PR TaskSchedule and TaskScheduleInstance become near read-only:
writes happen only on schedule create / update / delete (rare admin actions),
so the per-tick autovacuum churn on these hot tables disappears.

Design

The previous fire time travels forward through the schedule worker payload,
not through the database. Concretely:

  • The schedule.triggerScheduledTask worker payload gains an optional
    lastScheduleTime: z.coerce.date().optional() field.
  • When the engine fires a schedule, it re-enqueues the next tick with
    lastScheduleTime = scheduleTimestamp (the just-fired time).
  • When the next tick dequeues, payload.lastTimestamp is sourced from
    params.lastScheduleTime directly. No DB round-trip, no cron-derivation
    drift across DST boundaries, no caveats around recently-edited cron
    expressions.

payload.lastTimestamp keeps its Date | undefined SDK shape. First-ever
fires still report undefined, so customer if (!payload.lastTimestamp)
first-run patterns keep working.

For Redis jobs that were enqueued before this change (which lack
lastScheduleTime in their payload), the engine falls back to
instance.lastScheduledTimestamp once. Once those drain, the column is
never read again. Revert is code-only; the columns stay in place and can
be dropped in a follow-up once the rollout is stable.

Files

  • internal-packages/schedule-engine/* — engine refactor, workerCatalog
    schema field, TriggerScheduleParams extension, tests updated to assert
    on the worker-payload flow rather than DB readbacks.
  • internal-packages/database/prisma/schema.prisma/// @deprecated
    triple-slash docstrings on the three columns. No migration.
  • apps/webapp/app/presenters/v3/ScheduleListPresenter.server.ts — drops
    the lastRunTriggeredAt Prisma select; "Last run" cell is approximated
    from the cron expression's previous slot, gated on schedule.createdAt
    so brand-new schedules show "–". UI is best-effort; the runs page is the
    source of truth.
  • apps/webapp/app/v3/utils/calculateNextSchedule.server.ts — adds a
    previousScheduledTimestamp helper for the UI cell above. Public API
    responses (api.v1.schedules.*) already compute nextRun from cron and
    don't expose lastTimestamp — no public API change.
  • references/scheduled-tasks/ — new reference project with declarative
    schedules at multiple cadences and three throw-on-fail validators
    (first-fire-detector, interval-validator, upcoming-validator) for
    E2E-verifying the worker-payload flow.

Refs TRI-8891

Test plan

  • pnpm run typecheck --filter @internal/schedule-engine --filter webapp
  • pnpm run build --filter @trigger.dev/core
  • pnpm run test --filter @internal/schedule-engine — integration test
    asserts first-fire lastTimestamp === undefined, second fire carries
    the previous fire's timestamp exactly.
  • E2E against local webapp via references/scheduled-tasks:
    • Fresh schedules attached → all three deprecated columns stay NULL after
      multiple fires.
    • Redis payload at second fire contains
      "lastScheduleTime":"<previous fire timestamp>".
    • TaskRun.payload and the every-minute task's returned output both confirm
      lastTimestamp = null on first fire and lastTimestamp = <prev fire> on
      second fire, exactly 60s apart.
    • All three throw-on-FAIL validators completed successfully on every
      non-first fire.
  • Schedules REST API end-to-end (POST / GET / PUT / activate /
    deactivate / DELETE) — nextRun recomputed live from cron + tz on
    every response, no reads of deprecated columns.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 30, 2026

⚠️ No Changeset found

Latest commit: 3910781

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

The PR stops writing per-tick schedule timestamps to the DB (deprecating taskSchedule.lastRunTriggeredAt, taskScheduleInstance.lastScheduledTimestamp, taskScheduleInstance.nextScheduledTimestamp). The schedule engine now threads an optional lastScheduleTime through job payloads, adds a fromTimestamp anchor for next-time calculations, and uses a new previousScheduledTimestamp helper for recovery. Presenter logic computes lastRun at render time from the cron expression. Types and the worker Zod schema accept lastScheduleTime. Tests and server docs were updated; Prisma fields were annotated as deprecated.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: stopping per-tick persistence of schedule state to the database, which is the core objective of the PR.
Description check ✅ Passed The description is comprehensive and well-structured, covering the problem statement, design rationale, implementation details, file-by-file changes, and extensive test coverage. It follows the template structure with checklist completion and detailed changelog.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/tri-8891-stop-persisting-per-tick-schedule-engine-state-on

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

@ericallam ericallam marked this pull request as ready for review April 30, 2026 11:47
devin-ai-integration[bot]

This comment was marked as resolved.

ericallam added a commit that referenced this pull request Apr 30, 2026
Address CodeRabbit review on PR #3476:

1. Engine — split fromTimestamp from lastScheduleTime
   `RegisterScheduleInstanceParams.fromTimestamp` was doing double duty as
   both the next-cron-slot anchor and the "previous fire time" embedded in
   the next worker job's payload. On skipped ticks (inactive schedule, dev
   env disconnected, etc.) the engine still re-registered with
   `fromTimestamp = scheduleTimestamp`, so a long pause/disconnect would
   quietly overwrite the real last-fire time with a stream of skipped
   slots — defeating the workerCatalog accuracy benefit for the very case
   it was meant to handle.

   `RegisterScheduleInstanceParams` now has a separate optional
   `lastScheduleTime` field. `fromTimestamp` advances on every tick;
   `lastScheduleTime` only advances on real fires. After a skip the
   re-registration carries forward the existing `params.lastScheduleTime`
   (or the legacy `instance.lastScheduledTimestamp` fallback).

2. ScheduleListPresenter — guard cron back-calculation
   `previousScheduledTimestamp(...)` calls cron-parser, which throws on
   malformed expressions. A single bad row would have failed the whole
   schedules-list response. Wrapped per-row in try/catch so a degraded
   row falls back to `lastRun: undefined` instead of taking down the page.

3. scheduleEngine integration test — anchor on observed values
   The test compared against a precomputed "next minute boundary" Date,
   which is flaky if the test setup straddles a minute boundary. Switched
   to deriving expectations from the first observed `exactScheduleTime`
   and asserting relative invariants (60s gap, second lastTimestamp
   equals first timestamp).
coderabbitai[bot]

This comment was marked as resolved.

ericallam added a commit that referenced this pull request Apr 30, 2026
ScheduleListPresenter was deriving "Last run" via cron's previous slot
even for deactivated schedules, causing the dashboard to show "1 minute
ago" for a schedule that was actually deactivated months ago.

Gate the cron back-calc on `schedule.active`. Inactive schedules now
show "–" in the Last run cell, matching the semantic that they aren't
firing.

Per Devin review on PR #3476.
coderabbitai[bot]

This comment was marked as resolved.

ericallam added a commit that referenced this pull request Apr 30, 2026
Recovery (Redis job missing for an instance) was re-registering with
no lastScheduleTime, so the post-recovery first fire fell through the
fallback chain to instance.lastScheduledTimestamp — which is frozen at
the value last written before this PR stopped writing the column. For
schedules that fired heavily after deploy then suffered a Redis loss,
that meant payload.lastTimestamp would report a stale pre-deploy
timestamp on the first fire post-recovery.

Compute lastScheduleTime as the cron expression's previous slot (pure
cron math, no DB read — recovery fan-outs must not add load to hot
tables). Guarded against the cron-prev predating the instance's
createdAt and against cron-parser throwing on malformed expressions.
For continuously-running schedules this equals the actual last fire
time; for long-paused or recently-edited schedules it's the same
approximation the dashboard "Last run" cell accepts.

Per Devin and CodeRabbit review on PR #3476.
ericallam added a commit that referenced this pull request Apr 30, 2026
ScheduleListPresenter was guarding the cron-derived lastRun against
schedule.createdAt, which only proves the row exists. If a schedule
was edited (cron changed, timezone changed) or deactivated and then
reactivated, the cron's previous slot might predate the most recent
config change but still pass the createdAt guard — surfacing a
"Last run" the schedule never actually fired at under the current
configuration.

Switch the guard to schedule.updatedAt. Prisma's @updatedat bumps on
every row update (cron change, timezone change, activate toggle, etc.),
so the cron-derived slot is only shown once it's strictly after the
most recent config change — meaning the current config has been in
effect long enough to have actually fired at that slot.

Per CodeRabbit review on PR #3476.
Each scheduled-task tick previously issued 3 Prisma UPDATEs against
TaskSchedule.lastRunTriggeredAt, TaskScheduleInstance.lastScheduledTimestamp,
and TaskScheduleInstance.nextScheduledTimestamp. All three were pure
denormalization — every value can be derived without persisting.

Engine
- Drop the three per-tick prisma.update calls.
- Refactor registerNextTaskScheduleInstance to take a fromTimestamp arg
  instead of reading instance.lastScheduledTimestamp from the DB.
- Add optional lastScheduleTime to the schedule worker payload so the
  previous fire time travels forward via Redis. payload.lastTimestamp
  is now sourced from the worker payload, not a DB column. First-ever
  fires still report undefined so customer "first-run" sentinel
  patterns keep working.
- For in-flight Redis jobs enqueued before this change (which lack
  lastScheduleTime), fall back to instance.lastScheduledTimestamp once.
  After those drain, the column is never read again.

Schema
- Mark the three columns @deprecated via triple-slash Prisma docstrings.
  No migration — columns remain in place so revert is code-only. They
  can be dropped in a follow-up once the rollout is stable.

Webapp
- ScheduleListPresenter derives the dashboard "Last run" cell from the
  cron expression's previous slot, gated on schedule.createdAt so
  brand-new schedules show "–". UI is best-effort; runs page is the
  source of truth.
- API responses (api.v1.schedules.*) already compute nextRun from cron;
  no public API change. lastTimestamp on the SDK payload retains
  Date | undefined semantics — no SDK change either.

Tests
- scheduleEngine integration test asserts first-fire lastTimestamp is
  undefined and the second fire carries the previous fire's timestamp
  exactly.
- scheduleRecovery tests no longer assert against the deprecated
  nextScheduledTimestamp column; presence of the worker job is the
  source of truth.

References
- New references/scheduled-tasks project with declarative schedules at
  multiple cadences plus three validators (first-fire-detector,
  interval-validator, upcoming-validator) that throw on FAIL — used
  for E2E-verifying the worker-payload flow.

Refs TRI-8891
Address CodeRabbit review on PR #3476:

1. Engine — split fromTimestamp from lastScheduleTime
   `RegisterScheduleInstanceParams.fromTimestamp` was doing double duty as
   both the next-cron-slot anchor and the "previous fire time" embedded in
   the next worker job's payload. On skipped ticks (inactive schedule, dev
   env disconnected, etc.) the engine still re-registered with
   `fromTimestamp = scheduleTimestamp`, so a long pause/disconnect would
   quietly overwrite the real last-fire time with a stream of skipped
   slots — defeating the workerCatalog accuracy benefit for the very case
   it was meant to handle.

   `RegisterScheduleInstanceParams` now has a separate optional
   `lastScheduleTime` field. `fromTimestamp` advances on every tick;
   `lastScheduleTime` only advances on real fires. After a skip the
   re-registration carries forward the existing `params.lastScheduleTime`
   (or the legacy `instance.lastScheduledTimestamp` fallback).

2. ScheduleListPresenter — guard cron back-calculation
   `previousScheduledTimestamp(...)` calls cron-parser, which throws on
   malformed expressions. A single bad row would have failed the whole
   schedules-list response. Wrapped per-row in try/catch so a degraded
   row falls back to `lastRun: undefined` instead of taking down the page.

3. scheduleEngine integration test — anchor on observed values
   The test compared against a precomputed "next minute boundary" Date,
   which is flaky if the test setup straddles a minute boundary. Switched
   to deriving expectations from the first observed `exactScheduleTime`
   and asserting relative invariants (60s gap, second lastTimestamp
   equals first timestamp).
ScheduleListPresenter was deriving "Last run" via cron's previous slot
even for deactivated schedules, causing the dashboard to show "1 minute
ago" for a schedule that was actually deactivated months ago.

Gate the cron back-calc on `schedule.active`. Inactive schedules now
show "–" in the Last run cell, matching the semantic that they aren't
firing.

Per Devin review on PR #3476.
Recovery (Redis job missing for an instance) was re-registering with
no lastScheduleTime, so the post-recovery first fire fell through the
fallback chain to instance.lastScheduledTimestamp — which is frozen at
the value last written before this PR stopped writing the column. For
schedules that fired heavily after deploy then suffered a Redis loss,
that meant payload.lastTimestamp would report a stale pre-deploy
timestamp on the first fire post-recovery.

Compute lastScheduleTime as the cron expression's previous slot (pure
cron math, no DB read — recovery fan-outs must not add load to hot
tables). Guarded against the cron-prev predating the instance's
createdAt and against cron-parser throwing on malformed expressions.
For continuously-running schedules this equals the actual last fire
time; for long-paused or recently-edited schedules it's the same
approximation the dashboard "Last run" cell accepts.

Per Devin and CodeRabbit review on PR #3476.
Audit the schedule engine's logger.info calls and demote anything that
fires per-tick or per-instance to logger.debug. The previous mix would
emit ~3 info lines per fire ("Calculated next schedule timestamp",
"Triggering scheduled task", "Successfully triggered scheduled task")
which scales linearly with schedule volume.

Demoted to debug:
- "Calculated next schedule timestamp" — every tick (re-register after
  every fire)
- "Triggering scheduled task" — every fire
- "Successfully triggered scheduled task" — every fire
- "Recovering schedule" — per-instance in the recovery loop, fan-out
  potential during recovery storms
- "Job already exists for instance" — per-instance recovery
- "No job found for instance, registering next run" — per-instance
  recovery

Kept at info (lifecycle / per-event, fires once):
- Worker startup / disabled / shutdown
- "Recovering schedules in environment" (per recovery call, not per
  instance)
- "No instances found for environment" (empty recovery summary)
ScheduleListPresenter was guarding the cron-derived lastRun against
schedule.createdAt, which only proves the row exists. If a schedule
was edited (cron changed, timezone changed) or deactivated and then
reactivated, the cron's previous slot might predate the most recent
config change but still pass the createdAt guard — surfacing a
"Last run" the schedule never actually fired at under the current
configuration.

Switch the guard to schedule.updatedAt. Prisma's @updatedat bumps on
every row update (cron change, timezone change, activate toggle, etc.),
so the cron-derived slot is only shown once it's strictly after the
most recent config change — meaning the current config has been in
effect long enough to have actually fired at that slot.

Per CodeRabbit review on PR #3476.
@ericallam ericallam force-pushed the feature/tri-8891-stop-persisting-per-tick-schedule-engine-state-on branch from 8ba6318 to cd5e629 Compare April 30, 2026 20:03
…mpat

Add an integration test for the path where an in-flight Redis job
enqueued by the old engine (no `lastScheduleTime` in its payload) is
dequeued by the new engine. The new engine must report the value
persisted at `instance.lastScheduledTimestamp` as `payload.lastTimestamp`
rather than reporting `undefined`, so customers don't see a one-fire
gap in their lastTimestamp during the rollout.

The existing integration test exercises the fresh-schedule path and
the worker payload flow on subsequent fires. This new test
specifically exercises the DB-column fallback in the
`params.lastScheduleTime ?? instance.lastScheduledTimestamp ??
undefined` chain — the bridge that handles the legacy queue at
deploy time.
devin-ai-integration[bot]

This comment was marked as resolved.

External callers of registerNextTaskScheduleInstance — the deploy-time
declarative schedule sync, schedule upsert (cron change / activate),
and recovery — all called with just `{ instanceId }`, no
lastScheduleTime. That replaced the in-flight Redis job's payload with
one having lastScheduleTime: undefined, so the next fire fell back to
instance.lastScheduledTimestamp (a column this PR stops writing). On
every subsequent app deploy, customers would have seen one stale fire
per schedule, in perpetuity — the staleness compounding as the
unmaintained DB column drifted further from reality.

Move the cron-prev derivation inside registerNextTaskScheduleInstance
itself: when the caller doesn't pass lastScheduleTime, derive from the
cron expression's previous slot (guarded against the slot predating
the instance's createdAt — preserves first-fire `undefined` semantics
for brand-new schedules). Internal callers (after-fire, after-skip)
keep passing explicit values so their carry-forward semantics are
unchanged.

Also drops the duplicate cron-prev block from `#recoverTaskScheduleInstance`
— recovery now relies on the centralized fallback inside register.

Two new tests:
- "should derive lastScheduleTime from cron when external callers omit
  it" — exercises the deploy/upsert pattern on a long-running instance.
- "should leave lastScheduleTime undefined for brand-new schedules" —
  preserves the first-run sentinel (`if (!payload.lastTimestamp)`)
  customers rely on.

Per Devin review on PR #3476.
@ericallam ericallam merged commit ac7177d into main May 1, 2026
41 checks passed
@ericallam ericallam deleted the feature/tri-8891-stop-persisting-per-tick-schedule-engine-state-on branch May 1, 2026 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants