Skip to content

Add long-running tasks and machine lifecycle blueprint#2416

Merged
kcmartin merged 6 commits into
mainfrom
kristin/long-running-tasks-blueprint
Jun 16, 2026
Merged

Add long-running tasks and machine lifecycle blueprint#2416
kcmartin merged 6 commits into
mainfrom
kristin/long-running-tasks-blueprint

Conversation

@kcmartin

@kcmartin kcmartin commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

New blueprint covering the interaction between auto_stop_machines and long-running in-process work. Documents how the Fly proxy decides to stop a machine, why background tasks are invisible to that decision, and two patterns to keep work from getting killed:

  • Pattern A: disable autostop and manage shutdown in the app via SIGTERM/kill_timeout.
  • Pattern B: split web and worker into separate process groups so autostop only applies to the web tier.

Also covers kill_signal/kill_timeout semantics under autostop and other stop pathways (deploys, fly machine stop, host migrations), and a Common Problems section that addresses things like self-pings (they don't work) and workers not stopping on deploy.

Placement

  • New file: blueprints/long-running-tasks.html.md
  • Index entry: added to Background Jobs & Automation in blueprints/index.html.md with a NEW!! tag, next to the work-queues, task-scheduling, and supercronic blueprints.
  • Sidebar: added to the matching Background Jobs & Automation group in partials/_guides_nav.html.erb.

Empirical backing

Every technical claim in the draft is backed by a live deployment test on Fly. Specifically:

  • Pattern A confirmed working: auto_stop_machines = "off" keeps machines up; SIGTERM/kill_timeout graceful drain fires on manual stop with the full drain window observed.
  • Pattern B confirmed working: with split process groups, the proxy stops the web tier while the worker tier stays untouched.
  • Self-ping (an earlier "Pattern C" candidate) tested and does not prevent autostop, even with successful HTTP requests every 60s to the machine's own <app>.fly.dev hostname. The proxy stops the machine within 5 to 10 minutes regardless. This is captured in the Common Problems section as "Why doesn't a self-ping keep my machine alive?"
  • kill_timeout confirmed to be honored under the autostop pathway (full drain window before SIGKILL).

kcmartin added 5 commits June 15, 2026 21:32
New guide covering the interaction between auto_stop_machines and
long-running in-process work: how the proxy decides to stop a machine,
why background tasks are invisible to that decision, and two patterns
(disable autostop with an in-app drain; split web and worker into
separate process groups) to keep work from getting killed. Also covers
kill_signal/kill_timeout semantics under autostop and other stop
pathways.

Adds the entry to the Background Jobs & Automation section of the
blueprints index (with a NEW!! tag) and to the corresponding sidebar
nav group.
Replace em dash separators with colons in the Picking a pattern
table, and replace the em dash placeholder in the kill_signal Max
column with n/a.
Replace prose references to 'blueprint(s)' with 'guide(s)' throughout
the doc. Link paths under /docs/blueprints/ are unchanged.
@kcmartin kcmartin requested review from Roadmaster and injoongy June 15, 2026 22:04

@injoongy injoongy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@kcmartin kcmartin merged commit 054dee7 into main Jun 16, 2026
2 checks passed
@kcmartin kcmartin deleted the kristin/long-running-tasks-blueprint branch June 16, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants