Durable Agent Engine

🔴 Live demo: durable-agent-engine-b034212e-e938-4496-b3f0-50045a25b6f9.fly.dev — submit a run, watch it execute over WebSocket in real time.

A fault-tolerant orchestration engine for multi-step LLM agent workflows — built to survive worker crashes, network failures, and duplicate execution without losing or double-running a single step.

No Redis. No BullMQ. No managed queue service. The durable queue is built directly on Postgres using SELECT ... FOR UPDATE SKIP LOCKED, the same primitive production systems at companies like GitHub use for job queues — because reaching for a queue library hides the part that's actually hard: atomicity, lock ownership, and exactly-once semantics under concurrent workers.

Why this exists

Every LLM "agent" framework demo shows the happy path. None of them show what happens when:

a worker process gets SIGKILLed mid-step
two workers race to pick up the same step
a tool call fails transiently and needs a retry, but not a re-run of the steps before it
you need an audit trail of exactly what happened, when, for compliance

This project answers those questions with working code, not slides.

Architecture

POST /runs (DAG of steps) ─▶ Postgres (runs, steps, audit_log)
                                   │
                    ┌──────────────┼──────────────┐
                    ▼              ▼              ▼
              worker-1       worker-2       worker-3
           (SKIP LOCKED claim — exactly one wins per step)
                    │
                    ▼
        step succeeds → dependents promoted to 'ready'
        step fails    → exponential backoff retry → dead_letter after max_attempts
        worker dies   → lock TTL expires → step reclaimed automatically
                    │
                    ▼
        WebSocket broadcast ──▶ live dashboard / CLI watcher

Core guarantees

Guarantee	Mechanism
Exactly-one claim per step	`FOR UPDATE SKIP LOCKED` inside a transaction
No lost work on worker crash	Lock TTL (30s) + reaper requeues orphaned `running` steps
No duplicate run creation	Idempotency key = `${runId}:${stepName}`, upserted via `ON CONFLICT`
Bounded retries	Exponential backoff (1s → 60s cap), dead-letter after `max_attempts`
Full auditability	Every state transition written to `audit_log`
DAG correctness	Dependent steps only promoted to `ready` once all `depends_on` succeed

Quickstart

docker compose up -d        # local Postgres
cp .env.example .env
npm install
npm run migrate
npm start                   # API + WebSocket server on :4000
npm run worker              # run in 2-3 separate terminals to see SKIP LOCKED in action

Submit a run:

curl -X POST localhost:4000/runs -H 'Content-Type: application/json' -d '{
  "goal": "Research and summarize a topic",
  "steps": [
    { "name": "tool_call.fetch_data", "input": { "tool": "fetch_data" } },
    { "name": "tool_call.transform", "input": { "tool": "transform" }, "dependsOn": ["tool_call.fetch_data"] },
    { "name": "llm_call.summarize", "input": { "prompt": "Summarize the data" }, "dependsOn": ["tool_call.transform"] }
  ]
}'

Watch it live: wscat -c ws://localhost:4000/ws/<runId>

Chaos test — prove it, don't claim it

npm run worker   # start 3 of these in separate terminals
npm run chaos    # submits a run with simulated 40% transient failures, then kill -9 a worker mid-run

Expected result: the run still reaches completed. Killed steps get reclaimed by the lock reaper; failed steps retry with backoff; the audit log shows the full story.

Tech

Node.js (ESM, no framework magic), Express, pg, ws. Cross-process events (worker → API server) use Postgres LISTEN/NOTIFY, not an in-memory bus, since workers and the API run as separate processes — the WebSocket dashboard subscribes to a single LISTEN connection that fans out to all connected clients.

Deployment

Deployed as a single self-contained container (Node app + Postgres + supervisord) to InsForge compute, which provisions and runs it on Fly.io infrastructure under the hood — no separate cloud account needed. See Dockerfile and deploy/ for the supervisord-managed process layout (Postgres, migration, API server, two workers, all in one image for demo purposes).

What this demonstrates

Built as a deep dive into the kind of infrastructure problem backend teams at LLM companies solve daily: reliable multi-step agent execution, distributed locking, idempotency, and observability — not just wiring an SDK to a UI.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
deploy		deploy
public		public
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Durable Agent Engine

Why this exists

Architecture

Core guarantees

Quickstart

Chaos test — prove it, don't claim it

Tech

Deployment

What this demonstrates

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Durable Agent Engine

Why this exists

Architecture

Core guarantees

Quickstart

Chaos test — prove it, don't claim it

Tech

Deployment

What this demonstrates

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages