fix: Allow large image pulls to succeed by decoupling workflow deadline from lock TTL via heartbeat by jjamroga · Pull Request #234 · agent-substrate/substrate

Jonathan Jamroga (jjamroga) · 2026-06-12T17:53:24Z

Fixes #233.

Customer Impact: Prior to #230, large images (specifically our test image) would eventually finish pulling in ~20 min, at the cost of RSS increasing in atelet i.e. worker successfully has actor scheduled to it at the cost of overall system stability. This wasn't an acceptable tradeoff. Orphaned image pulls resulted in atelet consuming almost 10 GiB of memory before being garbage collected.

Now, long running image pulls have a process of reacquiring a lock, preventing it from expiring before the image pull has completed. Introduced a high level, user configurable timeout ensuring that there is a window in which the lock can be lost and be lost in the event of image pull hanging (instead of a similar bug where a process is left hanging despite the caller having abandoned it.)

Summary

Decouples the per-actor workflow deadline from the Redis lock TTL via a heartbeat, so large image pulls can finish without raising the lock TTL (which also bounds peer failover after a crashed ateapi).

Lock TTL (~30s) bounds failover latency — how long a peer ateapi waits to retry an actor if the original instance crashes mid-workflow. Workflow deadline (~5min) bounds legitimate work — how long a single Resume/Suspend/Pause is allowed to run end-to-end before we give up. The lock refresh (heartbeat goroutine, PEXPIREs every TTL/3) is the bridge: it keeps the lock alive across the long workflow window without raising the TTL itself, so we get both fast failover AND room for slow legitimate operations.

Lifecycle RPCs (Resume/Suspend/Pause) all go through acquireActorLock on entry. That call (a) grabs the Redis lock with the short TTL, (b) returns a workflow-bounded ctx via WithTimeout(workflowDeadline), and (c) spawns the heartbeat. While the RPC runs, the lock stays held; if the heartbeat ever fails or returns "you no longer own this lock," it cancels the workflow ctx so steps tear down promptly. On normal return, release stops the heartbeat and frees the lock. Mutual exclusion per actor holds throughout — exactly one workflow per actor at any instant, just like before, but no longer pinned to a 28-second ceiling.

Why a server-side deadline at all (not just heartbeat)

Considered keeping only the heartbeat and letting the workflow ctx inherit from the caller. Rejected on defense-in-depth grounds:

Substrate does not configure gRPC keepalive, so a network partition or a hung-but-not-killed caller would leave the workflow running indefinitely on ateapi.
A workflow step that forgets to propagate ctx would have no upstream timeout to fall back on.

The 5-minute server-side cap covers both cases at the cost of one config knob. Operators can raise it for slow registries; the lock TTL is independent so peer failover stays bounded at ≤30s regardless.

Test plan

Automated:

go test ./cmd/ateapi/internal/store/ateredis/... -run 'TestRefreshLock|TestAcquireLock' — 7 lock tests pass (3 new + 4 existing).
go test ./cmd/ateapi/internal/controlapi/ -run TestAcquireActorLock — 4 new heartbeat tests pass:
- Heartbeat keeps lock alive past TTL.
- Lost lock cancels workflow ctx with errLostActorLock cause.
- Release callback removes the lock.
- Second concurrent acquire returns codes.Aborted.
go test ./... -count=1 — full repo suite green (419 tests / 65 packages).
go vet ./... — clean.

Manual:

# Setup cluster
./hack/create-kind-cluster.sh && ./hack/install-ate-kind.sh --deploy-ate-system
export KO_DOCKER_REPO=localhost:5001
export KO_DEFAULTPLATFORMS=linux/$(go env GOARCH)
./hack/run-tool.sh ko build -B ./cmd/ateom-gvisor

# Install metrics server to keep tabs on memory
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/high-availability-1.21+.yaml
# Tweak values for cluster
kubectl patch -n kube-system deploy metrics-server --type=json -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'
kubectl scale -n kube-system deploy/metrics-server --replicas=1
kubectl rollout status deploy/metrics-server -n kube-system

# Watch memory consumption
while true; do date; kubectl top pod -n ate-system -l app=atelet --containers >> bin/output-fix.log; sleep 5; done

# Check atelet, image pull starts and eventually completes in a reasonable amount of time
kubectl logs daemonset/atelet -n ate-system --follow
{"time":"2026-06-12T17:33:16.609666504Z","level":"INFO","msg":"Using S3 storage backend"}
{"time":"2026-06-12T17:33:16.61079017Z","level":"INFO","msg":"Starting Prometheus metrics server on :9090"}
{"time":"2026-06-12T17:33:16.626096171Z","level":"INFO","msg":"WorkersManagerService listening","address":{"IP":"::","Port":8085,"Zone":""}}
{"time":"2026-06-12T17:36:40.508440418Z","level":"INFO","msg":"Ref includes digest, checking for cache hit","ref":"ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4","digest":"sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4","ate.dev/trace-id":"f2bd930bd82771821f8d5fcabcb895a0"}
{"time":"2026-06-12T17:36:40.508567918Z","level":"INFO","msg":"Ref includes digest, checking for cache hit","ref":"registry.k8s.io/pause:3.10.2@sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4","digest":"sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4","ate.dev/trace-id":"f2bd930bd82771821f8d5fcabcb895a0"}
{"time":"2026-06-12T17:36:40.508644543Z","level":"INFO","msg":"Cache miss","ref":"registry.k8s.io/pause:3.10.2@sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4","ate.dev/trace-id":"f2bd930bd82771821f8d5fcabcb895a0"}
{"time":"2026-06-12T17:36:40.508595126Z","level":"INFO","msg":"Cache miss","ref":"ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4","ate.dev/trace-id":"f2bd930bd82771821f8d5fcabcb895a0"}
{"time":"2026-06-12T17:38:35.082124179Z","level":"INFO","msg":"Handle RPC","method":"/atelet.AteomHerder/Run","req":{"target_ateom_uid":"739d5e60-905f-4681-bac4-171190ab7cba","actor_template_namespace":"ate-repro","actor_template_name":"repro-fat-image","actor_id":"69bcf578-73af-48da-8cb8-c73f4b09622b","runsc":{"amd64":{"sha256_hash":"a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63","url":"gs://gvisor/releases/nightly/2026-05-19/x86_64/runsc"},"arm64":{"sha256_hash":"1ba2366ae2efceba166046f51a4104f9261c9cb72c6db8f5b3fe2dc57dea86b9","url":"gs://gvisor/releases/nightly/2026-05-19/aarch64/runsc"}},"spec":{"containers":[{"name":"fat","image":"ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4","command":["/bin/sh","-c","sleep 3600"]}],"pause_image":"registry.k8s.io/pause:3.10.2@sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4"}},"resp":{},"err":null,"elapsed-time":"2m0.322655054s","ate.dev/trace-id":"f2bd930bd82771821f8d5fcabcb895a0"}
{"time":"2026-06-12T17:38:55.689723883Z","level":"INFO","msg":"Handle RPC","method":"/atelet.AteomHerder/Checkpoint","req":{"target_ateom_uid":"739d5e60-905f-4681-bac4-171190ab7cba","actor_template_namespace":"ate-repro","actor_template_name":"repro-fat-image","actor_id":"69bcf578-73af-48da-8cb8-c73f4b09622b","runsc":{"amd64":{"sha256_hash":"a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63","url":"gs://gvisor/releases/nightly/2026-05-19/x86_64/runsc"},"arm64":{"sha256_hash":"1ba2366ae2efceba166046f51a4104f9261c9cb72c6db8f5b3fe2dc57dea86b9","url":"gs://gvisor/releases/nightly/2026-05-19/aarch64/runsc"}},"spec":{"containers":[{"name":"fat","image":"ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4","command":["/bin/sh","-c","sleep 3600"]}],"pause_image":"registry.k8s.io/pause:3.10.2@sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4"},"snapshot_uri_prefix":"gs://ate-snapshots/repro/69bcf578-73af-48da-8cb8-c73f4b09622b/2026-06-12T17:38:55Z-XHXXXIU65H5YZCKRHLGC3XH6F4"},"resp":{},"err":null,"elapsed-time":"685.691709ms","ate.dev/trace-id":"41774e0b1b60502ad62d6351bafd4a1b"}

Out of scope / follow-ups

Moving image pulls off the workflow path entirely (background pull worker + status-polling RPC) is the long-term fix per internal/memorypullcache/memorypullcache.go:47-58. Should follow the on-disk / shared layer cache work via ategcs.ObjectStorage. The heartbeat shape introduced here does not preclude that redesign.
gRPC keepalive on ateapi server + client sides would tighten the partition-detection story. Left as a separate change.

ActorWorkflow.ResumeActor and SuspendActor used to derive their workflow ctx from the Redis lock TTL via acquireActorLock(ctx, id, 30s, 2s) — the workflow deadline and the lock TTL were a single 28s knob. That meant image pulls / restores that legitimately need more than 28s death-looped forever, while raising the knob also raised how long peers wait to retry an actor after a crashed ateapi replica. Split the two concerns: - Lock TTL stays short (30s constant, internal). Bounds peer failover. - Workflow deadline is a separate operator-configurable knob via the new --actor-workflow-deadline pflag (default 5m). Bounds a single Resume/Suspend. - A heartbeat goroutine refreshes the lock every lockTTL/3 (~10s) for the full workflow duration. On RefreshLock=false or any Redis error (peer stole the lock, Redis blip), the workflow ctx is cancelled with errLostActorLock as the cause so in-flight steps unwind cleanly and the mutual-exclusion invariant is preserved. - The release function stops the heartbeat (waits for goroutine exit) before best-effort ReleaseLock. Adds store.Interface.RefreshLock with a Redis CAS Lua script mirroring the existing ReleaseLock script.

Jonathan Jamroga (jjamroga) · 2026-06-15T17:54:19Z

Bowei Du (@bowei) Would you potentially be able to take a look at this? This is followup work to that memory leak issue I put a PR up for.

Conflicts came from main's WorkerPool/SandboxConfig plumbing and new PauseActor running into this branch's actorWorkflowDeadline plumbing and rewritten acquireActorLock signature. Resolutions: - NewService and NewActorWorkflow take both new sets of params: workerPoolLister/sandboxConfigLister AND actorWorkflowDeadline. - SuspendActor and PauseActor moved to the heartbeat-based acquireActorLock(ctx, id, lockTTL, heartbeatInterval) API. The hardcoded 30s lock / 2s padding in PauseActor's original commit are superseded; Pause now runs under w.workflowDeadline alongside Resume and Suspend. Follow-up: w.workflowDeadline is currently shared by all three workflows. Resume genuinely needs the long deadline (image pull) and Suspend needs it too (snapshot write). Pause is a cgroup-freeze and could justify a tighter signal-only bound; deferred until there's operational signal that 5min is too loose for Pause.

Jonathan Jamroga (jjamroga) marked this pull request as ready for review June 12, 2026 18:14

Jonathan Jamroga (jjamroga) changed the title ~~Decouple actor lock TTL from workflow deadline via heartbeat~~ fix: Pull large images by decoupling actor lock TTL from workflow deadline via heartbeat Jun 12, 2026

Jonathan Jamroga (jjamroga) changed the title ~~fix: Pull large images by decoupling actor lock TTL from workflow deadline via heartbeat~~ fix: Allow large image pulls to succeed by decoupling workflow deadline from lock TTL via heartbeat Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Allow large image pulls to succeed by decoupling workflow deadline from lock TTL via heartbeat#234

fix: Allow large image pulls to succeed by decoupling workflow deadline from lock TTL via heartbeat#234
Jonathan Jamroga (jjamroga) wants to merge 2 commits into
agent-substrate:mainfrom
jjamroga:jjamroga/decouple-workflow-deadline

Jonathan Jamroga (jjamroga) commented Jun 12, 2026 •

edited

Loading

Uh oh!

Jonathan Jamroga (jjamroga) commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jonathan Jamroga (jjamroga) commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why a server-side deadline at all (not just heartbeat)

Test plan

Out of scope / follow-ups

Uh oh!

Jonathan Jamroga (jjamroga) commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Jonathan Jamroga (jjamroga) commented Jun 12, 2026 •

edited

Loading