fix: Allow large image pulls to succeed by decoupling workflow deadline from lock TTL via heartbeat#234
Open
Jonathan Jamroga (jjamroga) wants to merge 2 commits into
Conversation
ActorWorkflow.ResumeActor and SuspendActor used to derive their workflow ctx from the Redis lock TTL via acquireActorLock(ctx, id, 30s, 2s) — the workflow deadline and the lock TTL were a single 28s knob. That meant image pulls / restores that legitimately need more than 28s death-looped forever, while raising the knob also raised how long peers wait to retry an actor after a crashed ateapi replica. Split the two concerns: - Lock TTL stays short (30s constant, internal). Bounds peer failover. - Workflow deadline is a separate operator-configurable knob via the new --actor-workflow-deadline pflag (default 5m). Bounds a single Resume/Suspend. - A heartbeat goroutine refreshes the lock every lockTTL/3 (~10s) for the full workflow duration. On RefreshLock=false or any Redis error (peer stole the lock, Redis blip), the workflow ctx is cancelled with errLostActorLock as the cause so in-flight steps unwind cleanly and the mutual-exclusion invariant is preserved. - The release function stops the heartbeat (waits for goroutine exit) before best-effort ReleaseLock. Adds store.Interface.RefreshLock with a Redis CAS Lua script mirroring the existing ReleaseLock script.
Contributor
Author
|
Bowei Du (@bowei) Would you potentially be able to take a look at this? This is followup work to that memory leak issue I put a PR up for. |
Conflicts came from main's WorkerPool/SandboxConfig plumbing and new PauseActor running into this branch's actorWorkflowDeadline plumbing and rewritten acquireActorLock signature. Resolutions: - NewService and NewActorWorkflow take both new sets of params: workerPoolLister/sandboxConfigLister AND actorWorkflowDeadline. - SuspendActor and PauseActor moved to the heartbeat-based acquireActorLock(ctx, id, lockTTL, heartbeatInterval) API. The hardcoded 30s lock / 2s padding in PauseActor's original commit are superseded; Pause now runs under w.workflowDeadline alongside Resume and Suspend. Follow-up: w.workflowDeadline is currently shared by all three workflows. Resume genuinely needs the long deadline (image pull) and Suspend needs it too (snapshot write). Pause is a cgroup-freeze and could justify a tighter signal-only bound; deferred until there's operational signal that 5min is too loose for Pause.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #233.
Customer Impact: Prior to #230, large images (specifically our test image) would eventually finish pulling in ~20 min, at the cost of RSS increasing in atelet i.e. worker successfully has actor scheduled to it at the cost of overall system stability. This wasn't an acceptable tradeoff. Orphaned image pulls resulted in atelet consuming almost 10 GiB of memory before being garbage collected.
Now, long running image pulls have a process of reacquiring a lock, preventing it from expiring before the image pull has completed. Introduced a high level, user configurable timeout ensuring that there is a window in which the lock can be lost and be lost in the event of image pull hanging (instead of a similar bug where a process is left hanging despite the caller having abandoned it.)
Summary
Decouples the per-actor workflow deadline from the Redis lock TTL via a heartbeat, so large image pulls can finish without raising the lock TTL (which also bounds peer failover after a crashed
ateapi).Lock TTL (~30s) bounds failover latency — how long a peer
ateapiwaits to retry an actor if the original instance crashes mid-workflow. Workflow deadline (~5min) bounds legitimate work — how long a single Resume/Suspend/Pause is allowed to run end-to-end before we give up. The lock refresh (heartbeat goroutine, PEXPIREs every TTL/3) is the bridge: it keeps the lock alive across the long workflow window without raising the TTL itself, so we get both fast failover AND room for slow legitimate operations.Lifecycle RPCs (Resume/Suspend/Pause) all go through
acquireActorLockon entry. That call (a) grabs the Redis lock with the short TTL, (b) returns a workflow-bounded ctx viaWithTimeout(workflowDeadline), and (c) spawns the heartbeat. While the RPC runs, the lock stays held; if the heartbeat ever fails or returns "you no longer own this lock," it cancels the workflow ctx so steps tear down promptly. On normal return, release stops the heartbeat and frees the lock. Mutual exclusion per actor holds throughout — exactly one workflow per actor at any instant, just like before, but no longer pinned to a 28-second ceiling.Why a server-side deadline at all (not just heartbeat)
Considered keeping only the heartbeat and letting the workflow ctx inherit from the caller. Rejected on defense-in-depth grounds:
ateapi.The 5-minute server-side cap covers both cases at the cost of one config knob. Operators can raise it for slow registries; the lock TTL is independent so peer failover stays bounded at ≤30s regardless.
Test plan
Automated:
go test ./cmd/ateapi/internal/store/ateredis/... -run 'TestRefreshLock|TestAcquireLock'— 7 lock tests pass (3 new + 4 existing).go test ./cmd/ateapi/internal/controlapi/ -run TestAcquireActorLock— 4 new heartbeat tests pass:errLostActorLockcause.codes.Aborted.go test ./... -count=1— full repo suite green (419 tests / 65 packages).go vet ./...— clean.Manual:
Out of scope / follow-ups
internal/memorypullcache/memorypullcache.go:47-58. Should follow the on-disk / shared layer cache work viaategcs.ObjectStorage. The heartbeat shape introduced here does not preclude that redesign.ateapiserver + client sides would tighten the partition-detection story. Left as a separate change.