postclone: gate cloned-VM Ready on the DHCP lease (fixes ready-without-IP)#23
Open
tonicmuroq wants to merge 1 commit into
Open
postclone: gate cloned-VM Ready on the DHCP lease (fixes ready-without-IP)#23tonicmuroq wants to merge 1 commit into
tonicmuroq wants to merge 1 commit into
Conversation
…t-IP) The clone path publishes lifecycle-state=ready from runPostCloneSetup as soon as the post-clone agent exec succeeds. That exec runs over vsock and does not depend on the network, so on a slow-booting guest it can finish before the guest's DHCP lease lands — leaving lifecycle=ready with no pod IP: resolveVMIP returns "" and both pod.status.podIP and the vm.cocoonstack.io/ip annotation are empty. This is a contract bug. lifecycle-state=ready (paired with observed-generation) is the signal consumers read the pod IP on — vm-service reads status.podIP exactly once after observing ready, and on empty raises "CocoonSet <name> podIP empty after lifecycle-state=ready" and deletes the freshly-cloned VM. So a premature Ready turns a perfectly good VM into a spurious 500 + teardown. Reproduces reliably on the first create from a large/slow snapshot (cold base image pull + multi-GB guest resume push the DHCP lease past the ready signal). The wake path already honors the contract — finalizeDropNICWake does waitForFreshIP -> refreshStatus -> Ready, and marks Failed if the lease never lands. The clone path simply never adopted it. Fix: add markReadyAfterIP and route both Ready sites in runPostCloneSetup through it (the no-fixup early return and the post-exec success). It waits for the lease, flushes status so status.podIP is published, then marks Ready; on timeout it marks Failed rather than ever publishing ready-without-IP. Now clone and wake deliver the same guarantee: ready => pod IP is resolvable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug
The clone path publishes
lifecycle-state=readyfromrunPostCloneSetupas soon as the post-clone agent exec succeeds. That exec runs over vsock and does not depend on the network, so on a slow-booting guest it can finish before the guest's DHCP lease lands — leavinglifecycle-state=readywith no pod IP:resolveVMIPreturns""and bothpod.status.podIPand thevm.cocoonstack.io/ipannotation are empty.This breaks the lifecycle contract.
lifecycle-state=ready(paired withobserved-generation) is the signal consumers read the pod IP on — e.g. vm-service readsstatus.podIPexactly once after observing ready, and on empty raises:…and then deletes the freshly-cloned VM. So a premature Ready turns a perfectly good VM into a spurious 500 + teardown, forcing a retry.
Reproduces reliably on the first create from a large/slow snapshot: a cold base-image pull (~13 GB) + multi-GB guest resume push the DHCP lease past the ready signal. Observed on Windows Server 2025 (8 GB) staging creates — first attempt 500s on
podIP empty after lifecycle-state=ready, immediate retry (base now cached) succeeds.Why the contract should hold (not be worked around in consumers)
readyis meant to promise "VM is up and reachable", i.e. the pod IP is resolvable. The wake path already enforces this —finalizeDropNICWakedoeswaitForFreshIP -> refreshStatus -> Ready, and marksFailedif the lease never lands. The clone path simply never adopted the same gate. Consumers reading the IP once afterreadyare correct; the clone path is the one violating the promise.The fix
Add
markReadyAfterIPand route both Ready sites inrunPostCloneSetupthrough it (the no-fixup early return, and the post-exec success). It mirrorsfinalizeDropNICWake:waitForFreshIP— wait for the DHCP lease (bounded bywakeFreshIPBudget, default 15s)refreshStatus+notify— publishstatus.podIPbefore flipping the annotationmarkLifecycleState(Ready)markLifecycleState(Failed)with aPostCloneIPWaitTimeoutevent, never ready-without-IPAfter this, clone and wake deliver the same guarantee:
ready⇒ pod IP is resolvable.go build+go test ./provider/cocoon/pass.Note / possible follow-up
Reused
wakeFreshIPBudget(15s) for the clone-path wait rather than adding a separatepostCloneFreshIPBudget— happy to split it out if you'd prefer a distinct knob.