postclone: gate cloned-VM Ready on the DHCP lease (fixes ready-without-IP) by tonicmuroq · Pull Request #23 · cocoonstack/vk-cocoon

tonicmuroq · 2026-06-23T09:57:03Z

The bug

The clone path publishes lifecycle-state=ready from runPostCloneSetup as soon as the post-clone agent exec succeeds. That exec runs over vsock and does not depend on the network, so on a slow-booting guest it can finish before the guest's DHCP lease lands — leaving lifecycle-state=ready with no pod IP: resolveVMIP returns "" and both pod.status.podIP and the vm.cocoonstack.io/ip annotation are empty.

This breaks the lifecycle contract. lifecycle-state=ready (paired with observed-generation) is the signal consumers read the pod IP on — e.g. vm-service reads status.podIP exactly once after observing ready, and on empty raises:

CocoonSet <name> podIP empty after lifecycle-state=ready

…and then deletes the freshly-cloned VM. So a premature Ready turns a perfectly good VM into a spurious 500 + teardown, forcing a retry.

Reproduces reliably on the first create from a large/slow snapshot: a cold base-image pull (~13 GB) + multi-GB guest resume push the DHCP lease past the ready signal. Observed on Windows Server 2025 (8 GB) staging creates — first attempt 500s on podIP empty after lifecycle-state=ready, immediate retry (base now cached) succeeds.

Why the contract should hold (not be worked around in consumers)

ready is meant to promise "VM is up and reachable", i.e. the pod IP is resolvable. The wake path already enforces this — finalizeDropNICWake does waitForFreshIP -> refreshStatus -> Ready, and marks Failed if the lease never lands. The clone path simply never adopted the same gate. Consumers reading the IP once after ready are correct; the clone path is the one violating the promise.

The fix

Add markReadyAfterIP and route both Ready sites in runPostCloneSetup through it (the no-fixup early return, and the post-exec success). It mirrors finalizeDropNICWake:

waitForFreshIP — wait for the DHCP lease (bounded by wakeFreshIPBudget, default 15s)
refreshStatus + notify — publish status.podIP before flipping the annotation
then markLifecycleState(Ready)
on lease-wait timeout → markLifecycleState(Failed) with a PostCloneIPWaitTimeout event, never ready-without-IP

After this, clone and wake deliver the same guarantee: ready ⇒ pod IP is resolvable.

go build + go test ./provider/cocoon/ pass.

Note / possible follow-up

Reused wakeFreshIPBudget (15s) for the clone-path wait rather than adding a separate postCloneFreshIPBudget — happy to split it out if you'd prefer a distinct knob.

…t-IP) The clone path publishes lifecycle-state=ready from runPostCloneSetup as soon as the post-clone agent exec succeeds. That exec runs over vsock and does not depend on the network, so on a slow-booting guest it can finish before the guest's DHCP lease lands — leaving lifecycle=ready with no pod IP: resolveVMIP returns "" and both pod.status.podIP and the vm.cocoonstack.io/ip annotation are empty. This is a contract bug. lifecycle-state=ready (paired with observed-generation) is the signal consumers read the pod IP on — vm-service reads status.podIP exactly once after observing ready, and on empty raises "CocoonSet <name> podIP empty after lifecycle-state=ready" and deletes the freshly-cloned VM. So a premature Ready turns a perfectly good VM into a spurious 500 + teardown. Reproduces reliably on the first create from a large/slow snapshot (cold base image pull + multi-GB guest resume push the DHCP lease past the ready signal). The wake path already honors the contract — finalizeDropNICWake does waitForFreshIP -> refreshStatus -> Ready, and marks Failed if the lease never lands. The clone path simply never adopted it. Fix: add markReadyAfterIP and route both Ready sites in runPostCloneSetup through it (the no-fixup early return and the post-exec success). It waits for the lease, flushes status so status.podIP is published, then marks Ready; on timeout it marks Failed rather than ever publishing ready-without-IP. Now clone and wake deliver the same guarantee: ready => pod IP is resolvable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

postclone: gate cloned-VM Ready on the DHCP lease (fixes ready-without-IP)#23

postclone: gate cloned-VM Ready on the DHCP lease (fixes ready-without-IP)#23
tonicmuroq wants to merge 1 commit into
mainfrom
fix/postclone-ready-gate-on-ip

tonicmuroq commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tonicmuroq commented Jun 23, 2026

The bug

Why the contract should hold (not be worked around in consumers)

The fix

Note / possible follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant