Skip to content

postclone: gate cloned-VM Ready on the DHCP lease (fixes ready-without-IP)#23

Open
tonicmuroq wants to merge 1 commit into
mainfrom
fix/postclone-ready-gate-on-ip
Open

postclone: gate cloned-VM Ready on the DHCP lease (fixes ready-without-IP)#23
tonicmuroq wants to merge 1 commit into
mainfrom
fix/postclone-ready-gate-on-ip

Conversation

@tonicmuroq

Copy link
Copy Markdown
Contributor

The bug

The clone path publishes lifecycle-state=ready from runPostCloneSetup as soon as the post-clone agent exec succeeds. That exec runs over vsock and does not depend on the network, so on a slow-booting guest it can finish before the guest's DHCP lease lands — leaving lifecycle-state=ready with no pod IP: resolveVMIP returns "" and both pod.status.podIP and the vm.cocoonstack.io/ip annotation are empty.

This breaks the lifecycle contract. lifecycle-state=ready (paired with observed-generation) is the signal consumers read the pod IP on — e.g. vm-service reads status.podIP exactly once after observing ready, and on empty raises:

CocoonSet <name> podIP empty after lifecycle-state=ready

…and then deletes the freshly-cloned VM. So a premature Ready turns a perfectly good VM into a spurious 500 + teardown, forcing a retry.

Reproduces reliably on the first create from a large/slow snapshot: a cold base-image pull (~13 GB) + multi-GB guest resume push the DHCP lease past the ready signal. Observed on Windows Server 2025 (8 GB) staging creates — first attempt 500s on podIP empty after lifecycle-state=ready, immediate retry (base now cached) succeeds.

Why the contract should hold (not be worked around in consumers)

ready is meant to promise "VM is up and reachable", i.e. the pod IP is resolvable. The wake path already enforces thisfinalizeDropNICWake does waitForFreshIP -> refreshStatus -> Ready, and marks Failed if the lease never lands. The clone path simply never adopted the same gate. Consumers reading the IP once after ready are correct; the clone path is the one violating the promise.

The fix

Add markReadyAfterIP and route both Ready sites in runPostCloneSetup through it (the no-fixup early return, and the post-exec success). It mirrors finalizeDropNICWake:

  • waitForFreshIP — wait for the DHCP lease (bounded by wakeFreshIPBudget, default 15s)
  • refreshStatus + notify — publish status.podIP before flipping the annotation
  • then markLifecycleState(Ready)
  • on lease-wait timeout → markLifecycleState(Failed) with a PostCloneIPWaitTimeout event, never ready-without-IP

After this, clone and wake deliver the same guarantee: ready ⇒ pod IP is resolvable.

go build + go test ./provider/cocoon/ pass.

Note / possible follow-up

Reused wakeFreshIPBudget (15s) for the clone-path wait rather than adding a separate postCloneFreshIPBudget — happy to split it out if you'd prefer a distinct knob.

…t-IP)

The clone path publishes lifecycle-state=ready from runPostCloneSetup as soon as
the post-clone agent exec succeeds. That exec runs over vsock and does not depend
on the network, so on a slow-booting guest it can finish before the guest's DHCP
lease lands — leaving lifecycle=ready with no pod IP: resolveVMIP returns "" and
both pod.status.podIP and the vm.cocoonstack.io/ip annotation are empty.

This is a contract bug. lifecycle-state=ready (paired with observed-generation)
is the signal consumers read the pod IP on — vm-service reads status.podIP
exactly once after observing ready, and on empty raises "CocoonSet <name> podIP
empty after lifecycle-state=ready" and deletes the freshly-cloned VM. So a
premature Ready turns a perfectly good VM into a spurious 500 + teardown.
Reproduces reliably on the first create from a large/slow snapshot (cold base
image pull + multi-GB guest resume push the DHCP lease past the ready signal).

The wake path already honors the contract — finalizeDropNICWake does
waitForFreshIP -> refreshStatus -> Ready, and marks Failed if the lease never
lands. The clone path simply never adopted it. Fix: add markReadyAfterIP and
route both Ready sites in runPostCloneSetup through it (the no-fixup early
return and the post-exec success). It waits for the lease, flushes status so
status.podIP is published, then marks Ready; on timeout it marks Failed rather
than ever publishing ready-without-IP. Now clone and wake deliver the same
guarantee: ready => pod IP is resolvable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant