Problem
CocoonSet clone fails with a backing-file I/O error in vm.restore when the base image's mutable tag at epoch has been re-pushed after the hot snapshot was baked. The node has a stale URL→digest mapping in its local cocoon image cache, but the in-flight EnsureImage call silently lets the clone proceed instead of forcing a re-pull, so the failure surfaces deep inside VerifyBaseFiles instead of at the image layer.
Evidence
Reproduced on testing 2026-05-09. CocoonSet from simular/win11-hot-testing:v1-20260509 (records base simular/win11 @ sha256:adafd938488daa114be898848eb24b9b0afffc21ac18f8b11f3f0057644b11e1); pod stuck in ProviderCreateFailed:
clone vm vk-default-vm-XXXX from simular/win11-hot-testing:v1-20260509: cocoon vm clone: exit status 1
INF base image not found locally, pulling https://epoch.simular.cloud/dl/simular/win11 ... func=core.EnsureImage
WRN pulled https://epoch.simular.cloud/dl/simular/win11 but expected digest sha256:adafd938... not found locally — clone may fail in VerifyBaseFiles func=core.EnsureImage
Error: vm.restore → 500: Backing file I/O error: /var/lib/cocoon/cloudimg/blobs/adafd938488daa114be898848eb24b9b0afffc21ac18f8b11f3f0057644b11e1.qcow2 — No such file or directory
epoch was serving the correct blob (verified — cocoon image pull --force later landed exactly sha256:adafd938...). The failing node's index still pointed simular/win11 URL → an older digest's qcow2.
Two interacting bits in the cocoonstack/cocoon codebase produce this:
-
cocoon/images/cloudimg/pull.go:62-79 — URL-level idempotency short-circuit, default force=false:
if !force {
if _, entry, ok := idx.Lookup(url); ok {
if utils.ValidFile(conf.BlobPath(entry.ContentSum.Hex())) {
skip = true // skip HTTP re-download
}
}
}
Reasonable for "we don't care which version, any cached one is fine."
-
cocoon/cmd/core/helpers.go:228-265 — EnsureImage calls b.Pull(ctx, pullRef, false, ...) with force=false hardcoded:
img, _ := b.Inspect(ctx, lookupRef) // lookupRef = vmCfg.ImageDigest
if img != nil { return } // exact digest already local
pullRef := digestPullRef(...)
if pullErr := b.Pull(ctx, pullRef, false, progress.Nop); pullErr != nil { ... }
if vmCfg.ImageDigest != "" && pullRef == vmCfg.Image {
img, _ := b.Inspect(ctx, vmCfg.ImageDigest)
if img == nil {
logger.Warnf(ctx, "pulled %s but expected digest %s not found locally — clone may fail in VerifyBaseFiles", ...)
return
}
}
EnsureImage already proved at the first Inspect that local cache lacks the expected digest. But the subsequent Pull(force=false) short-circuits on the URL match, the post-pull Inspect-by-digest also fails, the warning fires, and EnsureImage returns — clone then runs and dies later in VerifyBaseFiles. The warning is the only in-band signal.
How we resolved it
sudo cocoon image pull --force https://epoch.simular.cloud/dl/simular/win11
on every cocoon-vm node. --force=true skips the URL short-circuit at pull.go:62, re-downloads, recomputes sha256, and overwrites the index entry. Verified end-to-end — the testing 12-step e2e went from red (clone I/O error) to green (full create + hibernate + wake + RC commands) after running this on both cocoonset-node-1 and cocoonset-node-2.
Expected behavior
When EnsureImage is asked for a specific digest (vmCfg.ImageDigest != "") and has already determined that digest is absent locally, the subsequent pull should bypass the URL-level idempotency cache. The whole point of the digest-based pre-check is "I want exactly this digest"; if it's missing, the right action is to re-fetch, not to log a warning and let the next layer fail.
This matches the mental model of the existing cocoon image pull --force flag, whose own description says: "bypass cache and always re-download (useful when a mutable tag was replaced upstream)".
Viable approach
Single-line change in cocoon/cmd/core/helpers.go:EnsureImage:
// We already proved at Inspect(lookupRef=ImageDigest) that this digest
// isn't local. The cheap "URL already cached" short-circuit in pull()
// is wrong here — that's exactly the case where the upstream tag has
// moved and we need to re-fetch.
needForce := vmCfg.ImageDigest != ""
if pullErr := b.Pull(ctx, pullRef, needForce, progress.Nop); pullErr != nil { ... }
Cheap path stays cheap (no digest pinning + blob present → URL skip still applies). Slow path triggers exactly when a mutable tag has moved upstream — same semantics users already get from cocoon image pull --force manually.
Filed against vk-cocoon per request; the bug is entirely inside cocoonstack/cocoon (cmd/core/helpers.go + images/cloudimg/pull.go). vk-cocoon is the user-visible failure point because it shells out to cocoon vm clone to materialize CocoonSet pods. Happy to be transferred.
Problem
CocoonSet clone fails with a backing-file I/O error in
vm.restorewhen the base image's mutable tag at epoch has been re-pushed after the hot snapshot was baked. The node has a stale URL→digest mapping in its local cocoon image cache, but the in-flightEnsureImagecall silently lets the clone proceed instead of forcing a re-pull, so the failure surfaces deep insideVerifyBaseFilesinstead of at the image layer.Evidence
Reproduced on testing 2026-05-09. CocoonSet from
simular/win11-hot-testing:v1-20260509(records basesimular/win11@sha256:adafd938488daa114be898848eb24b9b0afffc21ac18f8b11f3f0057644b11e1); pod stuck inProviderCreateFailed:epoch was serving the correct blob (verified —
cocoon image pull --forcelater landed exactlysha256:adafd938...). The failing node's index still pointedsimular/win11URL → an older digest's qcow2.Two interacting bits in the
cocoonstack/cocooncodebase produce this:cocoon/images/cloudimg/pull.go:62-79— URL-level idempotency short-circuit, defaultforce=false:Reasonable for "we don't care which version, any cached one is fine."
cocoon/cmd/core/helpers.go:228-265—EnsureImagecallsb.Pull(ctx, pullRef, false, ...)withforce=falsehardcoded:EnsureImagealready proved at the firstInspectthat local cache lacks the expected digest. But the subsequentPull(force=false)short-circuits on the URL match, the post-pull Inspect-by-digest also fails, the warning fires, andEnsureImagereturns — clone then runs and dies later inVerifyBaseFiles. The warning is the only in-band signal.How we resolved it
on every cocoon-vm node.
--force=trueskips the URL short-circuit atpull.go:62, re-downloads, recomputes sha256, and overwrites the index entry. Verified end-to-end — the testing 12-step e2e went from red (clone I/O error) to green (full create + hibernate + wake + RC commands) after running this on bothcocoonset-node-1andcocoonset-node-2.Expected behavior
When
EnsureImageis asked for a specific digest (vmCfg.ImageDigest != "") and has already determined that digest is absent locally, the subsequent pull should bypass the URL-level idempotency cache. The whole point of the digest-based pre-check is "I want exactly this digest"; if it's missing, the right action is to re-fetch, not to log a warning and let the next layer fail.This matches the mental model of the existing
cocoon image pull --forceflag, whose own description says: "bypass cache and always re-download (useful when a mutable tag was replaced upstream)".Viable approach
Single-line change in
cocoon/cmd/core/helpers.go:EnsureImage:Cheap path stays cheap (no digest pinning + blob present → URL skip still applies). Slow path triggers exactly when a mutable tag has moved upstream — same semantics users already get from
cocoon image pull --forcemanually.Filed against vk-cocoon per request; the bug is entirely inside
cocoonstack/cocoon(cmd/core/helpers.go+images/cloudimg/pull.go). vk-cocoon is the user-visible failure point because it shells out tococoon vm cloneto materialize CocoonSet pods. Happy to be transferred.