Skip to content

EnsureImage with force=false silently skips re-pull when local URL cache is stale, clone dies in VerifyBaseFiles #37

@tonicmuroq

Description

@tonicmuroq

Problem

CocoonSet clone fails with a backing-file I/O error in vm.restore when the base image's mutable tag at epoch has been re-pushed after the hot snapshot was baked. The node has a stale URL→digest mapping in its local cocoon image cache, but the in-flight EnsureImage call silently lets the clone proceed instead of forcing a re-pull, so the failure surfaces deep inside VerifyBaseFiles instead of at the image layer.

Evidence

Reproduced on testing 2026-05-09. CocoonSet from simular/win11-hot-testing:v1-20260509 (records base simular/win11 @ sha256:adafd938488daa114be898848eb24b9b0afffc21ac18f8b11f3f0057644b11e1); pod stuck in ProviderCreateFailed:

clone vm vk-default-vm-XXXX from simular/win11-hot-testing:v1-20260509: cocoon vm clone: exit status 1
INF base image not found locally, pulling https://epoch.simular.cloud/dl/simular/win11 ...  func=core.EnsureImage
WRN pulled https://epoch.simular.cloud/dl/simular/win11 but expected digest sha256:adafd938... not found locally — clone may fail in VerifyBaseFiles  func=core.EnsureImage
Error: vm.restore → 500: Backing file I/O error: /var/lib/cocoon/cloudimg/blobs/adafd938488daa114be898848eb24b9b0afffc21ac18f8b11f3f0057644b11e1.qcow2 — No such file or directory

epoch was serving the correct blob (verified — cocoon image pull --force later landed exactly sha256:adafd938...). The failing node's index still pointed simular/win11 URL → an older digest's qcow2.

Two interacting bits in the cocoonstack/cocoon codebase produce this:

  1. cocoon/images/cloudimg/pull.go:62-79 — URL-level idempotency short-circuit, default force=false:

    if !force {
        if _, entry, ok := idx.Lookup(url); ok {
            if utils.ValidFile(conf.BlobPath(entry.ContentSum.Hex())) {
                skip = true   // skip HTTP re-download
            }
        }
    }

    Reasonable for "we don't care which version, any cached one is fine."

  2. cocoon/cmd/core/helpers.go:228-265EnsureImage calls b.Pull(ctx, pullRef, false, ...) with force=false hardcoded:

    img, _ := b.Inspect(ctx, lookupRef)   // lookupRef = vmCfg.ImageDigest
    if img != nil { return }              // exact digest already local
    pullRef := digestPullRef(...)
    if pullErr := b.Pull(ctx, pullRef, false, progress.Nop); pullErr != nil { ... }
    if vmCfg.ImageDigest != "" && pullRef == vmCfg.Image {
        img, _ := b.Inspect(ctx, vmCfg.ImageDigest)
        if img == nil {
            logger.Warnf(ctx, "pulled %s but expected digest %s not found locally — clone may fail in VerifyBaseFiles", ...)
            return
        }
    }

    EnsureImage already proved at the first Inspect that local cache lacks the expected digest. But the subsequent Pull(force=false) short-circuits on the URL match, the post-pull Inspect-by-digest also fails, the warning fires, and EnsureImage returns — clone then runs and dies later in VerifyBaseFiles. The warning is the only in-band signal.

How we resolved it

sudo cocoon image pull --force https://epoch.simular.cloud/dl/simular/win11

on every cocoon-vm node. --force=true skips the URL short-circuit at pull.go:62, re-downloads, recomputes sha256, and overwrites the index entry. Verified end-to-end — the testing 12-step e2e went from red (clone I/O error) to green (full create + hibernate + wake + RC commands) after running this on both cocoonset-node-1 and cocoonset-node-2.

Expected behavior

When EnsureImage is asked for a specific digest (vmCfg.ImageDigest != "") and has already determined that digest is absent locally, the subsequent pull should bypass the URL-level idempotency cache. The whole point of the digest-based pre-check is "I want exactly this digest"; if it's missing, the right action is to re-fetch, not to log a warning and let the next layer fail.

This matches the mental model of the existing cocoon image pull --force flag, whose own description says: "bypass cache and always re-download (useful when a mutable tag was replaced upstream)".

Viable approach

Single-line change in cocoon/cmd/core/helpers.go:EnsureImage:

// We already proved at Inspect(lookupRef=ImageDigest) that this digest
// isn't local. The cheap "URL already cached" short-circuit in pull()
// is wrong here — that's exactly the case where the upstream tag has
// moved and we need to re-fetch.
needForce := vmCfg.ImageDigest != ""
if pullErr := b.Pull(ctx, pullRef, needForce, progress.Nop); pullErr != nil { ... }

Cheap path stays cheap (no digest pinning + blob present → URL skip still applies). Slow path triggers exactly when a mutable tag has moved upstream — same semantics users already get from cocoon image pull --force manually.


Filed against vk-cocoon per request; the bug is entirely inside cocoonstack/cocoon (cmd/core/helpers.go + images/cloudimg/pull.go). vk-cocoon is the user-visible failure point because it shells out to cocoon vm clone to materialize CocoonSet pods. Happy to be transferred.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions