Skip to content

hibernation: wake fast-path returns Active without clearing hibernate annotation #2

@CMGS

Description

@CMGS

Summary

In hibernation/wake.go, when reconcileWake observes that the VM has already come back (container Running + VMID set), it immediately drops the snapshot tag and marks the CR Active, without clearing the vm.cocoonstack.io/hibernate annotation on the pod.

// hibernation/wake.go
if vmClonedAndRunning(pod) {
    r.Epoch.DeleteManifest(ctx, vmName, meta.HibernateSnapshotTag)
    return ctrl.Result{}, r.setPhase(ctx, hib, cocoonv1.CocoonHibernationPhaseActive, vmName)
}

// this clear is skipped on the fast-path above
if meta.ReadHibernateState(pod) {
    commonk8s.PatchHibernateState(ctx, r.Client, pod, false)
}

Scenario

  1. A pod is already running with a valid VMID, but still carries hibernate=true (e.g. residue from a prior failed hibernate, or a CR created against an already-awake pod).
  2. User creates/sets Desire=Wake. First reconcile hits the fast-path, returns Active. The hibernate=true annotation is left in place.
  3. User flips Desire=Hibernate. reconcileHibernate calls PatchHibernateState(pod, true), which is a no-op because the annotation already matches (see cocoon-common/k8s/utils.go:27).
  4. The reconciler immediately probes the registry for the snapshot tag. If a stale tag happens to be present, the CR gets marked Hibernated without vk-cocoon ever taking a new snapshot for this cycle.

Impact

A subsequent wake would clone from a stale (or nonexistent) snapshot, resulting in data divergence or a stuck Waking phase.

Notes

  • This is pre-existing behavior (predates 82a9bc3). Not introduced by the recent VMID-gate hardening.
  • Raised during a /code review of HEAD~3..HEAD; deferred out of scope for that review.

Possible fixes

  • Always call PatchHibernateState(pod, false) before returning Active on the fast-path.
  • Or: move the ReadHibernateState/PatchHibernateState block above the fast-path, so the annotation is cleared unconditionally during any wake reconcile.
  • Either fix needs a small unit test covering the "hibernate annotation residue on an already-live pod" case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions