Skip to content

hibernation: VMID-clearing is an implicit cross-repo contract with vk-cocoon #3

@CMGS

Description

@CMGS

Summary

hibernation/wake.go:vmClonedAndRunning gates the wake fast-path on IsContainerRunning(pod) && ParseVMRuntime(pod).VMID != "". The correctness of this gate depends on an external invariant: vk-cocoon must clear the vm.cocoonstack.io/id annotation during hibernate, and re-write it only after a successful snapshot clone on wake.

// hibernation/wake.go
// during hibernate, vk-cocoon clears the VMID annotation; on wake it writes
// a new VMID only after the snapshot clone succeeds.
func vmClonedAndRunning(pod *corev1.Pod) bool {
    return meta.IsContainerRunning(pod) && meta.ParseVMRuntime(pod).VMID != ""
}

Problem

This contract is not expressed or enforced in any repo we can verify:

  • cocoon-common/meta/vmruntime.go: VMRuntime.Apply uses setIfNotEmpty — it can only write, never clear.
  • cocoon-common/k8s/utils.go: PatchHibernateState only touches AnnotationHibernate.
  • cocoon-operator: no code path clears VMID.

So the "clear on hibernate" step must live entirely in vk-cocoon (likely via a direct client-go patch, or via pod recreation). If vk-cocoon ever regresses — crashes mid-hibernate, partial hibernate, bug in the clear path — the operator's wake gate silently degrades back to the pre-82a9bc3 race: IsContainerRunning alone, which can flap during the pod-recreate → wake window.

Why this is worth tracking

  • The assumption is written in a comment but not enforced by a test, schema, or contract.
  • A regression in vk-cocoon would not produce any failing test in cocoon-operator — the gate just silently stops gating.

Options

  1. Strong signal: introduce a vm.cocoonstack.io/hibernate-epoch annotation (monotonically incremented each hibernate/wake pair by vk-cocoon). The operator gates on epoch advancing, not on VMID presence. Survives transient VMID residue.
  2. Contract test: add an integration / contract test that exercises vk-cocoon's hibernate path and asserts the VMID annotation is absent post-hibernate. Catches vk-cocoon regressions here, not months later in prod.
  3. Defensive clear in operator: have the operator clear VMID itself during reconcileHibernate. Rejected for now — it makes VMID a shared-writer annotation and introduces its own races.

Option 1 is the cleanest long-term fix but requires a coordinated change across vk-cocoon, cocoon-common (annotation constant), and cocoon-operator.

Notes

  • Surfaced during a /code review of HEAD~3..HEAD; deferred out of scope because the fix is cross-repo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions