Skip to content

Chore/remove cdi#2394

Draft
Isteb4k wants to merge 65 commits into
mainfrom
chore/remove-cdi
Draft

Chore/remove cdi#2394
Isteb4k wants to merge 65 commits into
mainfrom
chore/remove-cdi

Conversation

@Isteb4k

@Isteb4k Isteb4k commented May 22, 2026

Copy link
Copy Markdown
Contributor

Description

Why do we need it, and what problem does it solve?

What is the expected result?

Checklist

  • The code is covered by unit tests.
  • e2e tests passed.
  • Documentation updated according to the changes.
  • Changes were tested in the Kubernetes cluster manually.

Changelog entries

section: core
type: fix
summary: TODO

@Isteb4k Isteb4k added this to the v1.9.0 milestone May 22, 2026
Isteb4k and others added 5 commits May 22, 2026 13:12
…f config

The `importPackageImages` template emits entries that belong to the `import:`
section (they reference other images, not git sources). They were being
appended directly after the `git:` block, producing config where each
`- image:` / `before:` entry was parsed as an extra field of the git mount,
causing werf to reject it with `unknown fields: image, before`.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ixture

After the CDI-as-operator removal, the disk importer is provided by the
standalone `virtual-disk-importer` image. Its camelCase name
`virtualDiskImporter` is referenced from
`templates/virtualization-controller/_helpers.tpl` via
`helm_lib_module_image`, so the kubeconform fixture must define a digest
for it. Without it, `helm template` fails with
`Image virtualization.virtualDiskImporter has no digest`.

Co-authored-by: Cursor <cursoragent@cursor.com>
…DI cleanup

The CDI removal commit added a `replace` directive in
images/virtualization-artifact/go.mod that pointed at a vendored copy of
the upstream API package under
images/cdi-artifact/containerized-data-importer/staging/..., but the
follow-up cleanup removed that entire `staging/` tree. As a result
`go mod download` (and every downstream `go build`/`go test`) failed
with: open .../staging/.../containerized-data-importer-api/go.mod: no
such file or directory.

Drop the dangling local replace so we resolve
kubevirt.io/containerized-data-importer-api v1.63.1 from the Go module
proxy, and restore the `fmt` import in cmd/virtualization-controller
that was accidentally dropped alongside the temporary debug print
removed by the same commit (the file still uses fmt.Sprintf).

Co-authored-by: Cursor <cursoragent@cursor.com>
@Isteb4k Isteb4k self-assigned this May 22, 2026
Isteb4k and others added 20 commits May 22, 2026 14:59
The changes that removed CDI and refactored VirtualDisk data sources to
the step-based pattern left a number of issues that golangci-lint v2.11
catches. Fix them in one pass:

- Drop unused helpers in pkg/controller/vd/internal/source/sources.go
  (setPhaseCondition*, getNodePlacement, retryPeriod,
  setQuotaExceededPhaseCondition, isStorageClassWFFC, the unused
  SupplementsCleaner / CleanUpSupplements / Cleaner). The step pattern
  handles these cases internally now.
- Drop the unused DiskService.isImmediateBindingMode helper.
- Auto-fix the gci/gofumpt/QF1008 issues in service/errors.go,
  storageprofile/storageprofile_controller.go, and the new VirtualDisk
  source tests.
- Replace the deprecated reconcile.Result.Requeue with RequeueAfter in
  Create{Importer,Uploader}Step, WaitForPVCImportStep, and the matching
  *_test.go assertions.
- Simplify VirtualImage reconcilePVCImportFromDVCR (its bool-result was
  always true) and inline the call sites in vi http/registry/upload.
- Cleanup gocritic findings in vd internal watchers (unlambda Pod
  watcher map func, singleCaseSwitch -> if in PVC watcher).
- test/e2e: drop the dangling local replace of
  containerized-data-importer-api (the staging tree was removed) so
  typecheck succeeds, and lowercase a few error strings in observer
  predicates flagged by ST1005.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ub to drop CVEs

Trivy reports five github.com/docker/docker CVEs against the
cdi-importer binary (CVE-2026-34040, CVE-2026-33997, CVE-2026-41567,
CVE-2026-42306, CVE-2026-41568). docker/docker is only pulled into CDI
transitively via github.com/containers/image/v5/manifest, which uses
exactly one symbol — github.com/docker/docker/api/types/versions — for
manifest version comparison.

Mirror the workaround already in place in deckhouse/3p-containerized-data-importer:
keep a minimal staging/src/github.com/docker/docker stub that only
provides api/types/versions and an empty registry package, and replace
the upstream module with it. go mod tidy correspondingly prunes the
otel/grpc/genproto/containerd transitive dependencies that were only
brought in by the full docker/docker module.

Co-authored-by: Cursor <cursoragent@cursor.com>
The mocks regenerated in the previous chat were produced with a
different moq version that aliased k8s.io/api/storage/v1 as
storagev1. moq v0.5.3 (pinned in Taskfile.init.yaml) leaves the import
unaliased and uses *v1.StorageClass throughout, which is what the
"check auto-generated files are up-to-date" CI step expects. Run
`task controller:dev:gogenerate` and commit the resulting diff.

Co-authored-by: Cursor <cursoragent@cursor.com>
Move the vendored CDI importer code from
images/cdi-artifact/containerized-data-importer/ one level up to
images/cdi-artifact/ to remove the redundant nested directory. The
former unpack-bundle.sh and .gitignore tailored for the wrapper layout
are no longer needed. Werf and mount-point paths are updated to point
at the flattened tree.

Co-authored-by: Cursor <cursoragent@cursor.com>
cdi-importer now emits kubevirt_cdi_import_progress_total covering
both internal phases: TransferScratch (download from DVCR to the
scratch PVC) maps to 0-49% and qemu-img Convert (scratch to target
PVC) maps to 50-100%.

The virtualization-controller picks up both registry_progress and
kubevirt_cdi_import_progress_total metrics and applies a per-source
scale when computing vd/vi progress:

  - ObjectRef CVI/VI imports (cdi-importer is the sole pipeline)
    surface raw 0-100%.
  - HTTP, Registry and Upload imports (DVCR-fed) surface cdi-importer
    progress as 50-100%, after the dvcr-importer/uploader phase has
    already filled 0-50%.

WaitForPVCImportStep now requeues every 2 seconds while an import is
in progress so disk/image status reflects up-to-date percentages.

Co-authored-by: Cursor <cursoragent@cursor.com>
After the importer pod name was shortened to "d8v-<prefix>-importer-<uid>"
both the dvcr-importer Pod (started by importer_service.go for HTTP,
Registry and Upload sources) and the cdi-importer Pod (started by
disk_import_service.go to copy from DVCR to the target PVC) ended up
sharing the same name via sup.ImporterPod().

The collision caused EnsureSupplementPVCImport to fetch the already
Completed dvcr-importer Pod, treat it as the cdi-importer Pod, observe
phase Succeeded and call cleanupPVCImport on it. cleanupPVCImport
deletes the pod by name, but the dvcr-importer Pod still carries the
vi-/vd-protection finalizer, so it was stuck in Terminating and the
controller looped on "Waiting for supplements to be terminated".

Introduce a separate name template ("d8v-<prefix>-pvc-importer-<uid>")
and a new Generator.PVCImporterPod() helper. Switch the disk-import
service to use it so the two phases never share a Pod name.

Co-authored-by: Cursor <cursoragent@cursor.com>
DiskService.CleanUpSupplements built a bare target PVC (no
annotations) and passed it to cleanupPVCImport. cleanupPVCImport
read the pod name from target.Annotations[AnnPVCImportPod] and fell
back to target.Name when the annotation was empty - i.e. it tried
to delete a pod sharing the target PVC's name. The actual
cdi-importer pod is named d8v-<prefix>-pvc-importer-<uid>, so the
delete was a silent no-op and the pod survived VD cleanup.

The orphan pod kept its volume mount on the target PVC, which
prevented kubernetes.io/pvc-protection from finalising the PVC and
in turn blocked the vd-cleanup finalizer, leaving deleted
VirtualDisks stuck in Terminating forever.

Thread the supplements.Generator through cleanupPVCImport and use
sup.PVCImporterPod() as the fallback pod name so cleanup targets
the correct pod whether or not the target PVC carries the
AnnPVCImportPod annotation.

Co-authored-by: Cursor <cursoragent@cursor.com>
…rvice.Protect

The OwnerReference and the {vd,vi}-protection finalizer that disk import PVCs
need to participate in the controller-driven cleanup must be present from the
moment the PVC is created. Previously the finalizer was added later via
DiskService.Protect, opening a window where a PVC could be deleted before the
controller had stamped its protection on it.

Move the finalizer next to the OwnerReference inside StartPVCImport,
StartSupplementPVCImport and makePVCCloneTarget so every disk import target PVC
carries them at creation. Drop DiskService.Protect entirely along with all of
its callers in the VD ready step and the VI source handlers (http, registry,
upload, object_ref*, sources). Regenerate the VD source mocks and update the
VD source unit tests so they stop wiring the now-removed ProtectFunc.

Co-authored-by: Cursor <cursoragent@cursor.com>
The dvcr-importer and uploader pods are already created with the
resource-quota-overrides.deckhouse.io/ignore=true label so they don't get
charged against namespace quotas. Apply the same label, at creation time, to
the scratch PVC and the cdi-importer pod that DiskService spawns to populate
target PVCs.

Co-authored-by: Cursor <cursoragent@cursor.com>
Introduce PersistentVolumeClaimService as the single entry point for filling
a target PVC. Its Import method picks the strategy (smart clone via
VolumeSnapshot, CSI clone via dataSource, or host-assisted via cdi-importer
pod) and provisions every helper resource (scratch PVC, cdi-importer pod,
DVCR auth/CA copies, VolumeSnapshot, etc.) with ownerRef, finalizer, and
the resource-quota-overrides.deckhouse.io/ignore label set at creation.

DiskService is slimmed down: StartPVCImport/EnsurePVCImport and clone
helpers are removed and replaced by disk.PersistentVolumeClaim().Import.
The VirtualDisk reconciler step builds the target PVC descriptor and hands
it off to the service; data sources (HTTP, Registry, Upload, ObjectRef
CVI/VI) and the VirtualImage source plumbing are updated accordingly.
Tests and mocks are regenerated to match the new contract.

Co-authored-by: Cursor <cursoragent@cursor.com>
Isteb4k and others added 30 commits June 17, 2026 18:40
After the CDI removal, the kubevirt-operator ServiceAccount lost access to the
cdi.internal.virtualization.deckhouse.io group, but the embedded KubeVirt install
strategy still generates the virt-api and virt-controller ClusterRoles with CDI
rules (cdi.kubevirt.io, rewritten to cdi.internal.virtualization.deckhouse.io).
The API server's privilege-escalation guard therefore blocked virt-operator from
creating those ClusterRoles, leaving the KubeVirt CR stuck in Deploying/Degraded
and virt-api/virt-controller/virt-handler never deployed.

Grant the operator ClusterRole '*'/'*' on cdi.internal.virtualization.deckhouse.io
(consistent with the other *.internal.virtualization.deckhouse.io grants) so the
KubeVirt control plane can be deployed again.

Co-authored-by: Cursor <cursoragent@cursor.com>
Controller:
- monitoring: raise the import-pod metrics scrape timeout so the
  controller reads the live progress metric instead of only learning
  the final value when the pod completes (progress jumping 0%->50%->100%);
- uploader step: drop the redundant IsUploadStarted scrape that could
  bail out before reporting progress.

E2E:
- add the HaveTimelyProgress observer invariant for VirtualDisk and
  VirtualImage: while provisioning, progress must update at least every
  10s, otherwise the test fails;
- boot VirtualMachines from CDROM/ISO VirtualImages (and fall back to a
  VirtualDisk for non-CDROM images) to avoid the non-CDROM first-device
  admission webhook rejection.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- percent: render 0% and 100% without a fraction and everything else with
  one decimal; emit "50.0%" at the DVCR->PVC hand-off instead of "50%" so
  the reported progress stops flip-flopping between "50%" and "50.0%".
- e2e: HaveTimelyProgress now requires progress to advance at least every
  10s, allowing only the 0%/50% stage boundaries to pause (up to a minute),
  and rejects stage jumps (0%->>=50%, 50%->100%) so progress must stream
  intermediate percentages instead of only 0%/50%/100%.

Co-authored-by: Cursor <cursoragent@cursor.com>
Controller:
- pin the import prime PVC (and importer pod) to the consuming VM's node so a
  WaitForFirstConsumer node-local volume is provisioned where the VM runs,
  fixing the hotplug pod's "PV node affinity doesn't match node" failure
- always expose a progress percentage while a VirtualDisk/VirtualImage is
  Provisioning (default 0%), and force 0% for WaitForUserUpload
- surface a target-PVC quota rejection on the Ready condition (Pending/
  QuotaExceeded) instead of returning a reconciler error

E2E:
- model block-device tests around a WaitForFirstConsumer storage class for the
  scenario's main resources and an Immediate one for dependencies
- split the combined precheck into wffc, immediate and same-CSI-driver prechecks
- replace BeReadyOrWaitForFirstConsumer with a per-disk BeWaitForFirstConsumer
  (WFFC) / BeReady (Immediate) wait
- split VirtualImageCreation DVCR/PVC specs and share dependencies via BeforeAll
- assert StorageProfile deletion and add a HaveProgressWhileProvisioning invariant

Co-authored-by: Cursor <cursoragent@cursor.com>
…eck rename

Controller:
- report import progress monotonically: an importer pod restart resets its
  kubevirt_cdi_import_progress_total metric to 0, which made progress jump
  backwards (e.g. 46.4% -> 0%); GetProgress now never returns a value lower
  than the one already published
- surface a "namespace is terminating" create rejection (Project/namespace
  cleanup) on the VirtualDisk/VirtualImage Ready condition instead of failing
  the reconcile; add common.ErrNamespaceTerminating and unit tests

E2E:
- wire HaveTimelyProgress into the Upload VirtualDisk spec so all creation
  scenarios verify intermediate progress values
- rename precheck labels to kebab-case abbreviations (wffc-sc-precheck,
  immediate-sc-precheck, same-csi-sc-precheck, different-csi-sc-precheck,
  default-sc-precheck, precreated-cvi-precheck, post-cleanup-precheck,
  vmclass-precheck)

Co-authored-by: Cursor <cursoragent@cursor.com>
…, bounder quota exclude

* normalizeProgress on VI/VD/CVI clears Status.Progress for ""/Pending so a
  source path that optimistically sets "0%" before deciding to advance the
  phase no longer surfaces as an intermediate "Pending 0%" status; VD's
  WaitForFirstConsumer keeps its in-flight progress (it is an
  intra-Provisioning pause, not a reset).
* monitoring.GetProgressReportFromURL caps scrape attempts at 1.5s and stops
  silently swallowing timeouts, so a busy importer /metrics endpoint cannot
  freeze Status.Progress for the full 5s scrape budget.
* bounder Pod gets the resource-quota-overrides.deckhouse.io/ignore label,
  matching the other importer/uploader pods, so the WFFC host-binder is no
  longer billed against the project quota.

Co-authored-by: Cursor <cursoragent@cursor.com>
…gress predicates

* Observer interface exposes InvariantViolated() and Stopped() channels;
  WaitFor now rechecks Err() after a satisfying event so an Always/Never
  violation that latched on the same event takes precedence over a
  successful predicate match.
* StartObserver wrappers (vd, vi, cvi, vdsnapshot, vm) launch a watcher
  goroutine that calls Ginkgo's Fail with defer GinkgoRecover() the moment
  the first invariant fires, instead of letting the breach surface 30s+
  later in DeferCleanup while WaitFor blocks on an unrelated condition.
* HaveNoProgressBeforeProvisioning (VD, VI) flags any non-empty
  Status.Progress while the resource is still in ""/Pending, catching the
  "Pending 0%" intermediate state.
* HaveTimelyProgress now also evaluates the stage-jump rule on entry to
  Ready, so a controller leap of 0% -> 100% across the Provisioning ->
  Ready boundary is no longer hidden by the predicate stopping its tracking
  one event too early.
* Wire the new no-progress invariant into createVirtualImageAndWait /
  uploadVirtualImageAndWait and into startVirtualDisk / the VD upload spec.

Co-authored-by: Cursor <cursoragent@cursor.com>
…al size

A VirtualImage imported to a PVC never streamed intermediate progress: the
controller hardcoded 0%/50% during provisioning and jumped straight to 100%,
violating the e2e progress invariants. Refresh Status.Progress from the
pvc-importer pod metric while the import is in flight (mirroring the VD
controller), scaling into the 50..100 slice for the DVCR-sourced path.

The VI-on-PVC -> DVCR path also under-reported the produced image size: it
copied the source VI's nominal size instead of the importer's measured virtual
size. A raw block-device source exposes the full, possibly over-allocated
device size, so the converted image is larger than the nominal size and any
downstream PVC sized from it failed with "virtual image size ... is larger
than the reported available storage". Report the actual size via
GetSize(pod) (as the VD path already does) and align image-derived required
PVC sizes up to the block boundary, matching CDI's GetRequiredSpace.

Co-authored-by: Cursor <cursoragent@cursor.com>
Validate blockdevice progress against source-specific coverage rules so regressions, invalid phases, and stale updates fail immediately.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Keep progress empty in Pending states and stream importer progress through VD/VI handlers so e2e observer expectations match controller behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep VI provisioning progress below 100% until Ready and update e2e progress expectations to match the source/storage contract.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Create explicit PVC target creation paths for VD/VI flows and move population work into a dedicated PVC populator controller.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the PVC population changes lint-clean by applying gofumpt formatting and removing an unused e2e helper option.

Co-authored-by: Cursor <cursoragent@cursor.com>
Use watch-based observers for populator PVC and Pod waits so failed states abort immediately, and let standalone PVC population own its helper resources directly.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Use a separate source and target importer for host-assigned PVC clones so SDS replicated PVC-to-PVC copies avoid the broken snapshot restore path while keeping cleanup and progress reporting stable.

Co-authored-by: Cursor <cursoragent@cursor.com>
Move the SDS Replicated host-assisted override into StorageProfile status,
allow controller metrics scrape on pvc-importer pods, and start host-assigned
source/target importers in parallel via an NBD Service endpoint.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Replace StorageClass annotations with WFFC_STORAGE_CLASS and
IMMEDIATE_STORAGE_CLASS, auto-detecting sibling classes from the default
StorageClass when the env vars are unset.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants