feat(vm): add GPU DRA resource claim support#2520
Draft
danilrwx wants to merge 10 commits into
Draft
Conversation
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
22e85a0 to
3925492
Compare
Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
Drop the req- and -template suffixes from generated DRA names and align the device request name with the resource claim name (gpu-<device>). The template name becomes <vm>-<device>. This raises the user-facing gpuDevices[].name MaxLength from 55 to 59: the previous 55-char limit was dictated by the req-gpu- prefix (8 chars) that left no headroom against the 63-char DNS label limit. Signed-off-by: Daniil Antoshin <daniil.antoshin@flant.com>
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add support for attaching physical GPU devices to virtual machines via Kubernetes DRA (Dynamic Resource Allocation).
A new
spec.gpuDevicesfield lets a user request a GPU by product model. The virtualization controller generates a DRAResourceClaimTemplateper device with a CEL selector matching the requestedproductName, aphysicaldevice type, and the absence of a sharing strategy. The kvbuilder renders the corresponding DRA resource claims and GPU devices into the KubeVirtVirtualMachine.Key pieces:
VirtualMachine.spec.gpuDevices[](name,model),MaxItems: 16,+listType=mapkeyed byname.GPUfeature gate (alpha, locked off in CE).GPUResourceClaimHandlercreates/updates/deletes ownedResourceClaimTemplates and cleans up orphans; aResourceClaimTemplatewatcher enqueues the owning VM.GPUDevicesValidatorrejects GPU devices unless theGPUfeature gate is enabled and thegpu.deckhouse.ioDeviceClassexists.vmchangecomparator marks GPU changes as requiring restart (AwaitingRestartToApplyConfiguration).gpu-<name>, device request name =gpu-<name>,ResourceClaimTemplate=<vm>-<name>.Depends on deckhouse/3p-kubevirt#130 for KubeVirt to recognize the Deckhouse GPU DRA attributes.
Why do we need it, and what problem does it solve?
Users running ML/rendering workloads need to attach a physical GPU to a VM. Today there is no way to do this through the
VirtualMachineAPI.A model-based request keeps VM manifests portable: the user asks for a GPU class (e.g.
NVIDIA H100) and lets DRA + the scheduler pick a concrete, exclusive, passthrough-capable device on a suitable node — instead of pinning the VM to a specific node, PCI address, or GPU UUID.What is the expected result?
GPUfeature gate inModuleConfig virtualizationand ensure a GPU DRA provider and thegpu.deckhouse.ioDeviceClassare installed.spec.gpuDevices:ResourceClaimTemplateand the VM schedules on a node with a matching exclusive physical GPU.spec.gpuDeviceson a running VM setsAwaitingRestartToApplyConfigurationand applies only after a restart.Checklist
Changelog entries