[Draft]Dra#8671
Open
runzhen wants to merge 2 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This draft PR appears to introduce an experimental “Managed GPU Experience via DRA” path for Ubuntu 24.04 GPU nodes by adding a new cached component (dra-driver-nvidia-gpu), switching CSE startup logic between nvidia-device-plugin and dra-driver-nvidia-gpu, and adding a new bootstrapping config flag.
Changes:
- Add a new cached component (
dra-driver-nvidia-gpu) to VHD build + components manifest. - Add a DRA toggle path in Linux CSE to start
dra-driver-nvidia-gpuinstead ofnvidia-device-plugin. - Add a new e2e scenario stub intended to validate the DRA driver on Ubuntu 24.04.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| vhdbuilder/packer/install-dependencies.sh | Adds VHD build-time caching for dra-driver-nvidia-gpu tarball. |
| parts/common/components.json | Registers new dra-driver-nvidia-gpu component and its download URL for Ubuntu 24.04. |
| parts/linux/cloud-init/artifacts/cse_main.sh | Introduces ENABLE_MANAGED_GPU_EXPERIENCE_DRA derived from ENABLE_MANAGED_GPU_DRA. |
| parts/linux/cloud-init/artifacts/cse_config.sh | Switches managed GPU service startup to device-plugin vs DRA driver; adds DRA systemd override. |
| parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh | Adds a DRA-specific “package list” and attempts to install it from cache. |
| aks-node-controller/parser/parser.go | Hard-codes ENABLE_MANAGED_GPU_DRA=true for scriptless-NBC env generation. |
| pkg/agent/datamodel/types.go | Adds EnableManagedGPUDRA to NBC. |
| pkg/agent/baker.go | Exposes IsEnableManagedGPUDRA to templates via func map. |
| e2e/scenario_gpu_managed_experience_test.go | Adds a new Ubuntu 24.04 DRA scenario (currently without assertions). |
Comment on lines
+207
to
215
| packageList=$(managedGPUPackageList) | ||
|
|
||
| if [ ${ENABLE_MANAGED_GPU_EXPERIENCE_DRA} = "true" ]; then | ||
| packageList=$(managedGPUPackageListDRA) | ||
| echo "DRA is enabled, using DRA-specific package list." | ||
| fi | ||
|
|
||
| for packageName in $(packageList); do | ||
| downloadDir="/opt/${packageName}/downloads" |
Comment on lines
+181
to
+187
| managedGPUPackageListDRA() { | ||
| packages=( | ||
| dra-driver-nvidia-gpu | ||
| datacenter-gpu-manager-4-core | ||
| datacenter-gpu-manager-4-proprietary | ||
| dcgm-exporter | ||
| ) |
Comment on lines
1606
to
1609
| # installed on a previous CSE run. Stop them if they exist. | ||
| logs_to_events "AKS.CSE.stop.nvidia-device-plugin" "systemctlDisableAndStop nvidia-device-plugin" | ||
| logs_to_events "AKS.CSE.stop.nvidia-device-plugin" "systemctlDisableAndStop dra-driver-nvidia-gpu" | ||
| logs_to_events "AKS.CSE.stop.nvidia-dcgm" "systemctlDisableAndStop nvidia-dcgm" |
Comment on lines
+1725
to
+1727
| # Reload systemd to pick up the override | ||
| systemctl daemon-reload | ||
| logs_to_events "AKS.CSE.start.dra-driver-nvidia-gpu" "systemctlEnableAndStart dra-driver-nvidia-gpu 30" || exit $ERR_DRA_DRIVER_START_FAIL |
Comment on lines
1596
to
+1621
| @@ -1605,13 +1605,67 @@ configureManagedGPUExperience() { | |||
| # EnableManagedGPUExperience is mutable, so services may have been | |||
| # installed on a previous CSE run. Stop them if they exist. | |||
| logs_to_events "AKS.CSE.stop.nvidia-device-plugin" "systemctlDisableAndStop nvidia-device-plugin" | |||
| logs_to_events "AKS.CSE.stop.nvidia-device-plugin" "systemctlDisableAndStop dra-driver-nvidia-gpu" | |||
| logs_to_events "AKS.CSE.stop.nvidia-dcgm" "systemctlDisableAndStop nvidia-dcgm" | |||
| logs_to_events "AKS.CSE.stop.nvidia-dcgm-exporter" "systemctlDisableAndStop nvidia-dcgm-exporter" | |||
| rm -f "${managed_gpu_marker}" | |||
| fi | |||
| } | |||
|
|
|||
| startNvidiaManagedExpServices() { | |||
| # 1. Start device plugin or DRA driver | |||
| if [ "${ENABLE_MANAGED_GPU_EXPERIENCE_DRA}" = "true" ]; then | |||
| startDRADriverNvidiaGpu | |||
| else | |||
| startNvidiaDevicePlugin | |||
| fi | |||
Comment on lines
194
to
199
| "SERVICE_ACCOUNT_IMAGE_PULL_DEFAULT_TENANT_ID": config.GetServiceAccountImagePullProfile().GetDefaultTenantId(), | ||
| "IDENTITY_BINDINGS_LOCAL_AUTHORITY_SNI": config.GetServiceAccountImagePullProfile().GetLocalAuthoritySni(), | ||
| "CSE_TIMEOUT": getCSETimeout(config), | ||
| "SKIP_WAAGENT_HOLD": "true", | ||
| "ENABLE_MANAGED_GPU_DRA": "true", | ||
| } |
Comment on lines
1321
to
+1326
| "IsEnableManagedGPU": func() bool { | ||
| return config.EnableManagedGPU | ||
| }, | ||
| "IsEnableManagedGPUDRA": func() bool { | ||
| return config.EnableManagedGPUDRA | ||
| }, |
Comment on lines
+718
to
+744
| func Test_Ubuntu2404_NvidiaDraDriverRunning(t *testing.T) { | ||
| RunScenario(t, &Scenario{ | ||
| Description: "Tests that NVIDIA DRA driver is running & functional on Ubuntu 24.04 GPU nodes", | ||
| Tags: Tags{ | ||
| GPU: true, | ||
| }, | ||
| Config: Config{ | ||
| Cluster: ClusterKubenet, | ||
| VHD: config.VHDUbuntu2404Gen2Containerd, | ||
| BootstrapConfigMutator: func(_ *Cluster, nbc *datamodel.NodeBootstrappingConfiguration) { | ||
| nbc.AgentPoolProfile.VMSize = "Standard_NV6ads_A10_v5" | ||
| nbc.ConfigGPUDriverIfNeeded = true | ||
| nbc.EnableNvidia = true | ||
| }, | ||
| VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) { | ||
| vmss.SKU.Name = to.Ptr("Standard_NV6ads_A10_v5") | ||
| if vmss.Tags == nil { | ||
| vmss.Tags = map[string]*string{} | ||
| } | ||
|
|
||
| // Enable the AKS VM extension for GPU nodes | ||
| extension, err := createVMExtensionLinuxAKSNode(t.Context(), vmss.Location) | ||
| require.NoError(t, err, "creating AKS VM extension") | ||
| vmss.Properties = addVMExtensionToVMSS(vmss.Properties, extension) | ||
| }, | ||
| }, | ||
| }) |
Comment on lines
+1717
to
1722
| # Configure with pass-device-specs for non-MIG nodes | ||
| tee "${DRA_DRIVER_OVERRIDE_DIR}/10-dra-driver-nvidia-gpu.conf" > /dev/null <<'EOF' | ||
| [Service] | ||
| # Remove file-based logging - let systemd handle logs | ||
| StandardOutput=journal | ||
| StandardError=journal | ||
| # Change default port from 9400 to 19400 so that it does not conflict with user installed dcgm-exporter | ||
| ExecStart= | ||
| ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv --address ":19400" | ||
| ExecStart=/usr/bin/gpu-kubelet-plugin --kubeconfig /var/lib/kubelet/kubeconfig --container-driver-root / --image-name nvcr.io/nvidia/k8s-dra-driver-gpu:v25.8.1 --node-name=${NODE_NAME} | ||
| EOF |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #