Skip to content

[Draft]Dra#8671

Open
runzhen wants to merge 2 commits into
Azure:mainfrom
runzhen:dra
Open

[Draft]Dra#8671
runzhen wants to merge 2 commits into
Azure:mainfrom
runzhen:dra

Conversation

@runzhen

@runzhen runzhen commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This draft PR appears to introduce an experimental “Managed GPU Experience via DRA” path for Ubuntu 24.04 GPU nodes by adding a new cached component (dra-driver-nvidia-gpu), switching CSE startup logic between nvidia-device-plugin and dra-driver-nvidia-gpu, and adding a new bootstrapping config flag.

Changes:

  • Add a new cached component (dra-driver-nvidia-gpu) to VHD build + components manifest.
  • Add a DRA toggle path in Linux CSE to start dra-driver-nvidia-gpu instead of nvidia-device-plugin.
  • Add a new e2e scenario stub intended to validate the DRA driver on Ubuntu 24.04.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
vhdbuilder/packer/install-dependencies.sh Adds VHD build-time caching for dra-driver-nvidia-gpu tarball.
parts/common/components.json Registers new dra-driver-nvidia-gpu component and its download URL for Ubuntu 24.04.
parts/linux/cloud-init/artifacts/cse_main.sh Introduces ENABLE_MANAGED_GPU_EXPERIENCE_DRA derived from ENABLE_MANAGED_GPU_DRA.
parts/linux/cloud-init/artifacts/cse_config.sh Switches managed GPU service startup to device-plugin vs DRA driver; adds DRA systemd override.
parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh Adds a DRA-specific “package list” and attempts to install it from cache.
aks-node-controller/parser/parser.go Hard-codes ENABLE_MANAGED_GPU_DRA=true for scriptless-NBC env generation.
pkg/agent/datamodel/types.go Adds EnableManagedGPUDRA to NBC.
pkg/agent/baker.go Exposes IsEnableManagedGPUDRA to templates via func map.
e2e/scenario_gpu_managed_experience_test.go Adds a new Ubuntu 24.04 DRA scenario (currently without assertions).

Comment on lines +207 to 215
packageList=$(managedGPUPackageList)

if [ ${ENABLE_MANAGED_GPU_EXPERIENCE_DRA} = "true" ]; then
packageList=$(managedGPUPackageListDRA)
echo "DRA is enabled, using DRA-specific package list."
fi

for packageName in $(packageList); do
downloadDir="/opt/${packageName}/downloads"
Comment on lines +181 to +187
managedGPUPackageListDRA() {
packages=(
dra-driver-nvidia-gpu
datacenter-gpu-manager-4-core
datacenter-gpu-manager-4-proprietary
dcgm-exporter
)
Comment on lines 1606 to 1609
# installed on a previous CSE run. Stop them if they exist.
logs_to_events "AKS.CSE.stop.nvidia-device-plugin" "systemctlDisableAndStop nvidia-device-plugin"
logs_to_events "AKS.CSE.stop.nvidia-device-plugin" "systemctlDisableAndStop dra-driver-nvidia-gpu"
logs_to_events "AKS.CSE.stop.nvidia-dcgm" "systemctlDisableAndStop nvidia-dcgm"
Comment on lines +1725 to +1727
# Reload systemd to pick up the override
systemctl daemon-reload
logs_to_events "AKS.CSE.start.dra-driver-nvidia-gpu" "systemctlEnableAndStart dra-driver-nvidia-gpu 30" || exit $ERR_DRA_DRIVER_START_FAIL
Comment on lines 1596 to +1621
@@ -1605,13 +1605,67 @@ configureManagedGPUExperience() {
# EnableManagedGPUExperience is mutable, so services may have been
# installed on a previous CSE run. Stop them if they exist.
logs_to_events "AKS.CSE.stop.nvidia-device-plugin" "systemctlDisableAndStop nvidia-device-plugin"
logs_to_events "AKS.CSE.stop.nvidia-device-plugin" "systemctlDisableAndStop dra-driver-nvidia-gpu"
logs_to_events "AKS.CSE.stop.nvidia-dcgm" "systemctlDisableAndStop nvidia-dcgm"
logs_to_events "AKS.CSE.stop.nvidia-dcgm-exporter" "systemctlDisableAndStop nvidia-dcgm-exporter"
rm -f "${managed_gpu_marker}"
fi
}

startNvidiaManagedExpServices() {
# 1. Start device plugin or DRA driver
if [ "${ENABLE_MANAGED_GPU_EXPERIENCE_DRA}" = "true" ]; then
startDRADriverNvidiaGpu
else
startNvidiaDevicePlugin
fi
Comment on lines 194 to 199
"SERVICE_ACCOUNT_IMAGE_PULL_DEFAULT_TENANT_ID": config.GetServiceAccountImagePullProfile().GetDefaultTenantId(),
"IDENTITY_BINDINGS_LOCAL_AUTHORITY_SNI": config.GetServiceAccountImagePullProfile().GetLocalAuthoritySni(),
"CSE_TIMEOUT": getCSETimeout(config),
"SKIP_WAAGENT_HOLD": "true",
"ENABLE_MANAGED_GPU_DRA": "true",
}
Comment thread pkg/agent/baker.go
Comment on lines 1321 to +1326
"IsEnableManagedGPU": func() bool {
return config.EnableManagedGPU
},
"IsEnableManagedGPUDRA": func() bool {
return config.EnableManagedGPUDRA
},
Comment on lines +718 to +744
func Test_Ubuntu2404_NvidiaDraDriverRunning(t *testing.T) {
RunScenario(t, &Scenario{
Description: "Tests that NVIDIA DRA driver is running & functional on Ubuntu 24.04 GPU nodes",
Tags: Tags{
GPU: true,
},
Config: Config{
Cluster: ClusterKubenet,
VHD: config.VHDUbuntu2404Gen2Containerd,
BootstrapConfigMutator: func(_ *Cluster, nbc *datamodel.NodeBootstrappingConfiguration) {
nbc.AgentPoolProfile.VMSize = "Standard_NV6ads_A10_v5"
nbc.ConfigGPUDriverIfNeeded = true
nbc.EnableNvidia = true
},
VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) {
vmss.SKU.Name = to.Ptr("Standard_NV6ads_A10_v5")
if vmss.Tags == nil {
vmss.Tags = map[string]*string{}
}

// Enable the AKS VM extension for GPU nodes
extension, err := createVMExtensionLinuxAKSNode(t.Context(), vmss.Location)
require.NoError(t, err, "creating AKS VM extension")
vmss.Properties = addVMExtensionToVMSS(vmss.Properties, extension)
},
},
})
Comment on lines +1717 to 1722
# Configure with pass-device-specs for non-MIG nodes
tee "${DRA_DRIVER_OVERRIDE_DIR}/10-dra-driver-nvidia-gpu.conf" > /dev/null <<'EOF'
[Service]
# Remove file-based logging - let systemd handle logs
StandardOutput=journal
StandardError=journal
# Change default port from 9400 to 19400 so that it does not conflict with user installed dcgm-exporter
ExecStart=
ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv --address ":19400"
ExecStart=/usr/bin/gpu-kubelet-plugin --kubeconfig /var/lib/kubelet/kubeconfig --container-driver-root / --image-name nvcr.io/nvidia/k8s-dra-driver-gpu:v25.8.1 --node-name=${NODE_NAME}
EOF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants