feat(gpu): add NVIDIA GRID v20 driver support for RTX PRO 6000 BSE v6 SKUs#8666
Conversation
… SKUs
Select the new aks-gpu-grid-v20 image (NVIDIA GRID 595.x) for
NC_RTXPRO6000BSE_v6 SKUs. All existing GRID SKUs continue to use
aks-gpu-grid (570.x); CUDA path is untouched.
- components.json: add aks-gpu-grid-v20 GPUContainerImages entry.
- gpu_components.go: parse it into NvidiaGridV20DriverVersion /
AKSGPUGridV20VersionSuffix; refactor LoadConfig to match on the exact
repo name (fixes a latent substring collision between aks-gpu-grid and
aks-gpu-grid-v20); add RTXPro6000GPUDriverSizes.
- baker.go: add useGridV20Drivers(); branch GetGPUDriverVersion /
GetAKSGPUImageSHA / GetGPUDriverType on it (checked before grid),
driver type "grid-v20".
- renovate.json: add aks/aks-gpu-grid-v20 package rule.
- tests for the new selection paths.
Scope is Ubuntu-only: RTX PRO 6000 BSE v6 runs on Ubuntu GPU nodes, which
build the driver image repo as aks-gpu-${GPU_DRIVER_TYPE}; non-Ubuntu
(Mariner/ACL) install paths do not use the container image and are
deliberately untouched.
NOTE (do not merge yet): aks-gpu-grid-v20 is not yet published to MCR, so
the version tag suffix in components.json is a placeholder and must be
replaced with the real published tag before merge.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…529155703 Replace the placeholder version suffix 20260101000000 in the aks-gpu-grid-v20 GPUContainerImages entry with the real tag pushed to MCR by Azure/aks-gpu build 158. This is the tag AKS nodes pull for NC_RTXPRO6000BSE_v6 SKUs at provision time. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…9172331 The aks-gpu-grid-v20 driver image is now live in MCR. The only published tag is 595.58.03-20260609172331 (from aks-gpu build run 27223544445), so re-pin components.json from the earlier placeholder build tag 595.58.03-20260529155703 to the tag that actually resolves in MCR. Verified: mcr.microsoft.com/v2/aks/aks-gpu-grid-v20/tags/list -> 200 with tag 595.58.03-20260609172331; datamodel TestLoadConfig + full agent suite (244 specs) pass; make generate produces no snapshot drift. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds support for selecting NVIDIA GRID v20 (595.x) drivers for RTX PRO 6000 BSE v6 VM SKUs by introducing a new aks-gpu-grid-v20 GPU container image entry in components.json, parsing it in the datamodel, and branching driver/image selection logic accordingly (with accompanying unit tests and Renovate rule).
Changes:
- Add
aks/aks-gpu-grid-v20entry inparts/common/components.jsonand a matching Renovate package rule. - Extend
pkg/agent/datamodelGPU component parsing to load GRID v20 version/suffix safely (avoiding substring collisions), and add an SKU allowlist for RTX PRO 6000 BSE v6. - Update
pkg/agent/baker.goGPU driver selection to emitgrid-v20for those SKUs, plus unit test coverage for the new selection paths.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/agent/datamodel/gpu_components.go | Loads GRID v20 version/suffix from GPUContainerImages and adds RTX PRO 6000 BSE v6 SKU map. |
| pkg/agent/datamodel/gpu_components_test.go | Extends config-load assertions for GRID v20 version/suffix. |
| pkg/agent/baker.go | Branches GPU driver version/image SHA/type selection to grid-v20 for RTX PRO 6000 BSE v6 SKUs. |
| pkg/agent/baker_test.go | Adds Ginkgo specs validating the new grid-v20 selection behavior. |
| parts/common/components.json | Introduces the new published aks-gpu-grid-v20 image tag entry. |
| .github/renovate.json | Adds a Renovate rule for aks/aks-gpu-grid-v20 tag updates. |
|
AgentBaker Linux PR gate — Mixed: recurring 24.04 fwupd + Azure ARM 409 "AnotherOperationInProgress" cluster fixture contention (NOT this PR)
Bucket A — Ubuntu 24.04 fwupd.service (RECURRING main regression, NOT this PR)
Bucket B — Azure ARM 409 AnotherOperationInProgress on route-table updates (cluster fixture contention, NOT this PR)
Signature: Multiple scenarios race to update the same shared Three-level analysis:
Build-vs-test: Bucket A = product/VHD (main). Bucket B = test-infra/Azure-ARM concurrency (fixture). Recommended next action / owner: NodeSIG-dev for Bucket A (PR #8662 fix). E2E infra owner for Bucket B (serialize or scope route-table updates so concurrent scenarios on the latest-kubernetes-version-v2 cluster don't race on the same vnet-local route). PR author: do NOT block merge on these. Posted by Clawpilot AgentBaker gate detective. |
Addresses the three Copilot review comments on #8666: 1. Rename the local `parts` slice to `versionParts` in datamodel.LoadConfig so it no longer shadows the imported `parts` package. 2. Add TestGPUImageRepo covering cuda / grid / grid-v20 repo parsing plus an explicit assertion that aks-gpu-grid-v20 is never mis-parsed as aks-gpu-grid, locking in the substring-collision fix. 3. Make the non-Ubuntu provisioning paths fail fast for grid-v20. RTX PRO 6000 BSE v6 is Ubuntu-only and there is no v20 RPM (Mariner/AzureLinux) or sysext (Azure Container Linux). Previously grid-v20 fell through silently to the cuda path, installing the wrong driver on a vGPU node. cse_install_mariner.sh (downloadGPUDrivers) and cse_install_acl.sh (installGPUDriverSysext) now exit with ERR_NVIDIA_DRIVER_INSTALL and a clear message, with matching ShellSpec coverage. The Ubuntu container-image path (the supported path) is unchanged. The post-install nvidia-gridd licensing check in cse_config.sh needs no change: the install step now exits first for grid-v20 on Mariner/ACL, so that line is never reached for these SKUs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolves an add/add conflict in spec/parts/linux/cloud-init/artifacts/cse_install_acl_spec.sh: main added an ACL spec covering installSecureTLSBootstrapClientSysext while this branch added one covering the grid-v20 guard in installGPUDriverSysext. Kept main's file and integrated the 'installGPUDriverSysext grid vs cuda selection' Describe block into it. The cse_install_acl.sh guard itself auto-merged cleanly.
|
AgentBaker Linux PR gate — 236-failure run: shared cluster fleet outage continues (test-infra, NOT this PR)
Same shared cluster fleet outage affecting every concurrent PR in this window: 123× Cross-PR pattern this morning: PR #8652 build 167419663, PR #8679 build 167421198, PR #8294 build 167422687, and concurrent PRs all hit identical 236-fail / cluster-not-ready signature. Build-vs-test: test-infra (shared cluster fleet outage), NOT product, NOT PR-caused. Recommended next action / owner: Posted by Clawpilot AgentBaker gate detective. |
Addresses Copilot review feedback on the Mariner grid-v20 spec: the test already sets ERR_NVIDIA_DRIVER_INSTALL, so assert the status against the variable instead of a hard-coded 224 to avoid duplication and stay correct if the constant ever changes (matches the ACL spec). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds Test_Ubuntu2404_GPU_RTXPro6000_GridV20 exercising the new grid-v20 driver path end-to-end: provisions a Standard_NC128ds_xl_RTXPRO6000BSE_v6 node on the Ubuntu 2404 VHD and asserts the aks-gpu-grid-v20 (595.x) driver is installed via the new ValidateNvidiaGridV20DriverInstalled validator (nvidia-smi driver_version must be 595.*). Pinned to a region with SKU availability and quota. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
AgentBaker Linux PR gate —
Detective summary CSE on the HTTPSProxy_PrivateDNS test VM exits 99 because Same signature as build 167493131 (which hit Classification: Test infrastructure / scenario fixture flakiness. Second occurrence of this signature. Confidence: High. PR #8666 (NVIDIA GRID v20 driver support) only touches GPU-driver paths and does not modify CSE, apt-source config, or the proxy fixture. All VHD builds passed; only the proxy-dependent scenario failed. Strongest alternative theory: A regression in CSE proxy-aware Recommended next action / owner: No PR change required. AgentBaker E2E test-infra owner — confirm the HTTPSProxy_PrivateDNS fixture's proxy pod/daemon is running and reachable in the Evidence used: failed task log (3 |
Verified end-to-end against a real Standard_NC128ds_xl_RTXPRO6000BSE_v6 node: both default and scriptless_nbc subtests pass. The node pulled mcr.microsoft.com/aks/aks-gpu-grid-v20:595.58.03-20260609172331 and nvidia-smi reported the 595.x driver (NVIDIA_GPU_DRIVER_TYPE=grid-v20). - UseNVMe: true — RTX PRO 6000 BSE v6 only supports ephemeral OS disk on NvmeDisk placement (SupportedEphemeralOSDiskPlacements=NvmeDisk), not the default ResourceDisk, which returned 409 NotSupported. - Pin to southeastasia, which has SKU availability and quota for this SKU. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| MOCK_VM_SKU="" | ||
| get_compute_sku() { echo "$MOCK_VM_SKU"; } | ||
|
|
||
| # Capture which sysext was selected and avoid real installs. | ||
| installACLGPUSysext() { echo "installACLGPUSysext $1"; } | ||
| systemd-tmpfiles() { return 0; } | ||
|
|
||
| # Mock should_use_nvidia_open_drivers to avoid IMDS dependency. | ||
| MOCK_OPEN_RET=0 | ||
| should_use_nvidia_open_drivers() { return "$MOCK_OPEN_RET"; } | ||
|
|
Each RTX PRO 6000 BSE v6 size ships as a ds (higher-memory) and lds (lower-memory) pair that share the same GPU, e.g. Standard_NC128ds_xl_RTXPRO6000BSE_v6 (512 GB) vs Standard_NC128lds_xl_RTXPRO6000BSE_v6 (256 GB). Only the ds variants were in RTXPro6000GPUDriverSizes, so an lds node would not match useGridV20Drivers and would fall back to the cuda driver. Add the three lds variants (128/256/320) so they also get the aks-gpu-grid-v20 (595.x) driver, and extend the unit/ginkgo selection tests to cover an lds SKU. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
AgentBaker Linux PR gate —
Detective summary Same pattern as builds 167493131, 167505019, and 167534982: vmssCSE exits 99 because Classification: Test infrastructure / scenario fixture flakiness. Confidence: High. PR #8666 is NVIDIA GRID v20 GPU-driver support only — it does not touch CSE, apt, or the proxy fixture. No GPU scenarios are in the failure list; only the proxy-dependent HTTPSProxy_PrivateDNS scenario fails. Strongest alternative theory: A transient regression in the proxy-aware CSE apt config. Less likely because the failure is a TCP-level Recommended next action / owner: No PR change required. AgentBaker E2E test-infra — this signature is now hitting on every build that exercises HTTPSProxy_PrivateDNS; please prioritize fixing the proxy fixture before it crosses the watcher's escalation threshold. Evidence used: failed task log (3 |
What
Adds NVIDIA GRID v20 (595.x) driver support, selecting the new
aks-gpu-grid-v20container image for RTX PRO 6000 Blackwell Server Edition v6 SKUs:Standard_NC128ds_xl_RTXPRO6000BSE_v6Standard_NC256ds_xl_RTXPRO6000BSE_v6Standard_NC320ds_xl_RTXPRO6000BSE_v6All existing GRID SKUs keep using
aks-gpu-grid(570.x); the CUDA path is untouched.Changes
parts/common/components.json— addaks-gpu-grid-v20GPUContainerImagesentry, pinned to the published MCR tag595.58.03-20260609172331.pkg/agent/datamodel/gpu_components.go— parse it intoNvidiaGridV20DriverVersion/AKSGPUGridV20VersionSuffix; refactorLoadConfigto match on the exact repo name (fixes a latent substring collision:aks-gpu-grid-v20containsaks-gpu-grid); addRTXPro6000GPUDriverSizes.pkg/agent/baker.go— adduseGridV20Drivers(); branchGetGPUDriverVersion/GetAKSGPUImageSHA/GetGPUDriverTypeon it (checked before grid); driver type string"grid-v20"..github/renovate.json— addaks/aks-gpu-grid-v20package rule.Design notes
On Ubuntu the driver image repo is built as
mcr.microsoft.com/aks/aks-gpu-${GPU_DRIVER_TYPE}(cse_helpers.sh), so setting the driver type togrid-v20resolves the new repo automatically.Scope is Ubuntu-only by design. RTX PRO 6000 BSE v6 runs on Ubuntu GPU nodes. The non-Ubuntu install paths (Mariner RPM / ACL sysext) do not use the container image and have no v20 packages, so those CSE checks are deliberately left unchanged.
The new image comes from aks-gpu PR #158 (merged).
Publish status: LIVE in MCR ✅
aks-gpu-grid-v20is now published and resolves in MCR:components.jsonis pinned to that exact published tag (595.58.03-20260609172331, from aks-gpu build run 27223544445, digestsha256:fa35a31240aeea100a84e386ca9e5d97b79c1b6945f4a3527d9c1c8cf223c638). The earlier placeholder suffix has been replaced.make generateproduces no testdata/manifest diff (no existing scenario uses these SKUs).Testing
go build ./pkg/agent/...go test ./pkg/agent ./pkg/agent/datamodel— pass (full agent ginkgo suite: 244 specs; includes the grid-v20 / RTX PRO 6000 selection specs)make generate— no snapshot driftmake validate-components— passNote
Supersedes #8619, which was opened from a fork and carried the pre-publish placeholder tag. This PR is from the upstream
Azure/AgentBakerbranch with the live tag.