Skip to content

feat(gpu): add NVIDIA GRID v20 driver support for RTX PRO 6000 BSE v6 SKUs#8666

Merged
ganeshkumarashok merged 11 commits into
mainfrom
gpu-grid-v20-driver-support
Jun 11, 2026
Merged

feat(gpu): add NVIDIA GRID v20 driver support for RTX PRO 6000 BSE v6 SKUs#8666
ganeshkumarashok merged 11 commits into
mainfrom
gpu-grid-v20-driver-support

Conversation

@ganeshkumarashok

Copy link
Copy Markdown
Contributor

What

Adds NVIDIA GRID v20 (595.x) driver support, selecting the new aks-gpu-grid-v20 container image for RTX PRO 6000 Blackwell Server Edition v6 SKUs:

  • Standard_NC128ds_xl_RTXPRO6000BSE_v6
  • Standard_NC256ds_xl_RTXPRO6000BSE_v6
  • Standard_NC320ds_xl_RTXPRO6000BSE_v6

All existing GRID SKUs keep using aks-gpu-grid (570.x); the CUDA path is untouched.

Changes

  • parts/common/components.json — add aks-gpu-grid-v20 GPUContainerImages entry, pinned to the published MCR tag 595.58.03-20260609172331.
  • pkg/agent/datamodel/gpu_components.go — parse it into NvidiaGridV20DriverVersion / AKSGPUGridV20VersionSuffix; refactor LoadConfig to match on the exact repo name (fixes a latent substring collision: aks-gpu-grid-v20 contains aks-gpu-grid); add RTXPro6000GPUDriverSizes.
  • pkg/agent/baker.go — add useGridV20Drivers(); branch GetGPUDriverVersion / GetAKSGPUImageSHA / GetGPUDriverType on it (checked before grid); driver type string "grid-v20".
  • .github/renovate.json — add aks/aks-gpu-grid-v20 package rule.
  • Unit tests for the new selection paths.

Design notes

On Ubuntu the driver image repo is built as mcr.microsoft.com/aks/aks-gpu-${GPU_DRIVER_TYPE} (cse_helpers.sh), so setting the driver type to grid-v20 resolves the new repo automatically.

Scope is Ubuntu-only by design. RTX PRO 6000 BSE v6 runs on Ubuntu GPU nodes. The non-Ubuntu install paths (Mariner RPM / ACL sysext) do not use the container image and have no v20 packages, so those CSE checks are deliberately left unchanged.

The new image comes from aks-gpu PR #158 (merged).

Publish status: LIVE in MCR ✅

aks-gpu-grid-v20 is now published and resolves in MCR:

GET https://mcr.microsoft.com/v2/aks/aks-gpu-grid-v20/tags/list
-> 200, tags: ["595.58.03-20260609172331"]

components.json is pinned to that exact published tag (595.58.03-20260609172331, from aks-gpu build run 27223544445, digest sha256:fa35a31240aeea100a84e386ca9e5d97b79c1b6945f4a3527d9c1c8cf223c638). The earlier placeholder suffix has been replaced. make generate produces no testdata/manifest diff (no existing scenario uses these SKUs).

Testing

  • go build ./pkg/agent/...
  • go test ./pkg/agent ./pkg/agent/datamodel — pass (full agent ginkgo suite: 244 specs; includes the grid-v20 / RTX PRO 6000 selection specs)
  • make generate — no snapshot drift
  • make validate-components — pass

Note

Supersedes #8619, which was opened from a fork and carried the pre-publish placeholder tag. This PR is from the upstream Azure/AgentBaker branch with the live tag.

ganeshkumarashok and others added 4 commits June 1, 2026 17:23
… SKUs

Select the new aks-gpu-grid-v20 image (NVIDIA GRID 595.x) for
NC_RTXPRO6000BSE_v6 SKUs. All existing GRID SKUs continue to use
aks-gpu-grid (570.x); CUDA path is untouched.

- components.json: add aks-gpu-grid-v20 GPUContainerImages entry.
- gpu_components.go: parse it into NvidiaGridV20DriverVersion /
  AKSGPUGridV20VersionSuffix; refactor LoadConfig to match on the exact
  repo name (fixes a latent substring collision between aks-gpu-grid and
  aks-gpu-grid-v20); add RTXPro6000GPUDriverSizes.
- baker.go: add useGridV20Drivers(); branch GetGPUDriverVersion /
  GetAKSGPUImageSHA / GetGPUDriverType on it (checked before grid),
  driver type "grid-v20".
- renovate.json: add aks/aks-gpu-grid-v20 package rule.
- tests for the new selection paths.

Scope is Ubuntu-only: RTX PRO 6000 BSE v6 runs on Ubuntu GPU nodes, which
build the driver image repo as aks-gpu-${GPU_DRIVER_TYPE}; non-Ubuntu
(Mariner/ACL) install paths do not use the container image and are
deliberately untouched.

NOTE (do not merge yet): aks-gpu-grid-v20 is not yet published to MCR, so
the version tag suffix in components.json is a placeholder and must be
replaced with the real published tag before merge.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…529155703

Replace the placeholder version suffix 20260101000000 in the
aks-gpu-grid-v20 GPUContainerImages entry with the real tag pushed to
MCR by Azure/aks-gpu build 158. This is the tag AKS nodes pull for
NC_RTXPRO6000BSE_v6 SKUs at provision time.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…9172331

The aks-gpu-grid-v20 driver image is now live in MCR. The only published
tag is 595.58.03-20260609172331 (from aks-gpu build run 27223544445), so
re-pin components.json from the earlier placeholder build tag
595.58.03-20260529155703 to the tag that actually resolves in MCR.

Verified: mcr.microsoft.com/v2/aks/aks-gpu-grid-v20/tags/list -> 200 with
tag 595.58.03-20260609172331; datamodel TestLoadConfig + full agent suite
(244 specs) pass; make generate produces no snapshot drift.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for selecting NVIDIA GRID v20 (595.x) drivers for RTX PRO 6000 BSE v6 VM SKUs by introducing a new aks-gpu-grid-v20 GPU container image entry in components.json, parsing it in the datamodel, and branching driver/image selection logic accordingly (with accompanying unit tests and Renovate rule).

Changes:

  • Add aks/aks-gpu-grid-v20 entry in parts/common/components.json and a matching Renovate package rule.
  • Extend pkg/agent/datamodel GPU component parsing to load GRID v20 version/suffix safely (avoiding substring collisions), and add an SKU allowlist for RTX PRO 6000 BSE v6.
  • Update pkg/agent/baker.go GPU driver selection to emit grid-v20 for those SKUs, plus unit test coverage for the new selection paths.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/agent/datamodel/gpu_components.go Loads GRID v20 version/suffix from GPUContainerImages and adds RTX PRO 6000 BSE v6 SKU map.
pkg/agent/datamodel/gpu_components_test.go Extends config-load assertions for GRID v20 version/suffix.
pkg/agent/baker.go Branches GPU driver version/image SHA/type selection to grid-v20 for RTX PRO 6000 BSE v6 SKUs.
pkg/agent/baker_test.go Adds Ginkgo specs validating the new grid-v20 selection behavior.
parts/common/components.json Introduces the new published aks-gpu-grid-v20 image tag entry.
.github/renovate.json Adds a Renovate rule for aks/aks-gpu-grid-v20 tag updates.

Comment thread pkg/agent/datamodel/gpu_components.go Outdated
Comment thread pkg/agent/datamodel/gpu_components_test.go
Comment thread pkg/agent/baker.go
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — Mixed: recurring 24.04 fwupd + Azure ARM 409 "AnotherOperationInProgress" cluster fixture contention (NOT this PR)

  • Run: 167350983 (failed)
  • Failed task: Run AgentBaker E2E
  • Test summary: DONE 441 tests, 95 skipped, 33 failures in 1806.641s

Bucket A — Ubuntu 24.04 fwupd.service (RECURRING main regression, NOT this PR)

Test_Ubuntu2404_CSE_CachedPerformance, Test_Ubuntu2404_NPD_Basic, Test_Ubuntu2404_Scriptless, Test_Ubuntu2404_SecureTLSBootstrapping_BootstrapToken_Fallback, Test_Ubuntu2404Gen2, Test_Ubuntu2404Gen2_McrChinaCloud_Scriptless, Test_LocalDNSHostsPlugin (24.04 leg). 7 fwupd.service hits. Same 24.04 main regression flagged on 10+ recent gate runs; mitigation PR #8662 in flight.

Bucket B — Azure ARM 409 AnotherOperationInProgress on route-table updates (cluster fixture contention, NOT this PR)

Test_Random_VHD_With_Latest_Kubernetes_Version, Test_Ubuntu2204_CSE_CachedPerformance, Test_Ubuntu2204Gen2_ImagePullIdentityBinding_Disabled, Test_Ubuntu2204Gen2_ImagePullIdentityBinding_Disabled_Scriptless, Test_Ubuntu2204Gen2_ImagePullIdentityBinding_Enabled, Test_Ubuntu2204Gen2_ImagePullIdentityBinding_Enabled_Scriptless, Test_Ubuntu2204Gen2_ImagePullIdentityBinding_EnabledWithoutDefaultIDs.

Signature:

prepare cluster tasks: dag execution failed: failed to start adding route "vnet-local":
PUT .../routeTables/aks-agentpool-40918388-routetable/routes/vnet-local
RESPONSE 409: AnotherOperationInProgress
"Another operation on this or dependent resource is in progress."

Multiple scenarios race to update the same shared aks-agentpool-40918388-routetable "vnet-local" route on the abe2e-latest-kubernetes-version-v2-d6af0 cluster. This is an Azure RM-side concurrency contention in the test-cluster fixture, not a node/VHD/CSE issue. The PR (GPU NVIDIA GRID v20 driver) does not touch test cluster definitions or route-table provisioning.

Three-level analysis:

  1. L1: Bucket A — fwupd validator. Bucket B — ARM 409 on route table.
  2. L2: Bucket A reproduces across 10+ unrelated PRs on the same main HEAD (24.04 scoped). Bucket B reproduces on a single shared cluster across multiple parallel scenarios; the failure is at cluster-prep, not at node boot. PR feat(gpu): add NVIDIA GRID v20 driver support for RTX PRO 6000 BSE v6 SKUs #8666 changes: GPU driver registration for v20 / RTX PRO 6000 — distinct from both buckets.
  3. L3 challenge: "GPU driver change causes ARM 409s on 22.04 ImagePullIdentityBinding" — refuted, no causal path. Strongest alt for Bucket B: AKS routeTable update parallelism limit was tightened — possible but still upstream of this PR.

Build-vs-test: Bucket A = product/VHD (main). Bucket B = test-infra/Azure-ARM concurrency (fixture).
Confidence: HIGH that PR #8666 is not the cause of either bucket.

Recommended next action / owner: NodeSIG-dev for Bucket A (PR #8662 fix). E2E infra owner for Bucket B (serialize or scope route-table updates so concurrent scenarios on the latest-kubernetes-version-v2 cluster don't race on the same vnet-local route). PR author: do NOT block merge on these.

Posted by Clawpilot AgentBaker gate detective.

ganeshkumarashok and others added 2 commits June 9, 2026 19:47
Addresses the three Copilot review comments on #8666:

1. Rename the local `parts` slice to `versionParts` in datamodel.LoadConfig so it
   no longer shadows the imported `parts` package.

2. Add TestGPUImageRepo covering cuda / grid / grid-v20 repo parsing plus an
   explicit assertion that aks-gpu-grid-v20 is never mis-parsed as aks-gpu-grid,
   locking in the substring-collision fix.

3. Make the non-Ubuntu provisioning paths fail fast for grid-v20. RTX PRO 6000
   BSE v6 is Ubuntu-only and there is no v20 RPM (Mariner/AzureLinux) or sysext
   (Azure Container Linux). Previously grid-v20 fell through silently to the cuda
   path, installing the wrong driver on a vGPU node. cse_install_mariner.sh
   (downloadGPUDrivers) and cse_install_acl.sh (installGPUDriverSysext) now exit
   with ERR_NVIDIA_DRIVER_INSTALL and a clear message, with matching ShellSpec
   coverage. The Ubuntu container-image path (the supported path) is unchanged.

The post-install nvidia-gridd licensing check in cse_config.sh needs no change:
the install step now exits first for grid-v20 on Mariner/ACL, so that line is
never reached for these SKUs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolves an add/add conflict in
spec/parts/linux/cloud-init/artifacts/cse_install_acl_spec.sh: main added an
ACL spec covering installSecureTLSBootstrapClientSysext while this branch added
one covering the grid-v20 guard in installGPUDriverSysext. Kept main's file and
integrated the 'installGPUDriverSysext grid vs cuda selection' Describe block
into it. The cse_install_acl.sh guard itself auto-merged cleanly.
Copilot AI review requested due to automatic review settings June 10, 2026 02:53

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Comment thread spec/parts/linux/cloud-init/artifacts/cse_install_mariner_spec.sh Outdated
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — 236-failure run: shared cluster fleet outage continues (test-infra, NOT this PR)

  • Run: 167421825 (failed)
  • Failed task: Run AgentBaker E2E (full 60-minute timeout consumed)
  • Test summary: DONE 402 tests, 95 skipped, 236 failures in ~3616s (~59% failure rate; 0 fwupd hits)

Same shared cluster fleet outage affecting every concurrent PR in this window: 123× get or create cluster: failed to wait for cluster abe2e-kubenet-v5-150ee to be ready: context deadline exceeded. Earlier overnight runs hit ~11 min; current runs consume the full 60-min E2E timeout, indicating the fleet is worse, not recovering.

Cross-PR pattern this morning: PR #8652 build 167419663, PR #8679 build 167421198, PR #8294 build 167422687, and concurrent PRs all hit identical 236-fail / cluster-not-ready signature.

Build-vs-test: test-infra (shared cluster fleet outage), NOT product, NOT PR-caused.
This PR's exposure check: GPU NVIDIA GRID v20 driver support. No path to shared test cluster lifecycle.
Confidence: HIGH that PR #8666 is not the cause.

Recommended next action / owner: ⚠️ E2E infra / NodeSIG-dev — urgent shared cluster fleet restoration required (abe2e-kubenet-v5-*, abe2e-latest-kubernetes-version-v2-*, abe2e-azure-networkisolated-v2-*, abe2e-azure-v4-*, abe2e-azure-bootstrapprofile-cache-v2-*); clear ResourceGroupDeletionBlocked locks. PR gate is effectively offline until restored. PR author: rerun once fleet recovers.

Posted by Clawpilot AgentBaker gate detective.

ganeshkumarashok and others added 3 commits June 10, 2026 10:19
Addresses Copilot review feedback on the Mariner grid-v20 spec: the test
already sets ERR_NVIDIA_DRIVER_INSTALL, so assert the status against the
variable instead of a hard-coded 224 to avoid duplication and stay correct
if the constant ever changes (matches the ACL spec).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds Test_Ubuntu2404_GPU_RTXPro6000_GridV20 exercising the new grid-v20
driver path end-to-end: provisions a Standard_NC128ds_xl_RTXPRO6000BSE_v6
node on the Ubuntu 2404 VHD and asserts the aks-gpu-grid-v20 (595.x) driver
is installed via the new ValidateNvidiaGridV20DriverInstalled validator
(nvidia-smi driver_version must be 595.*). Pinned to a region with SKU
availability and quota.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — Test_Ubuntu2204_HTTPSProxy_PrivateDNS proxy fixture unreachable (NOT this PR)

  • Run: 167505019
  • Failed job: Run AgentBaker E2E (only HTTPSProxy_PrivateDNS subtests failed; all VHD builds passed)
  • Wiki signature: httpsproxy-fixture-proxy-unreachable (wiki)

Detective summary

CSE on the HTTPSProxy_PrivateDNS test VM exits 99 because apt-get update cannot reach the scenario's HTTP proxy. The retry loop logs (10 retries → exit):

vmssCSE exit status=99
W: Failed to fetch https://packages.microsoft.com/ubuntu/22.04/prod/dists/jammy/InRelease
   Could not connect to 10.14.0.162:8888 (10.14.0.162). - connect (113: No route to host)
   ...later attempts: connection timed out

Same signature as build 167493131 (which hit 10.14.0.193:8888 from a different fixture). Both addresses are in the scenario's proxy subnet (10.14.0.0/24); the proxy endpoint is part of the HTTPSProxy_PrivateDNS test fixture, not infrastructure this PR touches.

Classification: Test infrastructure / scenario fixture flakiness. Second occurrence of this signature.

Confidence: High. PR #8666 (NVIDIA GRID v20 driver support) only touches GPU-driver paths and does not modify CSE, apt-source config, or the proxy fixture. All VHD builds passed; only the proxy-dependent scenario failed.

Strongest alternative theory: A regression in CSE proxy-aware apt-get config introduced on main. Less likely because the failure is a network-level connect (113: No route to host) against a private RFC1918 endpoint inside the test VNet — i.e. the proxy itself is gone — not an apt config error.

Recommended next action / owner: No PR change required. AgentBaker E2E test-infra owner — confirm the HTTPSProxy_PrivateDNS fixture's proxy pod/daemon is running and reachable in the 10.14.0.0/24 test VNet. Recommend rerun of the failed leg only.

Evidence used: failed task log (3 === FAIL for HTTPSProxy_PrivateDNS subtests, CSE exit 99 with proxy connect refused / no route / timed out), all other E2E scenarios passed, all VHD builds passed, PR #8666 changes only touch GPU driver files.

Verified end-to-end against a real Standard_NC128ds_xl_RTXPRO6000BSE_v6 node:
both default and scriptless_nbc subtests pass. The node pulled
mcr.microsoft.com/aks/aks-gpu-grid-v20:595.58.03-20260609172331 and
nvidia-smi reported the 595.x driver (NVIDIA_GPU_DRIVER_TYPE=grid-v20).

- UseNVMe: true — RTX PRO 6000 BSE v6 only supports ephemeral OS disk on
  NvmeDisk placement (SupportedEphemeralOSDiskPlacements=NvmeDisk), not the
  default ResourceDisk, which returned 409 NotSupported.
- Pin to southeastasia, which has SKU availability and quota for this SKU.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 10, 2026 20:27

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Comment on lines +106 to +116
MOCK_VM_SKU=""
get_compute_sku() { echo "$MOCK_VM_SKU"; }

# Capture which sysext was selected and avoid real installs.
installACLGPUSysext() { echo "installACLGPUSysext $1"; }
systemd-tmpfiles() { return 0; }

# Mock should_use_nvidia_open_drivers to avoid IMDS dependency.
MOCK_OPEN_RET=0
should_use_nvidia_open_drivers() { return "$MOCK_OPEN_RET"; }

Each RTX PRO 6000 BSE v6 size ships as a ds (higher-memory) and lds
(lower-memory) pair that share the same GPU, e.g.
Standard_NC128ds_xl_RTXPRO6000BSE_v6 (512 GB) vs
Standard_NC128lds_xl_RTXPRO6000BSE_v6 (256 GB). Only the ds variants were
in RTXPro6000GPUDriverSizes, so an lds node would not match
useGridV20Drivers and would fall back to the cuda driver. Add the three
lds variants (128/256/320) so they also get the aks-gpu-grid-v20 (595.x)
driver, and extend the unit/ginkgo selection tests to cover an lds SKU.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — Test_Ubuntu2204_HTTPSProxy_PrivateDNS proxy fixture unreachable (NOT this PR)

  • Run: 167535509
  • Failed job: Run AgentBaker E2E (only HTTPSProxy_PrivateDNS subtests failed; all VHD builds passed)
  • Wiki signature: httpsproxy-fixture-proxy-unreachable (wiki)

Detective summary

Same pattern as builds 167493131, 167505019, and 167534982: vmssCSE exits 99 because apt-get update cannot reach the HTTPSProxy_PrivateDNS scenario's HTTP proxy at 10.14.0.193:8888. Fourth occurrence of this signature; approaching escalation threshold (>6 across 6 distinct build IDs).

Classification: Test infrastructure / scenario fixture flakiness.

Confidence: High. PR #8666 is NVIDIA GRID v20 GPU-driver support only — it does not touch CSE, apt, or the proxy fixture. No GPU scenarios are in the failure list; only the proxy-dependent HTTPSProxy_PrivateDNS scenario fails.

Strongest alternative theory: A transient regression in the proxy-aware CSE apt config. Less likely because the failure is a TCP-level connect refused/timed out against the private fixture proxy, not an apt config error.

Recommended next action / owner: No PR change required. AgentBaker E2E test-infra — this signature is now hitting on every build that exercises HTTPSProxy_PrivateDNS; please prioritize fixing the proxy fixture before it crosses the watcher's escalation threshold.

Evidence used: failed task log (3 === FAIL for HTTPSProxy_PrivateDNS, vmssCSE exit 99 with proxy at 10.14.0.193), all other E2E and all VHD builds passed.

@runzhen runzhen self-requested a review June 11, 2026 17:31
Comment thread e2e/scenario_test.go
@ganeshkumarashok ganeshkumarashok merged commit a45b90d into main Jun 11, 2026
39 of 44 checks passed
@ganeshkumarashok ganeshkumarashok deleted the gpu-grid-v20-driver-support branch June 11, 2026 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

components This pull request updates cached components on Linux or Windows VHDs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants