feat(linux): emit per-step CSE timing events for GPU driver install by ganeshkumarashok · Pull Request #8679 · Azure/AgentBaker

ganeshkumarashok · 2026-06-10T02:20:17Z

What

configGPUDrivers() was wrapped as a single coarse CSE event (AKS.CSE.ensureGPUDrivers.configGPUDrivers), so the heavy GPU driver install steps had no per-step timing — the most expensive part of GPU node provisioning (image pull, driver compile/install, and the nvidia-smi readiness wait) was lumped into one duration.

This PR wraps those steps in logs_to_events so each is timed individually and shows up in CSE timing telemetry (and e2e/cse_timing.go):

New event (`AKS.CSE.configGPUDrivers.*`)	OS path	What it times
`pullGPUDriverImage`	Ubuntu	`ctr image pull` of the driver image (only on VHD cache miss)
`installGPUDriverImage`	Ubuntu	driver extract + compile + load via `/entrypoint.sh install`
`downloadGPUDrivers`	Mariner/AzureLinux	dnf install of the CUDA/GRID driver package
`installNvidiaContainerToolkit`	Mariner/AzureLinux	dnf install of the container toolkit
`installNvidiaContainerToolkitSysext`	ACL	oras pull of the toolkit sysext
`installGPUDriverSysext`	ACL	oras pull of the driver sysext
`waitForNvidiaModprobe`	all	`nvidia-modprobe` readiness retry loop
`waitForNvidiaSmi`	all	`nvidia-smi` readiness retry loop

Why

Per-step durations let us see where GPU provisioning time actually goes (cache-miss pull vs. compile vs. driver readiness) instead of one opaque number, which is what we need to drive down GPU node bring-up latency.

No behavior change (observability-only)

The CSE scripts do not use set -e; unchecked commands never aborted, so bare-wrapped calls keep the same semantics.
logs_to_events returns the wrapped command's exit code (0 on success) and writes the event before returning, so every existing control-flow path is preserved byte-for-byte:
- Ubuntu pull stays unchecked (a failed pull still falls through to the install, which then exits with the original message/code).
- The Ubuntu install keeps its ret=$? / echo "Failed to install GPU driver, exiting..." / exit $ERR_GPU_DRIVERS_START_FAIL block.
- Mariner/ACL install functions still exit internally on failure.
- Readiness waits keep || exit $ERR_GPU_DRIVERS_START_FAIL.
The Ubuntu install runs bash -c "...", which can't be passed through logs_to_events' word-split argument form, so it (and the pull, for symmetry) are extracted into small named functions with the commands kept identical.
Cheap steps (ldconfig, enableNvidiaPersistenceMode, mkdir, image rm, containerd/NPD restart) are intentionally left unwrapped to avoid event noise.

Testing / scope notes

go test ./pkg/agent/... and go test ./aks-node-controller/... pass; the new functions/commands are byte-identical to the originals.
Shellcheck: the change adds no new [[ ]]/==; it passes the generic verify_shell.sh rule set.
No e2e thresholds added: the perf scenarios in scenario_cse_perf_test.go run on non-GPU SKUs, so these GPU events never fire there; the DefaultTaskThreshold catch-all already tracks them in GPU e2e scenarios.
No new ShellSpec specs: these are thin observability wrappers with no logic change, and configGPUDrivers invokes real hardware/daemon commands (covered by e2e GPU scenarios, not unit specs).
generatedCSECommand reference snapshots are intentionally not regenerated — they are not asserted, are excluded from make generate, and recent cse_config.sh changes (incl. the prior GPU critical-path refactor) did not touch them.

Copilot

Pull request overview

This PR improves Linux GPU node provisioning observability by emitting fine-grained CSE timing events for the most expensive GPU driver installation steps, instead of reporting a single coarse duration for the entire configGPUDrivers() phase.

Changes:

Added Ubuntu helper functions so GPU driver image pull vs. install/compile can be timed separately via logs_to_events.
Wrapped Mariner/AzureLinux and Azure Container Linux GPU driver/toolkit install steps with per-step logs_to_events timing events.
Wrapped nvidia-modprobe and nvidia-smi readiness retry loops with per-step logs_to_events timing events.

ganeshkumarashok · 2026-06-10T02:26:47Z

Design note on the two extracted Ubuntu helpers (pullGPUDriverImage / installGPUDriverImage):

logs_to_events runs its wrapped command via unquoted ${@}, which word-splits the arguments. That's fine for simple commands, but the driver install is retrycmd_if_failure 5 10 600 bash -c "$CTR_GPU_INSTALL_CMD ... /entrypoint.sh install" — word-splitting would shred the bash -c "..." grouping. So the install (and the pull, for symmetry) are extracted into small named functions whose bodies are byte-identical to the original commands, and the functions are what we pass to logs_to_events.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

aks-node-assistant · 2026-06-10T10:54:10Z

AgentBaker Linux PR gate — 236-failure run: shared cluster fleet outage continues (test-infra, NOT this PR)

Run: 167421198 (failed)
Failed task: Run AgentBaker E2E (full 60-minute timeout)
Test summary: DONE 402 tests, 95 skipped, 236 failures in 3617.299s (~59% failure rate; 0 fwupd hits)

Same shared cluster fleet outage affecting every concurrent PR in this window. Dominant signatures: get or create cluster: failed to wait for cluster abe2e-kubenet-v5-150ee to be ready: context deadline exceeded and ResourceGroupDeletionBlocked on shared MC RGs.

Cross-PR pattern in same window: PR #8652 build 167419663 and PR #8294 build 167422687 hit the same signature. Shared fleet degradation has continued and worsened from overnight (now consuming full 60-min timeout vs. 11 min earlier).

Build-vs-test: test-infra (shared cluster fleet outage), NOT product, NOT PR-caused.
This PR's exposure check: GPU CSE per-step timing events instrumentation — does not touch the shared test cluster fleet or AKS cluster lifecycle.
Confidence: HIGH that PR #8679 is not the cause.

Recommended next action / owner: ⚠️ E2E infra / NodeSIG-dev — same urgent shared-cluster-fleet restoration flagged on PR #8652 comment and prior runs. The entire PR gate is effectively offline until the shared abe2e-* clusters are restored. PR author: rerun once fleet is restored.

Posted by Clawpilot AgentBaker gate detective.

configGPUDrivers() was wrapped as a single coarse event (AKS.CSE.ensureGPUDrivers.configGPUDrivers), so the heavy GPU driver install steps had no per-step timing in CSE telemetry. Wrap the expensive steps in logs_to_events so each is timed individually: - Ubuntu: pullGPUDriverImage, installGPUDriverImage (driver image pull + compile/install). The install runs `bash -c "..."`, which cannot be passed through logs_to_events' word-split argument form, so both are extracted into small named functions. - Mariner/AzureLinux: downloadGPUDrivers, installNvidiaContainerToolkit. - ACL: installNvidiaContainerToolkitSysext, installGPUDriverSysext. - Common readiness waits: waitForNvidiaModprobe, waitForNvidiaSmi. This is observability-only with no behavior change: there is no `set -e` in the CSE scripts, logs_to_events returns the wrapped command's exit code (0 on success) and emits the event before returning, so every existing control-flow path (unchecked pull, `ret=$?` install check, `|| exit` readiness waits, and functions that exit internally) is preserved exactly. Cheap steps (ldconfig, enableNvidiaPersistenceMode, mkdir, image rm, containerd/NPD restart) are intentionally left unwrapped to avoid event noise. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add a configGPUDrivers Describe block to cse_config_spec.sh that mocks logs_to_events to capture the emitted event names and stubs the side-effecting driver/daemon commands, then asserts the expected per-OS timing events: - Ubuntu: pullGPUDriverImage, installGPUDriverImage - Mariner/AzureLinux: downloadGPUDrivers, installNvidiaContainerToolkit - ACL: installNvidiaContainerToolkitSysext, installGPUDriverSysext - all paths: waitForNvidiaModprobe, waitForNvidiaSmi This locks in the telemetry contract so a wrapper removal or event rename is caught. Addresses the Copilot review suggestion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

aks-node-assistant · 2026-06-10T23:01:56Z

AgentBaker Linux PR gate — Test, Scan, and Cleanup fails on every VHD (PR merge-ref race, NOT this PR)

Run: 167546471
Failed task: Test, Scan, and Cleanup (every VHD distro: 2204, 2204arm64, 2404, 2004fips, AzureLinuxV3, AzureLinuxV3ARM64fips, ACL variants — exit code 2 on each)
Wiki signature: pr-merge-ref-race (wiki)

Detective summary

./vhdbuilder/packer/test/run-test.sh fails on every VHD because the VM-side vhd-test Run-Command can't clone the PR merge ref:

fatal: couldn't find remote ref refs/pull/8679/merge
git-clone:Error: Failed to fetch remote branch 'refs/pull/8679/merge' into local branch 'refs-pull-8679-merge'

CIS scan and vhd-scanning.sh legs succeeded; only the test driver fails, and only on the git-fetch of the merge ref. Classic GitHub merge-ref race after auto-cancel/push regenerated the ref before the VM script reached it. Third occurrence of this signature.

Classification: Test infrastructure / PR merge-ref race. Not a PR-code regression.

Confidence: High. Identical fingerprint across all distros; PR #8679 changes only cse_config.sh + cse_config_spec.sh (per-step CSE timing events for GPU driver install) — nothing that could break git fetch server-side.

Strongest alternative theory: The new CSE config breaks the VM in a way that prevents Run-Command's git from succeeding. Less likely because Run-Command itself reports Enable succeeded, git is already the newest version, and the failure is fatal: couldn't find remote ref — a GitHub server-side ref-not-found.

Recommended next action / owner: No PR change required. Recommend rerun. Underlying fix tracked by AgentBaker E2E test-infra (race between GitHub merge-ref refresh and the VM-side vhd-test git fetch).

Evidence used: failed task log on every VHD distro (identical "couldn't find remote ref refs/pull/8679/merge" signature, run-test.sh exited with code 1), VHD scan + CIS legs succeeded.

Copilot AI review requested due to automatic review settings June 10, 2026 02:20

ganeshkumarashok requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners June 10, 2026 02:20

Copilot started reviewing on behalf of ganeshkumarashok June 10, 2026 02:20 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh

ganeshkumarashok force-pushed the ganeshkumarashok/gpu-driver-cse-timing branch from b00dc7d to 5d91874 Compare June 10, 2026 02:26

Copilot AI review requested due to automatic review settings June 10, 2026 02:43

Copilot started reviewing on behalf of ganeshkumarashok June 10, 2026 02:44 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

karenychen approved these changes Jun 10, 2026

View reviewed changes

runzhen approved these changes Jun 10, 2026

View reviewed changes

ganeshkumarashok and others added 2 commits June 10, 2026 14:48

ganeshkumarashok force-pushed the ganeshkumarashok/gpu-driver-cse-timing branch from 0f5f80f to f741c6a Compare June 10, 2026 21:48

ganeshkumarashok merged commit 33e1a8a into main Jun 10, 2026
20 of 21 checks passed

ganeshkumarashok deleted the ganeshkumarashok/gpu-driver-cse-timing branch June 10, 2026 21:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(linux): emit per-step CSE timing events for GPU driver install#8679

feat(linux): emit per-step CSE timing events for GPU driver install#8679
ganeshkumarashok merged 2 commits into
mainfrom
ganeshkumarashok/gpu-driver-cse-timing

ganeshkumarashok commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

ganeshkumarashok commented Jun 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

aks-node-assistant Bot commented Jun 10, 2026

Uh oh!

Uh oh!

aks-node-assistant Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ganeshkumarashok commented Jun 10, 2026

What

Why

No behavior change (observability-only)

Testing / scope notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

ganeshkumarashok commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

aks-node-assistant Bot commented Jun 10, 2026

Uh oh!

Uh oh!

aks-node-assistant Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ganeshkumarashok commented Jun 10, 2026 •

edited

Loading