Skip to content

feat(linux): emit per-step CSE timing events for GPU driver install#8679

Merged
ganeshkumarashok merged 2 commits into
mainfrom
ganeshkumarashok/gpu-driver-cse-timing
Jun 10, 2026
Merged

feat(linux): emit per-step CSE timing events for GPU driver install#8679
ganeshkumarashok merged 2 commits into
mainfrom
ganeshkumarashok/gpu-driver-cse-timing

Conversation

@ganeshkumarashok

Copy link
Copy Markdown
Contributor

What

configGPUDrivers() was wrapped as a single coarse CSE event (AKS.CSE.ensureGPUDrivers.configGPUDrivers), so the heavy GPU driver install steps had no per-step timing — the most expensive part of GPU node provisioning (image pull, driver compile/install, and the nvidia-smi readiness wait) was lumped into one duration.

This PR wraps those steps in logs_to_events so each is timed individually and shows up in CSE timing telemetry (and e2e/cse_timing.go):

New event (AKS.CSE.configGPUDrivers.*) OS path What it times
pullGPUDriverImage Ubuntu ctr image pull of the driver image (only on VHD cache miss)
installGPUDriverImage Ubuntu driver extract + compile + load via /entrypoint.sh install
downloadGPUDrivers Mariner/AzureLinux dnf install of the CUDA/GRID driver package
installNvidiaContainerToolkit Mariner/AzureLinux dnf install of the container toolkit
installNvidiaContainerToolkitSysext ACL oras pull of the toolkit sysext
installGPUDriverSysext ACL oras pull of the driver sysext
waitForNvidiaModprobe all nvidia-modprobe readiness retry loop
waitForNvidiaSmi all nvidia-smi readiness retry loop

Why

Per-step durations let us see where GPU provisioning time actually goes (cache-miss pull vs. compile vs. driver readiness) instead of one opaque number, which is what we need to drive down GPU node bring-up latency.

No behavior change (observability-only)

  • The CSE scripts do not use set -e; unchecked commands never aborted, so bare-wrapped calls keep the same semantics.
  • logs_to_events returns the wrapped command's exit code (0 on success) and writes the event before returning, so every existing control-flow path is preserved byte-for-byte:
    • Ubuntu pull stays unchecked (a failed pull still falls through to the install, which then exits with the original message/code).
    • The Ubuntu install keeps its ret=$? / echo "Failed to install GPU driver, exiting..." / exit $ERR_GPU_DRIVERS_START_FAIL block.
    • Mariner/ACL install functions still exit internally on failure.
    • Readiness waits keep || exit $ERR_GPU_DRIVERS_START_FAIL.
  • The Ubuntu install runs bash -c "...", which can't be passed through logs_to_events' word-split argument form, so it (and the pull, for symmetry) are extracted into small named functions with the commands kept identical.
  • Cheap steps (ldconfig, enableNvidiaPersistenceMode, mkdir, image rm, containerd/NPD restart) are intentionally left unwrapped to avoid event noise.

Testing / scope notes

  • go test ./pkg/agent/... and go test ./aks-node-controller/... pass; the new functions/commands are byte-identical to the originals.
  • Shellcheck: the change adds no new [[ ]]/==; it passes the generic verify_shell.sh rule set.
  • No e2e thresholds added: the perf scenarios in scenario_cse_perf_test.go run on non-GPU SKUs, so these GPU events never fire there; the DefaultTaskThreshold catch-all already tracks them in GPU e2e scenarios.
  • No new ShellSpec specs: these are thin observability wrappers with no logic change, and configGPUDrivers invokes real hardware/daemon commands (covered by e2e GPU scenarios, not unit specs).
  • generatedCSECommand reference snapshots are intentionally not regenerated — they are not asserted, are excluded from make generate, and recent cse_config.sh changes (incl. the prior GPU critical-path refactor) did not touch them.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Linux GPU node provisioning observability by emitting fine-grained CSE timing events for the most expensive GPU driver installation steps, instead of reporting a single coarse duration for the entire configGPUDrivers() phase.

Changes:

  • Added Ubuntu helper functions so GPU driver image pull vs. install/compile can be timed separately via logs_to_events.
  • Wrapped Mariner/AzureLinux and Azure Container Linux GPU driver/toolkit install steps with per-step logs_to_events timing events.
  • Wrapped nvidia-modprobe and nvidia-smi readiness retry loops with per-step logs_to_events timing events.

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
@ganeshkumarashok ganeshkumarashok force-pushed the ganeshkumarashok/gpu-driver-cse-timing branch from b00dc7d to 5d91874 Compare June 10, 2026 02:26
@ganeshkumarashok

ganeshkumarashok commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Design note on the two extracted Ubuntu helpers (pullGPUDriverImage / installGPUDriverImage):

logs_to_events runs its wrapped command via unquoted ${@}, which word-splits the arguments. That's fine for simple commands, but the driver install is retrycmd_if_failure 5 10 600 bash -c "$CTR_GPU_INSTALL_CMD ... /entrypoint.sh install" — word-splitting would shred the bash -c "..." grouping. So the install (and the pull, for symmetry) are extracted into small named functions whose bodies are byte-identical to the original commands, and the functions are what we pass to logs_to_events.

Copilot AI review requested due to automatic review settings June 10, 2026 02:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — 236-failure run: shared cluster fleet outage continues (test-infra, NOT this PR)

  • Run: 167421198 (failed)
  • Failed task: Run AgentBaker E2E (full 60-minute timeout)
  • Test summary: DONE 402 tests, 95 skipped, 236 failures in 3617.299s (~59% failure rate; 0 fwupd hits)

Same shared cluster fleet outage affecting every concurrent PR in this window. Dominant signatures: get or create cluster: failed to wait for cluster abe2e-kubenet-v5-150ee to be ready: context deadline exceeded and ResourceGroupDeletionBlocked on shared MC RGs.

Cross-PR pattern in same window: PR #8652 build 167419663 and PR #8294 build 167422687 hit the same signature. Shared fleet degradation has continued and worsened from overnight (now consuming full 60-min timeout vs. 11 min earlier).

Build-vs-test: test-infra (shared cluster fleet outage), NOT product, NOT PR-caused.
This PR's exposure check: GPU CSE per-step timing events instrumentation — does not touch the shared test cluster fleet or AKS cluster lifecycle.
Confidence: HIGH that PR #8679 is not the cause.

Recommended next action / owner: ⚠️ E2E infra / NodeSIG-dev — same urgent shared-cluster-fleet restoration flagged on PR #8652 comment and prior runs. The entire PR gate is effectively offline until the shared abe2e-* clusters are restored. PR author: rerun once fleet is restored.

Posted by Clawpilot AgentBaker gate detective.

ganeshkumarashok and others added 2 commits June 10, 2026 14:48
configGPUDrivers() was wrapped as a single coarse event
(AKS.CSE.ensureGPUDrivers.configGPUDrivers), so the heavy GPU driver
install steps had no per-step timing in CSE telemetry.

Wrap the expensive steps in logs_to_events so each is timed individually:
- Ubuntu: pullGPUDriverImage, installGPUDriverImage (driver image
  pull + compile/install). The install runs `bash -c "..."`, which
  cannot be passed through logs_to_events' word-split argument form,
  so both are extracted into small named functions.
- Mariner/AzureLinux: downloadGPUDrivers, installNvidiaContainerToolkit.
- ACL: installNvidiaContainerToolkitSysext, installGPUDriverSysext.
- Common readiness waits: waitForNvidiaModprobe, waitForNvidiaSmi.

This is observability-only with no behavior change: there is no `set -e`
in the CSE scripts, logs_to_events returns the wrapped command's exit
code (0 on success) and emits the event before returning, so every
existing control-flow path (unchecked pull, `ret=$?` install check,
`|| exit` readiness waits, and functions that exit internally) is
preserved exactly. Cheap steps (ldconfig, enableNvidiaPersistenceMode,
mkdir, image rm, containerd/NPD restart) are intentionally left
unwrapped to avoid event noise.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a configGPUDrivers Describe block to cse_config_spec.sh that mocks
logs_to_events to capture the emitted event names and stubs the
side-effecting driver/daemon commands, then asserts the expected
per-OS timing events:
- Ubuntu: pullGPUDriverImage, installGPUDriverImage
- Mariner/AzureLinux: downloadGPUDrivers, installNvidiaContainerToolkit
- ACL: installNvidiaContainerToolkitSysext, installGPUDriverSysext
- all paths: waitForNvidiaModprobe, waitForNvidiaSmi

This locks in the telemetry contract so a wrapper removal or event
rename is caught. Addresses the Copilot review suggestion.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ganeshkumarashok ganeshkumarashok force-pushed the ganeshkumarashok/gpu-driver-cse-timing branch from 0f5f80f to f741c6a Compare June 10, 2026 21:48
@ganeshkumarashok ganeshkumarashok merged commit 33e1a8a into main Jun 10, 2026
20 of 21 checks passed
@ganeshkumarashok ganeshkumarashok deleted the ganeshkumarashok/gpu-driver-cse-timing branch June 10, 2026 21:49
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — Test, Scan, and Cleanup fails on every VHD (PR merge-ref race, NOT this PR)

  • Run: 167546471
  • Failed task: Test, Scan, and Cleanup (every VHD distro: 2204, 2204arm64, 2404, 2004fips, AzureLinuxV3, AzureLinuxV3ARM64fips, ACL variants — exit code 2 on each)
  • Wiki signature: pr-merge-ref-race (wiki)

Detective summary

./vhdbuilder/packer/test/run-test.sh fails on every VHD because the VM-side vhd-test Run-Command can't clone the PR merge ref:

fatal: couldn't find remote ref refs/pull/8679/merge
git-clone:Error: Failed to fetch remote branch 'refs/pull/8679/merge' into local branch 'refs-pull-8679-merge'

CIS scan and vhd-scanning.sh legs succeeded; only the test driver fails, and only on the git-fetch of the merge ref. Classic GitHub merge-ref race after auto-cancel/push regenerated the ref before the VM script reached it. Third occurrence of this signature.

Classification: Test infrastructure / PR merge-ref race. Not a PR-code regression.

Confidence: High. Identical fingerprint across all distros; PR #8679 changes only cse_config.sh + cse_config_spec.sh (per-step CSE timing events for GPU driver install) — nothing that could break git fetch server-side.

Strongest alternative theory: The new CSE config breaks the VM in a way that prevents Run-Command's git from succeeding. Less likely because Run-Command itself reports Enable succeeded, git is already the newest version, and the failure is fatal: couldn't find remote ref — a GitHub server-side ref-not-found.

Recommended next action / owner: No PR change required. Recommend rerun. Underlying fix tracked by AgentBaker E2E test-infra (race between GitHub merge-ref refresh and the VM-side vhd-test git fetch).

Evidence used: failed task log on every VHD distro (identical "couldn't find remote ref refs/pull/8679/merge" signature, run-test.sh exited with code 1), VHD scan + CIS legs succeeded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants