feat(linux): emit per-step CSE timing events for GPU driver install#8679
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves Linux GPU node provisioning observability by emitting fine-grained CSE timing events for the most expensive GPU driver installation steps, instead of reporting a single coarse duration for the entire configGPUDrivers() phase.
Changes:
- Added Ubuntu helper functions so GPU driver image pull vs. install/compile can be timed separately via
logs_to_events. - Wrapped Mariner/AzureLinux and Azure Container Linux GPU driver/toolkit install steps with per-step
logs_to_eventstiming events. - Wrapped
nvidia-modprobeandnvidia-smireadiness retry loops with per-steplogs_to_eventstiming events.
b00dc7d to
5d91874
Compare
|
Design note on the two extracted Ubuntu helpers (
|
|
AgentBaker Linux PR gate — 236-failure run: shared cluster fleet outage continues (test-infra, NOT this PR)
Same shared cluster fleet outage affecting every concurrent PR in this window. Dominant signatures: Cross-PR pattern in same window: PR #8652 build 167419663 and PR #8294 build 167422687 hit the same signature. Shared fleet degradation has continued and worsened from overnight (now consuming full 60-min timeout vs. 11 min earlier). Build-vs-test: test-infra (shared cluster fleet outage), NOT product, NOT PR-caused. Recommended next action / owner: Posted by Clawpilot AgentBaker gate detective. |
configGPUDrivers() was wrapped as a single coarse event (AKS.CSE.ensureGPUDrivers.configGPUDrivers), so the heavy GPU driver install steps had no per-step timing in CSE telemetry. Wrap the expensive steps in logs_to_events so each is timed individually: - Ubuntu: pullGPUDriverImage, installGPUDriverImage (driver image pull + compile/install). The install runs `bash -c "..."`, which cannot be passed through logs_to_events' word-split argument form, so both are extracted into small named functions. - Mariner/AzureLinux: downloadGPUDrivers, installNvidiaContainerToolkit. - ACL: installNvidiaContainerToolkitSysext, installGPUDriverSysext. - Common readiness waits: waitForNvidiaModprobe, waitForNvidiaSmi. This is observability-only with no behavior change: there is no `set -e` in the CSE scripts, logs_to_events returns the wrapped command's exit code (0 on success) and emits the event before returning, so every existing control-flow path (unchecked pull, `ret=$?` install check, `|| exit` readiness waits, and functions that exit internally) is preserved exactly. Cheap steps (ldconfig, enableNvidiaPersistenceMode, mkdir, image rm, containerd/NPD restart) are intentionally left unwrapped to avoid event noise. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a configGPUDrivers Describe block to cse_config_spec.sh that mocks logs_to_events to capture the emitted event names and stubs the side-effecting driver/daemon commands, then asserts the expected per-OS timing events: - Ubuntu: pullGPUDriverImage, installGPUDriverImage - Mariner/AzureLinux: downloadGPUDrivers, installNvidiaContainerToolkit - ACL: installNvidiaContainerToolkitSysext, installGPUDriverSysext - all paths: waitForNvidiaModprobe, waitForNvidiaSmi This locks in the telemetry contract so a wrapper removal or event rename is caught. Addresses the Copilot review suggestion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
0f5f80f to
f741c6a
Compare
|
AgentBaker Linux PR gate — Test, Scan, and Cleanup fails on every VHD (PR merge-ref race, NOT this PR)
Detective summary
CIS scan and Classification: Test infrastructure / PR merge-ref race. Not a PR-code regression. Confidence: High. Identical fingerprint across all distros; PR #8679 changes only Strongest alternative theory: The new CSE config breaks the VM in a way that prevents Run-Command's git from succeeding. Less likely because Run-Command itself reports Recommended next action / owner: No PR change required. Recommend rerun. Underlying fix tracked by AgentBaker E2E test-infra (race between GitHub merge-ref refresh and the VM-side Evidence used: failed task log on every VHD distro (identical "couldn't find remote ref refs/pull/8679/merge" signature, |
What
configGPUDrivers()was wrapped as a single coarse CSE event (AKS.CSE.ensureGPUDrivers.configGPUDrivers), so the heavy GPU driver install steps had no per-step timing — the most expensive part of GPU node provisioning (image pull, driver compile/install, and thenvidia-smireadiness wait) was lumped into one duration.This PR wraps those steps in
logs_to_eventsso each is timed individually and shows up in CSE timing telemetry (ande2e/cse_timing.go):AKS.CSE.configGPUDrivers.*)pullGPUDriverImagectr image pullof the driver image (only on VHD cache miss)installGPUDriverImage/entrypoint.sh installdownloadGPUDriversinstallNvidiaContainerToolkitinstallNvidiaContainerToolkitSysextinstallGPUDriverSysextwaitForNvidiaModprobenvidia-modprobereadiness retry loopwaitForNvidiaSminvidia-smireadiness retry loopWhy
Per-step durations let us see where GPU provisioning time actually goes (cache-miss pull vs. compile vs. driver readiness) instead of one opaque number, which is what we need to drive down GPU node bring-up latency.
No behavior change (observability-only)
set -e; unchecked commands never aborted, so bare-wrapped calls keep the same semantics.logs_to_eventsreturns the wrapped command's exit code (0on success) and writes the event before returning, so every existing control-flow path is preserved byte-for-byte:ret=$?/echo "Failed to install GPU driver, exiting..."/exit $ERR_GPU_DRIVERS_START_FAILblock.exitinternally on failure.|| exit $ERR_GPU_DRIVERS_START_FAIL.bash -c "...", which can't be passed throughlogs_to_events' word-split argument form, so it (and the pull, for symmetry) are extracted into small named functions with the commands kept identical.ldconfig,enableNvidiaPersistenceMode,mkdir, imagerm, containerd/NPD restart) are intentionally left unwrapped to avoid event noise.Testing / scope notes
go test ./pkg/agent/...andgo test ./aks-node-controller/...pass; the new functions/commands are byte-identical to the originals.[[ ]]/==; it passes the genericverify_shell.shrule set.scenario_cse_perf_test.gorun on non-GPU SKUs, so these GPU events never fire there; theDefaultTaskThresholdcatch-all already tracks them in GPU e2e scenarios.configGPUDriversinvokes real hardware/daemon commands (covered by e2e GPU scenarios, not unit specs).generatedCSECommandreference snapshots are intentionally not regenerated — they are not asserted, are excluded frommake generate, and recentcse_config.shchanges (incl. the prior GPU critical-path refactor) did not touch them.