feat: build-only / install-skip-build modes for VHD-prebuilt GPU kernel module#162
feat: build-only / install-skip-build modes for VHD-prebuilt GPU kernel module#162ganeshkumarashok wants to merge 3 commits into
Conversation
…uilt kernel module Split the host-side driver install into two phases so the NVIDIA kernel module can be DKMS-compiled into the VHD at image build time and the boot-time install can skip straight to device init: - install.sh: refactor into build_kernel_module() (compile + stage userspace libs, no device access) and device_init() (modprobe, nvidia-smi, fabric manager, containerd config, udev). Add AKSGPU_BUILD_ONLY and AKSGPU_SKIP_KERNEL_BUILD modes, an overlay cleanup trap, and a dkms-marker (/opt/azure/aks-gpu/dkms-marker) recording kernel, driver_version, driver_kind and arch so the consumer (AgentBaker CSE) can validate an exact match before taking the skip-build fast path. - entrypoint.sh: add build-only and install-skip-build actions and pass the mode through to the host via nsenter. The default install action is unchanged. This is the aks-gpu half of the AgentBaker change that prebuilds the GPU kernel module into the VHD to reduce node provisioning time. Secure Boot module signing and GPU e2e validation are still required. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The skip-kernel-build fast path now requires a matching dkms-marker (kernel + driver_version + driver_kind); on mismatch it falls back to a full build. Before any full (re)build, remove_stale_baked_driver unloads loaded nvidia modules and removes a stale registered DKMS tree and its relocated libs/loader config, so a CUDA-baked VHD booting on a GRID node (or a version skew) cannot collide with the boot-time nvidia-installer. No-op on today's VHDs (nothing baked / nothing registered). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nt; fix EXIT trap - entrypoint.sh: collapse the per-action cp/echo duplication into a single staging block; the case now only maps action -> install-mode env var. - install.sh: the second `trap ... EXIT` (cleanup_overlay) was silently replacing the earlier `trap 'PS4="+ "' exit`; fold both into one EXIT trap. - install.sh: only run remove_stale_baked_driver when a baked marker is present, so the default `install` path (non-prebaked VHD, no marker) is byte-for-byte the legacy behaviour and never touches DKMS cleanup. The cleanup still runs for the reachable mismatch case (marker present but kind/version/kernel differs). - Correct comments that framed the mismatch path as defensive / "no-op on today's VHDs"; it is reachable because AgentBaker gates skip-build on marker presence alone. No functional change to build-only or the matching skip-build fast path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
timmy-wright
left a comment
There was a problem hiding this comment.
Squad Review — aks-gpu PR #162
Reviewers: Simon (defects) · Book (resilience) · Wash (testing) · Zoe (security: no findings)
The design is sound — separating build_kernel_module / device_init is the right shape. Two high-severity correctness bugs can brick nodes at scale; several medium-severity resilience gaps in the cleanup/stale-driver paths.
🔴 High
- Trap misses partial overlay mount (
install.sh~line 90, inline) —OVERLAY_MOUNTEDnot set when overlay mount fails after tmpfs succeeds; EXIT trap skips cleanup. - Skip-build hard-fails on corrupt prebake (
install.sh~line 253, inline) —modinfopasses for corrupt modules; provisioning dies with no fallback rebuild. - Mode selection entirely untested (Wash) — no shell tests for
build-only/install-skip-build/unknown-action. Wrong env propagation or accidental full build at VHD bake time is undetectable. Suggest bats tests with mockednsenter.
🟡 Medium
cleanup_overlayclears flag on unmount failure (install.sh~line 38, inline)- DKMS status parser captures trailing suffix (
install.sh~line 209, inline) —dkms removebecomes no-op fornvidia/ver: addedlines. - Early return skips marker/lib cleanup when DKMS state absent (
install.sh~line 222, inline) - build-only not transactional around marker write (
install.sh~line 152, inline) - Unload failure ignored on active node (
install.sh~line 186, inline) - Marker format edge cases untested (Wash) — blank
kernel=, CRLF, duplicate keys cause silent fallback rebuilds with no test coverage. - Fallback path not CI-coverable without GPU (Wash) — fully stubbable on a vanilla runner;
build_kernel_module,dkms,modinfo,ldconfigcan all be mocked.
🟢 Low
- Observability thin for fast-path failures (Wash) — log which marker field mismatched, what DKMS versions were removed.
| # add an extra layer of indirection via tmpfs because it's not possible to have an overlayfs on an overlayfs (i.e., inside a container) | ||
| mkdir /tmp/overlay | ||
| mount -t tmpfs tmpfs /tmp/overlay | ||
| mkdir /tmp/overlay/{workdir,lib64} |
There was a problem hiding this comment.
Trap misses partial overlay mount (Simon)
If the tmpfs mount succeeds but the subsequent overlay mount fails, OVERLAY_MOUNTED is never set to 1 (that happens on the next line). The EXIT trap then skips cleanup_overlay, leaving /tmp/overlay mounted in the finished VHD.
Suggestion: set OVERLAY_MOUNTED=1 immediately after the tmpfs mount and make cleanup_overlay tolerate a missing overlay layer (check with mountpoint -q before unmounting each layer individually).
| umount /tmp/overlay || true | ||
| rm -r /tmp/overlay || true | ||
| OVERLAY_MOUNTED=0 | ||
| fi |
There was a problem hiding this comment.
cleanup_overlay clears flag even when unmount fails (Book)
OVERLAY_MOUNTED=0 is set unconditionally inside the if block. If either umount call fails (e.g. mount is busy), the flag is cleared anyway, so the EXIT trap won't retry and a dangling mount can survive into the VHD.
Suggestion: only reset the flag after confirming the mountpoints are gone (e.g. mountpoint -q check, or inspect /proc/self/mountinfo).
| cat > "${DKMS_MARKER_FILE}" <<EOF | ||
| kernel=${KERNEL_NAME} | ||
| driver_version=${DRIVER_VERSION} | ||
| driver_kind=${DRIVER_KIND} |
There was a problem hiding this comment.
build-only path not transactional around marker write (Book)
If build_kernel_module succeeds but write_dkms_marker fails (disk full, permission error), the VHD ships with a prebaked DKMS/lib state but no marker. The boot-time path sees no marker, skips remove_stale_baked_driver, and the fresh build_kernel_module call collides with the stale prebaked DKMS state.
Suggestion: write the marker to a temp file first, then mv it atomically. On any marker-write failure, roll back the prebaked DKMS/lib state before exiting.
| rmmod "${mod}" 2>/dev/null || modprobe -r "${mod}" 2>/dev/null || true | ||
| fi | ||
| done | ||
| } |
There was a problem hiding this comment.
Unload failure silently ignored on active node (Book)
All rmmod/modprobe -r failures are swallowed with || true. On an active GPU node these calls will likely fail due to open device handles — execution continues into dkms remove (also || true), which can leave a partially-unloaded, partially-deregistered driver stack.
Suggestion: after the unload loop, assert all target modules are actually gone. If any remain, abort with a clear error rather than attempting DKMS removal against a live module.
| local registered_versions | ||
| registered_versions="$(dkms status 2>/dev/null \ | ||
| | sed -n 's#^nvidia[,/][[:space:]]*\([^,]*\).*#\1#p' \ | ||
| | sort -u || true)" |
There was a problem hiding this comment.
dkms status parser captures trailing suffix (Simon)
dkms status can emit lines like nvidia/570.86.15: added (in addition to the installed form). The current sed regex captures everything up to the first comma, so nvidia/570.86.15: added yields 570.86.15: added (colon + suffix included). dkms remove "nvidia/570.86.15: added" is a silent no-op under || true, leaving the stale DKMS registration.
Suggestion: stop capture at : or space too — e.g. \([^,: ]*\).
|
|
||
| # Drop the baked install's relocated libs + loader config + marker so only the driver we | ||
| # are about to build ends up on the path. build_kernel_module recreates the libs/conf. | ||
| rm -f /etc/ld.so.conf.d/nvidia.conf || true |
There was a problem hiding this comment.
Early return skips marker/lib cleanup when DKMS state already absent (Simon + Book)
If the marker exists but DKMS state is already partially missing (prior failed cleanup), registered_versions is empty and the function returns early — leaving the stale marker, ld.so.conf.d/nvidia.conf, and ${GPU_DEST}/lib64 on disk. The boot-time build_kernel_module then runs without having cleared those artefacts and may collide.
Suggestion: always remove marker/config/libs; gate only the dkms remove loop on non-empty registered_versions.
| echo "aks-gpu: skip-kernel-build mode (using module prebuilt in VHD for kernel ${KERNEL_NAME})" | ||
| ldconfig | ||
| dkms status | ||
| modinfo -k "$KERNEL_NAME" nvidia |
There was a problem hiding this comment.
Skip-build hard-fails on corrupt prebake instead of falling back (Simon + Book)
modinfo -k "$KERNEL_NAME" nvidia only checks file metadata — not whether the module is actually loadable. If the baked module or lib state is corrupt/incomplete, modinfo (or the subsequent dkms status) exits nonzero and node provisioning dies under set -e with no recovery. Every node booted from a bad VHD bricks.
Suggestion: wrap the fast-path validation in an explicit check. On any failure, fall back: remove_stale_baked_driver && build_kernel_module. The fast path should be opportunistic, not a hard requirement.
What
Split
install.shso the expensive NVIDIA DKMS kernel-module compile can be hoisted to VHD build time, removing it from the node-provisioning critical path.build_kernel_module(device-independent: nouveau blacklist, overlayfs +nvidia-installer --dkms, lib staging, ld config) is separated fromdevice_init(device-dependent:nvidia-modprobe/nvidia-smi, fabric manager, containerd runtime config, dev-char symlinks).entrypoint.sh:build-only(AKSGPU_BUILD_ONLY=1): compile + DKMS-register + stage libs, no device access (safe on a GPU-less Packer builder), write/opt/azure/aks-gpu/dkms-marker.install-skip-build(AKSGPU_SKIP_KERNEL_BUILD=1): skip recompile, run only device-init at node boot.install: unchanged legacy full path.remove_stale_baked_driverunloads loaded nvidia modules and removes a stale registered DKMS tree + relocated libs before a full rebuild, so a CUDA-baked VHD booting on a GRID node can't collide with the boot-time installer. No-op on today's VHDs.Why
The
nvidia-installer --dkmscompile is ~100s on the GPU provisioning critical path. Pre-baking it into the VHD removes it from every node boot.Consumer
AgentBaker uses these modes (build-only at VHD bake, install-skip-build at CSE): Azure/AgentBaker PR for
gpu-prebake-cuda-driver.Safety
Default behaviour is unchanged when neither env var is set and no marker exists.