Skip to content

feat: build-only / install-skip-build modes for VHD-prebuilt GPU kernel module#162

Open
ganeshkumarashok wants to merge 3 commits into
mainfrom
gpu-prebuild-kernel-module
Open

feat: build-only / install-skip-build modes for VHD-prebuilt GPU kernel module#162
ganeshkumarashok wants to merge 3 commits into
mainfrom
gpu-prebuild-kernel-module

Conversation

@ganeshkumarashok

Copy link
Copy Markdown
Collaborator

What

Split install.sh so the expensive NVIDIA DKMS kernel-module compile can be hoisted to VHD build time, removing it from the node-provisioning critical path.

  • build_kernel_module (device-independent: nouveau blacklist, overlayfs + nvidia-installer --dkms, lib staging, ld config) is separated from device_init (device-dependent: nvidia-modprobe/nvidia-smi, fabric manager, containerd runtime config, dev-char symlinks).
  • New modes via entrypoint.sh:
    • build-only (AKSGPU_BUILD_ONLY=1): compile + DKMS-register + stage libs, no device access (safe on a GPU-less Packer builder), write /opt/azure/aks-gpu/dkms-marker.
    • install-skip-build (AKSGPU_SKIP_KERNEL_BUILD=1): skip recompile, run only device-init at node boot.
    • default install: unchanged legacy full path.
  • Marker re-validation (kernel + driver_version + driver_kind) gates the skip-build fast path; any mismatch (kernel upgrade, or a shared VHD whose baked driver kind/version differs) falls back to a full build.
  • remove_stale_baked_driver unloads loaded nvidia modules and removes a stale registered DKMS tree + relocated libs before a full rebuild, so a CUDA-baked VHD booting on a GRID node can't collide with the boot-time installer. No-op on today's VHDs.

Why

The nvidia-installer --dkms compile is ~100s on the GPU provisioning critical path. Pre-baking it into the VHD removes it from every node boot.

Consumer

AgentBaker uses these modes (build-only at VHD bake, install-skip-build at CSE): Azure/AgentBaker PR for gpu-prebake-cuda-driver.

Safety

Default behaviour is unchanged when neither env var is set and no marker exists.

ganeshkumarashok and others added 2 commits May 31, 2026 11:55
…uilt kernel module

Split the host-side driver install into two phases so the NVIDIA kernel module
can be DKMS-compiled into the VHD at image build time and the boot-time install
can skip straight to device init:

- install.sh: refactor into build_kernel_module() (compile + stage userspace
  libs, no device access) and device_init() (modprobe, nvidia-smi, fabric
  manager, containerd config, udev). Add AKSGPU_BUILD_ONLY and
  AKSGPU_SKIP_KERNEL_BUILD modes, an overlay cleanup trap, and a dkms-marker
  (/opt/azure/aks-gpu/dkms-marker) recording kernel, driver_version,
  driver_kind and arch so the consumer (AgentBaker CSE) can validate an exact
  match before taking the skip-build fast path.
- entrypoint.sh: add build-only and install-skip-build actions and pass the
  mode through to the host via nsenter. The default install action is
  unchanged.

This is the aks-gpu half of the AgentBaker change that prebuilds the GPU kernel
module into the VHD to reduce node provisioning time. Secure Boot module
signing and GPU e2e validation are still required.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The skip-kernel-build fast path now requires a matching dkms-marker
(kernel + driver_version + driver_kind); on mismatch it falls back to a
full build. Before any full (re)build, remove_stale_baked_driver unloads
loaded nvidia modules and removes a stale registered DKMS tree and its
relocated libs/loader config, so a CUDA-baked VHD booting on a GRID node
(or a version skew) cannot collide with the boot-time nvidia-installer.

No-op on today's VHDs (nothing baked / nothing registered).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nt; fix EXIT trap

- entrypoint.sh: collapse the per-action cp/echo duplication into a single
  staging block; the case now only maps action -> install-mode env var.
- install.sh: the second `trap ... EXIT` (cleanup_overlay) was silently
  replacing the earlier `trap 'PS4="+ "' exit`; fold both into one EXIT trap.
- install.sh: only run remove_stale_baked_driver when a baked marker is present,
  so the default `install` path (non-prebaked VHD, no marker) is byte-for-byte
  the legacy behaviour and never touches DKMS cleanup. The cleanup still runs
  for the reachable mismatch case (marker present but kind/version/kernel differs).
- Correct comments that framed the mismatch path as defensive / "no-op on
  today's VHDs"; it is reachable because AgentBaker gates skip-build on marker
  presence alone.

No functional change to build-only or the matching skip-build fast path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@timmy-wright timmy-wright left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squad Review — aks-gpu PR #162

Reviewers: Simon (defects) · Book (resilience) · Wash (testing) · Zoe (security: no findings)

The design is sound — separating build_kernel_module / device_init is the right shape. Two high-severity correctness bugs can brick nodes at scale; several medium-severity resilience gaps in the cleanup/stale-driver paths.

🔴 High

  • Trap misses partial overlay mount (install.sh ~line 90, inline) — OVERLAY_MOUNTED not set when overlay mount fails after tmpfs succeeds; EXIT trap skips cleanup.
  • Skip-build hard-fails on corrupt prebake (install.sh ~line 253, inline) — modinfo passes for corrupt modules; provisioning dies with no fallback rebuild.
  • Mode selection entirely untested (Wash) — no shell tests for build-only/install-skip-build/unknown-action. Wrong env propagation or accidental full build at VHD bake time is undetectable. Suggest bats tests with mocked nsenter.

🟡 Medium

  • cleanup_overlay clears flag on unmount failure (install.sh ~line 38, inline)
  • DKMS status parser captures trailing suffix (install.sh ~line 209, inline) — dkms remove becomes no-op for nvidia/ver: added lines.
  • Early return skips marker/lib cleanup when DKMS state absent (install.sh ~line 222, inline)
  • build-only not transactional around marker write (install.sh ~line 152, inline)
  • Unload failure ignored on active node (install.sh ~line 186, inline)
  • Marker format edge cases untested (Wash) — blank kernel=, CRLF, duplicate keys cause silent fallback rebuilds with no test coverage.
  • Fallback path not CI-coverable without GPU (Wash) — fully stubbable on a vanilla runner; build_kernel_module, dkms, modinfo, ldconfig can all be mocked.

🟢 Low

  • Observability thin for fast-path failures (Wash) — log which marker field mismatched, what DKMS versions were removed.

Comment thread install.sh
# add an extra layer of indirection via tmpfs because it's not possible to have an overlayfs on an overlayfs (i.e., inside a container)
mkdir /tmp/overlay
mount -t tmpfs tmpfs /tmp/overlay
mkdir /tmp/overlay/{workdir,lib64}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trap misses partial overlay mount (Simon)

If the tmpfs mount succeeds but the subsequent overlay mount fails, OVERLAY_MOUNTED is never set to 1 (that happens on the next line). The EXIT trap then skips cleanup_overlay, leaving /tmp/overlay mounted in the finished VHD.

Suggestion: set OVERLAY_MOUNTED=1 immediately after the tmpfs mount and make cleanup_overlay tolerate a missing overlay layer (check with mountpoint -q before unmounting each layer individually).

Comment thread install.sh
umount /tmp/overlay || true
rm -r /tmp/overlay || true
OVERLAY_MOUNTED=0
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup_overlay clears flag even when unmount fails (Book)

OVERLAY_MOUNTED=0 is set unconditionally inside the if block. If either umount call fails (e.g. mount is busy), the flag is cleared anyway, so the EXIT trap won't retry and a dangling mount can survive into the VHD.

Suggestion: only reset the flag after confirming the mountpoints are gone (e.g. mountpoint -q check, or inspect /proc/self/mountinfo).

Comment thread install.sh
cat > "${DKMS_MARKER_FILE}" <<EOF
kernel=${KERNEL_NAME}
driver_version=${DRIVER_VERSION}
driver_kind=${DRIVER_KIND}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build-only path not transactional around marker write (Book)

If build_kernel_module succeeds but write_dkms_marker fails (disk full, permission error), the VHD ships with a prebaked DKMS/lib state but no marker. The boot-time path sees no marker, skips remove_stale_baked_driver, and the fresh build_kernel_module call collides with the stale prebaked DKMS state.

Suggestion: write the marker to a temp file first, then mv it atomically. On any marker-write failure, roll back the prebaked DKMS/lib state before exiting.

Comment thread install.sh
rmmod "${mod}" 2>/dev/null || modprobe -r "${mod}" 2>/dev/null || true
fi
done
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unload failure silently ignored on active node (Book)

All rmmod/modprobe -r failures are swallowed with || true. On an active GPU node these calls will likely fail due to open device handles — execution continues into dkms remove (also || true), which can leave a partially-unloaded, partially-deregistered driver stack.

Suggestion: after the unload loop, assert all target modules are actually gone. If any remain, abort with a clear error rather than attempting DKMS removal against a live module.

Comment thread install.sh
local registered_versions
registered_versions="$(dkms status 2>/dev/null \
| sed -n 's#^nvidia[,/][[:space:]]*\([^,]*\).*#\1#p' \
| sort -u || true)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dkms status parser captures trailing suffix (Simon)

dkms status can emit lines like nvidia/570.86.15: added (in addition to the installed form). The current sed regex captures everything up to the first comma, so nvidia/570.86.15: added yields 570.86.15: added (colon + suffix included). dkms remove "nvidia/570.86.15: added" is a silent no-op under || true, leaving the stale DKMS registration.

Suggestion: stop capture at : or space too — e.g. \([^,: ]*\).

Comment thread install.sh

# Drop the baked install's relocated libs + loader config + marker so only the driver we
# are about to build ends up on the path. build_kernel_module recreates the libs/conf.
rm -f /etc/ld.so.conf.d/nvidia.conf || true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early return skips marker/lib cleanup when DKMS state already absent (Simon + Book)

If the marker exists but DKMS state is already partially missing (prior failed cleanup), registered_versions is empty and the function returns early — leaving the stale marker, ld.so.conf.d/nvidia.conf, and ${GPU_DEST}/lib64 on disk. The boot-time build_kernel_module then runs without having cleared those artefacts and may collide.

Suggestion: always remove marker/config/libs; gate only the dkms remove loop on non-empty registered_versions.

Comment thread install.sh
echo "aks-gpu: skip-kernel-build mode (using module prebuilt in VHD for kernel ${KERNEL_NAME})"
ldconfig
dkms status
modinfo -k "$KERNEL_NAME" nvidia

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip-build hard-fails on corrupt prebake instead of falling back (Simon + Book)

modinfo -k "$KERNEL_NAME" nvidia only checks file metadata — not whether the module is actually loadable. If the baked module or lib state is corrupt/incomplete, modinfo (or the subsequent dkms status) exits nonzero and node provisioning dies under set -e with no recovery. Every node booted from a bad VHD bricks.

Suggestion: wrap the fast-path validation in an explicit check. On any failure, fall back: remove_stale_baked_driver && build_kernel_module. The fast path should be opportunistic, not a hard requirement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants