fix(cse): expand NODE_NAME in DRA driver systemd unit override#8676
fix(cse): expand NODE_NAME in DRA driver systemd unit override#8676Copilot wants to merge 5 commits into
Conversation
Co-authored-by: runzhen <32292691+runzhen@users.noreply.github.com>
Co-authored-by: runzhen <32292691+runzhen@users.noreply.github.com>
|
AgentBaker Linux PR gate — 236-failure mass run: shared cluster proxy-pod readiness exhaustion (test-infra, NOT this PR)
Dominant failure (same as 6 concurrent runs in this window): Every failing scenario times out at the Cross-PR pattern (same window): identical 236-failure shape on builds 167387444 (PR #8652), 167387406 (PR #8294), 167387387 (PR #8600), 167393232 (PR #8669), and earlier on 167378787 (PR #8618). Six unrelated PRs in the same ~3h window, all hit the same shared-cluster-fleet readiness exhaustion. Build-vs-test: test-infra (shared cluster pool), NOT product, NOT PR-caused. Recommended next action / owner: E2E infra / NodeSIG-dev — shared cluster fleet (e2e-proxy DaemonSet readiness + kubenet-v5 RG lifecycle) is overloaded. This is a draft/WIP PR; rerun once fleet stabilizes. Posted by Clawpilot AgentBaker gate detective. |
The
startDRADriverNvidiaGpufunction wrote the systemd unit override with a single-quoted heredoc (<<'EOF'), suppressing variable expansion. As a result,${NODE_NAME}was written literally into the drop-in file instead of the actual hostname, causinggpu-kubelet-pluginto receive an empty/undefined node name and breaking DRA driver registration.Change
parts/linux/cloud-init/artifacts/cse_config.sh: Changed<<'EOF'→<<EOFinstartDRADriverNvidiaGpuso${NODE_NAME}(set to$(hostname)at script init) is expanded at CSE execution time.