Skip to content

fix(cse): expand NODE_NAME in DRA driver systemd unit override#8676

Draft
Copilot wants to merge 5 commits into
mainfrom
copilot/fix-code-review-suggestion
Draft

fix(cse): expand NODE_NAME in DRA driver systemd unit override#8676
Copilot wants to merge 5 commits into
mainfrom
copilot/fix-code-review-suggestion

Conversation

Copilot AI commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

The startDRADriverNvidiaGpu function wrote the systemd unit override with a single-quoted heredoc (<<'EOF'), suppressing variable expansion. As a result, ${NODE_NAME} was written literally into the drop-in file instead of the actual hostname, causing gpu-kubelet-plugin to receive an empty/undefined node name and breaking DRA driver registration.

Change

  • parts/linux/cloud-init/artifacts/cse_config.sh: Changed <<'EOF'<<EOF in startDRADriverNvidiaGpu so ${NODE_NAME} (set to $(hostname) at script init) is expanded at CSE execution time.
# Before — NODE_NAME written literally
tee "${DRA_DRIVER_OVERRIDE_DIR}/10-dra-driver-nvidia-gpu.conf" > /dev/null <<'EOF'
[Service]
ExecStart=
ExecStart=/usr/bin/gpu-kubelet-plugin ... --node-name=${NODE_NAME}
EOF

# After — NODE_NAME expanded to actual hostname
tee "${DRA_DRIVER_OVERRIDE_DIR}/10-dra-driver-nvidia-gpu.conf" > /dev/null <<EOF
[Service]
ExecStart=
ExecStart=/usr/bin/gpu-kubelet-plugin ... --node-name=${NODE_NAME}
EOF

Copilot AI requested review from Copilot and removed request for Copilot June 9, 2026 23:23
Co-authored-by: runzhen <32292691+runzhen@users.noreply.github.com>
Copilot AI requested review from Copilot and removed request for Copilot June 9, 2026 23:28
Co-authored-by: runzhen <32292691+runzhen@users.noreply.github.com>
Copilot AI requested review from Copilot and removed request for Copilot June 9, 2026 23:28
Copilot AI changed the title [WIP] Fix the code as per review comment fix(cse): expand NODE_NAME in DRA driver systemd unit override Jun 9, 2026
Copilot AI requested a review from runzhen June 9, 2026 23:29
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — 236-failure mass run: shared cluster proxy-pod readiness exhaustion (test-infra, NOT this PR)

  • Run: 167398747 (failed) — WIP/draft PR by copilot-swe-agent
  • Failed task: Run AgentBaker E2E
  • Test summary: DONE 405 tests, 97 skipped, 236 failures in 652.730s (~58% failure rate; 0 fwupd hits)

Dominant failure (same as 6 concurrent runs in this window):

prepare cluster tasks: dag execution failed:
waiting for proxy pod to be ready: listing proxy pods:
client rate limiter Wait returned an error: context deadline exceeded

Every failing scenario times out at the cluster.go:163: ✓ preparing cluster done (311.0s) boundary because the harness's e2e-proxy DaemonSet doesn't go ready inside the prepare-cluster DAG.

Cross-PR pattern (same window): identical 236-failure shape on builds 167387444 (PR #8652), 167387406 (PR #8294), 167387387 (PR #8600), 167393232 (PR #8669), and earlier on 167378787 (PR #8618). Six unrelated PRs in the same ~3h window, all hit the same shared-cluster-fleet readiness exhaustion.

Build-vs-test: test-infra (shared cluster pool), NOT product, NOT PR-caused.
Confidence: HIGH that PR #8676 is not the cause.

Recommended next action / owner: E2E infra / NodeSIG-dev — shared cluster fleet (e2e-proxy DaemonSet readiness + kubenet-v5 RG lifecycle) is overloaded. This is a draft/WIP PR; rerun once fleet stabilizes.

Posted by Clawpilot AgentBaker gate detective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants