fix(cse): expand NODE_NAME in DRA driver systemd unit override by Copilot · Pull Request #8676 · Azure/AgentBaker

Copilot · 2026-06-09T23:23:51Z

The startDRADriverNvidiaGpu function wrote the systemd unit override with a single-quoted heredoc (<<'EOF'), suppressing variable expansion. As a result, ${NODE_NAME} was written literally into the drop-in file instead of the actual hostname, causing gpu-kubelet-plugin to receive an empty/undefined node name and breaking DRA driver registration.

Change

parts/linux/cloud-init/artifacts/cse_config.sh: Changed <<'EOF' → <<EOF in startDRADriverNvidiaGpu so ${NODE_NAME} (set to $(hostname) at script init) is expanded at CSE execution time.

# Before — NODE_NAME written literally
tee "${DRA_DRIVER_OVERRIDE_DIR}/10-dra-driver-nvidia-gpu.conf" > /dev/null <<'EOF'
[Service]
ExecStart=
ExecStart=/usr/bin/gpu-kubelet-plugin ... --node-name=${NODE_NAME}
EOF

# After — NODE_NAME expanded to actual hostname
tee "${DRA_DRIVER_OVERRIDE_DIR}/10-dra-driver-nvidia-gpu.conf" > /dev/null <<EOF
[Service]
ExecStart=
ExecStart=/usr/bin/gpu-kubelet-plugin ... --node-name=${NODE_NAME}
EOF

Co-authored-by: runzhen <32292691+runzhen@users.noreply.github.com>

aks-node-assistant · 2026-06-10T01:02:24Z

AgentBaker Linux PR gate — 236-failure mass run: shared cluster proxy-pod readiness exhaustion (test-infra, NOT this PR)

Run: 167398747 (failed) — WIP/draft PR by copilot-swe-agent
Failed task: Run AgentBaker E2E
Test summary: DONE 405 tests, 97 skipped, 236 failures in 652.730s (~58% failure rate; 0 fwupd hits)

Dominant failure (same as 6 concurrent runs in this window):

prepare cluster tasks: dag execution failed:
waiting for proxy pod to be ready: listing proxy pods:
client rate limiter Wait returned an error: context deadline exceeded

Every failing scenario times out at the cluster.go:163: ✓ preparing cluster done (311.0s) boundary because the harness's e2e-proxy DaemonSet doesn't go ready inside the prepare-cluster DAG.

Cross-PR pattern (same window): identical 236-failure shape on builds 167387444 (PR #8652), 167387406 (PR #8294), 167387387 (PR #8600), 167393232 (PR #8669), and earlier on 167378787 (PR #8618). Six unrelated PRs in the same ~3h window, all hit the same shared-cluster-fleet readiness exhaustion.

Build-vs-test: test-infra (shared cluster pool), NOT product, NOT PR-caused.
Confidence: HIGH that PR #8676 is not the cause.

Recommended next action / owner: E2E infra / NodeSIG-dev — shared cluster fleet (e2e-proxy DaemonSet readiness + kubenet-v5 RG lifecycle) is overloaded. This is a draft/WIP PR; rerun once fleet stabilizes.

Posted by Clawpilot AgentBaker gate detective.

runzhen and others added 3 commits June 9, 2026 05:40

add dra

32433c2

dra

1f24802

Initial plan

fea4d3f

Copilot AI assigned Copilot and runzhen Jun 9, 2026

Copilot AI requested review from Copilot and removed request for Copilot June 9, 2026 23:23

Copilot started work on behalf of runzhen June 9, 2026 23:23 View session

Merge branch 'pr-8671' into copilot/fix-code-review-suggestion

373014e

Co-authored-by: runzhen <32292691+runzhen@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot June 9, 2026 23:28

fix: use unquoted heredoc in startDRADriverNvidiaGpu to expand NODE_NAME

7d1b00e

Co-authored-by: runzhen <32292691+runzhen@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot June 9, 2026 23:28

Copilot AI changed the title ~~[WIP] Fix the code as per review comment~~ fix(cse): expand NODE_NAME in DRA driver systemd unit override Jun 9, 2026

Copilot finished work on behalf of runzhen June 9, 2026 23:29

Copilot AI requested a review from runzhen June 9, 2026 23:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cse): expand NODE_NAME in DRA driver systemd unit override#8676

fix(cse): expand NODE_NAME in DRA driver systemd unit override#8676
Copilot wants to merge 5 commits into
mainfrom
copilot/fix-code-review-suggestion

Copilot AI commented Jun 9, 2026 •

edited

Loading

Uh oh!

aks-node-assistant Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change

Uh oh!

aks-node-assistant Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jun 9, 2026 •

edited

Loading