feat: add secondary nic count in NBC and NetworkConfig by jumpinthefire · Pull Request #8697 · Azure/AgentBaker

jumpinthefire · 2026-06-12T00:33:08Z

What this PR does / why we need it:
See #8642 for more details.
This PR adds a configuration field that AKS RP will set when non-delegated secondary NICs are provisioned within a node pool. When this field is set, node provision will configure and bring up secondary NICs found in the OS. This is to leave existing agent pools intact and only turn on secondary NICs when explicitly desired by the customer, using the RP API.

Both scripted and scriptless variants are set.

github-actions · 2026-06-12T00:33:49Z

The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	Jun 12, 2026, 9:32 PM

Copilot

Pull request overview

This PR extends both the AgentBaker NodeBootstrappingConfiguration (scripted flow) and the aks-node-controller NetworkConfig proto (scriptless flow) to carry an explicit count of Standard-type secondary NICs so bootstrapping logic can decide when to bring up additional OS-level interfaces.

Changes:

Added StandardSecondaryNICCount to datamodel.NodeBootstrappingConfiguration for the /getnodebootstrapdata (scripted) configuration path.
Added standard_secondary_nic_count to the NetworkConfig proto and regenerated the corresponding Go bindings.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated no comments.

File	Description
pkg/agent/datamodel/types.go	Adds `StandardSecondaryNICCount` to the scripted NBC input model.
aks-node-controller/proto/aksnodeconfig/v1/network_config.proto	Adds `standard_secondary_nic_count` to the scriptless NetworkConfig schema.
aks-node-controller/pkg/gen/aksnodeconfig/v1/network_config.pb.go	Regenerated Go bindings to include the new proto field and accessor.

Files not reviewed (1)

aks-node-controller/pkg/gen/aksnodeconfig/v1/network_config.pb.go: Generated file

aks-node-assistant · 2026-06-12T11:03:38Z

🕵️ AgentBaker Linux Gate Detective — Build 167738206 (Run AgentBaker E2E) FAILED

3-level RCA

1. Surface symptom

211 failed sub-tests / 106 passed / 97 skipped across Ubuntu2204, Ubuntu2404, AzureLinuxV3 and ACL (both default and scriptless_nbc variants), plus all LocalDNSHostsPlugin* and *_NetworkIsolated* variants.
Every failing scenario terminates at kube.go:195 with:
🔴 FAIL: "<vmss>" haven't appeared in k8s API server: context deadline exceeded after exactly 600s.

2. Corroboration (≥2 evidence sources)

e2e-test-log/test-results/test-log.json — same kube.go:160 → kube.go:195 600s timeout on Test_Ubuntu2204/default, Test_AzureLinuxV3/scriptless_nbc, Test_LocalDNSHostsPlugin/AzureLinuxV3/default, Test_Ubuntu2204_NetworkIsolatedCluster_NonAnonymousACR/default.
In every case VMSS reaches "running" at ~140–200s and SSH-over-bastion is established, so the failure is after VM boot, before kubelet registers.
Distinct distros, distinct shared clusters (abe2e-kubenet-v5-150ee + abe2e-azure-networkisolated-v3-d6cc9), distinct test paths — all converge on the same "node never registers" signature → not a per-cluster infra issue.
Build issues panel also flags this stage; the only PR change is commit c3d68bd6 "Add secondary nic count in NBC and NetworkConfig".

3. Root-cause challenge — strongest alternative

Alternative: shared-cluster fleet outage (signature shared-cluster-fleet-outage).
Rejected because: that signature shows up as proxy-pod readiness errors, RG-deletion failures, or VMSS-create context-deadlines (~1020s) — none of which appear here. Failures here are uniformly post-VM-boot, pre-kubelet-register, span two unrelated shared clusters, and 106 tests on the same clusters passed. The pattern is a PR-localized regression in NBC / NetworkConfig payload, not a fleet outage.

Classification: Deterministic PR-caused E2E / provisioning regression (NBC + NetworkConfig secondary-NIC payload).
Build vs Test class: Test stage (Run AgentBaker E2E). VHD builds succeeded; only build2404gen2containerd has a non-blocking CIS regression warning unrelated to this fix.
Confidence: High.
Flaky vs deterministic: Deterministic — uniform 600s timeout signature across all distros / both bootstrap modes.

Wiki signature comparison

Existing signature	Match?	Why
`secondarynic-pod-exec-context-deadline`	❌	That signature targets `Test__SecondaryNIC` failing at the wireserver-blocking validator after the node is up. Here no `SecondaryNIC` tests exist in the failed list, and failures are before kubelet registers.
`secondarynic-networkctl-reload`	❌	That signature is a CSE-level `networkctl reload` exit. Here CSE/kubelet never reach the K8s API — different stage and no `networkctl` reference.
`networkisolated-apiserver-fqdn-nxdomain`	❌ (no escalation)	That signature is CSE `VALIDATION_ERR=52` from NXDOMAIN on `169.254.10.10` and is NetworkIsolated-only. Here NetworkIsolated tests fail via the same global 600s pattern as Ubuntu2204/AzureLinuxV3 default scenarios — downstream symptom of the broader PR regression, not the test-infra fixture issue. Wiki Count remains 6; threshold not crossed.
`localdns-hostsplugin-canary-hotreload-timeout`	❌	That signature is `validation.go:106` Canary did not resolve after 60s — runs only after node registration. Here LocalDNSHostsPlugin tests also fail at `kube.go:195` pre-registration.

New signature spec

Signature ID: secondarynic-nbc-node-registration-timeout
One-line: All linux E2E scenarios across distros / bootstrap modes fail at kube.go:195 with "…" haven't appeared in k8s API server: context deadline exceeded (600s) after VMSS reaches running — kubelet never registers; localized to PR feat: add secondary nic count in NBC and NetworkConfig #8697 NBC + NetworkConfig secondary-NIC payload change.

Recommended next step (PR author)

Inspect the NBC / aks-node-controller-config.json produced by this change for an unconditional/missing SecondaryNIC field consumed by both legacy CSE (parts/linux/cloud-init/artifacts/) and aks-node-controller, and confirm node-controller still completes when the new field is absent / zero-valued. The CSE output / /var/log/azure/aks-node-controller.log from scenario-logs-1781228226 will pinpoint the path that prevents kubelet from starting.

ESCALATION_THRESHOLD_HIT: false

aks-node-assistant · 2026-06-12T18:34:06Z

🕵️ AgentBaker Linux Gate Detective — Build 167815411 (Run AgentBaker E2E) FAILED — recurrence on commit 2c1b151583cb ("Catch diff error").

TL;DR

Same signature as the prior build on this PR — secondarynic-nbc-node-registration-timeout. 2c1b15 is a small follow-up commit and does not change the secondary-NIC NBC / NetworkConfig payload, so the regression introduced by c3d68bd6 is still present. 209 failed sub-tests (vs 211 prior) across Ubuntu2204 / Ubuntu2404 / AzureLinuxV3 / ACL on shared clusters abe2e-kubenet-v5-150ee and abe2e-azure-networkisolated-v3-*.

3-level RCA (delta vs prior comment)

1. Surface symptom — Unchanged: uniform kube.go:195 "<vmss>" haven't appeared in k8s API server: context deadline exceeded after 600s. VMSS reaches running, kubelet never registers, no SSH bastion reachability problem. Cascading kube.go:166 client-rate-limiter starvation on shared rest config.

2. Corroboration — Same shared clusters as prior build 167738206; same distros; same panic.go:694 "preparing AKS node failed". Trivial code delta in 2c1b15 ("Catch diff error") corroborates that the regression is in the original c3d68bd6 secondary-NIC NBC payload, not in build/CI tooling.

3. Root-cause challenge — Strongest alternative: shared abe2e-kubenet-v5-150ee is currently on an open repair item (#38403603, signature kubenet-v5-node-not-ready-scriptless) where the same kube.go:195 symptom is reproducing infra-side on unrelated PRs. Why it is still less likely the dominant cause here: the failure rate on PR #8697 is far higher than the cross-PR infra baseline, and the same NBC change has now reproduced across two builds on two different shared clusters (kubenet-v5-150ee + azure-networkisolated-v3-*) with the same payload, while many adjacent runs from other PRs pass on those clusters. The infra issue may be amplifying volume but it does not explain why every secondary-NIC-touched scenario for this PR uniformly fails kubelet registration.

Likely cause / signature

Wiki signature: secondarynic-nbc-node-registration-timeout
Classification: Deterministic PR-caused E2E/provisioning regression
Confidence: High that the NBC secondary-NIC payload change is the root cause; Medium that no infra amplification is present.

Recommended next action / owner

Owner: PR author.
Inspect kubelet / aks-node-controller startup logs from the scenario-logs artifact for one failed scenario (e.g. Test_Ubuntu2204_DisableKubeletServingCertificateRotationWithTags_AlreadyDisabled/scriptless_nbc) and confirm whether kubelet is failing to read or apply the new SecondaryNIC field from aks-node-controller-config.json.
Cross-check the NBC schema / serialization for the new field (default value, omitempty behavior, downstream consumer in CSE).

Evidence

Failed run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=167815411
Failed job: Run AgentBaker E2E (e2e stage)
Prior detective comment on this PR (build 167738206, commit c3d68bd6): #issuecomment-4690569715
Test summary: 414 tests, 97 skipped, 209 failures in 3376.8s
Source commit: 2c1b151583cb2914c0de9db630834682064b6ea1

Posted by Clawpilot AgentBaker Linux Gate Detective Watcher. Replies welcome.

Copilot

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.

Files not reviewed (1)

aks-node-controller/pkg/gen/aksnodeconfig/v1/network_config.pb.go: Generated file

jumpinthefire · 2026-06-12T23:14:03Z

Confirmed the same e2e failures occur on the main branch.

aks-node-assistant · 2026-06-12T23:31:17Z

🕵️ AgentBaker Linux Gate Detective — Build 167856860 (Run AgentBaker E2E) FAILED — 3rd recurrence on PR #8697, now on rebased commit 5ac78bac (PR HEAD).

TL;DR

PR was force-pushed/rebased to a single new commit 5ac78bac ("Add secondary nic count in NBC and NetworkConfig") and the secondarynic-nbc-node-registration-timeout signature reproduces unchanged — 211 failures uniformly across distros on both abe2e-kubenet-v5-150ee and abe2e-azure-networkisolated-v3-d6cc9 shared clusters. The rebase rules out "stale branch / merge race" theories; the regression is in the secondary-NIC NBC + NetworkConfig payload at PR HEAD.

Delta vs prior runs

Build	Commit	Failures	Note
167738206	`c3d68bd6`	211	First occurrence
167815411	`2c1b1515` (trivial "Catch diff error")	209	NBC payload unchanged → reproduces
167856860	`5ac78bac` (PR rebased to single commit)	211	Clean rebase still fails identically — confirms PR HEAD is the source

3-level RCA

1. Surface symptom — Unchanged: every failed scenario terminates at kube.go:195 🔴 FAIL: "<vmss>" haven't appeared in k8s API server: context deadline exceeded after 600s + kube.go:166 client-rate-limiter starvation; pre-kubelet-registration.

2. Corroboration — Identical signature, same two shared clusters, same uniform-across-distros pattern as prior builds 167738206 / 167815411 on this PR. Failure rate (~211/414) is far above the cross-PR shared-cluster baseline.

3. Root-cause challenge — Strongest alternative is still the shared abe2e-kubenet-v5-150ee infra outage (#38403603). Why less likely: identical reproduction across THREE independent commits on this same PR (initial, trivial follow-up, and now a clean rebase to fresh HEAD), while the same shared clusters intermittently pass for other PRs' runs. Three-shot rebase reproduction is strong evidence the NBC secondary-NIC payload is the dominant cause.

Recommended next action / owner

Owner: PR author (@jumpinthefire).
Pull the kubelet & aks-node-controller startup logs from scenario-logs/Test_Ubuntu2204_*/{default,scriptless_nbc} for one failing VMSS and check whether kubelet is failing to parse / apply the new SecondaryNIC field from aks-node-controller-config.json (likely culprit: a required field missing / wrong type / nil-deref in the new payload). The fact that VMs reach "running" but never register strongly points to kubelet startup or aks-node-controller failing post-CSE.

Evidence

Failed run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=167856860
Failed job: Run AgentBaker E2E (e2e stage)
Test summary: 414 tests, 97 skipped, 211 failures in 3489.9s
Source commit: 5ac78bac3624b4c7e38af73efe9f174c7199d361 (rebased PR HEAD, 2026-06-12T21:30Z)
Prior detective comments on this PR: build 167738206 · build 167815411

Posted by Clawpilot AgentBaker Linux Gate Detective Watcher.

Copilot AI review requested due to automatic review settings June 12, 2026 00:33

Copilot started reviewing on behalf of jumpinthefire June 12, 2026 00:33 View session

jumpinthefire changed the title ~~Add secondary nic count in NBC and NetworkConfig~~ feat: add secondary nic count in NBC and NetworkConfig Jun 12, 2026

jumpinthefire marked this pull request as ready for review June 12, 2026 00:34

jumpinthefire requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, runzhen, sulixu, surajssd, timmy-wright and zachary-bailey as code owners June 12, 2026 00:34

Copilot AI reviewed Jun 12, 2026

View reviewed changes

aks-node-assistant Bot mentioned this pull request Jun 12, 2026

chore(deps): update nvidia-dcgm (patch) #8659

Open

1 task

djsly approved these changes Jun 12, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 12, 2026 19:35

jumpinthefire force-pushed the jumpinthefire/secondarynicnbc branch from 2c1b151 to 4804db7 Compare June 12, 2026 19:35

Copilot started reviewing on behalf of jumpinthefire June 12, 2026 19:36 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread pkg/agent/datamodel/types.go

Comment thread aks-node-controller/proto/aksnodeconfig/v1/network_config.proto

Add secondary nic count in NBC and NetworkConfig

5ac78ba

jumpinthefire force-pushed the jumpinthefire/secondarynicnbc branch from 4804db7 to 5ac78ba Compare June 12, 2026 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add secondary nic count in NBC and NetworkConfig#8697

feat: add secondary nic count in NBC and NetworkConfig#8697
jumpinthefire wants to merge 1 commit into
mainfrom
jumpinthefire/secondarynicnbc

jumpinthefire commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

aks-node-assistant Bot commented Jun 12, 2026

Uh oh!

aks-node-assistant Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

jumpinthefire commented Jun 12, 2026

Uh oh!

aks-node-assistant Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jumpinthefire commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

aks-node-assistant Bot commented Jun 12, 2026

3-level RCA

Wiki signature comparison

New signature spec

Recommended next step (PR author)

Uh oh!

aks-node-assistant Bot commented Jun 12, 2026

TL;DR

3-level RCA (delta vs prior comment)

Likely cause / signature

Recommended next action / owner

Evidence

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

jumpinthefire commented Jun 12, 2026

Uh oh!

aks-node-assistant Bot commented Jun 12, 2026

TL;DR

Delta vs prior runs

3-level RCA

Recommended next action / owner

Evidence

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 12, 2026 •

edited

Loading