Skip to content

feat: add secondary nic count in NBC and NetworkConfig#8697

Open
jumpinthefire wants to merge 1 commit into
mainfrom
jumpinthefire/secondarynicnbc
Open

feat: add secondary nic count in NBC and NetworkConfig#8697
jumpinthefire wants to merge 1 commit into
mainfrom
jumpinthefire/secondarynicnbc

Conversation

@jumpinthefire

Copy link
Copy Markdown

What this PR does / why we need it:
See #8642 for more details.
This PR adds a configuration field that AKS RP will set when non-delegated secondary NICs are provisioned within a node pool. When this field is set, node provision will configure and bring up secondary NICs found in the OS. This is to leave existing agent pools intact and only turn on secondary NICs when explicitly desired by the customer, using the RP API.

Both scripted and scriptless variants are set.

Copilot AI review requested due to automatic review settings June 12, 2026 00:33
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 12, 2026, 9:32 PM

@jumpinthefire jumpinthefire changed the title Add secondary nic count in NBC and NetworkConfig feat: add secondary nic count in NBC and NetworkConfig Jun 12, 2026
@jumpinthefire jumpinthefire marked this pull request as ready for review June 12, 2026 00:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends both the AgentBaker NodeBootstrappingConfiguration (scripted flow) and the aks-node-controller NetworkConfig proto (scriptless flow) to carry an explicit count of Standard-type secondary NICs so bootstrapping logic can decide when to bring up additional OS-level interfaces.

Changes:

  • Added StandardSecondaryNICCount to datamodel.NodeBootstrappingConfiguration for the /getnodebootstrapdata (scripted) configuration path.
  • Added standard_secondary_nic_count to the NetworkConfig proto and regenerated the corresponding Go bindings.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated no comments.

File Description
pkg/agent/datamodel/types.go Adds StandardSecondaryNICCount to the scripted NBC input model.
aks-node-controller/proto/aksnodeconfig/v1/network_config.proto Adds standard_secondary_nic_count to the scriptless NetworkConfig schema.
aks-node-controller/pkg/gen/aksnodeconfig/v1/network_config.pb.go Regenerated Go bindings to include the new proto field and accessor.
Files not reviewed (1)
  • aks-node-controller/pkg/gen/aksnodeconfig/v1/network_config.pb.go: Generated file

@aks-node-assistant

Copy link
Copy Markdown
Contributor

🕵️ AgentBaker Linux Gate DetectiveBuild 167738206 (Run AgentBaker E2E) FAILED

3-level RCA

1. Surface symptom

  • 211 failed sub-tests / 106 passed / 97 skipped across Ubuntu2204, Ubuntu2404, AzureLinuxV3 and ACL (both default and scriptless_nbc variants), plus all LocalDNSHostsPlugin* and *_NetworkIsolated* variants.
  • Every failing scenario terminates at kube.go:195 with:
    🔴 FAIL: "<vmss>" haven't appeared in k8s API server: context deadline exceeded after exactly 600s.

2. Corroboration (≥2 evidence sources)

  • e2e-test-log/test-results/test-log.json — same kube.go:160 → kube.go:195 600s timeout on Test_Ubuntu2204/default, Test_AzureLinuxV3/scriptless_nbc, Test_LocalDNSHostsPlugin/AzureLinuxV3/default, Test_Ubuntu2204_NetworkIsolatedCluster_NonAnonymousACR/default.
  • In every case VMSS reaches "running" at ~140–200s and SSH-over-bastion is established, so the failure is after VM boot, before kubelet registers.
  • Distinct distros, distinct shared clusters (abe2e-kubenet-v5-150ee + abe2e-azure-networkisolated-v3-d6cc9), distinct test paths — all converge on the same "node never registers" signature → not a per-cluster infra issue.
  • Build issues panel also flags this stage; the only PR change is commit c3d68bd6 "Add secondary nic count in NBC and NetworkConfig".

3. Root-cause challenge — strongest alternative

  • Alternative: shared-cluster fleet outage (signature shared-cluster-fleet-outage).
  • Rejected because: that signature shows up as proxy-pod readiness errors, RG-deletion failures, or VMSS-create context-deadlines (~1020s) — none of which appear here. Failures here are uniformly post-VM-boot, pre-kubelet-register, span two unrelated shared clusters, and 106 tests on the same clusters passed. The pattern is a PR-localized regression in NBC / NetworkConfig payload, not a fleet outage.

Classification: Deterministic PR-caused E2E / provisioning regression (NBC + NetworkConfig secondary-NIC payload).
Build vs Test class: Test stage (Run AgentBaker E2E). VHD builds succeeded; only build2404gen2containerd has a non-blocking CIS regression warning unrelated to this fix.
Confidence: High.
Flaky vs deterministic: Deterministic — uniform 600s timeout signature across all distros / both bootstrap modes.

Wiki signature comparison

Existing signature Match? Why
secondarynic-pod-exec-context-deadline That signature targets Test_*_SecondaryNIC* failing at the wireserver-blocking validator after the node is up. Here no *SecondaryNIC* tests exist in the failed list, and failures are before kubelet registers.
secondarynic-networkctl-reload That signature is a CSE-level networkctl reload exit. Here CSE/kubelet never reach the K8s API — different stage and no networkctl reference.
networkisolated-apiserver-fqdn-nxdomain ❌ (no escalation) That signature is CSE VALIDATION_ERR=52 from NXDOMAIN on 169.254.10.10 and is NetworkIsolated-only. Here NetworkIsolated tests fail via the same global 600s pattern as Ubuntu2204/AzureLinuxV3 default scenarios — downstream symptom of the broader PR regression, not the test-infra fixture issue. Wiki Count remains 6; threshold not crossed.
localdns-hostsplugin-canary-hotreload-timeout That signature is validation.go:106 Canary did not resolve after 60s — runs only after node registration. Here LocalDNSHostsPlugin tests also fail at kube.go:195 pre-registration.

New signature spec

  • Signature ID: secondarynic-nbc-node-registration-timeout
  • One-line: All linux E2E scenarios across distros / bootstrap modes fail at kube.go:195 with "…" haven't appeared in k8s API server: context deadline exceeded (600s) after VMSS reaches running — kubelet never registers; localized to PR feat: add secondary nic count in NBC and NetworkConfig #8697 NBC + NetworkConfig secondary-NIC payload change.

Recommended next step (PR author)

Inspect the NBC / aks-node-controller-config.json produced by this change for an unconditional/missing SecondaryNIC field consumed by both legacy CSE (parts/linux/cloud-init/artifacts/) and aks-node-controller, and confirm node-controller still completes when the new field is absent / zero-valued. The CSE output / /var/log/azure/aks-node-controller.log from scenario-logs-1781228226 will pinpoint the path that prevents kubelet from starting.

ESCALATION_THRESHOLD_HIT: false

@aks-node-assistant

Copy link
Copy Markdown
Contributor

🕵️ AgentBaker Linux Gate DetectiveBuild 167815411 (Run AgentBaker E2E) FAILED — recurrence on commit 2c1b151583cb ("Catch diff error").

TL;DR

Same signature as the prior build on this PR — secondarynic-nbc-node-registration-timeout. 2c1b15 is a small follow-up commit and does not change the secondary-NIC NBC / NetworkConfig payload, so the regression introduced by c3d68bd6 is still present. 209 failed sub-tests (vs 211 prior) across Ubuntu2204 / Ubuntu2404 / AzureLinuxV3 / ACL on shared clusters abe2e-kubenet-v5-150ee and abe2e-azure-networkisolated-v3-*.

3-level RCA (delta vs prior comment)

1. Surface symptom — Unchanged: uniform kube.go:195 "<vmss>" haven't appeared in k8s API server: context deadline exceeded after 600s. VMSS reaches running, kubelet never registers, no SSH bastion reachability problem. Cascading kube.go:166 client-rate-limiter starvation on shared rest config.

2. Corroboration — Same shared clusters as prior build 167738206; same distros; same panic.go:694 "preparing AKS node failed". Trivial code delta in 2c1b15 ("Catch diff error") corroborates that the regression is in the original c3d68bd6 secondary-NIC NBC payload, not in build/CI tooling.

3. Root-cause challenge — Strongest alternative: shared abe2e-kubenet-v5-150ee is currently on an open repair item (#38403603, signature kubenet-v5-node-not-ready-scriptless) where the same kube.go:195 symptom is reproducing infra-side on unrelated PRs. Why it is still less likely the dominant cause here: the failure rate on PR #8697 is far higher than the cross-PR infra baseline, and the same NBC change has now reproduced across two builds on two different shared clusters (kubenet-v5-150ee + azure-networkisolated-v3-*) with the same payload, while many adjacent runs from other PRs pass on those clusters. The infra issue may be amplifying volume but it does not explain why every secondary-NIC-touched scenario for this PR uniformly fails kubelet registration.

Likely cause / signature

  • Wiki signature: secondarynic-nbc-node-registration-timeout
  • Classification: Deterministic PR-caused E2E/provisioning regression
  • Confidence: High that the NBC secondary-NIC payload change is the root cause; Medium that no infra amplification is present.

Recommended next action / owner

  • Owner: PR author.
  • Inspect kubelet / aks-node-controller startup logs from the scenario-logs artifact for one failed scenario (e.g. Test_Ubuntu2204_DisableKubeletServingCertificateRotationWithTags_AlreadyDisabled/scriptless_nbc) and confirm whether kubelet is failing to read or apply the new SecondaryNIC field from aks-node-controller-config.json.
  • Cross-check the NBC schema / serialization for the new field (default value, omitempty behavior, downstream consumer in CSE).

Evidence

Posted by Clawpilot AgentBaker Linux Gate Detective Watcher. Replies welcome.

Copilot AI review requested due to automatic review settings June 12, 2026 19:35
@jumpinthefire jumpinthefire force-pushed the jumpinthefire/secondarynicnbc branch from 2c1b151 to 4804db7 Compare June 12, 2026 19:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • aks-node-controller/pkg/gen/aksnodeconfig/v1/network_config.pb.go: Generated file

Comment thread pkg/agent/datamodel/types.go
Comment thread aks-node-controller/proto/aksnodeconfig/v1/network_config.proto
@jumpinthefire jumpinthefire force-pushed the jumpinthefire/secondarynicnbc branch from 4804db7 to 5ac78ba Compare June 12, 2026 21:31
@jumpinthefire

Copy link
Copy Markdown
Author

Confirmed the same e2e failures occur on the main branch.

@aks-node-assistant

Copy link
Copy Markdown
Contributor

🕵️ AgentBaker Linux Gate DetectiveBuild 167856860 (Run AgentBaker E2E) FAILED — 3rd recurrence on PR #8697, now on rebased commit 5ac78bac (PR HEAD).

TL;DR

PR was force-pushed/rebased to a single new commit 5ac78bac ("Add secondary nic count in NBC and NetworkConfig") and the secondarynic-nbc-node-registration-timeout signature reproduces unchanged — 211 failures uniformly across distros on both abe2e-kubenet-v5-150ee and abe2e-azure-networkisolated-v3-d6cc9 shared clusters. The rebase rules out "stale branch / merge race" theories; the regression is in the secondary-NIC NBC + NetworkConfig payload at PR HEAD.

Delta vs prior runs

Build Commit Failures Note
167738206 c3d68bd6 211 First occurrence
167815411 2c1b1515 (trivial "Catch diff error") 209 NBC payload unchanged → reproduces
167856860 5ac78bac (PR rebased to single commit) 211 Clean rebase still fails identically — confirms PR HEAD is the source

3-level RCA

1. Surface symptom — Unchanged: every failed scenario terminates at kube.go:195 🔴 FAIL: "<vmss>" haven't appeared in k8s API server: context deadline exceeded after 600s + kube.go:166 client-rate-limiter starvation; pre-kubelet-registration.

2. Corroboration — Identical signature, same two shared clusters, same uniform-across-distros pattern as prior builds 167738206 / 167815411 on this PR. Failure rate (~211/414) is far above the cross-PR shared-cluster baseline.

3. Root-cause challenge — Strongest alternative is still the shared abe2e-kubenet-v5-150ee infra outage (#38403603). Why less likely: identical reproduction across THREE independent commits on this same PR (initial, trivial follow-up, and now a clean rebase to fresh HEAD), while the same shared clusters intermittently pass for other PRs' runs. Three-shot rebase reproduction is strong evidence the NBC secondary-NIC payload is the dominant cause.

Recommended next action / owner

  • Owner: PR author (@jumpinthefire).
  • Pull the kubelet & aks-node-controller startup logs from scenario-logs/Test_Ubuntu2204_*/{default,scriptless_nbc} for one failing VMSS and check whether kubelet is failing to parse / apply the new SecondaryNIC field from aks-node-controller-config.json (likely culprit: a required field missing / wrong type / nil-deref in the new payload). The fact that VMs reach "running" but never register strongly points to kubelet startup or aks-node-controller failing post-CSE.

Evidence

Posted by Clawpilot AgentBaker Linux Gate Detective Watcher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants