Skip to content

test: (scriptless) Enable scriptless phase 3 in AB e2es#8453

Merged
lilypan26 merged 65 commits into
mainfrom
lily/scriptless/phase-3-e2e
Jun 12, 2026
Merged

test: (scriptless) Enable scriptless phase 3 in AB e2es#8453
lilypan26 merged 65 commits into
mainfrom
lily/scriptless/phase-3-e2e

Conversation

@lilypan26

@lilypan26 lilypan26 commented May 5, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

  • Enables scriptless phase 3 in ab e2es under tag EnableScriptlessANC
    • This tag means both ANC and NBC cse cmd will be provided, tests should validate that there are no diffs between the generated provisioning env vars
    • For now, NBC cse cmd will be used for provisioning until there are no more diffs, after which we can switch to using ANC
  • Adds CustomDataPhase3 to provide both ANC and NBC cse cmd to AKS node controller

Which issue(s) this PR fixes:

Fixes #

Copilot AI review requested due to automatic review settings May 5, 2026 16:25
@lilypan26 lilypan26 changed the title Lily/scriptless/phase 3 e2e test(scriptless): Enable scriptless phase 3 in AB e2es May 5, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables “scriptless phase 3” coverage in the AgentBaker e2e suite by adding a new scriptless_anc subtest path that provisions nodes using AKSNodeConfig/aks-node-controller, plus wiring many existing scenarios to provide an AKSNodeConfigMutator.

Changes:

  • Added a new scriptless_anc subtest variant and runtime flag (EnableScriptlessANC) to drive scriptless phase-3 execution.
  • Refactored/expanded the e2e “aks-node-controller hack” customData generation to optionally include AKSNodeConfig and/or an nbc-cmd script.
  • Updated many scenarios to set equivalent AKSNodeConfigMutator fields alongside existing NBC mutators.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
e2e/vmss.go Refactors customData hack generation and wires scriptless ANC + NBC cmd hack paths into VMSS creation.
e2e/types.go Adds EnableScriptlessANC and adjusts kubelet-config-file detection logic for scriptless ANC scenarios.
e2e/test_helpers.go Adds scriptless_anc subtest generation and new gating helper.
e2e/scenario_test.go Adds AKSNodeConfigMutator coverage across many existing scenarios.

Comment thread e2e/vmss.go Outdated
Comment thread e2e/vmss.go Outdated
Comment thread e2e/vmss.go Outdated
Comment thread e2e/vmss.go
Comment thread e2e/test_helpers.go
Comment thread e2e/vmss.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 27 changed files in this pull request and generated 2 comments.

Comment thread aks-node-controller/pkg/nodeconfigutils/utils_test.go Outdated
Comment thread e2e/test_helpers.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 25 changed files in this pull request and generated 4 comments.

Comment thread pkg/agent/baker.go Outdated
Comment thread e2e/test_helpers.go Outdated
Comment thread aks-node-controller/parser/parser.go
Comment thread aks-node-controller/app.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 26 changed files in this pull request and generated 3 comments.

Comment thread parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Comment thread aks-node-controller/parser/helper.go
Comment thread aks-node-controller/app.go
@djsly

djsly commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

AgentBaker Linux PR gate — E2E mass-failure (change-caused, HIGH confidence)

  • Run: 167125948 (failed)
  • Failed task: Run AgentBaker E2E → AzureCLI → exit 1
  • Net new failures: ~13 leaves, all under Test_LocalDNSHostsPlugin*.

Two failure shapes, same root cause:

  1. Loud / fast (1 leaf): Test_LocalDNSHostsPlugin_Scriptless/ACL/default (155.83s) → e2e/validation.go:52 🔴 expected no env var differences between provision-config and nbc-cmd, but found differences: — the new compareEnvs validator this PR added fires with a non-empty diff list.
  2. Slow / derivative (7 leaves): Test_LocalDNSHostsPlugin{,_Scriptless}/{Ubuntu2204,Ubuntu2404,AzureLinuxV3,ACL}/{default,scriptless_nbc}kube.go:189 after ~742s 🔴 "<pod>" haven't appeared in k8s API server: context deadline exceeded. Identical signature across every distro × variant → systemic node-bring-up degradation, not infra.

Likely root cause ((b) + (c) confirmed, (a) ruled out):

  • This PR flips nbc.EnableScriptlessNBCCSECmd = true for every non-Scriptless scenario in e2e/node_config.go, and e2e/scenario_test.go strips Tags: { Scriptless: true } from each _Scriptless test and adds a BootstrapConfigMutator — so every LocalDNS scenario now provisions via the new scriptless NBC-CSE-Cmd path, including the default (bash-CSE) variants.
  • parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh was changed from if/elif to two independent ifs, so when both files exist --provision-config and --nbc-cmd are both passed to aks-node-controller provision, which then runs compareEnvs.
  • e2e/node_config.go::nbcToAKSNodeConfigV1 has no LocalDNSProfile mapping — every LocalDNS occurrence in the PR diff is the unrelated GenerateLocalDNSCoreFile template. The localdns systemd unit therefore never gets configured/started on nodes provisioned via the converter, DNS on the node breaks, debug pods can't reach the API server → the kube.go:189 timeout.
  • The 2 cosmetic localdns.toml.gtpl diff lines (blank-line removal + trailing newline) are not the cause.

Build-vs-test: test-code-caused (e2e converter + scenario tagging) with a tightly-coupled product-code enabler (baker.go scriptless plumbing + aks-node-controller-wrapper.sh). Not a main regression.

Confidence: HIGH — three independent indicators converge: PR's own validator firing loudly on ACL/default, the EnableScriptlessNBCCSECmd flip mechanism, and the perfectly-symmetric distro×variant timeout pattern.

Strongest alternative (less likely): sysctlTemplateString rewrite in baker.go degrading DNS/conntrack reliability — refuted because every branch still sets tcp_retries2=8, LocalDNS scenarios don't customize sysctls, and a sysctl change wouldn't produce a 100% deterministic per-distro pattern lined up with files this PR touched.

Side-channel (not the cause, FYI): build2204arm64gen2containerd and build2404gen2containerd flagged succeededWithIssues with CIS regressions detected (1) — same 6.1.4.1 pattern recurring across renovate PRs the last 48h (#8652, #8294); non-gating here.

Recommended next action (owner: @lilypan26):

  1. Look at the ACL/default validation.go:52 env-var diff list first — it names the exact env keys the converter mis-maps.
  2. Audit e2e/node_config.go::nbcToAKSNodeConfigV1 for missing NBC field coverage — start with LocalDNSProfile, then any field that drives a systemd unit / cloud-init artifact. Use pkg/agent/baker.go rendering and aks-node-controller/parser as the reference.
  3. Decide if enabling EnableScriptlessNBCCSECmd = true for default variants of every scenario is intended in Phase 3 — if yes, converter coverage must be exhaustive before merge; if not, scope the flip to opted-in scenarios.
  4. Re-run the LocalDNS slice (Ubuntu2204/2404 + AzureLinuxV3 + ACL × {default, scriptless_nbc}) before re-requesting review.

Posted by Clawpilot AgentBaker gate detective.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 26 changed files in this pull request and generated 2 comments.

Comment thread parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Comment thread parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 26 changed files in this pull request and generated 4 comments.

Comment thread parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Comment thread parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Comment thread e2e/vmss.go
Comment thread aks-node-controller/app.go

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 26 changed files in this pull request and generated 2 comments.

Comment thread parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Comment thread aks-node-controller/helpers/const.go
Comment thread go.mod Outdated
go 1.25.10

require (
github.com/Azure/agentbaker/aks-node-controller v0.0.0-20241215075802-f13a779d5362

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! this version wont exist when we import this to RP isnt it? because we import v20260527.0 types

@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — Mass "node not ready" / 600s timeout across 209 scenarios (likely PR-related — scriptless phase 3 enablement)

  • Run: 167244552 (failed)
  • Failed task: Run AgentBaker E2E (Stage e2e → Job/Phase Run AgentBaker E2E)
  • Test summary: DONE 401 tests, 94 skipped, 209 failures in 3320.798s~52% failure rate

Dominant signature (essentially every failing scenario):

kube.go:160: [739.x] error listing nodes: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
kube.go:189: [739.x] 🔴 FAIL: "<vmss-name>" haven't appeared in k8s API server: context deadline exceeded
panic.go:694: ✗ waiting for node <vmss-name> to be ready failed (600.0s)
panic.go:694: ✗ preparing AKS node failed (739.x s)

VMs reach running state and SSH bastion is reachable, but the test never sees the node register in the AKS API server within 600s, and the test framework's kube-client itself is hitting client-side rate-limit timeouts.

Cross-scenario scope: Ubuntu 22.04, Ubuntu 24.04, AzureLinuxV3, ACL, ARM64 — all distros, scriptless and non-scriptless variants alike. NO fwupd.service hits in this run (so this is NOT the recurring 24.04 main regression).

Cluster fingerprint: all failing scenarios route through managed cluster abe2e-kubenet-v5-150ee (note the v5 suffix — other concurrent PR runs in the same window use v4 clusters). This strongly suggests the cluster pool this PR provisions/uses is the locus of the issue.

Build-vs-test: test/infrastructure (no per-VM CSE/VHD failure observed; nodes boot but never join API).
Confidence: MEDIUM-HIGH that this is specific to this PR's scriptless phase 3 enablement — either (a) the new test cluster/pool kubenet-v5 is misconfigured or undersized for the parallel load this PR drives, (b) scriptless phase 3 enablement is creating an N× explosion of concurrent node-registration attempts that exhausts client-side AKS API rate limits and/or the cluster's kubelet→API capacity, or (c) the new path is missing a step that lets the kubelet register (e.g. CA/bootstrap token wiring on the v5 cluster).
Strongest alternative (less likely): generic westus3 AKS API throttling — refuted: concurrent PRs against the v4 cluster in the same window (e.g. #8600 build 167255168, #8652 build 167255195) do not exhibit this signature.

Recommended next action / owner: PR author + NodeSIG-dev — please (1) verify the abe2e-kubenet-v5-* cluster is healthy (kube-apiserver, CCM/CNI, node bootstrap token), (2) check whether scriptless phase 3 enablement is creating significantly higher concurrency than the pool can handle, (3) consider gating the cluster-pool upgrade behind a separate prep PR before flipping scriptless phase 3 on across all scenarios. Do not blanket-rerun.

Side note: this run shows no fwupd.service 24.04 failures — the main regression flagged on other recent PRs may have been mitigated, or this PR's scenarios timed out before fwupd validators ran.

Posted by Clawpilot AgentBaker gate detective.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 24 changed files in this pull request and generated 3 comments.

Comment thread parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Comment thread parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Comment thread aks-node-controller/app.go
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — Same mass "node not ready" / kubenet-v5 cluster signature as prior run (still PR-related, not fwupd)

  • Run: 167348100 (failed) — commit e46bcf8
  • Failed task: Run AgentBaker E2E
  • Test summary: DONE 402 tests, 95 skipped, 209 failures in 3354.505ssame ~52% failure rate as prior build 167244552

Identical signature to the prior commented run:

  • 0 fwupd.service hits (so this is NOT the 24.04 main regression)
  • 74 client rate limiter hits + 291 kubenet-v5 cluster references
  • Every failing scenario routes through managed cluster abe2e-kubenet-v5-* and times out at the 600s waiting for node ... to be ready step with error listing nodes: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline.

Status: No change in failure shape between the prior run and this one — the scriptless phase 3 enablement + new kubenet-v5 cluster pool combination is still overwhelming the cluster's kubelet→API capacity and/or hitting client-side AKS API rate limits across all distros (Ubuntu 22.04/24.04, AzureLinuxV3, ACL, ARM64).

Three-level analysis: unchanged from prior comment on build 167244552. PR-related (scriptless phase 3 + kubenet-v5 cluster), not fwupd, not infra-flake.

Confidence: HIGH that this is specific to the scriptless phase 3 enablement + kubenet-v5 pool combo; the same exact 209-failure pattern reproducing across two builds on this PR with no commits to main between them strongly confirms determinism rather than transient throttling.

Recommended next action / owner: PR author + NodeSIG-dev — please follow up on prior actions: (1) verify abe2e-kubenet-v5-* cluster pool health and capacity, (2) reduce parallel concurrency or gate the v5 pool upgrade behind a prep PR, (3) check bootstrap-token wiring on the v5 cluster. Do not blanket-rerun — same outcome is expected without an env/scenario change.

Posted by Clawpilot AgentBaker gate detective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants