Skip to content

fix(acl): bump marketplace image to 3.20260602.01#8669

Open
aadhar-agarwal wants to merge 1 commit into
mainfrom
aadagarwal/update-acl-marketplace-images-20260602
Open

fix(acl): bump marketplace image to 3.20260602.01#8669
aadhar-agarwal wants to merge 1 commit into
mainfrom
aadagarwal/update-acl-marketplace-images-20260602

Conversation

@aadhar-agarwal

@aadhar-agarwal aadhar-agarwal commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Bumps the ACL marketplace image from 3.20260517.01 to 3.20260602.01 for all ACL VHD build jobs in both pipelines (.vsts-vhd-builder.yaml and .vsts-vhd-builder-release.yaml), covering the azure-linux-3-acl (x86) and azure-linux-3-arm64-gen2-acl (arm64) SKUs, FIPS and non-FIPS.

[TEST All VHDs] AKS Linux VHD Build - Msft Tenant - Running

AKS Linux VHD Build - TME Tenant - Passed

E2Ev2 AKS RP Customized Image Validation - Nominal

  • AI analysis of AKS E2E tests: The Customized Image Validation run shows 9 failing scenarios, none specific to this image bump. All node-image-specific checks pass on 3.0.20260602 (cluster create, CSI mount, and provisioning succeed); every failure also reproduces on other OS images and/or pipelines.
Failing scenario(s) What we observe Evidence it isn't this image
AzureFile_CSI_VM / _VMSS NFS unmount-volume sub-test hangs — pod stuck Terminating, WaitForPodDeleted fails Fails on all OS (Ubuntu/AzureLinux/ACL) across 8 pipelines since ~6/8; the prior ACL image 3.0.20260510 passed 6/1–6/7 then the same image failed 6/9+. Under investigation in AKS On Call.
Defender_Profile_Enable_New / _Existing Defender add-on validation fails Fails across 16+ pipelines and multiple OS simultaneously; the add-on ships independently of the OS image
KSCR, CrossTenant_Auxiliary_Token_Provider, Cross_Subscription_VNet Infra/flaky — capacity (ResolveVMSize), internal-LB, subscription selection, log-fetch Intermittent fleet-wide; pass on retry / other runs

Conclusion: these failures are not specific to the 3.20260517.01 → 3.20260602.01 bump — each reproduces on other OS images and pipelines, and all node-image-specific checks pass.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Copilot AI review requested due to automatic review settings June 9, 2026 22:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the pinned Azure Container Linux (ACL) Azure Marketplace base image version used by the VHD builder pipelines so ACL VHD build jobs consume the newly published image.

Changes:

  • Bump IMG_VERSION for azure-linux-3-acl (x86_64) from 3.20260517.013.20260602.01 in the main VHD builder pipeline.
  • Bump IMG_VERSION for azure-linux-3-arm64-gen2-acl (arm64) from 3.20260517.013.20260602.01 in the main VHD builder pipeline.
  • Apply the same version bumps to the release VHD builder pipeline (covering FIPS and non-FIPS ACL jobs).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
.pipelines/.vsts-vhd-builder.yaml Updates the pinned ACL marketplace IMG_VERSION used by ACL build jobs (x86 and arm64; FIPS/non-FIPS).
.pipelines/.vsts-vhd-builder-release.yaml Mirrors the same ACL IMG_VERSION bump in the release pipeline jobs.

@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — 236-failure mass run: shared cluster proxy-pod readiness exhaustion (test-infra, NOT this PR)

  • Run: 167393232 (failed)
  • Failed task: Run AgentBaker E2E
  • Test summary: DONE 402 tests, 95 skipped, 236 failures in 653.724s (~59% failure rate; 0 fwupd hits)

Dominant failure:

prepare cluster tasks: dag execution failed:
waiting for proxy pod to be ready: listing proxy pods:
client rate limiter Wait returned an error: context deadline exceeded

Every failing scenario fails at the cluster.go:163: ✓ preparing cluster done (311.0s) boundary — the harness's e2e-proxy DaemonSet never reports ready inside the prepare-cluster DAG's deadline.

Cross-PR pattern (same window): identical 236-failure / ~60% pattern on PR #8652 build 167387444, PR #8294 build 167387406, PR #8600 build 167387387, and earlier PR #8618 build 167378787. Same proxy-pod-readiness exhaustion + intermittent ResourceGroupBeingDeleted on shared kubenet-v5/networkisolated-v2 cluster pools.

Build-vs-test: test-infra (shared cluster fleet), NOT product, NOT PR-caused.
This PR's exposure check: ACL marketplace image bump (3.20260602.01). No path from a marketplace image-tag change to the e2e proxy DaemonSet's readiness behavior.
Confidence: HIGH that PR #8669 is not the cause.
Strongest alternative (less likely): ACL image bump breaking e2e-proxy DaemonSet — refuted by 4+ unrelated concurrent PRs hitting identical signature.

Recommended next action / owner: E2E infra / NodeSIG-dev — shared cluster fleet stabilization (proxy DaemonSet readiness + RG lifecycle). PR author: do NOT block merge intent on this; this is a draft PR — rerun once the shared cluster fleet recovers.

Posted by Clawpilot AgentBaker gate detective.

@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — single E2E failure (kubelet-exec proxy 502, NOT this PR)

  • Run: 167509311
  • Failed job: Run AgentBaker E2E (only Test_Ubuntu2404Gen2/{default,scriptless_nbc} failed; all VHD builds passed)
  • Wiki signature: kubelet-exec-proxy-502 (new)

Detective summary

The Ubuntu 24.04 gen2 scenario node provisioned fine; moby-containerd 2.2.4-ubuntu24.04u2 was installed and validated. The debug pod (debugnonhost-mariner-tolerated) was placed and Ready. The failure happens at the very next step — the test exec'd containerd config dump against the pod via apiserver /exec:

encountered unexpected error when executing command on pod:
Internal error occurred: error sending request:
Post "https://10.220.112.108:10250/exec/default/debugnonhost-mariner-tolerated-.../mariner?command=...":
proxy error from localhost:9443 while dialing 10.220.112.108:10250, code 502: 502 Bad Gateway

This is a 502 from the apiserver kubelet-exec proxy (localhost:9443 → kubelet 10.220.112.108:10250). The node and pod are healthy; this is a transient apiserver→kubelet streaming proxy hiccup, classic test-infra flake.

Classification: Test infrastructure / shared-cluster transient (apiserver kubelet-exec proxy 502).

Confidence: High. PR #8669 changes only .pipelines/.vsts-vhd-builder.yaml and .vsts-vhd-builder-release.yaml (ACL marketplace image version bump 3.20260602.01) — pure pipeline metadata change, no runtime/CSE/kubelet code. The failed path is entirely on the shared cluster control plane, not on the node under test.

Strongest alternative theory: A kubelet/containerd regression introduced by the moby-containerd 2.2.4 install during node provisioning making /exec/.../containerd config dump fail. Less likely because the error is a transport-layer 502 from the apiserver proxy (kubelet was never reached at all), not a containerd command error or kubelet-side rejection.

Recommended next action / owner: No PR change required. Recommend rerun of the failed leg only. If this 502 pattern recurs across multiple PRs, AgentBaker E2E test-infra owner should look at the shared cluster's apiserver-kubelet network reliability.

Evidence used: failed task log (3 === FAIL for one scenario + subtests, single 502 at the /exec proxy hop), all other E2E scenarios passed, all VHD builds passed, PR changes only touch pipeline YAML.

Bumps the ACL marketplace image from 3.20260517.01 to 3.20260602.01 for
all ACL VHD build jobs in both pipelines (.vsts-vhd-builder.yaml and
.vsts-vhd-builder-release.yaml), covering the azure-linux-3-acl (x86) and
azure-linux-3-arm64-gen2-acl (arm64) SKUs, FIPS and non-FIPS.

3.20260602.01 went go-live (public) on 2026-06-08 for both ACL SKUs. The
VHD scripts already discover the active UKI dynamically, so this is a
plain version bump with no script changes required.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 10, 2026 20:50
@aadhar-agarwal aadhar-agarwal force-pushed the aadagarwal/update-acl-marketplace-images-20260602 branch from af75a11 to 1dbd109 Compare June 10, 2026 20:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — 3 distinct E2E failures, all test-infra (NOT this PR)

  • Run: 167536780
  • Failed job: Run AgentBaker E2E (3 scenarios / 7 subtests; all VHD builds passed)
  • Failed scenarios: Test_Ubuntu2204_DisableKubeletServingCertificateRotationWithTags_AlreadyDisabled/default, Test_Ubuntu2204_HTTPSProxy_PrivateDNS/{default,scriptless_nbc}, Test_Ubuntu2204_PMC_Install/default

Detective summary — two known signatures

(1) wireserver-blocking-validator-assertionTest_Ubuntu2204_DisableKubeletServingCertificateRotationWithTags_AlreadyDisabled/default and Test_Ubuntu2204_PMC_Install/default:

🔴 FAIL: wireserver check "wireserver port 80 goalstate":
        unexpected curl exit code "0" (want 28 timeout or 7 refused)

The iptables DROP rule for 168.63.129.16:80 is present in the FORWARD chain, but the test's curl still gets exit 0. Same conntrack/TIME_WAIT leakage seen previously. Third occurrence of this signature.

(2) httpsproxy-fixture-proxy-unreachableTest_Ubuntu2204_HTTPSProxy_PrivateDNS/{default,scriptless_nbc} hit the recurring HTTP proxy fixture unreachable pattern at 10.14.0.0/24:8888. Fifth occurrence of this signature; approaching escalation threshold (>6 distinct builds).

Classification: Test infrastructure / test-code flakiness.

Confidence: High. PR #8669 only touches .pipelines/.vsts-vhd-builder.yaml and .vsts-vhd-builder-release.yaml (ACL marketplace image version bump 3.20260602.01); no runtime/CSE/iptables/proxy code is in scope. None of the failures are on ACL/marketplace-image-specific scenarios.

Strongest alternative theory: ACL marketplace image bump silently changes some node config that breaks wireserver blocking and the HTTPSProxy fixture. Less likely because the affected scenarios are non-ACL Ubuntu 22.04 scenarios on shared clusters, and the iptables rule itself is still present (this is a conntrack/test-code issue) — and the proxy issue is on a dedicated test fixture network.

Recommended next action / owner: No PR change required. Recommend rerun. Wiki signatures already track owners for both issues.

Evidence used: failed task log (7 === FAIL markers across 3 scenarios, 2 wireserver-validator assertions + 2 HTTPSProxy CSE-exit-99 failures), all VHD builds passed, PR changed files limited to pipeline YAML.

@aks-node-assistant

Copy link
Copy Markdown
Contributor

🕵️ AgentBaker Linux Gate Detective — Build 167684070

Failed job: Run AgentBaker E2E (Stage e2e)

Summary: All *_NetworkIsolatedCluster_* scenarios failed during CSE bootstrap with exit 52 — the node could not resolve the API server FQDN abe2e-azure-networkisolated-v3-kq4wzvpl.hcp.westus3.azmk8s.io against the private DNS resolver 169.254.10.10 (repeated NXDOMAIN for 300s, then Total timeout 300 reached, nslookup ... failed). Failures span Ubuntu 22.04, Azure Linux V3, and ACL VHDs against the same shared network-isolated cluster FQDN — the private DNS zone for the test cluster's API server was missing/misconfigured for the duration of this run. A second cluster of ImagePullIdentityBinding_* / Random_VHD_With_Latest_K8s failures (~4s elapsed) is shared-cluster prepareCluster cascading from the same outage (context canceled on /api/v1/nodes, daemonset debug-mariner-tolerated modify race).

Classification: 🟦 Test-infra flake (deterministic for the duration of this run, environmental — not PR-caused)
Build vs Test class: Test (E2E pre-flight / CSE DNS check)
Confidence: High

Wiki signature: networkisolated-apiserver-fqdn-nxdomain (existing — reuse)

Strongest alternative theory (challenged & rejected): PR #8669 bumps the ACL marketplace image to 3.20260602.01, so an ACL-only CSE regression was considered. Rejected because (a) the nslookup to the API server runs before any ACL/marketplace component is exercised; (b) non-ACL scenarios (Test_Ubuntu2204_NetworkIsolatedCluster_NonAnonymousACR, Test_AzureLinuxV3_NetworkIsolated_Package_Install, Test_AzureLinuxV3_NetworkIsolatedCluster_NonAnonymousACR) fail with the identical NXDOMAIN against the identical FQDN — the PR cannot affect non-ACL VHD paths; (c) prior builds on this PR (167393232, 167509311, 167536780) hit unrelated test-infra signatures, indicating the gate is currently noisy for environmental reasons unrelated to the ACL bump.

Recommended next action: Re-queue the E2E job. If the same networkisolated-apiserver-fqdn-nxdomain signature reappears on the next run, escalate to the AKS test-infra / shared cluster owner to validate the private DNS zone for abe2e-azure-networkisolated-v3-*.hcp.westus3.azmk8s.io in westus3.


Posted by clawpilot AgentBaker Linux Gate Detective Watcher. Build de-duped via hidden marker; do not edit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants