feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to v1.1.4-1 by cameronmeissner · Pull Request #8618 · Azure/AgentBaker

cameronmeissner · 2026-06-01T20:14:35Z

What this PR does / why we need it:

refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR now that the client is being built/published by dalec - this enable secure TLS bootstrapping support on FIPS images through dalec's out-of-the-box FIPS support. As such, this PR also updates AgentBaker E2Es to enable secure TLS bootstrapping on ALL VHDs according to user-specified E2E configuration.

This PR also onboard secure TLS bootstrap client to renovate for Linux so we can start consuming updates automatically.

Which issue(s) this PR fixes:

Fixes #

… use PMC/MCR

Copilot

Pull request overview

This PR refactors how aks-secure-tls-bootstrap-client is sourced for Linux images, moving away from GitHub release tarballs toward packages.microsoft.com (PMC) for Ubuntu/Azure Linux and MCR (OCI/sysext) for Flatcar/ACL, and updates Renovate ownership for related updates.

Changes:

Update parts/common/components.json to define distro-specific sources/versions for aks-secure-tls-bootstrap-client (PMC for Ubuntu/Azure Linux, MCR sysext for Flatcar).
Update VHD build dependency caching logic to use package/sysext download helpers instead of a direct tarball download.
Rename the “download from URL” helper in cse_install.sh for clarity and adjust its callsite; tweak Renovate assignee/reviewer rules.

Package Update Analysis: aks-secure-tls-bootstrap-client

Version change: 1.1.2 → 1.1.3 (patch update)
OS variants affected: Ubuntu 20.04/22.04/24.04, Azure Linux 3.0, Flatcar (sysext), Windows
OS variants NOT updated: Mariner (no entry / no default fallback) — causes silent skip on Mariner builds.

Upstream changelog: Not evaluated here (not available in-repo). Manual validation recommended.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`vhdbuilder/packer/install-dependencies.sh`	Switch `aks-secure-tls-bootstrap-client` handling to package/sysext download flow during VHD build.
`parts/linux/cloud-init/artifacts/cse_install.sh`	Rename the custom-URL download helper and update its caller.
`parts/common/components.json`	Move component metadata to distro-specific PMC/MCR sources and bump versions.
`.github/renovate.json`	Adjust Renovate assignees/reviewers and add a rule grouping for this component.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

…ssner/stls-client-dalec-linux

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

aks-node-assistant · 2026-06-08T23:01:39Z

AgentBaker Linux PR gate — Ubuntu 24.04 fwupd.service mass E2E failure (RECURRING main regression, NOT this PR)

Run: 167219726 (failed)
Failed task: Run AgentBaker E2E (Stage e2e → Job/Phase Run AgentBaker E2E)
Signature: validators.go:995: 🔴 FAIL: the following systemd units have unexpectedly entered a failed state: [fwupd.service]
Scope: Ubuntu 24.04 scenarios (Test_Ubuntu2404_SecureTLSBootstrapping_BootstrapToken_Fallback, Test_Ubuntu2404_NPD_Basic, and others)

This matches an active main-branch regression flagged earlier today on PR #8294 build 167206065 and re-confirmed on PR #8294 build 167221197 within the same ~1.5h window. All three runs share the same [fwupd.service] failed-unit signature across unrelated PRs (node-exporter bump, this STLS client refactor, etc.).

Build-vs-test: product/VHD regression caught by E2E (NOT a flake, NOT test-code).
This PR's exposure check: changes refactor aks-secure-tls-bootstrap-client install to PMC/MCR; the failing validator is the systemd-unit health check, not STLS install. STLS tests in this run failed because the post-install systemd-units validator trips on fwupd.service before STLS-specific assertions could differentiate. No evidence the PR introduced or worsened the fwupd state.
Confidence: HIGH that PR #8618 is not the cause; HIGH that this is a 24.04 VHD main regression around fwupd.service.
Strongest alternative (less likely): STLS PMC/MCR refactor altering boot-time package install order and breaking fwupd.service first-start — refuted: the same signature reproduces on PRs that don't touch STLS or package install order.

Recommended next action / owner: NodeSIG-dev — bisect main since the last green 24.04 E2E for anything touching fwupd or systemd unit enablement in vhdbuilder/packer/install-dependencies.sh / tool_installs_distro.sh. Likely mitigation: mask fwupd.service in the 24.04 VHD or fix the first-start dependency. PR author: do NOT block merge on this; rerun once the main fix lands. If you want to be extra safe, rebase once the fix is in to confirm 24.04 E2E goes green for this PR's diff.

Posted by Clawpilot AgentBaker gate detective.

aks-node-assistant · 2026-06-09T01:01:10Z

AgentBaker Linux PR gate — Ubuntu 24.04 fwupd.service mass E2E failure (STILL the recurring main regression, NOT this PR)

Run: 167238023 (failed) — new commit df88bc2 bumping STLS client to v1.1.4-1
Failed task: Run AgentBaker E2E (Stage e2e → Job/Phase Run AgentBaker E2E)
Test summary: DONE 438 tests, 95 skipped, 17 failures in 1666.129s
Primary signature: validators.go:995: 🔴 FAIL: the following systemd units have unexpectedly entered a failed state: [fwupd.service] (6 hits across this run)

Failing scenarios (all Ubuntu 24.04 except one):

Test_LocalDNSHostsPlugin/Ubuntu2404/{default,scriptless_nbc}
Test_Ubuntu2404_SecureTLSBootstrapping_BootstrapToken_Fallback/default
Test_Ubuntu2404_CSE_CachedPerformance/default
Test_Ubuntu2404_CSE_FullInstallPerformance/default
Test_Ubuntu2404Gen2/default
Test_Ubuntu2404Gen2_McrChinaCloud/scriptless_nbc
Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated/{default,scriptless_nbc} ← separate ongoing NetworkIsolated infra/fixture issue, not fwupd

This is the same fwupd.service 24.04 main regression previously flagged on builds 167206065, 167219726, and 167221197. New STLS commit landed but failure shape and scope are unchanged.

Build-vs-test: product/VHD regression caught by E2E (NOT a flake, NOT test-code, NOT STLS-related).
This PR's exposure check: STLS install moved to PMC/MCR + bumped to v1.1.4-1. The failing validator is the generic systemd-unit health check tripping on fwupd.service before STLS-specific assertions run; STLS BootstrapToken_Fallback failure is downstream of that pre-condition, not STLS install logic. No new failure modes introduced by the bump.
Confidence: HIGH that PR #8618 is not the cause; HIGH that this is a 24.04 VHD main regression around fwupd.service; the NetworkIsolated 22.04 failure is a separate known infra issue.
Strongest alternative (less likely): STLS PMC/MCR refactor altering boot-time package install order and breaking fwupd.service first-start — refuted: identical signature reproduces on unrelated PRs (renovate node-exporter #8294) on the same main HEAD; scope is strictly 24.04.

Recommended next action / owner: NodeSIG-dev — main-branch fix still pending. Likely mitigation: mask fwupd.service in 24.04 VHD or fix the first-start dependency in vhdbuilder/packer/install-dependencies.sh / tool_installs_distro.sh. PR author: do NOT block merge on this; rebase + rerun once the main fix lands to confirm a clean 24.04 leg.

Posted by Clawpilot AgentBaker gate detective.

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

aks-node-assistant · 2026-06-09T23:01:24Z

AgentBaker Linux PR gate — Mass 173-failure run: shared kubenet-v5 cluster's MC resource group was deleted mid-run (test-infra, NOT this PR)

Run: 167378787 (failed) — commit c1e52cb
Failed task: Run AgentBaker E2E
Test summary: DONE 402 tests, 95 skipped, 173 failures in 1276.143s (~43% failure rate; 0 fwupd hits, so NOT the 24.04 main regression)

Exact failure signature (identical across essentially every failing scenario):

PUT .../resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3/.../virtualMachineScaleSets/<name>
RESPONSE 409: 409 Conflict
ERROR CODE: ResourceGroupBeingDeleted
"The resource group 'MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3' is in deprovisioning state and cannot perform this operation."

Every failing scenario across all distros (Ubuntu 22.04/24.04, AzureLinuxV3/V2, ACL, ARM64, OSGuard) fails in <2s at the VMSS create step because the shared managed-cluster resource group MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3 is being torn down. Scenarios that don't target the v5 cluster pass cleanly.

Three-level analysis:

L1: 409 ResourceGroupBeingDeleted on MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3 at VMSS PUT time, ~1.7s into each scenario.
L2 corroboration: ~105 "Received unexpected error" hits, all with the same RG-being-deleted signature; failures span every distro family in the test matrix; failures happen before any node provisioning is attempted; PR feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to v1.1.4-1 #8618 (STLS client PMC/MCR refactor) doesn't touch test cluster lifecycle/teardown. Cross-PR pattern in the same window: similar mass cluster-not-available signature seen on PR test: (scriptless) Enable scriptless phase 3 in AB e2es #8453's prior runs against the same v5 cluster pool.
L3 challenge: alternatives — (a) PR-caused: STLS PMC/MCR refactor causing cluster RG deletion → no plausible path, the RG deletion is an Azure RM operation initiated outside the test process; (b) Azure RM transient on a per-resource basis → refuted, it's an explicit ResourceGroupBeingDeleted deprovisioning state on the cluster's RG, not random throttling; (c) prior gate run / cluster lifecycle automation deleted the shared abe2e-kubenet-v5-150ee cluster right as this run started — most likely. Strongest alt is (c) and it's the same root cause class as the recurring kubenet-v5 cluster instability flagged on PR test: (scriptless) Enable scriptless phase 3 in AB e2es #8453.

Build-vs-test: test-infra (shared cluster pool lifecycle), NOT product, NOT PR-caused.
Confidence: HIGH that PR #8618 is not the cause.

Recommended next action / owner: E2E infra / NodeSIG-dev — the abe2e-kubenet-v5-150ee shared cluster's MC RG was deleted while this run was using it. Either (a) the cluster pool cleanup automation needs a "in-use" check against active builds, (b) per-build clusters should be used for runs that hit the v5 path, or (c) the gate should retry/wait on ResourceGroupBeingDeleted and route to a fresh cluster. This is the same recurring kubenet-v5 pool stability issue flagged earlier today on PR #8453. PR author: do NOT block merge on this; rerun once the v5 cluster is restored.

Posted by Clawpilot AgentBaker gate detective.

aks-node-assistant · 2026-06-10T01:02:25Z

AgentBaker Linux PR gate — Test, Scan, and Cleanup fails on every VHD: VM-side test script can't fetch refs/pull/8618/merge (PR-merge-ref race, NOT PR-code regression)

Run: 167401158 (failed) — commit ece2ca4
Failed tasks: Test, Scan, and Cleanup across 10 VHD jobs (every distro build's post-VHD scan)
Class: VHD test/scan (not E2E)

Exact first-failure signature (identical across every distro):

Cloning AgentBaker repo and checking out remote branch 'refs/pull/8618/merge' into local branch 'refs-pull-8618-merge'
[stderr]
fatal: couldn't find remote ref refs/pull/8618/merge
git-clone:Error: Failed to fetch remote branch 'refs/pull/8618/merge' into local branch 'refs-pull-8618-merge'
git-clone:Error: Used command 'git fetch --quiet origin refs/pull/8618/merge:refs-pull-8618-merge'
Tests failed.
run-test failed 2 times
./vhdbuilder/packer/test/run-test.sh exited with code 1

The VHD test/scan VM (run-command extension) is trying to git fetch origin refs/pull/8618/merge:refs-pull-8618-merge against github.com/Azure/AgentBaker, and GitHub returns "couldn't find remote ref". This happens on every VHD distro job because they all hit the same VM-side script and the same upstream ref.

Three-level analysis:

L1: git fetch refs/pull/8618/merge returns "couldn't find remote ref" → run-test exits 1 → Test, Scan, and Cleanup exits 2 on every distro.
L2 corroboration: Build VHD itself succeeded for every distro (no Packer/CSE/CIS failure here); failure is strictly in the post-VHD on-VM test that re-clones the PR. PR feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to v1.1.4-1 #8618 has been force-pushed multiple times today (commits df88bc2 → 11f5b0f → eff2c18 → now ece2ca4) — GitHub's refs/pull/8618/merge is recomputed on each push and can be temporarily missing during force-push cycles, especially if the merge result has conflicts at the moment of fetch. CSE/cluster-prep/E2E are unaffected — Run AgentBaker E2E task is not in the failed-task list here.
L3 challenge: alternatives — (a) PR's STLS PMC/MCR refactor making refs/pull/8618/merge invalid: no — refs/pull/N/merge is a GitHub-server computed ref, not something the PR's code can touch; (b) GitHub-side outage: would have affected concurrent PRs' run-tests too — this only fails on feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to v1.1.4-1 #8618; (c) the merge ref was unavailable at fetch time because the PR was being force-pushed/recomputed — most likely. The build was triggered by source SHA ece2ca4 but the VM fetch happens later and resolves the floating refs/pull/8618/merge ref at fetch time, which is racy.

Build-vs-test: test-infra / harness coupling to GitHub's transient merge-ref state. The vhdbuilder/packer/test/run-test.sh VM-side checkout should pin to the specific PR head SHA ($SYSTEM_PULLREQUEST_SOURCECOMMITID) instead of the floating refs/pull/N/merge ref.
Confidence: HIGH that this is the merge-ref-race class, NOT a STLS refactor regression.
Strongest alternative (less likely): STLS PMC/MCR refactor changed the test runner to require something only present in the merge ref — refuted: the failure is at git fetch time, before any test-runner code from the PR is executed.

Recommended next action / owner: PR author can simply rerun the build once GitHub has settled the merge ref. Permanent fix (NodeSIG-dev / AgentBaker test-infra): in vhdbuilder/packer/test/run-test.sh and the run-command payload, replace the git fetch origin refs/pull/$PR/merge:... with a fetch of the explicit source-commit SHA passed by the pipeline ($SYSTEM_PULLREQUEST_SOURCECOMMITID or build's Build.SourceVersion). This eliminates the race entirely and matches what other Azure DevOps pipelines do. Do NOT block merge on this; rerun likely passes.

Posted by Clawpilot AgentBaker gate detective.

feat(linux): refactor aks-secure-tls-bootstrap-client installation to…

c478da7

… use PMC/MCR

Copilot AI review requested due to automatic review settings June 1, 2026 20:14

cameronmeissner requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, djsly, ganeshkumarashok, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners June 1, 2026 20:14

Copilot started reviewing on behalf of cameronmeissner June 1, 2026 20:14 View session

github-actions Bot added the components This pull request updates cached components on Linux or Windows VHDs label Jun 1, 2026

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread vhdbuilder/packer/install-dependencies.sh Outdated

Comment thread vhdbuilder/packer/install-dependencies.sh

Comment thread parts/linux/cloud-init/artifacts/cse_install.sh Outdated

Comment thread parts/common/components.json Outdated

cameronmeissner added 2 commits June 1, 2026 15:15

chore: sysext fixes

517e439

chore: comments

4a05660

Copilot AI review requested due to automatic review settings June 1, 2026 22:22

Copilot started reviewing on behalf of cameronmeissner June 1, 2026 22:23 View session

chore: conflicts

18cad2c

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread vhdbuilder/packer/install-dependencies.sh

Comment thread parts/common/components.json

cameronmeissner added 2 commits June 1, 2026 16:29

chore: fix sysexts

38f80af

chore: handle arch suffix

82152c8

Copilot AI review requested due to automatic review settings June 1, 2026 23:42

cameronmeissner added 2 commits June 8, 2026 14:13

Merge branch 'main' of https://github.com/Azure/AgentBaker into camei…

b53fca7

…ssner/stls-client-dalec-linux

chore: update to 1.1.4

69392a4

Copilot AI review requested due to automatic review settings June 8, 2026 21:16

Copilot started reviewing on behalf of cameronmeissner June 8, 2026 21:16 View session

cameronmeissner changed the title ~~feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR~~ feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to 1.1.4-1 Jun 8, 2026

cameronmeissner changed the title ~~feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to 1.1.4-1~~ feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to v1.1.4-1 Jun 8, 2026

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Comment thread parts/linux/cloud-init/artifacts/flatcar/cse_install_flatcar.sh

aks-node-assistant Bot mentioned this pull request Jun 8, 2026

chore(deps): update node-exporter-kubernetes (patch) #8294

Open

1 task

chore: harden disabled logic for flatcar/acl

df88bc2

chore: enable FIPS E2Es

eff2c18

Copilot AI review requested due to automatic review settings June 9, 2026 21:12

Copilot started reviewing on behalf of cameronmeissner June 9, 2026 21:12 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Comment thread parts/common/components.json

Comment thread e2e/node_config.go

djsly reviewed Jun 9, 2026

View reviewed changes

Comment thread .github/renovate.json

djsly reviewed Jun 9, 2026

View reviewed changes

Comment thread parts/linux/cloud-init/artifacts/azlosguard/azurelinux-ms-oss.repo

djsly approved these changes Jun 9, 2026

View reviewed changes

chore: update renovate.json

ece2ca4

lilypan26 approved these changes Jun 9, 2026

View reviewed changes

cameronmeissner merged commit 633b13e into main Jun 9, 2026
23 of 41 checks passed

cameronmeissner deleted the cameissner/stls-client-dalec-linux branch June 9, 2026 23:56

aks-node-assistant Bot mentioned this pull request Jun 10, 2026

fix(acl): bump marketplace image to 3.20260602.01 #8669

Open

aks-node-assistant Bot mentioned this pull request Jun 10, 2026

fix(cse): expand NODE_NAME in DRA driver systemd unit override #8676

Draft

Conversation

cameronmeissner commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Package Update Analysis: aks-secure-tls-bootstrap-client

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

aks-node-assistant Bot commented Jun 8, 2026

Uh oh!

aks-node-assistant Bot commented Jun 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

aks-node-assistant Bot commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aks-node-assistant Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cameronmeissner commented Jun 1, 2026 •

edited

Loading