Skip to content

feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to v1.1.4-1#8618

Merged
cameronmeissner merged 25 commits into
mainfrom
cameissner/stls-client-dalec-linux
Jun 9, 2026
Merged

feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to v1.1.4-1#8618
cameronmeissner merged 25 commits into
mainfrom
cameissner/stls-client-dalec-linux

Conversation

@cameronmeissner

@cameronmeissner cameronmeissner commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR now that the client is being built/published by dalec - this enable secure TLS bootstrapping support on FIPS images through dalec's out-of-the-box FIPS support. As such, this PR also updates AgentBaker E2Es to enable secure TLS bootstrapping on ALL VHDs according to user-specified E2E configuration.

This PR also onboard secure TLS bootstrap client to renovate for Linux so we can start consuming updates automatically.

Which issue(s) this PR fixes:

Fixes #

Copilot AI review requested due to automatic review settings June 1, 2026 20:14
@github-actions github-actions Bot added the components This pull request updates cached components on Linux or Windows VHDs label Jun 1, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors how aks-secure-tls-bootstrap-client is sourced for Linux images, moving away from GitHub release tarballs toward packages.microsoft.com (PMC) for Ubuntu/Azure Linux and MCR (OCI/sysext) for Flatcar/ACL, and updates Renovate ownership for related updates.

Changes:

  • Update parts/common/components.json to define distro-specific sources/versions for aks-secure-tls-bootstrap-client (PMC for Ubuntu/Azure Linux, MCR sysext for Flatcar).
  • Update VHD build dependency caching logic to use package/sysext download helpers instead of a direct tarball download.
  • Rename the “download from URL” helper in cse_install.sh for clarity and adjust its callsite; tweak Renovate assignee/reviewer rules.

Package Update Analysis: aks-secure-tls-bootstrap-client

Version change: 1.1.2 → 1.1.3 (patch update)
OS variants affected: Ubuntu 20.04/22.04/24.04, Azure Linux 3.0, Flatcar (sysext), Windows
OS variants NOT updated: Mariner (no entry / no default fallback) — causes silent skip on Mariner builds.

Upstream changelog: Not evaluated here (not available in-repo). Manual validation recommended.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
vhdbuilder/packer/install-dependencies.sh Switch aks-secure-tls-bootstrap-client handling to package/sysext download flow during VHD build.
parts/linux/cloud-init/artifacts/cse_install.sh Rename the custom-URL download helper and update its caller.
parts/common/components.json Move component metadata to distro-specific PMC/MCR sources and bump versions.
.github/renovate.json Adjust Renovate assignees/reviewers and add a rule grouping for this component.

Comment thread vhdbuilder/packer/install-dependencies.sh Outdated
Comment thread vhdbuilder/packer/install-dependencies.sh
Comment thread parts/linux/cloud-init/artifacts/cse_install.sh Outdated
Comment thread parts/common/components.json Outdated
Copilot AI review requested due to automatic review settings June 1, 2026 22:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread vhdbuilder/packer/install-dependencies.sh
Comment thread parts/common/components.json
Copilot AI review requested due to automatic review settings June 1, 2026 23:42
Copilot AI review requested due to automatic review settings June 8, 2026 21:16
@cameronmeissner cameronmeissner changed the title feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to 1.1.4-1 Jun 8, 2026
@cameronmeissner cameronmeissner changed the title feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to 1.1.4-1 feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to v1.1.4-1 Jun 8, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Comment thread parts/linux/cloud-init/artifacts/flatcar/cse_install_flatcar.sh
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — Ubuntu 24.04 fwupd.service mass E2E failure (RECURRING main regression, NOT this PR)

  • Run: 167219726 (failed)
  • Failed task: Run AgentBaker E2E (Stage e2e → Job/Phase Run AgentBaker E2E)
  • Signature: validators.go:995: 🔴 FAIL: the following systemd units have unexpectedly entered a failed state: [fwupd.service]
  • Scope: Ubuntu 24.04 scenarios (Test_Ubuntu2404_SecureTLSBootstrapping_BootstrapToken_Fallback, Test_Ubuntu2404_NPD_Basic, and others)

This matches an active main-branch regression flagged earlier today on PR #8294 build 167206065 and re-confirmed on PR #8294 build 167221197 within the same ~1.5h window. All three runs share the same [fwupd.service] failed-unit signature across unrelated PRs (node-exporter bump, this STLS client refactor, etc.).

Build-vs-test: product/VHD regression caught by E2E (NOT a flake, NOT test-code).
This PR's exposure check: changes refactor aks-secure-tls-bootstrap-client install to PMC/MCR; the failing validator is the systemd-unit health check, not STLS install. STLS tests in this run failed because the post-install systemd-units validator trips on fwupd.service before STLS-specific assertions could differentiate. No evidence the PR introduced or worsened the fwupd state.
Confidence: HIGH that PR #8618 is not the cause; HIGH that this is a 24.04 VHD main regression around fwupd.service.
Strongest alternative (less likely): STLS PMC/MCR refactor altering boot-time package install order and breaking fwupd.service first-start — refuted: the same signature reproduces on PRs that don't touch STLS or package install order.

Recommended next action / owner: NodeSIG-dev — bisect main since the last green 24.04 E2E for anything touching fwupd or systemd unit enablement in vhdbuilder/packer/install-dependencies.sh / tool_installs_distro.sh. Likely mitigation: mask fwupd.service in the 24.04 VHD or fix the first-start dependency. PR author: do NOT block merge on this; rerun once the main fix lands. If you want to be extra safe, rebase once the fix is in to confirm 24.04 E2E goes green for this PR's diff.

Posted by Clawpilot AgentBaker gate detective.

@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — Ubuntu 24.04 fwupd.service mass E2E failure (STILL the recurring main regression, NOT this PR)

  • Run: 167238023 (failed) — new commit df88bc2 bumping STLS client to v1.1.4-1
  • Failed task: Run AgentBaker E2E (Stage e2e → Job/Phase Run AgentBaker E2E)
  • Test summary: DONE 438 tests, 95 skipped, 17 failures in 1666.129s
  • Primary signature: validators.go:995: 🔴 FAIL: the following systemd units have unexpectedly entered a failed state: [fwupd.service] (6 hits across this run)

Failing scenarios (all Ubuntu 24.04 except one):

  • Test_LocalDNSHostsPlugin/Ubuntu2404/{default,scriptless_nbc}
  • Test_Ubuntu2404_SecureTLSBootstrapping_BootstrapToken_Fallback/default
  • Test_Ubuntu2404_CSE_CachedPerformance/default
  • Test_Ubuntu2404_CSE_FullInstallPerformance/default
  • Test_Ubuntu2404Gen2/default
  • Test_Ubuntu2404Gen2_McrChinaCloud/scriptless_nbc
  • Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated/{default,scriptless_nbc} ← separate ongoing NetworkIsolated infra/fixture issue, not fwupd

This is the same fwupd.service 24.04 main regression previously flagged on builds 167206065, 167219726, and 167221197. New STLS commit landed but failure shape and scope are unchanged.

Build-vs-test: product/VHD regression caught by E2E (NOT a flake, NOT test-code, NOT STLS-related).
This PR's exposure check: STLS install moved to PMC/MCR + bumped to v1.1.4-1. The failing validator is the generic systemd-unit health check tripping on fwupd.service before STLS-specific assertions run; STLS BootstrapToken_Fallback failure is downstream of that pre-condition, not STLS install logic. No new failure modes introduced by the bump.
Confidence: HIGH that PR #8618 is not the cause; HIGH that this is a 24.04 VHD main regression around fwupd.service; the NetworkIsolated 22.04 failure is a separate known infra issue.
Strongest alternative (less likely): STLS PMC/MCR refactor altering boot-time package install order and breaking fwupd.service first-start — refuted: identical signature reproduces on unrelated PRs (renovate node-exporter #8294) on the same main HEAD; scope is strictly 24.04.

Recommended next action / owner: NodeSIG-dev — main-branch fix still pending. Likely mitigation: mask fwupd.service in 24.04 VHD or fix the first-start dependency in vhdbuilder/packer/install-dependencies.sh / tool_installs_distro.sh. PR author: do NOT block merge on this; rebase + rerun once the main fix lands to confirm a clean 24.04 leg.

Posted by Clawpilot AgentBaker gate detective.

Copilot AI review requested due to automatic review settings June 9, 2026 21:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Comment thread parts/common/components.json
Comment thread e2e/node_config.go
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — Mass 173-failure run: shared kubenet-v5 cluster's MC resource group was deleted mid-run (test-infra, NOT this PR)

  • Run: 167378787 (failed) — commit c1e52cb
  • Failed task: Run AgentBaker E2E
  • Test summary: DONE 402 tests, 95 skipped, 173 failures in 1276.143s (~43% failure rate; 0 fwupd hits, so NOT the 24.04 main regression)

Exact failure signature (identical across essentially every failing scenario):

PUT .../resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3/.../virtualMachineScaleSets/<name>
RESPONSE 409: 409 Conflict
ERROR CODE: ResourceGroupBeingDeleted
"The resource group 'MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3' is in deprovisioning state and cannot perform this operation."

Every failing scenario across all distros (Ubuntu 22.04/24.04, AzureLinuxV3/V2, ACL, ARM64, OSGuard) fails in <2s at the VMSS create step because the shared managed-cluster resource group MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3 is being torn down. Scenarios that don't target the v5 cluster pass cleanly.

Three-level analysis:

  1. L1: 409 ResourceGroupBeingDeleted on MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3 at VMSS PUT time, ~1.7s into each scenario.
  2. L2 corroboration: ~105 "Received unexpected error" hits, all with the same RG-being-deleted signature; failures span every distro family in the test matrix; failures happen before any node provisioning is attempted; PR feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to v1.1.4-1 #8618 (STLS client PMC/MCR refactor) doesn't touch test cluster lifecycle/teardown. Cross-PR pattern in the same window: similar mass cluster-not-available signature seen on PR test: (scriptless) Enable scriptless phase 3 in AB e2es #8453's prior runs against the same v5 cluster pool.
  3. L3 challenge: alternatives — (a) PR-caused: STLS PMC/MCR refactor causing cluster RG deletion → no plausible path, the RG deletion is an Azure RM operation initiated outside the test process; (b) Azure RM transient on a per-resource basis → refuted, it's an explicit ResourceGroupBeingDeleted deprovisioning state on the cluster's RG, not random throttling; (c) prior gate run / cluster lifecycle automation deleted the shared abe2e-kubenet-v5-150ee cluster right as this run started — most likely. Strongest alt is (c) and it's the same root cause class as the recurring kubenet-v5 cluster instability flagged on PR test: (scriptless) Enable scriptless phase 3 in AB e2es #8453.

Build-vs-test: test-infra (shared cluster pool lifecycle), NOT product, NOT PR-caused.
Confidence: HIGH that PR #8618 is not the cause.

Recommended next action / owner: E2E infra / NodeSIG-dev — the abe2e-kubenet-v5-150ee shared cluster's MC RG was deleted while this run was using it. Either (a) the cluster pool cleanup automation needs a "in-use" check against active builds, (b) per-build clusters should be used for runs that hit the v5 path, or (c) the gate should retry/wait on ResourceGroupBeingDeleted and route to a fresh cluster. This is the same recurring kubenet-v5 pool stability issue flagged earlier today on PR #8453. PR author: do NOT block merge on this; rerun once the v5 cluster is restored.

Posted by Clawpilot AgentBaker gate detective.

Comment thread .github/renovate.json
Comment thread parts/linux/cloud-init/artifacts/azlosguard/azurelinux-ms-oss.repo
@cameronmeissner cameronmeissner merged commit 633b13e into main Jun 9, 2026
23 of 41 checks passed
@cameronmeissner cameronmeissner deleted the cameissner/stls-client-dalec-linux branch June 9, 2026 23:56
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — Test, Scan, and Cleanup fails on every VHD: VM-side test script can't fetch refs/pull/8618/merge (PR-merge-ref race, NOT PR-code regression)

  • Run: 167401158 (failed) — commit ece2ca4
  • Failed tasks: Test, Scan, and Cleanup across 10 VHD jobs (every distro build's post-VHD scan)
  • Class: VHD test/scan (not E2E)

Exact first-failure signature (identical across every distro):

Cloning AgentBaker repo and checking out remote branch 'refs/pull/8618/merge' into local branch 'refs-pull-8618-merge'
[stderr]
fatal: couldn't find remote ref refs/pull/8618/merge
git-clone:Error: Failed to fetch remote branch 'refs/pull/8618/merge' into local branch 'refs-pull-8618-merge'
git-clone:Error: Used command 'git fetch --quiet origin refs/pull/8618/merge:refs-pull-8618-merge'
Tests failed.
run-test failed 2 times
./vhdbuilder/packer/test/run-test.sh exited with code 1

The VHD test/scan VM (run-command extension) is trying to git fetch origin refs/pull/8618/merge:refs-pull-8618-merge against github.com/Azure/AgentBaker, and GitHub returns "couldn't find remote ref". This happens on every VHD distro job because they all hit the same VM-side script and the same upstream ref.

Three-level analysis:

  1. L1: git fetch refs/pull/8618/merge returns "couldn't find remote ref" → run-test exits 1 → Test, Scan, and Cleanup exits 2 on every distro.
  2. L2 corroboration: Build VHD itself succeeded for every distro (no Packer/CSE/CIS failure here); failure is strictly in the post-VHD on-VM test that re-clones the PR. PR feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to v1.1.4-1 #8618 has been force-pushed multiple times today (commits df88bc211f5b0feff2c18 → now ece2ca4) — GitHub's refs/pull/8618/merge is recomputed on each push and can be temporarily missing during force-push cycles, especially if the merge result has conflicts at the moment of fetch. CSE/cluster-prep/E2E are unaffected — Run AgentBaker E2E task is not in the failed-task list here.
  3. L3 challenge: alternatives — (a) PR's STLS PMC/MCR refactor making refs/pull/8618/merge invalid: no — refs/pull/N/merge is a GitHub-server computed ref, not something the PR's code can touch; (b) GitHub-side outage: would have affected concurrent PRs' run-tests too — this only fails on feat(linux): refactor aks-secure-tls-bootstrap-client installation to use PMC/MCR and bump to v1.1.4-1 #8618; (c) the merge ref was unavailable at fetch time because the PR was being force-pushed/recomputed — most likely. The build was triggered by source SHA ece2ca4 but the VM fetch happens later and resolves the floating refs/pull/8618/merge ref at fetch time, which is racy.

Build-vs-test: test-infra / harness coupling to GitHub's transient merge-ref state. The vhdbuilder/packer/test/run-test.sh VM-side checkout should pin to the specific PR head SHA ($SYSTEM_PULLREQUEST_SOURCECOMMITID) instead of the floating refs/pull/N/merge ref.
Confidence: HIGH that this is the merge-ref-race class, NOT a STLS refactor regression.
Strongest alternative (less likely): STLS PMC/MCR refactor changed the test runner to require something only present in the merge ref — refuted: the failure is at git fetch time, before any test-runner code from the PR is executed.

Recommended next action / owner: PR author can simply rerun the build once GitHub has settled the merge ref. Permanent fix (NodeSIG-dev / AgentBaker test-infra): in vhdbuilder/packer/test/run-test.sh and the run-command payload, replace the git fetch origin refs/pull/$PR/merge:... with a fetch of the explicit source-commit SHA passed by the pipeline ($SYSTEM_PULLREQUEST_SOURCECOMMITID or build's Build.SourceVersion). This eliminates the race entirely and matches what other Azure DevOps pipelines do. Do NOT block merge on this; rerun likely passes.

Posted by Clawpilot AgentBaker gate detective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

components This pull request updates cached components on Linux or Windows VHDs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants