Skip to content

fix(vhd): mask fwupd on Ubuntu 24.04 to unblock E2E PR gate (AB#38355676)#8662

Closed
djsly wants to merge 2 commits into
mainfrom
sylvainboily_microsoft/38355676
Closed

fix(vhd): mask fwupd on Ubuntu 24.04 to unblock E2E PR gate (AB#38355676)#8662
djsly wants to merge 2 commits into
mainfrom
sylvainboily_microsoft/38355676

Conversation

@djsly

@djsly djsly commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

What this PR does

Masks fwupd.service, fwupd-refresh.service, and fwupd-refresh.timer during VHD build for Ubuntu 24.04, and adds a VHD-content test asserting it stays masked.

Why

Starting around 2026-06-08 ~21:15 UTC, every Ubuntu 24.04 E2E scenario in pipeline 119535 (AKS Linux VHD Build - PR check-in gate) started failing deterministically with:

validators.go:995: 🔴 FAIL: the following systemd units have unexpectedly entered a failed state: [fwupd.service]
- failed unit logs will be included in scenario log bundle within <service-name>.service.log

The failure is distro-scoped to Ubuntu 24.04 and hits every 2404 scenario in unrelated PRs (~20 leaf failures/build). AzureLinux and Ubuntu 22.04 SKUs are NOT affected.

Affected builds

Build PR
167206065 #8294 (node-exporter bump)
167206071 #8600 (kubelet/kubectl bump)
167206039 #8642 (secondary NICs)
167206091 #8652 (runc/containerd bump)

Sample detective comment: #8642 (comment)

RCA (one paragraph)

fwupd (firmware-update daemon) is installed by the Ubuntu 24.04 cloud image and fwupd.service is enabled by default. On Azure VMs (Hyper-V Gen2) the daemon has no usable firmware-update surface — node firmware is managed out-of-band by the Azure host — and the recent fwupd version included in the rolled-up Ubuntu archive snapshot (likely via the 2026-05-24 security-patch refresh in #8582) exits non-zero at boot. The existing ValidateNoFailedSystemdUnits validator (e2e/validators.go:995) already allowlists the sibling fwupd-refresh.service (see line 936) but not fwupd.service itself, so every 24.04 scenario trips. Sample test-log entries from build 167206065:

{"Test":"Test_Ubuntu2404_Scriptless/default","Output":"validators.go:995: [231.522s] 🔴 FAIL: ... [fwupd.service] ..."}
{"Test":"Test_Ubuntu2404Gen2/default","Output":"validators.go:995: [318.300s] 🔴 FAIL: ... [fwupd.service] ..."}
{"Test":"Test_Ubuntu2404Gen2_McrChinaCloud/default","Output":"validators.go:995: [346.689s] 🔴 FAIL: ... [fwupd.service] ..."}

Fix and why (option chosen: mask at VHD-build time, not validator allowlist)

We chose to mask the units rather than allowlist fwupd.service in the E2E validator because:

  1. fwupd genuinely has no role on AKS Linux nodes — Azure host manages firmware.
  2. Allowlisting a failing daemon hides the symptom on every future Ubuntu daemon regression.
  3. Masking matches the established NodeSIG pattern for stop-phone-home / unused-on-AKS units (e.g. apt-daily masking 10 lines above in the same file; Ubuntu Pro inert on 20.04/FIPS in fix: make Ubuntu Pro inert on 20.04/FIPS VHDs to stop phone-home (AB#38255910) #8638).
  4. systemctl mask (vs. disable) survives systemctl preset-all and any reinstall of fwupd.

Changes

File Summary
vhdbuilder/packer/install-dependencies.sh In the existing Ubuntu-only block (next to the apt-daily mask), added an OS_VERSION = "24.04" guard that systemctl masks and disable --nows fwupd.service, fwupd-refresh.service, and fwupd-refresh.timer. Trailing || true because the units only exist when fwupd is installed (the 24.04 cloud-image default, but not guaranteed on minimal/future SKUs).
vhdbuilder/packer/test/linux-vhd-content-test.sh Added testFwupdMaskedOnUbuntu2404 (mirrors the existing testNfsServerService pattern) and wired it into the test dispatch right after testNfsServerService. The test treats masked as pass, treats absent/not-found as pass (variant doesn't ship fwupd), and fails on any other state, so a future regression is caught at VHD-build time rather than at E2E.

Tests run locally

This change is in vhdbuilder/packer/ (not parts/ or pkg/), so make generate snapshot regen is not triggered and was not run. The Windows agent this PR was authored from has no go, shellcheck, or docker available, so unit-level lint/build was not executed locally — the ADO CI on this PR will exercise:

  • vhdbuilder/packer/ build via the linux-vhd-build pipeline
  • The new testFwupdMaskedOnUbuntu2404 assertion will execute as part of the in-VHD linux-vhd-content-test.sh suite during the build
  • E2E will exercise the validator — if the mask works, no 2404 scenario should report fwupd.service as a failed unit

Scope / blast radius

  • Scoped to Ubuntu 24.04 only ([ "$OS_VERSION" = "24.04" ] guard).
  • No change to AzureLinux, Ubuntu 22.04, Ubuntu 20.04, FIPS, or any Windows path.
  • No change to e2e/validators.go — the validator stays strict, which is what we want.
  • No change to .pipelines/.vsts-vhd-builder.yaml or any other pipeline definition.
  • The chosen \|\| true falls back gracefully if a future minimal SKU does not ship fwupd.

Tracking

Note on commit signatures

Both commits in this PR were authored from a sandboxed Windows agent without a local GPG key, and the GitHub Contents API used for the push did not auto-sign with the web-flow key (this happens for some PAT-authenticated requests on org repos). Per CONTRIBUTING.md the recommended remediation is git commit --amend -S + git push --force-with-lease from a GPG-equipped machine; I (Sylvain) will do that before merge. The diffs themselves are intentionally minimal and easy to review without rebasing.

Do NOT auto-merge

This PR is deliberately not marked auto-complete — reviewer eyes on the masking decision are wanted before this lands.

cc @djsly @cameronmeissner @Devinwong @lilypan26 @r2k1

djsly and others added 2 commits June 9, 2026 14:02
fwupd ships in the Ubuntu 24.04 cloud image and tries to start on boot.
On AKS Linux nodes there is no firmware to manage -- firmware on Azure
VMs is handled out-of-band by the host -- and recent fwupd releases on
24.04 exit non-zero, which trips the ValidateNoFailedSystemdUnits E2E
validator (e2e/validators.go:995) on every Ubuntu 2404 scenario in the
PR check-in gate (pipeline 119535).

Mask fwupd.service, fwupd-refresh.service, and fwupd-refresh.timer
during VHD build (Ubuntu 24.04 only) in
vhdbuilder/packer/install-dependencies.sh, mirroring the apt-daily
masking pattern in the same Ubuntu block. Masking (vs. disabling)
prevents systemctl preset-all from re-enabling these units.

AB#38355676

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add testFwupdMaskedOnUbuntu2404 to linux-vhd-content-test.sh so that
any future regression in the fwupd masking applied by
vhdbuilder/packer/install-dependencies.sh is caught at VHD-build time
rather than reaching the E2E gate.

AB#38355676

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Masks fwupd-related systemd units during the Ubuntu 24.04 VHD build to prevent fwupd.service from entering a failed state on boot (which trips the E2E ValidateNoFailedSystemdUnits validator), and adds a VHD-content test to ensure the units remain masked.

Changes:

  • Mask/disable fwupd.service, fwupd-refresh.service, and fwupd-refresh.timer during Ubuntu 24.04 VHD build.
  • Add linux-vhd-content-test.sh coverage asserting those units are masked (or absent) on Ubuntu 24.04.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
vhdbuilder/packer/install-dependencies.sh Adds Ubuntu 24.04-specific masking/disabling of fwupd units during VHD build.
vhdbuilder/packer/test/linux-vhd-content-test.sh Adds and wires a VHD-content test that asserts fwupd units are masked (or not present) on Ubuntu 24.04.

Comment on lines +81 to +87
# `|| true` because the units only exist when fwupd is installed (24.04 cloud image
# default; not guaranteed on minimal or future SKUs) and `mask` against a non-existent
# unit can fail under newer systemd.
if [ "$OS_VERSION" = "24.04" ]; then
systemctl mask fwupd.service fwupd-refresh.service fwupd-refresh.timer || true
systemctl disable --now fwupd.service fwupd-refresh.service fwupd-refresh.timer 2>/dev/null || true
fi
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — Build VHD fails on every distro: CRLF line endings in install-dependencies.sh (PR-caused, high confidence)

  • Run: 167320998 (failed)
  • Failed stage/tasks: build stage — Build VHD failed on every distro phase (build2204gen2containerd, build2204arm64gen2containerd, build2404gen2containerd, build2004fipsgen2containerd, buildAzureLinuxV3gen2, buildAzureLinuxV3ARM64gen2fips, buildAzureLinuxOSGuardV3gen2fipsTL, buildacltlgen2, buildaclarm64tlgen2, buildaclfipstlgen2, buildaclarm64fipstlgen2). Class: VHD/Packer build, not E2E.

Exact first-failure signature (Packer shell provisioner, immediately after pre_install_dependencies stage):

==> azure-arm: + set -euo $'pipefail\r'
==> azure-arm: : invalid option name
==> azure-arm: Provisioning step had errors: Running the cleanup provisioner, if present...
==> azure-arm: Build 'azure-arm' errored after N minutes: Script exited with non-zero exit status: 2.

The $'pipefail\r' is the smoking gun: the provisioner script (the next big one after pre_install_dependencies — i.e. install-dependencies.sh) has Windows CRLF (\r\n) line endings. set -euo pipefail<CR> is parsed by bash as the option literal pipefail\r, which is rejected with "invalid option name". This kills the script on the very first executable line, so no installs run and Packer aborts before producing an artifact.

Three-level analysis:

  1. L1 surface: Packer azure-arm builder exits 2 on the first provisioner after pre_install_dependencies; bash rejects $'pipefail\r'.
  2. L2 corroboration: identical failure on every distro phase (Ubuntu 22.04/24.04/20.04 FIPS, AzureLinuxV3, ACL FIPS/TL/ARM64, etc.) — distro-independent → not a packaging/CIS issue. Pre-install dependencies stage succeeds, so it's specifically the next script. PR fix(vhd): mask fwupd on Ubuntu 24.04 to unblock E2E PR gate (AB#38355676) #8662 touches vhdbuilder/packer/install-dependencies.sh and vhdbuilder/packer/test/linux-vhd-content-test.sh — the file matches and the timing matches.
  3. L3 challenge: alternatives — (a) generic Packer regression: refuted, prior unmodified PR runs against the same main succeeded; (b) linux-vhd-content-test.sh CRLF: also possible but content-test runs in Test, Scan, and Cleanup later — the Build VHD failure here is install-dependencies.sh's first line; (c) the masking command itself being wrong: irrelevant because bash never gets to execute it (script dies on set -euo pipefail). Strongest alternative: only linux-vhd-content-test.sh is CRLF — less likely because Build VHD fails before that script ever runs.

Build-vs-test: build/VHD regression introduced by this PR.
Confidence: HIGH that PR #8662 (this PR) is the cause.

Recommended next action / owner: PR author (Sylvain) — re-save vhdbuilder/packer/install-dependencies.sh (and vhdbuilder/packer/test/linux-vhd-content-test.sh) with LF-only line endings, then force-push. Quick local check:

file vhdbuilder/packer/install-dependencies.sh
# should say: "ASCII text" or "Bourne-Again shell script, ASCII text"
# NOT "ASCII text, with CRLF line terminators"

And/or:

dos2unix vhdbuilder/packer/install-dependencies.sh vhdbuilder/packer/test/linux-vhd-content-test.sh
git add -u && git commit --amend --no-edit && git push --force-with-lease

Also worth confirming .gitattributes keeps *.sh text eol=lf so future Windows edits don't reintroduce CRLF.

Note: with this fix in flight, this PR is the proposed mitigation for the recurring Ubuntu 24.04 fwupd.service E2E mass-failure flagged on 8+ recent gate runs; getting this through is the unblock for the whole gate.

Posted by Clawpilot AgentBaker gate detective.

@djsly

djsly commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

not needed, disabling phasing instead

@djsly djsly closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants