Skip to content

feat(anc): support a base-to-version hotfix map in downloadHotfix#8694

Draft
Devinwong wants to merge 5 commits into
mainfrom
devinwong/laughing-pancake
Draft

feat(anc): support a base-to-version hotfix map in downloadHotfix#8694
Devinwong wants to merge 5 commits into
mainfrom
devinwong/laughing-pancake

Conversation

@Devinwong

@Devinwong Devinwong commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Provisioning-Hotfix groundwork - Milestone 1 (POC draft)

POC / M1 DRAFT. This PR contains 2.1a only. The check-hotfix subcommand (2.1b) is implemented separately and stacked on top of this branch.

Background

A "hotfix" is a small change to the ANC binary or a few /opt/azure/containers/*.sh scripts, shipped as one PMC package and applied at node provisioning time on top of VHD-baked content. VMSS model customdata can go stale on scale-out, autoscaler, or reimage, so new VMs may boot with buggy code. The broader design adds a second pointer channel the node reads at boot; this PR lays the AgentBaker (Node SIG) groundwork.

What 2.1a does

Extends aks-node-controller/hotfix.go:

  • hotfixConfig gains Hotfixes map[string]string mapping an ANC version base (YYYYMM.DD) to a hotfix version (YYYYMM.DD.PATCH), so one config can pin hotfixes for multiple VHD bases at once, with default-deny for any base not listed.
  • resolveVersion() applies the map first, then falls back to the legacy single version field when the map is empty (full backward compatibility).
  • hotfixBaseFromVersion() splits on . (not semver) to preserve the leading-zero day so map keys match exactly (e.g. 202604.01).
  • readHotfixConfig() parses the shared config shape.
  • downloadHotfix() resolves via the map and still gates through the unchanged shouldUpgradeToHotfix() "same base, patch strictly higher" semantics.
  • All resolution paths fail open: an unreadable/invalid config, an unparseable current version, or a non-matching base results in no hotfix applied rather than a provisioning failure.

Net effect (examples)

The node's baked-in ANC reports its own version (e.g. 202604.01.0). downloadHotfix resolves the config against that version and only upgrades when there is a higher patch for the same base.

Example 1 - map targets this node's base, higher patch -> upgrade

// config
{ "hotfixes": { "202604.01": "202604.01.3", "202605.30": "202605.30.1" } }
Baked ANC Matched entry Action
202604.01.0 202604.01 -> 202604.01.3 download + install 202604.01.3

Example 2 - base not listed -> no-op (default-deny)

{ "hotfixes": { "202605.30": "202605.30.1" } }
Baked ANC Matched entry Action
202604.01.0 none skip (no hotfix)

Example 3 - patch not higher -> no-op

{ "hotfixes": { "202604.01": "202604.01.0" } }
Baked ANC Matched entry Action
202604.01.0 202604.01 -> 202604.01.0 skip (not strictly higher)

Example 4 - legacy single-version config (unchanged behavior)

{ "version": "202604.01.3" }

Resolves exactly as before: upgrade only if 202604.01.3 is a higher patch of the same base as the baked ANC.

Files changed

  • aks-node-controller/hotfix.go
  • aks-node-controller/hotfix_test.go

Test results

go build ./... passes; all pure-logic unit tests pass. Two tests (TestDownloadHotfix_MatchingBaseUpgrades, TestDownloadHotfix_MapMatchingBaseUpgrades) fail on Windows only because they exercise package-manager detection that needs /etc/os-release, bash, and unix file permissions. They pass in Linux CI and are unrelated to this change.

…tfix M1 2.1a)

Adds a base (YYYYMM.DD) -> hotfix version (YYYYMM.DD.PATCH) map to the ANC
hotfix config so a single config can pin hotfixes for multiple VHD bases at
once, with default-deny for unlisted bases. The legacy single 'version' field
is still honored when the map is empty for full backward compatibility.

- hotfixConfig gains Hotfixes map; resolveVersion() applies map-first then
  legacy fallback; hotfixBaseFromVersion() splits on '.' to preserve the
  leading-zero day so map keys match exactly.
- readHotfixConfig() added; readHotfixVersion() retained via it.
- downloadHotfix() resolves via the map and still gates through the unchanged
  shouldUpgradeToHotfix() patch-only-strictly-higher semantics.
- Unparseable current version with a map present fails open (no hotfix).

Part of the Provisioning-Hotfix / live-patching-controller ConfigMap design (M1).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 11, 2026 23:50
@Devinwong Devinwong changed the title feat(anc): provisioning-hotfix M1 - base to version hotfix map (2.1a) feat(anc): provisioning-hotfix M1 - base to version hotfix map (§2.1a) Jun 11, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends aks-node-controller hotfix resolution to support a base→hotfix mapping, allowing one config to target multiple ANC “base” versions (YYYYMM.DD) while keeping backward compatibility with the legacy single version field.

Changes:

  • Add hotfixConfig.Hotfixes map (base -> hotfix version) with default-deny behavior for unlisted bases.
  • Implement base extraction (hotfixBaseFromVersion) and config-aware resolution (hotfixConfig.resolveVersion) with legacy fallback.
  • Add unit tests covering config parsing, base extraction, resolution behavior, and download behavior with the map.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
aks-node-controller/hotfix.go Adds map-based hotfix resolution and config parsing helpers; wires resolution into downloadHotfix().
aks-node-controller/hotfix_test.go Adds tests for config parsing, base extraction, resolution precedence, and map-driven download behavior.

Comment thread aks-node-controller/hotfix.go
Comment thread aks-node-controller/hotfix.go
Devin Wong and others added 2 commits June 11, 2026 17:11
…o-ops

Adds TestDownloadHotfix_MapMisconfiguredValueBaseSkips: when a hotfixes map
entry's value base (YYYYMM.DD) does not match its key, resolveVersion selects
it by key but shouldUpgradeToHotfix rejects it because the bases differ, so no
wrong-base binary is installed. Locks in the default-safe behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses PR review:
- downloadHotfix now logs and skips (returns nil) when the hotfix config is
  unreadable or invalid JSON, instead of returning an error. This honors the
  fail-open guarantee so a malformed config can never block provisioning.
- hotfixBaseFromVersion now rejects a present-but-empty patch segment
  (e.g. '202604.01.') so an obviously malformed current version never selects
  a map entry, matching the documented YYYYMM.DD.PATCH contract.
- Tests: replace TestDownloadHotfix_UnreadableFile with fail-open assertions,
  add TestDownloadHotfix_InvalidJSONFailsOpen, and cover the empty-patch case.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 12, 2026 00:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread aks-node-controller/hotfix_test.go
Comment thread aks-node-controller/hotfix_test.go
Devin Wong and others added 2 commits June 11, 2026 17:21
Addresses PR review on the two fail-open tests so they prove the skip is
specifically caused by the unreadable/invalid config, not an incidental
version parse skip:
- Set a parseable, hotfix-eligible Version (202604.01.0) and configure
  aptSourcesDir so a readable/valid config would proceed to install and flip
  installCalled. The only reason install does not fire is the config-read
  failure (fail-open).
- Make the unreadable case robust cross-platform: if chmod 0000 is ineffective,
  replace the path with a directory so the read genuinely fails.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The legacy readHotfixVersion function had no production callers after
downloadHotfix switched to readHotfixConfig + resolveVersion. Remove it
and fold its forward-compat coverage into TestReadHotfixConfig.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 12, 2026 00:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment on lines 35 to 38
hotfixPath := a.hotfixVersionPath
if hotfixPath == "" {
hotfixPath = defaultHotfixVersionPath
}
@Devinwong Devinwong changed the title feat(anc): provisioning-hotfix M1 - base to version hotfix map (§2.1a) feat(anc): support a base-to-version hotfix map in downloadHotfix Jun 12, 2026
@aks-node-assistant

Copy link
Copy Markdown
Contributor

🕵️ AgentBaker Linux gate — automated RCA for build 167737875

Signature: kubenet-v5-node-not-ready-scriptless (known wiki entry, occurrence #4, below escalation threshold of 6)
Classification: 🟡 Test infrastructure / shared-cluster fleet stressnot PR-caused
Build-vs-test class: test-infra (post-create node-registration)
Flaky vs deterministic: flaky (test-infra)
Confidence: High

Level 1 — Surface

115 scenarios across 75+ distinct top-level tests fail at kube.go:195:

🔴 FAIL: "<vmss>" haven't appeared in k8s API server: context deadline exceeded
✗ waiting for node ... to be ready failed (600.0s)

All on shared cluster abe2e-kubenet-v5-150ee (westus3), 11 spill onto abe2e-azure-networkisolated-v3-d6cc9. Cross-distro (Ubuntu 2204/2404, AzureLinuxV3, ACL, ARM64) and cross-bootstrap-mode (default + scriptless_nbc) — uniform symptom.

Level 2 — Corroboration (≥2 independent evidence sources)

Evidence Source Count
kube.go:195 "haven't appeared in k8s API server" test-log.json 115
bastionssh.go:304 Attempt 1/5 succeeded (no retry observed) test-log.json 136 → 0 retries
VALIDATION_ERR=52 / NXDOMAIN test-log.json 0
VMSS create succeeded (vmss.go:557 reached) test-log.json observed pre-failure

Cluster create + bastion + ARM/VMSS provisioning are all healthy. Failure is strictly at the post-VMSS-create kubelet → apiserver node-registration step.

Level 3 — Root cause + strongest alternative

Root cause (accepted): Shared cluster abe2e-kubenet-v5-150ee is not accepting node registrations within the 600 s budget — matches wiki signature kubenet-v5-node-not-ready-scriptless (apiserver/CSR controller lag or fleet-level stress on the shared kubenet-v5 fixture).

Strongest alternative considered: PR #8694 ANC hotfix M1 2.1a (feat(anc): extend hotfix config to base->version map) regresses aks-node-controller, preventing kubelet from starting. Refuted by:

  1. Failures hit default (script-based CSE) equally with scriptless_nbc — script-based path doesn't traverse ANC's new hotfix code.
  2. Identical symptom across Ubuntu 2204, 2404, AzureLinuxV3, ACL, and ARM64 VHDs — the PR did not touch any OS-specific or VHD code.
  3. The hotfix-config read is fail-open per commit 24a12e7 ("make hotfix config read fail-open and tighten base validation"), so a misconfig short-circuits to no-op rather than blocking kubelet.
  4. Failures concentrate on a single cluster suffix (150ee); if it were the PR, we'd see failures on whichever cluster a given scenario landed on — not bound to one fixture.

Action

  • No PR action required. This is a shared-fixture/fleet issue. Recommend re-queue.
  • Owner: AgentBaker E2E test-infra (shared cluster abe2e-kubenet-v5-150ee in westus3).
  • Wiki signature kubenet-v5-node-not-ready-scriptless will be updated to count 4 (last seen 167737875). Not at escalation threshold.

Does NOT match networkisolated-apiserver-fqdn-nxdomain (count=6 boundary): zero VALIDATION_ERR=52 / NXDOMAIN observed, NetworkIsolated tests fail with the same generic kube.go:195 path as everything else and run on a different cluster suffix (d6cc9kq4wzvpl). Escalation threshold NOT hit.

🤖 Posted by clawpilot-agentbaker-linux-gate-detective automation — marker 167737875:591bf839.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants