Skip to content

feat(anc): add check-hotfix subcommand to read hotfix pointer from ConfigMap#8696

Draft
Devinwong wants to merge 4 commits into
devinwong/laughing-pancakefrom
devinwong/anc-check-hotfix-configmap
Draft

feat(anc): add check-hotfix subcommand to read hotfix pointer from ConfigMap#8696
Devinwong wants to merge 4 commits into
devinwong/laughing-pancakefrom
devinwong/anc-check-hotfix-configmap

Conversation

@Devinwong

@Devinwong Devinwong commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Add a check-hotfix subcommand that reads the hotfix pointer from a ConfigMap

This adds a new fail-open check-hotfix subcommand to aks-node-controller. It reads a cluster ConfigMap that maps an ANC version base to a hotfix version, and writes that pointer to the file download-hotfix already consumes. download-hotfix then re-resolves the pointer and keeps its existing patch-only, strictly-higher gating. check-hotfix only fetches and stages the pointer - it never installs anything and never blocks provisioning.

Stacking

This branch is stacked on the base-to-version hotfix map change (PR #8694). The PR base is set to that branch so the diff shows only this change (app.go wiring + checkhotfix.go + checkhotfix_test.go). It must merge after #8694; if #8694 merges first, retarget this PR to main.

What it does

  1. Reads the kube-system/anc-hotfix-version ConfigMap from the apiserver with a raw net/http HTTPS GET (no client-go dependency).
    • Primary endpoint/creds: the bootstrap token and apiserver FQDN from the node config that ANC already parses, with the cluster CA from /etc/kubernetes/certs/ca.crt.
    • Secondary fallback (client-cert mode): parse the on-node bootstrap-kubeconfig, then kubeconfig, for server, CA, and client-cert/key or token.
    • Short-timeout (~10s) HTTPS client trusting the cluster CA (and presenting a client cert when present).
  2. Parses the ConfigMap: .data holds the full {"hotfixes":{...}} JSON object under a single key (prefers hotfixes.json, else the only entry). The value unmarshals directly into the same config type download-hotfix uses, so both commands share one parser and data contract.
  3. Writes the pointer to /opt/azure/containers/aks-node-controller-hotfix.json in the same {"hotfixes":{...}} shape (atomic temp-file + rename), so download-hotfix re-resolves it and applies its unchanged gating.
  4. Fail-open: the command always exits 0 so provisioning is never blocked. Any 404 / 403 / timeout / parse failure is logged, emitted as telemetry, and swallowed.
  5. Cold-start fallback: if the ConfigMap read fails, it reads a lenient top-level hotfixes object embedded in the node config and uses that. (Marked with a TODO to switch to a typed config field once that contract exists.)
  6. Telemetry: guest-agent events under task name CheckHotfix with outcomes configMapRead, noHotfixForBase, customDataFallback, failed.

Net effect (examples)

ConfigMap published to the cluster:

{
  "data": {
    "hotfixes.json": "{\"hotfixes\":{\"202604.01\":\"202604.01.1\",\"202605.01\":\"202605.01.2\"}}"
  }
}

check-hotfix stages /opt/azure/containers/aks-node-controller-hotfix.json:

{"hotfixes":{"202604.01":"202604.01.1","202605.01":"202605.01.2"}}
Node baked ANC version ConfigMap read check-hotfix outcome download-hotfix then does
202604.01.0 OK configMapRead base 202604.01 -> 202604.01.1, patch 1 > 0, upgrades
202605.01.2 OK configMapRead base 202605.01 -> 202605.01.2, patch not higher, no-op
202607.15.0 OK (no matching base) noHotfixForBase no pointer for this base, no-op
202604.01.0 fails, node config has embedded hotfixes customDataFallback reads staged fallback pointer, resolves as above
202604.01.0 fails, no fallback present failed (still exit 0) nothing staged, no-op

Tests

New network-free unit tests (creds/ConfigMap source injected, no real networking): success read+write, 404/403/timeout/connection fail-open, invalid ConfigMap JSON fail-open, noHotfixForBase, cold-start fallback (and no-pointer failure), telemetry outcomes and always-exit-0 wiring, shared-parser equivalence with download-hotfix, and kubeconfig parsing (token + client-cert, inline-data and file forms).

All new tests pass. The full go test ./... run shows no new failures versus the base branch. The only failures are pre-existing Windows-only environmental ones (they need /etc/os-release, bash, and unix file perms) that pass in Linux CI.

Note: wiring this command into the provisioning wrapper script is intentionally out of scope for this PR and will land separately behind a feature flag.

Devin Wong and others added 4 commits June 11, 2026 17:28
…2.1b)

Add a fail-open 'check-hotfix' CLI subcommand that reads the
kube-system/anc-hotfix-version ConfigMap published by the
live-patching-controller and stages the resolved {hotfixes:{...}} pointer
to the path download-hotfix already reads. download-hotfix keeps its
unchanged patch-only, strictly-higher gating; check-hotfix only fetches and
writes the pointer.

- Raw net/http HTTPS GET (no client-go); creds from AKSNodeConfig bootstrap
  token + apiserver FQDN (primary) or on-node kubeconfigs (secondary).
- Shares the 2.1a hotfixConfig parser/data contract with download-hotfix.
- Always exits 0; emits CheckHotfix telemetry (configMapRead,
  noHotfixForBase, customDataFallback, failed).
- PoC cold-start fallback reads a lenient top-level hotfixes object from the
  node config when the ConfigMap read fails (TODO: typed absvc contract).
- Injectable App fields (checkHotfixConfigMapFetcher, nodeConfigPath) for
  network-free unit tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The legacy readHotfixVersion function had no production callers after
downloadHotfix switched to readHotfixConfig + resolveVersion. Remove it
and fold its forward-compat coverage into TestReadHotfixConfig.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…2.1b)

Add a fail-open 'check-hotfix' CLI subcommand that reads the
kube-system/anc-hotfix-version ConfigMap published by the
live-patching-controller and stages the resolved {hotfixes:{...}} pointer
to the path download-hotfix already reads. download-hotfix keeps its
unchanged patch-only, strictly-higher gating; check-hotfix only fetches and
writes the pointer.

- Raw net/http HTTPS GET (no client-go); creds from AKSNodeConfig bootstrap
  token + apiserver FQDN (primary) or on-node kubeconfigs (secondary).
- Shares the 2.1a hotfixConfig parser/data contract with download-hotfix.
- Always exits 0; emits CheckHotfix telemetry (configMapRead,
  noHotfixForBase, customDataFallback, failed).
- PoC cold-start fallback reads a lenient top-level hotfixes object from the
  node config when the ConfigMap read fails (TODO: typed absvc contract).
- Injectable App fields (checkHotfixConfigMapFetcher, nodeConfigPath) for
  network-free unit tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Devinwong Devinwong changed the title feat(anc): provisioning-hotfix M1 - check-hotfix ConfigMap reader (2.1b) feat(anc): add check-hotfix subcommand to read hotfix pointer from ConfigMap Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant