Skip to content

test: add ShellSpec coverage for init-aks-custom-cloud*.sh#8692

Open
Devinwong wants to merge 2 commits into
mainfrom
devinwong/test-init-aks-custom-cloud-scripts
Open

test: add ShellSpec coverage for init-aks-custom-cloud*.sh#8692
Devinwong wants to merge 2 commits into
mainfrom
devinwong/test-init-aks-custom-cloud-scripts

Conversation

@Devinwong

@Devinwong Devinwong commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Context

The init-aks-custom-cloud*.sh scripts run at node boot in custom-cloud environments and rewrite the system package repository URLs to point at the internal RepoDepot mirror (since those nodes cannot reach the public internet). A misconfiguration here — e.g. a leftover public URL that does not get rewritten — can cause package operations to hang during provisioning. There was previously no automated test exercising these scripts, so a regression in the repo-URL rewrite could slip through.

This PR adds that missing test coverage.

What this PR does

No script logic is changed. This PR only adds ShellSpec unit-test coverage and makes the scripts/functions testable. The runtime behavior of all four scripts is preserved — main() invokes the same functions in the same order with the same arguments as before.

1. Makes all 4 init-aks-custom-cloud*.sh scripts testable (no behavior change)

  • init-aks-custom-cloud.sh (Ubuntu/Flatcar/ACL)
  • init-aks-custom-cloud-mariner.sh (Mariner/AzureLinux)
  • init-aks-custom-cloud-operation-requests.sh (Ubuntu/Flatcar/ACL)
  • init-aks-custom-cloud-operation-requests-mariner.sh (Mariner/AzureLinux)

Refactor pattern (matches existing repo convention, e.g. mariner-package-update.sh):

  • Lift hardcoded paths (apt sources, yum.repos.d, ssl certs, chrony, wireserver) into : "${VAR:=default}" env-overridable constants. Defaults are identical to the previous hardcoded values, so production behavior is unchanged.
  • Add ${__SOURCED__:+return} guard so ShellSpec's Include can source the script without running main at source time.
  • Wrap existing inline logic into named functions (detect_distro, fetch_and_install_ca_certs, init_*_repo_depot, write_chrony_config, etc.) — the same commands, just grouped into callable units.

2. Adds ShellSpec tests (20 examples across 4 spec files)

Drives each script with a mocked REPO_DEPOT_ENDPOINT and asserts:

  • All rewritten repo files point at the mocked depot.
  • No leftover packages.microsoft.com URLs remain.
  • No third-party (e.g. developer.download.nvidia.com) URLs leak in.
  • Existing /etc/apt/sources.list[.d] entries are backed up, not left in place.
  • init_repo_depot strips the trailing /ubuntu and dispatches the Mariner / AzureLinux / Ubuntu paths correctly.

Verification

  • All 20 new ShellSpec examples pass locally (pinned to the CI shellspec version 0.28.1).
  • make validate-shell (shellcheck gate) passes.
  • No snapshot/testdata changes — the refactor preserves embedded-script behavior.

Anchored on IcM 725845756 (Bleu, Dec 2025–Jan 2026): an unreachable repo
URL baked into the VHD caused tdnf to hang during AzSecPack install, which
exhausted CRP's extension budget and stuck cluster upgrades in "Creating".
The fix landed in VHD 202601.07, but no test exercises the
init-aks-custom-cloud*.sh scripts that rewrite repo URLs at boot — so a
similar misconfiguration would slip through public-cloud e2e again.

Refactor all 4 init-aks-custom-cloud*.sh scripts to be testable:
  - Lift hardcoded paths (apt sources, yum.repos.d, ssl certs, chrony,
    wireserver) into `: "\"` env-overridable constants.
  - Add `\` guard so ShellSpec can `Include` them
    without running `main` at source time.
  - Wrap the existing inline logic into named functions (detect_distro,
    fetch_and_install_ca_certs, init_*_repo_depot, write_chrony_config, etc.)
    so individual flows can be exercised in isolation.

Behavior is preserved — main() invokes the same functions in the same
order with the same arguments as the originals.

Add ShellSpec tests asserting, for both the `operation-requests` variants
(used in Bleu / public custom clouds) and the originals (USSecCloud /
USNatCloud):
  - All rewritten repo files point at a mocked REPO_DEPOT_ENDPOINT.
  - No leftover packages.microsoft.com URLs remain.
  - No third-party (e.g. developer.download.nvidia.com) URLs leak in.
  - Existing /etc/apt/sources.list[.d] entries are backed up, not left in
    place.
  - `init_repo_depot` correctly strips the trailing `/ubuntu` suffix and
    dispatches Mariner vs AzureLinux vs Ubuntu/Flatcar/ACL flows.

20 new ShellSpec examples across 4 spec files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens custom-cloud bootstrap reliability by refactoring the init-aks-custom-cloud*.sh repo-depot rewrite scripts to be sourceable/testable and adding ShellSpec coverage to prevent regressions like the IcM 725845756 “stale repo URL → package manager hang” incident.

Changes:

  • Refactors the 4 init-aks-custom-cloud*.sh scripts into function-oriented, env-overridable implementations and adds a ${__SOURCED__:+return} guard to support ShellSpec Include.
  • Adds 4 new ShellSpec files (≈20 examples) asserting repo rewrite invariants (no upstream PMC/Ubuntu/NVIDIA URLs leaking; backup behavior; correct repo-depot dispatch).
  • Adds focused dispatch tests for Mariner vs AzureLinux handling (including stripping trailing /ubuntu from REPO_DEPOT_ENDPOINT).

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
spec/parts/linux/cloud-init/artifacts/init_aks_custom_cloud_spec.sh Adds Ubuntu/Flatcar/ACL coverage for init-aks-custom-cloud.sh repo rewrite + backups + URL checks.
spec/parts/linux/cloud-init/artifacts/init_aks_custom_cloud_operation_requests_spec.sh Adds Ubuntu/Flatcar/ACL coverage for operation-requests variant used by Bleu-like clouds.
spec/parts/linux/cloud-init/artifacts/init_aks_custom_cloud_mariner_spec.sh Adds Mariner/AzureLinux coverage for depot rewrite + dispatch behavior.
spec/parts/linux/cloud-init/artifacts/init_aks_custom_cloud_operation_requests_mariner_spec.sh Adds Mariner/AzureLinux coverage for operation-requests variant + dispatch behavior.
parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh Refactors into testable functions; introduces env-overridable paths; preserves main execution via main().
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-operation-requests.sh Same refactor pattern for operation-requests Ubuntu/Flatcar/ACL variant.
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-mariner.sh Same refactor pattern for Mariner/AzureLinux variant; keeps repo rewrite + chrony config behavior.
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-operation-requests-mariner.sh Same refactor pattern for operation-requests Mariner/AzureLinux variant.

Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud-operation-requests.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud-mariner.sh Outdated
…ustom-cloud scripts

- Guard top-level 'set -x' with ${__SOURCED__} so ShellSpec Include does not
  emit trace to stderr (was causing the shellspec CI job to abort with exit 102).
- Drop the misleading 'echo "distribution is $distribution"' line ($distribution
  was never set); 'Running on $NAME' already logs the detected distro. Addresses
  the 4 Copilot review comments.
- Capture the functions' progress output in the 17 affected examples via
  'The output should be present', clearing the unhandled-stdout warnings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aks-node-assistant

Copy link
Copy Markdown
Contributor

🕵️ AgentBaker Linux Gate Detective — PR gate failure analysis

  • Failed run: build 167683195 (pipeline def 119535, refs/pull/8692/merge)
  • Failed job: Stage e2eRun AgentBaker E2E (exit 1)
  • Failed tests (6 top-level):
    • Test_AzureLinuxV3_NetworkIsolated_Package_Install (default + scriptless_nbc)
    • Test_AzureLinuxV3_NetworkIsolatedCluster_NonAnonymousACR (default + scriptless_nbc)
    • Test_Ubuntu2204_NetworkIsolatedCluster_NonAnonymousACR/scriptless_nbc
    • Test_Ubuntu2204_ArtifactStreaming_NetworkIsolatedCluster/scriptless_nbc
    • Test_ACL_NetworkIsolatedCluster_NonAnonymousACR
    • Test_ACL_SecureTLSBootstrapping_BootstrapToken_Fallback/scriptless_nbc

RCA (3-level)

  1. Surface: CSE fails on every NetworkIsolated scenario with the cluster's private apiserver FQDN unresolvable.
  2. Corroboration (≥2 sources):
    • test-log.json: nslookup against 169.254.10.10 returns NXDOMAIN for abe2e-azure-networkisolated-v3-kq4wzvpl.hcp.westus3.azmk8s.io (CSE VALIDATION_ERR=52, repeated for ~5 min).
    • Pipeline Run AgentBaker E2E issue list: only generic Script failed with exit code: 1 + a debug-pod readiness warning — no PR-source code error.
    • All 5 NetworkIsolated failures hit the same shared cluster abe2e-azure-networkisolated-v3-kq4wzvpl already named in the wiki signature.
  3. Challenge / strongest alternative: Could PR test: add ShellSpec coverage for init-aks-custom-cloud*.sh #8692 (ShellSpec coverage for init-aks-custom-cloud*.sh) be responsible? No — those scripts are not on the CSE NetworkIsolated apiserver-nslookup path, and the wiki shows the identical NXDOMAIN against the same westus3 cluster on three unrelated PRs (chore(deps): update kubelet-kubectl (patch) #8600, chore(deps): update nvidia-dcgm (patch) #8659, fix(linux): allow secondary nics to be configured on boot #8642). Deterministic shared-fixture / private-DNS issue, not PR-caused.

Classification

  • Signature (reuse, existing): networkisolated-apiserver-fqdn-nxdomain — Test-infra / NetworkIsolated shared-cluster private-DNS reachability.
  • Build-vs-test class: test-infra flake (shared fixture); deterministic on this cluster, not on the PR.
  • Secondary signature (1/6 tests): localdns-exporter-systemd-failed-state on Test_ACL_SecureTLSBootstrapping_BootstrapToken_Fallback/scriptless_nbc (localdns-exporter@0-10.220.112.107:9353-10.220.112.4:57128.service unexpectedly entered failed state). Tracked separately in the wiki.
  • Confidence: High.

Recommendation / owner

  • Owner: AgentBaker E2E test-infra — verify private DNS zone / private-link for the hcp FQDN on cluster abe2e-azure-networkisolated-v3-kq4wzvpl (westus3) and that 169.254.10.10 resolver is wired correctly.
  • Action for PR author: No code change required from test: add ShellSpec coverage for init-aks-custom-cloud*.sh #8692. Once the shared-cluster DNS is healthy, re-run the gate. The wiki row for networkisolated-apiserver-fqdn-nxdomain will be updated centrally to include build 167683195.

Posted by Clawpilot AgentBaker Linux Gate Detective Watcher. No raw private logs are included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants