Skip to content

fix: harden CSE retry helpers against disk-hang and timeout escape (AB#36680094)#8685

Open
pdamianov-dev wants to merge 6 commits into
mainfrom
pdamianov-dev/36680094
Open

fix: harden CSE retry helpers against disk-hang and timeout escape (AB#36680094)#8685
pdamianov-dev wants to merge 6 commits into
mainfrom
pdamianov-dev/36680094

Conversation

@pdamianov-dev

Copy link
Copy Markdown
Contributor

Summary

Fixes a class of CSE hangs reported via IcM 51000000888838 where node
provisioning would stall at the registry connectivity probe and only
recover after the global 15-minute watchdog tripped.

Two root causes:

  1. CURL_OUTPUT=/tmp/curl_verbose.out (and the analogous ORAS_OUTPUT)
    redirected verbose curl/oras output onto the ephemeral/temp disk.
    When that disk was unstable, the write blocked indefinitely, which
    in turn hung the retry helper.
  2. timeout without -k only sends SIGTERM. A curl/oras process stuck
    in uninterruptible D-state on a hung disk ignores SIGTERM, so the
    retry helper never returned and CSE made no further progress until
    the outer 15-minute watchdog tripped.

Changes

  • parts/linux/cloud-init/artifacts/cse_helpers.sh
    • CURL_OUTPUT / ORAS_OUTPUT moved from /tmp to /var/log/azure/
      (created by the CSE / waagent on Azure VMs, sibling of the existing
      EVENTS_LOGGING_DIR path).
    • Added -k 5s (SIGKILL grace) to the timeout-wrapped calls in:
      _retrycmd_internal, _retry_file_curl_internal,
      retrycmd_pull_from_registry_with_oras,
      retrycmd_cp_oci_layout_with_oras,
      retrycmd_get_aad_access_token,
      retrycmd_get_refresh_token_for_oras,
      retrycmd_can_oras_ls_acr_anonymously.
  • parts/linux/cloud-init/artifacts/cse_install.sh
    • Same CURL_OUTPUT path update for consistency.
  • spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh
    • Override CURL_OUTPUT / ORAS_OUTPUT to mktemp paths in the
      cse_retry_helpers_precheck BeforeEach so tests do not require
      /var/log/azure/ to exist in the shellspec container.

Testing

  • Local shellcheck (repo ignore list from .pipelines/scripts/verify_shell.sh)
    is clean on all three modified files.
  • Existing shellspec mocks of timeout() continue to work — the extra
    -k 5s arguments are positional and are absorbed by the mock.
  • CI shellcheck + shellspec will run on this PR.

Related

  • AB#36680094
  • IcM 51000000888838

- Move CURL_OUTPUT and ORAS_OUTPUT from /tmp to /var/log/azure to avoid
  CSE hangs when the ephemeral/temp disk is unstable and writes to /tmp
  block indefinitely (the verbose curl/oras output redirect is what was
  stalling at the registry connectivity probe).
- Add 'timeout -k 5s' across retry helpers (_retrycmd_internal,
  _retry_file_curl_internal, retrycmd_pull_from_registry_with_oras,
  retrycmd_cp_oci_layout_with_oras, retrycmd_get_aad_access_token,
  retrycmd_get_refresh_token_for_oras, retrycmd_can_oras_ls_acr_anonymously)
  so a curl/oras process stuck in uninterruptible D-state on a hung disk
  is forcibly killed via SIGKILL instead of stalling CSE until the global
  15-minute watchdog trips.
- Override CURL_OUTPUT/ORAS_OUTPUT in the shellspec BeforeEach so tests
  do not depend on /var/log/azure/ existing in the test container.

AB#36680094

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Linux CSE retry helpers to avoid hangs when the ephemeral/temp disk is unstable by moving verbose tool output off /tmp and ensuring timeout escalates to SIGKILL after a short grace period. This targets reported provisioning stalls where retry loops never returned until the outer 15-minute watchdog fired.

Changes:

  • Move CURL_OUTPUT / ORAS_OUTPUT defaults from /tmp to /var/log/azure/ so verbose output writes do not block on an unhealthy ephemeral/temp disk.
  • Add timeout -k 5s ... to curl/oras retry helpers so commands that ignore SIGTERM don’t indefinitely stall the retry loops.
  • Update ShellSpec precheck to redirect CURL_OUTPUT / ORAS_OUTPUT to mktemp paths to keep tests container-friendly.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
parts/linux/cloud-init/artifacts/cse_helpers.sh Moves verbose output paths to /var/log/azure/ and adds timeout -k 5s to multiple retry helpers to reduce hang risk.
parts/linux/cloud-init/artifacts/cse_install.sh Aligns CURL_OUTPUT default to /var/log/azure/ for consistency with retry helper behavior.
spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh Overrides output paths to temp files so timeout mocks don’t depend on /var/log/azure/ existing in the ShellSpec container.

Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 10, 2026 17:14
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 10, 2026 18:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (2)

parts/linux/cloud-init/artifacts/cse_helpers.sh:479

  • ORAS_OUTPUT is used unquoted in the redirection and in cat. This can cause word-splitting/globbing issues and will trip shellcheck (SC2086) if ORAS_OUTPUT is ever overridden to include spaces or other special chars. Quote it consistently like CURL_OUTPUT above.
        timeout -k 5s 60 oras pull "$url" -o "$target_folder" --registry-config "${ORAS_REGISTRY_CONFIG_FILE}" "$@" > $ORAS_OUTPUT 2>&1
        if [ "$?" -eq 0 ]; then
            return 0
        else
            cat $ORAS_OUTPUT

parts/linux/cloud-init/artifacts/cse_helpers.sh:511

  • ORAS_REGISTRY_CONFIG_FILE and ORAS_OUTPUT are unquoted here (--from-registry-config ${ORAS_REGISTRY_CONFIG_FILE} > $ORAS_OUTPUT, cat $ORAS_OUTPUT). This risks word-splitting/globbing and is inconsistent with other quoted paths/URLs in this PR. Quote both variables.
            timeout -k 5s 120 oras cp "$url" "$path:$tag" --to-oci-layout --from-registry-config ${ORAS_REGISTRY_CONFIG_FILE} > $ORAS_OUTPUT 2>&1
            if [ "$?" -ne 0 ]; then
                cat $ORAS_OUTPUT
            else
                return 0

@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — VHD build failure on buildAzureLinuxOSGuardV3gen2fipsTL (likely caused by THIS PR)

  • Run: 167504743
  • Failed job: buildAzureLinuxOSGuardV3gen2fipsTL
  • Failed task: Build VHD - Image Customizer (exit code 2)
  • Wiki signature: cse-helpers-curl-output-path-loop (new)

Detective summary

Image Customizer fails on AzureLinuxOSGuardV3 gen2 FIPS TL with:

image customization failed:
failed to customize raw image:
script (scripts/azlosguard-postinstall.sh) failed:
exit status 41

The actual failure mode is a 44-minute runaway retry loop inside the image-customization chroot. The log shows the exact same two lines every ~5 seconds, 480 iterations (17:28:15 → 18:12:10):

/home/packer/provision_source.sh: line 412: /var/log/azure/curl_verbose.out: No such file or directory
cat: /var/log/azure/curl_verbose.out: No such file or directory

The loop starts immediately after 120 file curl retries while installing cni-plugins v1.6.2 from packages.aks.azure.com. The retry helper:

  1. Tries timeout -k 5s ... curl ... > /var/log/azure/curl_verbose.out 2>&1
  2. The redirect target path doesn't exist in this image's chroot, so the curl's stdout/stderr is never written.
  3. On retry failure the helper does cat /var/log/azure/curl_verbose.out — which prints the "No such file" pair.
  4. The retry loop never observes a success file → it keeps retrying. After ~44 minutes the wrapping script exits 41 (likely the new outer timeout).

Why this is almost certainly THIS PR:

PR #8685 (cse_helpers.sh +21/-10) moves CURL_OUTPUT from /tmp/curl_verbose.out to /var/log/azure/curl_verbose.out and adds mkdir -p "$(dirname "$CURL_OUTPUT")" at the script-sourcing level. In the AzureLinuxOSGuardV3 image-customizer chroot, azlosguard-postinstall.sh runs at a build stage where /var/log/azure does not exist (and on OS Guard the /var/log tree may have restricted/mounted semantics). The top-level mkdir -p either runs before the right mount layout exists or its effect is lost by the time the curl-helper runs, so the redirect silently fails on every iteration.

No other VHD flavor builds OS Guard with this exact post-install sequence, which is why only buildAzureLinuxOSGuardV3gen2fipsTL fails while every other Linux VHD job in this run passed.

Classification: PR-caused VHD/Packer/CSE-helper regression. Likely-deterministic on this PR's HEAD.

Confidence: High.

Strongest alternative theory: Pre-existing breakage on main for azlosguard-postinstall.sh that this PR merely surfaces — less likely because the exact symptom (/var/log/azure/curl_verbose.out: No such file) maps 1:1 to the path change introduced by this PR. Before this PR, CURL_OUTPUT was /tmp/curl_verbose.out/tmp always exists in the chroot, so the loop would have surfaced before now.

Recommended next action / owner: @pdamianov-dev — please:

  • Ensure mkdir -p "$(dirname "$CURL_OUTPUT")" is performed inside the curl-helper function on each call (not only at script source time), so it survives chroot/image-customizer contexts.
  • Or fall back to /tmp/curl_verbose.out when /var/log/azure is not writable/creatable.
  • Also handle the retry-helper case where the verbose-output file is missing on cat — fail fast or log a warning instead of looping silently.

Evidence used: failed Image Customizer log (480 identical loop iterations spanning ~44 min, exit 41), PR changed files (only cse_helpers.sh, cse_install.sh, cse_retry_helpers_spec.sh), diff confirms the exact path migration to /var/log/azure/curl_verbose.out.

Add error handling to mkdir command for CURL_OUTPUT directory.
Copilot AI review requested due to automatic review settings June 10, 2026 19:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment on lines +209 to +211
CURL_OUTPUT=/var/log/azure/curl_verbose.out
ORAS_OUTPUT=/var/log/azure/oras_verbose.out
mkdir -p "$(dirname "$CURL_OUTPUT")" 2>/dev/null || true
Move the writability check into a per-call helper (_ensure_writable_output_path)
that falls back to /tmp transparently when /var/log/azure cannot be created
or written. Fixes two regressions introduced by the original move from /tmp:

1. shellspec CI aborted at source time because the top-level mkdir emitted
   "Permission denied" on stderr when run as the non-root shellspec user,
   which shellspec treats as a fatal sourcing error (exit 102).

2. AzureLinuxOSGuardV3 image-customizer chroot looped 480 times over 44 min
   because the shell-level redirect "> /var/log/azure/curl_verbose.out" fails
   before curl runs when the directory does not exist in the chroot.

The new helper:
- runs inside each retry helper (cannot be defeated by chroot/mount changes
  that happen between sourcing and execution)
- uses a subshell write-test `( : > "$path" ) 2>/dev/null` so shell-level
  redirection errors are properly suppressed
- falls back to /tmp/{curl,oras}_verbose.out when the primary path is not usable

Verified: 46/46 cse_retry_helpers_spec passes, all cse_* specs pass,
chroot simulation confirms transparent fallback with zero stderr leakage.

AB#36680094

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — Test_Ubuntu2204_HTTPSProxy_PrivateDNS proxy fixture unreachable (NOT this PR)

  • Run: 167534982
  • Failed job: Run AgentBaker E2E (only HTTPSProxy_PrivateDNS subtests failed; all VHD builds passed)
  • Wiki signature: httpsproxy-fixture-proxy-unreachable (wiki)

Detective summary

Same pattern as builds 167493131 and 167505019: vmssCSE exits 99 because apt-get update cannot reach the HTTPSProxy_PrivateDNS scenario's HTTP proxy in the 10.14.0.0/24 test VNet (this run hit 10.14.0.63 / 10.14.0.94). Apt retries 10x then aborts. Third occurrence of this signature; approaching escalation threshold (>6 across 6 distinct build IDs).

Classification: Test infrastructure / scenario fixture flakiness.

Confidence: High. PR #8685 is aks-node-controller + e2e/vmss.go only (no CSE/proxy/apt changes); HTTPSProxy_PrivateDNS uses a dedicated test-fixture proxy that has nothing to do with this PR.

Strongest alternative theory: A CSE-time apt config regression on main. Less likely — the failure is a TCP-level connect (113: No route to host) against a private proxy endpoint, not an apt configuration error.

Recommended next action / owner: No PR change required. AgentBaker E2E test-infra — please check HTTPSProxy_PrivateDNS proxy pod/daemon health and 10.14.0.0/24 reachability; this signature is now recurring across distinct PRs and distinct proxy IPs.

Evidence used: failed task log (3 === FAIL for HTTPSProxy_PrivateDNS, vmssCSE exit 99 with proxy at 10.14.0.63 / 10.14.0.94), all other E2E and all VHD builds passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants