fix: harden CSE retry helpers against disk-hang and timeout escape (AB#36680094)#8685
fix: harden CSE retry helpers against disk-hang and timeout escape (AB#36680094)#8685pdamianov-dev wants to merge 6 commits into
Conversation
- Move CURL_OUTPUT and ORAS_OUTPUT from /tmp to /var/log/azure to avoid CSE hangs when the ephemeral/temp disk is unstable and writes to /tmp block indefinitely (the verbose curl/oras output redirect is what was stalling at the registry connectivity probe). - Add 'timeout -k 5s' across retry helpers (_retrycmd_internal, _retry_file_curl_internal, retrycmd_pull_from_registry_with_oras, retrycmd_cp_oci_layout_with_oras, retrycmd_get_aad_access_token, retrycmd_get_refresh_token_for_oras, retrycmd_can_oras_ls_acr_anonymously) so a curl/oras process stuck in uninterruptible D-state on a hung disk is forcibly killed via SIGKILL instead of stalling CSE until the global 15-minute watchdog trips. - Override CURL_OUTPUT/ORAS_OUTPUT in the shellspec BeforeEach so tests do not depend on /var/log/azure/ existing in the test container. AB#36680094 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens Linux CSE retry helpers to avoid hangs when the ephemeral/temp disk is unstable by moving verbose tool output off /tmp and ensuring timeout escalates to SIGKILL after a short grace period. This targets reported provisioning stalls where retry loops never returned until the outer 15-minute watchdog fired.
Changes:
- Move
CURL_OUTPUT/ORAS_OUTPUTdefaults from/tmpto/var/log/azure/so verbose output writes do not block on an unhealthy ephemeral/temp disk. - Add
timeout -k 5s ...to curl/oras retry helpers so commands that ignore SIGTERM don’t indefinitely stall the retry loops. - Update ShellSpec precheck to redirect
CURL_OUTPUT/ORAS_OUTPUTtomktemppaths to keep tests container-friendly.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/cse_helpers.sh | Moves verbose output paths to /var/log/azure/ and adds timeout -k 5s to multiple retry helpers to reduce hang risk. |
| parts/linux/cloud-init/artifacts/cse_install.sh | Aligns CURL_OUTPUT default to /var/log/azure/ for consistency with retry helper behavior. |
| spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh | Overrides output paths to temp files so timeout mocks don’t depend on /var/log/azure/ existing in the ShellSpec container. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (2)
parts/linux/cloud-init/artifacts/cse_helpers.sh:479
ORAS_OUTPUTis used unquoted in the redirection and incat. This can cause word-splitting/globbing issues and will trip shellcheck (SC2086) ifORAS_OUTPUTis ever overridden to include spaces or other special chars. Quote it consistently likeCURL_OUTPUTabove.
timeout -k 5s 60 oras pull "$url" -o "$target_folder" --registry-config "${ORAS_REGISTRY_CONFIG_FILE}" "$@" > $ORAS_OUTPUT 2>&1
if [ "$?" -eq 0 ]; then
return 0
else
cat $ORAS_OUTPUT
parts/linux/cloud-init/artifacts/cse_helpers.sh:511
ORAS_REGISTRY_CONFIG_FILEandORAS_OUTPUTare unquoted here (--from-registry-config ${ORAS_REGISTRY_CONFIG_FILE} > $ORAS_OUTPUT,cat $ORAS_OUTPUT). This risks word-splitting/globbing and is inconsistent with other quoted paths/URLs in this PR. Quote both variables.
timeout -k 5s 120 oras cp "$url" "$path:$tag" --to-oci-layout --from-registry-config ${ORAS_REGISTRY_CONFIG_FILE} > $ORAS_OUTPUT 2>&1
if [ "$?" -ne 0 ]; then
cat $ORAS_OUTPUT
else
return 0
|
AgentBaker Linux PR gate — VHD build failure on
Detective summary Image Customizer fails on AzureLinuxOSGuardV3 gen2 FIPS TL with: The actual failure mode is a 44-minute runaway retry loop inside the image-customization chroot. The log shows the exact same two lines every ~5 seconds, 480 iterations ( The loop starts immediately after
Why this is almost certainly THIS PR: PR #8685 ( No other VHD flavor builds OS Guard with this exact post-install sequence, which is why only Classification: PR-caused VHD/Packer/CSE-helper regression. Likely-deterministic on this PR's HEAD. Confidence: High. Strongest alternative theory: Pre-existing breakage on Recommended next action / owner: @pdamianov-dev — please:
Evidence used: failed Image Customizer log (480 identical loop iterations spanning ~44 min, exit 41), PR changed files (only |
Add error handling to mkdir command for CURL_OUTPUT directory.
| CURL_OUTPUT=/var/log/azure/curl_verbose.out | ||
| ORAS_OUTPUT=/var/log/azure/oras_verbose.out | ||
| mkdir -p "$(dirname "$CURL_OUTPUT")" 2>/dev/null || true |
Move the writability check into a per-call helper (_ensure_writable_output_path)
that falls back to /tmp transparently when /var/log/azure cannot be created
or written. Fixes two regressions introduced by the original move from /tmp:
1. shellspec CI aborted at source time because the top-level mkdir emitted
"Permission denied" on stderr when run as the non-root shellspec user,
which shellspec treats as a fatal sourcing error (exit 102).
2. AzureLinuxOSGuardV3 image-customizer chroot looped 480 times over 44 min
because the shell-level redirect "> /var/log/azure/curl_verbose.out" fails
before curl runs when the directory does not exist in the chroot.
The new helper:
- runs inside each retry helper (cannot be defeated by chroot/mount changes
that happen between sourcing and execution)
- uses a subshell write-test `( : > "$path" ) 2>/dev/null` so shell-level
redirection errors are properly suppressed
- falls back to /tmp/{curl,oras}_verbose.out when the primary path is not usable
Verified: 46/46 cse_retry_helpers_spec passes, all cse_* specs pass,
chroot simulation confirms transparent fallback with zero stderr leakage.
AB#36680094
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
AgentBaker Linux PR gate —
Detective summary Same pattern as builds 167493131 and 167505019: vmssCSE exits 99 because Classification: Test infrastructure / scenario fixture flakiness. Confidence: High. PR #8685 is Strongest alternative theory: A CSE-time apt config regression on main. Less likely — the failure is a TCP-level Recommended next action / owner: No PR change required. AgentBaker E2E test-infra — please check HTTPSProxy_PrivateDNS proxy pod/daemon health and Evidence used: failed task log (3 |
Summary
Fixes a class of CSE hangs reported via IcM 51000000888838 where node
provisioning would stall at the registry connectivity probe and only
recover after the global 15-minute watchdog tripped.
Two root causes:
CURL_OUTPUT=/tmp/curl_verbose.out(and the analogousORAS_OUTPUT)redirected verbose curl/oras output onto the ephemeral/temp disk.
When that disk was unstable, the write blocked indefinitely, which
in turn hung the retry helper.
timeoutwithout-konly sends SIGTERM. A curl/oras process stuckin uninterruptible D-state on a hung disk ignores SIGTERM, so the
retry helper never returned and CSE made no further progress until
the outer 15-minute watchdog tripped.
Changes
parts/linux/cloud-init/artifacts/cse_helpers.shCURL_OUTPUT/ORAS_OUTPUTmoved from/tmpto/var/log/azure/(created by the CSE / waagent on Azure VMs, sibling of the existing
EVENTS_LOGGING_DIRpath).-k 5s(SIGKILL grace) to thetimeout-wrapped calls in:_retrycmd_internal,_retry_file_curl_internal,retrycmd_pull_from_registry_with_oras,retrycmd_cp_oci_layout_with_oras,retrycmd_get_aad_access_token,retrycmd_get_refresh_token_for_oras,retrycmd_can_oras_ls_acr_anonymously.parts/linux/cloud-init/artifacts/cse_install.shCURL_OUTPUTpath update for consistency.spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.shCURL_OUTPUT/ORAS_OUTPUTtomktemppaths in thecse_retry_helpers_precheckBeforeEachso tests do not require/var/log/azure/to exist in the shellspec container.Testing
shellcheck(repo ignore list from.pipelines/scripts/verify_shell.sh)is clean on all three modified files.
timeout()continue to work — the extra-k 5sarguments are positional and are absorbed by the mock.Related