fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230
Merged
fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent Linux CSE runs from exceeding the VM provisioner’s ~16-minute client window by adding (1) a pre-attempt global timeout guard and (2) a per-download wall-clock budget, then wiring those controls into provisioning-time download call sites and updating ShellSpec coverage accordingly.
Changes:
- Add a pre-command
check_cse_timeoutguard in_retrycmd_internaland a per-operationmaxBudgetin_retry_file_curl_internal(propagated viaretrycmd_curl_file/retrycmd_get_tarball). - Update provisioning-time download call sites to pass a 300s per-operation budget and adjust
retrycmd_get_tarballcall signature to include an explicit timeout. - Reduce
nvidia-smiper-try timeout from 300s to 30s and update/add ShellSpec expectations for the new retry helper behaviors.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh | Updates parameter tables and adds tests for pre-attempt global timeout and per-operation budget exit behavior. |
| spec/parts/linux/cloud-init/artifacts/cse_install_spec.sh | Updates expected retrycmd_get_tarball invocation signature to include timeout + budget. |
| parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh | Adds 300s per-operation budget to provisioning-time curl downloads (NVIDIA key, containerd, runc). |
| parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh | Adds 300s per-operation budget to NVIDIA repo file download. |
| parts/linux/cloud-init/artifacts/cse_install.sh | Adds explicit timeout + 300s per-operation budgets to provisioning-time tarball/curl downloads (credential provider, oras, CNI, crictl, k8s). |
| parts/linux/cloud-init/artifacts/cse_helpers.sh | Implements pre-attempt global timeout guard and per-operation budget; changes retrycmd_get_tarball signature. |
| parts/linux/cloud-init/artifacts/cse_config.sh | Lowers nvidia-smi retry timeout from 300s to 30s at 3 call sites. |
Devinwong
reviewed
Apr 2, 2026
pdamianov-dev
approved these changes
Apr 6, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
parts/linux/cloud-init/artifacts/cse_helpers.sh:347
- The per-operation
maxBudgetcheck happens before startingtimeout $timeout curl ..., but the curl attempt itself can still run longer than the remaining budget (and then exceed the budget by up totimeout+ sleep). IfmaxBudgetis intended to be a hard upper bound (“total wall-clock seconds across all retries”), cap the per-attempt timeout tomin(timeout, maxBudget - elapsed)and/or re-check the budget immediately after each attempt as well.
# Check per-operation budget if set -- prevents a single download from consuming the entire CSE window
if [ "${maxBudget}" -gt 0 ]; then
local opElapsed
opElapsed=$(( $(date +%s) - opStartTime ))
if [ "$opElapsed" -ge "$maxBudget" ]; then
echo "Operation budget of ${maxBudget}s exceeded after ${opElapsed}s, exiting early." >&2
return 2
fi
fi
# check if global cse timeout is approaching
if ! check_cse_timeout; then
echo "CSE timeout approaching, exiting early." >&2
return 2
else
if [ "$i" -gt 1 ]; then
sleep $waitSleep
fi
timeout $timeout curl -fsSLv $url -o $filePath > $CURL_OUTPUT 2>&1
if [ "$?" -ne 0 ]; then
spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh
Outdated
Show resolved
Hide resolved
…and guard Remove hardcoded 60s timeout from retrycmd_get_tarball; caller now passes timeout as 3rd positional arg (matches retrycmd_curl_file style). Add optional max_budget_s (6th arg, default 0) to retrycmd_get_tarball and retrycmd_curl_file. When > 0, _retry_file_curl_internal tracks wall-clock time and exits early (return 2) once the budget is consumed, bounding any single download to 5 min at provisioning time regardless of retry count. Add pre-command check_cse_timeout guard at the TOP of _retrycmd_internal loop. Previously the guard only fired after a failed command, meaning a new 300-600s operation could START at minute 12:59 and run to ~18 min. The guard now prevents starting any new attempt when >780s have elapsed since CSE_STARTTIME_SECONDS. Pass 300s budget to all 11 provisioning-time download call sites: cse_install.sh (credential-provider, oras, secure-TLS, CNI x2, crictl, k8s) ubuntu/cse_install_ubuntu.sh (nvidia GPG key, containerd, runc) mariner/cse_install_mariner.sh (nvidia repo file) Reduce nvidia-smi per-try timeout 300s to 30s in cse_config.sh (3 sites). nvidia-smi is a CLI status check, not a long-running driver operation; the 300s value was a copy-paste from the GPU driver install command. Update spec tests: fix parameter tables, add budget-exceeded test and pre-command CSE timeout test in cse_retry_helpers_spec.sh; update expected call signature in cse_install_spec.sh.
…allers Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/8f356829-1593-46b5-94dc-6fa4a85aa931 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>
The check_cse_timeout warning message was written to stdout, which pollutes pipes. In the 2004fips VHD build, the pipeline: retrycmd_if_failure 5 10 1200 yes | ua enable fips-updates sends stdout of the left side as stdin to ua enable fips-updates. The warning 'CSE_STARTTIME_SECONDS is not set' was read by ua as the answer to the 'Are you sure? (y/N)' prompt instead of 'y' from yes, causing 'Operation cancelled by user' and exit 184. Fix: redirect the warning to stderr (>&2), consistent with the error message that was already on stderr. Update the ShellSpec test to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…args - _retrycmd_internal: cap per-attempt timeout to remaining CSE budget (780s - elapsed) so a long timeout cannot overrun the provisioning window even when per-attempt timeouts are large - _retry_file_curl_internal: restructure loop so curl runs BEFORE the check+exhaustion logic, fixing a bug where retries=N only executed N-1 curl attempts (and retries=1 executed zero) - retrycmd_get_tarball spec: remove extra positional arg that shifted tarball path into URL position Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace [[ $3 =~ ^[0-9]+$ ]] with POSIX-compatible case statement in retrycmd_get_tarball to fix SC3010 shellcheck error - Fix _retry_file_curl_internal: keep pre-check at top of loop for efficiency (skip curl if file already valid), add final check after last curl attempt so retries=N always executes N curl attempts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace broken date mock (export -f date doesn't work in POSIX sh) with real-time approach: timeout mock sleeps 2s, budget=1s - Set CSE_STARTTIME_SECONDS in BeforeEach to prevent check_cse_timeout warnings in all retry tests - Explicitly unset CSE_STARTTIME_SECONDS in the check_cse_timeout test that verifies the unset behavior Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…internal Same pattern as _retrycmd_internal: when maxBudget > 0, effectiveTimeout is capped to min(timeout, maxBudget - elapsed) so a single curl attempt cannot overrun the documented per-operation budget. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ARTTIME_SECONDS guard Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/0bcd6f17-76f8-4ef1-ba7a-cf3a6f43d658 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>
The Test_Ubuntu2404_NPD_Basic test was flaky because the 6-minute PollUntilContextTimeout wasn't enough when Kubernetes API server rate limiting slowed down node condition polling. Both E2E stages hit 'context deadline exceeded' from the client rate limiter after ~7.5min. Increase to 10 minutes — still well within the 35-minute overall test budget — to give NPD more time to detect the simulated corruption under load. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NPD polls for filesystem corruption every 5 minutes. Under load, the test can miss two full cycles (>10 min), causing flaky timeouts. Instead of bumping the timeout further, restart NPD immediately after simulating corruption to trigger an instant journal scan. This makes detection deterministic regardless of where we land in the polling cycle. Also reduce the safety-net timeout from 10m to 5m since detection should now happen within seconds of the restart. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NPD's custom plugin monitor polls every 5 minutes. Restarting the service just resets the polling timer instead of triggering immediate detection. Use a 12-minute timeout to safely cover 2 full polling cycles plus margin for API server latency under load. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The one-shot 'systemd-run --unit=docker' approach writes a single journal entry that can age out of NPD's time-windowed lookback between check cycles, causing detection to permanently fail. Replace with a continuous loop (echo every 30s) so every NPD check cycle finds fresh entries. Also adds diagnostic logging: - Dumps the NPD plugin config for visibility into what it checks - Verifies journal entries exist after simulation starts Reduces timeout from 12min to 8min since continuous simulation means detection should happen within the first check cycle. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…debug The NPD plugin runs check_fs_corruption.sh which may filter by SYSLOG_IDENTIFIER rather than _SYSTEMD_UNIT. Set SyslogIdentifier=dockerd on the transient service to match Docker daemon journal metadata. Added diagnostics that dump the actual check_fs_corruption.sh script content and run it manually to verify detection works, giving us full visibility into why NPD fails to detect the simulated corruption. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause found via diagnostics: check_fs_corruption.sh checks 'journalctl -u containerd --since 5 min ago', NOT docker. The test was writing to docker.service journal which NPD never checks. Fix: - Move injector process into containerd's cgroup via cgroup.procs so journal daemon attributes entries to _SYSTEMD_UNIT=containerd.service - Use systemd-cat to write to journal from within containerd's cgroup - Update expected message from 'Docker journal' to 'containerd journal' - Keep continuous injection (every 30s) for lookback window coverage Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The cgroup.procs approach to inject journal entries under containerd.service does not work on Ubuntu 24.04 with cgroup v2 — systemd protects service cgroups from external process migration. New approach: replace check_fs_corruption.sh with a script that always reports corruption. This reliably tests the full NPD pipeline (plugin → condition → node status) without depending on journal injection mechanics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Gate check_cse_timeout in _retry_file_curl_internal on CSE_STARTTIME_SECONDS being set (avoids noisy warnings in VHD build) - Re-check/cap budget after sleep in _retry_file_curl_internal so the per-operation budget is a hard wall-clock cap - Cap sleep duration to remaining budget to avoid overshooting - Use sentinel file instead of shell function in ShellSpec test (timeout can't invoke shell functions) - Only apply 300s credential provider budget during CSE runs, use unlimited during VHD build to avoid flakiness - Log both stdout and stderr in NPD diagnostic output Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ShellSpec's Assert runs a shell command — 'not' is not a valid command. Use '[ ! -e ]' which is POSIX-compatible and available everywhere. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The budget re-check after sleep in _retry_file_curl_internal and the conditional credential provider budget introduced in the review comment fixes caused 14 E2E test failures. Reverting to the simpler approach that passed 108/108 E2E tests while keeping the safe CSE_STARTTIME_SECONDS gating on check_cse_timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…efer, docs
Thread 1: Fix malformed Logf format string (missing stdout label)
Thread 2: Add defer cleanup to restore check_fs_corruption.sh after NPD test
Thread 3: Extract duplicated 780 to CSE_MAX_DURATION_SECONDS shared constant
Thread 4: Make _retry_file_curl_internal and retrycmd_curl_file vars local,
default maxBudget to 0 when omitted
Thread 5: Clarify timeout_seconds must be integer (no duration suffixes)
Also restores budget re-check after sleep and conditional cred_budget —
the E2E failures (14→46 after revert) confirmed infra flakiness, not code.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review comments addressed: - Remove premature inline restore that ran BEFORE NPD poll; keep only the defer cleanup which runs at function exit - Use CSE_MAX_DURATION_SECONDS env var (default 780) in both check_cse_timeout and _retrycmd_internal to avoid drift - Mark all _retry_file_curl_internal and retrycmd_curl_file vars as local; default maxBudget to 0 when omitted - Document timeout_seconds must be integer (no suffixes) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… and retrycmd_curl_file Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/ed7889f6-7cc8-4ef0-a1cc-e8e8774c3df2 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>
…etection WaitUntilNodeReady used a bare Kubernetes Watch that was prone to missing node registration events due to: - Watch starting from 'now' (missing already-registered nodes) - Server-side watch timeout closing the channel after ~5-10 min - No reconnection logic for dropped watch connections Replace with wait.PollUntilContextTimeout using periodic List calls, matching the pattern already used by WaitUntilPodRunningWithRetry in the same file. Polls every 10s for up to 10 minutes. This eliminates the false-negative 'haven't appeared in k8s API server' failures where nodes actually registered successfully but the watch missed the event. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove early return after first non-Ready matching node so the poll loop checks all nodes with the VMSS prefix. Prevents missing a Ready node when multiple nodes share the same prefix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
pdamianov-dev
approved these changes
Apr 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Linux CSE runs can exceed the VM provisioner's ~16-minute client window when downloads retry indefinitely or a long-running operation starts near the deadline. This PR adds a global timeout guard and per-download wall-clock budget to bound execution time.
cse_helpers.sh_retrycmd_internal: Add pre-attemptcheck_cse_timeoutguard (gated onCSE_STARTTIME_SECONDSbeing set — no-op and no noise during VHD build). Cap per-attempteffectiveTimeouttomin(timeoutVal, remainingCseTime)so a single attempt cannot outlive the provisioning window._retry_file_curl_internal: Add optionalmaxBudget(4th arg, default 0). When > 0, capseffectiveTimeouttomin(timeout, maxBudget - elapsed)per attempt, bounding total download wall-clock to the budget regardless of retry count. Loop restructured so curl fires on every iteration (previously the last attempt was skipped).retrycmd_get_tarball: Backward-compatible signature detection — if 3rd arg is non-numeric (old callers:<retries> <wait_sleep> <tarball> <url>), defaults timeout to 60s. New callers pass explicit<timeout>as 3rd arg.retrycmd_curl_file: Exposes new optionalmax_budget_s6th arg (default 0).Call sites (
cse_install.sh,cse_install_ubuntu.sh,cse_install_mariner.sh)Pass 300s per-operation budget to all provisioning-time download sites (credential-provider, oras, TLS bootstrap, CNI ×2, crictl, k8s binaries, nvidia GPG key, containerd, runc, nvidia repo). Without budget: 120 retries × 60s = ~2h10m theoretical max per download; with budget: capped at 5 min total.
cse_config.shReduce
nvidia-smiper-try timeout from 300s → 30s at 3 call sites (nvidia-smiis a status check, not a long-running init).Backward compatibility
max_budget_sdefaults to 0 (no cap); VHD build callers unaffected.check_cse_timeoutguards are skipped whenCSE_STARTTIME_SECONDSis unset, preserving existing VHD build behavior exactly.