fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard by djsly · Pull Request #8230 · Azure/AgentBaker

djsly · 2026-04-02T20:16:22Z

Linux CSE runs can exceed the VM provisioner's ~16-minute client window when downloads retry indefinitely or a long-running operation starts near the deadline. This PR adds a global timeout guard and per-download wall-clock budget to bound execution time.

`cse_helpers.sh`

_retrycmd_internal: Add pre-attempt check_cse_timeout guard (gated on CSE_STARTTIME_SECONDS being set — no-op and no noise during VHD build). Cap per-attempt effectiveTimeout to min(timeoutVal, remainingCseTime) so a single attempt cannot outlive the provisioning window.
_retry_file_curl_internal: Add optional maxBudget (4th arg, default 0). When > 0, caps effectiveTimeout to min(timeout, maxBudget - elapsed) per attempt, bounding total download wall-clock to the budget regardless of retry count. Loop restructured so curl fires on every iteration (previously the last attempt was skipped).
retrycmd_get_tarball: Backward-compatible signature detection — if 3rd arg is non-numeric (old callers: <retries> <wait_sleep> <tarball> <url>), defaults timeout to 60s. New callers pass explicit <timeout> as 3rd arg.
retrycmd_curl_file: Exposes new optional max_budget_s 6th arg (default 0).

Call sites (`cse_install.sh`, `cse_install_ubuntu.sh`, `cse_install_mariner.sh`)

Pass 300s per-operation budget to all provisioning-time download sites (credential-provider, oras, TLS bootstrap, CNI ×2, crictl, k8s binaries, nvidia GPG key, containerd, runc, nvidia repo). Without budget: 120 retries × 60s = ~2h10m theoretical max per download; with budget: capped at 5 min total.

`cse_config.sh`

Reduce nvidia-smi per-try timeout from 300s → 30s at 3 call sites (nvidia-smi is a status check, not a long-running init).

Backward compatibility

max_budget_s defaults to 0 (no cap); VHD build callers unaffected.
check_cse_timeout guards are skipped when CSE_STARTTIME_SECONDS is unset, preserving existing VHD build behavior exactly.

Copilot

Pull request overview

This PR aims to prevent Linux CSE runs from exceeding the VM provisioner’s ~16-minute client window by adding (1) a pre-attempt global timeout guard and (2) a per-download wall-clock budget, then wiring those controls into provisioning-time download call sites and updating ShellSpec coverage accordingly.

Changes:

Add a pre-command check_cse_timeout guard in _retrycmd_internal and a per-operation maxBudget in _retry_file_curl_internal (propagated via retrycmd_curl_file / retrycmd_get_tarball).
Update provisioning-time download call sites to pass a 300s per-operation budget and adjust retrycmd_get_tarball call signature to include an explicit timeout.
Reduce nvidia-smi per-try timeout from 300s to 30s and update/add ShellSpec expectations for the new retry helper behaviors.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh	Updates parameter tables and adds tests for pre-attempt global timeout and per-operation budget exit behavior.
spec/parts/linux/cloud-init/artifacts/cse_install_spec.sh	Updates expected `retrycmd_get_tarball` invocation signature to include timeout + budget.
parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh	Adds 300s per-operation budget to provisioning-time curl downloads (NVIDIA key, containerd, runc).
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh	Adds 300s per-operation budget to NVIDIA repo file download.
parts/linux/cloud-init/artifacts/cse_install.sh	Adds explicit timeout + 300s per-operation budgets to provisioning-time tarball/curl downloads (credential provider, oras, CNI, crictl, k8s).
parts/linux/cloud-init/artifacts/cse_helpers.sh	Implements pre-attempt global timeout guard and per-operation budget; changes `retrycmd_get_tarball` signature.
parts/linux/cloud-init/artifacts/cse_config.sh	Lowers `nvidia-smi` retry timeout from 300s to 30s at 3 call sites.

parts/linux/cloud-init/artifacts/cse_helpers.sh

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

parts/linux/cloud-init/artifacts/cse_helpers.sh:347

The per-operation maxBudget check happens before starting timeout $timeout curl ..., but the curl attempt itself can still run longer than the remaining budget (and then exceed the budget by up to timeout + sleep). If maxBudget is intended to be a hard upper bound (“total wall-clock seconds across all retries”), cap the per-attempt timeout to min(timeout, maxBudget - elapsed) and/or re-check the budget immediately after each attempt as well.

        # Check per-operation budget if set -- prevents a single download from consuming the entire CSE window
        if [ "${maxBudget}" -gt 0 ]; then
            local opElapsed
            opElapsed=$(( $(date +%s) - opStartTime ))
            if [ "$opElapsed" -ge "$maxBudget" ]; then
                echo "Operation budget of ${maxBudget}s exceeded after ${opElapsed}s, exiting early." >&2
                return 2
            fi
        fi
        # check if global cse timeout is approaching
        if ! check_cse_timeout; then
            echo "CSE timeout approaching, exiting early." >&2
            return 2
        else
            if [ "$i" -gt 1 ]; then
                sleep $waitSleep
            fi
            timeout $timeout curl -fsSLv $url -o $filePath > $CURL_OUTPUT 2>&1
            if [ "$?" -ne 0 ]; then

parts/linux/cloud-init/artifacts/cse_helpers.sh

spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

parts/linux/cloud-init/artifacts/cse_install.sh

parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh

parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh

…and guard Remove hardcoded 60s timeout from retrycmd_get_tarball; caller now passes timeout as 3rd positional arg (matches retrycmd_curl_file style). Add optional max_budget_s (6th arg, default 0) to retrycmd_get_tarball and retrycmd_curl_file. When > 0, _retry_file_curl_internal tracks wall-clock time and exits early (return 2) once the budget is consumed, bounding any single download to 5 min at provisioning time regardless of retry count. Add pre-command check_cse_timeout guard at the TOP of _retrycmd_internal loop. Previously the guard only fired after a failed command, meaning a new 300-600s operation could START at minute 12:59 and run to ~18 min. The guard now prevents starting any new attempt when >780s have elapsed since CSE_STARTTIME_SECONDS. Pass 300s budget to all 11 provisioning-time download call sites: cse_install.sh (credential-provider, oras, secure-TLS, CNI x2, crictl, k8s) ubuntu/cse_install_ubuntu.sh (nvidia GPG key, containerd, runc) mariner/cse_install_mariner.sh (nvidia repo file) Reduce nvidia-smi per-try timeout 300s to 30s in cse_config.sh (3 sites). nvidia-smi is a CLI status check, not a long-running driver operation; the 300s value was a copy-paste from the GPU driver install command. Update spec tests: fix parameter tables, add budget-exceeded test and pre-command CSE timeout test in cse_retry_helpers_spec.sh; update expected call signature in cse_install_spec.sh.

…allers Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/8f356829-1593-46b5-94dc-6fa4a85aa931 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>

The check_cse_timeout warning message was written to stdout, which pollutes pipes. In the 2004fips VHD build, the pipeline: retrycmd_if_failure 5 10 1200 yes | ua enable fips-updates sends stdout of the left side as stdin to ua enable fips-updates. The warning 'CSE_STARTTIME_SECONDS is not set' was read by ua as the answer to the 'Are you sure? (y/N)' prompt instead of 'y' from yes, causing 'Operation cancelled by user' and exit 184. Fix: redirect the warning to stderr (>&2), consistent with the error message that was already on stderr. Update the ShellSpec test to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…args - _retrycmd_internal: cap per-attempt timeout to remaining CSE budget (780s - elapsed) so a long timeout cannot overrun the provisioning window even when per-attempt timeouts are large - _retry_file_curl_internal: restructure loop so curl runs BEFORE the check+exhaustion logic, fixing a bug where retries=N only executed N-1 curl attempts (and retries=1 executed zero) - retrycmd_get_tarball spec: remove extra positional arg that shifted tarball path into URL position Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Replace [[ $3 =~ ^[0-9]+$ ]] with POSIX-compatible case statement in retrycmd_get_tarball to fix SC3010 shellcheck error - Fix _retry_file_curl_internal: keep pre-check at top of loop for efficiency (skip curl if file already valid), add final check after last curl attempt so retries=N always executes N curl attempts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Replace broken date mock (export -f date doesn't work in POSIX sh) with real-time approach: timeout mock sleeps 2s, budget=1s - Set CSE_STARTTIME_SECONDS in BeforeEach to prevent check_cse_timeout warnings in all retry tests - Explicitly unset CSE_STARTTIME_SECONDS in the check_cse_timeout test that verifies the unset behavior Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…internal Same pattern as _retrycmd_internal: when maxBudget > 0, effectiveTimeout is capped to min(timeout, maxBudget - elapsed) so a single curl attempt cannot overrun the documented per-operation budget. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ARTTIME_SECONDS guard Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/0bcd6f17-76f8-4ef1-ba7a-cf3a6f43d658 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>

The Test_Ubuntu2404_NPD_Basic test was flaky because the 6-minute PollUntilContextTimeout wasn't enough when Kubernetes API server rate limiting slowed down node condition polling. Both E2E stages hit 'context deadline exceeded' from the client rate limiter after ~7.5min. Increase to 10 minutes — still well within the 35-minute overall test budget — to give NPD more time to detect the simulated corruption under load. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

NPD polls for filesystem corruption every 5 minutes. Under load, the test can miss two full cycles (>10 min), causing flaky timeouts. Instead of bumping the timeout further, restart NPD immediately after simulating corruption to trigger an instant journal scan. This makes detection deterministic regardless of where we land in the polling cycle. Also reduce the safety-net timeout from 10m to 5m since detection should now happen within seconds of the restart. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

NPD's custom plugin monitor polls every 5 minutes. Restarting the service just resets the polling timer instead of triggering immediate detection. Use a 12-minute timeout to safely cover 2 full polling cycles plus margin for API server latency under load. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The one-shot 'systemd-run --unit=docker' approach writes a single journal entry that can age out of NPD's time-windowed lookback between check cycles, causing detection to permanently fail. Replace with a continuous loop (echo every 30s) so every NPD check cycle finds fresh entries. Also adds diagnostic logging: - Dumps the NPD plugin config for visibility into what it checks - Verifies journal entries exist after simulation starts Reduces timeout from 12min to 8min since continuous simulation means detection should happen within the first check cycle. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…debug The NPD plugin runs check_fs_corruption.sh which may filter by SYSLOG_IDENTIFIER rather than _SYSTEMD_UNIT. Set SyslogIdentifier=dockerd on the transient service to match Docker daemon journal metadata. Added diagnostics that dump the actual check_fs_corruption.sh script content and run it manually to verify detection works, giving us full visibility into why NPD fails to detect the simulated corruption. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Root cause found via diagnostics: check_fs_corruption.sh checks 'journalctl -u containerd --since 5 min ago', NOT docker. The test was writing to docker.service journal which NPD never checks. Fix: - Move injector process into containerd's cgroup via cgroup.procs so journal daemon attributes entries to _SYSTEMD_UNIT=containerd.service - Use systemd-cat to write to journal from within containerd's cgroup - Update expected message from 'Docker journal' to 'containerd journal' - Keep continuous injection (every 30s) for lookback window coverage Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The cgroup.procs approach to inject journal entries under containerd.service does not work on Ubuntu 24.04 with cgroup v2 — systemd protects service cgroups from external process migration. New approach: replace check_fs_corruption.sh with a script that always reports corruption. This reliably tests the full NPD pipeline (plugin → condition → node status) without depending on journal injection mechanics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Gate check_cse_timeout in _retry_file_curl_internal on CSE_STARTTIME_SECONDS being set (avoids noisy warnings in VHD build) - Re-check/cap budget after sleep in _retry_file_curl_internal so the per-operation budget is a hard wall-clock cap - Cap sleep duration to remaining budget to avoid overshooting - Use sentinel file instead of shell function in ShellSpec test (timeout can't invoke shell functions) - Only apply 300s credential provider budget during CSE runs, use unlimited during VHD build to avoid flakiness - Log both stdout and stderr in NPD diagnostic output Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ShellSpec's Assert runs a shell command — 'not' is not a valid command. Use '[ ! -e ]' which is POSIX-compatible and available everywhere. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The budget re-check after sleep in _retry_file_curl_internal and the conditional credential provider budget introduced in the review comment fixes caused 14 E2E test failures. Reverting to the simpler approach that passed 108/108 E2E tests while keeping the safe CSE_STARTTIME_SECONDS gating on check_cse_timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…efer, docs Thread 1: Fix malformed Logf format string (missing stdout label) Thread 2: Add defer cleanup to restore check_fs_corruption.sh after NPD test Thread 3: Extract duplicated 780 to CSE_MAX_DURATION_SECONDS shared constant Thread 4: Make _retry_file_curl_internal and retrycmd_curl_file vars local, default maxBudget to 0 when omitted Thread 5: Clarify timeout_seconds must be integer (no duration suffixes) Also restores budget re-check after sleep and conditional cred_budget — the E2E failures (14→46 after revert) confirmed infra flakiness, not code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Review comments addressed: - Remove premature inline restore that ran BEFORE NPD poll; keep only the defer cleanup which runs at function exit - Use CSE_MAX_DURATION_SECONDS env var (default 780) in both check_cse_timeout and _retrycmd_internal to avoid drift - Mark all _retry_file_curl_internal and retrycmd_curl_file vars as local; default maxBudget to 0 when omitted - Document timeout_seconds must be integer (no suffixes) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… and retrycmd_curl_file Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/ed7889f6-7cc8-4ef0-a1cc-e8e8774c3df2 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>

…etection WaitUntilNodeReady used a bare Kubernetes Watch that was prone to missing node registration events due to: - Watch starting from 'now' (missing already-registered nodes) - Server-side watch timeout closing the channel after ~5-10 min - No reconnection logic for dropped watch connections Replace with wait.PollUntilContextTimeout using periodic List calls, matching the pattern already used by WaitUntilPodRunningWithRetry in the same file. Polls every 10s for up to 10 minutes. This eliminates the false-negative 'haven't appeared in k8s API server' failures where nodes actually registered successfully but the watch missed the event. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

e2e/kube.go

Remove early return after first non-Ready matching node so the poll loop checks all nodes with the VMSS prefix. Prevents missing a Ready node when multiple nodes share the same prefix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 2, 2026 20:16

djsly requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, ganeshkumarashok, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners April 2, 2026 20:16

djsly temporarily deployed to test April 2, 2026 20:16 — with GitHub Actions Inactive

Copilot started reviewing on behalf of djsly April 2, 2026 20:17 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

parts/linux/cloud-init/artifacts/cse_helpers.sh Outdated Show resolved Hide resolved

parts/linux/cloud-init/artifacts/cse_helpers.sh Show resolved Hide resolved

Copilot started work on behalf of djsly April 2, 2026 20:26 View session

Copilot finished work on behalf of djsly April 2, 2026 20:33

Devinwong reviewed Apr 2, 2026

View reviewed changes

parts/linux/cloud-init/artifacts/cse_helpers.sh Show resolved Hide resolved

pdamianov-dev approved these changes Apr 6, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 7, 2026 21:52

djsly temporarily deployed to test April 7, 2026 21:52 — with GitHub Actions Inactive

Copilot started reviewing on behalf of djsly April 7, 2026 21:53 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

parts/linux/cloud-init/artifacts/cse_helpers.sh Outdated Show resolved Hide resolved

parts/linux/cloud-init/artifacts/cse_helpers.sh Outdated Show resolved Hide resolved

spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh Outdated Show resolved Hide resolved

djsly temporarily deployed to test April 8, 2026 01:38 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings April 9, 2026 20:28

djsly force-pushed the djsly/37384340 branch from dd5c101 to 886395c Compare April 9, 2026 20:28

djsly temporarily deployed to test April 9, 2026 20:28 — with GitHub Actions Inactive

Copilot AI reviewed Apr 9, 2026

View reviewed changes

parts/linux/cloud-init/artifacts/cse_install.sh Show resolved Hide resolved

parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh Show resolved Hide resolved

parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Show resolved Hide resolved

djsly mentioned this pull request Apr 9, 2026

fix(e2e): add DAG dependencies to prevent route table race condition #8268

Merged

djsly and others added 22 commits April 9, 2026 19:59

fix(cse): add backward compat to retrycmd_get_tarball for old 4-arg c…

84610f5

…allers Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/8f356829-1593-46b5-94dc-6fa4a85aa931 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>

fix: gate check_cse_timeout calls in _retrycmd_internal behind CSE_ST…

475a17a

…ARTTIME_SECONDS guard Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/0bcd6f17-76f8-4ef1-ba7a-cf3a6f43d658 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>

fix(spec): use POSIX-compatible assertion in shellspec test

383f960

ShellSpec's Assert runs a shell command — 'not' is not a valid command. Use '[ ! -e ]' which is POSIX-compatible and available everywhere. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: gate max_budget on CSE_STARTTIME_SECONDS in retrycmd_get_tarball…

31ac5f7

… and retrycmd_curl_file Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/ed7889f6-7cc8-4ef0-a1cc-e8e8774c3df2 Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>

Copilot AI reviewed Apr 10, 2026

View reviewed changes

e2e/kube.go Outdated Show resolved Hide resolved

pdamianov-dev approved these changes Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230

fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230
djsly merged 23 commits intomainfrom
djsly/37384340

djsly commented Apr 2, 2026 •

edited by Copilot AI

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

djsly commented Apr 2, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cse_helpers.sh

Call sites (cse_install.sh, cse_install_ubuntu.sh, cse_install_mariner.sh)

cse_config.sh

Backward compatibility

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

djsly commented Apr 2, 2026 •

edited by Copilot AI

Loading

`cse_helpers.sh`

Call sites (`cse_install.sh`, `cse_install_ubuntu.sh`, `cse_install_mariner.sh`)

`cse_config.sh`