Skip to content

fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230

Merged
djsly merged 23 commits intomainfrom
djsly/37384340
Apr 10, 2026
Merged

fix(cse): prevent CSE timeout overrun with per-op budget and pre-command guard#8230
djsly merged 23 commits intomainfrom
djsly/37384340

Conversation

@djsly
Copy link
Copy Markdown
Collaborator

@djsly djsly commented Apr 2, 2026

Linux CSE runs can exceed the VM provisioner's ~16-minute client window when downloads retry indefinitely or a long-running operation starts near the deadline. This PR adds a global timeout guard and per-download wall-clock budget to bound execution time.

cse_helpers.sh

  • _retrycmd_internal: Add pre-attempt check_cse_timeout guard (gated on CSE_STARTTIME_SECONDS being set — no-op and no noise during VHD build). Cap per-attempt effectiveTimeout to min(timeoutVal, remainingCseTime) so a single attempt cannot outlive the provisioning window.
  • _retry_file_curl_internal: Add optional maxBudget (4th arg, default 0). When > 0, caps effectiveTimeout to min(timeout, maxBudget - elapsed) per attempt, bounding total download wall-clock to the budget regardless of retry count. Loop restructured so curl fires on every iteration (previously the last attempt was skipped).
  • retrycmd_get_tarball: Backward-compatible signature detection — if 3rd arg is non-numeric (old callers: <retries> <wait_sleep> <tarball> <url>), defaults timeout to 60s. New callers pass explicit <timeout> as 3rd arg.
  • retrycmd_curl_file: Exposes new optional max_budget_s 6th arg (default 0).

Call sites (cse_install.sh, cse_install_ubuntu.sh, cse_install_mariner.sh)

Pass 300s per-operation budget to all provisioning-time download sites (credential-provider, oras, TLS bootstrap, CNI ×2, crictl, k8s binaries, nvidia GPG key, containerd, runc, nvidia repo). Without budget: 120 retries × 60s = ~2h10m theoretical max per download; with budget: capped at 5 min total.

cse_config.sh

Reduce nvidia-smi per-try timeout from 300s → 30s at 3 call sites (nvidia-smi is a status check, not a long-running init).

Backward compatibility

  • max_budget_s defaults to 0 (no cap); VHD build callers unaffected.
  • check_cse_timeout guards are skipped when CSE_STARTTIME_SECONDS is unset, preserving existing VHD build behavior exactly.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent Linux CSE runs from exceeding the VM provisioner’s ~16-minute client window by adding (1) a pre-attempt global timeout guard and (2) a per-download wall-clock budget, then wiring those controls into provisioning-time download call sites and updating ShellSpec coverage accordingly.

Changes:

  • Add a pre-command check_cse_timeout guard in _retrycmd_internal and a per-operation maxBudget in _retry_file_curl_internal (propagated via retrycmd_curl_file / retrycmd_get_tarball).
  • Update provisioning-time download call sites to pass a 300s per-operation budget and adjust retrycmd_get_tarball call signature to include an explicit timeout.
  • Reduce nvidia-smi per-try timeout from 300s to 30s and update/add ShellSpec expectations for the new retry helper behaviors.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
spec/parts/linux/cloud-init/artifacts/cse_retry_helpers_spec.sh Updates parameter tables and adds tests for pre-attempt global timeout and per-operation budget exit behavior.
spec/parts/linux/cloud-init/artifacts/cse_install_spec.sh Updates expected retrycmd_get_tarball invocation signature to include timeout + budget.
parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh Adds 300s per-operation budget to provisioning-time curl downloads (NVIDIA key, containerd, runc).
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Adds 300s per-operation budget to NVIDIA repo file download.
parts/linux/cloud-init/artifacts/cse_install.sh Adds explicit timeout + 300s per-operation budgets to provisioning-time tarball/curl downloads (credential provider, oras, CNI, crictl, k8s).
parts/linux/cloud-init/artifacts/cse_helpers.sh Implements pre-attempt global timeout guard and per-operation budget; changes retrycmd_get_tarball signature.
parts/linux/cloud-init/artifacts/cse_config.sh Lowers nvidia-smi retry timeout from 300s to 30s at 3 call sites.

Copilot AI review requested due to automatic review settings April 7, 2026 21:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

parts/linux/cloud-init/artifacts/cse_helpers.sh:347

  • The per-operation maxBudget check happens before starting timeout $timeout curl ..., but the curl attempt itself can still run longer than the remaining budget (and then exceed the budget by up to timeout + sleep). If maxBudget is intended to be a hard upper bound (“total wall-clock seconds across all retries”), cap the per-attempt timeout to min(timeout, maxBudget - elapsed) and/or re-check the budget immediately after each attempt as well.
        # Check per-operation budget if set -- prevents a single download from consuming the entire CSE window
        if [ "${maxBudget}" -gt 0 ]; then
            local opElapsed
            opElapsed=$(( $(date +%s) - opStartTime ))
            if [ "$opElapsed" -ge "$maxBudget" ]; then
                echo "Operation budget of ${maxBudget}s exceeded after ${opElapsed}s, exiting early." >&2
                return 2
            fi
        fi
        # check if global cse timeout is approaching
        if ! check_cse_timeout; then
            echo "CSE timeout approaching, exiting early." >&2
            return 2
        else
            if [ "$i" -gt 1 ]; then
                sleep $waitSleep
            fi
            timeout $timeout curl -fsSLv $url -o $filePath > $CURL_OUTPUT 2>&1
            if [ "$?" -ne 0 ]; then

Copilot AI review requested due to automatic review settings April 9, 2026 20:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

djsly and others added 22 commits April 9, 2026 19:59
…and guard

Remove hardcoded 60s timeout from retrycmd_get_tarball; caller now
passes timeout as 3rd positional arg (matches retrycmd_curl_file style).

Add optional max_budget_s (6th arg, default 0) to retrycmd_get_tarball
and retrycmd_curl_file. When > 0, _retry_file_curl_internal tracks
wall-clock time and exits early (return 2) once the budget is consumed,
bounding any single download to 5 min at provisioning time regardless
of retry count.

Add pre-command check_cse_timeout guard at the TOP of _retrycmd_internal
loop. Previously the guard only fired after a failed command, meaning a
new 300-600s operation could START at minute 12:59 and run to ~18 min.
The guard now prevents starting any new attempt when >780s have elapsed
since CSE_STARTTIME_SECONDS.

Pass 300s budget to all 11 provisioning-time download call sites:
  cse_install.sh (credential-provider, oras, secure-TLS, CNI x2, crictl, k8s)
  ubuntu/cse_install_ubuntu.sh (nvidia GPG key, containerd, runc)
  mariner/cse_install_mariner.sh (nvidia repo file)

Reduce nvidia-smi per-try timeout 300s to 30s in cse_config.sh (3 sites).
nvidia-smi is a CLI status check, not a long-running driver operation;
the 300s value was a copy-paste from the GPU driver install command.

Update spec tests: fix parameter tables, add budget-exceeded test and
pre-command CSE timeout test in cse_retry_helpers_spec.sh; update
expected call signature in cse_install_spec.sh.
The check_cse_timeout warning message was written to stdout, which
pollutes pipes. In the 2004fips VHD build, the pipeline:

  retrycmd_if_failure 5 10 1200 yes | ua enable fips-updates

sends stdout of the left side as stdin to ua enable fips-updates.
The warning 'CSE_STARTTIME_SECONDS is not set' was read by ua as the
answer to the 'Are you sure? (y/N)' prompt instead of 'y' from yes,
causing 'Operation cancelled by user' and exit 184.

Fix: redirect the warning to stderr (>&2), consistent with the error
message that was already on stderr. Update the ShellSpec test to match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…args

- _retrycmd_internal: cap per-attempt timeout to remaining CSE budget
  (780s - elapsed) so a long timeout cannot overrun the provisioning
  window even when per-attempt timeouts are large
- _retry_file_curl_internal: restructure loop so curl runs BEFORE the
  check+exhaustion logic, fixing a bug where retries=N only executed
  N-1 curl attempts (and retries=1 executed zero)
- retrycmd_get_tarball spec: remove extra positional arg that shifted
  tarball path into URL position

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace [[ $3 =~ ^[0-9]+$ ]] with POSIX-compatible case statement
  in retrycmd_get_tarball to fix SC3010 shellcheck error
- Fix _retry_file_curl_internal: keep pre-check at top of loop for
  efficiency (skip curl if file already valid), add final check after
  last curl attempt so retries=N always executes N curl attempts

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace broken date mock (export -f date doesn't work in POSIX sh)
  with real-time approach: timeout mock sleeps 2s, budget=1s
- Set CSE_STARTTIME_SECONDS in BeforeEach to prevent check_cse_timeout
  warnings in all retry tests
- Explicitly unset CSE_STARTTIME_SECONDS in the check_cse_timeout test
  that verifies the unset behavior

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…internal

Same pattern as _retrycmd_internal: when maxBudget > 0, effectiveTimeout
is capped to min(timeout, maxBudget - elapsed) so a single curl attempt
cannot overrun the documented per-operation budget.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ARTTIME_SECONDS guard

Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/0bcd6f17-76f8-4ef1-ba7a-cf3a6f43d658

Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>
The Test_Ubuntu2404_NPD_Basic test was flaky because the 6-minute
PollUntilContextTimeout wasn't enough when Kubernetes API server rate
limiting slowed down node condition polling. Both E2E stages hit
'context deadline exceeded' from the client rate limiter after ~7.5min.

Increase to 10 minutes — still well within the 35-minute overall test
budget — to give NPD more time to detect the simulated corruption under
load.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NPD polls for filesystem corruption every 5 minutes. Under load, the
test can miss two full cycles (>10 min), causing flaky timeouts. Instead
of bumping the timeout further, restart NPD immediately after simulating
corruption to trigger an instant journal scan. This makes detection
deterministic regardless of where we land in the polling cycle.

Also reduce the safety-net timeout from 10m to 5m since detection should
now happen within seconds of the restart.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NPD's custom plugin monitor polls every 5 minutes. Restarting the service
just resets the polling timer instead of triggering immediate detection.
Use a 12-minute timeout to safely cover 2 full polling cycles plus margin
for API server latency under load.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The one-shot 'systemd-run --unit=docker' approach writes a single journal
entry that can age out of NPD's time-windowed lookback between check
cycles, causing detection to permanently fail. Replace with a continuous
loop (echo every 30s) so every NPD check cycle finds fresh entries.

Also adds diagnostic logging:
- Dumps the NPD plugin config for visibility into what it checks
- Verifies journal entries exist after simulation starts

Reduces timeout from 12min to 8min since continuous simulation means
detection should happen within the first check cycle.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…debug

The NPD plugin runs check_fs_corruption.sh which may filter by
SYSLOG_IDENTIFIER rather than _SYSTEMD_UNIT. Set SyslogIdentifier=dockerd
on the transient service to match Docker daemon journal metadata.

Added diagnostics that dump the actual check_fs_corruption.sh script
content and run it manually to verify detection works, giving us full
visibility into why NPD fails to detect the simulated corruption.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause found via diagnostics: check_fs_corruption.sh checks
'journalctl -u containerd --since 5 min ago', NOT docker. The test
was writing to docker.service journal which NPD never checks.

Fix:
- Move injector process into containerd's cgroup via cgroup.procs
  so journal daemon attributes entries to _SYSTEMD_UNIT=containerd.service
- Use systemd-cat to write to journal from within containerd's cgroup
- Update expected message from 'Docker journal' to 'containerd journal'
- Keep continuous injection (every 30s) for lookback window coverage

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The cgroup.procs approach to inject journal entries under
containerd.service does not work on Ubuntu 24.04 with cgroup v2 —
systemd protects service cgroups from external process migration.

New approach: replace check_fs_corruption.sh with a script that always
reports corruption. This reliably tests the full NPD pipeline
(plugin → condition → node status) without depending on journal
injection mechanics.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Gate check_cse_timeout in _retry_file_curl_internal on
  CSE_STARTTIME_SECONDS being set (avoids noisy warnings in VHD build)
- Re-check/cap budget after sleep in _retry_file_curl_internal so
  the per-operation budget is a hard wall-clock cap
- Cap sleep duration to remaining budget to avoid overshooting
- Use sentinel file instead of shell function in ShellSpec test
  (timeout can't invoke shell functions)
- Only apply 300s credential provider budget during CSE runs,
  use unlimited during VHD build to avoid flakiness
- Log both stdout and stderr in NPD diagnostic output

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ShellSpec's Assert runs a shell command — 'not' is not a valid command.
Use '[ ! -e ]' which is POSIX-compatible and available everywhere.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The budget re-check after sleep in _retry_file_curl_internal and the
conditional credential provider budget introduced in the review comment
fixes caused 14 E2E test failures. Reverting to the simpler approach
that passed 108/108 E2E tests while keeping the safe CSE_STARTTIME_SECONDS
gating on check_cse_timeout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…efer, docs

Thread 1: Fix malformed Logf format string (missing stdout label)
Thread 2: Add defer cleanup to restore check_fs_corruption.sh after NPD test
Thread 3: Extract duplicated 780 to CSE_MAX_DURATION_SECONDS shared constant
Thread 4: Make _retry_file_curl_internal and retrycmd_curl_file vars local,
           default maxBudget to 0 when omitted
Thread 5: Clarify timeout_seconds must be integer (no duration suffixes)

Also restores budget re-check after sleep and conditional cred_budget —
the E2E failures (14→46 after revert) confirmed infra flakiness, not code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review comments addressed:
- Remove premature inline restore that ran BEFORE NPD poll;
  keep only the defer cleanup which runs at function exit
- Use CSE_MAX_DURATION_SECONDS env var (default 780) in both
  check_cse_timeout and _retrycmd_internal to avoid drift
- Mark all _retry_file_curl_internal and retrycmd_curl_file
  vars as local; default maxBudget to 0 when omitted
- Document timeout_seconds must be integer (no suffixes)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… and retrycmd_curl_file

Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/ed7889f6-7cc8-4ef0-a1cc-e8e8774c3df2

Co-authored-by: djsly <4981802+djsly@users.noreply.github.com>
…etection

WaitUntilNodeReady used a bare Kubernetes Watch that was prone to
missing node registration events due to:
- Watch starting from 'now' (missing already-registered nodes)
- Server-side watch timeout closing the channel after ~5-10 min
- No reconnection logic for dropped watch connections

Replace with wait.PollUntilContextTimeout using periodic List calls,
matching the pattern already used by WaitUntilPodRunningWithRetry in
the same file. Polls every 10s for up to 10 minutes.

This eliminates the false-negative 'haven't appeared in k8s API
server' failures where nodes actually registered successfully but
the watch missed the event.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Remove early return after first non-Ready matching node so the poll
loop checks all nodes with the VMSS prefix. Prevents missing a Ready
node when multiple nodes share the same prefix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants