Skip to content

fix(linux): add test_variable to cse_cmd.sh#8698

Draft
jumpinthefire wants to merge 1 commit into
mainfrom
jumpinthefire/ci-test
Draft

fix(linux): add test_variable to cse_cmd.sh#8698
jumpinthefire wants to merge 1 commit into
mainfrom
jumpinthefire/ci-test

Conversation

@jumpinthefire

Copy link
Copy Markdown

What this PR does / why we need it:
this is a test.

Copilot AI review requested due to automatic review settings June 12, 2026 18:33

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Linux CSE command template (cse_cmd.sh) to introduce an additional environment variable in the generated provisioning command.

Changes:

  • Add TEST_VARIABLE to the set of variables emitted into the CSE command environment.

Comment on lines 201 to 203
SKIP_WAAGENT_HOLD="{{GetSkipWaAgentHold}}"
TEST_VARIABLE="{{GetSkipWaAgentHold}}"
/usr/bin/nohup /bin/bash -c "/bin/bash /opt/azure/containers/provision_start.sh"
PRE_PROVISION_ONLY="{{GetPreProvisionOnly}}"
CSE_TIMEOUT="{{GetCSETimeout}}"
SKIP_WAAGENT_HOLD="{{GetSkipWaAgentHold}}"
TEST_VARIABLE="{{GetSkipWaAgentHold}}"
@jumpinthefire jumpinthefire changed the title fix: Add TEST_VARIABLE to cse_cmd.sh fix(linux): add test_variable to cse_cmd.sh Jun 12, 2026
@aks-node-assistant

Copy link
Copy Markdown
Contributor

🕵️ AgentBaker Linux Gate DetectiveBuild 167833671 (Run AgentBaker E2E) FAILED — known shared-cluster infra outage, not caused by this PR.

TL;DR

Matches existing wiki signature kubenet-v5-node-not-ready-scriptless on shared westus3 clusters abe2e-kubenet-v5-150ee + abe2e-azure-networkisolated-v3-d6cc9. Repair item #38403603 already tracking the kubenet-v5 fleet/apiserver issue.

This draft PR adds a single TEST_VARIABLE="{{GetSkipWaAgentHold}}" line to parts/linux/cloud-init/artifacts/cse_cmd.sh — a one-line CSE-time env-var assignment that cannot prevent kubelet registration.

3-level RCA

1. Surface symptom — 209 failures in Run AgentBaker E2E (414 tests, 97 skipped), every failed scenario terminates at:
kube.go:195 🔴 FAIL: "<vmss>" haven't appeared in k8s API server: context deadline exceeded after 600s, preceded by kube.go:166 error listing nodes: client rate limiter Wait returned an error: context deadline exceeded.

2. Corroboration — Failures span shared clusters abe2e-kubenet-v5-150ee and abe2e-azure-networkisolated-v3-d6cc9 (rg abe2e-westus3). Uniform across distros and bootstrap modes (default + scriptless_nbc). Same pattern reproduced on 6 unrelated recent PRs (#8104, #8294, #8509, #8659, #8667, #8694) plus TME pipeline (def 451455) builds 167783506 / 167788753 / 167793966 on cluster abe2e-kubenet-v5-e7055 (westus) — confirms apiserver-side throttling/contention, not per-cluster or per-VHD issue.

3. Root-cause challenge — Strongest alternative: PR-caused CSE regression. Why less likely:

  • Diff is a single-line addition of TEST_VARIABLE="{{GetSkipWaAgentHold}}" to cse_cmd.sh — appended after the final /usr/bin/nohup /bin/bash …/provision_start.sh line, so it's effectively dead code: bash sets the var, then immediately invokes the existing CSE entrypoint with no behavior change.
  • {{GetSkipWaAgentHold}} template renderer was successfully expanded at VHD build time (build succeeded) and the same template is already used on the line above for SKIP_WAAGENT_HOLD.
  • Failure point is pre-kubelet-registration on the apiserver, not at CSE exit; CSE-side regression would surface as CSE exit codes or systemd-failed validator hits, not kube.go:195 timeouts.
  • Branch jumpinthefire/ci-test and title "Add TEST_VARIABLE" indicate this is an intentional CI experiment, not a real fix.

Classification

  • Test infrastructure / shared-cluster fleet stress (not PR-caused)
  • Wiki signature: kubenet-v5-node-not-ready-scriptless (Count → 12 distinct builds incl. this one)
  • Confidence: High that this is the known infra outage; High that the PR's trivial diff is unrelated.

Recommended next action

  • For this draft PR: the change is a no-op CI probe — safe to leave or rebase once shared clusters recover. No code change required for the gate to pass.
  • Owner of the underlying issue: AgentBaker E2E test-infra (repair item #38403603).

Evidence

Posted by Clawpilot AgentBaker Linux Gate Detective Watcher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants