fix(e2e): skip GPU tests on additional allocation failure error codes by ganeshkumarashok · Pull Request #8269 · Azure/AgentBaker

ganeshkumarashok · 2026-04-09T23:56:40Z

Summary

skipTestIfSKUNotAvailableErr previously required HTTP 409 status and only handled SkuNotAvailable and quota OperationNotAllowed errors
AllocationFailed returns HTTP 200 (not 409), so after CreateVMSSWithRetry exhausts its 10 retries, the error leaked through as a test failure instead of being skipped — this is the likely cause of intermittent GPU E2E pipeline failures on main
Adds handling for AllocationFailed, OverconstrainedZonalAllocationRequest, OverconstrainedAllocationRequest, and ZonalAllocationFailed error codes
Removes the StatusCode == 409 gate so non-409 allocation errors (like AllocationFailed at 200) are properly caught

Context

The GPU E2E pipeline has been intermittently failing on main (alternating success/failure across recent commits). GPU VMs (NC6s_v3, NC24ads_A100_v4, NV6ads_A10_v5) have limited regional availability, and Azure returns various allocation error codes depending on the specific failure mode. The pipeline sets SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE=true, but the skip function wasn't catching all allocation-related errors.

Test plan

go build ./... passes
GPU E2E pipeline stops failing on allocation-related errors (they become skips instead of failures)

The `skipTestIfSKUNotAvailableErr` function previously only handled `SkuNotAvailable` (409) and quota-exceeded `OperationNotAllowed` (409), and required HTTP 409 status. This missed several GPU capacity errors: - `AllocationFailed` returns HTTP 200, not 409, so it was never caught by the status code check. After `CreateVMSSWithRetry` exhausts its 10 retries, the error leaked through as a test failure instead of a skip. - `OverconstrainedZonalAllocationRequest`, `OverconstrainedAllocationRequest`, and `ZonalAllocationFailed` are 409 errors with different error codes that weren't handled. Remove the `StatusCode == 409` gate and add all allocation-related error codes so GPU tests are properly skipped when capacity is unavailable.

Copilot

Pull request overview

This PR hardens the GPU E2E harness by broadening the “capacity-related Azure error” detection so allocation failures are skipped (when SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE=true) instead of surfacing as intermittent test failures.

Changes:

Removes the StatusCode == 409 gate so allocation-related azcore.ResponseErrors with non-409 status (notably AllocationFailed at HTTP 200) can be handled.
Adds skip handling for additional allocation/capacity error codes: AllocationFailed, OverconstrainedZonalAllocationRequest, OverconstrainedAllocationRequest, ZonalAllocationFailed.
Clarifies inline comments around the different skip conditions.

Copilot · 2026-04-09T23:59:37Z

e2e/vmss.go

+	// SKU not available in region (409)
 	if respErr.ErrorCode == "SkuNotAvailable" {
 		t.Skip("skipping scenario SKU not available", t.Name(), err)
 	}
-	// sometimes the SKU quota is exceeded and we can't do anything. Skip the test in this case.
+	// Quota exceeded (409)


These comments mention “(409)” but the function no longer checks respErr.StatusCode. Either reintroduce status-code checks for these specific cases or adjust the comments so they don’t assert a status code the code isn’t enforcing.

Copilot · 2026-04-09T23:59:38Z

e2e/vmss.go

+	// Zonal allocation constraints — no capacity in the requested zone (409)
+	if respErr.ErrorCode == "OverconstrainedZonalAllocationRequest" ||
+		respErr.ErrorCode == "OverconstrainedAllocationRequest" ||
+		respErr.ErrorCode == "ZonalAllocationFailed" {
+		t.Skip("skipping scenario zonal allocation constrained", t.Name(), err)


This comment states the zonal allocation constraint errors are “(409)”, but the logic now skips solely by ErrorCode without validating status. Update the comment (or add a status-code guard) to keep documentation consistent with behavior.

Copilot · 2026-04-09T23:59:38Z

e2e/vmss.go

 func skipTestIfSKUNotAvailableErr(t testing.TB, err error) {
 	if !config.Config.SkipTestsWithSKUCapacityIssue {
 		return
 	}
 	var respErr *azcore.ResponseError
-	if !errors.As(err, &respErr) || respErr.StatusCode != 409 {
+	if !errors.As(err, &respErr) {


skipTestIfSKUNotAvailableErr now also skips on allocation and zonal-capacity error codes (not just SKU-not-available). The name is misleading; consider renaming to something like skipTestIfCapacityIssueErr (and updating the call site) to reflect the broader behavior.

Copilot AI review requested due to automatic review settings April 9, 2026 23:56

ganeshkumarashok requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners April 9, 2026 23:56

ganeshkumarashok temporarily deployed to test April 9, 2026 23:56 — with GitHub Actions Inactive

Copilot started reviewing on behalf of ganeshkumarashok April 9, 2026 23:57 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): skip GPU tests on additional allocation failure error codes#8269

fix(e2e): skip GPU tests on additional allocation failure error codes#8269
ganeshkumarashok wants to merge 1 commit intomainfrom
aganeshkumar/fix-gpu-e2e-failures

ganeshkumarashok commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ganeshkumarashok commented Apr 9, 2026

Summary

Context

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants