fix(e2e): skip GPU tests on additional allocation failure error codes#8269
fix(e2e): skip GPU tests on additional allocation failure error codes#8269ganeshkumarashok wants to merge 1 commit intomainfrom
Conversation
The `skipTestIfSKUNotAvailableErr` function previously only handled `SkuNotAvailable` (409) and quota-exceeded `OperationNotAllowed` (409), and required HTTP 409 status. This missed several GPU capacity errors: - `AllocationFailed` returns HTTP 200, not 409, so it was never caught by the status code check. After `CreateVMSSWithRetry` exhausts its 10 retries, the error leaked through as a test failure instead of a skip. - `OverconstrainedZonalAllocationRequest`, `OverconstrainedAllocationRequest`, and `ZonalAllocationFailed` are 409 errors with different error codes that weren't handled. Remove the `StatusCode == 409` gate and add all allocation-related error codes so GPU tests are properly skipped when capacity is unavailable.
There was a problem hiding this comment.
Pull request overview
This PR hardens the GPU E2E harness by broadening the “capacity-related Azure error” detection so allocation failures are skipped (when SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE=true) instead of surfacing as intermittent test failures.
Changes:
- Removes the
StatusCode == 409gate so allocation-relatedazcore.ResponseErrors with non-409 status (notablyAllocationFailedat HTTP 200) can be handled. - Adds skip handling for additional allocation/capacity error codes:
AllocationFailed,OverconstrainedZonalAllocationRequest,OverconstrainedAllocationRequest,ZonalAllocationFailed. - Clarifies inline comments around the different skip conditions.
| // SKU not available in region (409) | ||
| if respErr.ErrorCode == "SkuNotAvailable" { | ||
| t.Skip("skipping scenario SKU not available", t.Name(), err) | ||
| } | ||
| // sometimes the SKU quota is exceeded and we can't do anything. Skip the test in this case. | ||
| // Quota exceeded (409) |
There was a problem hiding this comment.
These comments mention “(409)” but the function no longer checks respErr.StatusCode. Either reintroduce status-code checks for these specific cases or adjust the comments so they don’t assert a status code the code isn’t enforcing.
| // Zonal allocation constraints — no capacity in the requested zone (409) | ||
| if respErr.ErrorCode == "OverconstrainedZonalAllocationRequest" || | ||
| respErr.ErrorCode == "OverconstrainedAllocationRequest" || | ||
| respErr.ErrorCode == "ZonalAllocationFailed" { | ||
| t.Skip("skipping scenario zonal allocation constrained", t.Name(), err) |
There was a problem hiding this comment.
This comment states the zonal allocation constraint errors are “(409)”, but the logic now skips solely by ErrorCode without validating status. Update the comment (or add a status-code guard) to keep documentation consistent with behavior.
| func skipTestIfSKUNotAvailableErr(t testing.TB, err error) { | ||
| if !config.Config.SkipTestsWithSKUCapacityIssue { | ||
| return | ||
| } | ||
| var respErr *azcore.ResponseError | ||
| if !errors.As(err, &respErr) || respErr.StatusCode != 409 { | ||
| if !errors.As(err, &respErr) { |
There was a problem hiding this comment.
skipTestIfSKUNotAvailableErr now also skips on allocation and zonal-capacity error codes (not just SKU-not-available). The name is misleading; consider renaming to something like skipTestIfCapacityIssueErr (and updating the call site) to reflect the broader behavior.
Summary
skipTestIfSKUNotAvailableErrpreviously required HTTP 409 status and only handledSkuNotAvailableand quotaOperationNotAllowederrorsAllocationFailedreturns HTTP 200 (not 409), so afterCreateVMSSWithRetryexhausts its 10 retries, the error leaked through as a test failure instead of being skipped — this is the likely cause of intermittent GPU E2E pipeline failures on mainAllocationFailed,OverconstrainedZonalAllocationRequest,OverconstrainedAllocationRequest, andZonalAllocationFailederror codesStatusCode == 409gate so non-409 allocation errors (likeAllocationFailedat 200) are properly caughtContext
The GPU E2E pipeline has been intermittently failing on main (alternating success/failure across recent commits). GPU VMs (NC6s_v3, NC24ads_A100_v4, NV6ads_A10_v5) have limited regional availability, and Azure returns various allocation error codes depending on the specific failure mode. The pipeline sets
SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE=true, but the skip function wasn't catching all allocation-related errors.Test plan
go build ./...passes