Skip to content

fix(e2e): skip GPU tests on additional allocation failure error codes#8269

Open
ganeshkumarashok wants to merge 1 commit intomainfrom
aganeshkumar/fix-gpu-e2e-failures
Open

fix(e2e): skip GPU tests on additional allocation failure error codes#8269
ganeshkumarashok wants to merge 1 commit intomainfrom
aganeshkumar/fix-gpu-e2e-failures

Conversation

@ganeshkumarashok
Copy link
Copy Markdown
Contributor

Summary

  • skipTestIfSKUNotAvailableErr previously required HTTP 409 status and only handled SkuNotAvailable and quota OperationNotAllowed errors
  • AllocationFailed returns HTTP 200 (not 409), so after CreateVMSSWithRetry exhausts its 10 retries, the error leaked through as a test failure instead of being skipped — this is the likely cause of intermittent GPU E2E pipeline failures on main
  • Adds handling for AllocationFailed, OverconstrainedZonalAllocationRequest, OverconstrainedAllocationRequest, and ZonalAllocationFailed error codes
  • Removes the StatusCode == 409 gate so non-409 allocation errors (like AllocationFailed at 200) are properly caught

Context

The GPU E2E pipeline has been intermittently failing on main (alternating success/failure across recent commits). GPU VMs (NC6s_v3, NC24ads_A100_v4, NV6ads_A10_v5) have limited regional availability, and Azure returns various allocation error codes depending on the specific failure mode. The pipeline sets SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE=true, but the skip function wasn't catching all allocation-related errors.

Test plan

  • go build ./... passes
  • GPU E2E pipeline stops failing on allocation-related errors (they become skips instead of failures)

The `skipTestIfSKUNotAvailableErr` function previously only handled
`SkuNotAvailable` (409) and quota-exceeded `OperationNotAllowed` (409),
and required HTTP 409 status. This missed several GPU capacity errors:

- `AllocationFailed` returns HTTP 200, not 409, so it was never caught
  by the status code check. After `CreateVMSSWithRetry` exhausts its 10
  retries, the error leaked through as a test failure instead of a skip.
- `OverconstrainedZonalAllocationRequest`, `OverconstrainedAllocationRequest`,
  and `ZonalAllocationFailed` are 409 errors with different error codes
  that weren't handled.

Remove the `StatusCode == 409` gate and add all allocation-related error
codes so GPU tests are properly skipped when capacity is unavailable.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the GPU E2E harness by broadening the “capacity-related Azure error” detection so allocation failures are skipped (when SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE=true) instead of surfacing as intermittent test failures.

Changes:

  • Removes the StatusCode == 409 gate so allocation-related azcore.ResponseErrors with non-409 status (notably AllocationFailed at HTTP 200) can be handled.
  • Adds skip handling for additional allocation/capacity error codes: AllocationFailed, OverconstrainedZonalAllocationRequest, OverconstrainedAllocationRequest, ZonalAllocationFailed.
  • Clarifies inline comments around the different skip conditions.

Comment on lines +516 to +520
// SKU not available in region (409)
if respErr.ErrorCode == "SkuNotAvailable" {
t.Skip("skipping scenario SKU not available", t.Name(), err)
}
// sometimes the SKU quota is exceeded and we can't do anything. Skip the test in this case.
// Quota exceeded (409)
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments mention “(409)” but the function no longer checks respErr.StatusCode. Either reintroduce status-code checks for these specific cases or adjust the comments so they don’t assert a status code the code isn’t enforcing.

Copilot uses AI. Check for mistakes.
Comment on lines +532 to +536
// Zonal allocation constraints — no capacity in the requested zone (409)
if respErr.ErrorCode == "OverconstrainedZonalAllocationRequest" ||
respErr.ErrorCode == "OverconstrainedAllocationRequest" ||
respErr.ErrorCode == "ZonalAllocationFailed" {
t.Skip("skipping scenario zonal allocation constrained", t.Name(), err)
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment states the zonal allocation constraint errors are “(409)”, but the logic now skips solely by ErrorCode without validating status. Update the comment (or add a status-code guard) to keep documentation consistent with behavior.

Copilot uses AI. Check for mistakes.
Comment on lines 508 to +513
func skipTestIfSKUNotAvailableErr(t testing.TB, err error) {
if !config.Config.SkipTestsWithSKUCapacityIssue {
return
}
var respErr *azcore.ResponseError
if !errors.As(err, &respErr) || respErr.StatusCode != 409 {
if !errors.As(err, &respErr) {
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipTestIfSKUNotAvailableErr now also skips on allocation and zonal-capacity error codes (not just SKU-not-available). The name is misleading; consider renaming to something like skipTestIfCapacityIssueErr (and updating the call site) to reflect the broader behavior.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants