Skip to content

OCPBUGS-83281: Fix Gateway cleanup in parallel e2e test workers#31023

Merged
openshift-merge-bot[bot] merged 3 commits intoopenshift:mainfrom
rhamini3:ocpbugs-83281
Apr 24, 2026
Merged

OCPBUGS-83281: Fix Gateway cleanup in parallel e2e test workers#31023
openshift-merge-bot[bot] merged 3 commits intoopenshift:mainfrom
rhamini3:ocpbugs-83281

Conversation

@rhamini3
Copy link
Copy Markdown
Contributor

@rhamini3 rhamini3 commented Apr 16, 2026

The Gateway API controller tests tracked Gateways in a sharedin-memory gateways slice, deleting them during AfterEach cleanup. However, openshift-tests distributes tests across separate parallel worker processes. The annotation-based checkAllTestsDone coordination works correctly because annotations are stored on the cluster-scoped GatewayClass, but the gateways slice is not shared across processes. The process that runs the final AfterEach cleanup has an empty gateways slice, so it deletes the GatewayClass and istiod but never deletes the Gateways created by other processes. This leaves gateway deployments orphaned on the cluster.

As a secondary issue, even when gateways were deleted, the GatewayClass and istiod were removed without waiting for the gateway proxy deployments to be fully cleaned up by GC. Since the deployments have an owner reference to the Gateway (not a finalizer), the cascade deletion is asynchronous, creating a race where gateway pods lose their control plane and crash-loop.

Fix both issues by cleaning up gateways at the individual test level using defer deleteGateway, which deletes the Gateway and waits for its proxy deployment to be removed by GC. Add deleteGateway and waitForGatewayDeploymentDeletion helpers shared by both the controller tests and the upgrade test Teardown. Cleanup errors now hard fail to surface leftover resources immediately rather than causing confusing downstream test failures.

https://redhat.atlassian.net/browse/OCPBUGS-83281

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 16, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@rhamini3: This pull request references Jira Issue OCPBUGS-83281, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Updating the test code to remove the gateways during the gateway tests and keep the cleanup common to all tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Teardown and cleanup in Gateway API tests were changed: upgrade teardown no longer polls for Gateway removal and now waits for gateway-associated Deployment deletion. The suite no longer enforces ordered/serial execution; shared bulk Gateway deletion was removed in favor of per-test deferred deleteGateway and new helpers.

Changes

Cohort / File(s) Summary
Upgrade test teardown
test/extended/router/gatewayapi_upgrade.go
Removed the Gateway polling loop and the wait import in teardown. Teardown now calls waitForGatewayDeploymentDeletion(t.oc, t.gatewayName) and fails the test on deployment-cleanup errors instead of best-effort polling.
Controller tests & helpers
test/extended/router/gatewayapicontroller.go
Removed g.Ordered/g.Serial from g.Describe; eliminated shared AfterEach bulk Gateway deletion and moved to per-test defer deleteGateway(oc, gw). Added waitForGatewayDeploymentDeletion(oc *exutil.CLI, gatewayName string) error to poll for the Gateway-associated Deployment deletion, and deleteGateway(oc *exutil.CLI, gatewayName string) to delete Gateway and wait for deployment removal (failing the test on errors).

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test
    participant API as API Server
    participant Controller as Gateway Controller
    participant Kube as Kubernetes (Deployments)

    Test->>API: delete Gateway
    API-->>Test: acknowledge deletion
    Controller->>Kube: reconcile -> delete Deployment for Gateway
    Kube-->>Controller: Deployment deleted
    Test->>Controller: call waitForGatewayDeploymentDeletion(gatewayName)
    Controller->>API: Get Deployment status
    API-->>Controller: NotFound => return success
    Controller-->>Test: success / error
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (9 passed)
Check name Status Explanation
Stable And Deterministic Test Names ✅ Passed All test names in modified files are static, deterministic strings with no dynamic information such as timestamps, UUIDs, pod names, or generated identifiers.
Test Structure And Quality ✅ Passed Test changes implement proper cleanup mechanisms with defer statements, new helper functions with timeout handling, and descriptive error messages following Ginkgo best practices.
Microshift Test Compatibility ✅ Passed Both the upgrade and controller Gateway API tests include the [apigroup:gateway.networking.k8s.io] tag, causing MicroShift CI to automatically skip them.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The pull request modifies gateway API test cleanup behavior without introducing multi-node or HA cluster assumptions. Changes are purely about improving test cleanup mechanics and do not affect SNO compatibility.
Topology-Aware Scheduling Compatibility ✅ Passed The PR modifies only test suite files in test/extended/router/ without introducing any deployment manifests, operator code, or scheduling constraints.
Ote Binary Stdout Contract ✅ Passed The modified files are Ginkgo test utility files that do not write to stdout. No main(), TestMain(), or process-level initialization functions found. Helper functions use e2e.Logf and e2e.Failf which are Ginkgo-intercepted logging utilities that do not write to stdout. No fmt.Print*, log.Print*, klog, or os.Stdout calls detected.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR does not add new Ginkgo e2e test cases, only modifies existing tests and adds utility helper functions without IPv4 assumptions.
Title check ✅ Passed The title accurately describes the main change: fixing gateway cleanup in parallel e2e test workers by implementing explicit per-test cleanup and centralized deletion logic.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from frobware and jcmoraisjr April 16, 2026 21:04
@openshift-ci-robot
Copy link
Copy Markdown

@rhamini3: This pull request references Jira Issue OCPBUGS-83281, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Updating the test code to remove the gateways during the gateway tests and keep the cleanup common to all tests

Summary by CodeRabbit

  • Tests
  • Improved gateway API test cleanup and resource deletion verification procedures
  • Enhanced test isolation through explicit per-test cleanup logic
  • Strengthened deployment verification mechanisms for gateway resources

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/extended/router/gatewayapicontroller.go (1)

353-353: Remove unused gateways slice appends.

The gateways slice is no longer used for cleanup since individual tests now delete their own gateways. These appends at lines 353, 392, and 518 are dead code and can be removed along with the slice declaration at line 114.

🧹 Remove dead code

Remove the slice declaration:

 	var (
 		oc                    = exutil.NewCLIWithPodSecurityLevel("gatewayapi-controller", admissionapi.LevelBaseline)
 		err                   error
-		gateways              []string
 		infPoolCRD            = "https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/main/config/crd/bases/inference.networking.k8s.io_inferencepools.yaml"

And remove all appends (lines 353, 392, 518):

 		gw := names.SimpleNameGenerator.GenerateName("gateway-")
-		gateways = append(gateways, gw)
 		_, gwerr := createAndCheckGateway(oc, gw, gatewayClassName, defaultDomain, loadBalancerSupported)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/router/gatewayapicontroller.go` at line 353, Remove the
now-unused gateways slice and its appends: delete the gateways variable
declaration and remove each occurrence of "gateways = append(gateways, gw)" (the
three dead appends in this file). Also remove or adjust any leftover references
to the gateways slice (e.g., cleanup loops or uses) so the file still compiles
and tests delete gateways individually as intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/extended/router/gatewayapicontroller.go`:
- Line 353: Remove the now-unused gateways slice and its appends: delete the
gateways variable declaration and remove each occurrence of "gateways =
append(gateways, gw)" (the three dead appends in this file). Also remove or
adjust any leftover references to the gateways slice (e.g., cleanup loops or
uses) so the file still compiles and tests delete gateways individually as
intended.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: e9af97cd-ddb4-46e4-aff1-793dffb61f2e

📥 Commits

Reviewing files that changed from the base of the PR and between d7ad0db and d4a9d63.

📒 Files selected for processing (2)
  • test/extended/router/gatewayapi_upgrade.go
  • test/extended/router/gatewayapicontroller.go

Comment thread test/extended/router/gatewayapicontroller.go
Comment on lines +367 to +375
// deleting the gateway at the end of the test
e2e.Logf("Deleting Gateway %q", gw)
err = oc.AdminGatewayApiClient().GatewayV1().Gateways(ingressNamespace).Delete(context.Background(), gw, metav1.DeleteOptions{})
if err != nil && !apierrors.IsNotFound(err) {
e2e.Logf("Failed to delete Gateway %q: %v", gw, err)
}
if err := waitForGatewayDeploymentDeletion(oc, gw); err != nil {
e2e.Logf("Gateway deployment for %q was not cleaned up: %v", gw, err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider putting this in a helper function, maybe DeleteGateway or something and add it as a defer call underneath the creation. I think that's a bit cleaner and it cleans up Gateways on early failures or skips.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought about that, if this works in our favour I will go ahead and do that thanks :)

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@gcs278
Copy link
Copy Markdown
Contributor

gcs278 commented Apr 17, 2026

@rhamini3 looks good 👍 Loki query looks much better.

From Claude analyzing the proxy logs (Gateways only ran for 30-40 second, no longer 40 minutes 😄):

Started:

  1. gateway-4md4x-openshift-default-5d86c6646-j694z (IP 10.129.2.147) — started at 23:48:06, ready at 23:48:14
  2. gateway-mhk77-openshift-default-6cf99777d6-dgmp7 (IP 10.131.1.158) — started at 23:48:22
  3. gateway-mhk77-openshift-default-6cf99777d6-vxprg (IP 10.129.2.153) — started at 23:48:22

Gracefully terminated (got SIGTERM, drained, shut down cleanly):

  • gateway-mhk77 pods — both terminated gracefully at ~23:48:52-53 (agent drained, 5s grace period, terminated)
  • gateway-4md4x pod — terminated gracefully at ~23:48:52-53

@rhamini3 rhamini3 changed the title [WIP] OCPBUGS-83281: Wait for gateway deletion before terminating test suite OCPBUGS-83281: Wait for gateway deletion before terminating test suite Apr 17, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@rhamini3: This pull request references Jira Issue OCPBUGS-83281, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Updating the test code to remove the gateways during the gateway tests and keep the cleanup common to all tests

Summary by CodeRabbit

  • Tests
  • Improved teardown to verify gateway deployment deletion before proceeding
  • Switched from shared bulk cleanup to per-test deferred gateway cleanup for stronger isolation
  • Adjusted test execution behavior and sequencing
  • Added explicit polling and failure logging to strengthen deletion verification and reduce flakiness

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/extended/router/gatewayapicontroller.go (1)

114-114: Remove unused gateways slice - dead code.

The gateways slice (line 114) is appended to at lines 361, 391, and 508, but it's never read anywhere. The old AfterEach that used this slice for bulk cleanup was removed. Consider removing both the slice declaration and the append statements.

♻️ Proposed fix to remove dead code

Remove the declaration at line 114:

 	var (
 		oc                    = exutil.NewCLIWithPodSecurityLevel("gatewayapi-controller", admissionapi.LevelBaseline)
 		err                   error
-		gateways              []string
 		infPoolCRD            = "https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/main/config/crd/bases/inference.networking.k8s.io_inferencepools.yaml"
 		managedDNS            bool
 		loadBalancerSupported bool

And remove the append statements:

 		g.By("Create the default Gateway")
 		gw := names.SimpleNameGenerator.GenerateName("gateway-")
-		gateways = append(gateways, gw)
 		defer deleteGateway(oc, gw)

Apply similar changes at lines 391 and 508.

Also applies to: 361-362, 391-392, 508-509

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/router/gatewayapicontroller.go` at line 114, Remove the dead
`gateways` slice and its unused mutations: delete the `gateways []string`
declaration and remove all `gateways = append(gateways, ...)` statements (the
appends around the test blocks where gateways are collected); since the old
AfterEach bulk-cleanup that consumed this slice was removed, ensure no other
code references `gateways` and run tests to verify nothing else depends on it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/router/gatewayapicontroller.go`:
- Line 109: Remove the unused gateways slice: delete the var gateways
declaration (the slice named "gateways" declared near the g.Describe call) and
remove any dead append calls that add entries to it (the statements like
gateways = append(gateways, ...) found inside the test bodies where
deleteGateway(...) is now used with defer). Keep all test logic and defer
deleteGateway(...) calls intact; simply eliminate the unused variable and its
append usages so there is no dead state tracking left over from the previous
cleanup pattern.

---

Nitpick comments:
In `@test/extended/router/gatewayapicontroller.go`:
- Line 114: Remove the dead `gateways` slice and its unused mutations: delete
the `gateways []string` declaration and remove all `gateways = append(gateways,
...)` statements (the appends around the test blocks where gateways are
collected); since the old AfterEach bulk-cleanup that consumed this slice was
removed, ensure no other code references `gateways` and run tests to verify
nothing else depends on it.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 239b149a-b248-426b-8da0-39ac5aa3dce0

📥 Commits

Reviewing files that changed from the base of the PR and between d4a9d63 and 66588d8.

📒 Files selected for processing (1)
  • test/extended/router/gatewayapicontroller.go

Comment thread test/extended/router/gatewayapicontroller.go
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-trt
Copy link
Copy Markdown

openshift-trt Bot commented Apr 17, 2026

Job Failure Risk Analysis for sha: fc61fef

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi Low
[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
This test has passed 0.00% of 13 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:upi JobTier:standard Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:vsphere Procedure:none SecurityMode:default Topology:ha Upgrade:none] in the last week.

Copy link
Copy Markdown
Contributor

@gcs278 gcs278 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix!

Comment thread test/extended/router/gatewayapicontroller.go Outdated
Comment thread test/extended/router/gatewayapicontroller.go Outdated
Comment thread test/extended/router/gatewayapicontroller.go Outdated
Comment thread test/extended/router/gatewayapicontroller.go Outdated
Comment thread test/extended/router/gatewayapicontroller.go
Comment thread test/extended/router/gatewayapi_upgrade.go Outdated
Comment thread test/extended/router/gatewayapi_upgrade.go Outdated
Comment on lines +1324 to +1327
// waitForGatewayDeploymentDeletion waits for a Gateway's deployment to be
// deleted. The deployment is cascade-deleted by GC after the Gateway is
// removed, but this is asynchronous. Must complete before removing the
// GatewayClass or istiod to prevent gateway pods from crash-looping.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CodeRabbit can you confirm that this is true with K8S?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gcs278, yes, the comment is accurate with respect to Kubernetes garbage collection mechanics. Here's a breakdown:

  1. Cascade deletion via owner references: Many objects in Kubernetes link to each other through owner references, which tell the control plane which objects are dependent on others. When a controller (such as the Istio/OSSM gateway controller) creates a Deployment on behalf of a Gateway resource, it typically sets an ownerReference on the Deployment pointing back to the Gateway.

  2. Asynchronous by default (background deletion): In background cascading deletion, Kubernetes deletes the owner object immediately and the garbage collector then deletes the dependents in the background. Kubernetes uses background cascading deletion by default. This is why the deletion is described as asynchronous — the Gateway disappears promptly, but the dependent Deployment is cleaned up later by the GC loop.

  3. Why the wait is necessary: In background deletion, the owner is deleted immediately, then the GC notices orphaned dependents, and the GC deletes dependents in the background. Since this GC loop runs asynchronously, there's a non-trivial window between Gateway deletion and Deployment deletion. If the GatewayClass or istiod were removed before GC cleaned up the gateway deployment, the gateway pods could crash-loop looking for their no-longer-present control plane — which is exactly what the comment describes.

One important caveat: the cascade deletion only works automatically if the Istio/OSSM gateway controller actually sets ownerReferences on the gateway Deployment pointing to the Gateway resource. The Kubernetes mechanics are correct, but whether this owner reference is established is Istio/OSSM controller implementation-specific behavior.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind also changing the rest of these?

Suggested change
e2e.Failf("Failed to delete GatewayClass %q: %v", gatewayClassName, err)

})
}

func deleteGateway(oc *exutil.CLI, gatewayName string) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit maybe it's worth being a bit more verbose on this function

Suggested change
func deleteGateway(oc *exutil.CLI, gatewayName string) {
/ deleteGatewayAndWaitForCleanup deletes a Gateway and waits for its proxy deployment to be removed by GC.
func deleteGatewayAndWaitForCleanup(oc *exutil.CLI, gatewayName string) {

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

The Gateway API controller tests tracked Gateways in a shared
in-memory gateways slice, deleting them during AfterEach cleanup.
However, openshift-tests distributes tests across separate parallel
worker processes. The annotation-based checkAllTestsDone coordination
works correctly because annotations are stored on the cluster-scoped
GatewayClass, but the gateways slice is not shared across processes.
The process that runs the final AfterEach cleanup has an empty
gateways slice, so it deletes the GatewayClass and istiod but never
deletes the Gateways created by other processes. This leaves gateway
deployments orphaned on the cluster.

As a secondary issue, even when gateways were deleted, the GatewayClass
and istiod were removed without waiting for the gateway proxy
deployments to be fully cleaned up by GC. Since the deployments have
an owner reference to the Gateway (not a finalizer), the cascade
deletion is asynchronous, creating a race where gateway pods lose
their control plane and crash-loop.

Fix both issues by cleaning up gateways at the individual test level
using defer deleteGateway, which deletes the Gateway and waits for
its proxy deployment to be removed by GC. Add deleteGateway and
waitForGatewayDeploymentDeletion helpers shared by both the controller
tests and the upgrade test Teardown. Cleanup errors now hard fail to
surface leftover resources immediately rather than causing confusing
downstream test failures.

https://redhat.atlassian.net/browse/OCPBUGS-83281

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Grant Spence <gspence@redhat.com>
Co-Authored-By: Ishmam Amin <iamin@redhat.com>
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 21, 2026

@gcs278: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-5.0-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-5.0-upgrade-from-stable-4.22-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-5.0-upgrade-from-stable-4.22-e2e-vsphere-ovn-upgrade
  • periodic-ci-openshift-release-main-nightly-5.0-upgrade-from-stable-4.22-e2e-metal-ipi-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b57308f0-3dc6-11f1-87a2-33ee2bbb7ff4-0

@gcs278
Copy link
Copy Markdown
Contributor

gcs278 commented Apr 21, 2026

Now getting:

error: unable to access the source layer sha256:5ab75c71069351042dcf2c7a6a33f5de09ff630a2fadeecc5d86a38bf4b46838: received unexpected HTTP status: 502 Bad Gateway

/retest

@openshift-trt
Copy link
Copy Markdown

openshift-trt Bot commented Apr 22, 2026

Job Failure Risk Analysis for sha: ca41c36

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift IncompleteTests
Tests for this run (29) are below the historical average (1443): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial IncompleteTests
Tests for this run (28) are below the historical average (736): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@rhamini3
Copy link
Copy Markdown
Contributor Author

/retest-required

@openshift-trt
Copy link
Copy Markdown

openshift-trt Bot commented Apr 22, 2026

Job Failure Risk Analysis for sha: ca41c36

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift IncompleteTests
Tests for this run (29) are below the historical average (1374): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial IncompleteTests
Tests for this run (29) are below the historical average (697): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@gcs278
Copy link
Copy Markdown
Contributor

gcs278 commented Apr 22, 2026

Payload upgrade jobs look good. I don't think we can test OLM to noOLM in this PR targeting 5.0 because the from-stable CI job is from 4.22 (still noOLM).

We will have to test OLM to noOLM when we create the backport.

@gcs278
Copy link
Copy Markdown
Contributor

gcs278 commented Apr 22, 2026

unrelated microshift failures
/retest

@openshift-trt
Copy link
Copy Markdown

openshift-trt Bot commented Apr 22, 2026

Job Failure Risk Analysis for sha: ca41c36

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial IncompleteTests
Tests for this run (28) are below the historical average (669): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@openshift-trt
Copy link
Copy Markdown

openshift-trt Bot commented Apr 22, 2026

Job Failure Risk Analysis for sha: ca41c36

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift IncompleteTests
Tests for this run (29) are below the historical average (1291): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial IncompleteTests
Tests for this run (28) are below the historical average (662): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@gcs278
Copy link
Copy Markdown
Contributor

gcs278 commented Apr 22, 2026

openshift/ovn-kubernetes#3150 recently merged
/retest

@melvinjoseph86
Copy link
Copy Markdown
Contributor

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-upgrade-fips openshift-ovn-kubernetes-3126-nightly-4.22-e2e-aws-ovn-upgrade-fips openshift-cloud-credential-operator-1008-nightly-4.22-e2e-aws-ovn-upgrade-fips openshift-cluster-kube-controller-manager-operator-919-nightly-4.22-e2e-aws-ovn-upgrade-fips

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 23, 2026

@melvinjoseph86: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips
  • periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-upgrade-fips

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7add5590-3f08-11f1-9ec0-24af14e19d06-0

@melvinjoseph86
Copy link
Copy Markdown
Contributor

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-azure-custom-dns-techpreview periodic-ci-openshift-release-main-nightly-4.22-e2e-gcp-custom-dns

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 23, 2026

@melvinjoseph86: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-azure-custom-dns-techpreview
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-gcp-custom-dns

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/47ecba50-3f0c-11f1-9b39-58f10e5f58e8-0

@melvinjoseph86
Copy link
Copy Markdown
Contributor

I am not able to run the e2e locally, but error is not reproducible on the scheduled payloads job, hence marking as verified
/verified by @mjoseph

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 23, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@melvinjoseph86: This PR has been marked as verified by @mjoseph.

Details

In response to this:

I am not able to run the e2e locally, but error is not reproducible on the scheduled payloads job, hence marking as verified
/verified by @mjoseph

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 0d2dab5 and 2 for PR HEAD ca41c36 in total

@gcs278
Copy link
Copy Markdown
Contributor

gcs278 commented Apr 23, 2026

/test images

@rhamini3
Copy link
Copy Markdown
Contributor Author

/retest

@melvinjoseph86
Copy link
Copy Markdown
Contributor

/test images

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-merge-bot openshift-merge-bot Bot merged commit eb9821a into openshift:main Apr 24, 2026
9 of 21 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@rhamini3: Jira Issue Verification Checks: Jira Issue OCPBUGS-83281
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-83281 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

The Gateway API controller tests tracked Gateways in a sharedin-memory gateways slice, deleting them during AfterEach cleanup. However, openshift-tests distributes tests across separate parallel worker processes. The annotation-based checkAllTestsDone coordination works correctly because annotations are stored on the cluster-scoped GatewayClass, but the gateways slice is not shared across processes. The process that runs the final AfterEach cleanup has an empty gateways slice, so it deletes the GatewayClass and istiod but never deletes the Gateways created by other processes. This leaves gateway deployments orphaned on the cluster.

As a secondary issue, even when gateways were deleted, the GatewayClass and istiod were removed without waiting for the gateway proxy deployments to be fully cleaned up by GC. Since the deployments have an owner reference to the Gateway (not a finalizer), the cascade deletion is asynchronous, creating a race where gateway pods lose their control plane and crash-loop.

Fix both issues by cleaning up gateways at the individual test level using defer deleteGateway, which deletes the Gateway and waits for its proxy deployment to be removed by GC. Add deleteGateway and waitForGatewayDeploymentDeletion helpers shared by both the controller tests and the upgrade test Teardown. Cleanup errors now hard fail to surface leftover resources immediately rather than causing confusing downstream test failures.

https://redhat.atlassian.net/browse/OCPBUGS-83281

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@gcs278
Copy link
Copy Markdown
Contributor

gcs278 commented Apr 24, 2026

/test images

@gcs278
Copy link
Copy Markdown
Contributor

gcs278 commented Apr 24, 2026

lol whoops I didn't refresh github to see that this was merged 😆

@rhamini3
Copy link
Copy Markdown
Contributor Author

/cherry-pick release-4.22

@openshift-cherrypick-robot
Copy link
Copy Markdown

@rhamini3: new pull request created: #31068

Details

In response to this:

/cherry-pick release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-robot
Copy link
Copy Markdown
Contributor

Fix included in release 5.0.0-0.nightly-2026-04-24-121336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants