OCPBUGS-83281: Fix Gateway cleanup in parallel e2e test workers by rhamini3 · Pull Request #31023 · openshift/origin

rhamini3 · 2026-04-16T21:03:45Z

The Gateway API controller tests tracked Gateways in a sharedin-memory gateways slice, deleting them during AfterEach cleanup. However, openshift-tests distributes tests across separate parallel worker processes. The annotation-based checkAllTestsDone coordination works correctly because annotations are stored on the cluster-scoped GatewayClass, but the gateways slice is not shared across processes. The process that runs the final AfterEach cleanup has an empty gateways slice, so it deletes the GatewayClass and istiod but never deletes the Gateways created by other processes. This leaves gateway deployments orphaned on the cluster.

As a secondary issue, even when gateways were deleted, the GatewayClass and istiod were removed without waiting for the gateway proxy deployments to be fully cleaned up by GC. Since the deployments have an owner reference to the Gateway (not a finalizer), the cascade deletion is asynchronous, creating a race where gateway pods lose their control plane and crash-loop.

Fix both issues by cleaning up gateways at the individual test level using defer deleteGateway, which deletes the Gateway and waits for its proxy deployment to be removed by GC. Add deleteGateway and waitForGatewayDeploymentDeletion helpers shared by both the controller tests and the upgrade test Teardown. Cleanup errors now hard fail to surface leftover resources immediately rather than causing confusing downstream test failures.

https://redhat.atlassian.net/browse/OCPBUGS-83281

openshift-merge-bot · 2026-04-16T21:03:48Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot · 2026-04-16T21:03:52Z

@rhamini3: This pull request references Jira Issue OCPBUGS-83281, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Updating the test code to remove the gateways during the gateway tests and keep the cleanup common to all tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-04-16T21:04:14Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Teardown and cleanup in Gateway API tests were changed: upgrade teardown no longer polls for Gateway removal and now waits for gateway-associated Deployment deletion. The suite no longer enforces ordered/serial execution; shared bulk Gateway deletion was removed in favor of per-test deferred deleteGateway and new helpers.

Changes

Cohort / File(s)	Summary
Upgrade test teardown `test/extended/router/gatewayapi_upgrade.go`	Removed the Gateway polling loop and the `wait` import in teardown. Teardown now calls `waitForGatewayDeploymentDeletion(t.oc, t.gatewayName)` and fails the test on deployment-cleanup errors instead of best-effort polling.
Controller tests & helpers `test/extended/router/gatewayapicontroller.go`	Removed `g.Ordered`/`g.Serial` from `g.Describe`; eliminated shared `AfterEach` bulk Gateway deletion and moved to per-test `defer deleteGateway(oc, gw)`. Added `waitForGatewayDeploymentDeletion(oc exutil.CLI, gatewayName string) error` to poll for the Gateway-associated Deployment deletion, and `deleteGateway(oc exutil.CLI, gatewayName string)` to delete Gateway and wait for deployment removal (failing the test on errors).

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test
    participant API as API Server
    participant Controller as Gateway Controller
    participant Kube as Kubernetes (Deployments)

    Test->>API: delete Gateway
    API-->>Test: acknowledge deletion
    Controller->>Kube: reconcile -> delete Deployment for Gateway
    Kube-->>Controller: Deployment deleted
    Test->>Controller: call waitForGatewayDeploymentDeletion(gatewayName)
    Controller->>API: Get Deployment status
    API-->>Controller: NotFound => return success
    Controller-->>Test: success / error

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (9 passed)

Check name	Status	Explanation
Stable And Deterministic Test Names	✅ Passed	All test names in modified files are static, deterministic strings with no dynamic information such as timestamps, UUIDs, pod names, or generated identifiers.
Test Structure And Quality	✅ Passed	Test changes implement proper cleanup mechanisms with defer statements, new helper functions with timeout handling, and descriptive error messages following Ginkgo best practices.
Microshift Test Compatibility	✅ Passed	Both the upgrade and controller Gateway API tests include the [apigroup:gateway.networking.k8s.io] tag, causing MicroShift CI to automatically skip them.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	The pull request modifies gateway API test cleanup behavior without introducing multi-node or HA cluster assumptions. Changes are purely about improving test cleanup mechanics and do not affect SNO compatibility.
Topology-Aware Scheduling Compatibility	✅ Passed	The PR modifies only test suite files in test/extended/router/ without introducing any deployment manifests, operator code, or scheduling constraints.
Ote Binary Stdout Contract	✅ Passed	The modified files are Ginkgo test utility files that do not write to stdout. No main(), TestMain(), or process-level initialization functions found. Helper functions use e2e.Logf and e2e.Failf which are Ginkgo-intercepted logging utilities that do not write to stdout. No fmt.Print, log.Print, klog, or os.Stdout calls detected.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR does not add new Ginkgo e2e test cases, only modifies existing tests and adds utility helper functions without IPv4 assumptions.
Title check	✅ Passed	The title accurately describes the main change: fixing gateway cleanup in parallel e2e test workers by implementing explicit per-test cleanup and centralized deletion logic.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-04-16T21:05:39Z

@rhamini3: This pull request references Jira Issue OCPBUGS-83281, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Updating the test code to remove the gateways during the gateway tests and keep the cleanup common to all tests

Summary by CodeRabbit

Tests

Improved gateway API test cleanup and resource deletion verification procedures

Enhanced test isolation through explicit per-test cleanup logic

Strengthened deployment verification mechanisms for gateway resources

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

🧹 Nitpick comments (1)

test/extended/router/gatewayapicontroller.go (1)

353-353: Remove unused gateways slice appends.

The gateways slice is no longer used for cleanup since individual tests now delete their own gateways. These appends at lines 353, 392, and 518 are dead code and can be removed along with the slice declaration at line 114.

🧹 Remove dead code

Remove the slice declaration:

 	var (
 		oc                    = exutil.NewCLIWithPodSecurityLevel("gatewayapi-controller", admissionapi.LevelBaseline)
 		err                   error
-		gateways              []string
 		infPoolCRD            = "https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/main/config/crd/bases/inference.networking.k8s.io_inferencepools.yaml"

And remove all appends (lines 353, 392, 518):

 		gw := names.SimpleNameGenerator.GenerateName("gateway-")
-		gateways = append(gateways, gw)
 		_, gwerr := createAndCheckGateway(oc, gw, gatewayClassName, defaultDomain, loadBalancerSupported)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/router/gatewayapicontroller.go` at line 353, Remove the
now-unused gateways slice and its appends: delete the gateways variable
declaration and remove each occurrence of "gateways = append(gateways, gw)" (the
three dead appends in this file). Also remove or adjust any leftover references
to the gateways slice (e.g., cleanup loops or uses) so the file still compiles
and tests delete gateways individually as intended.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/extended/router/gatewayapicontroller.go`:
- Line 353: Remove the now-unused gateways slice and its appends: delete the
gateways variable declaration and remove each occurrence of "gateways =
append(gateways, gw)" (the three dead appends in this file). Also remove or
adjust any leftover references to the gateways slice (e.g., cleanup loops or
uses) so the file still compiles and tests delete gateways individually as
intended.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: e9af97cd-ddb4-46e4-aff1-793dffb61f2e

📥 Commits

Reviewing files that changed from the base of the PR and between d7ad0db and d4a9d63.

📒 Files selected for processing (2)

test/extended/router/gatewayapi_upgrade.go
test/extended/router/gatewayapicontroller.go

gcs278 · 2026-04-16T21:31:11Z

+		// deleting the gateway at the end of the test
+		e2e.Logf("Deleting Gateway %q", gw)
+		err = oc.AdminGatewayApiClient().GatewayV1().Gateways(ingressNamespace).Delete(context.Background(), gw, metav1.DeleteOptions{})
+		if err != nil && !apierrors.IsNotFound(err) {
+			e2e.Logf("Failed to delete Gateway %q: %v", gw, err)
+		}
+		if err := waitForGatewayDeploymentDeletion(oc, gw); err != nil {
+			e2e.Logf("Gateway deployment for %q was not cleaned up: %v", gw, err)
+		}


Consider putting this in a helper function, maybe DeleteGateway or something and add it as a defer call underneath the creation. I think that's a bit cleaner and it cleans up Gateways on early failures or skips.

thought about that, if this works in our favour I will go ahead and do that thanks :)

openshift-merge-bot · 2026-04-16T21:32:29Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

gcs278 · 2026-04-17T00:57:33Z

@rhamini3 looks good 👍 Loki query looks much better.

From Claude analyzing the proxy logs (Gateways only ran for 30-40 second, no longer 40 minutes 😄):

Started:

gateway-4md4x-openshift-default-5d86c6646-j694z (IP 10.129.2.147) — started at 23:48:06, ready at 23:48:14

gateway-mhk77-openshift-default-6cf99777d6-dgmp7 (IP 10.131.1.158) — started at 23:48:22

gateway-mhk77-openshift-default-6cf99777d6-vxprg (IP 10.129.2.153) — started at 23:48:22

Gracefully terminated (got SIGTERM, drained, shut down cleanly):

gateway-mhk77 pods — both terminated gracefully at ~23:48:52-53 (agent drained, 5s grace period, terminated)

gateway-4md4x pod — terminated gracefully at ~23:48:52-53

openshift-ci-robot · 2026-04-17T04:48:10Z

@rhamini3: This pull request references Jira Issue OCPBUGS-83281, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Updating the test code to remove the gateways during the gateway tests and keep the cleanup common to all tests

Summary by CodeRabbit

Tests

Improved teardown to verify gateway deployment deletion before proceeding

Switched from shared bulk cleanup to per-test deferred gateway cleanup for stronger isolation

Adjusted test execution behavior and sequencing

Added explicit polling and failure logging to strengthen deletion verification and reduce flakiness

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

test/extended/router/gatewayapicontroller.go (1)
114-114: Remove unused gateways slice - dead code.

The gateways slice (line 114) is appended to at lines 361, 391, and 508, but it's never read anywhere. The old AfterEach that used this slice for bulk cleanup was removed. Consider removing both the slice declaration and the append statements.
♻️ Proposed fix to remove dead code

Remove the declaration at line 114:
 	var (
 		oc                    = exutil.NewCLIWithPodSecurityLevel("gatewayapi-controller", admissionapi.LevelBaseline)
 		err                   error
-		gateways              []string
 		infPoolCRD            = "https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/main/config/crd/bases/inference.networking.k8s.io_inferencepools.yaml"
 		managedDNS            bool
 		loadBalancerSupported bool
And remove the append statements:
 		g.By("Create the default Gateway")
 		gw := names.SimpleNameGenerator.GenerateName("gateway-")
-		gateways = append(gateways, gw)
 		defer deleteGateway(oc, gw)
Apply similar changes at lines 391 and 508.
Also applies to: 361-362, 391-392, 508-509
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/router/gatewayapicontroller.go` at line 114, Remove the dead
`gateways` slice and its unused mutations: delete the `gateways []string`
declaration and remove all `gateways = append(gateways, ...)` statements (the
appends around the test blocks where gateways are collected); since the old
AfterEach bulk-cleanup that consumed this slice was removed, ensure no other
code references `gateways` and run tests to verify nothing else depends on it.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/router/gatewayapicontroller.go`:
- Line 109: Remove the unused gateways slice: delete the var gateways
declaration (the slice named "gateways" declared near the g.Describe call) and
remove any dead append calls that add entries to it (the statements like
gateways = append(gateways, ...) found inside the test bodies where
deleteGateway(...) is now used with defer). Keep all test logic and defer
deleteGateway(...) calls intact; simply eliminate the unused variable and its
append usages so there is no dead state tracking left over from the previous
cleanup pattern.

---

Nitpick comments:
In `@test/extended/router/gatewayapicontroller.go`:
- Line 114: Remove the dead `gateways` slice and its unused mutations: delete
the `gateways []string` declaration and remove all `gateways = append(gateways,
...)` statements (the appends around the test blocks where gateways are
collected); since the old AfterEach bulk-cleanup that consumed this slice was
removed, ensure no other code references `gateways` and run tests to verify
nothing else depends on it.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 239b149a-b248-426b-8da0-39ac5aa3dce0

📥 Commits

Reviewing files that changed from the base of the PR and between d4a9d63 and 66588d8.

📒 Files selected for processing (1)

test/extended/router/gatewayapicontroller.go

openshift-merge-bot · 2026-04-17T05:38:47Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

openshift-trt · 2026-04-17T10:20:34Z

Job Failure Risk Analysis for sha: fc61fef

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi	Low [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] This test has passed 0.00% of 13 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:upi JobTier:standard Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:vsphere Procedure:none SecurityMode:default Topology:ha Upgrade:none] in the last week.

gcs278

Thanks for this fix!

gcs278 · 2026-04-17T13:54:52Z

+// waitForGatewayDeploymentDeletion waits for a Gateway's deployment to be
+// deleted. The deployment is cascade-deleted by GC after the Gateway is
+// removed, but this is asynchronous. Must complete before removing the
+// GatewayClass or istiod to prevent gateway pods from crash-looping.


@CodeRabbit can you confirm that this is true with K8S?

@gcs278, yes, the comment is accurate with respect to Kubernetes garbage collection mechanics. Here's a breakdown:

Cascade deletion via owner references: Many objects in Kubernetes link to each other through owner references, which tell the control plane which objects are dependent on others. When a controller (such as the Istio/OSSM gateway controller) creates a Deployment on behalf of a Gateway resource, it typically sets an ownerReference on the Deployment pointing back to the Gateway.

Asynchronous by default (background deletion): In background cascading deletion, Kubernetes deletes the owner object immediately and the garbage collector then deletes the dependents in the background. Kubernetes uses background cascading deletion by default. This is why the deletion is described as asynchronous — the Gateway disappears promptly, but the dependent Deployment is cleaned up later by the GC loop.

Why the wait is necessary: In background deletion, the owner is deleted immediately, then the GC notices orphaned dependents, and the GC deletes dependents in the background. Since this GC loop runs asynchronously, there's a non-trivial window between Gateway deletion and Deployment deletion. If the GatewayClass or istiod were removed before GC cleaned up the gateway deployment, the gateway pods could crash-loop looking for their no-longer-present control plane — which is exactly what the comment describes.

One important caveat: the cascade deletion only works automatically if the Istio/OSSM gateway controller actually sets ownerReferences on the gateway Deployment pointing to the Gateway resource. The Kubernetes mechanics are correct, but whether this owner reference is established is Istio/OSSM controller implementation-specific behavior.

openshift-merge-bot · 2026-04-17T17:11:51Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

gcs278 · 2026-04-17T17:38:39Z

Mind also changing the rest of these?

Suggested change

e2e.Failf("Failed to delete GatewayClass %q: %v", gatewayClassName, err)

gcs278 · 2026-04-17T17:41:34Z

+	})
+}
+
+func deleteGateway(oc *exutil.CLI, gatewayName string) {


nit maybe it's worth being a bit more verbose on this function

Suggested change

func deleteGateway(oc *exutil.CLI, gatewayName string) {

/ deleteGatewayAndWaitForCleanup deletes a Gateway and waits for its proxy deployment to be removed by GC.

func deleteGatewayAndWaitForCleanup(oc *exutil.CLI, gatewayName string) {

openshift-merge-bot · 2026-04-17T19:32:37Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

The Gateway API controller tests tracked Gateways in a shared in-memory gateways slice, deleting them during AfterEach cleanup. However, openshift-tests distributes tests across separate parallel worker processes. The annotation-based checkAllTestsDone coordination works correctly because annotations are stored on the cluster-scoped GatewayClass, but the gateways slice is not shared across processes. The process that runs the final AfterEach cleanup has an empty gateways slice, so it deletes the GatewayClass and istiod but never deletes the Gateways created by other processes. This leaves gateway deployments orphaned on the cluster. As a secondary issue, even when gateways were deleted, the GatewayClass and istiod were removed without waiting for the gateway proxy deployments to be fully cleaned up by GC. Since the deployments have an owner reference to the Gateway (not a finalizer), the cascade deletion is asynchronous, creating a race where gateway pods lose their control plane and crash-loop. Fix both issues by cleaning up gateways at the individual test level using defer deleteGateway, which deletes the Gateway and waits for its proxy deployment to be removed by GC. Add deleteGateway and waitForGatewayDeploymentDeletion helpers shared by both the controller tests and the upgrade test Teardown. Cleanup errors now hard fail to surface leftover resources immediately rather than causing confusing downstream test failures. https://redhat.atlassian.net/browse/OCPBUGS-83281 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Grant Spence <gspence@redhat.com> Co-Authored-By: Ishmam Amin <iamin@redhat.com>

openshift-ci · 2026-04-21T21:11:51Z

@gcs278: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-5.0-e2e-gcp-ovn-upgrade
periodic-ci-openshift-release-main-ci-5.0-upgrade-from-stable-4.22-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-main-ci-5.0-upgrade-from-stable-4.22-e2e-vsphere-ovn-upgrade
periodic-ci-openshift-release-main-nightly-5.0-upgrade-from-stable-4.22-e2e-metal-ipi-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b57308f0-3dc6-11f1-87a2-33ee2bbb7ff4-0

gcs278 · 2026-04-21T23:51:39Z

Now getting:

error: unable to access the source layer sha256:5ab75c71069351042dcf2c7a6a33f5de09ff630a2fadeecc5d86a38bf4b46838: received unexpected HTTP status: 502 Bad Gateway

/retest

openshift-trt · 2026-04-22T05:30:54Z

Job Failure Risk Analysis for sha: ca41c36

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift	IncompleteTests Tests for this run (29) are below the historical average (1443): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial	IncompleteTests Tests for this run (28) are below the historical average (736): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

rhamini3 · 2026-04-22T10:26:35Z

/retest-required

openshift-trt · 2026-04-22T12:18:34Z

Job Failure Risk Analysis for sha: ca41c36

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift	IncompleteTests Tests for this run (29) are below the historical average (1374): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial	IncompleteTests Tests for this run (29) are below the historical average (697): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

gcs278 · 2026-04-22T13:51:54Z

Payload upgrade jobs look good. I don't think we can test OLM to noOLM in this PR targeting 5.0 because the from-stable CI job is from 4.22 (still noOLM).

We will have to test OLM to noOLM when we create the backport.

gcs278 · 2026-04-22T13:52:39Z

unrelated microshift failures
/retest

openshift-trt · 2026-04-22T16:14:18Z

Job Failure Risk Analysis for sha: ca41c36

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial	IncompleteTests Tests for this run (28) are below the historical average (669): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt · 2026-04-22T16:19:10Z

Job Failure Risk Analysis for sha: ca41c36

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift	IncompleteTests Tests for this run (29) are below the historical average (1291): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial	IncompleteTests Tests for this run (28) are below the historical average (662): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

gcs278 · 2026-04-22T18:38:46Z

openshift/ovn-kubernetes#3150 recently merged
/retest

melvinjoseph86 · 2026-04-23T11:35:07Z

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-upgrade-fips openshift-ovn-kubernetes-3126-nightly-4.22-e2e-aws-ovn-upgrade-fips openshift-cloud-credential-operator-1008-nightly-4.22-e2e-aws-ovn-upgrade-fips openshift-cluster-kube-controller-manager-operator-919-nightly-4.22-e2e-aws-ovn-upgrade-fips

openshift-ci · 2026-04-23T11:35:11Z

@melvinjoseph86: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips
periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-upgrade-fips

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7add5590-3f08-11f1-9ec0-24af14e19d06-0

melvinjoseph86 · 2026-04-23T12:02:19Z

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-azure-custom-dns-techpreview periodic-ci-openshift-release-main-nightly-4.22-e2e-gcp-custom-dns

openshift-ci · 2026-04-23T12:02:23Z

@melvinjoseph86: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-e2e-azure-custom-dns-techpreview
periodic-ci-openshift-release-main-nightly-4.22-e2e-gcp-custom-dns

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/47ecba50-3f0c-11f1-9b39-58f10e5f58e8-0

melvinjoseph86 · 2026-04-23T16:59:20Z

I am not able to run the e2e locally, but error is not reproducible on the scheduled payloads job, hence marking as verified
/verified by @mjoseph

openshift-ci-robot · 2026-04-23T16:59:32Z

@melvinjoseph86: This PR has been marked as verified by @mjoseph.

Details

In response to this:

I am not able to run the e2e locally, but error is not reproducible on the scheduled payloads job, hence marking as verified
/verified by @mjoseph

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-bot · 2026-04-23T17:40:29Z

/retest-required

Remaining retests: 0 against base HEAD 0d2dab5 and 2 for PR HEAD ca41c36 in total

gcs278 · 2026-04-23T20:44:43Z

/test images

rhamini3 · 2026-04-23T21:10:54Z

/retest

melvinjoseph86 · 2026-04-24T06:01:43Z

/test images

openshift-merge-bot · 2026-04-24T06:16:17Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

openshift-ci-robot · 2026-04-24T06:17:34Z

@rhamini3: Jira Issue Verification Checks: Jira Issue OCPBUGS-83281
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-83281 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

The Gateway API controller tests tracked Gateways in a sharedin-memory gateways slice, deleting them during AfterEach cleanup. However, openshift-tests distributes tests across separate parallel worker processes. The annotation-based checkAllTestsDone coordination works correctly because annotations are stored on the cluster-scoped GatewayClass, but the gateways slice is not shared across processes. The process that runs the final AfterEach cleanup has an empty gateways slice, so it deletes the GatewayClass and istiod but never deletes the Gateways created by other processes. This leaves gateway deployments orphaned on the cluster.

As a secondary issue, even when gateways were deleted, the GatewayClass and istiod were removed without waiting for the gateway proxy deployments to be fully cleaned up by GC. Since the deployments have an owner reference to the Gateway (not a finalizer), the cascade deletion is asynchronous, creating a race where gateway pods lose their control plane and crash-loop.

Fix both issues by cleaning up gateways at the individual test level using defer deleteGateway, which deletes the Gateway and waits for its proxy deployment to be removed by GC. Add deleteGateway and waitForGatewayDeploymentDeletion helpers shared by both the controller tests and the upgrade test Teardown. Cleanup errors now hard fail to surface leftover resources immediately rather than causing confusing downstream test failures.

https://redhat.atlassian.net/browse/OCPBUGS-83281

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

gcs278 · 2026-04-24T15:06:52Z

/test images

gcs278 · 2026-04-24T15:07:18Z

lol whoops I didn't refresh github to see that this was merged 😆

rhamini3 · 2026-04-24T15:16:24Z

/cherry-pick release-4.22

openshift-cherrypick-robot · 2026-04-24T15:17:15Z

@rhamini3: new pull request created: #31068

Details

In response to this:

/cherry-pick release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-merge-robot · 2026-04-24T19:41:34Z

Fix included in release 5.0.0-0.nightly-2026-04-24-121336

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 16, 2026

openshift-ci Bot requested review from frobware and jcmoraisjr April 16, 2026 21:04

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

gcs278 reviewed Apr 16, 2026

View reviewed changes

Comment thread test/extended/router/gatewayapicontroller.go

gcs278 reviewed Apr 16, 2026

View reviewed changes

rhamini3 changed the title ~~[WIP] OCPBUGS-83281: Wait for gateway deletion before terminating test suite~~ OCPBUGS-83281: Wait for gateway deletion before terminating test suite Apr 17, 2026

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2026

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread test/extended/router/gatewayapicontroller.go

rhamini3 force-pushed the ocpbugs-83281 branch from 66588d8 to fc61fef Compare April 17, 2026 05:02

gcs278 reviewed Apr 17, 2026

View reviewed changes

rhamini3 force-pushed the ocpbugs-83281 branch from ac0f2f7 to a269443 Compare April 17, 2026 16:40

gcs278 reviewed Apr 17, 2026

View reviewed changes

rhamini3 force-pushed the ocpbugs-83281 branch from a269443 to 7777f6c Compare April 17, 2026 18:58

rhamini3 force-pushed the ocpbugs-83281 branch from d80779d to 3f8a12d Compare April 17, 2026 20:08

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 23, 2026

openshift-merge-bot Bot merged commit eb9821a into openshift:main Apr 24, 2026
9 of 21 checks passed

openshift-cherrypick-robot mentioned this pull request Apr 24, 2026

[release-4.22] OCPBUGS-84322: Fix Gateway cleanup in parallel e2e test workers #31068

Open


	e2e.Failf("Failed to delete GatewayClass %q: %v", gatewayClassName, err)

	func deleteGateway(oc *exutil.CLI, gatewayName string) {
	/ deleteGatewayAndWaitForCleanup deletes a Gateway and waits for its proxy deployment to be removed by GC.
	func deleteGatewayAndWaitForCleanup(oc *exutil.CLI, gatewayName string) {

Conversation

rhamini3 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-merge-bot Bot commented Apr 16, 2026

Uh oh!

openshift-ci-robot commented Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci-robot commented Apr 16, 2026

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gcs278 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

rhamini3 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-merge-bot Bot commented Apr 16, 2026

Uh oh!

gcs278 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 17, 2026

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-merge-bot Bot commented Apr 17, 2026

Uh oh!

openshift-trt Bot commented Apr 17, 2026

Uh oh!

gcs278 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gcs278 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-merge-bot Bot commented Apr 17, 2026

Uh oh!

gcs278 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

gcs278 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-merge-bot Bot commented Apr 17, 2026

Uh oh!

openshift-ci Bot commented Apr 21, 2026

Uh oh!

gcs278 commented Apr 21, 2026

Uh oh!

openshift-trt Bot commented Apr 22, 2026

Uh oh!

rhamini3 commented Apr 22, 2026

Uh oh!

rhamini3 commented Apr 16, 2026 •

edited

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

gcs278 commented Apr 17, 2026 •

edited

Loading