OCPBUGS-83281: Fix Gateway cleanup in parallel e2e test workers#31023
OCPBUGS-83281: Fix Gateway cleanup in parallel e2e test workers#31023openshift-merge-bot[bot] merged 3 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@rhamini3: This pull request references Jira Issue OCPBUGS-83281, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughTeardown and cleanup in Gateway API tests were changed: upgrade teardown no longer polls for Gateway removal and now waits for gateway-associated Deployment deletion. The suite no longer enforces ordered/serial execution; shared bulk Gateway deletion was removed in favor of per-test deferred Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test
participant API as API Server
participant Controller as Gateway Controller
participant Kube as Kubernetes (Deployments)
Test->>API: delete Gateway
API-->>Test: acknowledge deletion
Controller->>Kube: reconcile -> delete Deployment for Gateway
Kube-->>Controller: Deployment deleted
Test->>Controller: call waitForGatewayDeploymentDeletion(gatewayName)
Controller->>API: Get Deployment status
API-->>Controller: NotFound => return success
Controller-->>Test: success / error
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@rhamini3: This pull request references Jira Issue OCPBUGS-83281, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/extended/router/gatewayapicontroller.go (1)
353-353: Remove unusedgatewaysslice appends.The
gatewaysslice is no longer used for cleanup since individual tests now delete their own gateways. These appends at lines 353, 392, and 518 are dead code and can be removed along with the slice declaration at line 114.🧹 Remove dead code
Remove the slice declaration:
var ( oc = exutil.NewCLIWithPodSecurityLevel("gatewayapi-controller", admissionapi.LevelBaseline) err error - gateways []string infPoolCRD = "https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/main/config/crd/bases/inference.networking.k8s.io_inferencepools.yaml"And remove all appends (lines 353, 392, 518):
gw := names.SimpleNameGenerator.GenerateName("gateway-") - gateways = append(gateways, gw) _, gwerr := createAndCheckGateway(oc, gw, gatewayClassName, defaultDomain, loadBalancerSupported)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/router/gatewayapicontroller.go` at line 353, Remove the now-unused gateways slice and its appends: delete the gateways variable declaration and remove each occurrence of "gateways = append(gateways, gw)" (the three dead appends in this file). Also remove or adjust any leftover references to the gateways slice (e.g., cleanup loops or uses) so the file still compiles and tests delete gateways individually as intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/extended/router/gatewayapicontroller.go`:
- Line 353: Remove the now-unused gateways slice and its appends: delete the
gateways variable declaration and remove each occurrence of "gateways =
append(gateways, gw)" (the three dead appends in this file). Also remove or
adjust any leftover references to the gateways slice (e.g., cleanup loops or
uses) so the file still compiles and tests delete gateways individually as
intended.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: e9af97cd-ddb4-46e4-aff1-793dffb61f2e
📒 Files selected for processing (2)
test/extended/router/gatewayapi_upgrade.gotest/extended/router/gatewayapicontroller.go
| // deleting the gateway at the end of the test | ||
| e2e.Logf("Deleting Gateway %q", gw) | ||
| err = oc.AdminGatewayApiClient().GatewayV1().Gateways(ingressNamespace).Delete(context.Background(), gw, metav1.DeleteOptions{}) | ||
| if err != nil && !apierrors.IsNotFound(err) { | ||
| e2e.Logf("Failed to delete Gateway %q: %v", gw, err) | ||
| } | ||
| if err := waitForGatewayDeploymentDeletion(oc, gw); err != nil { | ||
| e2e.Logf("Gateway deployment for %q was not cleaned up: %v", gw, err) | ||
| } |
There was a problem hiding this comment.
Consider putting this in a helper function, maybe DeleteGateway or something and add it as a defer call underneath the creation. I think that's a bit cleaner and it cleans up Gateways on early failures or skips.
There was a problem hiding this comment.
thought about that, if this works in our favour I will go ahead and do that thanks :)
|
Scheduling required tests: |
|
@rhamini3 looks good 👍 Loki query looks much better. From Claude analyzing the proxy logs (Gateways only ran for 30-40 second, no longer 40 minutes 😄):
|
|
@rhamini3: This pull request references Jira Issue OCPBUGS-83281, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/extended/router/gatewayapicontroller.go (1)
114-114: Remove unusedgatewaysslice - dead code.The
gatewaysslice (line 114) is appended to at lines 361, 391, and 508, but it's never read anywhere. The oldAfterEachthat used this slice for bulk cleanup was removed. Consider removing both the slice declaration and the append statements.♻️ Proposed fix to remove dead code
Remove the declaration at line 114:
var ( oc = exutil.NewCLIWithPodSecurityLevel("gatewayapi-controller", admissionapi.LevelBaseline) err error - gateways []string infPoolCRD = "https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/main/config/crd/bases/inference.networking.k8s.io_inferencepools.yaml" managedDNS bool loadBalancerSupported boolAnd remove the append statements:
g.By("Create the default Gateway") gw := names.SimpleNameGenerator.GenerateName("gateway-") - gateways = append(gateways, gw) defer deleteGateway(oc, gw)Apply similar changes at lines 391 and 508.
Also applies to: 361-362, 391-392, 508-509
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/router/gatewayapicontroller.go` at line 114, Remove the dead `gateways` slice and its unused mutations: delete the `gateways []string` declaration and remove all `gateways = append(gateways, ...)` statements (the appends around the test blocks where gateways are collected); since the old AfterEach bulk-cleanup that consumed this slice was removed, ensure no other code references `gateways` and run tests to verify nothing else depends on it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/router/gatewayapicontroller.go`:
- Line 109: Remove the unused gateways slice: delete the var gateways
declaration (the slice named "gateways" declared near the g.Describe call) and
remove any dead append calls that add entries to it (the statements like
gateways = append(gateways, ...) found inside the test bodies where
deleteGateway(...) is now used with defer). Keep all test logic and defer
deleteGateway(...) calls intact; simply eliminate the unused variable and its
append usages so there is no dead state tracking left over from the previous
cleanup pattern.
---
Nitpick comments:
In `@test/extended/router/gatewayapicontroller.go`:
- Line 114: Remove the dead `gateways` slice and its unused mutations: delete
the `gateways []string` declaration and remove all `gateways = append(gateways,
...)` statements (the appends around the test blocks where gateways are
collected); since the old AfterEach bulk-cleanup that consumed this slice was
removed, ensure no other code references `gateways` and run tests to verify
nothing else depends on it.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 239b149a-b248-426b-8da0-39ac5aa3dce0
📒 Files selected for processing (1)
test/extended/router/gatewayapicontroller.go
|
Scheduling required tests: |
|
Job Failure Risk Analysis for sha: fc61fef
|
| // waitForGatewayDeploymentDeletion waits for a Gateway's deployment to be | ||
| // deleted. The deployment is cascade-deleted by GC after the Gateway is | ||
| // removed, but this is asynchronous. Must complete before removing the | ||
| // GatewayClass or istiod to prevent gateway pods from crash-looping. |
There was a problem hiding this comment.
@CodeRabbit can you confirm that this is true with K8S?
There was a problem hiding this comment.
@gcs278, yes, the comment is accurate with respect to Kubernetes garbage collection mechanics. Here's a breakdown:
-
Cascade deletion via owner references: Many objects in Kubernetes link to each other through owner references, which tell the control plane which objects are dependent on others. When a controller (such as the Istio/OSSM gateway controller) creates a Deployment on behalf of a Gateway resource, it typically sets an
ownerReferenceon the Deployment pointing back to the Gateway. -
Asynchronous by default (background deletion): In background cascading deletion, Kubernetes deletes the owner object immediately and the garbage collector then deletes the dependents in the background. Kubernetes uses background cascading deletion by default. This is why the deletion is described as asynchronous — the Gateway disappears promptly, but the dependent Deployment is cleaned up later by the GC loop.
-
Why the wait is necessary: In background deletion, the owner is deleted immediately, then the GC notices orphaned dependents, and the GC deletes dependents in the background. Since this GC loop runs asynchronously, there's a non-trivial window between Gateway deletion and Deployment deletion. If the GatewayClass or istiod were removed before GC cleaned up the gateway deployment, the gateway pods could crash-loop looking for their no-longer-present control plane — which is exactly what the comment describes.
One important caveat: the cascade deletion only works automatically if the Istio/OSSM gateway controller actually sets ownerReferences on the gateway Deployment pointing to the Gateway resource. The Kubernetes mechanics are correct, but whether this owner reference is established is Istio/OSSM controller implementation-specific behavior.
|
Scheduling required tests: |
There was a problem hiding this comment.
Mind also changing the rest of these?
| e2e.Failf("Failed to delete GatewayClass %q: %v", gatewayClassName, err) |
| }) | ||
| } | ||
|
|
||
| func deleteGateway(oc *exutil.CLI, gatewayName string) { |
There was a problem hiding this comment.
nit maybe it's worth being a bit more verbose on this function
| func deleteGateway(oc *exutil.CLI, gatewayName string) { | |
| / deleteGatewayAndWaitForCleanup deletes a Gateway and waits for its proxy deployment to be removed by GC. | |
| func deleteGatewayAndWaitForCleanup(oc *exutil.CLI, gatewayName string) { |
|
Scheduling required tests: |
The Gateway API controller tests tracked Gateways in a shared in-memory gateways slice, deleting them during AfterEach cleanup. However, openshift-tests distributes tests across separate parallel worker processes. The annotation-based checkAllTestsDone coordination works correctly because annotations are stored on the cluster-scoped GatewayClass, but the gateways slice is not shared across processes. The process that runs the final AfterEach cleanup has an empty gateways slice, so it deletes the GatewayClass and istiod but never deletes the Gateways created by other processes. This leaves gateway deployments orphaned on the cluster. As a secondary issue, even when gateways were deleted, the GatewayClass and istiod were removed without waiting for the gateway proxy deployments to be fully cleaned up by GC. Since the deployments have an owner reference to the Gateway (not a finalizer), the cascade deletion is asynchronous, creating a race where gateway pods lose their control plane and crash-loop. Fix both issues by cleaning up gateways at the individual test level using defer deleteGateway, which deletes the Gateway and waits for its proxy deployment to be removed by GC. Add deleteGateway and waitForGatewayDeploymentDeletion helpers shared by both the controller tests and the upgrade test Teardown. Cleanup errors now hard fail to surface leftover resources immediately rather than causing confusing downstream test failures. https://redhat.atlassian.net/browse/OCPBUGS-83281 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Grant Spence <gspence@redhat.com> Co-Authored-By: Ishmam Amin <iamin@redhat.com>
|
@gcs278: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b57308f0-3dc6-11f1-87a2-33ee2bbb7ff4-0 |
|
Now getting: /retest |
|
Job Failure Risk Analysis for sha: ca41c36
|
|
/retest-required |
|
Job Failure Risk Analysis for sha: ca41c36
|
|
Payload upgrade jobs look good. I don't think we can test OLM to noOLM in this PR targeting 5.0 because the We will have to test OLM to noOLM when we create the backport. |
|
unrelated microshift failures |
|
Job Failure Risk Analysis for sha: ca41c36
|
|
Job Failure Risk Analysis for sha: ca41c36
|
|
openshift/ovn-kubernetes#3150 recently merged |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-upgrade-fips openshift-ovn-kubernetes-3126-nightly-4.22-e2e-aws-ovn-upgrade-fips openshift-cloud-credential-operator-1008-nightly-4.22-e2e-aws-ovn-upgrade-fips openshift-cluster-kube-controller-manager-operator-919-nightly-4.22-e2e-aws-ovn-upgrade-fips |
|
@melvinjoseph86: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7add5590-3f08-11f1-9ec0-24af14e19d06-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-azure-custom-dns-techpreview periodic-ci-openshift-release-main-nightly-4.22-e2e-gcp-custom-dns |
|
@melvinjoseph86: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/47ecba50-3f0c-11f1-9b39-58f10e5f58e8-0 |
|
I am not able to run the e2e locally, but error is not reproducible on the scheduled payloads job, hence marking as verified |
|
@melvinjoseph86: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test images |
|
/retest |
|
/test images |
|
Scheduling required tests: |
eb9821a
into
openshift:main
|
@rhamini3: Jira Issue Verification Checks: Jira Issue OCPBUGS-83281 Jira Issue OCPBUGS-83281 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test images |
|
lol whoops I didn't refresh github to see that this was merged 😆 |
|
/cherry-pick release-4.22 |
|
@rhamini3: new pull request created: #31068 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in release 5.0.0-0.nightly-2026-04-24-121336 |
The Gateway API controller tests tracked Gateways in a sharedin-memory gateways slice, deleting them during AfterEach cleanup. However, openshift-tests distributes tests across separate parallel worker processes. The annotation-based checkAllTestsDone coordination works correctly because annotations are stored on the cluster-scoped GatewayClass, but the gateways slice is not shared across processes. The process that runs the final AfterEach cleanup has an empty gateways slice, so it deletes the GatewayClass and istiod but never deletes the Gateways created by other processes. This leaves gateway deployments orphaned on the cluster.
As a secondary issue, even when gateways were deleted, the GatewayClass and istiod were removed without waiting for the gateway proxy deployments to be fully cleaned up by GC. Since the deployments have an owner reference to the Gateway (not a finalizer), the cascade deletion is asynchronous, creating a race where gateway pods lose their control plane and crash-loop.
Fix both issues by cleaning up gateways at the individual test level using defer deleteGateway, which deletes the Gateway and waits for its proxy deployment to be removed by GC. Add deleteGateway and waitForGatewayDeploymentDeletion helpers shared by both the controller tests and the upgrade test Teardown. Cleanup errors now hard fail to surface leftover resources immediately rather than causing confusing downstream test failures.
https://redhat.atlassian.net/browse/OCPBUGS-83281