Skip to content

OCPBUGS-86719: Use zero-downtime rollout strategy for console pods#1168

Open
asadawar wants to merge 1 commit into
openshift:mainfrom
asadawar:OCPBUGS-86719-sequential-rollout
Open

OCPBUGS-86719: Use zero-downtime rollout strategy for console pods#1168
asadawar wants to merge 1 commit into
openshift:mainfrom
asadawar:OCPBUGS-86719-sequential-rollout

Conversation

@asadawar
Copy link
Copy Markdown

@asadawar asadawar commented May 29, 2026

Summary

  • Change the console deployment rollout strategy from maxSurge=3, maxUnavailable=1 to maxSurge=1, maxUnavailable=0 on 3+ node topologies (HighlyAvailable, External+HA), ensuring no old pod is terminated until its replacement passes readiness checks
  • On 2-node topologies (DualReplica, HighlyAvailableArbiter), keep maxUnavailable=1 with maxSurge reduced from 3 to 1 to avoid rollout deadlock caused by required pod anti-affinity
  • Add test coverage for DualReplica and HighlyAvailableArbiter topology strategies

Why this approach

Three approaches were considered:

1. maxUnavailable=0 for all HA topologies (rejected)
On DualReplica (2 masters, 2 replicas) and HighlyAvailableArbiter (2 full masters + 1 arbiter) clusters, the console deployment uses RequiredDuringSchedulingIgnoredDuringExecution pod anti-affinity on kubernetes.io/hostname. When every eligible node already runs a console pod, the scheduler cannot place a surge pod. With maxUnavailable=0, no old pod can be terminated either, causing a rollout deadlock that stalls until ProgressDeadlineExceeded (10 minutes). This approach was rejected because it would break recently added DualReplica support (PR #1151, merged 2026-05-07).

2. Keep maxUnavailable=1 for all topologies, only reduce maxSurge (rejected)
Reducing maxSurge from 3 to 1 aligns with other operators (CMO monitoring-plugin uses maxUnavailable=1 with default maxSurge) but does not fix the reported bug. With maxUnavailable=1, Kubernetes is still allowed to terminate one old pod before its replacement is ready, causing the console flap. This approach was rejected because it does not address the root cause.

3. Topology-aware strategy (chosen)
Use maxUnavailable=0 on topologies where a free node is available for the surge pod (HighlyAvailable with 3+ masters, External+HA with multiple workers), and maxUnavailable=1 on constrained topologies (DualReplica, HighlyAvailableArbiter) where rollout deadlock is possible. This fixes the bug for the most common topology while preserving correct behavior on constrained clusters.

For the HighlyAvailableArbiter case, the conservative choice (maxUnavailable=1) was made because arbiter nodes may have taints or resource constraints that prevent scheduling console pods, effectively making it a 2-node topology for console scheduling. Maintainers familiar with arbiter node scheduling can adjust this if arbiter nodes are known to be eligible.

Root cause

The withStrategy function in pkg/console/subresource/deployment/deployment.go:184 set maxSurge=3, maxUnavailable=1 for all HA topologies. These values were introduced in PR #1107 (OCPBUGS-74872) as part of a refactor that moved deployment construction from bindata to Go code, without specific rationale for the strategy values.

With maxUnavailable=1 and 2 replicas, the Kubernetes deployment controller is allowed to terminate one old pod immediately when a rollout starts, even before any new pod is ready. This creates a window (approximately 10-15 seconds based on observed pod startup times) where only one pod serves traffic. During this window:

  • The terminating pod is removed from Service endpoints (even though the preStop hook keeps the process alive for 25 seconds)
  • New connections are routed only to the single remaining pod
  • WebSocket connections to the terminated pod are dropped, causing visible console "flapping"

Cluster verification

Verified on a live OCP 4.22.0-rc.4 vSphere IPI cluster:

Cluster topology:

$ oc get infrastructure cluster -o jsonpath='{.status.controlPlaneTopology}'
HighlyAvailable

$ oc get nodes -l node-role.kubernetes.io/master= -o name
node/master-0
node/master-1
node/master-2

Current strategy (before fix):

$ oc get deployment console -n openshift-console -o jsonpath='{.spec.strategy}'
{"rollingUpdate":{"maxSurge":3,"maxUnavailable":1},"type":"RollingUpdate"}

Pod distribution (2 pods on 2 of 3 masters, 3rd master free for surge):

$ oc get pods -n openshift-console -o wide
NAME                        READY   STATUS    NODE
console-7dfb9f987d-4rbcd    1/1     Running   master-1
console-7dfb9f987d-zc7n4    1/1     Running   master-0

With the fix applied (maxSurge=1, maxUnavailable=0), the rollout behavior would be:

  1. New pod created on master-2 (the free node). Total: 3 pods, 2 available.
  2. New pod passes readiness check. Total: 3 pods, 3 available.
  3. One old pod terminated. Total: 2 pods, 2 available.
  4. Second new pod created on the freed node. Total: 3 pods, 2 available.
  5. Second new pod passes readiness check. Old pod terminated. Total: 2 pods, 2 available.

At no point does availability drop below 2 (full capacity).

Test plan

  • Unit tests pass (make test-unit): all deployment strategy tests updated and passing
  • Added new test cases for DualReplica and HighlyAvailableArbiter topologies
  • gofmt and govet clean (make check)
  • Verified current cluster topology and strategy on live OCP 4.22 cluster

OWNERS

/cc @spadgett @jhadvig @TheRealJon

Bug: https://issues.redhat.com/browse/OCPBUGS-86719

Summary by CodeRabbit

  • Bug Fixes

    • Console deployment uses a topology-aware rolling update strategy to reduce disruption during rollouts.
  • Tests

    • Updated unit tests to cover rolling update behavior across additional topology modes.
  • Chores / CI

    • Extended end-to-end test timeout in the CI test runner to reduce spurious timeouts.

@openshift-ci openshift-ci Bot requested review from TheRealJon, jhadvig and spadgett May 29, 2026 10:54
@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 29, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@asadawar: This pull request references Jira Issue OCPBUGS-86719, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Change the console deployment rollout strategy from maxSurge=3, maxUnavailable=1 to maxSurge=1, maxUnavailable=0 on 3+ node topologies (HighlyAvailable, External+HA), ensuring no old pod is terminated until its replacement passes readiness checks
  • On 2-node topologies (DualReplica, HighlyAvailableArbiter), keep maxUnavailable=1 with maxSurge reduced from 3 to 1 to avoid rollout deadlock caused by required pod anti-affinity
  • Add test coverage for DualReplica and HighlyAvailableArbiter topology strategies

Why this approach

Three approaches were considered:

1. maxUnavailable=0 for all HA topologies (rejected)
On DualReplica (2 masters, 2 replicas) and HighlyAvailableArbiter (2 full masters + 1 arbiter) clusters, the console deployment uses RequiredDuringSchedulingIgnoredDuringExecution pod anti-affinity on kubernetes.io/hostname. When every eligible node already runs a console pod, the scheduler cannot place a surge pod. With maxUnavailable=0, no old pod can be terminated either, causing a rollout deadlock that stalls until ProgressDeadlineExceeded (10 minutes). This approach was rejected because it would break recently added DualReplica support (PR #1151, merged 2026-05-07).

2. Keep maxUnavailable=1 for all topologies, only reduce maxSurge (rejected)
Reducing maxSurge from 3 to 1 aligns with other operators (CMO monitoring-plugin uses maxUnavailable=1 with default maxSurge) but does not fix the reported bug. With maxUnavailable=1, Kubernetes is still allowed to terminate one old pod before its replacement is ready, causing the console flap. This approach was rejected because it does not address the root cause.

3. Topology-aware strategy (chosen)
Use maxUnavailable=0 on topologies where a free node is available for the surge pod (HighlyAvailable with 3+ masters, External+HA with multiple workers), and maxUnavailable=1 on constrained topologies (DualReplica, HighlyAvailableArbiter) where rollout deadlock is possible. This fixes the bug for the most common topology while preserving correct behavior on constrained clusters.

For the HighlyAvailableArbiter case, the conservative choice (maxUnavailable=1) was made because arbiter nodes may have taints or resource constraints that prevent scheduling console pods, effectively making it a 2-node topology for console scheduling. Maintainers familiar with arbiter node scheduling can adjust this if arbiter nodes are known to be eligible.

Root cause

The withStrategy function in pkg/console/subresource/deployment/deployment.go:184 set maxSurge=3, maxUnavailable=1 for all HA topologies. These values were introduced in PR #1107 (OCPBUGS-74872) as part of a refactor that moved deployment construction from bindata to Go code, without specific rationale for the strategy values.

With maxUnavailable=1 and 2 replicas, the Kubernetes deployment controller is allowed to terminate one old pod immediately when a rollout starts, even before any new pod is ready. This creates a window (approximately 10-15 seconds based on observed pod startup times) where only one pod serves traffic. During this window:

  • The terminating pod is removed from Service endpoints (even though the preStop hook keeps the process alive for 25 seconds)
  • New connections are routed only to the single remaining pod
  • WebSocket connections to the terminated pod are dropped, causing visible console "flapping"

Cluster verification

Verified on a live OCP 4.22.0-rc.4 vSphere IPI cluster:

Cluster topology:

$ oc get infrastructure cluster -o jsonpath='{.status.controlPlaneTopology}'
HighlyAvailable

$ oc get nodes -l node-role.kubernetes.io/master= -o name
node/master-0
node/master-1
node/master-2

Current strategy (before fix):

$ oc get deployment console -n openshift-console -o jsonpath='{.spec.strategy}'
{"rollingUpdate":{"maxSurge":3,"maxUnavailable":1},"type":"RollingUpdate"}

Pod distribution (2 pods on 2 of 3 masters, 3rd master free for surge):

$ oc get pods -n openshift-console -o wide
NAME                        READY   STATUS    NODE
console-7dfb9f987d-4rbcd    1/1     Running   master-1
console-7dfb9f987d-zc7n4    1/1     Running   master-0

With the fix applied (maxSurge=1, maxUnavailable=0), the rollout behavior would be:

  1. New pod created on master-2 (the free node). Total: 3 pods, 2 available.
  2. New pod passes readiness check. Total: 3 pods, 3 available.
  3. One old pod terminated. Total: 2 pods, 2 available.
  4. Second new pod created on the freed node. Total: 3 pods, 2 available.
  5. Second new pod passes readiness check. Old pod terminated. Total: 2 pods, 2 available.

At no point does availability drop below 2 (full capacity).

Test plan

  • Unit tests pass (make test-unit): all deployment strategy tests updated and passing
  • Added new test cases for DualReplica and HighlyAvailableArbiter topologies
  • gofmt and govet clean (make check)
  • Verified current cluster topology and strategy on live OCP 4.22 cluster

OWNERS

/cc @spadgett @jhadvig @TheRealJon

Bug: https://issues.redhat.com/browse/OCPBUGS-86719

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Walkthrough

The PR makes console Deployment rolling update parameters topology-aware: DualReplica and HighlyAvailableArbiter use MaxSurge=1/MaxUnavailable=1; other topologies use MaxSurge=1/MaxUnavailable=0. Tests are updated and expanded for topology cases, and e2e test timeout increased to 40m.

Changes

Topology-aware rolling update strategy

Layer / File(s) Summary
Topology-aware rolling update strategy implementation
pkg/console/subresource/deployment/deployment.go
withStrategy now sets Strategy.RollingUpdate using control-plane topology: MaxSurge=1/MaxUnavailable=1 for DualReplica and HighlyAvailableArbiter, and MaxSurge=1/MaxUnavailable=0 for other topologies.
Strategy test fixtures and topology-specific test cases
pkg/console/subresource/deployment/deployment_test.go
Tests add fixtures for DualReplicaTopologyMode and HighlyAvailableArbiterMode, introduce zeroDowntimeStrategy (1/0) and a constrained HA strategy (1/1), update highly available expectations to use zero-downtime where appropriate, and add cases asserting DualReplica/Arbiter behavior.
Default deployment and downloads test expectations
pkg/console/subresource/deployment/deployment_test.go
Default, trusted-CA, external topology, and downloads deployment test expectations updated to maxSurge=1 / maxUnavailable=0.
E2E runner timeout
test-e2e.sh
Increase go test timeout from 30m to 40m in the OpenShift CI path and align non-OpenShift branch invocation/echo formatting.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description provides a comprehensive summary, detailed rationale with three approaches evaluated, root cause analysis, verification on a live cluster, and test plan; however, it does not follow the required template structure with sections like 'Analysis/Root cause', 'Solution description', 'Test setup', 'Test cases', and 'Browser conformance'. Restructure the description to follow the required template: add 'Analysis/Root cause', 'Solution description', 'Test setup', 'Test cases', 'Browser conformance' sections, and include the 'Reviewers and assignees' section with appropriate team tags.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (13 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically references the Jira issue (OCPBUGS-86719) and accurately summarizes the main change: implementing a zero-downtime rollout strategy for console pods.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The project uses standard Go testing (not Ginkgo), and all test names in the PR changes are stable and deterministic with no dynamic values like pod names, timestamps, UUIDs, or node names.
Test Structure And Quality ✅ Passed PR adds standard Go unit tests with table-driven patterns consistent with codebase, maintaining single responsibility and proper assertions via deep.Equal structures.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. Changes are limited to production code (deployment strategy) and unit tests (using Go testing package), plus a test runner script timeout update.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds TestDeploymentsReplicas, a standard Go test (not Ginkgo) that handles SNO via topology-aware replica logic and does not make multi-node assumptions.
Topology-Aware Scheduling Compatibility ✅ Passed PR checks ControlPlaneTopology before applying constraints; correctly uses maxUnavailable=0 for 3+ nodes and maxUnavailable=1 for 2-node topologies.
Ote Binary Stdout Contract ✅ Passed PR modifies deployment code, unit tests, and test wrapper script only—none introduce process-level stdout writes violating OTE Binary Stdout Contract; no process-level code in modified files.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Changes are limited to deployment strategy code, standard Go unit tests, and e2e runner script timeout adjustment. Check is not applicable.
No-Weak-Crypto ✅ Passed The PR changes are focused on Kubernetes deployment rolling update strategies and test timeouts. No cryptographic operations, weak crypto, or secret comparisons are present in the modified files.
Container-Privileges ✅ Passed PR introduces no privileged container configs, host access, or privilege escalation. Only modifies RollingUpdate strategy; existing deployments enforce non-root, no-escalation security contexts.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data exposed in logs. All logging statements use operational metrics; test-e2e.sh uses single quotes preventing KUBECONFIG credential expansion.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 29, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

Hi @asadawar. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jhadvig
Copy link
Copy Markdown
Member

jhadvig commented Jun 4, 2026

/ok-to-test

@openshift-ci openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 4, 2026
Copy link
Copy Markdown
Member

@jhadvig jhadvig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asadawar thank you for the fix 👍
/lgtm
/approve
/cherry-pick release-4.22

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 4, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 4, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: asadawar, jhadvig

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2026
@asadawar
Copy link
Copy Markdown
Author

asadawar commented Jun 4, 2026

/test e2e-aws-operator

@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 4, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 4, 2026

New changes are detected. LGTM label has been removed.

The rollout strategy change (maxUnavailable: 0) makes each
deployment rollout take a bit longer since the new pod must be
Ready before the old pod is terminated. Across the full test
suite this adds enough time to push past the 30m limit, so the
e2e test timeout is bumped from 30m to 40m.

Assisted-by: Claude Code
@asadawar asadawar force-pushed the OCPBUGS-86719-sequential-rollout branch from 5224b0b to 33be859 Compare June 4, 2026 15:56
@asadawar
Copy link
Copy Markdown
Author

asadawar commented Jun 4, 2026

The e2e timeout was hitting 30m because the rollout strategy change (maxUnavailable: 0) makes each deployment rollout take a bit longer. The new pod has to be fully Ready before the old one gets terminated, which is the whole point of this PR, but it adds up across the full test suite. Bumped the test timeout from 30m to 40m to give enough headroom.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test-e2e.sh`:
- Line 20: The echo line currently uses single quotes so ${KUBECONFIG} is not
expanded; update the echo in test-e2e.sh (the echo
'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/') to use
double quotes so the shell expands ${KUBECONFIG} (i.e., change the quoting
around the echo argument to allow variable substitution).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 880576a7-c8ff-473a-b39a-de403c2d7608

📥 Commits

Reviewing files that changed from the base of the PR and between 77df00f and 33be859.

📒 Files selected for processing (3)
  • pkg/console/subresource/deployment/deployment.go
  • pkg/console/subresource/deployment/deployment_test.go
  • test-e2e.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/console/subresource/deployment/deployment.go
  • pkg/console/subresource/deployment/deployment_test.go
📜 Review details
🧰 Additional context used
📓 Path-based instructions (1)
{Makefile,makefile,**/Makefile,**/*.sh}

📄 CodeRabbit inference engine (AGENTS.md)

Use GOFLAGS="-mod=vendor" for builds and tests to ensure vendored dependencies are used

Files:

  • test-e2e.sh
🪛 Shellcheck (0.11.0)
test-e2e.sh

[info] 20-20: Expressions don't expand in single quotes, use double quotes for that.

(SC2016)

🔀 Multi-repo context openshift/console

[::openshift/console::] pkg/server/server.go:759 — server constructs the JSON served flags including ControlPlaneTopology (ControlPlaneTopology: s.ControlPlaneTopology). This is where the operator/daemon could expose topology to the frontend via SERVER_FLAGS.

[::openshift/console::] cmd/bridge/main.go:175,331 — CLI/bridge defines the "control-plane-topology-mode" flag and maps it into the server flags (ControlPlaneTopology) passed to the server; relevant for how topology is injected into runtime flags.

[::openshift/console::] pkg/serverconfig/config.go:271-272 and pkg/serverconfig/validate.go:49,140-143 — topology handling/validation and wiring into serverconfig (flag population/validation for controlPlaneTopology).

[::openshift/console::] frontend/packages/console-dynamic-plugin-sdk/src/extensions/dashboards.ts:75 and frontend/public/components/dashboard/dashboards-page/cluster-dashboard/status-card.tsx:155-158 and frontend/packages/console-app/console-extensions.json:1909 — frontend code reads/uses controlPlaneTopology (via window.SERVER_FLAGS.controlPlaneTopology / disallowedControlPlaneTopology in extensions). This shows the topology flag is consumed by the UI.

Findings summary:

  • I could not locate a repo-wide function named withStrategy or the file path pkg/console/subresource/deployment/deployment.go in this checkout (rg returned no matches); ensure the PR path matches this repository layout.
  • ControlPlaneTopology is surfaced in server/bridge/flags and consumed by the frontend; changes in operator behavior that depend on topology (rollingUpdate strategy) may affect what topology value is exposed to the UI and any UI logic that filters features by topology.
🔇 Additional comments (1)
test-e2e.sh (1)

10-18: LGTM!

Comment thread test-e2e.sh
else
echo 'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 30m -v ./test/e2e/'
KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 30m -v ./test/e2e/
echo 'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/'
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix single quotes to allow variable expansion.

The echo statement uses single quotes, which prevent ${KUBECONFIG} from expanding. The output will literally show ${KUBECONFIG} instead of the actual kubeconfig path, which is misleading for debugging.

🔧 Proposed fix
-	echo 'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/'
+	echo "KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo 'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/'
echo "KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/"
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 20-20: Expressions don't expand in single quotes, use double quotes for that.

(SC2016)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test-e2e.sh` at line 20, The echo line currently uses single quotes so
${KUBECONFIG} is not expanded; update the echo in test-e2e.sh (the echo
'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/') to use
double quotes so the shell expands ${KUBECONFIG} (i.e., change the quoting
around the echo argument to allow variable substitution).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single quotes are intentional. This echo shows the user the command template to run manually, not an expanded version. It's the same pattern as before this PR (was single-quoted with 30m too).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asadawar, understood — the single quotes are deliberate to display the literal command template (with ${KUBECONFIG} unexpanded) as a reference for manual execution. That's a reasonable intent, and the pattern is consistent with the previous version of the script. I'll note this for future reviews.


✏️ Learnings added
Learnt from: asadawar
Repo: openshift/console-operator PR: 1168
File: test-e2e.sh:20-20
Timestamp: 2026-06-04T16:38:24.616Z
Learning: In `test-e2e.sh` (openshift/console-operator), the `else` branch echo statement intentionally uses single quotes (`echo 'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/'`) so that `${KUBECONFIG}` is displayed literally as a command template for the user to run manually, not expanded. This is intentional design — do not flag it as a ShellCheck SC2016 issue.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: TESTING.md:0-0
Timestamp: 2026-05-28T16:57:00.121Z
Learning: Run end-to-end tests using `make test-e2e` (requires cluster access)

Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : When new Ginkgo e2e tests are added, check for external connectivity requirements including connections to public internet hosts (e.g., google.com, github.com, quay.io, registry.redhat.io), pulling images from public registries, downloading content from external URLs, DNS resolution of public hostnames, and connections to external APIs or services outside the cluster

Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : When new Ginkgo e2e tests are added, check whether they reference namespaces that do not exist on MicroShift: openshift-kube-apiserver, openshift-kube-controller-manager, openshift-kube-scheduler

Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : When new Ginkgo e2e tests are added, check whether they make assumptions about multi-node or HA clusters in Single Node OpenShift (SNO). Flag tests that expect multiple control-plane/master nodes, multiple worker nodes, pod anti-affinity, node-to-node communication patterns, leader election failover, pod rescheduling to different nodes, node scaling, separate infra/worker/master roles, rolling update assumptions, or ingress/load balancing behavior dependent on multiple endpoints on different nodes

Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : Review Ginkgo test code for appropriate timeouts - operations that interact with the cluster must include timeouts. Flag indefinite waits or missing timeouts on Eventually/Consistently calls

Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : When new Ginkgo e2e tests are added, check whether they make unsupported MicroShift assumptions including multi-node or HA assumptions, FeatureGate resources, upgrade/update workflows based on ClusterVersion, node scaling expectations, or multi-replica control-plane component deployments

Learnt from: CR
Repo: openshift/console-operator PR: 0
File: .claude/skills/e2e-test-review.md:0-0
Timestamp: 2026-06-04T07:36:55.991Z
Learning: Applies to test/e2e/**/*.go : Use 5-second poll intervals for most checks to balance responsiveness with API load; adjust based on what is being polled

Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : When new Ginkgo e2e tests are added, check whether they use any APIs or features that are NOT available on MicroShift. Flag tests that reference Project/ProjectRequest, Build/BuildConfig, DeploymentConfig, ClusterOperator, ClusterVersion, Etcd operator, CSV/OLM resources, MachineSet/Machine/MachineHealthCheck, ClusterAutoscaler/MachineAutoscaler, Console, Monitoring stack components, ImageRegistry operator, Samples operator, OperatorHub/CatalogSource/PackageManifest, CloudCredential/CredentialsRequest, Storage operator, Network operator CRDs, or any OpenShift API groups besides Route and SecurityContextConstraints

Learnt from: CR
Repo: openshift/console-operator PR: 0
File: vendor/github.com/fsnotify/fsnotify/CONTRIBUTING.md:0-0
Timestamp: 2026-06-01T15:02:53.921Z
Learning: Run `go test ./...` to execute all tests; use the `-short` flag to make stress tests run faster

Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : Review Ginkgo test code for proper setup and cleanup - tests should use BeforeEach/AfterEach for setup and cleanup. Flag tests that create resources without cleanup, especially cluster-scoped resources

Learnt from: CR
Repo: openshift/console-operator PR: 0
File: .claude/skills/e2e-test-review.md:0-0
Timestamp: 2026-06-04T07:36:55.991Z
Learning: Applies to test/e2e/**/*.go : Use `framework.AsyncOperationTimeout` constant instead of hardcoding timeout durations in test code

Learnt from: CR
Repo: openshift/console-operator PR: 0
File: .claude/skills/e2e-test-review.md:0-0
Timestamp: 2026-06-04T07:36:55.991Z
Learning: Applies to test/e2e/**/*.go : Provide helpful error messages with context in assertions (e.g., namespace, name, timeout duration) rather than vague messages

Learnt from: CR
Repo: openshift/console-operator PR: 0
File: vendor/github.com/NYTimes/gziphandler/CONTRIBUTING.md:0-0
Timestamp: 2026-06-01T15:02:45.315Z
Learning: Ensure code changes pass `go test` locally and on Travis CI

Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : Review Ginkgo test code for consistency with codebase patterns - tests should follow existing patterns for how fixtures are created, clients are obtained, and waits are structured

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 4, 2026

@asadawar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-operator 33be859 link true /test e2e-aws-operator
ci/prow/e2e-aws-console 33be859 link true /test e2e-aws-console

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants