OCPBUGS-86719: Use zero-downtime rollout strategy for console pods#1168
OCPBUGS-86719: Use zero-downtime rollout strategy for console pods#1168asadawar wants to merge 1 commit into
Conversation
|
@asadawar: This pull request references Jira Issue OCPBUGS-86719, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThe PR makes console Deployment rolling update parameters topology-aware: DualReplica and HighlyAvailableArbiter use MaxSurge=1/MaxUnavailable=1; other topologies use MaxSurge=1/MaxUnavailable=0. Tests are updated and expanded for topology cases, and e2e test timeout increased to 40m. ChangesTopology-aware rolling update strategy
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @asadawar. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: asadawar, jhadvig The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test e2e-aws-operator |
|
New changes are detected. LGTM label has been removed. |
The rollout strategy change (maxUnavailable: 0) makes each deployment rollout take a bit longer since the new pod must be Ready before the old pod is terminated. Across the full test suite this adds enough time to push past the 30m limit, so the e2e test timeout is bumped from 30m to 40m. Assisted-by: Claude Code
5224b0b to
33be859
Compare
|
The e2e timeout was hitting 30m because the rollout strategy change (maxUnavailable: 0) makes each deployment rollout take a bit longer. The new pod has to be fully Ready before the old one gets terminated, which is the whole point of this PR, but it adds up across the full test suite. Bumped the test timeout from 30m to 40m to give enough headroom. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test-e2e.sh`:
- Line 20: The echo line currently uses single quotes so ${KUBECONFIG} is not
expanded; update the echo in test-e2e.sh (the echo
'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/') to use
double quotes so the shell expands ${KUBECONFIG} (i.e., change the quoting
around the echo argument to allow variable substitution).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 880576a7-c8ff-473a-b39a-de403c2d7608
📒 Files selected for processing (3)
pkg/console/subresource/deployment/deployment.gopkg/console/subresource/deployment/deployment_test.gotest-e2e.sh
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/console/subresource/deployment/deployment.go
- pkg/console/subresource/deployment/deployment_test.go
📜 Review details
🧰 Additional context used
📓 Path-based instructions (1)
{Makefile,makefile,**/Makefile,**/*.sh}
📄 CodeRabbit inference engine (AGENTS.md)
Use
GOFLAGS="-mod=vendor"for builds and tests to ensure vendored dependencies are used
Files:
test-e2e.sh
🪛 Shellcheck (0.11.0)
test-e2e.sh
[info] 20-20: Expressions don't expand in single quotes, use double quotes for that.
(SC2016)
🔀 Multi-repo context openshift/console
[::openshift/console::] pkg/server/server.go:759 — server constructs the JSON served flags including ControlPlaneTopology (ControlPlaneTopology: s.ControlPlaneTopology). This is where the operator/daemon could expose topology to the frontend via SERVER_FLAGS.
[::openshift/console::] cmd/bridge/main.go:175,331 — CLI/bridge defines the "control-plane-topology-mode" flag and maps it into the server flags (ControlPlaneTopology) passed to the server; relevant for how topology is injected into runtime flags.
[::openshift/console::] pkg/serverconfig/config.go:271-272 and pkg/serverconfig/validate.go:49,140-143 — topology handling/validation and wiring into serverconfig (flag population/validation for controlPlaneTopology).
[::openshift/console::] frontend/packages/console-dynamic-plugin-sdk/src/extensions/dashboards.ts:75 and frontend/public/components/dashboard/dashboards-page/cluster-dashboard/status-card.tsx:155-158 and frontend/packages/console-app/console-extensions.json:1909 — frontend code reads/uses controlPlaneTopology (via window.SERVER_FLAGS.controlPlaneTopology / disallowedControlPlaneTopology in extensions). This shows the topology flag is consumed by the UI.
Findings summary:
- I could not locate a repo-wide function named withStrategy or the file path pkg/console/subresource/deployment/deployment.go in this checkout (rg returned no matches); ensure the PR path matches this repository layout.
- ControlPlaneTopology is surfaced in server/bridge/flags and consumed by the frontend; changes in operator behavior that depend on topology (rollingUpdate strategy) may affect what topology value is exposed to the UI and any UI logic that filters features by topology.
🔇 Additional comments (1)
test-e2e.sh (1)
10-18: LGTM!
| else | ||
| echo 'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 30m -v ./test/e2e/' | ||
| KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 30m -v ./test/e2e/ | ||
| echo 'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/' |
There was a problem hiding this comment.
Fix single quotes to allow variable expansion.
The echo statement uses single quotes, which prevent ${KUBECONFIG} from expanding. The output will literally show ${KUBECONFIG} instead of the actual kubeconfig path, which is misleading for debugging.
🔧 Proposed fix
- echo 'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/'
+ echo "KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo 'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/' | |
| echo "KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/" |
🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 20-20: Expressions don't expand in single quotes, use double quotes for that.
(SC2016)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test-e2e.sh` at line 20, The echo line currently uses single quotes so
${KUBECONFIG} is not expanded; update the echo in test-e2e.sh (the echo
'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/') to use
double quotes so the shell expands ${KUBECONFIG} (i.e., change the quoting
around the echo argument to allow variable substitution).
There was a problem hiding this comment.
The single quotes are intentional. This echo shows the user the command template to run manually, not an expanded version. It's the same pattern as before this PR (was single-quoted with 30m too).
There was a problem hiding this comment.
@asadawar, understood — the single quotes are deliberate to display the literal command template (with ${KUBECONFIG} unexpanded) as a reference for manual execution. That's a reasonable intent, and the pattern is consistent with the previous version of the script. I'll note this for future reviews.
✏️ Learnings added
Learnt from: asadawar
Repo: openshift/console-operator PR: 1168
File: test-e2e.sh:20-20
Timestamp: 2026-06-04T16:38:24.616Z
Learning: In `test-e2e.sh` (openshift/console-operator), the `else` branch echo statement intentionally uses single quotes (`echo 'KUBERNETES_CONFIG=${KUBECONFIG} go test -timeout 40m -v ./test/e2e/'`) so that `${KUBECONFIG}` is displayed literally as a command template for the user to run manually, not expanded. This is intentional design — do not flag it as a ShellCheck SC2016 issue.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: TESTING.md:0-0
Timestamp: 2026-05-28T16:57:00.121Z
Learning: Run end-to-end tests using `make test-e2e` (requires cluster access)
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : When new Ginkgo e2e tests are added, check for external connectivity requirements including connections to public internet hosts (e.g., google.com, github.com, quay.io, registry.redhat.io), pulling images from public registries, downloading content from external URLs, DNS resolution of public hostnames, and connections to external APIs or services outside the cluster
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : When new Ginkgo e2e tests are added, check whether they reference namespaces that do not exist on MicroShift: openshift-kube-apiserver, openshift-kube-controller-manager, openshift-kube-scheduler
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : When new Ginkgo e2e tests are added, check whether they make assumptions about multi-node or HA clusters in Single Node OpenShift (SNO). Flag tests that expect multiple control-plane/master nodes, multiple worker nodes, pod anti-affinity, node-to-node communication patterns, leader election failover, pod rescheduling to different nodes, node scaling, separate infra/worker/master roles, rolling update assumptions, or ingress/load balancing behavior dependent on multiple endpoints on different nodes
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : Review Ginkgo test code for appropriate timeouts - operations that interact with the cluster must include timeouts. Flag indefinite waits or missing timeouts on Eventually/Consistently calls
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : When new Ginkgo e2e tests are added, check whether they make unsupported MicroShift assumptions including multi-node or HA assumptions, FeatureGate resources, upgrade/update workflows based on ClusterVersion, node scaling expectations, or multi-replica control-plane component deployments
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: .claude/skills/e2e-test-review.md:0-0
Timestamp: 2026-06-04T07:36:55.991Z
Learning: Applies to test/e2e/**/*.go : Use 5-second poll intervals for most checks to balance responsiveness with API load; adjust based on what is being polled
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : When new Ginkgo e2e tests are added, check whether they use any APIs or features that are NOT available on MicroShift. Flag tests that reference Project/ProjectRequest, Build/BuildConfig, DeploymentConfig, ClusterOperator, ClusterVersion, Etcd operator, CSV/OLM resources, MachineSet/Machine/MachineHealthCheck, ClusterAutoscaler/MachineAutoscaler, Console, Monitoring stack components, ImageRegistry operator, Samples operator, OperatorHub/CatalogSource/PackageManifest, CloudCredential/CredentialsRequest, Storage operator, Network operator CRDs, or any OpenShift API groups besides Route and SecurityContextConstraints
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: vendor/github.com/fsnotify/fsnotify/CONTRIBUTING.md:0-0
Timestamp: 2026-06-01T15:02:53.921Z
Learning: Run `go test ./...` to execute all tests; use the `-short` flag to make stress tests run faster
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : Review Ginkgo test code for proper setup and cleanup - tests should use BeforeEach/AfterEach for setup and cleanup. Flag tests that create resources without cleanup, especially cluster-scoped resources
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: .claude/skills/e2e-test-review.md:0-0
Timestamp: 2026-06-04T07:36:55.991Z
Learning: Applies to test/e2e/**/*.go : Use `framework.AsyncOperationTimeout` constant instead of hardcoding timeout durations in test code
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: .claude/skills/e2e-test-review.md:0-0
Timestamp: 2026-06-04T07:36:55.991Z
Learning: Applies to test/e2e/**/*.go : Provide helpful error messages with context in assertions (e.g., namespace, name, timeout duration) rather than vague messages
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: vendor/github.com/NYTimes/gziphandler/CONTRIBUTING.md:0-0
Timestamp: 2026-06-01T15:02:45.315Z
Learning: Ensure code changes pass `go test` locally and on Travis CI
Learnt from: CR
Repo: openshift/console-operator PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-06-01T15:06:28.169Z
Learning: Applies to **/{test,tests,e2e,integration}/**/*_test.go : Review Ginkgo test code for consistency with codebase patterns - tests should follow existing patterns for how fixtures are created, clients are obtained, and waits are structured
|
@asadawar: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
maxSurge=3, maxUnavailable=1tomaxSurge=1, maxUnavailable=0on 3+ node topologies (HighlyAvailable, External+HA), ensuring no old pod is terminated until its replacement passes readiness checksmaxUnavailable=1withmaxSurgereduced from 3 to 1 to avoid rollout deadlock caused by required pod anti-affinityWhy this approach
Three approaches were considered:
1.
maxUnavailable=0for all HA topologies (rejected)On DualReplica (2 masters, 2 replicas) and HighlyAvailableArbiter (2 full masters + 1 arbiter) clusters, the console deployment uses
RequiredDuringSchedulingIgnoredDuringExecutionpod anti-affinity onkubernetes.io/hostname. When every eligible node already runs a console pod, the scheduler cannot place a surge pod. WithmaxUnavailable=0, no old pod can be terminated either, causing a rollout deadlock that stalls untilProgressDeadlineExceeded(10 minutes). This approach was rejected because it would break recently added DualReplica support (PR #1151, merged 2026-05-07).2. Keep
maxUnavailable=1for all topologies, only reducemaxSurge(rejected)Reducing
maxSurgefrom 3 to 1 aligns with other operators (CMO monitoring-plugin usesmaxUnavailable=1with defaultmaxSurge) but does not fix the reported bug. WithmaxUnavailable=1, Kubernetes is still allowed to terminate one old pod before its replacement is ready, causing the console flap. This approach was rejected because it does not address the root cause.3. Topology-aware strategy (chosen)
Use
maxUnavailable=0on topologies where a free node is available for the surge pod (HighlyAvailable with 3+ masters, External+HA with multiple workers), andmaxUnavailable=1on constrained topologies (DualReplica, HighlyAvailableArbiter) where rollout deadlock is possible. This fixes the bug for the most common topology while preserving correct behavior on constrained clusters.For the HighlyAvailableArbiter case, the conservative choice (
maxUnavailable=1) was made because arbiter nodes may have taints or resource constraints that prevent scheduling console pods, effectively making it a 2-node topology for console scheduling. Maintainers familiar with arbiter node scheduling can adjust this if arbiter nodes are known to be eligible.Root cause
The
withStrategyfunction inpkg/console/subresource/deployment/deployment.go:184setmaxSurge=3, maxUnavailable=1for all HA topologies. These values were introduced in PR #1107 (OCPBUGS-74872) as part of a refactor that moved deployment construction from bindata to Go code, without specific rationale for the strategy values.With
maxUnavailable=1and 2 replicas, the Kubernetes deployment controller is allowed to terminate one old pod immediately when a rollout starts, even before any new pod is ready. This creates a window (approximately 10-15 seconds based on observed pod startup times) where only one pod serves traffic. During this window:Cluster verification
Verified on a live OCP 4.22.0-rc.4 vSphere IPI cluster:
Cluster topology:
Current strategy (before fix):
Pod distribution (2 pods on 2 of 3 masters, 3rd master free for surge):
With the fix applied (
maxSurge=1, maxUnavailable=0), the rollout behavior would be:At no point does availability drop below 2 (full capacity).
Test plan
make test-unit): all deployment strategy tests updated and passinggofmtandgovetclean (make check)OWNERS
/cc @spadgett @jhadvig @TheRealJon
Bug: https://issues.redhat.com/browse/OCPBUGS-86719
Summary by CodeRabbit
Bug Fixes
Tests
Chores / CI