OCPBUGS-69936: Try to stagger some test groups to help with high cpu disruption cases#31024
OCPBUGS-69936: Try to stagger some test groups to help with high cpu disruption cases#31024xueqzhan wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@xueqzhan: This pull request references Jira Issue OCPBUGS-69936, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: xueqzhan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdded an optional worker-start stagger to the parallel-by-file test queue. A new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@xueqzhan: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f0d3fc60-39e3-11f1-8358-bf5fcf6ed1a5-0 |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/test/ginkgo/cmd_runsuite.go (1)
665-665: Extract the stagger interval into a named constant.The same
200*time.Millisecondvalue is now duplicated across both bucket call sites. Pulling it into a shared constant will make future tuning less error-prone.Also applies to: 719-719
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/test/ginkgo/cmd_runsuite.go` at line 665, Extract the duplicated literal 200*time.Millisecond into a named package-level constant (e.g., staggerInterval or executeStaggerInterval) and replace the two direct uses in the calls to q.ExecuteWithStagger with that constant; update both call sites that currently pass 200*time.Millisecond so they use the new constant for easier tuning and to avoid duplication.pkg/test/ginkgo/queue_test.go (1)
75-79: Add coverage for the new stagger path.This only updates the helper signature; it still doesn't verify that staggered workers actually start later or that a canceled context stops delayed workers promptly. A focused test around
ExecuteWithStagger/execute(..., workerStagger)would make the new scheduling behavior much safer to change.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/test/ginkgo/queue_test.go` around lines 75 - 79, Add a focused test for the staggered-worker path: create a new test (e.g., Test_execute_staggered) that uses the updated testingSuiteRunner helper (testingSuiteRunner) instrumented to record each worker's actual start time, then call execute/ExecuteWithStagger with a non-zero workerStagger (e.g., 50–100ms) and assert that recorded start times for successive workers respect the stagger delta (i.e., each start >= previous start + stagger within a small tolerance). Also add a cancellation subtest that runs execute with a long stagger and a canceling context to verify that canceled context prevents delayed workers from starting (no new start timestamps after cancellation time). Ensure you reference and use execute and/or ExecuteWithStagger and the testingSuiteRunner to locate where to hook timestamps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/test/ginkgo/cmd_runsuite.go`:
- Line 665: Extract the duplicated literal 200*time.Millisecond into a named
package-level constant (e.g., staggerInterval or executeStaggerInterval) and
replace the two direct uses in the calls to q.ExecuteWithStagger with that
constant; update both call sites that currently pass 200*time.Millisecond so
they use the new constant for easier tuning and to avoid duplication.
In `@pkg/test/ginkgo/queue_test.go`:
- Around line 75-79: Add a focused test for the staggered-worker path: create a
new test (e.g., Test_execute_staggered) that uses the updated testingSuiteRunner
helper (testingSuiteRunner) instrumented to record each worker's actual start
time, then call execute/ExecuteWithStagger with a non-zero workerStagger (e.g.,
50–100ms) and assert that recorded start times for successive workers respect
the stagger delta (i.e., each start >= previous start + stagger within a small
tolerance). Also add a cancellation subtest that runs execute with a long
stagger and a canceling context to verify that canceled context prevents delayed
workers from starting (no new start timestamps after cancellation time). Ensure
you reference and use execute and/or ExecuteWithStagger and the
testingSuiteRunner to locate where to hook timestamps.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a299422c-ccfc-4cf4-8d35-9f82659e6ca9
📒 Files selected for processing (3)
pkg/test/ginkgo/cmd_runsuite.gopkg/test/ginkgo/queue.gopkg/test/ginkgo/queue_test.go
|
/payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips 10 |
|
@xueqzhan: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b994b260-39e5-11f1-9462-98fe04071209-0 |
|
Scheduling required tests: |
|
/payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips 10 |
|
@xueqzhan: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/78154f90-3a60-11f1-818c-0812ff93230d-0 |
|
Scheduling required tests: |
|
Job Failure Risk Analysis for sha: 40a31f6
|
Summary by CodeRabbit