OCPBUGS-55736: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage #2549

machine424 · 2025-01-06T13:23:32Z

…il we can ensure it won't result in excessive CPU usage

I added CHANGELOG entry for this change.
No user facing changes, so no entry in CHANGELOG was needed.

machine424 · 2025-01-06T14:21:07Z

/hold

rexagod

Since there is no ticket linked to this, I was wondering if we saw any instances of 10% memory reduction bottleneck-ing the CPU on SNO?

rexagod · 2025-01-07T09:33:06Z

pkg/manifests/manifests.go

@@ -1491,7 +1491,7 @@ func (f *Factory) PrometheusK8s(grpcTLS *v1.Secret, telemetrySecret *v1.Secret)
 	return p, nil
 }

-func (f *Factory) setupGoGC(p *monv1.Prometheus) {
+func (f *Factory) adjustGoGCConfig(p *monv1.Prometheus) {


Maybe something like:

Suggested change

func (f *Factory) adjustGoGCConfig(p *monv1.Prometheus) {

func (f *Factory) adjustGoSettings(p *monv1.Prometheus) {

Since this affects the GOMEMLIMIT too now.

renaming as adjustGoGCRelatedConfig because GOMEMLIMIT is also related to Garbage collection.

rexagod · 2025-01-07T09:36:32Z

pkg/manifests/manifests_test.go

 				for _, env := range c.Env {
 					require.NotEqual(t, env.Name, "GOGC")
 				}
 				return
 			}

-			require.Contains(t, c.Env, v1.EnvVar{Name: "GOGC", Value: tc.exp})
+			require.Contains(t, c.Env, v1.EnvVar{Name: "GOGC", Value: tc.expectedGOGC})


rexagod · 2025-01-07T09:43:46Z

pkg/manifests/manifests_test.go

-			require.Contains(t, c.Env, v1.EnvVar{Name: "GOGC", Value: tc.exp})
+			require.Contains(t, c.Env, v1.EnvVar{Name: "GOGC", Value: tc.expectedGOGC})
+
+			require.Equal(t, tc.autoGOMEMLIMITDisabled, argumentPresent(*c, "--no-auto-gomemlimit"))


We could drop the tc.autoGOMEMLIMITDisabled field as this could be safely derived from ir.HighlyAvailableInfrastructure(), as that's the only case where this is disabled for now (else enabled)?

Suggested change

require.Equal(t, tc.autoGOMEMLIMITDisabled, argumentPresent(*c, "--no-auto-gomemlimit"))

require.Equal(t, tc.ir.HighlyAvailableInfrastructure(), argumentPresent(*c, "--no-auto-gomemlimit"))

yep, but I'm trying to avoid the same logic in tests as well in case sth is broken/wrong somewhere.
Also having it explicit makes reading the test cases easier.

rexagod · 2025-01-07T09:50:28Z

I'm asking #2549 (review) as any observed insight should help me set a more meaningful buffer threshold in kubernetes-monitoring/kubernetes-mixin#1010 (comment).

machine424 · 2025-01-07T10:47:44Z

Thanks for the review, this is still WIP actually, requires openshift/prometheus#227. I've marked it as such. I'll get back to you later.

openshift-ci-robot · 2025-04-14T07:33:38Z

@machine424: This pull request references MON-4200 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

…il we can ensure it won't result in excessive CPU usage

requires openshift/prometheus#227

I added CHANGELOG entry for this change.

No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jan--f · 2025-04-14T08:33:11Z

/lgtm

rexagod · 2025-04-14T08:38:03Z

/lgtm

rexagod · 2025-04-14T08:38:45Z

Needs a make versions?

machine424 · 2025-04-14T18:09:26Z

/retest-required

machine424 · 2025-04-14T18:09:49Z

Needs a make versions?

I'd make the bot take care of that to not pollute the PR.

simonpasquier · 2025-04-15T07:15:33Z

pkg/manifests/manifests.go

+		// Until we're certain setting GOMEMLIMIT to 0.9 (default ratio) of detected maximum
+		// container or system memory won't result in excessive CPU usage, we're disabling the
+		// auto setting for SNO.
+		p.Spec.Containers[i].Args = append(p.Spec.Containers[i].Args, "--no-auto-gomemlimit")


should rather use p.Spec.AdditionalArgs?

Good catch, even though we're overriding all the args, no test seems to be failing...
e.g.

cluster-monitoring-operator/test/e2e/config_test.go

Line 231 in 5ee3bd4

func TestClusterMonitorPrometheusK8Config(t *testing.T) {

I'll take a look
/hold

well, it's because we don't run CMO e2e on SNO.
but e2e-aws-ovn-single-node is failing for this reason though, but it's not required.
Maybe we can consider making it required now that the upgrade one is a payload blocking https://issues.redhat.com//browse/OCPEDGE-799

yes IIUC we were waiting for SNO to be a blocking job before moving it to required here (it was too flaky initially).

I'll check with the SNO team, we'd probably pick the same upgrade one than payload...

machine424 · 2025-04-16T09:35:49Z

/retest-required

jan--f · 2025-04-17T08:02:11Z

/lgtm

jan--f · 2025-04-17T08:35:24Z

/retest-required

machine424 · 2025-04-17T14:05:36Z

/retest

machine424 · 2025-04-18T06:39:09Z

/retest-required

machine424 · 2025-04-18T14:37:42Z

/retest-required

machine424 · 2025-04-23T06:35:09Z

/retest

machine424 · 2025-04-24T12:02:35Z

/retest-required

machine424 · 2025-04-25T19:38:04Z

/retest-required

…il we can ensure it won't result in excessive CPU usage

openshift-ci · 2025-04-27T23:43:41Z

@machine424: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`9ec8f5c`	link	false	`/test okd-scos-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

machine424 · 2025-04-29T07:25:57Z

/hold cancel
/label acknowledge-critical-fixes-only

machine424 · 2025-05-01T22:51:18Z

/retest-required

danielmellado · 2025-05-02T06:50:19Z

/lgtm

openshift-ci · 2025-05-02T06:51:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielmellado, jan--f, machine424, rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [danielmellado,jan--f,machine424,rexagod]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2025-05-02T11:51:11Z

[ART PR BUILD NOTIFIER]

Distgit: cluster-monitoring-operator
This PR has been included in build cluster-monitoring-operator-container-v4.20.0-202505021117.p0.g4dbb28d.assembly.stream.el9.
All builds following this will include this PR.

machine424 · 2025-05-06T07:37:48Z

/jira backport release-4.19

openshift-ci-robot · 2025-05-06T07:37:50Z

@machine424: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.19

In response to this:

/jira backport release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-cherrypick-robot · 2025-05-06T07:38:37Z

@openshift-ci-robot: new pull request created: #2597

In response to this:

@machine424: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.19

In response to this:

/jira backport release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

machine424 · 2025-05-06T07:39:48Z

/retitle OCPBUGS-55736: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage

openshift-ci-robot · 2025-05-06T07:39:55Z

@machine424: Jira Issue OCPBUGS-55736 is in an unrecognized state (Closed) and will not be moved to the MODIFIED state.

In response to this:

…il we can ensure it won't result in excessive CPU usage

requires openshift/prometheus#227

I added CHANGELOG entry for this change.

No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci bot requested review from marioferh and rexagod January 6, 2025 13:24

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 6, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 6, 2025

rexagod reviewed Jan 7, 2025

View reviewed changes

machine424 marked this pull request as draft January 7, 2025 10:45

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2025

machine424 changed the title ~~chore(manifests): disable --auto-gomemlimit for Prometheus on SNO unt…~~ MON-4200: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage Apr 14, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 14, 2025

machine424 force-pushed the sno-auto-xxx branch from 8e147f7 to 290c80d Compare April 14, 2025 07:37

machine424 marked this pull request as ready for review April 14, 2025 07:38

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 14, 2025

openshift-ci bot requested review from jan--f and rexagod April 14, 2025 07:39

openshift-ci bot assigned jan--f Apr 14, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 14, 2025

openshift-ci bot assigned rexagod Apr 14, 2025

simonpasquier reviewed Apr 15, 2025

View reviewed changes

machine424 force-pushed the sno-auto-xxx branch from 290c80d to 19dc083 Compare April 16, 2025 08:16

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 16, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 17, 2025

chore(manifests): disable --auto-gomemlimit for Prometheus on SNO unt…

9ec8f5c

…il we can ensure it won't result in excessive CPU usage

machine424 force-pushed the sno-auto-xxx branch from 19dc083 to 9ec8f5c Compare April 27, 2025 20:55

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 27, 2025

openshift-ci bot added acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 29, 2025

openshift-ci bot assigned danielmellado May 2, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 2, 2025

openshift-merge-bot bot merged commit 4dbb28d into openshift:main May 2, 2025
20 of 21 checks passed

openshift-cherrypick-robot mentioned this pull request May 6, 2025

OCPBUGS-55737: [release-4.19] MON-4200: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage #2597

Merged

openshift-ci bot changed the title ~~MON-4200: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage~~ OCPBUGS-55736: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage May 6, 2025

	func (f Factory) adjustGoGCConfig(p monv1.Prometheus) {
	func (f Factory) adjustGoSettings(p monv1.Prometheus) {

	require.Equal(t, tc.autoGOMEMLIMITDisabled, argumentPresent(*c, "--no-auto-gomemlimit"))
	require.Equal(t, tc.ir.HighlyAvailableInfrastructure(), argumentPresent(*c, "--no-auto-gomemlimit"))

OCPBUGS-55736: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage #2549

OCPBUGS-55736: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage #2549

Conversation

machine424 commented Jan 6, 2025 • edited Loading

machine424 commented Jan 6, 2025

rexagod left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rexagod commented Jan 7, 2025

machine424 commented Jan 7, 2025

openshift-ci-robot commented Apr 14, 2025 • edited by openshift-ci bot Loading

jan--f commented Apr 14, 2025

rexagod commented Apr 14, 2025

rexagod commented Apr 14, 2025

machine424 commented Apr 14, 2025

machine424 commented Apr 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machine424 commented Apr 16, 2025

jan--f commented Apr 17, 2025

jan--f commented Apr 17, 2025

machine424 commented Apr 17, 2025

machine424 commented Apr 18, 2025

machine424 commented Apr 18, 2025

machine424 commented Apr 23, 2025

machine424 commented Apr 24, 2025

machine424 commented Apr 25, 2025

openshift-ci bot commented Apr 27, 2025

machine424 commented Apr 29, 2025

machine424 commented May 1, 2025

danielmellado commented May 2, 2025

openshift-ci bot commented May 2, 2025

openshift-bot commented May 2, 2025

machine424 commented May 6, 2025

openshift-ci-robot commented May 6, 2025

openshift-cherrypick-robot commented May 6, 2025

machine424 commented May 6, 2025

openshift-ci-robot commented May 6, 2025

machine424 commented Jan 6, 2025 •

edited

Loading

openshift-ci-robot commented Apr 14, 2025 •

edited by openshift-ci bot

Loading