Skip to content

OCPBUGS-55736: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage #2549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 2, 2025

Conversation

machine424
Copy link
Contributor

@machine424 machine424 commented Jan 6, 2025

…il we can ensure it won't result in excessive CPU usage

requires openshift/prometheus#227

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

@openshift-ci openshift-ci bot requested review from marioferh and rexagod January 6, 2025 13:24
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 6, 2025
@machine424
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 6, 2025
Copy link
Member

@rexagod rexagod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no ticket linked to this, I was wondering if we saw any instances of 10% memory reduction bottleneck-ing the CPU on SNO?

@@ -1491,7 +1491,7 @@ func (f *Factory) PrometheusK8s(grpcTLS *v1.Secret, telemetrySecret *v1.Secret)
return p, nil
}

func (f *Factory) setupGoGC(p *monv1.Prometheus) {
func (f *Factory) adjustGoGCConfig(p *monv1.Prometheus) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like:

Suggested change
func (f *Factory) adjustGoGCConfig(p *monv1.Prometheus) {
func (f *Factory) adjustGoSettings(p *monv1.Prometheus) {

Since this affects the GOMEMLIMIT too now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renaming as adjustGoGCRelatedConfig because GOMEMLIMIT is also related to Garbage collection.

for _, env := range c.Env {
require.NotEqual(t, env.Name, "GOGC")
}
return
}

require.Contains(t, c.Env, v1.EnvVar{Name: "GOGC", Value: tc.exp})
require.Contains(t, c.Env, v1.EnvVar{Name: "GOGC", Value: tc.expectedGOGC})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+100!

require.Contains(t, c.Env, v1.EnvVar{Name: "GOGC", Value: tc.exp})
require.Contains(t, c.Env, v1.EnvVar{Name: "GOGC", Value: tc.expectedGOGC})

require.Equal(t, tc.autoGOMEMLIMITDisabled, argumentPresent(*c, "--no-auto-gomemlimit"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could drop the tc.autoGOMEMLIMITDisabled field as this could be safely derived from ir.HighlyAvailableInfrastructure(), as that's the only case where this is disabled for now (else enabled)?

Suggested change
require.Equal(t, tc.autoGOMEMLIMITDisabled, argumentPresent(*c, "--no-auto-gomemlimit"))
require.Equal(t, tc.ir.HighlyAvailableInfrastructure(), argumentPresent(*c, "--no-auto-gomemlimit"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, but I'm trying to avoid the same logic in tests as well in case sth is broken/wrong somewhere.
Also having it explicit makes reading the test cases easier.

@rexagod
Copy link
Member

rexagod commented Jan 7, 2025

I'm asking #2549 (review) as any observed insight should help me set a more meaningful buffer threshold in kubernetes-monitoring/kubernetes-mixin#1010 (comment).

@machine424 machine424 marked this pull request as draft January 7, 2025 10:45
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2025
@machine424
Copy link
Contributor Author

Thanks for the review, this is still WIP actually, requires openshift/prometheus#227. I've marked it as such. I'll get back to you later.

@machine424 machine424 changed the title chore(manifests): disable --auto-gomemlimit for Prometheus on SNO unt… MON-4200: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage Apr 14, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 14, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 14, 2025

@machine424: This pull request references MON-4200 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

…il we can ensure it won't result in excessive CPU usage

requires openshift/prometheus#227

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@machine424 machine424 marked this pull request as ready for review April 14, 2025 07:38
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 14, 2025
@openshift-ci openshift-ci bot requested review from jan--f and rexagod April 14, 2025 07:39
@jan--f
Copy link
Contributor

jan--f commented Apr 14, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 14, 2025
@rexagod
Copy link
Member

rexagod commented Apr 14, 2025

/lgtm

@rexagod
Copy link
Member

rexagod commented Apr 14, 2025

Needs a make versions?

@machine424
Copy link
Contributor Author

/retest-required

@machine424
Copy link
Contributor Author

Needs a make versions?

I'd make the bot take care of that to not pollute the PR.

// Until we're certain setting GOMEMLIMIT to 0.9 (default ratio) of detected maximum
// container or system memory won't result in excessive CPU usage, we're disabling the
// auto setting for SNO.
p.Spec.Containers[i].Args = append(p.Spec.Containers[i].Args, "--no-auto-gomemlimit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should rather use p.Spec.AdditionalArgs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, even though we're overriding all the args, no test seems to be failing...
e.g.

func TestClusterMonitorPrometheusK8Config(t *testing.T) {

I'll take a look
/hold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it's because we don't run CMO e2e on SNO.
but e2e-aws-ovn-single-node is failing for this reason though, but it's not required.
Maybe we can consider making it required now that the upgrade one is a payload blocking https://issues.redhat.com//browse/OCPEDGE-799

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes IIUC we were waiting for SNO to be a blocking job before moving it to required here (it was too flaky initially).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check with the SNO team, we'd probably pick the same upgrade one than payload...

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 16, 2025
@machine424
Copy link
Contributor Author

/retest-required

@jan--f
Copy link
Contributor

jan--f commented Apr 17, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 17, 2025
@jan--f
Copy link
Contributor

jan--f commented Apr 17, 2025

/retest-required

@machine424
Copy link
Contributor Author

/retest

@machine424
Copy link
Contributor Author

/retest-required

1 similar comment
@machine424
Copy link
Contributor Author

/retest-required

@machine424
Copy link
Contributor Author

/retest

@machine424
Copy link
Contributor Author

/retest-required

1 similar comment
@machine424
Copy link
Contributor Author

/retest-required

…il we can ensure it won't result in excessive CPU usage
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 27, 2025
Copy link
Contributor

openshift-ci bot commented Apr 27, 2025

@machine424: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 9ec8f5c link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@machine424
Copy link
Contributor Author

/hold cancel
/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 29, 2025
@machine424
Copy link
Contributor Author

/retest-required

@danielmellado
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 2, 2025
Copy link
Contributor

openshift-ci bot commented May 2, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielmellado, jan--f, machine424, rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [danielmellado,jan--f,machine424,rexagod]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 4dbb28d into openshift:main May 2, 2025
20 of 21 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-monitoring-operator
This PR has been included in build cluster-monitoring-operator-container-v4.20.0-202505021117.p0.g4dbb28d.assembly.stream.el9.
All builds following this will include this PR.

@machine424
Copy link
Contributor Author

/jira backport release-4.19

@openshift-ci-robot
Copy link
Contributor

@machine424: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.19

In response to this:

/jira backport release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@openshift-ci-robot: new pull request created: #2597

In response to this:

@machine424: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.19

In response to this:

/jira backport release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@machine424
Copy link
Contributor Author

/retitle OCPBUGS-55736: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage

@openshift-ci openshift-ci bot changed the title MON-4200: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage OCPBUGS-55736: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage May 6, 2025
@openshift-ci-robot
Copy link
Contributor

@machine424: Jira Issue OCPBUGS-55736 is in an unrecognized state (Closed) and will not be moved to the MODIFIED state.

In response to this:

…il we can ensure it won't result in excessive CPU usage

requires openshift/prometheus#227

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants