-
Notifications
You must be signed in to change notification settings - Fork 369
OCPBUGS-55736: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage #2549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no ticket linked to this, I was wondering if we saw any instances of 10% memory reduction bottleneck-ing the CPU on SNO?
pkg/manifests/manifests.go
Outdated
@@ -1491,7 +1491,7 @@ func (f *Factory) PrometheusK8s(grpcTLS *v1.Secret, telemetrySecret *v1.Secret) | |||
return p, nil | |||
} | |||
|
|||
func (f *Factory) setupGoGC(p *monv1.Prometheus) { | |||
func (f *Factory) adjustGoGCConfig(p *monv1.Prometheus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like:
func (f *Factory) adjustGoGCConfig(p *monv1.Prometheus) { | |
func (f *Factory) adjustGoSettings(p *monv1.Prometheus) { |
Since this affects the GOMEMLIMIT too now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renaming as adjustGoGCRelatedConfig
because GOMEMLIMIT
is also related to Garbage collection.
for _, env := range c.Env { | ||
require.NotEqual(t, env.Name, "GOGC") | ||
} | ||
return | ||
} | ||
|
||
require.Contains(t, c.Env, v1.EnvVar{Name: "GOGC", Value: tc.exp}) | ||
require.Contains(t, c.Env, v1.EnvVar{Name: "GOGC", Value: tc.expectedGOGC}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+100!
pkg/manifests/manifests_test.go
Outdated
require.Contains(t, c.Env, v1.EnvVar{Name: "GOGC", Value: tc.exp}) | ||
require.Contains(t, c.Env, v1.EnvVar{Name: "GOGC", Value: tc.expectedGOGC}) | ||
|
||
require.Equal(t, tc.autoGOMEMLIMITDisabled, argumentPresent(*c, "--no-auto-gomemlimit")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could drop the tc.autoGOMEMLIMITDisabled
field as this could be safely derived from ir.HighlyAvailableInfrastructure()
, as that's the only case where this is disabled for now (else enabled)?
require.Equal(t, tc.autoGOMEMLIMITDisabled, argumentPresent(*c, "--no-auto-gomemlimit")) | |
require.Equal(t, tc.ir.HighlyAvailableInfrastructure(), argumentPresent(*c, "--no-auto-gomemlimit")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, but I'm trying to avoid the same logic in tests as well in case sth is broken/wrong somewhere.
Also having it explicit makes reading the test cases easier.
I'm asking #2549 (review) as any observed insight should help me set a more meaningful buffer threshold in kubernetes-monitoring/kubernetes-mixin#1010 (comment). |
Thanks for the review, this is still WIP actually, requires openshift/prometheus#227. I've marked it as such. I'll get back to you later. |
@machine424: This pull request references MON-4200 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
8e147f7
to
290c80d
Compare
/lgtm |
/lgtm |
Needs a |
/retest-required |
I'd make the bot take care of that to not pollute the PR. |
pkg/manifests/manifests.go
Outdated
// Until we're certain setting GOMEMLIMIT to 0.9 (default ratio) of detected maximum | ||
// container or system memory won't result in excessive CPU usage, we're disabling the | ||
// auto setting for SNO. | ||
p.Spec.Containers[i].Args = append(p.Spec.Containers[i].Args, "--no-auto-gomemlimit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should rather use p.Spec.AdditionalArgs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, even though we're overriding all the args, no test seems to be failing...
e.g.
func TestClusterMonitorPrometheusK8Config(t *testing.T) { |
I'll take a look
/hold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, it's because we don't run CMO e2e on SNO.
but e2e-aws-ovn-single-node
is failing for this reason though, but it's not required.
Maybe we can consider making it required now that the upgrade one is a payload blocking https://issues.redhat.com//browse/OCPEDGE-799
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes IIUC we were waiting for SNO to be a blocking job before moving it to required here (it was too flaky initially).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check with the SNO team, we'd probably pick the same upgrade one than payload...
290c80d
to
19dc083
Compare
/retest-required |
/lgtm |
/retest-required |
/retest |
/retest-required |
1 similar comment
/retest-required |
/retest |
/retest-required |
1 similar comment
/retest-required |
…il we can ensure it won't result in excessive CPU usage
19dc083
to
9ec8f5c
Compare
@machine424: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/hold cancel |
/retest-required |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danielmellado, jan--f, machine424, rexagod The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
4dbb28d
into
openshift:main
[ART PR BUILD NOTIFIER] Distgit: cluster-monitoring-operator |
/jira backport release-4.19 |
@machine424: The following backport issues have been created: Queuing cherrypicks to the requested branches to be created after this PR merges: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@openshift-ci-robot: new pull request created: #2597 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/retitle OCPBUGS-55736: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage |
@machine424: Jira Issue OCPBUGS-55736 is in an unrecognized state (Closed) and will not be moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
…il we can ensure it won't result in excessive CPU usage
requires openshift/prometheus#227