feat: Add selector and bootstrap observability metrics#286
Conversation
✅ Deploy Preview for node-readiness-controller canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rawadhossain The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @rawadhossain. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Signed-off-by: Rawad Hossain <rawad.hossain00@gmail.com>
406dde6 to
11900ae
Compare
AvineshTripathi
left a comment
There was a problem hiding this comment.
Thanks for the PR! Left a few comments.
|
|
||
| // Update selector_matched_nodes_total from the node list we already fetched. | ||
| var matchedCount int | ||
| for i := range nodeList.Items { |
There was a problem hiding this comment.
we already do the same thing inside processAllNodesForRule here
| ) | ||
|
|
||
| // BootstrapCompletionErrors tracks failures writing the bootstrap-completion annotation. | ||
| BootstrapCompletionErrors = prometheus.NewCounterVec( |
There was a problem hiding this comment.
I couldn't find this in the doc. Where are we planning to use these metrics? I feel counter without reason of failure won't be that useful. Please correct me if I am understanding this wrongly
|
|
||
| // BootstrapNRCDuration tracks time from NRC taint application to bootstrap completion. | ||
| // Unlike bootstrap_duration_seconds, this excludes pre-NRC node boot time. | ||
| BootstrapNRCDuration = prometheus.NewHistogramVec( |
There was a problem hiding this comment.
missing implementation here or is this the pending decision one?
| if duration > 0 { | ||
| metrics.BootstrapDuration.WithLabelValues(rule.Name).Observe(duration) | ||
| } | ||
| // TODO: pending decision — record BootstrapNRCDuration from readiness.k8s.io/taint-applied-<rule> to time.Now(); skip if missing. |
There was a problem hiding this comment.
This PR introduces a new format of bootstrap annotation where the value is a JSON. Instead of creating a new annotation readiness.k8s.io/taint-applied-<rule> we could just add a field in the json value.
Description
This PR adds three new metrics. Two metrics are fully implemented, while the third is left with TODOs pending discussion.
What each metric does
node_readiness_selector_matched_nodes_totalTracks how many nodes currently match a rule's spec. If a rule's
NodeSelectormatches no nodes, controller performs no work and produces no other signal. This metric makes those misconfigurations immediately visible.node_readiness_bootstrap_completion_errors_totalCounts failures writing the bootstrap completion annotation. If this write fails, the node continues to be re-evaluated even though bootstrap completed. This metric makes those failures visible.
node_readiness_bootstrap_nrc_duration_secondsMeasures only the time NRC itself held a node, from the first taint until bootstrap completion. Excludes pre-NRC boot time.
It's registered, but recording logic deferred pending discussion on the timestamp anchor.
The two approaches I see are:
Option A:
readiness.k8s.io/taint-applied-<rule>tonode.ObjectMetawhen the taint is first applied and record duration from that timestamp to bootstrap completion.Option B:
lastTransitionTimeas the start timestamp.Status().Patch()call andnodes/statuswrite permissions.I left TODOs, so implementation can be completed once we agree on the approach.
Related to Issue #182
Type of Change
/kind feature
Testing
Checklist
make testpassesmake lintpasses