Skip to content

feat: Add selector and bootstrap observability metrics#286

Open
rawadhossain wants to merge 1 commit into
kubernetes-sigs:mainfrom
rawadhossain:feat/add-observability-metrics
Open

feat: Add selector and bootstrap observability metrics#286
rawadhossain wants to merge 1 commit into
kubernetes-sigs:mainfrom
rawadhossain:feat/add-observability-metrics

Conversation

@rawadhossain

Copy link
Copy Markdown
Contributor

Description

This PR adds three new metrics. Two metrics are fully implemented, while the third is left with TODOs pending discussion.

What each metric does

node_readiness_selector_matched_nodes_total

Tracks how many nodes currently match a rule's spec. If a rule's NodeSelector matches no nodes, controller performs no work and produces no other signal. This metric makes those misconfigurations immediately visible.

node_readiness_bootstrap_completion_errors_total

Counts failures writing the bootstrap completion annotation. If this write fails, the node continues to be re-evaluated even though bootstrap completed. This metric makes those failures visible.

node_readiness_bootstrap_nrc_duration_seconds

Measures only the time NRC itself held a node, from the first taint until bootstrap completion. Excludes pre-NRC boot time.

It's registered, but recording logic deferred pending discussion on the timestamp anchor.
The two approaches I see are:

Option A:

  • Write readiness.k8s.io/taint-applied-<rule> to node.ObjectMeta when the taint is first applied and record duration from that timestamp to bootstrap completion.

Option B:

  • Write a dedicated node status condition and use its API-server-generated lastTransitionTime as the start timestamp.
  • Requires a separate Status().Patch() call and nodes/status write permissions.

I left TODOs, so implementation can be completed once we agree on the approach.

Related to Issue #182

Type of Change

/kind feature

Testing

  • Added tests coverages.

Checklist

  • make test passes
  • make lint passes

@kubernetes-prow kubernetes-prow Bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 2, 2026
@netlify

netlify Bot commented Jul 2, 2026

Copy link
Copy Markdown

Deploy Preview for node-readiness-controller canceled.

Name Link
🔨 Latest commit 11900ae
🔍 Latest deploy log https://app.netlify.com/projects/node-readiness-controller/deploys/6a47d03749624d0008917e21

@kubernetes-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rawadhossain
Once this PR has been reviewed and has the lgtm label, please assign ajaysundark for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubernetes-prow

Copy link
Copy Markdown

Hi @rawadhossain. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubernetes-prow kubernetes-prow Bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 2, 2026
Signed-off-by: Rawad Hossain <rawad.hossain00@gmail.com>
@rawadhossain rawadhossain force-pushed the feat/add-observability-metrics branch from 406dde6 to 11900ae Compare July 3, 2026 15:07
@kubernetes-prow kubernetes-prow Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 3, 2026
@ajaysundark ajaysundark self-requested a review July 3, 2026 21:22

@AvineshTripathi AvineshTripathi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left a few comments.


// Update selector_matched_nodes_total from the node list we already fetched.
var matchedCount int
for i := range nodeList.Items {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already do the same thing inside processAllNodesForRule here

)

// BootstrapCompletionErrors tracks failures writing the bootstrap-completion annotation.
BootstrapCompletionErrors = prometheus.NewCounterVec(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find this in the doc. Where are we planning to use these metrics? I feel counter without reason of failure won't be that useful. Please correct me if I am understanding this wrongly


// BootstrapNRCDuration tracks time from NRC taint application to bootstrap completion.
// Unlike bootstrap_duration_seconds, this excludes pre-NRC node boot time.
BootstrapNRCDuration = prometheus.NewHistogramVec(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing implementation here or is this the pending decision one?

if duration > 0 {
metrics.BootstrapDuration.WithLabelValues(rule.Name).Observe(duration)
}
// TODO: pending decision — record BootstrapNRCDuration from readiness.k8s.io/taint-applied-<rule> to time.Now(); skip if missing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces a new format of bootstrap annotation where the value is a JSON. Instead of creating a new annotation readiness.k8s.io/taint-applied-<rule> we could just add a field in the json value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants