feat(cluster): add CNPGInstanceMetricsAbsent alert and runbook#931
Open
philippemnoel wants to merge 2 commits into
Open
feat(cluster): add CNPGInstanceMetricsAbsent alert and runbook#931philippemnoel wants to merge 2 commits into
philippemnoel wants to merge 2 commits into
Conversation
Adds a PrometheusRule and runbook that detect a running CloudNativePG instance whose metrics exporter is hung: it stops serving cnpg_* metrics while the pod stays up. This matters because the lag, HA and replication alerts all read from the same exporter and are "expr > threshold" rules, so once it goes silent they have no samples to evaluate and never fire. A hung exporter can coincide with a frozen standby, leaving replication stuck and unmonitored. The rule keys on up == 0, which the scraper synthesizes for every PodMonitor target, so a scrape timeout (hung exporter) shows as up == 0 with no extra dependency. The podSelector keeps it cnpg-scoped, the 10-minute 'for' rides out normal restarts/upgrades, and a removed pod's up series goes stale rather than 0 so scale-downs self-exclude. Signed-off-by: Philippe Noël <philippemnoel@gmail.com>
walter-woodall
approved these changes
Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Following on from #774. Here @paradedb we hit a failure mode worth its own rule: a CNPG instance whose metrics exporter hangs while the pod stays
Ready. It stops servingcnpg_*metrics, so the lag/HA/replication alerts that read from that same exporter have no samples and silently never fire.Adds the
CNPGInstanceMetricsAbsentalert (up == 0for 10m, scoped by podSelector) and its runbook, in the same form as the existing rules. No extra dependencies.Original commit in our fork: paradedb@7d539a4