Skip to content

feat(cluster): add CNPGInstanceMetricsAbsent alert and runbook#931

Open
philippemnoel wants to merge 2 commits into
cloudnative-pg:mainfrom
paradedb:feat/cnpg-instance-metrics-absent
Open

feat(cluster): add CNPGInstanceMetricsAbsent alert and runbook#931
philippemnoel wants to merge 2 commits into
cloudnative-pg:mainfrom
paradedb:feat/cnpg-instance-metrics-absent

Conversation

@philippemnoel

@philippemnoel philippemnoel commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Following on from #774. Here @paradedb we hit a failure mode worth its own rule: a CNPG instance whose metrics exporter hangs while the pod stays Ready. It stops serving cnpg_* metrics, so the lag/HA/replication alerts that read from that same exporter have no samples and silently never fire.

Adds the CNPGInstanceMetricsAbsent alert (up == 0 for 10m, scoped by podSelector) and its runbook, in the same form as the existing rules. No extra dependencies.

Original commit in our fork: paradedb@7d539a4

Adds a PrometheusRule and runbook that detect a running CloudNativePG
instance whose metrics exporter is hung: it stops serving cnpg_* metrics
while the pod stays up.

This matters because the lag, HA and replication alerts all read from the
same exporter and are "expr > threshold" rules, so once it goes silent
they have no samples to evaluate and never fire. A hung exporter can
coincide with a frozen standby, leaving replication stuck and unmonitored.

The rule keys on up == 0, which the scraper synthesizes for every
PodMonitor target, so a scrape timeout (hung exporter) shows as up == 0
with no extra dependency. The podSelector keeps it cnpg-scoped, the
10-minute 'for' rides out normal restarts/upgrades, and a removed pod's
up series goes stale rather than 0 so scale-downs self-exclude.

Signed-off-by: Philippe Noël <philippemnoel@gmail.com>
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants