From 7c8d1f4571bec1606b324342fdb311e28c965d7e Mon Sep 17 00:00:00 2001 From: Komh Date: Fri, 24 Apr 2026 01:40:20 +0000 Subject: [PATCH] [observability] Recover Prometheus from TSDB head chunk corruption --- ...metheus_from_TSDB_head_chunk_corruption.md | 122 ++++++++++++++++++ 1 file changed, 122 insertions(+) create mode 100644 docs/en/solutions/Recover_Prometheus_from_TSDB_head_chunk_corruption.md diff --git a/docs/en/solutions/Recover_Prometheus_from_TSDB_head_chunk_corruption.md b/docs/en/solutions/Recover_Prometheus_from_TSDB_head_chunk_corruption.md new file mode 100644 index 00000000..ee554e7e --- /dev/null +++ b/docs/en/solutions/Recover_Prometheus_from_TSDB_head_chunk_corruption.md @@ -0,0 +1,122 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Issue + +The platform monitoring stack starts emitting `KubeAPIDown` (and other rule-based) alerts even though `kubectl get --raw=/healthz` and the API server pods themselves are healthy. The Prometheus pods log repeated rule-evaluation failures of the form: + +```text +ts=YYYY-MM-DDTHH:MM:SS.XXXZ caller=group.go:NNN level=warn + component="rule manager" file=/etc/prometheus/rules/...rules.yaml + group=kube-prometheus-node-recording.rules + msg="Evaluating rule failed" + rule="record: instance:node_cpu:rate:sum + expr: sum by (instance) (rate(node_cpu_seconds_total{...}[3m]))" + err="corruption in head chunk file /prometheus/chunks_head/00xxx4: + checksum mismatch expected:0, actual:dxxxxx6" +``` + +Recording- and alerting-rule queries return errors instead of values, so any alert whose expression depends on those rules fires spuriously. + +## Root Cause + +Prometheus persists the most recent (still-mutable) two-hour window of samples as **head chunks** plus a **write-ahead log (WAL)** under `/prometheus/`. Head chunks are written with a per-chunk checksum so that a partial or torn write can be detected on the next read. + +A checksum mismatch in `/prometheus/chunks_head/...` means the head chunk file on disk is no longer self-consistent. Common triggers: + +- The Prometheus pod was killed ungracefully (OOM, node power loss, forced eviction) before the in-memory head was flushed. +- The PV / underlying storage acknowledged a write that was not actually durable (storage latency spike, controller failover, kernel page-cache loss on a host crash). +- A node-level failure interrupted an in-flight write to a head chunk. + +Once a head chunk is corrupt, the rule manager that tries to read it surfaces the error on every evaluation tick, which is what produces the `KubeAPIDown`-style false positives — the alert is not really about the API server, it is about the recording rule that feeds it. + +## Resolution + +ACP exposes the platform Prometheus through `observability/monitor` (Prometheus Operator + Thanos). The recovery procedure is the same as for any open-source Prometheus deployment: stop the operator from reconciling the StatefulSet, delete the corrupted head and WAL files inside the affected pod, and let Prometheus rebuild from the persisted blocks. + +> Important: deleting `chunks_head/` and `wal/` discards the most recent (typically up to two hours) of samples for that replica. Long-term blocks already flushed to disk — and any data already shipped to Thanos / object storage — are preserved. If two Prometheus replicas are running and only one is corrupt, the surviving replica still serves the recent window. + +1. **Pause the operator** so it does not fight the manual cleanup. Substitute the namespace where the platform monitoring stack lives (`cpaas-system` for the in-core ACP `observability/monitor` install). + + ```bash + NS=cpaas-system # adjust for your install + kubectl -n "$NS" scale deploy prometheus-operator --replicas=0 + ``` + +2. **Identify the affected Prometheus pod** and confirm the corruption error on it (do not blindly clean every replica): + + ```bash + kubectl -n "$NS" get pods -l app.kubernetes.io/name=prometheus + kubectl -n "$NS" logs -c prometheus --tail=200 \ + | grep -i "corruption in head chunk" + ``` + +3. **Remove only the corrupted head chunks and WAL** inside the failing pod. The Prometheus container runs as a non-root user; the data directory is `/prometheus`. + + ```bash + kubectl -n "$NS" exec -it -c prometheus -- \ + sh -c 'rm -rf /prometheus/chunks_head/* /prometheus/wal/*' + ``` + +4. **Restart the Prometheus pod** so the TSDB re-opens cleanly: + + ```bash + kubectl -n "$NS" delete pod + ``` + +5. **Resume the operator** to restore normal reconciliation: + + ```bash + kubectl -n "$NS" scale deploy prometheus-operator --replicas=1 + ``` + +6. **Verify**. The pod should reach `Ready`, the corruption log line should stop, and rule evaluation should succeed: + + ```bash + kubectl -n "$NS" get pod -w + kubectl -n "$NS" logs -c prometheus --tail=200 \ + | grep -E 'level=err|head chunk|"Evaluating rule failed"' || echo "clean" + ``` + +If the original `KubeAPIDown` alert was a false positive caused by a broken recording rule, it should resolve on the next evaluation cycle once the rule starts producing values again. + +### Prevention + +- Provision the Prometheus PV on storage that honors `fsync` and survives node reboots without losing committed pages. +- Run **two** Prometheus replicas (the operator's default) so a single-replica TSDB corruption does not cause a monitoring outage. +- Keep Thanos / remote-write export to long-term object storage enabled — the recent-window loss during this recovery is then bounded to whatever has not yet been shipped. +- Avoid hard-killing Prometheus pods. When draining a node, let the kubelet's graceful termination flush the head. + +## Diagnostic Steps + +```bash +# 1. Confirm the alert really comes from a broken recording rule, not the API server. +kubectl get --raw=/healthz # API server should answer "ok" +kubectl get --raw=/readyz?verbose + +# 2. Find which Prometheus replicas are logging the corruption. +NS=cpaas-system +for p in $(kubectl -n "$NS" get pod -l app.kubernetes.io/name=prometheus \ + -o name); do + echo "==== $p ====" + kubectl -n "$NS" logs "$p" -c prometheus --tail=300 \ + | grep -E 'corruption in head chunk|"Evaluating rule failed"' | tail -5 +done + +# 3. Inspect the on-disk layout (sanity check before deleting anything). +kubectl -n "$NS" exec -c prometheus -- ls -lh /prometheus +kubectl -n "$NS" exec -c prometheus -- ls -lh /prometheus/chunks_head | head +kubectl -n "$NS" exec -c prometheus -- ls -lh /prometheus/wal | head + +# 4. After recovery, confirm rule evaluation is healthy. +kubectl -n "$NS" exec -c prometheus -- \ + wget -qO- http://localhost:9090/api/v1/rules | \ + jq '.data.groups[].rules[] | select(.health!="ok")' +``` + +If `chunks_head/` keeps growing back into corruption after recovery, the root cause is upstream of Prometheus — investigate the underlying storage class, node stability, and the pod's recent termination history (`kubectl -n "$NS" describe pod ` for `Last State: Terminated` reasons).