obs(pebble): expose LSM internals as prometheus metrics#578
Conversation
Add an elastickv_pebble_* metrics family that periodically polls each
raft group's *pebble.DB.Metrics() snapshot and mirrors the most
operationally useful fields into Prometheus so operators can graph and
alert on write-stall precursors without restarting the process for a
heap dump.
Fields exposed (per-group label):
Gauges (overwritten each tick):
l0_sublevels - canonical write-stall precursor
l0_num_files - L0 fan-in
compact_estimated_debt_bytes - compactions falling behind ingest
compact_in_progress
memtable_count / memtable_size_bytes / memtable_zombie_count
block_cache_size_bytes
Counters (positive-delta only, reset-safe):
compact_count_total
block_cache_hits_total / block_cache_misses_total
Rationale: L0 sublevels and compaction debt are the incident signals
for write stalls; memtable size + zombies catch memtable-related
stalls and long-lived iterators; block cache size/hit/miss track
working-set fit.
Note: pebble v2.1.4 does not expose WriteStallCount / WriteStallDuration
on the Metrics struct (only as EventListener callbacks), so those are
omitted from this pass. L0 sublevels is the more actionable proxy.
Wiring mirrors the existing DispatchCollector shape:
- store.pebbleStore.Metrics() accessor (guarded by dbMu so a
concurrent Restore/Close cannot race the DB swap)
- monitoring.PebbleMetricsSource interface satisfied by
*pebbleStore
- monitoring.PebbleCollector started from main.go next to the
DispatchCollector, on the same 5 s cadence
Tests use a fakePebbleSource returning canned *pebble.Metrics and
assert both the gauge-overwrite semantics and the reset-safe counter
deltas.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 59 minutes and 33 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a PebbleCollector to monitor Pebble LSM internals, such as L0 sublevels, compaction debt, and block cache usage, exposing them as Prometheus metrics. The implementation includes metrics definitions, polling logic, registry integration, and comprehensive unit tests. A typo was identified in the pebbleSnapshot struct field name blockCacheMisse which should be corrected for consistency.
| type pebbleSnapshot struct { | ||
| compactCount int64 | ||
| blockCacheHits int64 | ||
| blockCacheMisse int64 |
There was a problem hiding this comment.
|
Addressed the typo in /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a PebbleCollector to monitor and expose Pebble LSM internal metrics, such as L0 pressure, compaction debt, and cache usage, via Prometheus. The implementation includes logic to handle monotonic counters and source resets. Feedback suggests improving naming consistency for block cache metrics and optimizing performance by pre-calculating group ID strings to reduce allocations during the polling loop.
| // Block cache. | ||
| blockCacheSizeBytes *prometheus.GaugeVec | ||
| blockCacheHitsTotal *prometheus.CounterVec | ||
| blockCacheMissTotal *prometheus.CounterVec |
There was a problem hiding this comment.
The field name blockCacheMissTotal is inconsistent with blockCacheHitsTotal (plural). For better consistency with the other fields and the Prometheus metric name itself, consider renaming this to blockCacheMissesTotal.
| blockCacheMissTotal *prometheus.CounterVec | |
| blockCacheMissesTotal *prometheus.CounterVec |
| type PebbleSource struct { | ||
| GroupID uint64 | ||
| Source PebbleMetricsSource | ||
| } |
There was a problem hiding this comment.
To avoid repeated allocations during the polling loop, consider pre-calculating the string representation of the GroupID and storing it in the PebbleSource struct. strconv.FormatUint is called for every group on every 5-second tick.
| type PebbleSource struct { | |
| GroupID uint64 | |
| Source PebbleMetricsSource | |
| } | |
| type PebbleSource struct { | |
| GroupID uint64 | |
| GroupIDStr string | |
| Source PebbleMetricsSource | |
| } |
| if snap == nil { | ||
| continue | ||
| } | ||
| group := strconv.FormatUint(src.GroupID, 10) |
| out = append(out, monitoring.PebbleSource{ | ||
| GroupID: runtime.spec.id, | ||
| Source: src, | ||
| }) |
There was a problem hiding this comment.
Addresses gemini-code-assist medium-priority review on PR #578: the polling loop formatted src.GroupID with strconv.FormatUint on every tick (every 5s per group), causing avoidable allocations. Store the pre-computed decimal string on PebbleSource at construction time and reuse it as the "group" Prometheus label in observeOnce. - monitoring/pebble.go: add PebbleSource.GroupIDStr; consume it in observeOnce; drop now-unused strconv import. - main.go: populate GroupIDStr in pebbleMonitorSources. - monitoring/pebble_test.go: populate GroupIDStr in test fixtures.
|
Addressed gemini's latest medium-priority review (2026-04-21T20:53:10Z) in 2af6f78. Change: precompute
Verification: /gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Pull request overview
Adds first-class observability for Pebble/LSM health by exporting selected *pebble.DB.Metrics() internals as Prometheus metrics, integrated into the existing monitoring registry and started alongside other periodic collectors.
Changes:
- Add
pebbleStore.Metrics()accessor guarded bydbMuto safely snapshot Pebble metrics. - Introduce
monitoring.PebbleMetrics+PebbleCollectorto translate Pebble snapshots intoelastickv_pebble_*gauges/counters. - Wire the new collector into
main.goand add unit tests validating gauge overwrite + reset-safe counter deltas.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| store/lsm_store.go | Exposes a thread-safe Metrics() snapshot method on pebbleStore. |
| monitoring/registry.go | Registers Pebble metrics and exposes Registry.PebbleCollector(). |
| monitoring/pebble.go | Implements Prometheus vectors + polling collector for Pebble metrics. |
| monitoring/pebble_test.go | Adds unit tests for overwrite semantics, delta counters, and nil safety. |
| main.go | Starts the Pebble collector and builds per-group Pebble sources. |
| // Returns nil while the store is closed or between a Restore/swap (no | ||
| // current DB). Callers must handle nil. | ||
| // | ||
| // Safe for concurrent use: takes the dbMu read lock to protect against | ||
| // Restore/Close swapping the DB pointer. |
There was a problem hiding this comment.
The docstring says this returns nil while the store is closed or mid Restore/swap, but the implementation only returns nil when s.db == nil. Restore holds dbMu exclusively (callers will block rather than observe nil), and Close() doesn’t set s.db to nil, so a post-Close call would still hit s.db.Metrics(). Please either update the comment to match actual behavior, or set s.db = nil as part of Close()/swap so the nil contract is true (and avoids calling Metrics() on a closed DB).
| // dashboard. | ||
| // | ||
| // The point-in-time fields (Sublevels, NumFiles, EstimatedDebt, | ||
| // MemTable.*, NumInProgress, BlockCache.Size/Count) are exposed as |
There was a problem hiding this comment.
The header comment mentions exposing BlockCache.Size/Count as gauges, but this collector only exports block_cache_size_bytes (no block cache count metric). Please update the comment to match what’s actually exported, or add the missing metric if intended.
| // MemTable.*, NumInProgress, BlockCache.Size/Count) are exposed as | |
| // MemTable.*, NumInProgress, BlockCache.Size) are exposed as |
| blockCacheSizeBytes *prometheus.GaugeVec | ||
| blockCacheHitsTotal *prometheus.CounterVec | ||
| blockCacheMissTotal *prometheus.CounterVec | ||
| } |
There was a problem hiding this comment.
Field name blockCacheMissTotal is inconsistent with the exported metric name (…_misses_total) and the nearby snapshot field (blockCacheMisses). Consider renaming to blockCacheMissesTotal for clarity/consistency.
- Metrics() docstring now matches actual behavior (nil only pre-Open or after a failed Open; Restore callers block on dbMu). - Drop "Count" from the collector header comment; only BlockCache.Size is exported. - Rename blockCacheMissTotal -> blockCacheMissesTotal for consistency with blockCacheHitsTotal, the snapshot field blockCacheMisses, and the exported metric name (..._misses_total).
|
/gemini review Addressed Copilot's 3 review items in be3c19b:
Verification: |
There was a problem hiding this comment.
Code Review
This pull request introduces a PebbleCollector to monitor Pebble LSM internals, such as L0 sublevels, compaction debt, and memtable/block cache usage, via Prometheus. It includes the necessary metrics definitions, a collector that polls these metrics on a fixed interval, and integration into the main application lifecycle. Feedback was provided regarding a concurrency issue in the metrics collection loop: holding a mutex while calling potentially blocking Metrics() methods could stall observability for all raft groups on a node. It is recommended to use more granular locking to ensure that a single slow or blocked store does not impact the monitoring of others.
| c.mu.Lock() | ||
| defer c.mu.Unlock() |
There was a problem hiding this comment.
Holding the PebbleCollector mutex across the entire loop is problematic because src.Source.Metrics() can block. In pebbleStore, Metrics() acquires dbMu.RLock(). If a raft group is performing a long-running operation that holds dbMu exclusively (like a Restore or a large Compact), this collector will block on that group while holding c.mu. This prevents metrics from being updated for all other raft groups on the same node, leading to a total loss of Pebble observability during critical recovery operations. Per repository guidelines, ensure that the blocking semantic of Metrics() is explicitly stated in its interface documentation. Additionally, when performing updates to the previous state, the lock must be held continuously from the read to the assignment to prevent race conditions.
func (c *PebbleCollector) observeOnce(sources []PebbleSource) {
if c == nil || c.metrics == nil {
return
}
for _, src := range sources {
if src.Source == nil {
continue
}
snap := src.Source.Metrics()
if snap == nil {
continue
}
group := src.GroupIDStr
// L0 pressure: gauges, overwritten each tick.
c.metrics.l0Sublevels.WithLabelValues(group).Set(float64(snap.Levels[0].Sublevels))
c.metrics.l0NumFiles.WithLabelValues(group).Set(float64(snap.Levels[0].TablesCount))
// Compaction.
c.metrics.compactEstimatedDebt.WithLabelValues(group).Set(float64(snap.Compact.EstimatedDebt))
c.metrics.compactInProgress.WithLabelValues(group).Set(float64(snap.Compact.NumInProgress))
// Memtable.
c.metrics.memtableCount.WithLabelValues(group).Set(float64(snap.MemTable.Count))
c.metrics.memtableSizeBytes.WithLabelValues(group).Set(float64(snap.MemTable.Size))
c.metrics.memtableZombieCount.WithLabelValues(group).Set(float64(snap.MemTable.ZombieCount))
// Block cache gauge.
c.metrics.blockCacheSizeBytes.WithLabelValues(group).Set(float64(snap.BlockCache.Size))
// Monotonic counters: emit only the positive delta. A smaller
// value means the source was reset (store reopened); rebase
// silently without emitting negative.
c.mu.Lock()
prev := c.previous[src.GroupID]
curr := pebbleSnapshot{
compactCount: snap.Compact.Count,
blockCacheHits: snap.BlockCache.Hits,
blockCacheMisses: snap.BlockCache.Misses,
}
if curr.compactCount > prev.compactCount {
c.metrics.compactCountTotal.WithLabelValues(group).Add(float64(curr.compactCount - prev.compactCount))
}
if curr.blockCacheHits > prev.blockCacheHits {
c.metrics.blockCacheHitsTotal.WithLabelValues(group).Add(float64(curr.blockCacheHits - prev.blockCacheHits))
}
if curr.blockCacheMisses > prev.blockCacheMisses {
c.metrics.blockCacheMissesTotal.WithLabelValues(group).Add(float64(curr.blockCacheMisses - prev.blockCacheMisses))
}
c.previous[src.GroupID] = curr
c.mu.Unlock()
}
}References
- When an interface method is intended to block until a certain condition is met, explicitly state this blocking semantic in the documentation or interface description to avoid ambiguity for implementers.
- When performing atomic operations based on a version or state variable, do not release the lock between reading the variable and performing the conditional check to prevent race conditions and the use of stale data.
|
/gemini reivew |
|
It looks like you made a typo in the command |
Summary
elastickv_pebble_*Prometheus metrics family that periodically polls each raft group's*pebble.DB.Metrics()snapshot so operators can graph and alert on write-stall precursors.DispatchCollectorshape: aPebbleMetricsSourceinterface (satisfied by thepebbleStorevia a newMetrics()accessor guarded bydbMu), aPebbleCollectorthat polls on the same 5 s cadence frommain.go.fakePebbleSourcereturning canned*pebble.Metrics; asserts gauge-overwrite semantics and reset-safe counter deltas.Exposed metrics (per
grouplabel)Gauges (overwritten each tick):
elastickv_pebble_l0_sublevels- canonical write-stall precursorelastickv_pebble_l0_num_fileselastickv_pebble_compact_estimated_debt_byteselastickv_pebble_compact_in_progresselastickv_pebble_memtable_count/_size_bytes/_zombie_countelastickv_pebble_block_cache_size_bytesCounters (positive-delta only, reset-safe across Restore/swap):
elastickv_pebble_compact_count_totalelastickv_pebble_block_cache_hits_total/_misses_totalDeviations from the task description
WriteStallCount/WriteStallDuration: pebble v2.1.4 does not expose those on theMetricsstruct (they are only delivered viaEventListener.WriteStallBegin/Endcallbacks). Wiring an event listener is doable but it would require modifyingdefaultPebbleOptions()to install one - that is a larger surface than this observability pass. L0 sublevels and compaction debt are the more actionable write-stall proxies and are included. Can follow up with anEventListener-based stall counter in a separate PR if desired._total= 100); subsequent observations emit only the positive delta. A source reset (Restorereopens the DB with fresh counters) causes a smaller observed value - that delta is silently dropped so the Prometheus counter stays monotonic, mirroringDispatchCollectorbehaviour.Test plan
go build ./...go test -race -count=1 -short ./monitoring/... ./store/...make lint(0 issues)