Skip to content

obs(pebble): expose LSM internals as prometheus metrics#578

Merged
bootjp merged 4 commits intomainfrom
obs/pebble-metrics
Apr 22, 2026
Merged

obs(pebble): expose LSM internals as prometheus metrics#578
bootjp merged 4 commits intomainfrom
obs/pebble-metrics

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 21, 2026

Summary

  • Adds an elastickv_pebble_* Prometheus metrics family that periodically polls each raft group's *pebble.DB.Metrics() snapshot so operators can graph and alert on write-stall precursors.
  • Mirrors the existing DispatchCollector shape: a PebbleMetricsSource interface (satisfied by the pebbleStore via a new Metrics() accessor guarded by dbMu), a PebbleCollector that polls on the same 5 s cadence from main.go.
  • Covered by unit tests using a fakePebbleSource returning canned *pebble.Metrics; asserts gauge-overwrite semantics and reset-safe counter deltas.

Exposed metrics (per group label)

Gauges (overwritten each tick):

  • elastickv_pebble_l0_sublevels - canonical write-stall precursor
  • elastickv_pebble_l0_num_files
  • elastickv_pebble_compact_estimated_debt_bytes
  • elastickv_pebble_compact_in_progress
  • elastickv_pebble_memtable_count / _size_bytes / _zombie_count
  • elastickv_pebble_block_cache_size_bytes

Counters (positive-delta only, reset-safe across Restore/swap):

  • elastickv_pebble_compact_count_total
  • elastickv_pebble_block_cache_hits_total / _misses_total

Deviations from the task description

  • No WriteStallCount / WriteStallDuration: pebble v2.1.4 does not expose those on the Metrics struct (they are only delivered via EventListener.WriteStallBegin / End callbacks). Wiring an event listener is doable but it would require modifying defaultPebbleOptions() to install one - that is a larger surface than this observability pass. L0 sublevels and compaction debt are the more actionable write-stall proxies and are included. Can follow up with an EventListener-based stall counter in a separate PR if desired.
  • Counter semantics: the first observation emits the absolute cumulative value (e.g. 100 hits -> _total = 100); subsequent observations emit only the positive delta. A source reset (Restore reopens the DB with fresh counters) causes a smaller observed value - that delta is silently dropped so the Prometheus counter stays monotonic, mirroring DispatchCollector behaviour.

Test plan

  • go build ./...
  • go test -race -count=1 -short ./monitoring/... ./store/...
  • make lint (0 issues)

Add an elastickv_pebble_* metrics family that periodically polls each
raft group's *pebble.DB.Metrics() snapshot and mirrors the most
operationally useful fields into Prometheus so operators can graph and
alert on write-stall precursors without restarting the process for a
heap dump.

Fields exposed (per-group label):

  Gauges (overwritten each tick):
    l0_sublevels                 - canonical write-stall precursor
    l0_num_files                 - L0 fan-in
    compact_estimated_debt_bytes - compactions falling behind ingest
    compact_in_progress
    memtable_count / memtable_size_bytes / memtable_zombie_count
    block_cache_size_bytes

  Counters (positive-delta only, reset-safe):
    compact_count_total
    block_cache_hits_total / block_cache_misses_total

Rationale: L0 sublevels and compaction debt are the incident signals
for write stalls; memtable size + zombies catch memtable-related
stalls and long-lived iterators; block cache size/hit/miss track
working-set fit.

Note: pebble v2.1.4 does not expose WriteStallCount / WriteStallDuration
on the Metrics struct (only as EventListener callbacks), so those are
omitted from this pass. L0 sublevels is the more actionable proxy.

Wiring mirrors the existing DispatchCollector shape:
  - store.pebbleStore.Metrics() accessor (guarded by dbMu so a
    concurrent Restore/Close cannot race the DB swap)
  - monitoring.PebbleMetricsSource interface satisfied by
    *pebbleStore
  - monitoring.PebbleCollector started from main.go next to the
    DispatchCollector, on the same 5 s cadence

Tests use a fakePebbleSource returning canned *pebble.Metrics and
assert both the gauge-overwrite semantics and the reset-safe counter
deltas.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 minutes and 33 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 59 minutes and 33 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 701783b0-23e2-43e1-bd1c-09bb8e6dfc3e

📥 Commits

Reviewing files that changed from the base of the PR and between bbd7850 and be3c19b.

📒 Files selected for processing (5)
  • main.go
  • monitoring/pebble.go
  • monitoring/pebble_test.go
  • monitoring/registry.go
  • store/lsm_store.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch obs/pebble-metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a PebbleCollector to monitor Pebble LSM internals, such as L0 sublevels, compaction debt, and block cache usage, exposing them as Prometheus metrics. The implementation includes metrics definitions, polling logic, registry integration, and comprehensive unit tests. A typo was identified in the pebbleSnapshot struct field name blockCacheMisse which should be corrected for consistency.

Comment thread monitoring/pebble.go Outdated
type pebbleSnapshot struct {
compactCount int64
blockCacheHits int64
blockCacheMisse int64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in this field name. It should be blockCacheMisses.

This typo is also present in its usages on lines 263, 271, and 272. Please correct it in all places for consistency and maintainability.

Suggested change
blockCacheMisse int64
blockCacheMisses int64

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 21, 2026

Addressed the typo in monitoring/pebble.go: renamed blockCacheMisse -> blockCacheMisses on lines 182, 263, 271, 272 (commit b8a8b67). Tests pass (go test -race -count=1 -short ./monitoring/... ./store/...) and golangci-lint run ./monitoring/... is clean.

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a PebbleCollector to monitor and expose Pebble LSM internal metrics, such as L0 pressure, compaction debt, and cache usage, via Prometheus. The implementation includes logic to handle monotonic counters and source resets. Feedback suggests improving naming consistency for block cache metrics and optimizing performance by pre-calculating group ID strings to reduce allocations during the polling loop.

Comment thread monitoring/pebble.go Outdated
// Block cache.
blockCacheSizeBytes *prometheus.GaugeVec
blockCacheHitsTotal *prometheus.CounterVec
blockCacheMissTotal *prometheus.CounterVec
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The field name blockCacheMissTotal is inconsistent with blockCacheHitsTotal (plural). For better consistency with the other fields and the Prometheus metric name itself, consider renaming this to blockCacheMissesTotal.

Suggested change
blockCacheMissTotal *prometheus.CounterVec
blockCacheMissesTotal *prometheus.CounterVec

Comment thread monitoring/pebble.go
Comment on lines +163 to +166
type PebbleSource struct {
GroupID uint64
Source PebbleMetricsSource
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To avoid repeated allocations during the polling loop, consider pre-calculating the string representation of the GroupID and storing it in the PebbleSource struct. strconv.FormatUint is called for every group on every 5-second tick.

Suggested change
type PebbleSource struct {
GroupID uint64
Source PebbleMetricsSource
}
type PebbleSource struct {
GroupID uint64
GroupIDStr string
Source PebbleMetricsSource
}

Comment thread monitoring/pebble.go Outdated
if snap == nil {
continue
}
group := strconv.FormatUint(src.GroupID, 10)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If you apply the suggestion to pre-calculate the group ID string in PebbleSource, you can use it here directly to avoid per-tick allocations.

Suggested change
group := strconv.FormatUint(src.GroupID, 10)
group := src.GroupIDStr

Comment thread main.go
Comment on lines +469 to +472
out = append(out, monitoring.PebbleSource{
GroupID: runtime.spec.id,
Source: src,
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the PebbleSource struct is updated to include a pre-calculated GroupIDStr, populate it here during the extraction loop.

out = append(out, monitoring.PebbleSource{
	GroupID:    runtime.spec.id,
	GroupIDStr: strconv.FormatUint(runtime.spec.id, 10),
	Source:     src,
})

Addresses gemini-code-assist medium-priority review on PR #578: the
polling loop formatted src.GroupID with strconv.FormatUint on every
tick (every 5s per group), causing avoidable allocations. Store the
pre-computed decimal string on PebbleSource at construction time and
reuse it as the "group" Prometheus label in observeOnce.

- monitoring/pebble.go: add PebbleSource.GroupIDStr; consume it in
  observeOnce; drop now-unused strconv import.
- main.go: populate GroupIDStr in pebbleMonitorSources.
- monitoring/pebble_test.go: populate GroupIDStr in test fixtures.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

Addressed gemini's latest medium-priority review (2026-04-21T20:53:10Z) in 2af6f78.

Change: precompute PebbleSource.GroupIDStr once at construction (in pebbleMonitorSources) instead of calling strconv.FormatUint inside observeOnce on every 5s tick per group. Over the process lifetime this removes one small-string allocation per group per tick.

  • monitoring/pebble.go: add GroupIDStr string field to PebbleSource; use src.GroupIDStr as the group label in observeOnce; drop the now-unused strconv import.
  • main.go: populate GroupIDStr: strconv.FormatUint(runtime.spec.id, 10) in pebbleMonitorSources.
  • monitoring/pebble_test.go: populate GroupIDStr in the three PebbleSource literals used by the collector tests.

Verification: go build ./... clean; go test -race -count=1 -short ./monitoring/... ./store/... passes (monitoring 2.085s, store 7.011s); make lint 0 issues.

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class observability for Pebble/LSM health by exporting selected *pebble.DB.Metrics() internals as Prometheus metrics, integrated into the existing monitoring registry and started alongside other periodic collectors.

Changes:

  • Add pebbleStore.Metrics() accessor guarded by dbMu to safely snapshot Pebble metrics.
  • Introduce monitoring.PebbleMetrics + PebbleCollector to translate Pebble snapshots into elastickv_pebble_* gauges/counters.
  • Wire the new collector into main.go and add unit tests validating gauge overwrite + reset-safe counter deltas.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
store/lsm_store.go Exposes a thread-safe Metrics() snapshot method on pebbleStore.
monitoring/registry.go Registers Pebble metrics and exposes Registry.PebbleCollector().
monitoring/pebble.go Implements Prometheus vectors + polling collector for Pebble metrics.
monitoring/pebble_test.go Adds unit tests for overwrite semantics, delta counters, and nil safety.
main.go Starts the Pebble collector and builds per-group Pebble sources.

Comment thread store/lsm_store.go Outdated
Comment on lines +1799 to +1803
// Returns nil while the store is closed or between a Restore/swap (no
// current DB). Callers must handle nil.
//
// Safe for concurrent use: takes the dbMu read lock to protect against
// Restore/Close swapping the DB pointer.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says this returns nil while the store is closed or mid Restore/swap, but the implementation only returns nil when s.db == nil. Restore holds dbMu exclusively (callers will block rather than observe nil), and Close() doesn’t set s.db to nil, so a post-Close call would still hit s.db.Metrics(). Please either update the comment to match actual behavior, or set s.db = nil as part of Close()/swap so the nil contract is true (and avoids calling Metrics() on a closed DB).

Copilot uses AI. Check for mistakes.
Comment thread monitoring/pebble.go Outdated
// dashboard.
//
// The point-in-time fields (Sublevels, NumFiles, EstimatedDebt,
// MemTable.*, NumInProgress, BlockCache.Size/Count) are exposed as
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment mentions exposing BlockCache.Size/Count as gauges, but this collector only exports block_cache_size_bytes (no block cache count metric). Please update the comment to match what’s actually exported, or add the missing metric if intended.

Suggested change
// MemTable.*, NumInProgress, BlockCache.Size/Count) are exposed as
// MemTable.*, NumInProgress, BlockCache.Size) are exposed as

Copilot uses AI. Check for mistakes.
Comment thread monitoring/pebble.go Outdated
Comment on lines +50 to +53
blockCacheSizeBytes *prometheus.GaugeVec
blockCacheHitsTotal *prometheus.CounterVec
blockCacheMissTotal *prometheus.CounterVec
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Field name blockCacheMissTotal is inconsistent with the exported metric name (…_misses_total) and the nearby snapshot field (blockCacheMisses). Consider renaming to blockCacheMissesTotal for clarity/consistency.

Copilot generated this review using guidance from repository custom instructions.
- Metrics() docstring now matches actual behavior (nil only pre-Open
  or after a failed Open; Restore callers block on dbMu).
- Drop "Count" from the collector header comment; only
  BlockCache.Size is exported.
- Rename blockCacheMissTotal -> blockCacheMissesTotal for consistency
  with blockCacheHitsTotal, the snapshot field blockCacheMisses, and
  the exported metric name (..._misses_total).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

/gemini review

Addressed Copilot's 3 review items in be3c19b:

  1. store/lsm_store.go:1803 — Updated Metrics() docstring to accurately describe nil semantics: nil only before the first Open or after a failed Open; Restore callers block on dbMu (held exclusively by Restore) rather than observing nil, and Close() does not clear s.db. Behavior unchanged (out of scope for this PR).
  2. monitoring/pebble.go:19 — Removed "Count" from the header comment; only BlockCache.Size is exported (Size is more actionable than Count).
  3. monitoring/pebble.go:53 — Renamed blockCacheMissTotal to blockCacheMissesTotal for consistency with blockCacheHitsTotal, the adjacent snapshot field blockCacheMisses, and the exported metric name ..._misses_total. All usages propagated.

Verification: go build ./... clean, go test -race -count=1 -short ./monitoring/... ./store/... passes, make lint 0 issues.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a PebbleCollector to monitor Pebble LSM internals, such as L0 sublevels, compaction debt, and memtable/block cache usage, via Prometheus. It includes the necessary metrics definitions, a collector that polls these metrics on a fixed interval, and integration into the main application lifecycle. Feedback was provided regarding a concurrency issue in the metrics collection loop: holding a mutex while calling potentially blocking Metrics() methods could stall observability for all raft groups on a node. It is recommended to use more granular locking to ensure that a single slow or blocked store does not impact the monitoring of others.

Comment thread monitoring/pebble.go
Comment on lines +231 to +232
c.mu.Lock()
defer c.mu.Unlock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Holding the PebbleCollector mutex across the entire loop is problematic because src.Source.Metrics() can block. In pebbleStore, Metrics() acquires dbMu.RLock(). If a raft group is performing a long-running operation that holds dbMu exclusively (like a Restore or a large Compact), this collector will block on that group while holding c.mu. This prevents metrics from being updated for all other raft groups on the same node, leading to a total loss of Pebble observability during critical recovery operations. Per repository guidelines, ensure that the blocking semantic of Metrics() is explicitly stated in its interface documentation. Additionally, when performing updates to the previous state, the lock must be held continuously from the read to the assignment to prevent race conditions.

func (c *PebbleCollector) observeOnce(sources []PebbleSource) {
	if c == nil || c.metrics == nil {
		return
	}
	for _, src := range sources {
		if src.Source == nil {
			continue
		}
		snap := src.Source.Metrics()
		if snap == nil {
			continue
		}
		group := src.GroupIDStr

		// L0 pressure: gauges, overwritten each tick.
		c.metrics.l0Sublevels.WithLabelValues(group).Set(float64(snap.Levels[0].Sublevels))
		c.metrics.l0NumFiles.WithLabelValues(group).Set(float64(snap.Levels[0].TablesCount))

		// Compaction.
		c.metrics.compactEstimatedDebt.WithLabelValues(group).Set(float64(snap.Compact.EstimatedDebt))
		c.metrics.compactInProgress.WithLabelValues(group).Set(float64(snap.Compact.NumInProgress))

		// Memtable.
		c.metrics.memtableCount.WithLabelValues(group).Set(float64(snap.MemTable.Count))
		c.metrics.memtableSizeBytes.WithLabelValues(group).Set(float64(snap.MemTable.Size))
		c.metrics.memtableZombieCount.WithLabelValues(group).Set(float64(snap.MemTable.ZombieCount))

		// Block cache gauge.
		c.metrics.blockCacheSizeBytes.WithLabelValues(group).Set(float64(snap.BlockCache.Size))

		// Monotonic counters: emit only the positive delta. A smaller
		// value means the source was reset (store reopened); rebase
		// silently without emitting negative.
		c.mu.Lock()
		prev := c.previous[src.GroupID]
		curr := pebbleSnapshot{
			compactCount:     snap.Compact.Count,
			blockCacheHits:   snap.BlockCache.Hits,
			blockCacheMisses: snap.BlockCache.Misses,
		}
		if curr.compactCount > prev.compactCount {
			c.metrics.compactCountTotal.WithLabelValues(group).Add(float64(curr.compactCount - prev.compactCount))
		}
		if curr.blockCacheHits > prev.blockCacheHits {
			c.metrics.blockCacheHitsTotal.WithLabelValues(group).Add(float64(curr.blockCacheHits - prev.blockCacheHits))
		}
		if curr.blockCacheMisses > prev.blockCacheMisses {
			c.metrics.blockCacheMissesTotal.WithLabelValues(group).Add(float64(curr.blockCacheMisses - prev.blockCacheMisses))
		}
		c.previous[src.GroupID] = curr
		c.mu.Unlock()
	}
}
References
  1. When an interface method is intended to block until a certain condition is met, explicitly state this blocking semantic in the documentation or interface description to avoid ambiguity for implementers.
  2. When performing atomic operations based on a version or state variable, do not release the lock between reading the variable and performing the conditional check to prevent race conditions and the use of stale data.

@bootjp bootjp requested a review from Copilot April 22, 2026 09:17
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

/gemini reivew

@gemini-code-assist
Copy link
Copy Markdown
Contributor

It looks like you made a typo in the command /gemini reivew. To trigger a review, please use the correct command: /gemini review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@bootjp bootjp merged commit 0f97c45 into main Apr 22, 2026
12 checks passed
@bootjp bootjp deleted the obs/pebble-metrics branch April 22, 2026 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants