Skip to content

Implement Devnet-4 metrics from leanMetrics PR #29#289

Merged
MegaRedHand merged 2 commits intomainfrom
devnet4-metrics-impl
Apr 15, 2026
Merged

Implement Devnet-4 metrics from leanMetrics PR #29#289
MegaRedHand merged 2 commits intomainfrom
devnet4-metrics-impl

Conversation

@pablodeymo
Copy link
Copy Markdown
Collaborator

Motivation

Implements the metrics defined in leanEthereum/leanMetrics#29 for Devnet-4 monitoring: block production, gossip message sizes, sync status, and updated histogram buckets.

Description

Block Production Metrics (blockchain/metrics.rs → instrumented in lib.rs + store.rs)

Metric Type Buckets
lean_block_building_time_seconds Histogram 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 0.75, 1
lean_block_building_payload_aggregation_time_seconds Histogram 0.1, 0.25, 0.5, 0.75, 1, 2, 3, 4
lean_block_aggregated_payloads Histogram 1, 2, 4, 8, 16, 32, 64, 128
lean_block_building_success_total Counter
lean_block_building_failures_total Counter
  • propose_block() is wrapped with a timing guard for total block building time, and increments success/failure counters on each exit path.
  • produce_block_with_signatures() times the build_block() call specifically for payload aggregation, and observes the aggregated payload count.

Sync Status (blockchain/metrics.rs → tracked in lib.rs)

Metric Type Labels
lean_node_sync_status Gauge status=idle,syncing,synced
  • Set to idle at startup (before first tick).
  • Updated to syncing or synced after every block, by comparing head_slot against the wall clock slot.

Gossip Message Size Metrics (p2p/metrics.rs → instrumented in gossipsub/handler.rs)

Metric Type Buckets
lean_gossip_block_size_bytes Histogram 10K, 50K, 100K, 250K, 500K, 1M, 2M, 5M
lean_gossip_attestation_size_bytes Histogram 512, 1K, 2K, 4K, 8K, 16K
lean_gossip_aggregation_size_bytes Histogram 1K, 4K, 16K, 64K, 128K, 256K, 512K, 1M
  • Observed on the uncompressed message after snappy decompression, for each gossip message type.

Modified Existing Metric

  • lean_committee_signatures_aggregation_time_seconds: buckets updated from [0.005..1.0] to [0.05..4.0] to capture longer aggregation times in Devnet-4.

How to Test

  1. make fmt — passes
  2. make lint — passes (clippy clean)
  3. make test — spec test failures are pre-existing (fixture format mismatch, unrelated to this PR)
  4. Run a local devnet and verify new metrics appear on the /metrics endpoint (port 5054)

  Add block production metrics (building time, payload aggregation time,
  aggregated payload count, success/failure counters), gossip message size
  histograms (block, attestation, aggregation), node sync status gauge
  (idle/syncing/synced), and update committee signature aggregation
  histogram buckets to wider range.
@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. Medium Sync status is computed from head_slot, which is not a valid syncedness signal on a consensus chain with empty slots. In crates/blockchain/src/lib.rs, head_slot >= current_slot marks the node as synced, but head_slot only advances when a block exists. Empty slots are normal, so a fully caught-up node will report syncing whenever the wall clock moves into a skipped slot. This will flap on slot boundaries and stay wrong across runs of empty slots. The comparison should be driven from the tick-derived slot / local time progression, not the latest imported head slot.

  2. Medium The sync-status gauge is only refreshed when a block is processed, so it can remain stale indefinitely. The update path is only in crates/blockchain/src/lib.rs, while the node’s notion of current time/slot is maintained in on_tick at crates/blockchain/src/lib.rs. That means a node can stay idle after startup until the first imported block, and can stay synced after import stalls because nothing reevaluates the gauge without another block. This logic belongs in on_tick, where the wall-clock slot is already computed every interval.

  3. Low The new “payload aggregation” timing does not match the code it wraps. In crates/blockchain/src/store.rs, time_block_building_payload_aggregation() surrounds the entire build_block(...) call, but build_block also runs repeated process_slots / process_block transitions and final state-root computation at crates/blockchain/src/store.rs. That makes the metric materially over-report aggregation time and weakens its usefulness for performance analysis. Either narrow the guard to the attestation-selection/compaction work or rename the metric.

No consensus-critical changes to fork choice, attestation validation, STF, XMSS verification, or SSZ handling stood out in this diff beyond the observability issues above.

I could not run cargo check in this environment because rustup attempted to write under a read-only path.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

Overall this is a solid metrics PR with good use of RAII timing guards and proper error path coverage. However, there are several correctness and maintainability issues to address:

Critical Issues

  1. Potential panic on system time (crates/blockchain/src/lib.rs:282)

    .duration_since(std::time::UNIX_EPOCH)
    .unwrap()

    While unlikely, unwrap() can panic if the system clock is before Unix epoch. Replace with .unwrap_or_default() or proper error handling.

  2. Integer truncation (crates/blockchain/src/lib.rs:282)

    .as_millis() as u64

    as_millis() returns u128. Though overflow is practically impossible (500M+ years), explicit saturating cast (try_into().unwrap_or(u64::MAX)) is safer.

Correctness & Maintainability

  1. Stringly-typed sync status (crates/blockchain/src/metrics.rs:571-578)
    The set_node_sync_status function relies on string matching against a hardcoded array. If an invalid string is passed, all labels are set to 0, hiding the node state.

    Recommendation: Define


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have all the context needed. Here is my review:


PR #289 Review — Devnet-4 Metrics

Overview

This PR adds Prometheus instrumentation for block production timing/outcomes, node sync status, and gossip message sizes. The implementation is clean and follows existing patterns. A few correctness and scope issues are worth addressing.


Correctness Issues

1. lean_block_building_time_seconds scope includes gossip publication

_timing in propose_block() (lib.rs:221) is dropped at the end of the function, which includes:

  • produce_block_with_signatures
  • sign_block_root
  • process_block (full local state transition)
  • p2p.publish_block (network I/O)

The metric description says "Time taken to build a block," but gossip publication is included. A slow network path will inflate this histogram and make it misleading. Either rename the metric to lean_block_proposal_time_seconds or move _timing to end before the publish call.

2. Naming mismatch on lean_block_building_payload_aggregation_time_seconds

The block in store.rs:1018-1028 is named payload_aggregation timing, but build_block does more than just aggregate payloads — it assembles the full block, runs justification/finality logic, computes the state root, etc. The name lean_block_building_time_seconds would be more accurate here; the broader scope timer in propose_block could be named lean_block_proposal_time_seconds. The current naming will confuse operators trying to correlate the two.

3. signatures.len() counts AggregatedSignatureProof objects, not validator count

store.rs:993 shows the return type is Vec<AggregatedSignatureProof>, so signatures.len() correctly measures the number of aggregated payloads packed into the block. This is the intended semantics. No bug here, just confirming it's correct.

4. Sync status only transitions when a block is received

set_node_sync_status is only called from process_block (lib.rs:273). If a node is behind and not receiving any blocks (common in initial sync), the status stays "idle" forever rather than transitioning to "syncing". Consider also updating sync status on the tick event, or at minimum on any head-slot query, so an offline/stuck node shows "syncing" when it's actually behind.


Minor Issues

5. Gossip size histograms are not eagerly registered

blockchain/metrics.rs has an init() function that force-initializes all metrics so they appear in /metrics from startup (main.rs:93). The three gossip histograms in p2p/metrics.rs have no equivalent — they will be absent from the /metrics output until the first gossip message is received. Either add a p2p::metrics::init() call, or add the gossip histograms to the blockchain init() (odd but functional), or accept the lazy behavior and document it.

6. .unwrap() on SystemTime (lib.rs:285)

std::time::SystemTime::now()
    .duration_since(std::time::UNIX_EPOCH)
    .unwrap()

This panics if the system clock is set before 1970, which is unrealistic in production but is still a panic path in consensus code. Use .unwrap_or_default() or .expect("system clock before Unix epoch") with a descriptive message, or guard with an else branch that leaves sync status unchanged.

7. as u64 truncation on as_millis() (lib.rs:286)

Duration::as_millis() returns u128. Casting to u64 silently truncates at u64::MAX (~584 million years from epoch), so this is safe in practice. However, this is worth a comment since the silent truncation looks alarming without context.


What Looks Good

  • The label-flip pattern for set_node_sync_status (sets target label to 1, all others to 0) is the correct Prometheus idiom for enumerated state representation. Pre-seeding all three labels via set_node_sync_status("idle") at startup ensures all series appear in /metrics immediately. Well done.
  • All three failure paths in propose_block correctly increment the failure counter (lines 228, 239, 257). Success is counted only after local process_block succeeds, which is the right semantic boundary.
  • Gossip size observations are placed after snappy decompression but before SSZ decoding (handler.rs), which correctly captures uncompressed message size as documented.
  • Histogram buckets are well-chosen: lean_gossip_attestation_size_bytes upper bound of 16K is appropriate for XMSS individual attestations (~3KB signature + data), and aggregation buckets extend to 1M for worst-case multi-validator aggregations.
  • lean_block_aggregated_payloads with bucket 0 (implicit [0, 1)) correctly handles the no-aggregator case described in CLAUDE.md.

Suggested Priority

  • Before merge: Item 1 (metric scope/naming) and Item 5 (gossip metrics not pre-registered) — both affect observability correctness.
  • Nice-to-have: Item 4 (sync status updates), Item 2 (naming alignment), Item 6 (panic path).

Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 15, 2026

Greptile Summary

This PR adds Devnet-4 monitoring metrics: block production histograms/counters, a labelled sync-status gauge, and gossip message-size histograms, plus a bucket range update for the existing committee-aggregation histogram. Implementation correctly follows the codebase's LazyLock/TimingGuard patterns, handles all failure exit paths for success/failure counters, and uses saturating_sub for pre-genesis safety in the slot calculation.

Confidence Score: 5/5

Safe to merge; only one P2 style comment about timing-scope labelling.

All findings are P2 (style/clarity). No logic errors, data loss, or runtime failures were identified. The sync-status gauge correctly uses saturating_sub for pre-genesis safety, genesis_time is u64 so the *1000 multiplication is safe, and all metric exit paths are correctly instrumented. Gossip metrics are consistently lazily registered, matching the existing p2p pattern.

crates/blockchain/src/lib.rs — the timing scope of lean_block_building_time_seconds is wider than its description implies.

Important Files Changed

Filename Overview
crates/blockchain/src/lib.rs Adds timing guard for block building, success/failure counters on all exit paths, and sync-status update based on head-slot vs wall-clock slot; timing scope for lean_block_building_time_seconds is broader than the metric description suggests (includes process_block + publish_block)
crates/blockchain/src/metrics.rs Adds 5 block-production metrics (2 histograms, 1 aggregated-payload histogram, 2 counters), 1 sync-status IntGaugeVec, and updates existing aggregation bucket range; all registered via init() and follow existing LazyLock/TimingGuard patterns
crates/blockchain/src/store.rs Wraps build_block() call with payload-aggregation timing guard and observes aggregated-payload count after successful build; correctly only records count on success path
crates/net/p2p/src/gossipsub/handler.rs Observes decompressed message size for block, aggregation, and attestation gossip types immediately after successful decompression, consistent with measuring uncompressed payload size
crates/net/p2p/src/metrics.rs Adds three gossip-size histograms (block, attestation, aggregation) with appropriate bucket ranges; lazily initialized on first use, consistent with existing p2p metrics pattern (no explicit init())

Sequence Diagram

sequenceDiagram
    participant BC as BlockChainServer
    participant Metrics as blockchain/metrics
    participant Store as store.rs
    participant KM as KeyManager
    participant P2P as P2PServer

    BC->>Metrics: time_block_building() [_timing guard START]
    BC->>Store: produce_block_with_signatures()
    activate Store
    Store->>Metrics: time_block_building_payload_aggregation() [guard START]
    Store->>Store: build_block()
    Store->>Metrics: [payload aggregation guard DROP → records duration]
    Store->>Metrics: observe_block_aggregated_payloads(signatures.len())
    Store-->>BC: (Block, signatures, checkpoints) OR Err
    deactivate Store
    alt build failed
        BC->>Metrics: inc_block_building_failures()
    else success
        BC->>KM: sign_block_root()
        alt sign failed
            BC->>Metrics: inc_block_building_failures()
        else signed
            BC->>BC: process_block() [state transition + storage]
            BC->>Metrics: update_head_slot / set_node_sync_status
            alt process failed
                BC->>Metrics: inc_block_building_failures()
            else processed
                BC->>Metrics: inc_block_building_success()
                BC->>P2P: publish_block()
            end
        end
    end
    BC->>Metrics: [_timing guard DROP → records lean_block_building_time_seconds]

    Note over P2P,Metrics: On gossip receive
    P2P->>P2P: decompress_message()
    P2P->>Metrics: observe_gossip_block_size(uncompressed.len())
    P2P->>P2P: observe_gossip_attestation_size / observe_gossip_aggregation_size
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/blockchain/src/lib.rs
Line: 221

Comment:
**Timing scope wider than metric name implies**

`_timing` is dropped at the end of `propose_block`, so `lean_block_building_time_seconds` captures not only block construction and signing but also the full state transition inside `process_block()` (which includes SSZ processing, fork-choice updates, and storage writes) and the gossip `publish_block` call. The nested `lean_block_building_payload_aggregation_time_seconds` already isolates the `build_block()` phase, so the outer metric's scope may surprise operators trying to attribute latency — the description in the metric string says "Time taken to build a block" but it measures the entire proposal pipeline.

Consider either tightening the timing scope to end before `process_block` / `publish_block`, or updating the metric description to `"Time taken to propose a block (build + sign + process + publish)"` to make the breadth explicit.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Implement Devnet-4 metrics from leanMetr..." | Re-trigger Greptile

fn propose_block(&mut self, slot: u64, validator_id: u64) {
info!(%slot, %validator_id, "We are the proposer for this slot");

let _timing = metrics::time_block_building();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Timing scope wider than metric name implies

_timing is dropped at the end of propose_block, so lean_block_building_time_seconds captures not only block construction and signing but also the full state transition inside process_block() (which includes SSZ processing, fork-choice updates, and storage writes) and the gossip publish_block call. The nested lean_block_building_payload_aggregation_time_seconds already isolates the build_block() phase, so the outer metric's scope may surprise operators trying to attribute latency — the description in the metric string says "Time taken to build a block" but it measures the entire proposal pipeline.

Consider either tightening the timing scope to end before process_block / publish_block, or updating the metric description to "Time taken to propose a block (build + sign + process + publish)" to make the breadth explicit.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/src/lib.rs
Line: 221

Comment:
**Timing scope wider than metric name implies**

`_timing` is dropped at the end of `propose_block`, so `lean_block_building_time_seconds` captures not only block construction and signing but also the full state transition inside `process_block()` (which includes SSZ processing, fork-choice updates, and storage writes) and the gossip `publish_block` call. The nested `lean_block_building_payload_aggregation_time_seconds` already isolates the `build_block()` phase, so the outer metric's scope may surprise operators trying to attribute latency — the description in the metric string says "Time taken to build a block" but it measures the entire proposal pipeline.

Consider either tightening the timing scope to end before `process_block` / `publish_block`, or updating the metric description to `"Time taken to propose a block (build + sign + process + publish)"` to make the breadth explicit.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread crates/blockchain/src/metrics.rs Outdated
}

/// Set the node sync status. Sets the given status label to 1 and all others to 0.
pub fn set_node_sync_status(status: &str) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use an enum for this

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added a SyncStatus enum with Idle/Syncing/Synced variants.

Comment thread crates/blockchain/src/lib.rs Outdated
Comment on lines +282 to +292
let genesis_time_ms = self.store.config().genesis_time * 1000;
let now_ms = std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_millis() as u64;
let current_slot = now_ms.saturating_sub(genesis_time_ms) / MILLISECONDS_PER_SLOT;
let status = if head_slot >= current_slot {
"synced"
} else {
"syncing"
};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified to self.store.time() / INTERVALS_PER_SLOT < self.store.head_slot()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, replaced the manual wall clock calculation with self.store.time() / INTERVALS_PER_SLOT.

  Address PR #289 review comments: replace raw &str sync status with a
  SyncStatus enum, and replace manual SystemTime wall clock calculation
  with store.time() / INTERVALS_PER_SLOT.
Copy link
Copy Markdown
Collaborator

@MegaRedHand MegaRedHand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MegaRedHand MegaRedHand merged commit cc98253 into main Apr 15, 2026
1 of 2 checks passed
@MegaRedHand MegaRedHand deleted the devnet4-metrics-impl branch April 15, 2026 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants