Skip to content

Bound gossip_signatures with FIFO eviction to prevent unbounded growth#290

Open
pablodeymo wants to merge 3 commits intomainfrom
fix-gossip-signatures-unbounded-growth
Open

Bound gossip_signatures with FIFO eviction to prevent unbounded growth#290
pablodeymo wants to merge 3 commits intomainfrom
fix-gossip-signatures-unbounded-growth

Conversation

@pablodeymo
Copy link
Copy Markdown
Collaborator

Motivation

Fixes #263. During a finalization stall, gossip_signatures grows without limit because it is only pruned when finalization advances. In the devnet4 test run (2026-04-07/08, ~18.5h), finalization stalled at slot 10733 for 6+ hours. During this period, every gossip attestation from every validator was inserted and never removed, with attestation_count in produced blocks growing from ~5 to 797. This contributed to the RocksDB file descriptor exhaustion crash on node 0.

Description

Replaces the unbounded HashMap<H256, GossipDataEntry> with a new GossipSignatureBuffer struct that enforces a hard cap on total individual signatures, following the same FIFO eviction pattern already used by PayloadBuffer for aggregated payloads.

New GossipSignatureBuffer struct

Field Type Purpose
data HashMap<H256, GossipDataEntry> Signatures grouped by attestation data root
order VecDeque<H256> FIFO insertion order of data_roots
capacity usize Hard cap on total_signatures
total_signatures usize Running count (O(1), was O(n) sum)

Eviction algorithm

On insert(data_root, validator_id, signature):
  1. If (validator_id, data_root) already exists → skip (first-write-wins)
  2. Insert signature into BTreeMap (ascending order for XMSS)
  3. total_signatures += 1
  4. If new data_root → push to back of VecDeque
  5. While total_signatures > capacity:
       Pop front of VecDeque (oldest data_root)
       Remove entire entry from HashMap
       total_signatures -= removed.signatures.len()

Capacity

GOSSIP_SIGNATURE_CAP = 2048 individual signatures:

  • With 4 validators: covers ~512 slots (~34 minutes)
  • Worst-case memory: ~6 MB (2048 × 3KB XMSS signatures)
  • Matches the scale of AGGREGATED_PAYLOAD_CAP = 512

What stays the same

  • Finalization-based pruning — still runs when finalization advances (complementary, not replaced)
  • Interval-2 deletiondelete_gossip_signatures() still removes consumed sigs after aggregation
  • BTreeMap ordering — ascending validator_id for XMSS aggregate correctness
  • First-write-wins dedup — duplicate (validator_id, data_root) silently ignored
  • Public API — all 5 Store methods keep their exact signatures and return types
  • Zero caller changesblockchain/src/store.rs compiles unchanged

Performance note

The delete() method collects emptied data_roots during iteration and batch-cleans the VecDeque with a single retain() pass, avoiding per-root O(n) scans.

Tests added

6 new unit tests covering:

  1. FIFO eviction when over capacity
  2. First-write-wins dedup
  3. Multiple validators per data_root
  4. Delete cleans up empty data_roots from both HashMap and VecDeque
  5. Slot-based pruning with count verification
  6. Eviction removes entire data_root (all its signatures)

How to Test

make fmt    # passes
make lint   # passes (clippy clean)
cargo test -p ethlambda-storage --lib --release  # 22 tests pass (6 new + 16 existing)

Devnet validation: run a multi-client devnet, observe lean_gossip_signatures metric stays bounded even if finalization stalls.

Closes

Closes #263

…y growth

Replace the bare HashMap with a GossipSignatureBuffer that caps total
individual signatures at 2048 (~6MB). When over capacity, the oldest
data_root and all its signatures are evicted FIFO, matching the
PayloadBuffer pattern. Finalization-based pruning is preserved as a
complementary mechanism.

Fixes #263
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

The PR introduces a well-designed GossipSignatureBuffer with proper FIFO eviction and accurate accounting. No critical issues found.

Minor Suggestions:

  1. Redundant early return in delete (store.rs:304-306): The check if keys.is_empty() { return; } is unnecessary since the following loop would simply not execute. This is purely stylistic—safe to leave as-is for explicitness.

  2. Test determinism (store.rs:1755-1769): The make_dummy_sig function uses a seeded RNG (seed 42), which is good for reproducibility. However, LazyLock caches the same signature bytes for all calls. While sufficient for buffer logic tests, consider documenting that signature validation is not tested here.

Verification of correctness:

  • Accounting safety: The total_signatures counter is accurately maintained in all paths (insert, delete, prune, eviction). The delete method correctly handles duplicate keys in the input slice by checking remove().is_some() before decrementing.
  • FIFO semantics: Eviction removes entire data_root entries (all signatures for that attestation) based on insertion order, not individual signatures. This matches the design comment and prevents partial attestation aggregation.
  • Memory bounds: The hard cap of 2048 signatures (~6MB worst case) plus VecDeque tracking provides O(1) insertion and bounded memory usage.

Consensus safety note:
The FIFO eviction strategy may drop valid attestations under high gossip load, but this is acceptable for consensus-layer gossip buffers (best-effort delivery). The prune method correctly removes finalized slots, preventing stale signature accumulation.

LGTM – No changes required.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

Findings

  1. The new fixed-size gossip buffer can silently drop valid attestations before they are ever aggregated, which changes consensus behavior under load. on_gossip_attestation stores every verified unaggregated attestation in this pool, and aggregate_committee_signatures later builds proofs only from whatever remains in the buffer (crates/blockchain/src/store.rs:533, crates/blockchain/src/store.rs:140). The new cap is a hard-coded 2048 signatures with FIFO eviction by data_root insertion order (crates/storage/src/store.rs:91, crates/storage/src/store.rs:303). Because validate_attestation_data only rejects future attestations and does not reject old-but-still-known ones (crates/blockchain/src/store.rs:401), a peer can fill this buffer with many distinct valid attestations and evict current-slot signatures before interval-2 aggregation runs. On custom devnets with more than 4 validators, or after backlog bursts, this can reduce attestation inclusion and delay fork-choice weight/finality. I would not ship a fixed global cap here unless it is derived from validator count / slot horizon, or eviction preferentially drops stale/finalized slots rather than earliest-inserted roots.

Aside from that, the buffer bookkeeping itself looks consistent: BTreeMap preserves XMSS aggregation order, and the insert/delete/prune counters line up with the exposed snapshot/count APIs.

I could not run the targeted tests because cargo attempted to write under a read-only rustup temp directory in this environment.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have everything needed. Let me write a thorough review.


Code Review: PR #290 — Bound gossip_signatures with FIFO eviction

Overview

This PR fixes a real production crash (issue #263) by replacing an unbounded HashMap<H256, GossipDataEntry> with a new GossipSignatureBuffer that enforces a hard cap of 2048 individual signatures via FIFO eviction by data_root insertion order. The approach mirrors the existing PayloadBuffer pattern and keeps the public Store API unchanged.

The root problem is legitimate: during finalization stalls, gossip signatures accumulate indefinitely. The fix is well-targeted and the encapsulation is clean.


Correctness

Self-eviction edge case (minor but worth documenting):

In insert, eviction pops from the front (oldest) of the VecDeque. If there is only one data_root in the buffer and it accumulates more signatures than capacity, the eviction loop will pop and remove that same data_root — including the signature just inserted. Example with capacity=4:

existing: data_root X with 4 sigs (total=4, at capacity)
insert(X, validator=4): total → 5, evict front = X (removes all 4 sigs), total → 1
result: only validator 4's sig survives

With GOSSIP_SIGNATURE_CAP=2048 and a 4-validator devnet this cannot happen in practice (a single data_root can have at most 4 sigs), but this is a latent correctness risk as validator set sizes grow. The lack of a test for this scenario means the behaviour is undocumented.

prune return value unit is data_roots, not signatures:

prune returns pruned_roots.len(), counting distinct attestation-data entries removed — matching what the old before - gossip.len() returned. The blockchain layer logs this as "pruned gossip signatures: N", which could be read as individual signatures. This is consistent with the prior behaviour and not a regression, but the ambiguity is worth noting.

Invariant: order and data lengths must always match. This invariant holds across all three operations as written, but there is no debug_assert!(self.order.len() == self.data.len()) to catch future regressions. The eviction loop also silently handles the impossible case where order.pop_front() returns a root that data.remove() doesn't find — resulting in total_signatures stalling above capacity and breaking out. A debug_assert! inside that branch would surface invariant breakage early:

if let Some(evicted) = self.order.pop_front() {
    if let Some(removed) = self.data.remove(&evicted) {
        self.total_signatures -= removed.signatures.len();
    } else {
        debug_assert!(false, "order/data out of sync: {evicted:?} not in data");
        break;
    }
}

prune uses split-borrow of self inside retain closure:

self.data.retain(|root, entry| {
    // ...
    self.total_signatures -= entry.signatures.len();  // borrows self.total_signatures
    // ...
});

This relies on Rust 2021 disjoint closure captures allowing separate borrows of self.data and self.total_signatures. This is valid and compiles correctly with edition 2024 — no issue.


Performance

VecDeque without capacity hint: GossipSignatureBuffer::new uses VecDeque::new(), which starts with zero capacity and reallocates as entries are added. Since capacity is known at construction time, VecDeque::with_capacity(capacity) would eliminate reallocations for typical workloads (one entry per data_root, bounded by the eviction cap).

Unconditional HashSet allocations:

Both delete and prune allocate a HashSet<H256> unconditionally, even in the common case where no entries become empty / get pruned. The delete path already has an early return for empty keys, but the HashSet is allocated before the loop in both methods. Lazy initialization or a local Vec (leveraging small-set O(n) scan) would reduce allocation pressure on the hot path.

retain complexity in delete: self.order.retain(|r| !emptied_roots.contains(r)) is O(order.len()) with O(1) set lookup — fine. The existing comment correctly calls out that this is a single-pass batch clean.


Design

Fixed constant vs. validator-derived cap: GOSSIP_SIGNATURE_CAP = 2048 is justified well for the current 4-validator devnet. However, the PR description's own math shows the cap is tightly coupled to validator count: "With 4 validators: covers ~512 slots (~34 minutes)". At 100 validators, 2048 signatures covers only ~20 slots (~80 seconds). If the node is deployed in a larger-network scenario, the cap would need to scale with genesis_validators.len(). The constant is fine for now given the explicit scope of the fix, but a future follow-up should make this configurable or derived from genesis config.

FIFO eviction unit is entire data_roots, not individual signatures: When a data_root with 512 sigs is evicted to make room for one new sig, total_signatures drops by 512 at once. This means the buffer can oscillate well below its stated capacity. For the current 4-validator case this is inconsequential, but it's worth documenting as a known property.


Tests

The six new tests cover the core paths well. A few gaps worth noting:

  • No test for the self-eviction case described above (capacity=N, one data_root accumulates N+1 sigs → entire root evicted).
  • gossip_buffer_prune_by_slot uses exactly 1 sig per slot. A variant with multiple sigs per slot would verify that total_signatures is correctly decremented for multi-sig entries.
  • No test for snapshot() content or ordering — just that the internal counters are right. Fine given snapshot is a read-only projection.
  • buf.data and buf.order are accessed directly in tests, relying on same-module visibility. This is idiomatic Rust; no issue.

Code Quality

The encapsulation is clean and the pattern is consistent with PayloadBuffer. Moving the is_empty early-return guard from Store::delete_gossip_signatures into GossipSignatureBuffer::delete is the right abstraction boundary. The O(1) total_signatures counter replacing the previous O(n) .values().map(|e| e.signatures.len()).sum() is a meaningful improvement.


Summary

Category Verdict
Correctness Good — one latent self-eviction edge case; invariant assertions would help
Performance Fine — minor: pre-allocate VecDeque, lazy HashSet in delete/prune
Design Good — consider validator-count-derived cap for future scalability
Tests Good — covers main cases; two missing edge-case variants noted
Security Acceptable — FIFO is gameable in theory but the bounded-memory fix outweighs that risk

The implementation correctly solves the stated production issue. The concerns above are all minor or future-facing — none block merging. The Codex review's suggestion to derive the cap from validator count is the most worth tracking as a follow-up.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 15, 2026

Greptile Summary

This PR introduces GossipSignatureBuffer, a bounded FIFO structure that replaces the unbounded HashMap<H256, GossipDataEntry> for in-memory gossip signatures, fixing the unbounded growth during finalization stalls (issue #263). The implementation mirrors the existing PayloadBuffer pattern and enforces a 2048-signature hard cap (~6 MB worst-case) with slot-based pruning still running as a complementary path. The public API is unchanged and all 22 tests pass.

Confidence Score: 5/5

Safe to merge — bounded buffer is correctly implemented, invariants are maintained, all 22 tests pass, and the public API is unchanged.

No P0 or P1 issues found. The FIFO eviction algorithm is correct: data, order, and total_signatures are kept in sync across all three mutation paths (insert/evict, delete, prune). The Rust 2021 partial-field capture in the prune closure is valid and confirmed by the passing test suite. Dev dependencies are correctly scoped. The capacity constant (2048) is well-justified in the PR description.

No files require special attention.

Important Files Changed

Filename Overview
crates/storage/src/store.rs Core change: replaces HashMap<H256, GossipDataEntry> with a bounded GossipSignatureBuffer (FIFO eviction, VecDeque-tracked order, O(1) total_signatures counter). Invariants between data, order, and total_signatures are correctly maintained across insert, delete, and prune. Six new unit tests cover all stated cases. No API-breaking changes.
crates/storage/Cargo.toml Adds leansig and rand to [dev-dependencies] only, required to generate real XMSS signatures in the new unit tests. No production dependency changes.
Cargo.lock Lock file updated to reflect new dev-dependencies (leansig, rand 0.10.0) for ethlambda-storage. No new production dependencies.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[insert hashed, validator_id, signature] --> B{data_root in data?}
    B -- Yes --> C{validator_id in signatures?}
    C -- Yes --> D[return / first-write-wins]
    C -- No --> E[insert signature into BTreeMap\ntotal_signatures += 1]
    B -- No --> F[create new GossipDataEntry\ninsert into data\npush data_root to order back\ntotal_signatures += 1]
    E --> G{total_signatures > capacity?}
    F --> G
    G -- No --> H[done]
    G -- Yes --> I[pop oldest data_root from order front]
    I --> J[remove entry from data\ntotal_signatures -= entry.len]
    J --> G

    subgraph delete [delete keys]
        K[for each vid,data_root] --> L{entry exists?}
        L -- No --> M[skip]
        L -- Yes --> N[remove vid from signatures\ntotal_signatures -= 1]
        N --> O{signatures empty?}
        O -- Yes --> P[remove from data\nadd to emptied_roots]
        O -- No --> K
        P --> K
        K --> Q[VecDeque.retain: remove emptied_roots]
    end

    subgraph prune [prune finalized_slot]
        R[data.retain: keep slot > finalized_slot\naccumulate pruned_roots\ntotal_signatures -= pruned.len] --> S[VecDeque.retain: remove pruned_roots]
    end
Loading

Reviews (1): Last reviewed commit: "Bound gossip_signatures with FIFO evicti..." | Re-trigger Greptile

Comment thread crates/storage/src/store.rs
Comment thread crates/storage/src/store.rs Outdated
  Overwrites the existing signature when the same (validator_id, data_root)
  is inserted again, keeping the latest signature. The total_signatures
  counter is only incremented for genuinely new entries.
Comment thread crates/storage/src/store.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gossip_signatures grows unbounded during finalization stall

2 participants