[DNM] feat(da): support fiber (not via c-node) by julienrbrt · Pull Request #3244 · evstack/ev-node

julienrbrt · 2026-04-13T11:57:04Z

Overview

Support Fiber client (based on https://github.com/celestiaorg/celestia-app/blob/63fbf31cca216fc4e067a9e1b3a3431115c7009b/fibre), but not via celestia node or apex for this PoC
celestiaorg/celestia-node#4892

coderabbitai · 2026-04-13T11:57:12Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 815ab2f3-ee3d-41ad-b66c-d5dbd0195eca

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch julien/fiber

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-13T11:57:20Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	May 19, 2026, 7:11 PM

claude · 2026-04-13T11:57:22Z

Claude finished @julienrbrt's task in 3m 55s —— View job

PR Review: `[DNM] feat(da): support fiber (not via c-node)`

Gather context and understand PR scope
Read AGENTS.md and project guidelines
Review core fiber client implementation
Review DA submitter changes
Review DA retriever/syncer changes
Review config changes and tools
Review security findings
Compile and post final review

This is a large, architecturally significant PoC. The DIVERGENCE.md document is excellent — it clearly explains the intentional breaking changes. The overall direction (combined blobs, DA-only sync, simplified submitter) is sound. Most issues below are bugs in the new Fiber-specific code that could cause silent failures or correctness problems in production.

Critical Bugs

1. context.Background() used in upload — cancellation is broken

block/internal/da/fiber_client.go:102

result, err := c.fiber.Upload(context.Background(), nsID, flat)

The caller's ctx is passed into Submit but ignored for the actual upload. This means:

Shutdowns / context cancellations will not abort in-flight uploads
The downstream errors.Is(err, context.Canceled) check (line 114) can never be true, because context.Background() is never cancelled — so that code path is dead

Fix this →

2. Namespace slice panics if namespace is shorter than 10 bytes

block/internal/da/fiber_client.go:95, 202, 321

nsID := namespace[len(namespace)-10:]   // panics if len(namespace) < 10

There's no bounds check before slicing. If a caller passes a namespace shorter than 10 bytes (e.g. misconfiguration), this will panic the node. All three call sites (Submit, retrieve, Subscribe) have this issue.

Fix this →

3. splitBlobs count field not validated — potential OOM

block/internal/da/fiber_client.go:434

count := int(binary.BigEndian.Uint32(data))
blobs := make([][]byte, 0, count)  // count can be up to 4 billion

A corrupted or maliciously crafted blob can set count to math.MaxUint32, causing an immediate OOM allocation. Should cap count at len(data)/4 (minimum 4 bytes per blob entry) before allocating.

Fix this →

4. MockDA subscriber cleanup has an index-corruption bug

block/internal/da/fibremock/mock.go:224-236

The cleanup goroutine uses a captured idx to remove itself by swap-with-last, but idx becomes stale when earlier subscribers are removed first. If subscriber A (idx=0) is removed and swaps B (idx=1) to position 0, then when B's goroutine runs it still has idx=1 — it now removes whichever subscriber is at position 1 (not B itself), corrupting the subscriber list. This is reproducible with 3+ concurrent Listen calls. For a test mock the practical impact is limited, but it's worth fixing before the mock is used in integration tests.

Significant Issues

5. CodeQL: Integer overflow in MarshalBlockBlob

block/internal/common/blob.go:25

size := 4 + 4 + len(headerBz) + 4 + len(dataBz) + 4 + len(envelopeSig)

On 32-bit targets this overflows if the combined payload approaches 2 GiB. CodeQL correctly flags this. Cast to int64 (or uint64) before summing or validate individual lengths first.

6. GetProofs / Validate are silently unimplemented

block/internal/da/fiber_client.go:378-398

GetProofs always returns a non-nil error, and Validate always returns true. Callers that rely on either for security decisions (proof verification) will silently skip all validation. These stubs should at minimum return a clearly typed sentinel error (da.ErrNotSupported) so call sites can handle or skip them explicitly.

7. TODO DA height is always 0 in Submit result

block/internal/da/fiber_client.go:149

Height: 0, /* TODO */

The returned Height is always 0. The cache update in the submitter uses this value (cacheMgr.SetHeaderDAIncluded(hash, daHeight, ...)). If callers rely on DA height for sequencing or inclusion proof generation, this silently produces wrong state.

8. DefaultMaxBlobSize is wrong for non-Fiber nodes (admitted HACK)

block/internal/common/consts.go:13-19

The HACK comment acknowledges that this global 128 MiB default is incorrect for the legacy JSON-RPC blob client (bridge nodes reject >5 MiB). This means anyone running a non-Fiber node on this branch will fail DA submission silently. Given this is "DNM" / PoC, acceptable short-term, but the comment says "restructure when throughput-cleanup lands" — that cleanup should be a hard prerequisite before any production merge.

9. panic in NewFiberDAClient instead of returning error

block/public.go:103-104

if err != nil {
    panic(err)
}

Even though NewFiberClient currently only errors on nil client (already checked), panicking from startup initialization is an anti-pattern. Return the error instead.

Fix this →

10. Dual log lines on upload failure

block/internal/da/fiber_client.go:104-128

On upload failure both c.logger.Warn() (with timing/context) and then c.logger.Error() (with just the error) are called, producing two log events for the same failure. Consolidate into a single structured log at Error level with all fields.

Code Quality / Minor Issues

11. Typo: heigth variable name

block/internal/da/fiber_client.go:370

heigth, err := c.fiber.Head(headCtx)
return heigth, nil

Rename to height. Fix this →

12. DIVERGENCE.md refers to wrong branch name

block/DIVERGENCE.md:1

The document says branch perf/block-optimization but this is julien/fiber.

13. fiberSubscribeChanSize = 42 is unexplained magic

block/internal/da/fiber_client.go:309

The buffer size of 42 has no documented rationale. Either use a named constant with a comment explaining the sizing (e.g. "holds ~N seconds of blob events at max throughput") or derive it from config.

14. retrieve ignores its bool parameter entirely

block/internal/da/fiber_client.go:198

Both Retrieve and RetrieveBlobs call c.retrieve(ctx, height, namespace, true/false) but the bool parameter is named _ and ignored. If the parameter serves no purpose, remove it from the signature.

15. Missing test coverage for block/public.go (12 uncovered lines)

The codecov report shows block/public.go at 0% coverage for the new NewFiberDAClient and SetMaxBlobSize wrappers. These should have at least basic smoke tests.

Architecture Notes

The DIVERGENCE.md is a thoughtful write-up and the design rationale is solid. A few observations for when this moves toward production:

Forced inclusion namespace: HasForcedInclusionNamespace() always returns false for the Fiber PoC, which disables censorship-resistance verification entirely. The design document should call this out explicitly as a known gap.
tools/talis: ~2500 lines of new AWS/DigitalOcean infrastructure code with no tests. For a PoC benchmark tool this is acceptable, but the SSH key handling and cloud API calls in aws.go and deployment.go deserve at least a README section on credential management.
P2P removal: Removing P2P sync from the block package is a major architectural change (Divergence §3). DA-only sync has latency implications (DA finality time vs gossip time). The tradeoff is intentional for the Fiber throughput experiment, but clients relying on P2P fast-sync will need updating.
da.Subscriber: The Subscribe → da.Subscriber → da.NewSubscriber indirection is clean. The inline fast-path in DAFollower.HandleEvent (skip re-fetch if blobs already present in the subscription event) is a good optimization.

codecov · 2026-04-13T12:01:14Z

Codecov Report

❌ Patch coverage is 90.65657% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.16%. Comparing base (2865d6d) to head (4485d91).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
block/public.go	0.00%	12 Missing ⚠️
block/internal/da/fibremock/mock.go	90.90%	5 Missing and 5 partials ⚠️
block/internal/da/fiber_client.go	96.74%	5 Missing and 3 partials ⚠️
pkg/sequencers/solo/sequencer.go	61.53%	5 Missing ⚠️
pkg/config/config.go	75.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3244      +/-   ##
==========================================
+ Coverage   62.33%   63.16%   +0.82%     
==========================================
  Files         122      124       +2     
  Lines       12873    13258     +385     
==========================================
+ Hits         8024     8374     +350     
- Misses       3968     3995      +27     
- Partials      881      889       +8

Flag	Coverage Δ
combined	`63.16% <90.65%> (+0.82%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Adds a fibremock package with: - DA interface (Upload/Download/Listen) matching the fibre gRPC service - In-memory MockDA implementation with LRU eviction and configurable retention - Tests covering all paths Migrated from celestiaorg/x402-risotto#16 as-is for integration.

Adds tools/celestia-node-fiber, a new Go sub-module that implements the ev-node fiber.DA interface by delegating Upload, Download and Listen to a celestia-node api/client.Client. Upload and Download run locally against a Celestia consensus node (gRPC) and Fibre Storage Providers (Fibre gRPC) — no bridge-node hop — using celestia-node's self-sufficient client (celestiaorg/celestia-node#4961). Listen subscribes to blob.Subscribe on a bridge node and forwards only share-version-2 blobs, which is how Fibre blobs settle on-chain via MsgPayForFibre. The package lives in its own go.mod, parallel to tools/local-fiber, so ev-node core does not inherit celestia-app / cosmos-sdk replace-directive soup. A FromModules constructor accepts the Fibre and Blob Module interfaces directly so callers can inject mocks or share an existing *api/client.Client. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…#3280) * test(celestia-node-fiber): showcase end-to-end Upload/Listen/Download Adds tools/celestia-node-fiber/testing/, a single-validator in-process showcase that boots a fibre-tagged Celestia chain + in-process Fibre server + celestia-node bridge, registers the validator's FSP via valaddr (with the dns:/// URI scheme the client's gRPC resolver expects), funds an escrow account, and drives the full adapter surface. TestShowcase proves the round-trip: subscribe via Listen, Upload a blob, wait for the share-version-2 BlobEvent that lands after the async MsgPayForFibre commits, assert the BlobID from Listen matches Upload's return, Download and diff the payload bytes. The harness is intentionally single-validator — a 2-validator Docker Compose showcase is planned as a follow-up for exercising real quorum collection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(celestia-node-fiber): scale showcase to 10 blobs, document DataSize gap Upload 10 distinct-payload blobs through adapter.Upload, collect BlobEvents via adapter.Listen until every BlobID is accounted for (order-insensitive, rejects duplicates), then round-trip each blob through adapter.Download to diff bytes. Catches routing bugs (wrong blob returned for a BlobID) and duplicate-event bugs that a single-blob test can't see. Scaling the test also exposed a semantic issue: the v2 share carries only (fibre_blob_version + commitment), so b.DataLen() — what listen.go's fibreBlobToEvent reports today — is always 36, not the original payload length ev-node's fibermock conveys. The adapter can't derive the payload size from the subscription stream alone; surfacing it correctly needs an x/fibre PaymentPromise lookup (tracked as a TODO on fibreBlobToEvent). The test therefore asserts DataSize is non-zero rather than matching len(payload). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…3281) listen.go previously set BlobEvent.DataSize to b.DataLen(), which for a share-version-2 Fibre blob is always the fixed share-data layout (fibre_blob_version + commitment = 36 bytes) — not the original payload length. That diverges from ev-node's fibermock contract and misleads any consumer that uses DataSize to allocate buffers or report progress. The v2 share genuinely doesn't carry the original size, and x/fibre v8 has no chain query to derive it from the commitment. The only accurate path is to Download the blob and measure. Listen now does exactly that before forwarding each event. The cost is one FSP round-trip per v2 blob; can be made opt-out later if it hurts throughput-sensitive use cases. Tests: - Showcase restores the strict DataSize == len(payload) assertion across all 10 blobs. - Unit test TestListen_FiltersFibreOnlyAndEmitsEvent now stubs fakeFibre.Download to return a deterministic payload and asserts DataSize matches its length. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ight subscriptions (#3283) feat(celestia-node-fiber): Listen takes fromHeight for resume subscriptions Threads a fromHeight parameter through the Fibre DA Listen path so a subscriber can rejoin the stream from a past block height without missing blobs. Consumes the matching celestia-node API change landed in celestiaorg/celestia-node#4962, which gave Blob.Subscribe a fromHeight argument backed by a WaitForHeight loop. Changes: - block/internal/da/fiber/types.go: DA.Listen signature now takes fromHeight uint64. fromHeight == 0 preserves "follow from tip" semantics, >0 replays from that block forward. - block/internal/da/fibremock/mock.go: replay matching blobs with height >= fromHeight before attaching the live subscriber. - block/internal/da/fiber_client.go: outer fiberDAClient.Subscribe does not yet expose a starting height (datypes.DA doesn't plumb one), so pass 0 and defer resume-from-height wiring to a future datypes.DA change. - tools/celestia-node-fiber/listen.go: propagate fromHeight to client.Blob.Subscribe on the celestia-node API. - tools/celestia-node-fiber/go.mod: bump celestia-node to the merged pseudo-version (v0.0.0-20260423143400-194cc74ce99c) carrying #4962. - tools/celestia-node-fiber/adapter_test.go: fakeBlob.subscribeFn gets the new fromHeight arg; TestListen_FiltersFibreOnlyAndEmitsEvent asserts that fromHeight=0 is forwarded. - tools/celestia-node-fiber/testing/showcase_test.go: existing TestShowcase passes fromHeight=0. New TestShowcaseResume uploads 3 blobs, discovers their settlement heights via a live Listen, then opens a fresh Listen with fromHeight at the first blob's height and verifies every historical blob is replayed with correct Height and DataSize. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…log (#3307) * feat(fibre): log per-Submit upload duration The Fibre Submit path was opaque: failures showed up as DeadlineExceeded with no signal of how long the upload actually took, and successes only logged at debug level inside the upstream library. During load-test debugging this turned into a guessing game — was the cluster slow, the deadline too tight, or something stuck mid-RPC? Add a single info-level (warn-on-failure) log line in fiberDAClient.Submit covering the Upload call: duration, flat blob bytes, blob count. Cheap (one time.Since) and gives the operator concrete numbers — e.g. "17 blobs / 115 MiB / 1.5 s" — to reason about whether RPCTimeout, pending cap, or batch sizing is the right knob to turn next. * fix(fibre): split DA Submit batches at Fibre's 128 MiB upload cap Under sustained txsim load (~50 MiB/s) the DA submitter batched 10 block_data items into one Upload(), producing a flat payload of 144 MiB. Fibre's per-upload cap is hard at ~128 MiB ("blob size exceeds maximum allowed size: data size 144366912 exceeds maximum 134217723") and rejected every batched upload. With MaxPendingHeadersAndData=10 that took down 170 consecutive submissions before the node halted itself with "Data exceeds DA blob size limit". Wrap the Upload call in a chunker that groups input blobs into ≤120 MiB chunks (8 MiB headroom under Fibre's cap for the per-blob length-prefix overhead added by flattenBlobs) and uploads each chunk separately. Aggregates submitted counts and BlobIDs across chunks; on first chunk failure, returns the error with the partially-submitted count so the submitter's retry/backoff logic sees a coherent state instead of all-or-nothing. Single oversized blobs (already validated against DefaultMaxBlobSize earlier in Submit) still land alone and fail server-side, but at least don't drag healthy peers into the same rejected batch. * fix(evnode-fibre): cap per-block data at 100 MiB to fit a Fibre upload Companion to the submitter chunking fix. The submitter can split a multi-blob batch into ≤120 MiB Fibre uploads, but a *single* block_data item that exceeds 128 MiB still ends up alone in its own chunk and fails server-side ("blob size exceeds maximum allowed size"). Lower the per-block cap to 100 MiB so under high-throughput txsim a single block can't grow past Fibre's hard limit, and update the comment to explain the relationship between this cap and Fibre's ~128 MiB upload reject threshold.

* fix(tools/talis): wait-for-chain + atomic keyring + one-command driver Three race conditions surfaced repeatedly on a fresh AWS bring-up of the Fibre throughput experiment. Each one had the same shape: a talis subcommand "succeeded" at the CLI level (or returned the txhash with --yes) before the chain had actually applied the work, leaving downstream steps to fail in confusing ways. This commit makes each step verify *outcome*, not just *invocation*, so the experiment can go from a fresh `talis up` to a running loadgen without manual intervention. • setup-fibre script (fibre_setup.go) now: - polls `celestia-appd status` for `latest_block_height>0` before submitting any tx — fixes the silent-noop where set-host + 100× deposit-to-escrow all bounced with "celestia-app is not ready; please wait for first block"; - retries `set-host` in a loop until the validator's host shows up in `query valaddr providers` — fixes the case where --yes returns the txhash before block inclusion and the tx silently lands in the mempool but never confirms; - verifies fibre-0's escrow account is funded on-chain before the tmux session exits — same silent-failure mode as set-host, but on the deposit side. The talis-CLI step also now cross-checks all validators are registered from a single vantage point before returning, so a concurrent set-host race surfaces as an error instead of a half-empty provider list start-fibre would cache forever. • fibre-bootstrap-evnode (fibre_bootstrap_evnode.go) now stages the keyring scp into a tmp directory and `mv`s it atomically into place. The previous direct `scp -r` to /root/keyring-fibre/keyring-test created the directory before transferring its contents — the evnode init script's `[ -d keyring-test ]` poll passed mid-transfer, the daemon launched with no fibre-0.info, and crashed with `keyring entry "fibre-0" not found`. • evnode_init.sh (genesis.go) now waits for the specific keyring-test/fibre-0.info file rather than just the keyring-test directory. Belt-and-braces: the bootstrap mv is already atomic on the same filesystem, but the file-level guard means a hand-pushed keyring (not via talis) can't trip the same race. • New `talis fibre-experiment` umbrella command runs up → genesis → deploy → setup-fibre → start-fibre → fibre-bootstrap-evnode in order. Each step uses the same binary as a subprocess; failures in any step abort the chain. Operator goes from a prepared root dir to a running loadgen with one command, instead of remembering the sequence. Verified by 5-min sustained loadgen against julien/fiber HEAD with PR #3287 (concurrent submitter) merged: 47.65 MB/s @ 99.999 % ok, up from the prior 24.57 MB/s baseline (the gap is PR #3287's overlapping uploads — these talis fixes just stop the deploy from silently breaking before throughput matters). * fix(tools/talis): finalize fibre setup race fixes Three follow-up bugs surfaced from the PR #3303 follow-up verification run on a 3-validator AWS Fibre cluster: - aws.go: CreateAWSInstances exited 0 even when individual instance launches failed, so `talis up` lied about success and downstream steps proceeded against a partial cluster. Returns a joined error now so failure cascades stop early. - download.go: sshExec used cmd.CombinedOutput, mixing SSH warnings (the "Warning: Permanently added '...'..." chatter on stderr) into bytes the caller hands to fmt.Sscanf("%d"). The CLI-side providers cross-check parsed those warnings as 0 and looped until its 5-min deadline even though a direct SSH query showed all 3 providers registered. Switch to cmd.Output() (stdout only) and add `-q -o LogLevel=ERROR` to silence the chatter for any caller that does combine streams. - fibre_setup.go: the per-validator escrow verification used `celestia-appd query fibre escrow` which doesn't exist — the actual subcommand is `escrow-account`. The query errored on every retry, the grep for "amount" never matched, and the script wedged on the 3-min deadline reporting `FATAL: fibre-0 escrow not present`. Switch to `escrow-account` and key on `"found":true` (the explicit existence flag in the response). Also wrap the fibre-0 deposit-to-escrow itself in a retry loop matching set-host — same `--yes`-returns-before-inclusion silent-failure mode bit it. fibre-1..N stay best-effort. * feat(evnode-txsim): keep-alive conn pool + pprof endpoint Two diagnostic improvements for the load generator: 1. http.Transport.MaxIdleConnsPerHost defaults to 2 in stdlib. With --concurrency=8 (or higher), 6+ goroutines per cycle had to open fresh TCP+TLS sockets per request because the pool couldn't hold their idle conns between requests. Bump MaxIdleConns / MaxIdleConnsPerHost / MaxConnsPerHost to 2*concurrency so every active sender has a reusable keep-alive socket, eliminating handshake churn from the hot path. 2. Always-on net/http/pprof on 127.0.0.1:6060. evnode-txsim is a load tester, not a production daemon, so cost of always serving profiling is acceptable; the payoff is being able to grab CPU profiles under live load without re-deploying the binary — `ssh -L 6060:127.0.0.1:6060 root@loadgen \ go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30`. A profile captured this way under c=8 traced the per-request hot path: 25.5% in kernel write(2), 25% in net/http body marshaling. That diagnostic surfaced that the c6in.2xlarge loadgen was the binding constraint for the experiment at ~22 MB/s, not evnode or DA — a finding we'd have spent another debug round chasing without the in-process profiler.

* fix(solo,reaping): bound sequencer queue to prevent ingest-side OOM Under sustained ingest above the block-production drain rate, SoloSequencer.queue grew monotonically. A 32-vCPU loadgen pushing >100 MB/s into a runner whose executor drains ~100 MB/s per block filled the queue at ~150 MB/s of net-positive growth — heap profiles showed 24 GB of retained io.ReadAll bytes in the queue within ~30 s, then anon-rss:63GB OOM-kill at the box's 64 GiB ceiling. Reproducible twice with identical signature. Two changes, one feature: - SoloSequencer.SetMaxQueueBytes(n) caps the queue's total retained tx bytes. SubmitBatchTxs uses all-or-nothing admission against the cap: if the incoming batch would push us over, the whole batch is rejected with ErrQueueFull and the queue keeps its current contents untouched. Partial admission would force the caller to track which prefix succeeded and only re-feed the suffix on retry; the reaper currently doesn't do that, so the whole-batch rule lets the reaper just retry the same batch later when the queue has drained. queueBytes is decremented on drain (queue := nil) and re-counted for postponed txs that the executor's FilterTxs returns to the queue. Zero cap = the legacy unbounded path, preserved for tests and small deployments. - The reaper bridging executor mempool → sequencer matches ErrQueueFull via errors.Is and treats it as transient backpressure: marks the rejected hashes as "seen" so the next reaper tick doesn't re-hash + re-submit the same already- rejected txs forever, logs a warn line with the dropped count, and continues running. Without this match every queue-full event would tear the daemon down via the existing fatal-on- submit-error path. Loadgen sees the backpressure indirectly: with the sequencer queue full, the executor's txChan stops draining, /tx blocks on its bounded channel send, and txsim observes 5xx / timeouts — cleanly applied at the application layer instead of via the kernel OOM-killer. * fix(evnode-fibre): enforce maxBytes in inMemExecutor.FilterTxs The stub executor used by the runner returned FilterOK for every transaction unconditionally, ignoring the maxBytes budget plumbed through SoloSequencer.GetNextBatch. Under sustained txsim load (~50 MiB/s, 8 concurrent senders) the mempool would accumulate ~50K txs while a 100 MiB upload was in flight; on the next batch the sequencer drained ALL of them into one block (~369 MiB raw), the submitter saw a single item exceeding the per-blob cap, and halted the node with `single item exceeds DA blob size limit`. Walk the input txs in arrival order, accumulate sizes against maxBytes, and return FilterPostpone past the budget so the sequencer puts the overflow back on its queue. Verified live: blocks now cap at ~10K txs / ~100 MiB and evnode sustains 58.77 MB/s DA upload throughput through a 5-min txsim run with zero crashes (was 0 → crash within 30 s before this fix). * fix(evnode-fibre): wire sequencer queue cap + lift ingest queue caps Two runner-side changes paired with the SoloSequencer bound: - After constructing the SoloSequencer, call SetMaxQueueBytes with 10× the per-block tx budget (= 1 GiB at the current 100 MiB MaxBlobSize). 10× is the sweet spot: large enough that a short burst above steady-state ingest doesn't trigger backpressure (we want to absorb that), small enough that the worst-case retained bytes fit comfortably under the box's RAM budget alongside the pending cache + DA in-flight buffers. - Lift the inMemExecutor's hardcoded ingest caps. txChan and maxBlockTxs were sized at 500 (5 MB / 5K txs per reaper poll) back when those were the only memory bound on the runner. With the SetMaxQueueBytes cap and the FilterTxs-enforced per-block budget now actually doing the bounding, the ingest queue can hold a full 100 MiB block-worth of txs (10K slots at 10 KB) without burdening memory — and a single reaper poll can drain that whole batch in one GetTxs call instead of needing 20× cycles. This was the binding constraint at ~5,000 tx/s = 50 MB/s in earlier runs. * fix(config): tighten Fiber pending cap to 10 to bound submitter memory ApplyFiberDefaults set MaxPendingHeadersAndData=50, but each pending data item under Fiber is up to MaxBlobSize (~100 MiB raw). With 3-FSP fan-out and per-attempt retry buffers in flight, 50 items × 3 × retries crossed 64 GiB on c6in.8xlarge under sustained txsim load and the kernel OOM-killed evnode 30 s into the run. 10 keeps the in-flight footprint bounded while still letting healthy uploads pipeline against the actual Fibre RPC latency. Verified by heap profiling: pending pause at ~ 10 × 100 MiB plus fan-out keeps RSS below ~10 GiB, evnode runs indefinitely.

* refactor block package for performance and simplified sync * reduce diff * reduce diff * tidy

+		return nil, fmt.Errorf("marshal data: %w", err)
+	}
+
+	size := 4 + 4 + len(headerBz) + 4 + len(dataBz) + 4 + len(envelopeSig)


julienrbrt · 2026-05-18T12:22:37Z

ev-grpc for risotto isn't the solution chosen.

…st fibre (#3329) Bumps the celestia-node-fiber tool to track: - celestia-node feature/fibre-experimental @ 9adc59e0 (rebuilt from feature/fibre 'add state client #4987' + celestia-app dep bump) - celestia-app feat/fibre-payments @ 5abc6308 (current main + PR #7190 'perf(fibre): flat-file shard store') Also bumps the gnark-crypto replace directive target from v0.18.0 to v0.20.1 to keep parity with celestia-node's go.mod, which had to bump for the new celestia-app (gnark 0.14 -> 0.15).

julienrbrt added 2 commits April 13, 2026 13:54

feat(da): support fiber (not via c-node)

19ea929

Merge branch 'main' into julien/fiber

960146d

github-actions Bot assigned julienrbrt Apr 13, 2026

julienrbrt and others added 7 commits April 14, 2026 15:12

wip

ef44db2

Merge branch 'main' into julien/fiber

f96ab47

reduce alloc

f3356c6

Merge branch 'main' into julien/fiber

49c92d1

lint

7278685

updates

4485d91

julienrbrt changed the title ~~feat(da): support fiber (not via c-node)~~ [DNM] feat(da): support fiber (not via c-node) Apr 20, 2026

julienrbrt and others added 15 commits April 20, 2026 14:46

wire fiber in testapp (poc)

6472139

Merge branch 'main' into julien/fiber

03b4877

tidy tool

da26572

updates

04c70e7

properly disable fi

9e5b2ca

improve submission

c49fe6f

updates

e26879b

cleanup

7be668a

rm local fiber

24ff04e

fix flags

a4a46e7

cleanups

918acaf

julienrbrt and others added 23 commits April 30, 2026 00:20

feedback

d7ae05c

cleanup api

1c7a9d1

fixes

0016376

Merge branch 'julien/speedup-submitter' into julien/fiber

91089a5

feat(pkg/sequencers): add queue limit in solo sequencer

66dc0b4

use option

26ebd05

cl

7d0fec9

Merge branch 'solo-seq-impr' into julien/fiber

5db4ef0

move test files

094dc44

Merge branch 'solo-seq-impr' into julien/fiber

b681e8b

fix

ae257fd

Merge branch 'main' into julien/fiber

5fbf3e6

fix SSH key resolution

dbebfca

improve debug

1934aa4

fix build

db2a9b6

fix ssh path

884e8ef

fix rss flag

df51291

Improve experiment configuration

6889ed6

Merge branch 'main' into julien/fiber

34c08b0

fiber refactor block package for performance and simplified sync (#3324)

099d5ee

* refactor block package for performance and simplified sync * reduce diff * reduce diff * tidy

github-advanced-security AI found potential problems May 12, 2026

View reviewed changes

Comment thread block/internal/common/blob.go

return nil, fmt.Errorf("marshal data: %w", err)

}

size := 4 + 4 + len(headerBz) + 4 + len(dataBz) + 4 + len(envelopeSig)

julienrbrt closed this May 18, 2026

julienrbrt deleted the julien/fiber branch May 18, 2026 12:22

julienrbrt restored the julien/fiber branch May 19, 2026 18:24

julienrbrt reopened this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] feat(da): support fiber (not via c-node)#3244

[DNM] feat(da): support fiber (not via c-node)#3244
julienrbrt wants to merge 84 commits into
mainfrom
julien/fiber

julienrbrt commented Apr 13, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

julienrbrt commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

julienrbrt commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: [DNM] feat(da): support fiber (not via c-node)

Critical Bugs

Significant Issues

Code Quality / Minor Issues

Architecture Notes

Uh oh!

codecov Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

julienrbrt commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

julienrbrt commented Apr 13, 2026 •

edited

Loading

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

github-actions Bot commented Apr 13, 2026 •

edited

Loading

claude Bot commented Apr 13, 2026 •

edited

Loading

PR Review: `[DNM] feat(da): support fiber (not via c-node)`

codecov Bot commented Apr 13, 2026 •

edited

Loading