perf: single-format rkyv serialization + zero-copy in-place verifier#769
Open
Oppen wants to merge 36 commits into
Open
perf: single-format rkyv serialization + zero-copy in-place verifier#769Oppen wants to merge 36 commits into
Oppen wants to merge 36 commits into
Conversation
Add the measurement/profiling harness for the in-VM STARK verifier: - `empty`-proof and `deserialize-only` bench guests + `sp1/verifier` cross-prover comparison, all exercising the no_std verifier. - Expand the recursion smoke test with PC-histogram, sampled-flamegraph, page-count, cycle-count and per-step-breakdown diagnostics, plus the `make test-profile-recursion` targets and the histogram-aggregation CI script/workflow. - Expose read-only `Executor::memory()`, `Memory::cells()` and `SymbolTable::functions()` accessors and make `flamegraph::demangle` public so the diagnostics can resolve guest PCs to functions.
The top-100 per-address table carried bare PCs with no file:line, so it was not actionable for optimization and the CI aggregator already discarded it. Keep the per-function fold (the view that matters); terminate the aggregator's function-table parse on the trailing rule instead of the removed PC header.
Extract setup_guest_run (blob build + ELF load + Executor::new) and a log_progress throttled-readout factory, used by the cycle-count, page-count, PC-histogram, sampled-flamegraph and step-breakdown diagnostics. Generalize the PC-histogram runner over guest name + progress stride so the deserialize-only histogram is a one-line caller instead of a near-duplicate.
Collapse the cycle-count, PC-histogram and step-breakdown diagnostics into one parameterized run_profile(guest, stride, opts, detailed): total cycles print unconditionally, the top-25 functions + per-step breakdown gate on detailed (they share one streamed pass over the same PC stream). Every variant now comes in 1query and multiquery flavours for both recursion and the deserialize-only control. Route execute_outer_and_commit through drive_executor too — the rebased streaming finish() makes its hand-rolled drain loop redundant.
Add deserialize-only to RECURSION_GUESTS and migrate the guest to the recursion guest's std shape (lambda_vm_syscalls + build-std std), since the old no_std panic handler collided with std. Add getrandom_backend="custom" to its cargo config (transitive getrandom 0.3 needs it) and track its Cargo.lock. The deser control guest now builds and its profile tests run.
The smoke pipelines already host-verify the inner proof, so building with --features stark/instruments surfaces the per-step timings; the dedicated test was just that verify minus the guest run. Documented the flag in the module doc.
It was never wired into the bench harness or CI (run.sh uses sp1/fibonacci), and its in-VM verifier-cost comparison is superseded by the recursion profile tests in this PR.
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
For some reason that alone appears to inhibit completely the effects of next PRs pre-built commitments and vkey optimization.
Steps detection was misbehaving due to inlined functions not emitting symbols. The solutions were either marking `#[inline(never)]`, which in the case of `replay_rounds_after_round_1` inhibits optimizations. Since we added the dependency, we took more advantage of it and expanded the detailed profile with a per file:line breakdown as well.
This reverts commit 3df1e08.
Replace symbol/DWARF-based verifier-step detection with an explicit addi x0,x0,N marker instruction, immune to inlining. Adds a STEP_DECODE_DONE marker in the recursion guest itself, making the deserialize-only control guest (manual A/B subtraction) redundant.
Too much reviewer overhead for their current value. The sampled flamegraph will come back once the executor's flamegraph tooling makes it simple to reimplement; the page-count histogram isn't interesting right now.
SymbolTable::functions() and Memory::cells() existed solely for the flamegraph/page-count smoke tests just deleted.
Add STEP_AIRS_AND_BUS_BALANCE_DONE marker so the verifier's preprocessed FFT+Merkle commitment build (VmAirs::new) is bucketed separately from postcard decode and from multi_verify's transcript replay. The top-25 cycle table now also tags each row with its verifier step, so e.g. how much of step4:openings is keccak is visible at a glance. Update the CI histogram aggregator to parse and render the new step column.
The previous split added a step column to one combined top-25 table, losing per-step rank/cum% fidelity. Print the global top-25 (all steps folded together) plus a separate top-25 table per verifier step, so each step's own hottest functions and their cumulative share are visible directly. Update the CI aggregator to parse and render the new multi-table output.
Per-step tables previously used the global cycle count as the pct denominator, so a function dominating a cheap step (e.g. 90% of step2:claimed) rendered as a near-zero percentage of the whole run — useless for spotting what dominates within that step. Use each step's own cycle total as the denominator for its table instead; the global table still uses the run's total. Update the CI aggregator to parse and surface the per-step denominator.
…filing decode_step_marker required only dst==0, matching the canonical NOP (addi x0, x0, 0) as marker 0; pin src==0 and imm!=0 per the documented addi x0, x0, N convention. run_profile latched the step bucket at the highest marker ever seen, so multi_verify's per-AIR-table 3,4,5,6 repetition folded every table after the first into bucket 6. Track the latest marker instead.
Add `VmVerifyingKey` (prover/src/vkey.rs): host-derived cache of the five preprocessed-table Merkle commitments (BITWISE, DECODE, REGISTER, KECCAK_RC, per-PAGE). `VmAirs::new_with_vkey` / `verify_with_options_with_vkey` take the cached commitments instead of recomputing them — recomputation is ~87% of verifier cycles inside the recursion guest. Soundness is preserved by Fiat-Shamir. The recursion and deserialize-only guests and the smoke test now encode the vkey into the postcard blob `(VmProof, elf, opts, vkey)`.
A prover-supplied vkey defeats the preprocessed-commitment check: Fiat-Shamir only catches post-hoc tampering, not a coordinated prover committing to a forged table with a matching vkey. Bind keccak(vkey) as vk_digest: embed ProofOptions in the vkey (query count and grinding factor pin no commitment), stamp the digest into VmProof and the statement transcript (V3), check it in verify before STARK work, and commit vk_digest || inner output from the recursion guest so the outer verifier can compare against a digest derived from the trusted inner ELF. Also validate vkey version/page-count instead of panicking on short pages, and reject on options mismatch.
The comment should still say when it needs bumping, not why it was bumped the last time.
The host roundtrip test still decoded (VmProof, elf, opts); postcard discards trailing bytes, so it silently skipped the vkey and the vkey verify path the guest actually exercises.
Replace postcard and serde with rkyv as the sole proof serialization. The STARK verifier gets one implementation over ArchivedStarkProof, reading the archive in place: archived field elements are bit-identical to native ones on little-endian (ArchivedFieldElement transparent newtype + slice_as_native), so the recursion guest verifies straight from its input buffer with no deserialization pass and no per-field allocation. Owned multi_verify becomes a serialize-then-delegate shim, so every host verification also exercises the wire format. - recursion blob: 12-byte LVMR magic/version prefix 16-aligns the archive at PRIVATE_INPUT_START+4+12 (const-asserted); guest borrows the input region (get_private_input_slice), commits vk_digest ‖ public_output from verify_recursion_blob - vkey digest: framework-free fixed-width canonical encoding (exhaustive destructure; injective), replacing postcard bytes - CLI persistence: bincode -> validated rkyv from_bytes/to_bytes; proof files change format - Table: manual rkyv impl under disk-spill reads via row_major_data, archive layout identical to the derive - ethrex host-reference tests move to tooling/ethrex-tests (detached workspace): ethrex pins rkyv 'unaligned', a global archived-layout switch that must not feature-unify with the aligned proof format; our crates pin 'aligned' so any reintroduction is a compile error - harden verifier for in-place reads: OOD dimensions_consistent + height>0 gate, deep-openings count guard, aux-width checked_sub, FRI decommitment length equality (fixes pre-existing skip of the fri_last_value check via over-long layers_evaluations_sym) - verify_recursion_blob falls back to one aligned copy when the host buffer is misaligned (guest path stays zero-copy) Recursion verifier guest, inner empty proof: - 1 query (blowup=2): 115.26M -> 88.98M cycles (-22.8%) - multi-query (blowup=8): 2.976B -> 2.211B cycles (-25.7%) setup step (was postcard decode): 21.89M -> ~170 cycles
Widen the private-input header from 4 to 16 bytes ([len: u32 LE] + 12 reserved), moving the payload base to PRIVATE_INPUT_START + 16 — 16-aligned, so guests can read structured (rkyv-archived) input in place with naturally-aligned loads instead of working around a 4-aligned base (the recursion blob's pad arithmetic; ethrex's rkyv 'unaligned' pin exists for the same reason and can eventually be dropped upstream). - executor: PRIVATE_INPUT_PAYLOAD_OFFSET = 16 (const-asserted 16-aligned); store_private_inputs writes payload at +16 - syscalls: get_private_input(_slice) and ef_io read at +16 - prover: private_input_bytes mirrors the header in the PAGE/genesis image; verifier page bound uses the offset - recursion prefix: 16 bytes (magic+version+reserved(8)) — pure framing now, sized to a multiple of the alignment; encode_recursion_input serializes directly after the prefix into an AlignedVec (no archive copy), so the host path is aligned by construction and the misaligned fallback only guards foreign buffers - fixtures: fibonacci bench guest and test_private_input_xpage read the payload at +16; xpage now commits payload[0..8] BREAKING: guests built against the +4 payload base read garbage; rebuild all guest ELFs (make compile-programs).
|
Benchmark Results for modified programs 🚀
|
Oppen
commented
Jul 3, 2026
Collaborator
Author
There was a problem hiding this comment.
Moved out of the main workspace due to rkyv/unaligned feature bubbling up to lambda_vm. Not actually deleted.
a5515d1 to
65f990f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces postcard and serde with rkyv as the sole proof serialization format, and rewrites the STARK verifier as a single implementation over
&ArchivedStarkProof, read in place from the archive — no deserialization pass, no per-field allocation. On little-endian, an archived field element is bit-identical to a native one (ArchivedFieldElementtransparent newtype +slice_as_native), so the recursion guest verifies the proof straight out of its private-input buffer.Numbers (recursion verifier guest, inner
emptyproof)Design
multi_verifyis now a serialize-then-delegate shim, so every host verification (incl. all tests) exercises the actual archive format.PIis materialized per table only where the AIR API needs it (()for the VM — free).LVMRmagic/version prefix + rkyv archive, encoded copy-free into anAlignedVecseeded with the prefix. Guest borrows the mapped input region (get_private_input_slice), validates the prefix, verifies in place, commitsvk_digest ‖ public_output.PRIVATE_INPUT_START + 16(header =len: u32+ 12 reserved), so the payload base is 16-aligned and structured input needs no pad arithmetic. Rebuild guest ELFs.unaligned, a global archived-layout switch that feature-unifies across a workspace build and would silently flip the proof format. Its host-reference tests move to the detached workspacetooling/ethrex-tests(make test-ethrex), and our crates pinalignedso any reintroduction is a hard compile error (verified: re-adding the dep trips rkyv's mutual-exclusion guard).Verification
Adversarially reviewed for soundness, robustness, and performance: verifier rewrite is check-for-check identical to the owned path (transcript order, iteration bounds, Montgomery-form transmute analyzed — non-canonical bit patterns can only reject); hardened for in-place reads (OOD dimension/height gates, deep-openings count guard, FRI decommitment length equality — the latter also closes a pre-existing skip of the
fri_last_valuecheck); malformed guest blobs are self-DoS only, no false-accept path. Full workspace suite green (492 prover / 137 stark tests; every host verify round-trips the archive), host blob roundtrip/tamper/misalignment tests green, in-VM end-to-end verifies and commits.