perf(bb/msm): stream-walker accumulate — cache operands/dx across the three inner passes by AztecBot · Pull Request #23737 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T02:46:36Z

What

Pure implementation-level optimization to the stream-walker accumulate kernel (barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_stream_walker.template.wgsl).

Each inner-loop iteration walks the S slots three times (forward prefix → batched inversion → backward peel). The base code re-derived every slot's operands in all three passes: it loaded point_x/point_y/l0_index and recomputed dx = fr_sub(p_rx, p_lx) in the forward pass, reloaded+re-subtracted in the inverse pass, and reloaded a third time in the peel. The cursor isn't advanced and acc_x/acc_y aren't written until the peel, so the operands are identical across all three passes. This commit loads them once in the forward pass into per-slot private caches (op_lx/op_ly/op_rx/op_ry, dx_cache) and reuses them — removing ~1/3 of the kernel's global point reads and one field subtraction per slot per iteration.

Caches live in private registers (same class as the existing acc_x/acc_y) → no GPU memory change.
Algorithm unchanged / intended bit-identical: no GLV, no signed-digit change, batched-inversion math untouched. S=8, WALKER_TPB=64, pref_scratch (16 KB workgroup), window/bucket sizes and all buffers unchanged.
Confirmed a real change vs base: dx_cache is absent in stream-walker-impl@eaa6d3d and present after this commit.

Validation status — NOT VALIDATED (blocked). Do not merge on perf claims.

Real-hardware validation could not be completed in this session. There are currently no trustworthy timing or correctness numbers for this PR:

Device correctness (noble cross-check): not obtained.
Apple / Adreno / Mali timing vs baseline: not obtained.

Blockers:

BrowserStack mobile (Android) workers created via the available MCP tooling are served as a non-WebGPU "Android Browser" (no navigator.gpu), so Mali/Adreno could not run the WebGPU harness at all.
The screenshot-based result readout (/5/worker/<id>/screenshot.json) did not yield a usable screenshot for the harness; the shared 2-seat pool was saturated and workers stayed queued.

Any Apple/Adreno/Mali timings or "cross-check GREEN" statements in earlier revisions of this description are retracted as unverified.

Candidate follow-up (NOT in this commit)

A second, independent pure-engineering lever was prototyped locally but is likewise unvalidated: move pref_scratch from var<workgroup> (16 KB) to per-invocation private storage — it is single-thread-owned (no cross-thread sharing, no barrier), so this frees the 16 KB workgroup allocation the swarm identified as the #1 occupancy bottleneck, at the cost of S*2 vec4 of private space per invocation. Bit-identical (memory placement only). Needs the same real-hardware validation.

TODO before "done"

Noble cross-check GREEN on real hardware (small logN).
Baseline-vs-PR wall-clock timing on ≥1 Apple, ≥1 Adreno, ≥1 Mali device; no memory regression.
Obtain WebGPU-capable mobile BrowserStack access + a working result readout; paste real numbers.

Build: WGSL is inlined at build time; _generated/shaders.ts is regenerated and included in the diff.

Note: git add -A during PR creation also swept in two dev-harness scratch files (dev/msm-webgpu/bench-entry.ts, bench-banner.ts); they are measurement scaffolding, not part of the kernel change.

… three inner passes

perf(bb/msm): stream-walker accumulate — cache operands/dx across the…

d1aabfa

… three inner passes

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bb/msm): stream-walker accumulate — cache operands/dx across the three inner passes#23737

perf(bb/msm): stream-walker accumulate — cache operands/dx across the three inner passes#23737
AztecBot wants to merge 1 commit into
stream-walker-implfrom
cb/sw-innerperf

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Validation status — NOT VALIDATED (blocked). Do not merge on perf claims.

Candidate follow-up (NOT in this commit)

TODO before "done"

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading