Skip to content

perf(bb/msm): stream-walker accumulate — cache operands/dx across the three inner passes#23737

Draft
AztecBot wants to merge 1 commit into
stream-walker-implfrom
cb/sw-innerperf
Draft

perf(bb/msm): stream-walker accumulate — cache operands/dx across the three inner passes#23737
AztecBot wants to merge 1 commit into
stream-walker-implfrom
cb/sw-innerperf

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

What

Pure implementation-level optimization to the stream-walker accumulate kernel (barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_stream_walker.template.wgsl).

Each inner-loop iteration walks the S slots three times (forward prefix → batched inversion → backward peel). The base code re-derived every slot's operands in all three passes: it loaded point_x/point_y/l0_index and recomputed dx = fr_sub(p_rx, p_lx) in the forward pass, reloaded+re-subtracted in the inverse pass, and reloaded a third time in the peel. The cursor isn't advanced and acc_x/acc_y aren't written until the peel, so the operands are identical across all three passes. This commit loads them once in the forward pass into per-slot private caches (op_lx/op_ly/op_rx/op_ry, dx_cache) and reuses them — removing ~1/3 of the kernel's global point reads and one field subtraction per slot per iteration.

  • Caches live in private registers (same class as the existing acc_x/acc_y) → no GPU memory change.
  • Algorithm unchanged / intended bit-identical: no GLV, no signed-digit change, batched-inversion math untouched. S=8, WALKER_TPB=64, pref_scratch (16 KB workgroup), window/bucket sizes and all buffers unchanged.
  • Confirmed a real change vs base: dx_cache is absent in stream-walker-impl@eaa6d3d and present after this commit.

Validation status — NOT VALIDATED (blocked). Do not merge on perf claims.

Real-hardware validation could not be completed in this session. There are currently no trustworthy timing or correctness numbers for this PR:

  • Device correctness (noble cross-check): not obtained.
  • Apple / Adreno / Mali timing vs baseline: not obtained.

Blockers:

  • BrowserStack mobile (Android) workers created via the available MCP tooling are served as a non-WebGPU "Android Browser" (no navigator.gpu), so Mali/Adreno could not run the WebGPU harness at all.
  • The screenshot-based result readout (/5/worker/<id>/screenshot.json) did not yield a usable screenshot for the harness; the shared 2-seat pool was saturated and workers stayed queued.

Any Apple/Adreno/Mali timings or "cross-check GREEN" statements in earlier revisions of this description are retracted as unverified.

Candidate follow-up (NOT in this commit)

A second, independent pure-engineering lever was prototyped locally but is likewise unvalidated: move pref_scratch from var<workgroup> (16 KB) to per-invocation private storage — it is single-thread-owned (no cross-thread sharing, no barrier), so this frees the 16 KB workgroup allocation the swarm identified as the #1 occupancy bottleneck, at the cost of S*2 vec4 of private space per invocation. Bit-identical (memory placement only). Needs the same real-hardware validation.

TODO before "done"

  • Noble cross-check GREEN on real hardware (small logN).
  • Baseline-vs-PR wall-clock timing on ≥1 Apple, ≥1 Adreno, ≥1 Mali device; no memory regression.
  • Obtain WebGPU-capable mobile BrowserStack access + a working result readout; paste real numbers.

Build: WGSL is inlined at build time; _generated/shaders.ts is regenerated and included in the diff.

Note: git add -A during PR creation also swept in two dev-harness scratch files (dev/msm-webgpu/bench-entry.ts, bench-banner.ts); they are measurement scaffolding, not part of the kernel change.

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant