perf(bb/msm): stream-walker accumulate — cache operands/dx across the three inner passes#23737
Draft
AztecBot wants to merge 1 commit into
Draft
perf(bb/msm): stream-walker accumulate — cache operands/dx across the three inner passes#23737AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
… three inner passes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Pure implementation-level optimization to the stream-walker accumulate kernel (
barretenberg/ts/src/msm_webgpu/wgsl/cuzk/ba_stream_walker.template.wgsl).Each inner-loop iteration walks the S slots three times (forward prefix → batched inversion → backward peel). The base code re-derived every slot's operands in all three passes: it loaded
point_x/point_y/l0_indexand recomputeddx = fr_sub(p_rx, p_lx)in the forward pass, reloaded+re-subtracted in the inverse pass, and reloaded a third time in the peel. The cursor isn't advanced andacc_x/acc_yaren't written until the peel, so the operands are identical across all three passes. This commit loads them once in the forward pass into per-slot private caches (op_lx/op_ly/op_rx/op_ry,dx_cache) and reuses them — removing ~1/3 of the kernel's global point reads and one field subtraction per slot per iteration.acc_x/acc_y) → no GPU memory change.S=8,WALKER_TPB=64,pref_scratch(16 KB workgroup), window/bucket sizes and all buffers unchanged.dx_cacheis absent instream-walker-impl@eaa6d3dand present after this commit.Validation status — NOT VALIDATED (blocked). Do not merge on perf claims.
Real-hardware validation could not be completed in this session. There are currently no trustworthy timing or correctness numbers for this PR:
Blockers:
navigator.gpu), so Mali/Adreno could not run the WebGPU harness at all./5/worker/<id>/screenshot.json) did not yield a usable screenshot for the harness; the shared 2-seat pool was saturated and workers stayed queued.Any Apple/Adreno/Mali timings or "cross-check GREEN" statements in earlier revisions of this description are retracted as unverified.
Candidate follow-up (NOT in this commit)
A second, independent pure-engineering lever was prototyped locally but is likewise unvalidated: move
pref_scratchfromvar<workgroup>(16 KB) to per-invocation private storage — it is single-thread-owned (no cross-thread sharing, no barrier), so this frees the 16 KB workgroup allocation the swarm identified as the #1 occupancy bottleneck, at the cost of S*2 vec4 of private space per invocation. Bit-identical (memory placement only). Needs the same real-hardware validation.TODO before "done"
Build: WGSL is inlined at build time;
_generated/shaders.tsis regenerated and included in the diff.Note:
git add -Aduring PR creation also swept in two dev-harness scratch files (dev/msm-webgpu/bench-entry.ts,bench-banner.ts); they are measurement scaffolding, not part of the kernel change.