perf(bb/msm): stream-walker l0 load-reuse — portable Adreno+Mali win; register-budget boundary mapped (fused overshoots Adreno) by AztecBot · Pull Request #23730 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T00:20:31Z

Stream-walker l0 load-reuse — a portable, register-safe memory-latency win

TL;DR

The stream-walker is memory-bandwidth-bound on real mobile hardware. Caching each slot's packed l0_index handle once in the forward pass and reusing it (instead of re-issuing the dependent gather 3–4× per point) is a real, cross-vendor win — proven on both Adreno and Mali with drift-controlled in-session A/Bs. This PR ships that (reuseLoads, default on) and maps the exact register-budget boundary: pushing the reuse one step further (fusePeel — merge the inverse + backward-peel passes to also cut a point_x gather) overshoots Adreno's per-thread register file and is catastrophic there, while Mali tolerates it. So reuseLoads is the maximum register-safe reuse; fusePeel stays gated off.

reuseLoads caches only the 4-byte handle and re-reads point_x from the resolved index (a cache hit) — it adds zero net live per-thread state, which is why it stays under the occupancy cliff. Bit-identical arithmetic; neutral-to-positive everywhere measured.

Real-hardware A/B — Adreno (Galaxy S25 Ultra · Adreno 830 · Chrome 145), GPU wall time

Drift-controlled interleaved A/B (both variants in one page load, shared thermal state). Win grows with size — the memory-bound signature (reducing the dependent l0_index gather pays off once the working set exceeds cache):

logn	reuse	base	speedup
14	119.3 ms	132.5 ms	+9.5 %
16	232.9 ms	277.9 ms	+16.2 %

Re-confirmed this session with single-variant solo runs (n=20): reuse logn14 = 117.8 ms, logn16 = 231.2 ms — matches the interleaved numbers.

Real-hardware A/B — Mali (Pixel 9 Pro XL · Tensor G4 · Chrome 145), GPU wall time — the portability proof

Same harness, interleaved reuse,base,fused, n=12/variant, logn16. Wall is the trustworthy metric on Android (uncalibrated timestamp-query); stream_walker phase delta is shown as corroboration and agrees in direction:

variant	median wall	Δ vs base (wall)	Δ vs base (stream_walker)
base	377.2 ms	—	—
reuse	368.7 ms	−2.3 %	−3.9 %
fused	364.3 ms	−3.4 %	−4.8 %

The load-reuse win generalizes to Mali (smaller magnitude than Adreno — different memory hierarchy/cache, but real and directionally consistent across both wall and the stream_walker phase). This was the missing portability proof.

The register-budget boundary — how far the reuse can be pushed (gap closed)

fusePeel extends the idea: merge the inverse pass into the backward peel so each slot's point_x is gathered 2× instead of 3×, and the per-slot inv_dx scratch round-trip vanishes. Expected: a small further win on top of reuse. On Mali it delivers exactly that and is the best variant (fused−reuse = −1.2 % wall). On Adreno it is catastrophic:

logn	reuse	fused	result
14	117.8 ms	1137.9 ms	~9.7× slower (n=20, tight 1122–1151 ms)
16	231.2 ms	—	GPU device lost (reproduced in 3 independent runs)

Why: the merged backward loop holds inv_dx + both points' x/y + lambda + the result point live simultaneously (~8 field elements ≈ 64 registers/thread). That overshoots Adreno's small per-thread register file → register spill / occupancy collapse (the deterministic, low-variance ~10× slowdown at logn14) → at logn16 the pressure loses the device entirely. This is the same register-file-competition failure mode that makes this PR non-composable with #23726, and the same class as the earlier x-coord value-cache dead-end (~4×). Mali's larger budget hides all of it.

Boundary, stated precisely: caching only the 4-byte handle and re-reading point_x (reuseLoads) adds +0 net live per-thread registers → at/under Adreno's budget → the maximum register-safe reuse. Any extension that adds live per-thread field-element state — the x-coord value cache (4×) or the fusePeel merge (≈10× / device-loss) — exceeds Adreno's budget. The lever is exhausted at reuseLoads; fusePeel is kept as an A/B knob, default-off, documented Adreno-unsafe in MsmConfig.

Correctness (SwiftShader, headless, WASM-free Noble cross-check) — PASS

All three variants (base, reuse, fused) agree with @noble/curves at logn 10 and 12 (autorun=msm-ref-check; the in-tree bb wasm is a 213-byte stub, so Noble is the oracle). _generated/shaders.ts regenerated and in sync.

Harness notes

Adreno can't run the 3-variant interleaved A/B: the bench destroys+rebuilds the whole MsmV2 on every variant switch, and on Adreno the rebuild churn following a fused pipeline build reproducibly loses the device. The fused-on-Adreno characterization above therefore uses single-variant solo runs (one build, no rebuild). A proper fix (cache pipelines per variant — the ~49 MB SRS pool is shared, per-MsmV2 state is tiny) would enable interleaved A/B on Adreno too.
q-param URL workaround + page-load beacon (prior commits) are what make the runtime A/B drivable on real Android at all.

Recommendation

Ship reuseLoads (this PR's default). Proven win on Adreno (~10–16 %) and Mali (~2–4 %), register-safe on both, bit-identical.
Keep fusePeel off. Adreno-unsafe (≈10× / device-loss); on Mali it's only ~1 % better than reuseLoads — not worth a vendor-specific gate.
Do not compose with perf(bb/msm): stream-walker pref_scratch → private memory (frees workgroup occupancy limiter), TPB 64→128 #23726's private pref_scratch on mobile (separate register strategy; ~4× regression).

Base: stream-walker-impl.

…sion passes

Add ?autorun=msm-gpu-bench: drives the GPU MSM directly (no Run/WASM gate), captures per-phase GPU timestamps + allocated bytes. The Run-gated msm-bench stalls on devices without threaded-WASM/cross-origin isolation. Also trim the autorun SRS download to the chosen logn so mobile page-load does not stall on the full 2^20 (64 MB) prefix.

…ate cache The prior pass cached both the packed l0_index handles and the 32-byte x-coordinates (plx/prx, ~1 KB/thread private) across the three batch-inversion passes. First principles + the #23726 occupancy profiling show the x-coord cache is the wrong cost: the re-reads it eliminates are same-address cache hits (no DRAM bandwidth saved), and ~1 KB/thread of extra private state competes with the very occupancy that limits this kernel (pref_scratch→private, TPB 64->128 in #23726). Cache only the 4-byte packed handles (l0a/l0b, 8 bytes/slot) so the dependent l0_index gather is issued once per point; re-read point_x from the cached handle in the inverse pass and backward peel (a cache hit, point index already resolved). Bit-identical arithmetic; ~1 KB/thread less private state, composing cleanly with #23726. Cross-checked GREEN vs Noble at logn 10/11/12 (SwiftShader).

…o WebGPU probe Adds a reuse_loads knob (MsmConfig.reuseLoads, ?reuse / ?variants) so one page load can build+time BOTH the pre-reuse baseline and the handle-only load-reuse variant on the same device/inputs — a controlled A/B that does not depend on a second BrowserStack session. msm-gpu-bench now interleaves variants (?rounds) and reports per-variant median/avg wall + per-phase GPU time plus the reuse-minus-base delta. Fixes the BrowserStack Android caps: /5/worker selects the browser via `browser`, so real-Android Chrome needs browser:"chrome" (the prior browser:"android"+browserName:"chrome" launched the device default browser, labelled "Android Browser", and produced no telemetry). Adds probe.html, a dependency-free (ES5 + XHR, no ESM) WebGPU capability probe that posts telemetry immediately so a mobile run can never come back as 'zero telemetry, no error row'. Both walker variants cross-check GREEN vs Noble under SwiftShader at logn 10 and 12.

… real Android BrowserStack's real-mobile intent launch drops everything after the first unescaped '&', so the interleaved-variant bench URL (?autorun=msm-gpu-bench& logn=…&variants=reuse,base&rounds=…) arrived truncated to ?autorun=msm-gpu-bench — reverting logn to its default and pulling the full 2^20 SRS (64MB), which loses the mobile GPU device. Pass a single percent-encoded q= and expand it into location.search before any module reads the query. Complements the runtime reuseLoads toggle so a single Android session can A/B both variants.

…int_x 3→2 gathers, register-free The inverse pass and the backward peel were two separate backward loops over the S slots, and each gathered point_x. Merge them: one backward walk derives inv_dx[k] = inv * prefix[k-1] and consumes it immediately for the affine add, so each slot's point_x is gathered twice (forward product + fused peel) instead of three times. The per-slot inv_dx scratch round-trip is also removed. No new per-thread private state, so it stays inside the Adreno register budget that the x-coord cache broke. Idle pad slots still peel their dx from the running inverse so the batched inversion stays correct. Gated behind fusePeel (implies reuseLoads); A/B'd as the 'fused' variant. Cross-check GREEN vs noble at logn 10,12 on SwiftShader for base/reuse/fused.

…u-bench progress beacons - run-browserstack.mjs: add --rounds/--variants and q-encode the query for real-mobile targets (the index.html q= expander already existed but the runner never used it, so multi-param mobile URLs truncated at the first &). - main.ts msm-gpu-bench: map the 'fused' variant to reuse+fusePeel; post progress beacons during warmup and per rep so the runner's first-progress watchdog sees liveness through the (slow on mobile) shader JITs; report fused−base and fused−reuse wall deltas.

….6× logn14, device-loss logn16)

perf(bb/msm): stream-walker — cache point operands across batch-inver…

37e2623

…sion passes

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026

AztecBot added 2 commits May 30, 2026 01:46

AztecBot changed the title ~~perf(bb/msm): stream-walker — cache point operands across batch-inversion passes~~ perf(bb/msm): stream-walker — cache l0 handle across batch-inversion passes (occupancy-safe load reuse) May 30, 2026

AztecBot changed the title ~~perf(bb/msm): stream-walker — cache l0 handle across batch-inversion passes (occupancy-safe load reuse)~~ perf(bb/msm): stream-walker occupancy-safe load reuse + runtime A/B harness (mobile-WebGPU on BrowserStack settled) May 30, 2026

AztecBot changed the title ~~perf(bb/msm): stream-walker occupancy-safe load reuse + runtime A/B harness (mobile-WebGPU on BrowserStack settled)~~ perf(bb/msm): stream-walker l0 load-reuse — ~10–16% real-Adreno win (in-session A/B); must NOT be stacked with #23726 May 30, 2026

AztecBot added 2 commits May 30, 2026 08:55

docs(bb/msm): document fusePeel as Adreno register-unsafe (measured 9…

28247b7

….6× logn14, device-loss logn16)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bb/msm): stream-walker l0 load-reuse — portable Adreno+Mali win; register-budget boundary mapped (fused overshoots Adreno)#23730

perf(bb/msm): stream-walker l0 load-reuse — portable Adreno+Mali win; register-budget boundary mapped (fused overshoots Adreno)#23730
AztecBot wants to merge 8 commits into
stream-walker-implfrom
cb/msm-opt-coalesce-reuse-7k2p

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stream-walker l0 load-reuse — a portable, register-safe memory-latency win

TL;DR

Real-hardware A/B — Adreno (Galaxy S25 Ultra · Adreno 830 · Chrome 145), GPU wall time

Real-hardware A/B — Mali (Pixel 9 Pro XL · Tensor G4 · Chrome 145), GPU wall time — the portability proof

The register-budget boundary — how far the reuse can be pushed (gap closed)

Correctness (SwiftShader, headless, WASM-free Noble cross-check) — PASS

Harness notes

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading