Skip to content

perf(bb/msm): stream-walker l0 load-reuse — portable Adreno+Mali win; register-budget boundary mapped (fused overshoots Adreno)#23730

Draft
AztecBot wants to merge 8 commits into
stream-walker-implfrom
cb/msm-opt-coalesce-reuse-7k2p
Draft

perf(bb/msm): stream-walker l0 load-reuse — portable Adreno+Mali win; register-budget boundary mapped (fused overshoots Adreno)#23730
AztecBot wants to merge 8 commits into
stream-walker-implfrom
cb/msm-opt-coalesce-reuse-7k2p

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Stream-walker l0 load-reuse — a portable, register-safe memory-latency win

TL;DR

The stream-walker is memory-bandwidth-bound on real mobile hardware. Caching each slot's packed l0_index handle once in the forward pass and reusing it (instead of re-issuing the dependent gather 3–4× per point) is a real, cross-vendor win — proven on both Adreno and Mali with drift-controlled in-session A/Bs. This PR ships that (reuseLoads, default on) and maps the exact register-budget boundary: pushing the reuse one step further (fusePeel — merge the inverse + backward-peel passes to also cut a point_x gather) overshoots Adreno's per-thread register file and is catastrophic there, while Mali tolerates it. So reuseLoads is the maximum register-safe reuse; fusePeel stays gated off.

reuseLoads caches only the 4-byte handle and re-reads point_x from the resolved index (a cache hit) — it adds zero net live per-thread state, which is why it stays under the occupancy cliff. Bit-identical arithmetic; neutral-to-positive everywhere measured.

Real-hardware A/B — Adreno (Galaxy S25 Ultra · Adreno 830 · Chrome 145), GPU wall time

Drift-controlled interleaved A/B (both variants in one page load, shared thermal state). Win grows with size — the memory-bound signature (reducing the dependent l0_index gather pays off once the working set exceeds cache):

logn reuse base speedup
14 119.3 ms 132.5 ms +9.5 %
16 232.9 ms 277.9 ms +16.2 %

Re-confirmed this session with single-variant solo runs (n=20): reuse logn14 = 117.8 ms, logn16 = 231.2 ms — matches the interleaved numbers.

Real-hardware A/B — Mali (Pixel 9 Pro XL · Tensor G4 · Chrome 145), GPU wall time — the portability proof

Same harness, interleaved reuse,base,fused, n=12/variant, logn16. Wall is the trustworthy metric on Android (uncalibrated timestamp-query); stream_walker phase delta is shown as corroboration and agrees in direction:

variant median wall Δ vs base (wall) Δ vs base (stream_walker)
base 377.2 ms
reuse 368.7 ms −2.3 % −3.9 %
fused 364.3 ms −3.4 % −4.8 %

The load-reuse win generalizes to Mali (smaller magnitude than Adreno — different memory hierarchy/cache, but real and directionally consistent across both wall and the stream_walker phase). This was the missing portability proof.

The register-budget boundary — how far the reuse can be pushed (gap closed)

fusePeel extends the idea: merge the inverse pass into the backward peel so each slot's point_x is gathered 2× instead of 3×, and the per-slot inv_dx scratch round-trip vanishes. Expected: a small further win on top of reuse. On Mali it delivers exactly that and is the best variant (fused−reuse = −1.2 % wall). On Adreno it is catastrophic:

logn reuse fused result
14 117.8 ms 1137.9 ms ~9.7× slower (n=20, tight 1122–1151 ms)
16 231.2 ms GPU device lost (reproduced in 3 independent runs)

Why: the merged backward loop holds inv_dx + both points' x/y + lambda + the result point live simultaneously (~8 field elements ≈ 64 registers/thread). That overshoots Adreno's small per-thread register file → register spill / occupancy collapse (the deterministic, low-variance ~10× slowdown at logn14) → at logn16 the pressure loses the device entirely. This is the same register-file-competition failure mode that makes this PR non-composable with #23726, and the same class as the earlier x-coord value-cache dead-end (~4×). Mali's larger budget hides all of it.

Boundary, stated precisely: caching only the 4-byte handle and re-reading point_x (reuseLoads) adds +0 net live per-thread registers → at/under Adreno's budget → the maximum register-safe reuse. Any extension that adds live per-thread field-element state — the x-coord value cache (4×) or the fusePeel merge (≈10× / device-loss) — exceeds Adreno's budget. The lever is exhausted at reuseLoads; fusePeel is kept as an A/B knob, default-off, documented Adreno-unsafe in MsmConfig.

Correctness (SwiftShader, headless, WASM-free Noble cross-check) — PASS

All three variants (base, reuse, fused) agree with @noble/curves at logn 10 and 12 (autorun=msm-ref-check; the in-tree bb wasm is a 213-byte stub, so Noble is the oracle). _generated/shaders.ts regenerated and in sync.

Harness notes

  • Adreno can't run the 3-variant interleaved A/B: the bench destroys+rebuilds the whole MsmV2 on every variant switch, and on Adreno the rebuild churn following a fused pipeline build reproducibly loses the device. The fused-on-Adreno characterization above therefore uses single-variant solo runs (one build, no rebuild). A proper fix (cache pipelines per variant — the ~49 MB SRS pool is shared, per-MsmV2 state is tiny) would enable interleaved A/B on Adreno too.
  • q-param URL workaround + page-load beacon (prior commits) are what make the runtime A/B drivable on real Android at all.

Recommendation

Base: stream-walker-impl.

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
AztecBot added 2 commits May 30, 2026 01:46
Add ?autorun=msm-gpu-bench: drives the GPU MSM directly (no Run/WASM gate),
captures per-phase GPU timestamps + allocated bytes. The Run-gated msm-bench
stalls on devices without threaded-WASM/cross-origin isolation. Also trim the
autorun SRS download to the chosen logn so mobile page-load does not stall on
the full 2^20 (64 MB) prefix.
…ate cache

The prior pass cached both the packed l0_index handles and the 32-byte
x-coordinates (plx/prx, ~1 KB/thread private) across the three batch-inversion
passes. First principles + the #23726 occupancy profiling show the x-coord
cache is the wrong cost: the re-reads it eliminates are same-address cache hits
(no DRAM bandwidth saved), and ~1 KB/thread of extra private state competes
with the very occupancy that limits this kernel (pref_scratch→private, TPB
64->128 in #23726).

Cache only the 4-byte packed handles (l0a/l0b, 8 bytes/slot) so the dependent
l0_index gather is issued once per point; re-read point_x from the cached handle
in the inverse pass and backward peel (a cache hit, point index already
resolved). Bit-identical arithmetic; ~1 KB/thread less private state, composing
cleanly with #23726. Cross-checked GREEN vs Noble at logn 10/11/12 (SwiftShader).
@AztecBot AztecBot changed the title perf(bb/msm): stream-walker — cache point operands across batch-inversion passes perf(bb/msm): stream-walker — cache l0 handle across batch-inversion passes (occupancy-safe load reuse) May 30, 2026
…o WebGPU probe

Adds a reuse_loads knob (MsmConfig.reuseLoads, ?reuse / ?variants) so one page
load can build+time BOTH the pre-reuse baseline and the handle-only load-reuse
variant on the same device/inputs — a controlled A/B that does not depend on a
second BrowserStack session. msm-gpu-bench now interleaves variants (?rounds)
and reports per-variant median/avg wall + per-phase GPU time plus the
reuse-minus-base delta.

Fixes the BrowserStack Android caps: /5/worker selects the browser via
`browser`, so real-Android Chrome needs browser:"chrome" (the prior
browser:"android"+browserName:"chrome" launched the device default browser,
labelled "Android Browser", and produced no telemetry).

Adds probe.html, a dependency-free (ES5 + XHR, no ESM) WebGPU capability probe
that posts telemetry immediately so a mobile run can never come back as 'zero
telemetry, no error row'. Both walker variants cross-check GREEN vs Noble under
SwiftShader at logn 10 and 12.
@AztecBot AztecBot changed the title perf(bb/msm): stream-walker — cache l0 handle across batch-inversion passes (occupancy-safe load reuse) perf(bb/msm): stream-walker occupancy-safe load reuse + runtime A/B harness (mobile-WebGPU on BrowserStack settled) May 30, 2026
… real Android

BrowserStack's real-mobile intent launch drops everything after the first
unescaped '&', so the interleaved-variant bench URL (?autorun=msm-gpu-bench&
logn=…&variants=reuse,base&rounds=…) arrived truncated to ?autorun=msm-gpu-bench
— reverting logn to its default and pulling the full 2^20 SRS (64MB), which
loses the mobile GPU device. Pass a single percent-encoded q= and expand it
into location.search before any module reads the query. Complements the runtime
reuseLoads toggle so a single Android session can A/B both variants.
@AztecBot AztecBot changed the title perf(bb/msm): stream-walker occupancy-safe load reuse + runtime A/B harness (mobile-WebGPU on BrowserStack settled) perf(bb/msm): stream-walker l0 load-reuse — ~10–16% real-Adreno win (in-session A/B); must NOT be stacked with #23726 May 30, 2026
AztecBot added 2 commits May 30, 2026 08:55
…int_x 3→2 gathers, register-free

The inverse pass and the backward peel were two separate backward loops over
the S slots, and each gathered point_x. Merge them: one backward walk derives
inv_dx[k] = inv * prefix[k-1] and consumes it immediately for the affine add,
so each slot's point_x is gathered twice (forward product + fused peel) instead
of three times. The per-slot inv_dx scratch round-trip is also removed. No new
per-thread private state, so it stays inside the Adreno register budget that the
x-coord cache broke. Idle pad slots still peel their dx from the running inverse
so the batched inversion stays correct. Gated behind fusePeel (implies
reuseLoads); A/B'd as the 'fused' variant. Cross-check GREEN vs noble at logn
10,12 on SwiftShader for base/reuse/fused.
…u-bench progress beacons

- run-browserstack.mjs: add --rounds/--variants and q-encode the query for
  real-mobile targets (the index.html q= expander already existed but the
  runner never used it, so multi-param mobile URLs truncated at the first &).
- main.ts msm-gpu-bench: map the 'fused' variant to reuse+fusePeel; post
  progress beacons during warmup and per rep so the runner's first-progress
  watchdog sees liveness through the (slow on mobile) shader JITs; report
  fused−base and fused−reuse wall deltas.
@AztecBot AztecBot changed the title perf(bb/msm): stream-walker l0 load-reuse — ~10–16% real-Adreno win (in-session A/B); must NOT be stacked with #23726 perf(bb/msm): stream-walker load-reuse (~10–16% Adreno) + fused inverse/peel (gather 3→2); register-budget levers, do NOT stack with #23726 May 30, 2026
@AztecBot AztecBot changed the title perf(bb/msm): stream-walker load-reuse (~10–16% Adreno) + fused inverse/peel (gather 3→2); register-budget levers, do NOT stack with #23726 perf(bb/msm): stream-walker l0 load-reuse — portable Adreno+Mali win; register-budget boundary mapped (fused overshoots Adreno) May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant