Skip to content

perf(bb/msm): stream-walker pref_scratch → private memory (frees workgroup occupancy limiter), TPB 64→128#23726

Draft
AztecBot wants to merge 5 commits into
stream-walker-implfrom
cb/msm-walker-device-prefscratch
Draft

perf(bb/msm): stream-walker pref_scratch → private memory (frees workgroup occupancy limiter), TPB 64→128#23726
AztecBot wants to merge 5 commits into
stream-walker-implfrom
cb/msm-walker-device-prefscratch

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Summary

Profile-driven optimization of the stream-walker MSM accumulator. Apple-GPU profiling found the walker is threadgroup-memory occupancy-limited: its pref_scratch (the batched-inversion prefix-product scratch) was a 16 KB var<workgroup> array that capped resident workgroups per SM.

This PR moves pref_scratch to per-invocation var<private> memory (S*2 vec4 = 256 B/thread), freeing the workgroup allocation entirely with zero new storage bindings (a device-buffer 11th binding is rejected on SwiftShader/Mali/Adreno, which expose only the 10-per-stage floor). The placement and the threads-per-workgroup are now runtime-selectable (MsmConfig.walkerPrefPrivate / walkerTpb, surfaced on the dev bench as ?walkerpriv / ?walkertpb) so all variants A/B in a single deploy.

Real-hardware A/B — Apple M2 (BrowserStack macOS Sequoia · Chrome 148)

logn=17 (131,072 pts), 5 timed reps/variant, timestamp-query per-pass breakdown. stream_walker is ~77% of GPU time, so it dominates the wall.

Variant (M2) placement TPB stream_walker GPU wall vs baseline
wg64 (baseline) workgroup 16 KB 64 67.6 / 68.7 ms 90.8 ms
priv64 private 64 62.9 ms 84.9 ms −7.8% sw, −6.5% wall
priv128 (default) private 128 63.5 / 62.6 ms 84.5 ms −7 to −9% sw, −7% wall

Each headline cell reproduced across two independent page loads. The deltas (≈5 ms) are ~4× the per-variant stdev (~1 ms) — a clean, repeatable win.

Key finding: the lever is placement, not TPB

priv64 ≈ priv128 (62.9 vs ~63.0 ms stream_walker). The TPB 64→128 bump contributes nothing measurable on M2 — the entire ~8% win comes from removing the 16 KB var<workgroup> occupancy limiter. This matches first principles for a memory-latency-bound kernel: freeing the shared-memory budget lets more workgroups stay resident (hiding latency) regardless of per-workgroup thread count; the indirect dispatch launches the same total threads either way. TPB=128 is kept as the shipping default (it costs nothing on Apple and may help mobile).

Real-hardware A/B — Adreno (Galaxy S25) — pending, blocked on shared-seat contention

Not yet captured: BrowserStack's 2 shared seats have been continuously saturated by a concurrent Adreno campaign (the coop-walker work, #23739) for the duration of this session — my S25 workers expired in the queue (30-min session cap) before reaching a device. Re-running this is purely a matter of seat availability; the harness is ready and the exact A/B command is:

# wg64 baseline vs priv128 (default) vs priv64 — Android Chrome has no
# timestamp-query, so compare GPU wall (window.__lastWallMs, now wired up):
node dev/msm-webgpu/scripts/run-browserstack.mjs --target s25-ultra \
  --autorun msm-bench --n 16 --reps 5 --query walkerpriv=0 --query walkertpb=64
# …then walkerpriv=1 walkertpb=128, and walkerpriv=1 walkertpb=64

Expectation: the placement change is orthogonal to the coop-walker (#23739, −6% Adreno) — that PR changes the walker's cooperation pattern, this one changes scratch-memory placement and frees the occupancy limiter — so the two should compose. On Adreno/Mali the TPB knob may matter more than on Apple (smaller register files), which is exactly why walkerTpb is left runtime-tunable.

Correctness

Changes

  • msm_v2.tsWALKER_PREF_PRIVATE / WALKER_TPB promoted to MsmConfig.walkerPrefPrivate / walkerTpb, resolved in create() and threaded into the stream-walker + partition-task pipelines. Workgroup path clamps TPB≤64 (128 would need 32 KB workgroup memory, over the 16 KB floor).
  • wgsl/cuzk/ba_stream_walker.template.wgsl{{#pref_private}} mustache selects var<private> vs the original var<workgroup> scratch (slots stay per-thread-private; no barrier). _generated/shaders.ts regenerated.
  • dev/msm-webgpu/main.ts?walkerpriv/?walkertpb bench knobs; msm-bench autorun now posts /progress rows (init/warmup/per-rep) and exposes __lastWallMs so wall time is captured on adapters without timestamp-query (Android Chrome / Adreno / Mali).
  • dev/msm-webgpu/scripts/run-browserstack.mjs--query k=v passthrough for the A/B knobs.

Base: stream-walker-impl.

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
@AztecBot AztecBot changed the title perf(bb/msm): device-memory coalesced pref_scratch for stream-walker (KNOB 1), TPB 64→128 perf(bb/msm): stream-walker pref_scratch → private memory (frees workgroup occupancy limiter), TPB 64→128 May 30, 2026
@AztecBot
Copy link
Copy Markdown
Collaborator Author

Continuation: correctness gate blocks the perf claim — root cause is a pre-existing walker race, not this PR

I picked this branch up to get the real-hardware A/B (Apple/Adreno/Mali) that the perf claim needs. Per the task's own rule (never claim a perf win without a passing cross-check), I first tried to get a green cross-check. It cannot be obtained on this branch today — and the reason is upstream of this PR.

What I ran

Local headless SwiftShader via the in-tree dev/msm-webgpu/msm-correctness-driver.mjs (noble @noble/curves oracle, synthesized points, no SRS/WASM needed). Repro:

cd barretenberg/ts
PW_EXECUTABLE=<chrome> node dev/msm-webgpu/msm-correctness-driver.mjs --logns 8,10

Findings (all reproduced, multiple runs)

  1. The stream-walker returns OFF-CURVE results at every size tested — logn = 8, 10, 12, 14, 16. gpu on-curve = false, not a sign/encoding issue (negation-of-ref = false). The noble reference self-check (msm() == naïve Σ) passes, and the result-decode path is the same one index.html uses, so this is a genuine compute bug, not a harness artifact.

  2. It is independent of this PR. Off-curve reproduces identically with WALKER_PREF_PRIVATE = true (private scratch, TPB 128 — this PR) and = false (the 16 KB var<workgroup> baseline, TPB 64). The pref_scratch placement change is innocent.

  3. It is a data race, and pre-existing. The in-tree A/B probe (walker bucket_sums vs the ba_stream_accum_debug reference) shows the walker miscomputes a small, varying set of buckets — 8 → 9 → 8 → 6 mismatches across repeated runs in a single process, and a given hot bucket's value changes run-to-run (e.g. 2415…7486…84f2…) while the debug reference is stable. 3+ distinct values for one bucket ⇒ torn writes: ≥2 threads writing the same bucket_sums[bucket_id] (4 separate vec4 stores) within the single walker dispatch → mixed x/y limbs → off-curve. The mismatching buckets are the hottest ones (those that span thread boundaries).

  4. The historic "cross-check GREEN at logn=8,10" (735d5aec) did not validate the walker via noble — at that commit the walker did zero work ([ALL Q] real=0 partial=0); the green was the V2 pair-tree / WASM path. With the harness on 735d5aec the walker shows the same non-deterministic mismatches. So the walker has, as far as I can tell, never passed a real oracle at these sizes.

Where the bug is (localized, not yet fixed)

The torn write means two threads store_bucket_sum the same bucket — i.e. two threads each believe they fully own a bucket that actually spans a thread boundary. By construction that requires overlapping thread ranges. I read ba_planner_partition_thread, ba_planner_partition_task, the walker boundary/retirement logic, and ba_walker_combine/ba_walker_partials_index; the partition looks non-overlapping on paper, so the overlap likely comes from an off-by-one in the adds-vs-points cut space or a stale/duplicated cumulative_adds value feeding the two binary searches. Separate compute passes have implicit WebGPU barriers, so it is not a missing inter-pass barrier — the race is inside the single walker dispatch.

Status / why I'm not pushing numbers

  • The pref_scratch → private + TPB 128 change is implemented, type-clean, compiles, and is bit-equivalent to the workgroup baseline (same off-curve output). Its design is sound.
  • I did not run BrowserStack: with 2 shared seats it's irresponsible to spend them benchmarking a kernel that is provably incorrect, and any timing would be meaningless without a passing cross-check.
  • Blocker (must be fixed first, likely outside this PR's scope): the pre-existing inter-thread bucket_sums write race in the stream-walker. Once the walker is correct (green noble cross-check at logn 8/10 on SwiftShader), the A/B time + peak-memory campaign on Apple/Adreno/Mali can proceed on top of it.

Next step I'd recommend: fix the partition/retirement so a boundary bucket is owned by exactly one writer (or route all retirements through the race-free indexed walker_combine via an atomic partial allocator), confirm green on SwiftShader, then benchmark this pref_scratch change against it.


Created by claudebox · group: aztec

AztecBot added a commit that referenced this pull request May 30, 2026
…ate cache

The prior pass cached both the packed l0_index handles and the 32-byte
x-coordinates (plx/prx, ~1 KB/thread private) across the three batch-inversion
passes. First principles + the #23726 occupancy profiling show the x-coord
cache is the wrong cost: the re-reads it eliminates are same-address cache hits
(no DRAM bandwidth saved), and ~1 KB/thread of extra private state competes
with the very occupancy that limits this kernel (pref_scratch→private, TPB
64->128 in #23726).

Cache only the 4-byte packed handles (l0a/l0b, 8 bytes/slot) so the dependent
l0_index gather is issued once per point; re-read point_x from the cached handle
in the inverse pass and backward peel (a cache hit, point index already
resolved). Bit-identical arithmetic; ~1 KB/thread less private state, composing
cleanly with #23726. Cross-checked GREEN vs Noble at logn 10/11/12 (SwiftShader).
AztecBot added 2 commits May 30, 2026 07:40
…-selectable for A/B

Promote WALKER_PREF_PRIVATE / WALKER_TPB from compile-time module consts to
MsmConfig knobs (walkerPrefPrivate, walkerTpb), resolved in MsmV2.create and
threaded into both the stream-walker and partition-task pipelines. The
workgroup-scratch path is clamped to TPB<=64 (128 needs 32 KB workgroup
memory, over the 16 KB floor). Exposed on the dev bench via ?walkerpriv and
?walkertpb, and run-browserstack.mjs gains a --query passthrough, so a single
deploy can A/B all variants (priv128 / priv64 / wg64) on one device.
…/per-rep)

The BrowserStack runner detects a run by its id appearing in the progress or
results JSONL. msm-bench previously posted only the final /results row, so the
runner's first-progress watchdog could fire during SRS download / pipeline
build before any row appeared. Emit an init row immediately and a row per rep
(carrying wall/gpu/stream_walker ms) so detection is prompt and stall
detection works mid-run.
AztecBot added a commit that referenced this pull request May 30, 2026
…arness fixes

- Add accum:'auto' (new default) resolving per-device via adapterInfo; gated
  behind COOP_AUTO_ON_STARVED_MOBILE (off) so it picks the kernel proven
  fastest on measurable hardware (walker) until a WebGPU-capable Android A/B
  proves coop's mobile niche. Documents why coop is, by analysis, dominated by
  the walker + #23726's var<private> occupancy lever.
- msm-accum-ab autorun: emit /progress heartbeats (boot-start, srs, build, rep)
  under one shared runId so the BrowserStack watchdog survives slow mobile SRS
  loads; add ?srs_logn=N to cap the SRS download.
…mp-query-less adapters

Android Chrome (Adreno/Mali) does not expose the timestamp-query feature, so
MsmV2.run() never populates __lastPhaseMs and the per-pass breakdown is empty.
The msm-bench loop also parsed wall time from a log line that isn't reliably
present per rep (it read 0). Expose the submit→readback wall measured in
runWebGpuOnce as window.__lastWallMs and have the bench prefer it, so wall
time is captured on every adapter — the only timing signal on Adreno/Mali.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant