Skip to content

perf(bb/msm): arch-aware stream-walker TPB (128 on ≥32 KB devices, 64 on Mali)#23729

Draft
AztecBot wants to merge 2 commits into
stream-walker-implfrom
cb/186c6fe939be
Draft

perf(bb/msm): arch-aware stream-walker TPB (128 on ≥32 KB devices, 64 on Mali)#23729
AztecBot wants to merge 2 commits into
stream-walker-implfrom
cb/186c6fe939be

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Summary

The memory-light stream-walker accumulator hardcoded WALKER_TPB = 64 in msm_v2.ts, sizing its var<workgroup> pref_scratch to 16 KB so it always fits Mali Bifrost's 16 KB workgroup-storage cap. On Apple (TBDR) and Adreno — which grant 32 KB — that leaves half the workgroup-memory budget and a slice of occupancy unused, and is part of why the walker trails the V2 pair-tree near logn≈17.

This wires the walker's TPB to the device's granted maxComputeWorkgroupStorageSize (already requested at the adapter max in gpu.ts):

  • ≥ 32 KB (Apple / Adreno) → TPB = 128 (pref_scratch = 128·S·2·16 B = 32 KB at S=8)
  • otherwise (Mali Bifrost, 16 KB) → TPB = 64 (16 KB) — unchanged

The constants STREAM_WALKER_TPB / STREAM_WALKER_TPB_MALI_FALLBACK already lived in ba_stream_plan.ts for exactly this ("Mali targets … must drop this to 64 at adapter init") but were never read; this connects them.

Changes

  • msm_v2.ts — arch-aware WALKER_TPB from device.limits.maxComputeWorkgroupStorageSize; new walkerTpb MsmConfig override.
  • dev/msm-webgpu/main.ts?walkertpb= URL knob so the bench can A/B 64 vs 128 on one device.

Why this is correctness-safe (structural argument)

In ba_stream_walker.template.wgsl the walker has no cross-thread sharing and no workgroup barriers (KNOB 1): pref_scratch is sized TPB * S * 2 and each thread touches only pref_scratch[local_id*S*2 .. +S*2); partial slots are 2*(t*S+k)+{0,1} (disjoint per thread); accumulators are private registers. partition_task and the walker are generated from the same WALKER_TPB, so nwg = ceil(num_active/TPB) stays consistent. Changing TPB therefore changes only (a) how the fixed NUM_THREADS=8192 threads pack into workgroups and (b) the symbolic pref_scratch length — never the per-thread computation → bit-identical result across TPB by construction.

Memory

Workgroup-shared only; device buffers unchanged → still within the ≤100 MB budget up to n=2²⁰. 32 KB is exactly the Apple/Adreno cap and 2× Mali's, so the 64 fallback is required there (a 32 KB request fails device validation on Mali).

Validation status — full transparency

Type-check: no type errors from this change (the dev tsconfig's only complaints are pre-existing TS6059 rootDir structural notes affecting every src/ import).

SwiftShader local cross-check (logn 8/10): attempted, BLOCKED by the environment. The WebGPU harness cannot run in this container: the only available browser (Playwright's pre-installed /opt/ms-playwright/chromium-1148, launched via executablePath) does not expose navigator.gpu under any configuration tried — --enable-unsafe-webgpu with --use-vulkan=swiftshader / --use-angle=swiftshader, in both headless=old and headless=new under xvfb. navigator.gpu is undefined in every case (no working Dawn↔SwiftShader WebGPU backend in this build on a GPU-less host). yarn install, @noble/curves, and playwright-core all work; only the WebGPU adapter is missing. Correctness therefore rests on the structural argument above pending real-GPU confirmation.

BrowserStack device numbers: attempted, BLOCKED by seat contention. The shared BrowserStack pool (2 seats across ~10 concurrent agents) was at 2/2 for essentially the entire session (confirmed across ~90 min of escalating 4→10→30 min backoffs). When a seat momentarily freed, staging the cloudflared tunnel and completing the MCP worker handshake before another agent re-claimed the seat did not succeed within the worker-id window. The driver + tunnel are fully working (run-browserstack.mjs, cloudflared installed). One worker created during an attempt was pointed at a stale tunnel URL by mistake and was immediately deleted (browserstack_workers now returns [] — no leaked seats).

To complete the measurements (exact repro, when a seat is free)

cd barretenberg/ts
PATH="/tmp/bin:$PATH" node dev/msm-webgpu/scripts/run-browserstack.mjs \
  --target macos --page index --autorun msm-cross-check --n 10   # correctness
# then ?autorun=msm-bench and A/B ?walkertpb=64 vs 128 on macos / s25-ultra / pixel-9-pro-xl

No device numbers are reported here because none were obtained — I will not publish unverified figures.

Base: stream-walker-impl.

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant