feat(bb/msm): per-architecture autotuner for stream-walker TPB (Mali/Adreno/Apple) by AztecBot · Pull Request #23725 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T00:07:03Z

Summary

Makes the stream-walker faster (its slow path) by combining two correctness-neutral, memory-free levers on top of the per-architecture TPB autotuner:

dx-caching — the walker's batched-inversion inner loop touched each point's x-coordinate ~3× per add (forward prefix pass, inverse pass, backward peel), all uncoalesced SRS reads. The forward pass now caches each slot's dx = p_rx − p_lx in private registers; the inverse pass reuses it instead of reloading point_x and re-subtracting. This removes ~1/3 of the walker's uncoalesced x-coordinate reads at zero device/workgroup memory cost (S×8 = 64 u32 of registers/thread; the swarm established register pressure is not the walker's binding constraint).
Per-architecture TPB autotuner — picks the largest threads-per-block whose pref_scratch footprint fits the device's maxComputeWorkgroupStorageSize (TPB=128 on 32 KB Apple/Adreno, 64 on 16 KB Mali). TPB only repacks a fixed set of NUM_THREADS = nwg×256 logical threads into workgroups; each thread owns its own [local_id×S, +S) scratch slice with no cross-thread sharing and no barrier, so it is correctness-neutral.

Both compose: dx-caching cuts per-add memory traffic; the autotuner raises occupancy/packing per arch.

dx-caching (the speed lever)

pref_scratch[k] holds the forward prefix product; the inverse pass peels it
back to per-slot 1/dx. It needs the same dx the forward pass computed — caching
dx (bit-identical) lets the inverse pass skip its point_x reload + subtraction.

_generated/shaders.ts is regenerated (inlined WGSL) and included in the diff.

Device-memory pref_scratch — explored, gated (honest finding)

The swarm flagged moving pref_scratch to device storage as the #1 occupancy lever. Wiring it revealed a real hardware ceiling: the walker already binds all 10 storage buffers, so a device-memory pref_scratch needs an 11th. SwiftShader and many mobile GPUs cap maxStorageBuffersPerShaderStage at 10, which rejects the pipeline layout.

This PR adds the walkerPrefDevice knob + WGSL variant (binding 11, global t*S*2 slice) but gates it on maxStorageBuffersPerShaderStage ≥ 11, falling back to var<workgroup> otherwise so it never breaks. To engage it on the ≤10-buffer mobile targets a storage binding must first be freed (e.g. interleaving sorted_bucket_list+sorted_count_list, which share an index) — tracked as follow-up; not landed because it can't be perf-validated without a real-hardware seat.

Validation

Autotuner unit tests (host, no GPU): 10/10 PASS.
SwiftShader correctness (software WebGPU, GPU-less host) — cross-check vs @noble/curves BN254 Pippenger, GREEN at logn=8 and logn=10 for: workgroup pref_scratch at autotuned TPB=128, forced TPB=64, and the device-pref fallback path. SwiftShader reports maxStorageBuffersPerShaderStage=10, confirming the ceiling above.
tsc --noEmit clean for msm_webgpu/.

Real-hardware A/B — still outstanding (seat contention)

The 2 shared BrowserStack seats have been continuously saturated by other agents' Mali (Pixel 9 Pro XL) and Adreno (Galaxy S25 Ultra) jobs across this session's patient backoff. Not run rather than displace other agents' work. The harness is ready (?tpb/?prefdev in the xcheck page; --qp passthrough in run-browserstack; cloudflared installed). When a seat frees, the dx-caching delta is a direct apples-to-apples comparison vs this PR's previously-recorded macOS M2 TPB=128 stream_walker accumulate of 33.1 ms (same code minus dx-caching):

# Apple (M2): dx-cache delta + TPB A/B
node dev/msm-webgpu/scripts/run-browserstack.mjs --page xcheck --target macos          --n 16 --reps 5 --qp "tpb=128"
node dev/msm-webgpu/scripts/run-browserstack.mjs --page xcheck --target macos          --n 16 --reps 5 --qp "tpb=64"
# Adreno (wall-clock; timestamp-queries unreliable) + Mali (TPB autotune differentiation)
node dev/msm-webgpu/scripts/run-browserstack.mjs --page xcheck --target s25-ultra      --n 16 --reps 5 --qp "tpb=128"
node dev/msm-webgpu/scripts/run-browserstack.mjs --page xcheck --target pixel-9-pro-xl --n 16 --reps 5 --qp "tpb=64"

Status

Draft. dx-caching + autotuner + gated device-pref knob landed and SwiftShader-GREEN. Mobile Apple/Adreno/Mali numbers are the only outstanding item, blocked purely on shared-seat contention.

…Adreno/Apple)

…harness

…tack mobile)

… knob Cache the forward-pass dx per slot (private registers, 0 device/workgroup memory) and reuse it in the batched-inversion peel, so the inverse pass no longer reloads point_x and re-subtracts — removing ~1/3 of the walker's uncoalesced SRS x-coordinate reads. Correctness-neutral; the cached value is bit-identical to the recomputed dx. Add an opt-in walkerPrefDevice knob that places pref_scratch in a device storage buffer (occupancy lever). The walker already binds 10 storage buffers, so this needs maxStorageBuffersPerShaderStage >= 11; when the device caps at 10 (SwiftShader, many mobile GPUs) it falls back to var<workgroup> placement so the walker still launches. Wire ?tpb / ?prefdev into the xcheck harness for real-device A/B. Cross-check GREEN vs @noble/curves at logn=8,10 under SwiftShader for workgroup (TPB 64/128) and device-fallback paths; autotune unit tests 10/10.

… device pref)

feat(bb/msm): per-architecture autotuner for stream-walker TPB (Mali/…

a49a75a

…Adreno/Apple)

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026

AztecBot added 5 commits May 30, 2026 00:12

feat(bb/msm): surface autotuner decision + walkertpb A/B knob in dev …

fee98d1

…harness

update PR #23725

96a4f75

test(bb/msm): WASM-free WebGPU xcheck harness (SwiftShader + BrowserS…

6b05d3f

…tack mobile)

test(bb/msm): --qp passthrough for run-browserstack A/B (forced TPB /…

b24ed00

… device pref)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bb/msm): per-architecture autotuner for stream-walker TPB (Mali/Adreno/Apple)#23725

feat(bb/msm): per-architecture autotuner for stream-walker TPB (Mali/Adreno/Apple)#23725
AztecBot wants to merge 6 commits into
stream-walker-implfrom
claudebox/msm-opt-autotune-tpb-7k3x

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

dx-caching (the speed lever)

Device-memory pref_scratch — explored, gated (honest finding)

Validation

Real-hardware A/B — still outstanding (seat contention)

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading