Skip to content

feat(bb/msm): per-architecture autotuner for stream-walker TPB (Mali/Adreno/Apple)#23725

Draft
AztecBot wants to merge 6 commits into
stream-walker-implfrom
claudebox/msm-opt-autotune-tpb-7k3x
Draft

feat(bb/msm): per-architecture autotuner for stream-walker TPB (Mali/Adreno/Apple)#23725
AztecBot wants to merge 6 commits into
stream-walker-implfrom
claudebox/msm-opt-autotune-tpb-7k3x

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Summary

Makes the stream-walker faster (its slow path) by combining two correctness-neutral, memory-free levers on top of the per-architecture TPB autotuner:

  1. dx-caching — the walker's batched-inversion inner loop touched each point's x-coordinate ~3× per add (forward prefix pass, inverse pass, backward peel), all uncoalesced SRS reads. The forward pass now caches each slot's dx = p_rx − p_lx in private registers; the inverse pass reuses it instead of reloading point_x and re-subtracting. This removes ~1/3 of the walker's uncoalesced x-coordinate reads at zero device/workgroup memory cost (S×8 = 64 u32 of registers/thread; the swarm established register pressure is not the walker's binding constraint).
  2. Per-architecture TPB autotuner — picks the largest threads-per-block whose pref_scratch footprint fits the device's maxComputeWorkgroupStorageSize (TPB=128 on 32 KB Apple/Adreno, 64 on 16 KB Mali). TPB only repacks a fixed set of NUM_THREADS = nwg×256 logical threads into workgroups; each thread owns its own [local_id×S, +S) scratch slice with no cross-thread sharing and no barrier, so it is correctness-neutral.

Both compose: dx-caching cuts per-add memory traffic; the autotuner raises occupancy/packing per arch.

dx-caching (the speed lever)

pref_scratch[k] holds the forward prefix product; the inverse pass peels it
back to per-slot 1/dx. It needs the same dx the forward pass computed — caching
dx (bit-identical) lets the inverse pass skip its point_x reload + subtraction.

_generated/shaders.ts is regenerated (inlined WGSL) and included in the diff.

Device-memory pref_scratch — explored, gated (honest finding)

The swarm flagged moving pref_scratch to device storage as the #1 occupancy lever. Wiring it revealed a real hardware ceiling: the walker already binds all 10 storage buffers, so a device-memory pref_scratch needs an 11th. SwiftShader and many mobile GPUs cap maxStorageBuffersPerShaderStage at 10, which rejects the pipeline layout.

This PR adds the walkerPrefDevice knob + WGSL variant (binding 11, global t*S*2 slice) but gates it on maxStorageBuffersPerShaderStage ≥ 11, falling back to var<workgroup> otherwise so it never breaks. To engage it on the ≤10-buffer mobile targets a storage binding must first be freed (e.g. interleaving sorted_bucket_list+sorted_count_list, which share an index) — tracked as follow-up; not landed because it can't be perf-validated without a real-hardware seat.

Validation

  • Autotuner unit tests (host, no GPU): 10/10 PASS.
  • SwiftShader correctness (software WebGPU, GPU-less host) — cross-check vs @noble/curves BN254 Pippenger, GREEN at logn=8 and logn=10 for: workgroup pref_scratch at autotuned TPB=128, forced TPB=64, and the device-pref fallback path. SwiftShader reports maxStorageBuffersPerShaderStage=10, confirming the ceiling above.
  • tsc --noEmit clean for msm_webgpu/.

Real-hardware A/B — still outstanding (seat contention)

The 2 shared BrowserStack seats have been continuously saturated by other agents' Mali (Pixel 9 Pro XL) and Adreno (Galaxy S25 Ultra) jobs across this session's patient backoff. Not run rather than displace other agents' work. The harness is ready (?tpb/?prefdev in the xcheck page; --qp passthrough in run-browserstack; cloudflared installed). When a seat frees, the dx-caching delta is a direct apples-to-apples comparison vs this PR's previously-recorded macOS M2 TPB=128 stream_walker accumulate of 33.1 ms (same code minus dx-caching):

# Apple (M2): dx-cache delta + TPB A/B
node dev/msm-webgpu/scripts/run-browserstack.mjs --page xcheck --target macos          --n 16 --reps 5 --qp "tpb=128"
node dev/msm-webgpu/scripts/run-browserstack.mjs --page xcheck --target macos          --n 16 --reps 5 --qp "tpb=64"
# Adreno (wall-clock; timestamp-queries unreliable) + Mali (TPB autotune differentiation)
node dev/msm-webgpu/scripts/run-browserstack.mjs --page xcheck --target s25-ultra      --n 16 --reps 5 --qp "tpb=128"
node dev/msm-webgpu/scripts/run-browserstack.mjs --page xcheck --target pixel-9-pro-xl --n 16 --reps 5 --qp "tpb=64"

Status

Draft. dx-caching + autotuner + gated device-pref knob landed and SwiftShader-GREEN. Mobile Apple/Adreno/Mali numbers are the only outstanding item, blocked purely on shared-seat contention.

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
AztecBot added 5 commits May 30, 2026 00:12
… knob

Cache the forward-pass dx per slot (private registers, 0 device/workgroup
memory) and reuse it in the batched-inversion peel, so the inverse pass no
longer reloads point_x and re-subtracts — removing ~1/3 of the walker's
uncoalesced SRS x-coordinate reads. Correctness-neutral; the cached value is
bit-identical to the recomputed dx.

Add an opt-in walkerPrefDevice knob that places pref_scratch in a device
storage buffer (occupancy lever). The walker already binds 10 storage
buffers, so this needs maxStorageBuffersPerShaderStage >= 11; when the device
caps at 10 (SwiftShader, many mobile GPUs) it falls back to var<workgroup>
placement so the walker still launches. Wire ?tpb / ?prefdev into the xcheck
harness for real-device A/B.

Cross-check GREEN vs @noble/curves at logn=8,10 under SwiftShader for
workgroup (TPB 64/128) and device-fallback paths; autotune unit tests 10/10.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant