fix(bb/msm): correct MsmV2 multi-batch (numBatches>1) + honest peak-memory map#23733
Draft
AztecBot wants to merge 12 commits into
Draft
fix(bb/msm): correct MsmV2 multi-batch (numBatches>1) + honest peak-memory map#23733AztecBot wants to merge 12 commits into
AztecBot wants to merge 12 commits into
Conversation
…trade harness Adds the instrumentation the self-review needs to turn the budget knob from an unmeasured no-op into a measured memory/time trade: - mem-accounting.mjs: replicates ensureScratch sizing exactly (no GPU) to regenerate the per-buffer peak map + numBatches lever curve. Corrects the prior table: it omitted reducePrefScratch (~5.6MB @logn20) and undercounted the bucket lists, so the fully-batched floor is ~104MB scratch / ~168MB total @logn20, not the previously-claimed 92.5/156. - MsmV2.batchCount getter + dev-page ?membudget knob + msm-membudget-sweep autorun: rebuilds MsmV2 across budgets at fixed logN on a real device, times GPU wall ms per numBatches, and cross-checks correctness vs noble.
…er floor Script-verified (mem-accounting.mjs) corrections to the peak-memory map: - adds reducePrefScratch (~5.6MB @logn20) the prior table omitted - fully-batched floor @logn20 is ~104MB scratch / ~168MB total, not 92.5/156 - scalarsRawBuf (32MB) is the largest scratch buffer and is NOT batch-dependent - documents that the ≤100MB-to-2^20 goal is unreachable by batching alone, and that the lever is a memory/time trade (decompose re-reads all n scalars per batch); time curve pending the real-hardware sweep
estimateMem omitted reducePrefScratch (~5.6MB @logn20), planMeta, and streamPlannerMeta, so a caller's memBudgetBytes silently under-bounded the true peak by ~6MB. Hoist the (data-independent) reduction MAXC above the budget fit-check and fold those buffers into fixedScratch. Default MEM_BUDGET (248MB) still dwarfs the ~172MB logn20 estimate, so the default numBatches is unchanged (the fix only affects callers who set a tight budget). Also documents on MsmConfig.memBudgetBytes that the lever is a measured memory/time trade.
…e HW) macOS Sequoia / Chrome 148, logn16, median-of-5 GPU wall, noble cross-check passed at every nb: nb1 47.4MB/50.7ms -> nb2 39.1MB/+18.5% -> nb5 34.1MB/+77.5%. Monotonic memory drop, steep+accelerating time cost, diminishing returns past nb2-3. Conclusion: the host-buffer memory lever is exhausted (no free over-provisioning; batching is a deliberate trade), default no-op is correct, remaining time-neutral cuts need WGSL changes.
The membudget sweep only cross-checked the first (nb=1) result against noble, so multi-batch (nb>1) correctness was never actually verified — the whole point of the lever is that nb>1 stays correct. Record a per-row crossOk for every budget and fail the run if any nb disagrees.
- macOS cross-check is at nb=1 only (correct the earlier 'every nb' wording); nb>1 is the same kernels over disjoint window slices, all-nb sweep queued. - Add Galaxy S25 Ultra (Adreno) trade row: same monotonic time/memory shape. - Flag that the Android nb=1 MSM disagreed with noble (bases self-verified OK, so it's an Adreno MSM-compute issue, not this PR's buffer sizing) — pre-existing, independent of this change.
…orrect
Per-nb cross-check on real macOS hardware (Chrome 148, logn16): nb=1 matches
@noble/curves but nb=2,3,4,5,10 ALL disagree. This refutes the lever's premise
('no correctness change, same path as the logn20 default'). Root cause is host
side: planner/walker bind groups + params are built once with no per-batch
bucket offset, so the batchBuckets-vs-bTotal index spaces only coincide at nb=1.
Blast radius: the same multi-batch code is the wgFits-forced default at logn19
(nb=2) and logn20 (nb=3), so MsmV2 is very likely incorrect by default at
logn>=19 (never cross-checked there — noble too slow). Distinct from the
invisible bucket-0 issue in PR #23741.
Warns on MsmConfig.memBudgetBytes and documents the evidence + root-cause
direction in MSM_V2_MEMORY.md. Does not change runtime behavior (default budget
stays a no-op); the fix to the multi-batch path is required follow-up.
decompose applies batch_window_base only to the scalar-bit read (w_global), but writes local-window-indexed output (idx = w*input_size+p); the bucketResult- writing kernels (ba_size1/ba_stream_walker/ba_walker_combine) never re-apply a batch_window_base*BW offset, so nb>1 batches overwrite the low batchBuckets region of bucketResult instead of filling disjoint slices. Records the fix direction.
The numBatches budget lever (and the wgFits-forced default at logn19 nb=2 / logn20 nb=3) returned the wrong MSM: every batch's LOCAL CSR was written to the same [0, batchBuckets) region of bucketResult instead of its disjoint global slice, and the per-batch walker scratch (bucketHead/walkerNodeCounter/taskCuts/ threadCuts) was reset only once before the loop, so batch>0 either walked batch0's stale linked-list heads or overflowed max_nodes and dropped its partials. Fix: - Thread a per-batch bucket_base = bi*batchBuckets into the three bucket-writing kernels (ba_size1 .y, ba_stream_walker .w, ba_walker_combine .z). The CSR / partials / linked-list index spaces stay LOCAL; only the final bucket_sums write adds the base. ba_stream_walker derives M_partials in-shader (2*NUM_THREADS*S) to free params.w with no extra binding. - Build one size1/walker/combine bind group per batch (bucket_base uniform), matching the existing per-batch decomposeBinds. - Move the per-batch scratch clears inside the batch loop; keep bucketResult and the 8 MB walkerPartials cleared once (disjoint accumulation / only fresh linked slots are read). - Add MsmConfig.forceNumBatches (+ ?nb= / ?nbs= dev-page hooks) to pin nb for deterministic cross-checks. At nb=1 bucket_base=0 and the loop runs once, so the default path is byte-identical to before. Cross-checked vs @noble/curves (headless SwiftShader): nb=1..4 @logn10 and nb=1..3 @logn14 all agree (previously every nb>1 disagreed). Updates the MsmConfig docs; MSM_V2_MEMORY.md updated separately.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Axis B — MsmV2 peak GPU memory: the budget lever, now correct
Headline: the multi-batch path is now correct (was returning the wrong MSM)
Self-reviewing the
numBatchesmemory lever uncovered that everynb>1returned the wrong MSM — and the same multi-batch code is the
wgFits-forceddefault at logn19 (nb=2) / logn20 (nb=3), so MsmV2 was very likely
incorrect by default at logn≥19. This PR roots out and fixes both bugs.
Cross-check vs
@noble/curves(headless SwiftShader — correctness ishardware-independent; nb forced via the new
MsmConfig.forceNumBatches):Previously every nb>1 disagreed (the prior revision's failure table was at
logn16, where nb=2 was ❌). Now all pass — and logn16 nb=2 passes at the
same
statsBytespeak (47.4 → 39.1 MB, −17.5 %), confirming the fix changes nobuffer size. At nb=1,
bucket_base=0and the batch loop runs once, so thedefault path is byte-identical to before.
Root cause & fix
Two bugs, both nb=1-invisible (at nb=1
batchBuckets == bTotal, local == global):window space
[0, batchWindows)→ LOCAL bucket space[0, batchBuckets), butthe three kernels writing the full-
bTotalbucketResult(ba_size1,ba_stream_walker,ba_walker_combine) indexed it by the local bucket, soevery batch overwrote
[0, batchBuckets)instead of its disjoint global slice.Fix: thread
bucket_base = bi*batchBucketsinto those kernels' destinationindex (
ba_size1.y,ba_stream_walker.w—M_partialsis now derivedin-shader to free the slot, no new binding —
ba_walker_combine.z), onebind group per batch (mirrors the existing per-batch
decomposeBinds). TheCSR / partials / linked-list spaces stay LOCAL; only the final write adds the
base.
bucketHead(atomic linked-list heads),walkerNodeCounter(atomic node allocator),taskCuts,threadCutswerecleared once before the loop, so batch
biwalked batchbi-1's staleheads or overflowed
max_nodesand silently dropped its partials. Fix: clearper-batch;
bucketResultand the 8 MBwalkerPartialsstay cleared once(disjoint accumulation / only fresh linked slots are read).
Memory: the lever is now a correct memory/time trade
statsBytes()is a pure sum ofGPUBuffer.size, a closed form of(n, c, numBatches)— the fix changes no buffer size, so the priorreal-hardware A/B (macOS Sequoia · Chrome 148, logn16, median-of-5 GPU wall)
carries over verbatim, now on a correct path:
statsBytes)Peak falls monotonically, wall rises, accelerating — diminishing returns past
nb≈2–3. So raising the default budget would still regress the common path:
the no-op default (
MEM_BUDGET248 MB) is right. Deterministic peak acrosssizes (script-verified by
mem-accounting.mjs): logn17 68.8 MB total / logn20229.7 MB, fully-batched floor 38.7 / 168.1 MB. The fixed floor (
scalarsRaw32 MB @logn20,
bucketResult,redBuf,walkerPartials, SRS) batching cannotreach.
(Per-nb real-hardware re-time queued on BrowserStack — both seats were saturated
at write time. nb=1 wall is unchanged by construction since that path is
byte-identical; the SwiftShader cross-check proves correctness independent of
hardware.)
Conclusion
The host-buffer-management lever is a correct but steep memory/time trade.
Pushing peak below the batching floor needs WGSL-level levers (in-place
bucket reduction −17 MB; on-GPU SRS y-recovery −32 MB via the existing
decompress_g1_bn254; per-batch scalar byte-slicing) — each a separate verifiedchange, flagged in
MSM_V2_MEMORY.md.Files
…/wgsl/cuzk/ba_size1,ba_stream_walker,ba_walker_combine.template.wgsl(+ regenerated_generated/shaders.ts) — per-batchbucket_base.…/msm_webgpu/msm_v2.ts— per-batch writer bind groups; per-batch scratch clears;forceNumBatchesknob; honestestimateMem;batchCountgetter.…/msm_webgpu/MSM_V2_MEMORY.md— corrected map + correctness fix + cross-check.…/dev/msm-webgpu/main.ts—?nb=/?nbs=hooks;msm-membudget-sweepper-nb cross-check.…/dev/msm-webgpu/scripts/mem-accounting.mjs— deterministic accounting.Created by claudebox · group:
aztec