feat(prover): Build the LogUp aux trace on the GPU#762
Conversation
…nto gpu_integration
|
/bench-gpu |
|
/review-ai |
GPU Benchmark (ABBA) —
|
Codex Code ReviewFound one issue. High - Resident LogUp buffers bypass the VRAM budget and can exhaust the GPU. |
| blowup_factor, | ||
| weights, | ||
| "coset_lde_row_major lde_size", | ||
| true, |
There was a problem hiding this comment.
Medium (device memory / hot-path work): retain_trace_col_major is hard-coded true here, so every GPU main-trace LDE now allocates an extra column-major snapshot (n * total_cols u64s, ~half of handle.buf) and runs an extra launch_row_to_col_major transpose — on every proof, regardless of whether the resident aux build will ever consume it.
The snapshot lives inside the GpuLdeBase handle held in main_gpu_handles for the whole aux-build window (the peak-memory phase), but it is only read by the resident aux path, which is disabled when:
disk-spill+StorageMode::Disk(resident_aux_ok = false) — and this is precisely the memory-constrained mode where the extra resident buffer hurts most;debug-checks(the whole resident/main_trace_devthreading iscfg'd out, yet the buffer is still allocated and retained here);- non-Goldilocks / preprocessed tables that fall back.
Consider threading the retain flag through so it's only true when the resident aux build can actually use it (i.e. gate on the same not(debug-checks) / non-disk conditions), rather than paying the transpose + extra residency unconditionally.
Review: Build the LogUp aux trace on the GPUReviewed the full diff (the PR files aren't checked out locally, so this is diff-based plus the base helpers it builds on). Overall this is a careful, well-structured change: the CUDA kernels mirror the CPU evaluator closely, the parity/
Findings1. Medium — unconditional trace snapshot retention (device memory + hot-path transpose). 2. Low — dropped safety Nothing blocking. Nice work on the parity tests and the resident/fallback separation. |
|
/bench-gpu |
AI ReviewPR #762 · 17 changed files Findings
Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro). AI-001: Underflow in row_sum slice for num_rows == 0
Claim When Evidence Line 410: Suggested fix Add an early return before the scan/finalize when num_rows == 0 (similar to AI-002: Instrument labels read "(CPU)" but the GPU-resident path also runs these stages
Claim The new report rows are labelled "LogUp fingerprint (CPU)", "LogUp batch invert (CPU)", "LogUp term combine (CPU)", "LogUp accumulate scan (CPU)", but when the GPU-resident aux build fires these all show 0 (the host counter only accumulates from the CPU Evidence
Suggested fix Either drop the "(CPU)" suffix, or add a parallel timer on the GPU path that records host wall-time per phase and accumulates into the same fields. Lower-priority cosmetic; the comment near the resident build ( Reviewer Lanes
Verification Lanes
Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report. Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts. |
Description:
Moves the LogUp auxiliary trace build (Round 1) from the CPU to the GPU, and keeps the data on the GPU instead of copying it back and forth.
What changed:
Result (ethrex 5-tx, RTX 5090): aux build drops from ~1.74s to ~1.33s. The proof is byte-for-byte identical to the CPU version and all tests pass.
Note: this is a building block. On its own the win is small, because the stages around it (trace generation, constraint evaluation) are still on the CPU and force data across the bus. The bigger speedup comes when those move to the GPU too and the whole chain stays on the device — this PR adds the plumbing (resident aux trace, main-trace reuse, device-input LDE) they build on.
Known limit: preprocessed tables still upload their main trace, since they have no GPU copy to reuse.