⚡ Thunderbolt: softmax_v6 — 8x unrolled phases with single-FMA exp256#53
⚡ Thunderbolt: softmax_v6 — 8x unrolled phases with single-FMA exp256#53bugparty wants to merge 1 commit into
Conversation
Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
📝 WalkthroughWalkthroughThis PR introduces ChangesSoftmax v6 AVX2 Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
ml_kernels/src/test_naive_ops.cpp (1)
186-216: 💤 Low valueConsider adding a test case with a non-multiple-of-8 size to exercise scalar tail handling.
The current test with 72 elements exercises the main loops and 8-element cleanup paths, but doesn't cover the scalar tail (lines 617-621, 648-650 in softmax.h). Adding a test case with, e.g., 75 elements would improve coverage of edge-case handling.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/test_naive_ops.cpp` around lines 186 - 216, The test currently only covers multiples of 8; change test_softmax_v6 to exercise the scalar tail by using a non-multiple-of-8 length (for example replace input.resize(72, 1.0f) with input.resize(75, 1.0f) or add a new test function test_softmax_v6_tail that sets input.size() to 75), then run ml_kernels::softmax_naive and ml_kernels::softmax_v6 as before and assert elementwise closeness and sum==1 to validate the scalar tail handling in softmax_v6/softmax_naive.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 186-216: The test currently only covers multiples of 8; change
test_softmax_v6 to exercise the scalar tail by using a non-multiple-of-8 length
(for example replace input.resize(72, 1.0f) with input.resize(75, 1.0f) or add a
new test function test_softmax_v6_tail that sets input.size() to 75), then run
ml_kernels::softmax_naive and ml_kernels::softmax_v6 as before and assert
elementwise closeness and sum==1 to validate the scalar tail handling in
softmax_v6/softmax_naive.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 90950ab1-ef6f-43c4-bd68-422c7b9382a7
📒 Files selected for processing (4)
.jules/thunderbolt.mdml_kernels/include/ml_kernels/softmax.hml_kernels/src/kernel_bench.cppml_kernels/src/test_naive_ops.cpp
💡 What:
Added a new AVX2 kernel,
softmax_v6, heavily optimizing the prior versions by:x - n*ln2from dual FMAs to a single FMA to exploit softmax shift-invariance.maxandnormalizephases are unrolled 8x to perfectly match execution unit latency (e.g.max_ps4-cycle latency), while the register-heavyexpphase is kept at 4x unroll to avoid YMM spilling.🎯 Why:
The prior
softmax_v5was bottlenecked during the max/normalize phases by unrolling only 4x. Additionally, perfectly accurateln2approximation required multiple FMA instructions which bottlenecked port execution while the shift-invariant properties of softmax can tolerate the slight approximation.🏗️ How:
exp256_ps_v3insideml_kernels/include/ml_kernels/softmax.hutilizing single FMA range reduction.max0throughmax7).test_softmax_v6intest_naive_ops.cppverifying the output is within the 1e-4 target tolerance.BenchmarkRegistry.📊 Impact:
🖥️ Tested on:
🔬 How to reproduce:
PR created automatically by Jules for task 14084203667146761558 started by @bugparty
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests