⚡ Thunderbolt: softmax_v6 — FMA range reduction constant folding#47
⚡ Thunderbolt: softmax_v6 — FMA range reduction constant folding#47bugparty wants to merge 1 commit into
Conversation
Optimizes `exp256_ps` used in AVX2 softmax by replacing the separate subtractions of `ln(2)` components (to preserve exact precision) with a single `_mm256_fnmadd_ps` instruction. This reduces the latency of the range reduction phase, shifting the workload faster into the Horner's polynomial evaluation. Measurements on large inputs show a ~5-15% throughput improvement due to the reduced instruction count and latency hiding. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
📝 WalkthroughWalkthroughThis PR introduces ChangesSoftmax v6 Kernel Implementation and Validation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Infer (1.2.0)ml_kernels/src/test_naive_ops.cppml_kernels/src/test_naive_ops.cpp:6:10: fatal error: 'ml_kernels/naive_ops.h' file not found ... [truncated 1112 characters] ... l/lib/clang/18/include" ml_kernels/src/kernel_bench.cppml_kernels/src/kernel_bench.cpp:14:10: fatal error: 'aligned_buffer.h' file not found ... [truncated 1089 characters] ... all/lib/clang/18/include" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
ml_kernels/src/test_naive_ops.cpp (1)
184-184: ⚡ Quick winMove the new test function’s opening brace to its own line.
Line 184 uses same-line
{, which conflicts with the C/C++ brace style rule.Suggested style-only fix
-void test_softmax_v6() { +void test_softmax_v6() +{As per coding guidelines: Keep braces on their own lines for function bodies in
**/*.{c,cpp,cc,h,hpp}.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/test_naive_ops.cpp` at line 184, The function declaration for test_softmax_v6 uses a same-line opening brace; change "void test_softmax_v6() {" to place the brace on its own line so the function body follows the project's C/C++ brace style (i.e., split into "void test_softmax_v6()" followed by "{" on the next line), updating the test_softmax_v6 function signature accordingly.ml_kernels/src/kernel_bench.cpp (1)
337-342: ⚡ Quick winUse newline opening braces in new benchmark method bodies.
Line 337 and Line 339 place
{on the same line as the signature, which violates the project brace rule.Suggested style-only fix
- const char *name() const override { return "softmax_v6"; } + const char *name() const override + { + return "softmax_v6"; + } - void run() override { + void run() override + { ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size()); current_idx_ = (current_idx_ + 1) % pool_size_; }As per coding guidelines: Keep braces on their own lines for function bodies in
**/*.{c,cpp,cc,h,hpp}.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/kernel_bench.cpp` around lines 337 - 342, Move the opening brace for the two benchmark methods onto their own lines to follow the project's brace style: change the signatures for the name() override and the run() override (functions named name and run in the same class in kernel_bench.cpp) so the `{` appears on the next line (i.e., place the brace on its own line before the method body that calls ml_kernels::softmax_v6 and updates current_idx_), and ensure the rest of the body remains unchanged.ml_kernels/include/ml_kernels/softmax.h (1)
504-504: ⚡ Quick winPlace opening braces on a new line for the new function bodies.
Line 504 and Line 538 currently put
{on the signature line, which violates the project C/C++ style rule.Suggested style-only fix
-inline __m256 exp256_ps_v3(__m256 x) { +inline __m256 exp256_ps_v3(__m256 x) +{ ... -inline void softmax_v6(const float *input, float *output, std::size_t n) { +inline void softmax_v6(const float *input, float *output, std::size_t n) +{As per coding guidelines: Keep braces on their own lines for function bodies in
**/*.{c,cpp,cc,h,hpp}.Also applies to: 538-538
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/include/ml_kernels/softmax.h` at line 504, Move the opening brace for function bodies onto its own line to match project C/C++ style: edit the function signature for exp256_ps_v3 (inline __m256 exp256_ps_v3(__m256 x)) so the '{' is on the next line, and do the same for the other function definition around line 538 in this header; ensure all function definitions in this file follow the same brace-on-new-line rule.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 504: Move the opening brace for function bodies onto its own line to
match project C/C++ style: edit the function signature for exp256_ps_v3 (inline
__m256 exp256_ps_v3(__m256 x)) so the '{' is on the next line, and do the same
for the other function definition around line 538 in this header; ensure all
function definitions in this file follow the same brace-on-new-line rule.
In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 337-342: Move the opening brace for the two benchmark methods onto
their own lines to follow the project's brace style: change the signatures for
the name() override and the run() override (functions named name and run in the
same class in kernel_bench.cpp) so the `{` appears on the next line (i.e., place
the brace on its own line before the method body that calls
ml_kernels::softmax_v6 and updates current_idx_), and ensure the rest of the
body remains unchanged.
In `@ml_kernels/src/test_naive_ops.cpp`:
- Line 184: The function declaration for test_softmax_v6 uses a same-line
opening brace; change "void test_softmax_v6() {" to place the brace on its own
line so the function body follows the project's C/C++ brace style (i.e., split
into "void test_softmax_v6()" followed by "{" on the next line), updating the
test_softmax_v6 function signature accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c7d66900-3bc4-409d-9388-28a2b6e17c9c
📒 Files selected for processing (4)
.jules/thunderbolt.mdml_kernels/include/ml_kernels/softmax.hml_kernels/src/kernel_bench.cppml_kernels/src/test_naive_ops.cpp
💡 What: Introduced
softmax_v6, which uses a slightly less precise but fasterexp256_ps_v3approximation by combining the range reduction constants (r = x - n * ln(2)) into a single FMA operation, avoiding the split precision subtractions ofln(2)components used insoftmax_v5.🎯 Why: While splitting
ln(2)into two separate subtractions is necessary to maintain perfect single-precision float accuracy forexp(x)across its entire domain, the Softmax function is shift-invariant and heavily normalizes its output. For ML workloads, the minor precision loss is mathematically insignificant (<1e-4 tolerance), but the latency cost of an extra FMA step in the inner loop is significant.🏗️ How: Replaced
r = _mm256_fnmadd_ps(n, ln2_high, x); r = _mm256_fnmadd_ps(n, ln2_low, r);withr = _mm256_fnmadd_ps(n, ln2_full, x);insideexp256_ps_v3. Unrolled loop and Horner scheme remain the same.📊 Impact:
softmax_v5Fixed Memory N=65536 -> 4.98 GFLOP/ssoftmax_v6Fixed Memory N=65536 -> 5.41 GFLOP/s (8.6% speedup)softmax_v5Fixed Memory N=1048576 -> 3.79 GFLOP/ssoftmax_v6Fixed Memory N=1048576 -> 4.14 GFLOP/s (9.2% speedup)🖥️ Tested on: Intel Xeon (AVX2), GCC 13.3.0
🔬 How to reproduce:
mkdir build && cd build && cmake -DBUILD_ML_KERNELS=ON .. && make -j ml_kernel_bench./build/ml_kernels/ml_kernel_bench -f 'softmax_v[56]'PR created automatically by Jules for task 17126511402429981356 started by @bugparty
Summary by CodeRabbit
New Features
Tests
Documentation