⚡ Thunderbolt: softmax_v6 — FMA range reduction constant folding by bugparty · Pull Request #47 · bugparty/cpu_math_kernels_pri

bugparty · 2026-06-04T19:53:48Z

💡 What: Introduced softmax_v6, which uses a slightly less precise but faster exp256_ps_v3 approximation by combining the range reduction constants (r = x - n * ln(2)) into a single FMA operation, avoiding the split precision subtractions of ln(2) components used in softmax_v5.
🎯 Why: While splitting ln(2) into two separate subtractions is necessary to maintain perfect single-precision float accuracy for exp(x) across its entire domain, the Softmax function is shift-invariant and heavily normalizes its output. For ML workloads, the minor precision loss is mathematically insignificant (<1e-4 tolerance), but the latency cost of an extra FMA step in the inner loop is significant.
🏗️ How: Replaced r = _mm256_fnmadd_ps(n, ln2_high, x); r = _mm256_fnmadd_ps(n, ln2_low, r); with r = _mm256_fnmadd_ps(n, ln2_full, x); inside exp256_ps_v3. Unrolled loop and Horner scheme remain the same.
📊 Impact:
softmax_v5 Fixed Memory N=65536 -> 4.98 GFLOP/s
softmax_v6 Fixed Memory N=65536 -> 5.41 GFLOP/s (8.6% speedup)
softmax_v5 Fixed Memory N=1048576 -> 3.79 GFLOP/s
softmax_v6 Fixed Memory N=1048576 -> 4.14 GFLOP/s (9.2% speedup)
🖥️ Tested on: Intel Xeon (AVX2), GCC 13.3.0
🔬 How to reproduce:
mkdir build && cd build && cmake -DBUILD_ML_KERNELS=ON .. && make -j ml_kernel_bench
./build/ml_kernels/ml_kernel_bench -f 'softmax_v[56]'

PR created automatically by Jules for task 17126511402429981356 started by @bugparty

Summary by CodeRabbit

New Features
- Introduced a new AVX2-optimized Softmax implementation delivering approximately 5–15% throughput improvement for ML kernel operations.
Tests
- Added unit tests and performance benchmarks for the new Softmax optimization.
Documentation
- Added optimization guidelines for similar ML exponential kernel implementations.

Optimizes `exp256_ps` used in AVX2 softmax by replacing the separate subtractions of `ln(2)` components (to preserve exact precision) with a single `_mm256_fnmadd_ps` instruction. This reduces the latency of the range reduction phase, shifting the workload faster into the Horner's polynomial evaluation. Measurements on large inputs show a ~5-15% throughput improvement due to the reduced instruction count and latency hiding. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

google-labs-jules · 2026-06-04T19:53:49Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-06-04T19:53:59Z

📝 Walkthrough

Walkthrough

This PR introduces softmax_v6, a new AVX2 softmax kernel featuring an optimized exponential computation (exp256_ps_v3) that folds range-reduction constants into a single FMA operation. The implementation includes vectorized max reduction, exp+sum accumulation, early-return optimization, and normalization. Tests validate correctness against naive softmax, and a benchmark harness measures performance.

Changes

Softmax v6 Kernel Implementation and Validation

Layer / File(s)	Summary
AVX2 exponential and softmax kernels with range-reduction optimization `.jules/thunderbolt.md`, `ml_kernels/include/ml_kernels/softmax.h`	exp256_ps_v3 computes `exp(x)` with range reduction via combined FMA/FNMADD for `r = x - n * ln(2)`, then Horner polynomial evaluation and exponent construction. softmax_v6 orchestrates vectorized max reduction over 32 elements, exp+sum accumulation using exp256_ps_v3 with 4-way SIMD lanes, scalar tail handling, early return on zero sum, and vectorized output normalization. Thunderbolt.md documents the FMA constant-folding optimization and ~5–15% throughput gain.
Test correctness against naive implementation `ml_kernels/src/test_naive_ops.cpp`	test_softmax_v6() validates softmax_v6 output against softmax_naive on fixed 40-element input with per-element tolerance of 1e-4, and asserts output probabilities sum to ~1.0. Integrated into main() test sequence.
Benchmark harness for softmax_v6 `ml_kernels/src/kernel_bench.cpp`	SoftmaxV6Benchmark subclass calls ml_kernels::softmax_v6 on pooled buffers with standard wraparound index update; registered as a measurable benchmark variant.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

bugparty/cpu_math_kernels_pri#31: Introduced the previous softmax generation (softmax_v5) and exp256_ps_v2 with similar range-reduction and Horner-evaluation patterns; softmax_v6 continues this iterative optimization lineage with FMA constant folding.

Poem

🐰 With whiskers twitching, we fold constants tight,
FMA fusing range-reduction's flight,
Softmax v6 hops through vectors with care,
Exponential blooms in AVX2 air,
Five to fifteen percent faster fare! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: introducing softmax_v6 with FMA range reduction constant folding optimization, which is the core focus of all code changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thunderbolt/softmax_v6_optimization-17126511402429981356

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)

ml_kernels/src/test_naive_ops.cpp

ml_kernels/src/test_naive_ops.cpp:6:10: fatal error: 'ml_kernels/naive_ops.h' file not found
6 | #include "ml_kernels/naive_ops.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/bf31ec7306d2db750a64d21e58724064f9921095-d463b5e2e9e167cc/tmp/clang_command_.tmp.651947.txt
++Contents of '/tmp/coderabbit-infer/bf31ec7306d2db750a64d21e58724064f9921095-d463b5e2e9e167cc/tmp/clang_command_.tmp.651947.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-unknown-linux-gnu" "-emit

... [truncated 1112 characters] ...

l/lib/clang/18/include"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/d463b5e2e9e167cc/file.o" "-x" "c++"
"ml_kernels/src/test_naive_ops.cpp" "-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"

ml_kernels/src/kernel_bench.cpp

ml_kernels/src/kernel_bench.cpp:14:10: fatal error: 'aligned_buffer.h' file not found
14 | #include "aligned_buffer.h"
| ^~~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/bf31ec7306d2db750a64d21e58724064f9921095-7c4391b9a04596fa/tmp/clang_command_.tmp.c8fc79.txt
++Contents of '/tmp/coderabbit-infer/bf31ec7306d2db750a64d21e58724064f9921095-7c4391b9a04596fa/tmp/clang_command_.tmp.c8fc79.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-unknown-linux-gnu" "-emit-obj" "-mrelax-all"

... [truncated 1089 characters] ...

all/lib/clang/18/include"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/7c4391b9a04596fa/file.o" "-x" "c++"
"ml_kernels/src/kernel_bench.cpp" "-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

ml_kernels/src/test_naive_ops.cpp (1)

184-184: ⚡ Quick win

Move the new test function’s opening brace to its own line.

Line 184 uses same-line {, which conflicts with the C/C++ brace style rule.
Suggested style-only fix
-void test_softmax_v6() {
+void test_softmax_v6()
+{
As per coding guidelines: Keep braces on their own lines for function bodies in **/*.{c,cpp,cc,h,hpp}.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` at line 184, The function declaration for
test_softmax_v6 uses a same-line opening brace; change "void test_softmax_v6()
{" to place the brace on its own line so the function body follows the project's
C/C++ brace style (i.e., split into "void test_softmax_v6()" followed by "{" on
the next line), updating the test_softmax_v6 function signature accordingly.

ml_kernels/src/kernel_bench.cpp (1)

337-342: ⚡ Quick win

Use newline opening braces in new benchmark method bodies.

Line 337 and Line 339 place { on the same line as the signature, which violates the project brace rule.

Suggested style-only fix

-    const char *name() const override { return "softmax_v6"; }
+    const char *name() const override
+    {
+        return "softmax_v6";
+    }

-    void run() override {
+    void run() override
+    {
         ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size());
         current_idx_ = (current_idx_ + 1) % pool_size_;
     }

As per coding guidelines: Keep braces on their own lines for function bodies in **/*.{c,cpp,cc,h,hpp}.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/kernel_bench.cpp` around lines 337 - 342, Move the opening
brace for the two benchmark methods onto their own lines to follow the project's
brace style: change the signatures for the name() override and the run()
override (functions named name and run in the same class in kernel_bench.cpp) so
the `{` appears on the next line (i.e., place the brace on its own line before
the method body that calls ml_kernels::softmax_v6 and updates current_idx_), and
ensure the rest of the body remains unchanged.

ml_kernels/include/ml_kernels/softmax.h (1)

504-504: ⚡ Quick win

Place opening braces on a new line for the new function bodies.

Line 504 and Line 538 currently put { on the signature line, which violates the project C/C++ style rule.
Suggested style-only fix
-inline __m256 exp256_ps_v3(__m256 x) {
+inline __m256 exp256_ps_v3(__m256 x)
+{
...
-inline void softmax_v6(const float *input, float *output, std::size_t n) {
+inline void softmax_v6(const float *input, float *output, std::size_t n)
+{
As per coding guidelines: Keep braces on their own lines for function bodies in **/*.{c,cpp,cc,h,hpp}.

Also applies to: 538-538
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` at line 504, Move the opening brace
for function bodies onto its own line to match project C/C++ style: edit the
function signature for exp256_ps_v3 (inline __m256 exp256_ps_v3(__m256 x)) so
the '{' is on the next line, and do the same for the other function definition
around line 538 in this header; ensure all function definitions in this file
follow the same brace-on-new-line rule.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 504: Move the opening brace for function bodies onto its own line to
match project C/C++ style: edit the function signature for exp256_ps_v3 (inline
__m256 exp256_ps_v3(__m256 x)) so the '{' is on the next line, and do the same
for the other function definition around line 538 in this header; ensure all
function definitions in this file follow the same brace-on-new-line rule.

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 337-342: Move the opening brace for the two benchmark methods onto
their own lines to follow the project's brace style: change the signatures for
the name() override and the run() override (functions named name and run in the
same class in kernel_bench.cpp) so the `{` appears on the next line (i.e., place
the brace on its own line before the method body that calls
ml_kernels::softmax_v6 and updates current_idx_), and ensure the rest of the
body remains unchanged.

In `@ml_kernels/src/test_naive_ops.cpp`:
- Line 184: The function declaration for test_softmax_v6 uses a same-line
opening brace; change "void test_softmax_v6() {" to place the brace on its own
line so the function body follows the project's C/C++ brace style (i.e., split
into "void test_softmax_v6()" followed by "{" on the next line), updating the
test_softmax_v6 function signature accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c7d66900-3bc4-409d-9388-28a2b6e17c9c

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and bf31ec7.

📒 Files selected for processing (4)

.jules/thunderbolt.md
ml_kernels/include/ml_kernels/softmax.h
ml_kernels/src/kernel_bench.cpp
ml_kernels/src/test_naive_ops.cpp

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Thunderbolt: softmax_v6 — FMA range reduction constant folding#47

⚡ Thunderbolt: softmax_v6 — FMA range reduction constant folding#47
bugparty wants to merge 1 commit into
mainfrom
thunderbolt/softmax_v6_optimization-17126511402429981356

bugparty commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bugparty commented Jun 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

google-labs-jules Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bugparty commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading