⚡ Thunderbolt: Softmax — Single FMA range reduction and 8x Unrolling by bugparty · Pull Request #49 · bugparty/cpu_math_kernels_pri

bugparty · 2026-06-06T20:14:41Z

💡 What: Added softmax_v6 and exp256_ps_v3 inside the ml_kernels library, decorated with AVX2 and FMA target attributes. The max reduction and normalisation loops are unrolled 8x (64 elements per iteration). Range reduction is optimised to use a single _mm256_fnmadd_ps.
🎯 Why: The extended precision subtraction inside exp computation (i.e., Estrin's/Horner's scheme range reduction) was needlessly adding latency for ML applications where numerical tolerance (1e-4) enables shift-invariance scaling. Furthermore, earlier AVX2 reductions (_mm256_max_ps) had 4-cycle latencies, meaning unrolling out to 8x (with 8 accumulators) allows perfect port saturation.
🏗️ How:

__m256 r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.6931471805599453f), x); is used in exp256_ps_v3.
Increased max-finding unroll depth to 8x.
Increased normalisation scalar mult array depth to 8x.
📊 Impact: ~7% speedup. Achieved ~6.09 GFLOP/s compared to softmax_v5 at ~5.67 GFLOP/s for N=65536 in Fixed Memory mode, with similar ~5-10% gains across large array bounds in Pool Mode.
🖥️ Tested on: GCC 13.3.0, Linux x86_64
🔬 How to reproduce: cd build && DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench --filter softmax_v6 --sizes 65536

PR created automatically by Jules for task 911386279487980053 started by @bugparty

Summary by CodeRabbit

Release Notes

New Features
- Introduced an optimized softmax implementation with enhanced performance using specialized vector operations
Tests
- Added comprehensive test coverage validating numerical accuracy of the new implementation
Documentation
- Documented the optimization approach and performance benchmark results

Implemented a new AVX2 `softmax_v6` kernel targeting modern x86 architectures (Haswell+). - Replaces extended precision range reduction (split subtraction) in exp256 with a single `_mm256_fnmadd_ps` to boost ILP. - Expands unrolling for max reduction and normalization phases to 8x to shift from instruction latency bound to throughput/memory bound. - Achieves ~6.0 GFLOP/s vs ~5.6 GFLOP/s in prior implementation. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

google-labs-jules · 2026-06-06T20:14:42Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-06-06T20:14:56Z

📝 Walkthrough

Walkthrough

This PR introduces softmax_v6, a new AVX2+FMA-optimized softmax kernel that uses a single-FMA exponential range-reduction formula in place of the prior extended approach, applies refined 8-way and 4-way unrolling patterns, includes correctness validation tests, and registers the implementation for performance benchmarking.

Changes

AVX2 Softmax Optimization with Single-FMA Range Reduction

Layer / File(s)	Summary
Single-FMA exponential helper `ml_kernels/include/ml_kernels/softmax.h`	`exp256_ps_v3` replaces extended range reduction with a single FMA-based formulation while preserving vector clamping, polynomial approximation, and exponent construction.
Softmax kernel with 8x/4x unrolling pattern `ml_kernels/include/ml_kernels/softmax.h`, `ml_kernels/src/test_naive_ops.cpp`	`softmax_v6` computes max via 8-way unrolling, exponentials and partial sums via 4-way unrolling using `exp256_ps_v3`, reduces to scalar sum, and normalizes via 8-way unrolling. Test function `test_softmax_v6()` validates against `softmax_naive` with per-element and normalization checks.
Benchmark registration and documentation `ml_kernels/src/kernel_bench.cpp`, `.jules/thunderbolt.md`	`SoftmaxV6Benchmark` class registers the kernel for performance measurement; dated note documents the single-FMA approach, reports benchmark evidence, and specifies unrolling guidelines.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A rabbit hops through AVX lanes so wide,
One FMA hops where doubles used to glide,
Unrolling fast—eight, four—in perfect stride,
Softmax blooms swifter with this joyful guide!
✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main changes: introducing single FMA range reduction and 8x unrolling for softmax optimization. It directly matches the core improvements in softmax_v6 and exp256_ps_v3.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thunderbolt-softmax-fma-unroll-911386279487980053

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)

ml_kernels/src/test_naive_ops.cpp

ml_kernels/src/test_naive_ops.cpp:6:10: fatal error: 'ml_kernels/naive_ops.h' file not found
6 | #include "ml_kernels/naive_ops.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/19cd30e1f2199a9d079585e788833507d14dfcc7-d463b5e2e9e167cc/tmp/clang_command_.tmp.64cd6c.txt
++Contents of '/tmp/coderabbit-infer/19cd30e1f2199a9d079585e788833507d14dfcc7-d463b5e2e9e167cc/tmp/clang_command_.tmp.64cd6c.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-unknown-linux-gnu" "-emit

... [truncated 1112 characters] ...

l/lib/clang/18/include"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/d463b5e2e9e167cc/file.o" "-x" "c++"
"ml_kernels/src/test_naive_ops.cpp" "-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"

ml_kernels/src/kernel_bench.cpp

ml_kernels/src/kernel_bench.cpp:14:10: fatal error: 'aligned_buffer.h' file not found
14 | #include "aligned_buffer.h"
| ^~~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/19cd30e1f2199a9d079585e788833507d14dfcc7-7c4391b9a04596fa/tmp/clang_command_.tmp.97e752.txt
++Contents of '/tmp/coderabbit-infer/19cd30e1f2199a9d079585e788833507d14dfcc7-7c4391b9a04596fa/tmp/clang_command_.tmp.97e752.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-unknown-linux-gnu" "-emit-obj" "-mrelax-all"

... [truncated 1089 characters] ...

all/lib/clang/18/include"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/7c4391b9a04596fa/file.o" "-x" "c++"
"ml_kernels/src/kernel_bench.cpp" "-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

ml_kernels/src/kernel_bench.cpp (1)
335-343: ⚡ Quick win

Apply repository brace style in the new benchmark class.

The added class/function bodies use same-line opening braces; project C/C++ style requires braces on their own lines. As per coding guidelines, "Keep braces on their own lines for function bodies".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/kernel_bench.cpp` around lines 335 - 343, The new benchmark
class SoftmaxV6Benchmark and its method bodies use same-line opening braces;
update the brace placement to match repository style by moving each opening
brace for the class and its methods (class SoftmaxV6Benchmark, name(), run())
onto their own lines so function/class bodies start with a brace on the next
line while keeping the existing method signatures and logic (including the call
to ml_kernels::softmax_v6 and the current_idx_ update) unchanged.
Source: Coding guidelines
ml_kernels/src/test_naive_ops.cpp (2)
155-155: ⚡ Quick win

Match required function-brace style in new test.

Opening brace is on the same line for the added function body; this repo style requires it on its own line. As per coding guidelines, "Keep braces on their own lines for function bodies".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` at line 155, The function declaration for
test_softmax_v6 uses a same-line opening brace; change it to the project's
function-brace style by placing the opening brace on its own line (i.e., rewrite
the declaration for test_softmax_v6 so the "{" is moved to the following line
before the function body).
Source: Coding guidelines

155-194: ⚡ Quick win

Add boundary-size coverage for v6 tail logic.

This test uses one size (80), so scalar and boundary tails are still unverified for the new 8x/4x loops. Please add small/edge sizes like {0,1,7,8,31,32,33,63,64,65} and validate against softmax_naive with the same tolerance.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 155 - 194, Update
test_softmax_v6 to iterate over boundary sizes {0,1,7,8,31,32,33,63,64,65} (in
addition to the existing large case) and for each size allocate input,
output_naive, output_v6 vectors of that size, fill input with representative
values (can reuse a prefix of the existing input or generate small deterministic
values), call ml_kernels::softmax_naive and ml_kernels::softmax_v6, then assert
element-wise closeness within 1e-4 between outputs and that the softmax v6
output sums to 1 within 1e-4; keep the existing prints/assert style and
reference the same functions test_softmax_v6, softmax_naive and softmax_v6 when
locating code to change.
ml_kernels/include/ml_kernels/softmax.h (2)
511-511: ⚡ Quick win

Use brace placement required by repo C/C++ style.

The newly added function bodies keep the opening brace on the declaration line; this file pattern requires braces on their own lines. As per coding guidelines, "Keep braces on their own lines for function bodies".

Also applies to: 541-541
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` at line 511, The function
definitions place the opening brace on the same line as the declaration,
violating the repo C/C++ style; update exp256_ps_v3 (and the other newly added
nearby function) so the opening brace is on its own line before the function
body, preserving existing indentation and formatting for the rest of the body to
match the file's brace placement convention.
Source: Coding guidelines

511-520: ⚡ Quick win

Harden exp256_ps_v3 for positive-input safety.

exp256_ps_v3 only clamps the lower bound before exponent-bit construction. If this helper is reused with sufficiently positive inputs, n_int + 127 can exceed float exponent range and yield invalid bit patterns after the shift/cast path. Add an upper clamp (or explicitly document/assert x <= 0 precondition).
Suggested change
 inline __m256 exp256_ps_v3(__m256 x) {
     x = _mm256_max_ps(x, _mm256_set1_ps(-87.3f));
+    x = _mm256_min_ps(x, _mm256_set1_ps(88.7f));
     __m256 x_log2e = _mm256_mul_ps(x, _mm256_set1_ps(1.4426950408889634f));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` around lines 511 - 520, The function
exp256_ps_v3 currently clamps only the lower bound of x; to avoid overflow in
the exponent-bit path (n_int + 127 exceeding float exponent range) clamp the
upper bound as well (e.g., x = _mm256_min_ps(x, _mm256_set1_ps(87.3f))) before
computing x_log2e and n_int, or alternatively add a clear assert/documentation
that exp256_ps_v3 is only called with x <= 0; update the code around
exp256_ps_v3 where x is modified and before x_log2e/_mm256_cvtps_epi32 to ensure
safety for positive inputs (refer to symbols x, x_log2e, n_int, n).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.jules/thunderbolt.md:
- Line 31: Update the heading date in .jules/thunderbolt.md from "## 2024-06-06
- Single FMA Range Reduction for AVX2 Softmax" to "## 2026-06-06 - Single FMA
Range Reduction for AVX2 Softmax" so the note's year matches the PR timeline;
locate the heading text exactly as written and replace the year portion only.

---

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 511: The function definitions place the opening brace on the same line as
the declaration, violating the repo C/C++ style; update exp256_ps_v3 (and the
other newly added nearby function) so the opening brace is on its own line
before the function body, preserving existing indentation and formatting for the
rest of the body to match the file's brace placement convention.
- Around line 511-520: The function exp256_ps_v3 currently clamps only the lower
bound of x; to avoid overflow in the exponent-bit path (n_int + 127 exceeding
float exponent range) clamp the upper bound as well (e.g., x = _mm256_min_ps(x,
_mm256_set1_ps(87.3f))) before computing x_log2e and n_int, or alternatively add
a clear assert/documentation that exp256_ps_v3 is only called with x <= 0;
update the code around exp256_ps_v3 where x is modified and before
x_log2e/_mm256_cvtps_epi32 to ensure safety for positive inputs (refer to
symbols x, x_log2e, n_int, n).

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 335-343: The new benchmark class SoftmaxV6Benchmark and its method
bodies use same-line opening braces; update the brace placement to match
repository style by moving each opening brace for the class and its methods
(class SoftmaxV6Benchmark, name(), run()) onto their own lines so function/class
bodies start with a brace on the next line while keeping the existing method
signatures and logic (including the call to ml_kernels::softmax_v6 and the
current_idx_ update) unchanged.

In `@ml_kernels/src/test_naive_ops.cpp`:
- Line 155: The function declaration for test_softmax_v6 uses a same-line
opening brace; change it to the project's function-brace style by placing the
opening brace on its own line (i.e., rewrite the declaration for test_softmax_v6
so the "{" is moved to the following line before the function body).
- Around line 155-194: Update test_softmax_v6 to iterate over boundary sizes
{0,1,7,8,31,32,33,63,64,65} (in addition to the existing large case) and for
each size allocate input, output_naive, output_v6 vectors of that size, fill
input with representative values (can reuse a prefix of the existing input or
generate small deterministic values), call ml_kernels::softmax_naive and
ml_kernels::softmax_v6, then assert element-wise closeness within 1e-4 between
outputs and that the softmax v6 output sums to 1 within 1e-4; keep the existing
prints/assert style and reference the same functions test_softmax_v6,
softmax_naive and softmax_v6 when locating code to change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6757f15d-a260-43fc-b5e2-79f99e182d64

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and 19cd30e.

📒 Files selected for processing (4)

.jules/thunderbolt.md
ml_kernels/include/ml_kernels/softmax.h
ml_kernels/src/kernel_bench.cpp
ml_kernels/src/test_naive_ops.cpp

coderabbitai · 2026-06-06T20:20:48Z


 **Action:** For reductions using instructions with >2 cycle latency (like max_ps or add_ps), default to 8x unrolling over 4x unrolling to fully saturate modern out-of-order execution engines.
+
+## 2024-06-06 - Single FMA Range Reduction for AVX2 Softmax


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

The new note’s year looks inconsistent with this PR timeline.

The heading says 2024-06-06, but this PR was opened on June 6, 2026. If this note is intended to document this change, the year should likely be 2026 to keep chronology clear.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.jules/thunderbolt.md at line 31, Update the heading date in .jules/thunderbolt.md from "## 2024-06-06 - Single FMA Range Reduction for AVX2 Softmax" to "## 2026-06-06 - Single FMA Range Reduction for AVX2 Softmax" so the note's year matches the PR timeline; locate the heading text exactly as written and replace the year portion only.

coderabbitai Bot reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Thunderbolt: Softmax — Single FMA range reduction and 8x Unrolling#49

⚡ Thunderbolt: Softmax — Single FMA range reduction and 8x Unrolling#49
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-fma-unroll-911386279487980053

bugparty commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented Jun 6, 2026

Uh oh!

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		Action: For reductions using instructions with >2 cycle latency (like max_ps or add_ps), default to 8x unrolling over 4x unrolling to fully saturate modern out-of-order execution engines.

		## 2024-06-06 - Single FMA Range Reduction for AVX2 Softmax

Conversation

bugparty commented Jun 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

google-labs-jules Bot commented Jun 6, 2026

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bugparty commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading