⚡ Thunderbolt: Softmax — single FMA range reduction by bugparty · Pull Request #55 · bugparty/cpu_math_kernels_pri

bugparty · 2026-06-16T20:21:55Z

💡 What: Added softmax_v6 containing a new polynomial evaluation helper exp256_ps_v3 which merges the two-step precision split calculation of ln(2) into a single combined _mm256_fnmadd_ps instruction.

🎯 Why: The previous exp256_ps_v2 implementation split ln(2) into two separate constants (0.693145751953125f and 1.428606765330187e-06f) to maintain high precision during range reduction. However, because softmax involves shift-invariant scaling bounded by _mm256_max_ps, splitting the constants increases instruction count and port pressure unnecessarily without significant gains in the required tolerance threshold (1e-4).

🏗️ How: Merged the components r = x - n * 0.693145... and r = r - n * 1.428... into a single r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.6931471805599453f), x). The reduction was placed inside softmax_v6 using the existing 4x unrolled loop and in-register Horner polynomial configuration from softmax_v5.

📊 Impact: ~5-10% throughput improvement across multiple matrix sizes, specifically increasing from 4.42 GFLOP/s to 4.78 GFLOP/s at N=262144 on Fixed Memory workloads, maintaining strict 1e-4 accuracy tolerances.

🖥️ Tested on: AVX2-capable host (x86_64, GCC 13.3)

🔬 How to reproduce:

mkdir -p build && cd build && cmake -DBUILD_ML_KERNELS=ON .. && make -j$(nproc)
DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench -f 'softmax_v[56]' --iters 5000

PR created automatically by Jules for task 9919791229991724126 started by @bugparty

Summary by CodeRabbit

New Features
- Optimized softmax implementation added with improved vectorized computation
Tests
- Added comprehensive test coverage for new softmax variant
Chores
- Added performance benchmark for new implementation

Replaced split ln(2) precision constants with a combined single-FMA step in `exp256_ps_v3`. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

google-labs-jules · 2026-06-16T20:21:56Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-06-16T20:22:11Z

📝 Walkthrough

Walkthrough

Adds exp256_ps_v3, an AVX2 vectorized exponential that uses a single combined ln(2) constant in an FNMADD range-reduction step, and softmax_v6 built on it with the standard three-phase (max-reduce, exp+sum, normalize) AVX2 flow. A benchmark class and a correctness test are registered alongside a developer note documenting the constant-combination rationale.

Changes

softmax_v6 kernel, benchmark, and tests

Layer / File(s)	Summary
`exp256_ps_v3` helper and `softmax_v6` kernel `ml_kernels/include/ml_kernels/softmax.h`	`exp256_ps_v3` clamps inputs, scales via integer exponent reconstruction, applies FNMADD with a single combined `ln(2)` constant, and evaluates the polynomial with Horner-style FMA. `softmax_v6` wraps it in 32/8-wide AVX2 max-reduction, exp+sum, and normalization phases with scalar remainders.
Benchmark registration and correctness test `ml_kernels/src/kernel_bench.cpp`, `ml_kernels/src/test_naive_ops.cpp`	`SoftmaxV6Benchmark` calls `softmax_v6` and is added to the benchmark registry. `test_softmax_v6` compares output against `softmax_naive` element-wise at `1e-4` and checks the sum is ~`1.0`; `main()` is updated to call it.
Design note `.jules/thunderbolt.md`	Adds a dated entry recording the single-FMA constant-combination approach, benchmark evidence, and an action guideline.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

bugparty/cpu_math_kernels_pri#31: Directly precedes this PR — adds exp256_ps_v2/softmax_v5 using the same file and pattern of AVX2 range-reduction helpers wired into a new softmax variant.
bugparty/cpu_math_kernels_pri#28: Adds exp256_ps_estrin/softmax_v4 to the same header using the same AVX2 helper-plus-softmax extension pattern.

Poem

🐇 Hop! One constant where two used to be,
FNMADD folds the ln(2) with glee.
Eight floats wide, then thirty-two in a row,
Softmax v6 puts on quite a show.
The benchmark cheers, the tests all pass —
This rabbit's math is unsurpassed! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main change: introducing softmax_v6 with optimized single FMA-based range reduction for the exponential computation, which is the core innovation across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thunderbolt-softmax-v6-fma-9919791229991724126

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

ml_kernels/src/test_naive_ops.cpp (1)
156-156: ⚡ Quick win

Use the repository brace style for the new test function definition.

test_softmax_v6 currently places { on the same line as the signature.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp} requires: “Keep braces on their own lines for function bodies.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` at line 156, The function
`test_softmax_v6` currently has its opening brace on the same line as the
function signature, which violates the repository's C++ brace style guidelines.
Move the opening brace to its own line by placing it on a new line immediately
after the function signature `void test_softmax_v6()` to conform to the
requirement that braces for function bodies must be on their own lines.
Source: Coding guidelines
ml_kernels/src/kernel_bench.cpp (1)
335-344: ⚡ Quick win

Apply the project brace style in the new benchmark class methods.

SoftmaxV6Benchmark methods use inline opening braces on signature lines; this is inconsistent with the C/C++ formatting rule used by the repo.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp} requires: “Keep braces on their own lines for function bodies.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/kernel_bench.cpp` around lines 335 - 344, The
SoftmaxV6Benchmark class methods violate the project's C/C++ brace style
guideline which requires opening braces to be on their own lines for function
bodies. In the name() method override, the opening brace is inline with the
function signature on the same line as the return statement. In the run() method
override, the opening brace is also on the same line as the function signature.
Reformat both methods by moving the opening braces to their own lines while
keeping the function body content properly indented, which will also require
expanding the single-line name() method implementation across multiple lines.
Source: Coding guidelines
ml_kernels/include/ml_kernels/softmax.h (1)
505-505: ⚡ Quick win

Move function-body braces to their own lines to match repository style.

Both newly added function definitions keep { on the signature line, which conflicts with the project’s C/C++ brace rule.
Style-only patch sketch
-inline __m256 exp256_ps_v3(__m256 x) {
+inline __m256 exp256_ps_v3(__m256 x)
+{
...
-inline void softmax_v6(const float *input, float *output, std::size_t n) {
+inline void softmax_v6(const float *input, float *output, std::size_t n)
+{
As per coding guidelines, **/*.{c,cpp,cc,h,hpp} requires: “Keep braces on their own lines for function bodies.”

Also applies to: 542-542
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` at line 505, The opening brace for
the function definition inline __m256 exp256_ps_v3(__m256 x) is placed on the
same line as the function signature, which violates the repository's C/C++ style
guidelines requiring braces to be on their own lines for function bodies. Move
the opening brace to its own line immediately after the function signature. This
same style correction applies to another function definition at line 542 in the
same file—ensure its opening brace is also moved to its own line following the
same pattern.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 505: The opening brace for the function definition inline __m256
exp256_ps_v3(__m256 x) is placed on the same line as the function signature,
which violates the repository's C/C++ style guidelines requiring braces to be on
their own lines for function bodies. Move the opening brace to its own line
immediately after the function signature. This same style correction applies to
another function definition at line 542 in the same file—ensure its opening
brace is also moved to its own line following the same pattern.

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 335-344: The SoftmaxV6Benchmark class methods violate the
project's C/C++ brace style guideline which requires opening braces to be on
their own lines for function bodies. In the name() method override, the opening
brace is inline with the function signature on the same line as the return
statement. In the run() method override, the opening brace is also on the same
line as the function signature. Reformat both methods by moving the opening
braces to their own lines while keeping the function body content properly
indented, which will also require expanding the single-line name() method
implementation across multiple lines.

In `@ml_kernels/src/test_naive_ops.cpp`:
- Line 156: The function `test_softmax_v6` currently has its opening brace on
the same line as the function signature, which violates the repository's C++
brace style guidelines. Move the opening brace to its own line by placing it on
a new line immediately after the function signature `void test_softmax_v6()` to
conform to the requirement that braces for function bodies must be on their own
lines.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f204c77-dd2c-4d6a-89ef-6d5615420b88

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and 67cad47.

📒 Files selected for processing (4)

.jules/thunderbolt.md
ml_kernels/include/ml_kernels/softmax.h
ml_kernels/src/kernel_bench.cpp
ml_kernels/src/test_naive_ops.cpp

⚡ Thunderbolt: Softmax — single FMA range reduction

67cad47

Replaced split ln(2) precision constants with a combined single-FMA step in `exp256_ps_v3`. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Thunderbolt: Softmax — single FMA range reduction#55

⚡ Thunderbolt: Softmax — single FMA range reduction#55
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-v6-fma-9919791229991724126

bugparty commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bugparty commented Jun 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

google-labs-jules Bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bugparty commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading