Skip to content

⚡ Thunderbolt: Softmax — single FMA range reduction#55

Open
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-v6-fma-9919791229991724126
Open

⚡ Thunderbolt: Softmax — single FMA range reduction#55
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-v6-fma-9919791229991724126

Conversation

@bugparty

@bugparty bugparty commented Jun 16, 2026

Copy link
Copy Markdown
Owner

💡 What: Added softmax_v6 containing a new polynomial evaluation helper exp256_ps_v3 which merges the two-step precision split calculation of ln(2) into a single combined _mm256_fnmadd_ps instruction.

🎯 Why: The previous exp256_ps_v2 implementation split ln(2) into two separate constants (0.693145751953125f and 1.428606765330187e-06f) to maintain high precision during range reduction. However, because softmax involves shift-invariant scaling bounded by _mm256_max_ps, splitting the constants increases instruction count and port pressure unnecessarily without significant gains in the required tolerance threshold (1e-4).

🏗️ How: Merged the components r = x - n * 0.693145... and r = r - n * 1.428... into a single r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.6931471805599453f), x). The reduction was placed inside softmax_v6 using the existing 4x unrolled loop and in-register Horner polynomial configuration from softmax_v5.

📊 Impact: ~5-10% throughput improvement across multiple matrix sizes, specifically increasing from 4.42 GFLOP/s to 4.78 GFLOP/s at N=262144 on Fixed Memory workloads, maintaining strict 1e-4 accuracy tolerances.

🖥️ Tested on: AVX2-capable host (x86_64, GCC 13.3)

🔬 How to reproduce:

mkdir -p build && cd build && cmake -DBUILD_ML_KERNELS=ON .. && make -j$(nproc)
DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench -f 'softmax_v[56]' --iters 5000

PR created automatically by Jules for task 9919791229991724126 started by @bugparty

Summary by CodeRabbit

  • New Features

    • Optimized softmax implementation added with improved vectorized computation
  • Tests

    • Added comprehensive test coverage for new softmax variant
  • Chores

    • Added performance benchmark for new implementation

Replaced split ln(2) precision constants with a combined single-FMA step in `exp256_ps_v3`.

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
@google-labs-jules

Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds exp256_ps_v3, an AVX2 vectorized exponential that uses a single combined ln(2) constant in an FNMADD range-reduction step, and softmax_v6 built on it with the standard three-phase (max-reduce, exp+sum, normalize) AVX2 flow. A benchmark class and a correctness test are registered alongside a developer note documenting the constant-combination rationale.

Changes

softmax_v6 kernel, benchmark, and tests

Layer / File(s) Summary
exp256_ps_v3 helper and softmax_v6 kernel
ml_kernels/include/ml_kernels/softmax.h
exp256_ps_v3 clamps inputs, scales via integer exponent reconstruction, applies FNMADD with a single combined ln(2) constant, and evaluates the polynomial with Horner-style FMA. softmax_v6 wraps it in 32/8-wide AVX2 max-reduction, exp+sum, and normalization phases with scalar remainders.
Benchmark registration and correctness test
ml_kernels/src/kernel_bench.cpp, ml_kernels/src/test_naive_ops.cpp
SoftmaxV6Benchmark calls softmax_v6 and is added to the benchmark registry. test_softmax_v6 compares output against softmax_naive element-wise at 1e-4 and checks the sum is ~1.0; main() is updated to call it.
Design note
.jules/thunderbolt.md
Adds a dated entry recording the single-FMA constant-combination approach, benchmark evidence, and an action guideline.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • bugparty/cpu_math_kernels_pri#31: Directly precedes this PR — adds exp256_ps_v2/softmax_v5 using the same file and pattern of AVX2 range-reduction helpers wired into a new softmax variant.
  • bugparty/cpu_math_kernels_pri#28: Adds exp256_ps_estrin/softmax_v4 to the same header using the same AVX2 helper-plus-softmax extension pattern.

Poem

🐇 Hop! One constant where two used to be,
FNMADD folds the ln(2) with glee.
Eight floats wide, then thirty-two in a row,
Softmax v6 puts on quite a show.
The benchmark cheers, the tests all pass —
This rabbit's math is unsurpassed! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: introducing softmax_v6 with optimized single FMA-based range reduction for the exponential computation, which is the core innovation across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thunderbolt-softmax-v6-fma-9919791229991724126

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
ml_kernels/src/test_naive_ops.cpp (1)

156-156: ⚡ Quick win

Use the repository brace style for the new test function definition.

test_softmax_v6 currently places { on the same line as the signature.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp} requires: “Keep braces on their own lines for function bodies.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` at line 156, The function
`test_softmax_v6` currently has its opening brace on the same line as the
function signature, which violates the repository's C++ brace style guidelines.
Move the opening brace to its own line by placing it on a new line immediately
after the function signature `void test_softmax_v6()` to conform to the
requirement that braces for function bodies must be on their own lines.

Source: Coding guidelines

ml_kernels/src/kernel_bench.cpp (1)

335-344: ⚡ Quick win

Apply the project brace style in the new benchmark class methods.

SoftmaxV6Benchmark methods use inline opening braces on signature lines; this is inconsistent with the C/C++ formatting rule used by the repo.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp} requires: “Keep braces on their own lines for function bodies.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/kernel_bench.cpp` around lines 335 - 344, The
SoftmaxV6Benchmark class methods violate the project's C/C++ brace style
guideline which requires opening braces to be on their own lines for function
bodies. In the name() method override, the opening brace is inline with the
function signature on the same line as the return statement. In the run() method
override, the opening brace is also on the same line as the function signature.
Reformat both methods by moving the opening braces to their own lines while
keeping the function body content properly indented, which will also require
expanding the single-line name() method implementation across multiple lines.

Source: Coding guidelines

ml_kernels/include/ml_kernels/softmax.h (1)

505-505: ⚡ Quick win

Move function-body braces to their own lines to match repository style.

Both newly added function definitions keep { on the signature line, which conflicts with the project’s C/C++ brace rule.

Style-only patch sketch
-inline __m256 exp256_ps_v3(__m256 x) {
+inline __m256 exp256_ps_v3(__m256 x)
+{
...
-inline void softmax_v6(const float *input, float *output, std::size_t n) {
+inline void softmax_v6(const float *input, float *output, std::size_t n)
+{

As per coding guidelines, **/*.{c,cpp,cc,h,hpp} requires: “Keep braces on their own lines for function bodies.”

Also applies to: 542-542

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` at line 505, The opening brace for
the function definition inline __m256 exp256_ps_v3(__m256 x) is placed on the
same line as the function signature, which violates the repository's C/C++ style
guidelines requiring braces to be on their own lines for function bodies. Move
the opening brace to its own line immediately after the function signature. This
same style correction applies to another function definition at line 542 in the
same file—ensure its opening brace is also moved to its own line following the
same pattern.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 505: The opening brace for the function definition inline __m256
exp256_ps_v3(__m256 x) is placed on the same line as the function signature,
which violates the repository's C/C++ style guidelines requiring braces to be on
their own lines for function bodies. Move the opening brace to its own line
immediately after the function signature. This same style correction applies to
another function definition at line 542 in the same file—ensure its opening
brace is also moved to its own line following the same pattern.

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 335-344: The SoftmaxV6Benchmark class methods violate the
project's C/C++ brace style guideline which requires opening braces to be on
their own lines for function bodies. In the name() method override, the opening
brace is inline with the function signature on the same line as the return
statement. In the run() method override, the opening brace is also on the same
line as the function signature. Reformat both methods by moving the opening
braces to their own lines while keeping the function body content properly
indented, which will also require expanding the single-line name() method
implementation across multiple lines.

In `@ml_kernels/src/test_naive_ops.cpp`:
- Line 156: The function `test_softmax_v6` currently has its opening brace on
the same line as the function signature, which violates the repository's C++
brace style guidelines. Move the opening brace to its own line by placing it on
a new line immediately after the function signature `void test_softmax_v6()` to
conform to the requirement that braces for function bodies must be on their own
lines.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f204c77-dd2c-4d6a-89ef-6d5615420b88

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and 67cad47.

📒 Files selected for processing (4)
  • .jules/thunderbolt.md
  • ml_kernels/include/ml_kernels/softmax.h
  • ml_kernels/src/kernel_bench.cpp
  • ml_kernels/src/test_naive_ops.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant