Skip to content

⚡ Thunderbolt: softmax_v6 — Single-FMA range reduction#52

Open
bugparty wants to merge 1 commit into
mainfrom
thunderbolt/softmax-single-fma-1740750288549863835
Open

⚡ Thunderbolt: softmax_v6 — Single-FMA range reduction#52
bugparty wants to merge 1 commit into
mainfrom
thunderbolt/softmax-single-fma-1740750288549863835

Conversation

@bugparty

@bugparty bugparty commented Jun 10, 2026

Copy link
Copy Markdown
Owner

💡 What
Adds a new softmax_v6 kernel and exp256_ps_v3 helper that employs a single FMA (_mm256_fnmadd_ps(n, ln2_constant, x)) for the exponentiation range reduction step.

🎯 Why
Using split precision constants for ln(2) (0.693145751953125f and 1.428606765330187e-06f) creates a dependent subtraction/FMA chain. A single FMA breaks this dependency, dropping the instruction count and improving ILP.

🏗️ How
Replaced the split subtraction with _mm256_fnmadd_ps(n, _mm256_set1_ps(0.6931471805599453f), x). Since softmax relies on normalizing values by their sum, slight precision losses from a single FMA mathematically cancel out during the final division, leaving outputs well within typical machine learning numerical tolerances (e.g., 1e-4).

📊 Impact
Microbenchmark results (ml_kernel_bench) demonstrate a throughput increase from ~4.9 GFLOP/s (softmax_v5) to ~5.4 GFLOP/s (softmax_v6) in Fixed Memory mode (N=16384). Accuracy remains well within the < 1e-4 tolerance compared to the scalar reference.

🖥️ Tested on
Tested on x86_64 Linux with GCC 13.3.0 utilizing AVX2 intrinsics (Haswell+).

🔬 How to reproduce
DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --filter 'softmax_v[56]'


PR created automatically by Jules for task 1740750288549863835 started by @bugparty

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new softmax kernel variant implementation
  • Tests

    • Added validation tests for the new softmax implementation, verifying numerical accuracy and output normalization
  • Chores

    • Extended benchmarking suite with performance profiling for the new kernel
    • Updated optimization documentation with implementation details

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
@google-labs-jules

Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a new AVX2 softmax kernel (softmax_v6) featuring an optimized exponential function that combines range-reduction terms into a single FMA operation. The change includes the core optimization, integration into the softmax kernel, correctness validation against the naive reference, performance benchmarking, and documentation of the approach.

Changes

Softmax V6 Optimization

Layer / File(s) Summary
Optimized exp256_ps_v3 with single-FMA range reduction
ml_kernels/include/ml_kernels/softmax.h
exp256_ps_v3 performs AVX2 exponential computation by combining the range-reduction scaling r = x - n*ln2 into a single FMA operation, then evaluates a polynomial using Horner's scheme and reconstructs the result via exp2n bit manipulation.
Softmax_v6 kernel integration
ml_kernels/include/ml_kernels/softmax.h
softmax_v6 applies exp256_ps_v3 in the softmax pipeline: computes the maximum via reduce_max, exponentiates with range-reduced values while accumulating partial sums across 4 streams, reduces via reduce_sum, and normalizes by the inverse sum for both vectorized and scalar tail portions.
Tests, benchmarks, and optimization documentation
ml_kernels/src/test_naive_ops.cpp, ml_kernels/src/kernel_bench.cpp, .jules/thunderbolt.md
test_softmax_v6() validates element-wise closeness within 1e-4 against softmax_naive and verifies probability-sum invariant. SoftmaxV6Benchmark harness measures throughput. Optimization log entry documents the single-FMA approach, benchmark evidence of improvement, error-bound verification, and usage guidance.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • bugparty/cpu_math_kernels_pri#31: Prior softmax variant (softmax_v5) using a similar range-reduction + Horner approach with a exp256_ps_* optimization pattern, providing a precedent and baseline for comparison with this newer single-FMA variant.

Poem

🐰 A whisker-twitch of optimized precision,
Single-FMA softmax makes our decision!
Range-reduced, Horner-spun, vectors aligned—
Exponentials dance where throughput's designed. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: introducing softmax_v6 with single-FMA range reduction, which is the primary innovation across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thunderbolt/softmax-single-fma-1740750288549863835

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 560: Inside the ml_kernels namespace remove the redundant ml_kernels::
qualification when calling reduce_max and reduce_sum in the softmax
implementation (e.g., change occurrences like float max_val =
ml_kernels::reduce_max(max0) and similar reduce_sum calls to use reduce_max and
reduce_sum unqualified) so they match the other softmax variants (such as
softmax_v5) for consistency.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 29a0a541-68b3-4d8d-9185-97793e30da80

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and 68e7fd5.

📒 Files selected for processing (4)
  • .jules/thunderbolt.md
  • ml_kernels/include/ml_kernels/softmax.h
  • ml_kernels/src/kernel_bench.cpp
  • ml_kernels/src/test_naive_ops.cpp

for (; i + 7 < n; i += 8) {
max0 = _mm256_max_ps(max0, _mm256_loadu_ps(input + i));
}
float max_val = ml_kernels::reduce_max(max0);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove unnecessary namespace qualification for consistency.

Lines 560 and 604 explicitly qualify ml_kernels::reduce_max and ml_kernels::reduce_sum, but the code is already inside the ml_kernels namespace (declared at line 9). All other softmax variants in this file (v2-v5) call these functions without qualification (e.g., line 423 in softmax_v5).

🔧 Proposed fix
-    float max_val = ml_kernels::reduce_max(max0);
+    float max_val = reduce_max(max0);
-    float sum_val = ml_kernels::reduce_sum(sum0);
+    float sum_val = reduce_sum(sum0);

Also applies to: 604-604

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` at line 560, Inside the ml_kernels
namespace remove the redundant ml_kernels:: qualification when calling
reduce_max and reduce_sum in the softmax implementation (e.g., change
occurrences like float max_val = ml_kernels::reduce_max(max0) and similar
reduce_sum calls to use reduce_max and reduce_sum unqualified) so they match the
other softmax variants (such as softmax_v5) for consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant