Skip to content

⚡ Thunderbolt: softmax_v6 — 8x unrolled phases with single-FMA exp256#53

Open
bugparty wants to merge 1 commit into
mainfrom
thunderbolt/softmax_v6-14084203667146761558
Open

⚡ Thunderbolt: softmax_v6 — 8x unrolled phases with single-FMA exp256#53
bugparty wants to merge 1 commit into
mainfrom
thunderbolt/softmax_v6-14084203667146761558

Conversation

@bugparty

@bugparty bugparty commented Jun 13, 2026

Copy link
Copy Markdown
Owner

💡 What:
Added a new AVX2 kernel, softmax_v6, heavily optimizing the prior versions by:

  1. Converting the range reduction x - n*ln2 from dual FMAs to a single FMA to exploit softmax shift-invariance.
  2. Mixing loop unrolling: the simple max and normalize phases are unrolled 8x to perfectly match execution unit latency (e.g. max_ps 4-cycle latency), while the register-heavy exp phase is kept at 4x unroll to avoid YMM spilling.

🎯 Why:
The prior softmax_v5 was bottlenecked during the max/normalize phases by unrolling only 4x. Additionally, perfectly accurate ln2 approximation required multiple FMA instructions which bottlenecked port execution while the shift-invariant properties of softmax can tolerate the slight approximation.

🏗️ How:

  • Added exp256_ps_v3 inside ml_kernels/include/ml_kernels/softmax.h utilizing single FMA range reduction.
  • Upgraded the max search to use 8 independent accumulator chains (max0 through max7).
  • Upgraded the normalization phase to compute and store 8 vectors (64 elements) per iteration.
  • Created test_softmax_v6 in test_naive_ops.cpp verifying the output is within the 1e-4 target tolerance.
  • Bound and recorded this test into the BenchmarkRegistry.

📊 Impact:

  • Microbenchmark verified N=1048576 fixed memory runs improved from 4.19 GFLOP/s to 4.48 GFLOP/s, a ~7% throughput improvement, transitioning these phases directly from latency-bound to port execution bound.

🖥️ Tested on:

  • Haswell+ (AVX2 + FMA target) via CI GitHub runner VM (Ubuntu).

🔬 How to reproduce:

mkdir build && cd build && cmake .. && make ml_kernel_bench
DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench --sizes 1048576 --filter 'softmax_v6'

PR created automatically by Jules for task 14084203667146761558 started by @bugparty

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced an optimized softmax kernel implementation with enhanced performance characteristics and computational efficiency.
  • Documentation

    • Added comprehensive performance benchmarking documentation and optimization insights for the softmax implementation, including comparative benchmark analysis.
  • Tests

    • Added extensive test coverage for the new softmax variant, including precision validation and correctness verification across multiple input scenarios.

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
@google-labs-jules

Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces softmax_v6, a new AVX2-vectorized softmax kernel optimized with a specialized exp range-reduction helper, along with corresponding unit tests, benchmark registration, and performance documentation comparing against prior versions.

Changes

Softmax v6 AVX2 Implementation

Layer / File(s) Summary
AVX2 exp and softmax_v6 functions
ml_kernels/include/ml_kernels/softmax.h
exp256_ps_v3 performs log2-scale range reduction and computes the reduced exponent argument using a single FNMADD. softmax_v6 pipelines max reduction (8-way unroll), exponentiation and sum accumulation (4-way unroll), vector-to-scalar sum reduction, and output normalization (8-way unroll).
Test implementation and integration
ml_kernels/src/test_naive_ops.cpp
test_softmax_v6() expands input to 72 elements, verifies per-element agreement with softmax_naive within 1e-4 tolerance, and confirms output probability sum ≈ 1.0. Test is wired into main() for execution.
Benchmark registration
ml_kernels/src/kernel_bench.cpp
SoftmaxV6Benchmark inherits SoftmaxBenchmark, invokes ml_kernels::softmax_v6, and is registered for discovery and performance measurement against other softmax variants.
Performance documentation
.julius/thunderbolt.md
Dated entry (2024-06-13) documents shift-invariant range reduction using single FMA, mixed unrolling strategy (heavier for max/normalize phases, lighter for exp/polynomial to avoid spilling), benchmark evidence vs v5, and action items for future work.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • bugparty/cpu_math_kernels_pri#31: The direct predecessor PR introducing exp256_ps_v2 and softmax_v5, which softmax_v6 and exp256_ps_v3 extend with refined range reduction and unrolling tuning.

Poem

🐰 A kernel rebirth, softmax reborn,
With AVX SIMD worn in a mathematical form—
Exp reduces swift with a single FMA dance,
Eight lanes unroll where probabilities prance! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: introducing softmax_v6 with 8x unrolled phases and single-FMA exp256 optimization, which is the core deliverable across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thunderbolt/softmax_v6-14084203667146761558

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ml_kernels/src/test_naive_ops.cpp (1)

186-216: 💤 Low value

Consider adding a test case with a non-multiple-of-8 size to exercise scalar tail handling.

The current test with 72 elements exercises the main loops and 8-element cleanup paths, but doesn't cover the scalar tail (lines 617-621, 648-650 in softmax.h). Adding a test case with, e.g., 75 elements would improve coverage of edge-case handling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 186 - 216, The test currently
only covers multiples of 8; change test_softmax_v6 to exercise the scalar tail
by using a non-multiple-of-8 length (for example replace input.resize(72, 1.0f)
with input.resize(75, 1.0f) or add a new test function test_softmax_v6_tail that
sets input.size() to 75), then run ml_kernels::softmax_naive and
ml_kernels::softmax_v6 as before and assert elementwise closeness and sum==1 to
validate the scalar tail handling in softmax_v6/softmax_naive.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 186-216: The test currently only covers multiples of 8; change
test_softmax_v6 to exercise the scalar tail by using a non-multiple-of-8 length
(for example replace input.resize(72, 1.0f) with input.resize(75, 1.0f) or add a
new test function test_softmax_v6_tail that sets input.size() to 75), then run
ml_kernels::softmax_naive and ml_kernels::softmax_v6 as before and assert
elementwise closeness and sum==1 to validate the scalar tail handling in
softmax_v6/softmax_naive.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 90950ab1-ef6f-43c4-bd68-422c7b9382a7

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and fc3831e.

📒 Files selected for processing (4)
  • .jules/thunderbolt.md
  • ml_kernels/include/ml_kernels/softmax.h
  • ml_kernels/src/kernel_bench.cpp
  • ml_kernels/src/test_naive_ops.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant