Enable SME2 Streaming SVE in ARM by stevesuzuki-arm · Pull Request #9126 · halide/Halide

stevesuzuki-arm · 2026-05-07T19:57:50Z

Enable SME2 Streaming SVE in ARM

This PR adds initial ARM SME2 streaming-mode support to Halide,
which allows us to compute with longer vector length SVE on targets with SME2.

A new sme_streaming(enable, var) scheduling directive provides the users
the option to control which loop is computed in streaming-mode.

The change introduces a new Target::SME2 feature with supplemental features Target::SME_SVLDDD, where DDD represents streaming vector length in bits (e.g. 128, 256, 512, ...). If Target::SME2 is enabled, exactly one of Target::SME_SVLDDD feature must be enabled as well.
Please note natural_vector_size() always returns the host vector size and users have to call Target::sme_streaming_vector_bits() to work out device native vector size if it's vectorizing a stage scheduled as sme_streaming.

In Halide lowering, a new LowerSMEStreamingTasks pass is added,
which extracts the loop with streaming-mode as internal closure function
so that we can attach the LLVM function attributes to transit to/from streaming-mode.

aarch64_pstate_sm_body to emit smstart/smstop transition
NoInline to prevent streaming closure from inlined to non-streaming function

In CodeGen, target_vscale() depends on whether streaming-mode or not
and it varies even in a Module, although it is constant within Function boundary.
In streaming-mode, vector type code-gen and intrinsic selection are
performed based on Target::sme_streaming_vector_bits() (streaming vscale).
In terms of coverage, it is almost the same as existing SVE2 code-gen
while SME2 specific instruction has not been enabled for now.

Additionally, the following changes are implemented:

Auto-detect SME2 and SME_SVLDDD target features on host CPU
Fall back from streaming SVE when vectorization factors are not feasible
Gather/scatter in streaming mode is scalarized with warning
Add runtime checks for streaming vscale mismatches with compile-time vscale

Checklist

Tests added or updated (not required for docs, CI config, or typo fixes)
Documentation updated (if public API changed)
Python bindings updated (if public API changed)
Commits include AI attribution where applicable (see Code of Conduct)

Added: - Target::SME2 definition - streaming_vector_bits in Target for SME2 - Auto-detect SME2 and streaming_vector_bits - sme_streaming() scheduling directive in Func and Pipeline - DeviceAPI::Host_SMEStreaming in IR "For" - LowerSMEStreamingTasks pass to extract streaming closure - Attribute in LoweredFunc for streaming closure - LLVM Function attribute to control streaming mode - NoInline to prevent streaming closure from inlined - "aarch64_pstate_sm_body" to emit smstart/smstop transition - Disable gather/scatter in SME streaming mode Tests: - Add correctness/sme_streaming - Run simd_op_check_sve2 in SME streaming mode - Add test to assert runtime streaming vscale

stevesuzuki-arm · 2026-05-07T20:02:29Z

This PR is ready for review. I will touch on this in dev meeting if I have a chance.

Reason: While vector_bits is used across multiple target architectures, streaming_vector_bits is aarch64 specific. So we choose to use Target::Feature rather than a new member for arbitrary bits. - Removed Target::streaming_vector_bits member variable - Added Feature::SME_SVL{128,256,512,1024,2048}

Revert the changes in halide_error_vscale_invalid to avoid potential runtime breaking changes.

Because streaming_vector_bits member variable has been removed.

stevesuzuki-arm · 2026-05-11T14:36:28Z

Based on the feedback in dev meeting, streaming_vector_bits has been replaced with Feature::SME_SVL

codecov · 2026-05-11T15:56:33Z

Codecov Report

❌ Patch coverage is 61.13074% with 110 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@7e2ecf2). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/CodeGen_ARM.cpp	54.25%	34 Missing and 9 partials ⚠️
src/Target.cpp	52.56%	22 Missing and 15 partials ⚠️
src/LowerSMEStreamingTasks.cpp	81.69%	9 Missing and 4 partials ⚠️
src/IRPrinter.cpp	0.00%	2 Missing and 1 partial ⚠️
src/InjectHostDevBufferCopies.cpp	40.00%	1 Missing and 2 partials ⚠️
src/Lower.cpp	72.72%	1 Missing and 2 partials ⚠️
src/Deserialization.cpp	0.00%	1 Missing and 1 partial ⚠️
src/Profiling.cpp	0.00%	0 Missing and 2 partials ⚠️
src/Serialization.cpp	0.00%	1 Missing and 1 partial ⚠️
src/DeviceInterface.cpp	0.00%	0 Missing and 1 partial ⚠️
... and 1 more

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #9126   +/-   ##
=======================================
  Coverage        ?   69.31%           
=======================================
  Files           ?      255           
  Lines           ?    78468           
  Branches        ?    18781           
=======================================
  Hits            ?    54389           
  Misses          ?    18554           
  Partials        ?     5525

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zvookin · 2026-05-20T18:26:30Z

The high order question is whether this should just be another top-level target feature for ARM processors. If so, no special loop annotation is required and vector_bits_N can be used to give the size of the SME2 unit.

It means one cannot generate both SME2 and NEON in the same pipeline, but having looked over the architecture spec I'm not convinced that is useful. It is slightly convenient in terms of bounds inference, but performance wise switching between the modes clears the entire vector state so imposing a function call boundary there is hardly a problem. (Same is true for SVE/SVE2.) My reading of the architecture spec is that fine grained switching between SME2 and one of the other vector extensions is not a great idea.

Also per being a singular resource, its interaction with parallelism requires care at the level outside of Halide generated code.

I expect doing it this way limits the processing that can be specified, but that would be true inside the loop labelled SME2 anyway. This may have been discussed with a question as to whether to fail compilation or to fallback to e.g. NEON. Really the initial use case is specialized kernels that are written specifically for the SME2 hardware anyway so failing compilation is fine.

stevesuzuki-arm · 2026-05-20T19:23:13Z

It is true that switching between streaming mode has some overhead. So, very frequent transitions (e.g. in inner-most loop) should be avoided in terms of performance.
IMO, when scheduling a long complex pipeline, it is not uncommon that one wants to run cleanly-vectorized part in streaming mode and others in non-streaming (e.g. scalar processing, calling runtime function, etc). In such a case, there are a couple (or several) of intermediates generated with compute_root() as boundary, which I think is an common situation. I'd assume streaming SVE is generally beneficial if the workload is simply vectorized and scaled even without matmul (caveat: depending on the u-implementation). I have some internal app which has this kind of mixed targeting, but the switching cost is negligible as compared to the performance gain. So, I think asking the user to separate the Halide module just for streaming/non-streaming boundary sounds inconvenient.
That is the idea why I expose it as an scheduling option for user.

abadams · 2026-05-26T20:20:18Z

    D3D12Compute,
    Vulkan,
    WebGPU,
+    Host_SMEStreaming,


Could this just be "SME"? Or "SMEStreaming"? I'm not sure what Host_ buys you.

The nuance I intended with "Host" is that it is almost like running on host CPU because from SW point of view:

Instructions are streamed from host, which looks similar to issueing other CPU instructions

No device buffer

No device runtime

I think the word "streaming" helps to emphasize the execution in streaming mode as there are other SME specific features/instructions legal even in non-streaming mode.
So, I'm fine with "SMEStreaming" or "Host_SMEStreaming". Do you prefer to remove "Host"?

I would prefer SMEStreaming

abadams · 2026-05-26T20:23:18Z

+     * When a loop is marked with sme_streaming(true), that loop including its inner loops
+     * are executed in Streaming mode. Marking with sme_streaming(false) prevents the loop
+     * from being executed in Streaming mode. */
+    Func &sme_streaming(bool enable, const VarOrRVar &x = Var::outermost());


I'm not sure I understand the function of the bool parameter here. Most of the other similar methods just mark a loop as something, and if you don't want that you don't call the method (or you call some other method like unroll(). Why is this not just the same as the hexagon method?

test_2_stages_consumer_streaming_at() in test/correctness/sme_streaming.cpp is an example where false is set, which deals with not-ideal situation where producer tile needs to be computed in non-streaming due to some reason.
We may remove the bool parameter if we don't support this case in favor of simplicity. So, I think it is a design choice.

Is this schedule:

g.compute_root().sme_streaming(true, x).split(x, xo, xi, 256); // explicitly set false, otherwise streaming is enabled f.compute_at(g, xo).sme_streaming(false);

equivalent to this?

g.compute_root().split(x, xo, xi, 256).sme_streaming(true, xi); f.compute_at(g, xo);

I guess maybe they differ in how the generated function calls are structured? The sense I'm getting is that unlike other offload engines, with SME you can have host loop inside a device loop - you can "come back" from the device temporarily for some code, and this can be useful. Is that correct?

If so, I think we just want a .host(VarOrRVar) scheduling call that sets deviceAPI to Host for that loop. (Open to other opinions for the name).

If we look at only the vectorization of arithmetics to compute output, they are equivalent. On the other hand, function structures are different.

produce g: for x.xo<SMEStreaming>: produce f: for __outermost in [0, 0]: <== marked as non-streaming! (or .host) for x: f(...) = ... consume f: for x.xi in [0, 255]<SMEStreaming>: g(...) = ... VS produce g: for x.xo: produce f: for x: f(...) = ... consume f: for x.xi in [0, 255]<SMEStreaming>: g(...) = ...

And yes, your guess is exactly the idea behind this.

abadams · 2026-05-26T20:28:29Z

+    return result;
+}
+
+int Target::natural_vector_size(const Halide::Type &t, bool is_sme_streaming) const {


This is awkwardly different between sme and other offload targets now. An arm pipeline could conceivably have something schedule on host, something schedule on a hexagon dsp, and something schedule for sme. I think unless we figure out a general solution we should just natural_vector_size as it is on main (i.e. it always returns the host vector size) and code will have to call sme_streaming_vector_bits if it's vectorizing a stage scheduled as sme_streaming

I see. I will apply the change accordingly.

Done. And updated the PR description as well.

Co-authored-by: Codex <codex@openai.com>

- Removed "Host_" prefix - Updated a few switch-case which was missing Co-authored-by: Codex <codex@openai.com>

stevesuzuki-arm · 2026-05-27T19:54:33Z

With halide-llvm 23.0.0.dev94237+gf95ccbae, simd_op_check_sve2 test fails due to llvm/llvm-project#200034

stevesuzuki-arm and others added 2 commits May 7, 2026 20:16

Fix compile error in Windows build

6d2059c

suppress SME spell-check issues

75596df

alexreinking requested a review from halidebuildbots May 7, 2026 23:44

stevesuzuki-arm added 3 commits May 11, 2026 07:23

Add halide_error_streaming_vscale_invalid runtime error

c7d9580

Revert the changes in halide_error_vscale_invalid to avoid potential runtime breaking changes.

Revert the change of parse_vector_bits in Target

0b2810a

Because streaming_vector_bits member variable has been removed.

Merge branch 'main' into pr-sme2

1be5b79

abadams reviewed May 26, 2026

View reviewed changes

stevesuzuki-arm and others added 4 commits May 27, 2026 10:05

Merge branch 'main' into pr-sme2

0dc9be6

Restore Target::natural_vector_size without streaming arg

424ddaa

Co-authored-by: Codex <codex@openai.com>

Rename DeviceAPI for SME Streaming

8e2901e

- Removed "Host_" prefix - Updated a few switch-case which was missing Co-authored-by: Codex <codex@openai.com>

Merge branch 'main' into pr-sme2

1a66dd8

Conversation

stevesuzuki-arm commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

stevesuzuki-arm commented May 7, 2026

Uh oh!

stevesuzuki-arm commented May 11, 2026

Uh oh!

codecov Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zvookin commented May 20, 2026

Uh oh!

stevesuzuki-arm commented May 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevesuzuki-arm May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevesuzuki-arm commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stevesuzuki-arm commented May 7, 2026 •

edited

Loading

codecov Bot commented May 11, 2026 •

edited

Loading

stevesuzuki-arm May 27, 2026 •

edited

Loading