Switch to neon for interleave (#20137) by metascroy · Pull Request #20137 · pytorch/executorch

metascroy · 2026-06-09T01:50:00Z

Summary:

The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.

image_processor_simd.{h,cpp} provides deinterleave_to_chw():

One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane.
NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere.
Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets.

Both backends now call it.

From benchmark script, GPU is stable (within noise), and CPU is on par at min (0.98x) to 1.86x faster at max.

CPU rows       n=120  median=1.19x  min=0.98x  max=1.86x
GPU rows       n=36   median=1.11x  min=0.87x  max=1.41x
Default rows   n=24   median=1.29x  min=0.99x  max=1.71x

See D102373165 for inspiration.

Differential Revision: D107958353

pytorch-bot · 2026-06-09T01:50:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20137

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit c7f4f74 with merge base af92b60 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-06-09T01:50:08Z

@metascroy has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107958353.

github-actions · 2026-06-09T01:50:51Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

metascroy · 2026-06-09T01:52:57Z

@claude review this code

claude · 2026-06-09T01:53:37Z

Claude finished @metascroy's task in 1m 21s —— View job

Code Review: Switch to NEON for interleave

Read the PR diff and changed files
Analyze new SIMD kernel (image_processor_simd.{h,cpp})
Review integration in image_processor.cpp and image_processor_apple.cpp
Check build system changes (CMakeLists.txt, targets.bzl)
Post review feedback

Overall: This is a clean refactoring that unifies two sub-optimal deinterleave implementations into a single well-vectorized kernel. The NEON intrinsics are correct and the architecture (fast path + row-by-row slow path) is sound. A few observations below:

Potential integer overflow in fast path

image_processor_simd.cpp:132 — src_w * src_h is computed as int32_t * int32_t:

deinterleave_run(
    src,
    src_w * src_h,  // <-- can overflow for images > ~46k x 46k
    ...

This is safe given practical image sizes going through this pipeline (the upstream already constrains dimensions), but casting to int64_t before the multiply would be more defensive. Low priority.

No validation of `in_channels` or channel offsets

The function accepts any in_channels value, but the NEON path only handles 4 (vld4) and "else" (vld3). If in_channels is something unexpected (e.g. 2 or 5), the NEON branch would silently use vld3 with wrong stride semantics. Similarly, r_off/g_off/b_off are documented as < in_channels but not checked.

Since this is an internal API with exactly two call sites (both correct), this is fine as-is. An ET_DCHECK would be a cheap safety net if desired:

ET_DCHECK_MSG(in_channels == 3 || in_channels == 4, "in_channels must be 3 or 4");

`attribute((always_inline))` portability

image_processor_simd.cpp:31 — __attribute__((always_inline)) is GCC/Clang-only. This is gated behind ET_IMAGE_USE_NEON which implies ARM+Clang/GCC, so it's correct in practice. If Windows-on-ARM (MSVC) ever becomes a target, this would need a __forceinline equivalent behind a macro.

vDSP `no_offset` optimization removed

The old Apple path had a special no_offset branch using vsmul (cheaper than vsmsa) when bias terms were zero. The new NEON FMA approach always does multiply-add. On modern ARM cores with fused FMA units, vfmaq_f32 with a zero addend is essentially free, so this is not a regression. Just noting the intentional simplification.

Minor: `image_processor_simd.h` not in install list

CMakeLists.txt:68-75 — The header image_processor_simd.h is not in the install(FILES ...) lists. This is correct since it's an internal implementation header, but worth confirming the intent is that downstream consumers never call deinterleave_to_chw directly.

Summary: Looks good. The NEON kernel is idiomatic, the widen→FMA→store pipeline is efficient, and the unification eliminates meaningful code duplication. No correctness issues found. The minor points above are optional hardening.

Summary: The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit. image_processor_simd.{h,cpp} provides deinterleave_to_chw(): * One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane. * NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere. * Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets. Both backends now call it. From benchmark script, GPU is stable (within noise), and CPU is on par at min (0.98x) to 1.86x faster at max. ``` CPU rows n=120 median=1.19x min=0.98x max=1.86x GPU rows n=36 median=1.11x min=0.87x max=1.41x Default rows n=24 median=1.29x min=0.99x max=1.71x ``` See D102373165 for inspiration. Differential Revision: D107958353

metascroy · 2026-06-09T19:14:35Z

@claude review this code

claude · 2026-06-09T19:15:29Z

Claude finished @metascroy's task in 1m 42s —— View job

Code Review: Switch to NEON for interleave

Read SIMD kernel (image_processor_simd.{h,cpp})
Review integration (image_processor.cpp, image_processor_apple.cpp)
Check build files (CMakeLists.txt, targets.bzl)
Post review

Overall: Clean, correct unification of two duplicated deinterleave+normalize implementations into one well-vectorized kernel. I verified the math and the bounds; no correctness issues found. The hardening suggestions from the earlier review round appear to have already landed in this revision (see below).

Correctness — verified

FMA math matches scalar. vfmaq_f32(vb, x, va) = vb + x*va = x*a + b, identical to the scalar p[off]*a + b. Affine coefficients (a = scale/std, b = -mean/std) and the RGB-indexed norm are consistent across both call sites (BGRA → {2,1,0}, RGB/RGBA → {0,1,2}).
No over-read/over-write. The i + 16 <= n guards keep both vld4q_u8 (64 B) and vld3q_u8 (48 B) reads inside the run, and each widen_fma_store writes exactly [i, i+16). In the fast path the run is exactly n*in_channels bytes and spatial == n; in the slow path each row reads ≤ src_w pixels (within src_stride) and writes within (y+offset_y)*final_w + offset_x + src_w ≤ final_w*final_h. The strided/letterbox call sites (Apple BGRA with row padding, P010) correctly avoid the fast path.
Earlier-round items already addressed: src_w*src_h now uses static_cast<size_t> (image_processor_simd.cpp:142), and the ET_DCHECK_MSG guards on in_channels/channel offsets are present (:116-120).

Minor observations (optional, non-blocking)

FP determinism across the NEON/scalar boundary. The NEON path uses a single-rounding FMA; the scalar fallback (:93-95) is a separate multiply + add, which may round twice (or contract to FMA depending on -ffp-contract). Results can differ in the last bit between ARM and non-ARM, and even within a single image at the <16-pixel scalar tail. Negligible for preprocessing, but if any golden test does an exact float compare it could become flaky — confirm comparisons are tolerance-based.
ET_DCHECK compiles out in release. In NDEBUG builds an unexpected in_channels (e.g. 2 or 5) would silently fall into the vld3 branch with wrong stride semantics. Fine for an internal API with two correct call sites; would only matter if deinterleave_to_chw ever becomes part of the public surface.
The in_channels == 4 / == 3 loop bodies differ only by vld4q/vld3q. Leaving them as two explicit loops reads better than templating over the load — consistent with the repo's "prefer simpler/concise" guidance, so no change suggested.

Nothing here needs to block the PR.

psiddh

lgtm, this is great to have NEON accelerated path

metascroy requested review from kirklandsign and larryliu0820 as code owners June 9, 2026 01:50

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026

meta-codesync Bot added the meta-exported label Jun 9, 2026

meta-codesync Bot changed the title ~~Switch to neon for interleave~~ Switch to neon for interleave (#20137) Jun 9, 2026

metascroy force-pushed the export-D107958353 branch from 1d1f264 to 3342f2e Compare June 9, 2026 18:56

metascroy force-pushed the export-D107958353 branch from 3342f2e to 62eec46 Compare June 9, 2026 18:57

metascroy force-pushed the export-D107958353 branch from 62eec46 to c7f4f74 Compare June 9, 2026 19:13

metascroy requested a review from psiddh June 9, 2026 19:14

mergennachin approved these changes Jun 10, 2026

View reviewed changes

psiddh approved these changes Jun 10, 2026

View reviewed changes

meta-codesync Bot merged commit 92e6a4c into pytorch:main Jun 10, 2026
181 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to neon for interleave (#20137)#20137

Switch to neon for interleave (#20137)#20137
meta-codesync[bot] merged 1 commit into
pytorch:mainfrom
metascroy:export-D107958353

metascroy commented Jun 9, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

metascroy commented Jun 9, 2026

Uh oh!

claude Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

metascroy commented Jun 9, 2026

Uh oh!

claude Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

psiddh left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

metascroy commented Jun 9, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20137

⏳ No Failures, 1 Pending

Uh oh!

meta-codesync Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

This PR needs a release notes: label

Uh oh!

metascroy commented Jun 9, 2026

Uh oh!

claude Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Switch to NEON for interleave

Potential integer overflow in fast path

No validation of in_channels or channel offsets

__attribute__((always_inline)) portability

vDSP no_offset optimization removed

Minor: image_processor_simd.h not in install list

Uh oh!

metascroy commented Jun 9, 2026

Uh oh!

claude Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Switch to NEON for interleave

Correctness — verified

Minor observations (optional, non-blocking)

Uh oh!

psiddh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

metascroy commented Jun 9, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 9, 2026 •

edited

Loading

No validation of `in_channels` or channel offsets

`attribute((always_inline))` portability

vDSP `no_offset` optimization removed

Minor: `image_processor_simd.h` not in install list

claude Bot commented Jun 9, 2026 •

edited

Loading

psiddh left a comment •

edited

Loading