Skip to content

Switch to neon for interleave (#20137)#20137

Merged
meta-codesync[bot] merged 1 commit into
pytorch:mainfrom
metascroy:export-D107958353
Jun 10, 2026
Merged

Switch to neon for interleave (#20137)#20137
meta-codesync[bot] merged 1 commit into
pytorch:mainfrom
metascroy:export-D107958353

Conversation

@metascroy

@metascroy metascroy commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary:

The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.

image_processor_simd.{h,cpp} provides deinterleave_to_chw():

  • One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane.
  • NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere.
  • Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets.

Both backends now call it.

From benchmark script, GPU is stable (within noise), and CPU is on par at min (0.98x) to 1.86x faster at max.

CPU rows       n=120  median=1.19x  min=0.98x  max=1.86x
GPU rows       n=36   median=1.11x  min=0.87x  max=1.41x
Default rows   n=24   median=1.29x  min=0.99x  max=1.71x

See D102373165 for inspiration.

Differential Revision: D107958353

@pytorch-bot

pytorch-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20137

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit c7f4f74 with merge base af92b60 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026
@meta-codesync

meta-codesync Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@metascroy has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107958353.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@metascroy

Copy link
Copy Markdown
Contributor Author

@claude review this code

@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown

Claude finished @metascroy's task in 1m 21s —— View job


Code Review: Switch to NEON for interleave

  • Read the PR diff and changed files
  • Analyze new SIMD kernel (image_processor_simd.{h,cpp})
  • Review integration in image_processor.cpp and image_processor_apple.cpp
  • Check build system changes (CMakeLists.txt, targets.bzl)
  • Post review feedback

Overall: This is a clean refactoring that unifies two sub-optimal deinterleave implementations into a single well-vectorized kernel. The NEON intrinsics are correct and the architecture (fast path + row-by-row slow path) is sound. A few observations below:


Potential integer overflow in fast path

image_processor_simd.cpp:132src_w * src_h is computed as int32_t * int32_t:

deinterleave_run(
    src,
    src_w * src_h,  // <-- can overflow for images > ~46k x 46k
    ...

This is safe given practical image sizes going through this pipeline (the upstream already constrains dimensions), but casting to int64_t before the multiply would be more defensive. Low priority.

No validation of in_channels or channel offsets

The function accepts any in_channels value, but the NEON path only handles 4 (vld4) and "else" (vld3). If in_channels is something unexpected (e.g. 2 or 5), the NEON branch would silently use vld3 with wrong stride semantics. Similarly, r_off/g_off/b_off are documented as < in_channels but not checked.

Since this is an internal API with exactly two call sites (both correct), this is fine as-is. An ET_DCHECK would be a cheap safety net if desired:

ET_DCHECK_MSG(in_channels == 3 || in_channels == 4, "in_channels must be 3 or 4");

__attribute__((always_inline)) portability

image_processor_simd.cpp:31__attribute__((always_inline)) is GCC/Clang-only. This is gated behind ET_IMAGE_USE_NEON which implies ARM+Clang/GCC, so it's correct in practice. If Windows-on-ARM (MSVC) ever becomes a target, this would need a __forceinline equivalent behind a macro.

vDSP no_offset optimization removed

The old Apple path had a special no_offset branch using vsmul (cheaper than vsmsa) when bias terms were zero. The new NEON FMA approach always does multiply-add. On modern ARM cores with fused FMA units, vfmaq_f32 with a zero addend is essentially free, so this is not a regression. Just noting the intentional simplification.

Minor: image_processor_simd.h not in install list

CMakeLists.txt:68-75 — The header image_processor_simd.h is not in the install(FILES ...) lists. This is correct since it's an internal implementation header, but worth confirming the intent is that downstream consumers never call deinterleave_to_chw directly.


Summary: Looks good. The NEON kernel is idiomatic, the widen→FMA→store pipeline is efficient, and the unification eliminates meaningful code duplication. No correctness issues found. The minor points above are optional hardening.


@meta-codesync meta-codesync Bot changed the title Switch to neon for interleave Switch to neon for interleave (#20137) Jun 9, 2026
@metascroy metascroy force-pushed the export-D107958353 branch from 1d1f264 to 3342f2e Compare June 9, 2026 18:56
metascroy added a commit to metascroy/executorch that referenced this pull request Jun 9, 2026
Summary:

The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.

image_processor_simd.{h,cpp} provides deinterleave_to_chw():

* One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane.
* NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere.
* Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets.

Both backends now call it.

From benchmark script, GPU is stable (within noise), and CPU is on par at min (0.98x) to 1.86x faster at max.

```
CPU rows       n=120  median=1.19x  min=0.98x  max=1.86x
GPU rows       n=36   median=1.11x  min=0.87x  max=1.41x
Default rows   n=24   median=1.29x  min=0.99x  max=1.71x

```

See D102373165 for inspiration.

Differential Revision: D107958353
metascroy added a commit to metascroy/executorch that referenced this pull request Jun 9, 2026
Summary:

The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.

image_processor_simd.{h,cpp} provides deinterleave_to_chw():

* One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane.
* NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere.
* Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets.

Both backends now call it.

From benchmark script, GPU is stable (within noise), and CPU is on par at min (0.98x) to 1.86x faster at max.

```
CPU rows       n=120  median=1.19x  min=0.98x  max=1.86x
GPU rows       n=36   median=1.11x  min=0.87x  max=1.41x
Default rows   n=24   median=1.29x  min=0.99x  max=1.71x

```

See D102373165 for inspiration.

Differential Revision: D107958353
@metascroy metascroy force-pushed the export-D107958353 branch from 3342f2e to 62eec46 Compare June 9, 2026 18:57
Summary:

The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.

image_processor_simd.{h,cpp} provides deinterleave_to_chw():

* One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane.
* NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere.
* Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets.

Both backends now call it.

From benchmark script, GPU is stable (within noise), and CPU is on par at min (0.98x) to 1.86x faster at max.

```
CPU rows       n=120  median=1.19x  min=0.98x  max=1.86x
GPU rows       n=36   median=1.11x  min=0.87x  max=1.41x
Default rows   n=24   median=1.29x  min=0.99x  max=1.71x

```

See D102373165 for inspiration.

Differential Revision: D107958353
@metascroy metascroy force-pushed the export-D107958353 branch from 62eec46 to c7f4f74 Compare June 9, 2026 19:13
@metascroy

Copy link
Copy Markdown
Contributor Author

@claude review this code

@metascroy metascroy requested a review from psiddh June 9, 2026 19:14
@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown

Claude finished @metascroy's task in 1m 42s —— View job


Code Review: Switch to NEON for interleave

  • Read SIMD kernel (image_processor_simd.{h,cpp})
  • Review integration (image_processor.cpp, image_processor_apple.cpp)
  • Check build files (CMakeLists.txt, targets.bzl)
  • Post review

Overall: Clean, correct unification of two duplicated deinterleave+normalize implementations into one well-vectorized kernel. I verified the math and the bounds; no correctness issues found. The hardening suggestions from the earlier review round appear to have already landed in this revision (see below).

Correctness — verified

  • FMA math matches scalar. vfmaq_f32(vb, x, va) = vb + x*va = x*a + b, identical to the scalar p[off]*a + b. Affine coefficients (a = scale/std, b = -mean/std) and the RGB-indexed norm are consistent across both call sites (BGRA → {2,1,0}, RGB/RGBA → {0,1,2}).
  • No over-read/over-write. The i + 16 <= n guards keep both vld4q_u8 (64 B) and vld3q_u8 (48 B) reads inside the run, and each widen_fma_store writes exactly [i, i+16). In the fast path the run is exactly n*in_channels bytes and spatial == n; in the slow path each row reads ≤ src_w pixels (within src_stride) and writes within (y+offset_y)*final_w + offset_x + src_w ≤ final_w*final_h. The strided/letterbox call sites (Apple BGRA with row padding, P010) correctly avoid the fast path.
  • Earlier-round items already addressed: src_w*src_h now uses static_cast<size_t> (image_processor_simd.cpp:142), and the ET_DCHECK_MSG guards on in_channels/channel offsets are present (:116-120).

Minor observations (optional, non-blocking)

  1. FP determinism across the NEON/scalar boundary. The NEON path uses a single-rounding FMA; the scalar fallback (:93-95) is a separate multiply + add, which may round twice (or contract to FMA depending on -ffp-contract). Results can differ in the last bit between ARM and non-ARM, and even within a single image at the <16-pixel scalar tail. Negligible for preprocessing, but if any golden test does an exact float compare it could become flaky — confirm comparisons are tolerance-based.

  2. ET_DCHECK compiles out in release. In NDEBUG builds an unexpected in_channels (e.g. 2 or 5) would silently fall into the vld3 branch with wrong stride semantics. Fine for an internal API with two correct call sites; would only matter if deinterleave_to_chw ever becomes part of the public surface.

  3. The in_channels == 4 / == 3 loop bodies differ only by vld4q/vld3q. Leaving them as two explicit loops reads better than templating over the load — consistent with the repo's "prefer simpler/concise" guidance, so no change suggested.

Nothing here needs to block the PR.

@psiddh psiddh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, this is great to have NEON accelerated path

@meta-codesync meta-codesync Bot merged commit 92e6a4c into pytorch:main Jun 10, 2026
181 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants