Switch to neon for interleave (#20137)#20137
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20137
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit c7f4f74 with merge base af92b60 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@metascroy has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107958353. |
This PR needs a
|
|
@claude review this code |
|
Claude finished @metascroy's task in 1m 21s —— View job Code Review: Switch to NEON for interleave
Overall: This is a clean refactoring that unifies two sub-optimal deinterleave implementations into a single well-vectorized kernel. The NEON intrinsics are correct and the architecture (fast path + row-by-row slow path) is sound. A few observations below: Potential integer overflow in fast path
deinterleave_run(
src,
src_w * src_h, // <-- can overflow for images > ~46k x 46k
...This is safe given practical image sizes going through this pipeline (the upstream already constrains dimensions), but casting to No validation of
|
1d1f264 to
3342f2e
Compare
Summary:
The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.
image_processor_simd.{h,cpp} provides deinterleave_to_chw():
* One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane.
* NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere.
* Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets.
Both backends now call it.
From benchmark script, GPU is stable (within noise), and CPU is on par at min (0.98x) to 1.86x faster at max.
```
CPU rows n=120 median=1.19x min=0.98x max=1.86x
GPU rows n=36 median=1.11x min=0.87x max=1.41x
Default rows n=24 median=1.29x min=0.99x max=1.71x
```
See D102373165 for inspiration.
Differential Revision: D107958353
Summary:
The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.
image_processor_simd.{h,cpp} provides deinterleave_to_chw():
* One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane.
* NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere.
* Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets.
Both backends now call it.
From benchmark script, GPU is stable (within noise), and CPU is on par at min (0.98x) to 1.86x faster at max.
```
CPU rows n=120 median=1.19x min=0.98x max=1.86x
GPU rows n=36 median=1.11x min=0.87x max=1.41x
Default rows n=24 median=1.29x min=0.99x max=1.71x
```
See D102373165 for inspiration.
Differential Revision: D107958353
3342f2e to
62eec46
Compare
Summary:
The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.
image_processor_simd.{h,cpp} provides deinterleave_to_chw():
* One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane.
* NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere.
* Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets.
Both backends now call it.
From benchmark script, GPU is stable (within noise), and CPU is on par at min (0.98x) to 1.86x faster at max.
```
CPU rows n=120 median=1.19x min=0.98x max=1.86x
GPU rows n=36 median=1.11x min=0.87x max=1.41x
Default rows n=24 median=1.29x min=0.99x max=1.71x
```
See D102373165 for inspiration.
Differential Revision: D107958353
62eec46 to
c7f4f74
Compare
|
@claude review this code |
|
Claude finished @metascroy's task in 1m 42s —— View job Code Review: Switch to NEON for interleave
Overall: Clean, correct unification of two duplicated deinterleave+normalize implementations into one well-vectorized kernel. I verified the math and the bounds; no correctness issues found. The hardening suggestions from the earlier review round appear to have already landed in this revision (see below). Correctness — verified
Minor observations (optional, non-blocking)
Nothing here needs to block the PR. |
Summary:
The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.
image_processor_simd.{h,cpp} provides deinterleave_to_chw():
Both backends now call it.
From benchmark script, GPU is stable (within noise), and CPU is on par at min (0.98x) to 1.86x faster at max.
See D102373165 for inspiration.
Differential Revision: D107958353