Add Mojo AOT-compiled SIMD take/filter kernels for primitive arrays#7387
Add Mojo AOT-compiled SIMD take/filter kernels for primitive arrays#7387joseph-isaacs wants to merge 18 commits intodevelopfrom
Conversation
Adds a new take kernel implementation that uses Mojo's SIMD gather instructions, compiled ahead-of-time and statically linked into vortex-array. When the Mojo SDK is installed, `build.rs` compiles `kernels/take.mojo` to a native object file with zero external dependencies (no Mojo runtime needed). The kernel auto-selects optimal SIMD width (AVX-512/AVX2/NEON) via Mojo's type system. The dispatch priority is: Mojo > portable_simd > AVX2 > scalar. When Mojo is not installed, build.rs is a no-op and existing Rust kernels are used — zero impact on builds without the Mojo toolchain. Covers all 16 type combinations (4 value widths × 4 index types). All 203 existing take tests pass with the Mojo kernel active. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
Extends the Mojo AOT kernel with filter-by-indices support. The
primitive filter path converts sparse masks (<80% selectivity) into
an index array, then gathers values at those positions — identical
to the take operation but with usize indices.
Four new exported symbols (vortex_filter_{1,2,4,8}byte) are added
to the Mojo kernel and wired into filter_slice_by_indices behind
cfg(vortex_mojo). Falls back to scalar when Mojo is unavailable.
All 121 existing filter tests pass with the Mojo kernel active.
Signed-off-by: Claude <noreply@anthropic.com>
https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
Adds `take_primitive_simd` benchmark that calls all three gather implementations through identical `fn(&[T], &[u32]) -> Buffer<T>` signatures on raw buffers. No Vortex Array overhead. Results on AVX2 (65K values, random u32 indices, median): u32, n=100K: scalar=66.9µs, avx2=46.0µs (1.45x), mojo=44.0µs (1.52x) u64, n=100K: scalar=67.1µs, avx2=55.6µs (1.21x), mojo=55.4µs (1.21x) Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
Adds a pip install step for the Mojo SDK in the bench-codspeed job, gated to only run for the vortex-array shard. This enables the Mojo AOT take/filter kernels during codspeed benchmark runs so we get performance tracking for the SIMD gather path. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
The codspeed benchmark runner crashed with exit code 132 (SIGILL) because `mojo build --emit object` defaults to the native CPU, which may emit AVX-512 or other instructions the CI runner doesn't support. Adds MOJO_MCPU env var (defaults to "native") that build.rs passes as `--mcpu` to the Mojo compiler. CI sets it to "x86-64-v3" (AVX2 baseline) to match the existing RUSTFLAGS="-C target-feature=+avx2". Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
Merging this PR will improve performance by 81.52%
Performance Changes
Comparing Footnotes
|
|
Does Mojo handle runtime dispatch to choose the right kernel for architecture? Or does it just pick one you build the mojo kernels I think one thing to keep in mind is that since we're a library, when a downstream crate compiles Vortex in, and e.g. the build machine has AVX512, but a client machine only supports AVX2 or something, that would result in a runtime failure that's failure opaque to the library user. In any final version of this code, we should be sure that any arch-specific kernels should be gated by a runtime check before we invoke them. Similar to what we do for the existing AVX2 kernel. |
CodSpeed results showed the Mojo generic gather is ~14% slower than the hand-tuned AVX2 intrinsics for 32-bit types (f32/u32), while being ~50% faster for u8. The AVX2 kernel uses specialized masked gather instructions that outperform Mojo's portable SIMD at x86-64-v3. New dispatch order: portable_simd (nightly) > AVX2 (x86_64) > Mojo (fallback) > scalar Mojo now serves as the SIMD path for: - x86_64 without AVX2 (rare but possible) - Non-x86 platforms (ARM NEON, etc.) Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
Cargo sets TARGET=x86_64-unknown-linux-gnu which confuses Mojo's
auto-detection ("unknown target triple"). Explicitly pass it via
--target-triple so AOT compilation works in the Cargo build env.
Also adds MOJO_MCPU=native default with CI override to x86-64-v3.
Signed-off-by: Claude <noreply@anthropic.com>
https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
Two changes that close the gap between Mojo and hand-written AVX2: 1. Target --mcpu=skylake instead of x86-64-v3. The latter causes LLVM to scalarize llvm.masked.gather into 8 individual loads (vpextrq + movl). Skylake enables hardware vpgatherqd which does the gather in a single instruction. 2. 4x loop unrolling in _take(). Issuing 4 independent gather ops per iteration keeps the gather pipeline saturated — critical since vpgatherqd has multi-cycle latency. Before (x86-64-v3, no unroll): 48.1 µs (u32 100K) — 6% behind AVX2 After (skylake, 4x unroll): 44.3 µs (u32 100K) — matches AVX2 Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
Passes --mtune matching --mcpu so LLVM schedules instructions optimally for the target microarchitecture. On Skylake this increases vpgather instruction count from 50 to 75 (LLVM is more willing to use hardware gather with proper scheduling hints). Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
With the optimized kernel (4x unroll + skylake vpgatherqd), Mojo matches hand-written AVX2 intrinsics on x86_64 and also works on ARM/NEON. Restore Mojo as the primary dispatch choice when available, falling back to portable_simd > AVX2 > scalar. This lets codspeed measure the full Mojo-in-production impact across all dict/take benchmarks. Also tested prefetch hints — they hurt at <100K elements (L2 cache already sufficient) and only help marginally at 1M+. Not included. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
Adds a SIMD broadcast+store kernel for run-end decoding of primitive types. For each run, the value is broadcast to a SIMD register and written 8 elements at a time (vpbroadcastd + vmovdqu on AVX2). Local benchmarks (100K u32 elements): run_len=8: scalar=54µs, mojo=18µs (3.1x) run_len=32: scalar=39µs, mojo=10µs (4.0x) run_len=128: scalar=37µs, mojo=9µs (4.1x) Only activates for the common fast path: u32 ends, non-nullable values, zero offset. Falls through to existing Rust decode otherwise. Adds build.rs to vortex-runend (shares the same Mojo kernel file from vortex-array/kernels/take.mojo), primitive decode benchmark, and CI Mojo install for codspeed shard 6. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
- Cargo.toml: keep vortex_mojo cfg, accept removal of disable_loom/vortex_nightly - take/mod.rs: keep Mojo dispatch, accept removal of portable_simd, use develop's simplified non-Mojo dispatch structure Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
Adds decode_primitive_u32_scalar alongside decode_primitive_u32 so codspeed tracks both side by side. The scalar variant uses a raw Rust fill loop matching push_n_unchecked behavior. Local results (100K u32): run_len=8: scalar=62µs, mojo=17µs (3.7x) run_len=32: scalar=27µs, mojo=14µs (1.9x) run_len=128: scalar=20µs, mojo=9µs (2.2x) Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
The existing `decompress` benchmark in run_end_compress.rs uses u64 ends, but the Mojo fast path only handled u32 ends. Added u64 ends variants to the Mojo kernel and updated the Rust bridge to dispatch on (ends_ptype, value_byte_width). This means the existing codspeed `decompress[u8/u16/u32/u64]` benchmarks will now exercise the Mojo SIMD broadcast path and show deltas against the develop baseline. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
- Move runend decode kernel to encodings/runend/kernels/decode.mojo (each crate owns its own kernel file) - Remove take_primitive_simd benchmark — existing codspeed benchmarks (decode_primitives, dict_canonicalize, dict_mask, decompress) already cover all Mojo-accelerated paths - Remove decode_primitive_u32 benchmark — existing decompress benchmark in run_end_compress.rs already exercises the Mojo runend decode path - Remove bench_take_scalar/avx2/mojo helpers and visibility hacks from the crate public API - Revert module visibility changes (compute, take, avx2, mojo back to private) Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
…alar The struct is conditionally unused (only when vortex_mojo is set). #[expect(unused)] fails in CI where Mojo isn't installed for lint. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
…ple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com> Fix Mojo build on macOS: skip --mcpu=native and --target-triple on Apple targets On macOS, Mojo rejects the Cargo target triple format and --mcpu=native also triggers broken host triple detection. Skip both flags on Apple targets in both build scripts. Signed-off-by: Joe Isaacs <joe@spiraldb.com>
Summary
Adds Mojo AOT-compiled SIMD gather kernels for primitive take and filter, with zero runtime dependency and graceful fallback when Mojo isn't installed.
CodSpeed CI Results
"Merging this PR will improve performance by 48.74%" — 11 improved, 0 regressed, 1111 untouched.
decode_primitives[u8](5 variants)bench_dict_mask(4 variants)gather_u32_mojo[100K]vsgather_u32_avx2[100K]What's included
kernels/take.mojo— 20 SIMD gather kernels (16 take + 4 filter), 4x unrolled, compiled with--mcpu skylake --mtune skylakeforvpgatherqdbuild.rs— AOT compiles.mojo→.o→.a, detects Mojo via PATH +~/.local/bin, passes--target-triplefrom Cargo'sTARGETenv, gracefully falls backmojo.rs— Rust FFI bridge withTakeImpl, dispatches by value byte-widthslice.rs— Mojo SIMD filter for the sparse indices path (<80%selectivity)take_primitive_simdbench — divan 3-way comparison: scalar vs AVX2 vs Mojopip install --user mojo+MOJO_MCPU=skylakefor codspeed shard Add CI #2Key design decisions
Int: Mojo 0.26'sUnsafePointerhas origin/mut params incompatible with@export. Solved withtype_ofanchor pattern.nmshows 0 undefined symbols. No Mojo runtime/GC.--mcpu skylake: Critical forvpgatherqdhardware gather.x86-64-v3scalarizes the gather into 8 individual loads.Mojo compiles for a single target CPU (no runtime dispatch). If the build machine has AVX-512 but the runtime machine only has AVX2, you'd get SIGILL. Currently mitigated by pinning
MOJO_MCPU=skylakein CI. For production use, this needs runtime feature detection or multiple compiled objects — same pattern as the existingmultiversioncrate usage.Test plan
https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST