|
| 1 | +# FP16 Support in xsimd |
| 2 | + |
| 3 | +xsimd is a C++ header-only library that abstracts SIMD (vectorization) intrinsics behind a single, generic API. |
| 4 | +The same code — `xsimd::batch<float>` — compiles to optimal machine code on x86 SSE/AVX, ARM NEON SVE, RISC-V, and WebAssembly, with no runtime overhead. |
| 5 | +When an intrinsic is missing on a given target, xsimd falls back gracefully rather than failing or leaving the developer to write platform-specific branches. |
| 6 | +This is why projects like Mozilla Firefox, Apache Arrow, Meta Velox, KDE Krita, and Pythran have adopted it as their vectorization layer. |
| 7 | + |
| 8 | +FP16 — the 16-bit half-precision floating point format — has become a first-class data type in modern computing. |
| 9 | +It is the default storage format for large language model weights, the standard precision for neural network inference, |
| 10 | +and increasingly the format of choice wherever memory bandwidth is the binding constraint. |
| 11 | +Yet consuming or producing FP16 data from C++ SIMD code today requires writing painful, platform-specific intrinsics by hand. |
| 12 | +xsimd currently has no FP16 support, forcing its users to drop out of the generic API the moment they touch half-precision data. |
| 13 | + |
| 14 | +We propose to add vectorized FP16 support to xsimd — native FP16 operations where hardware supports them, and correct fallbacks elsewhere. |
| 15 | + |
| 16 | +## Why FP16 Matters |
| 17 | + |
| 18 | +**Memory bandwidth is a bottleneck.** Modern CPUs and GPUs are not compute-bound — they are memory-bandwidth-bound. |
| 19 | +FP16 cuts data size in half versus FP32. |
| 20 | +This means twice as many values fit in cache, twice as many elements move per memory transaction, |
| 21 | +and large working can perform more with L2 or L3 caches without accessing RAM. |
| 22 | +The bandwidth saving alone, before any compute consideration, is the primary reason the format matters. |
| 23 | + |
| 24 | +**SIMD registers double the throughput.** With native arithmetic support, FP16 operation double the number |
| 25 | +of floating point numbers processed per CPU cycle when precision is not an issue. |
| 26 | + |
| 27 | +**FP16 is widely used in AI.** Transformer weights, KV caches, activations, and embeddings are all routinely stored in FP16. |
| 28 | +Any library that processes or pipelines this data at any point of the training and inference pipeline must be able to consume and produce FP16 buffers efficiently. |
| 29 | +Without xsimd FP16 support, these projects become a limiting factor in an otheriwse highly optimized data transformation. |
| 30 | + |
| 31 | +## Hardware Landscape |
| 32 | + |
| 33 | +FP16 conversion and arithmetic are now widely available across all major SIMD families: |
| 34 | +- **x86**: Early on, the `f16c` feature introduced SIMD convertion from FP16 and FP32 for efficient storage, while arithmetic would still be performed in FP32. |
| 35 | + With the AVX-512 generation, support for doing operations directly in FP16 is introduce with sigificant speedups. |
| 36 | +- **ARM**: FP16 support becomes mandatory in latest ARM generations (ARM v8.2-a) with arithmetic, convertion, *etc*. |
| 37 | + This affects NEON operations on modern smartphones and all Apple silicon M-chips. |
| 38 | + Coverage is extended server side with both SVE and SVE2 supporting FP16. |
| 39 | + |
| 40 | +## Proposed Work |
| 41 | + |
| 42 | +This proposal covers foundational FP16 support: native FP16 operations on platforms that provide hardware acceleration, and correct, efficient fallbacks everywhere else. |
| 43 | + |
| 44 | +Concretely, this means: |
| 45 | +- A new `xsimd::batch<xsimd::fp16>` type (or equivalent half-precision batch specialization) that can be loaded from and stored to FP16 buffers. |
| 46 | +- Support for converting from and to `batch<float>`, mapping to the optimal hardware instruction where available, and a correct SIMD algorithm elsewhere. |
| 47 | +- Native FP16 arithmetic operations — add, multiply, FMA, min, max, and comparison — on backends that provide hardware support, with FP32-based fallbacks on those that do not |
| 48 | + |
| 49 | +## Impact |
| 50 | + |
| 51 | +Funding this development will directly open xsimd to the rapidly growing landscape of LLM and machine |
| 52 | +learning workflows: local inference engines, model weight processing, and embedding pipelines. |
| 53 | + |
| 54 | +Beyond new workloads, this will benefit existing projects using xsimd that already handle FP16 data. |
| 55 | +For instance Apache Arrow and Parquet process half-precision columns today without hardware-optimized SIMD support. |
| 56 | +These projects stand to benefit directly and with small integration effort. |
| 57 | + |
| 58 | +##### Are you interested in this project? Either entirely or partially, contact us for more information on how to help us fund it. |
0 commit comments