Skip to content

Commit 5626b9d

Browse files
committed
Add Float16 support for xsimd
1 parent 9d2e611 commit 5626b9d

4 files changed

Lines changed: 89 additions & 0 deletions

File tree

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# FP16 Support in xsimd
2+
3+
xsimd is a C++ header-only library that abstracts SIMD (vectorization) intrinsics behind a single, generic API.
4+
The same code — `xsimd::batch<float>` — compiles to optimal machine code on x86 SSE/AVX, ARM NEON SVE, RISC-V, and WebAssembly, with no runtime overhead.
5+
When an intrinsic is missing on a given target, xsimd falls back gracefully rather than failing or leaving the developer to write platform-specific branches.
6+
This is why projects like Mozilla Firefox, Apache Arrow, Meta Velox, KDE Krita, and Pythran have adopted it as their vectorization layer.
7+
8+
FP16 — the 16-bit half-precision floating point format — has become a first-class data type in modern computing.
9+
It is the default storage format for large language model weights, the standard precision for neural network inference,
10+
and increasingly the format of choice wherever memory bandwidth is the binding constraint.
11+
Yet consuming or producing FP16 data from C++ SIMD code today requires writing painful, platform-specific intrinsics by hand.
12+
xsimd currently has no FP16 support, forcing its users to drop out of the generic API the moment they touch half-precision data.
13+
14+
We propose to add vectorized FP16 support to xsimd — native FP16 operations where hardware supports them, and correct fallbacks elsewhere.
15+
16+
## Why FP16 Matters
17+
18+
**Memory bandwidth is a bottleneck.** Modern CPUs and GPUs are not compute-bound — they are memory-bandwidth-bound.
19+
FP16 cuts data size in half versus FP32.
20+
This means twice as many values fit in cache, twice as many elements move per memory transaction,
21+
and large working can perform more with L2 or L3 caches without accessing RAM.
22+
The bandwidth saving alone, before any compute consideration, is the primary reason the format matters.
23+
24+
**SIMD registers double the throughput.** With native arithmetic support, FP16 operation double the number
25+
of floating point numbers processed per CPU cycle when precision is not an issue.
26+
27+
**FP16 is widely used in AI.** Transformer weights, KV caches, activations, and embeddings are all routinely stored in FP16.
28+
Any library that processes or pipelines this data at any point of the training and inference pipeline must be able to consume and produce FP16 buffers efficiently.
29+
Without xsimd FP16 support, these projects become a limiting factor in an otheriwse highly optimized data transformation.
30+
31+
## Hardware Landscape
32+
33+
FP16 conversion and arithmetic are now widely available across all major SIMD families:
34+
- **x86**: Early on, the `f16c` feature introduced SIMD convertion from FP16 and FP32 for efficient storage, while arithmetic would still be performed in FP32.
35+
With the AVX-512 generation, support for doing operations directly in FP16 is introduce with sigificant speedups.
36+
- **ARM**: FP16 support becomes mandatory in latest ARM generations (ARM v8.2-a) with arithmetic, convertion, *etc*.
37+
This affects NEON operations on modern smartphones and all Apple silicon M-chips.
38+
Coverage is extended server side with both SVE and SVE2 supporting FP16.
39+
40+
## Proposed Work
41+
42+
This proposal covers foundational FP16 support: native FP16 operations on platforms that provide hardware acceleration, and correct, efficient fallbacks everywhere else.
43+
44+
Concretely, this means:
45+
- A new `xsimd::batch<xsimd::fp16>` type (or equivalent half-precision batch specialization) that can be loaded from and stored to FP16 buffers.
46+
- Support for converting from and to `batch<float>`, mapping to the optimal hardware instruction where available, and a correct SIMD algorithm elsewhere.
47+
- Native FP16 arithmetic operations — add, multiply, FMA, min, max, and comparison — on backends that provide hardware support, with FP32-based fallbacks on those that do not
48+
49+
## Impact
50+
51+
Funding this development will directly open xsimd to the rapidly growing landscape of LLM and machine
52+
learning workflows: local inference engines, model weight processing, and embedding pipelines.
53+
54+
Beyond new workloads, this will benefit existing projects using xsimd that already handle FP16 data.
55+
For instance Apache Arrow and Parquet process half-precision columns today without hardware-optimized SIMD support.
56+
These projects stand to benefit directly and with small integration effort.
57+
58+
##### Are you interested in this project? Either entirely or partially, contact us for more information on how to help us fund it.

src/components/fundable/projectsDetails.ts

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ import JupyterGISRasterProcessingMD from "@site/src/components/fundable/descript
33
import JupyterGISToolsForPythonAPIMD from "@site/src/components/fundable/descriptions/JupyterGISToolsForPythonAPI.md"
44
import EmscriptenForgePackageRequestsMD from "@site/src/components/fundable/descriptions/EmscriptenForgePackageRequests.md"
55
import SVE2SupportInXsimdMD from "@site/src/components/fundable/descriptions/SVE2SupportInXsimd.md"
6+
import Float16SupportInXsimdMD from "@site/src/components/fundable/descriptions/Float16SupportInXsimd.md"
67
import MatrixOperationsInXtensorMD from "@site/src/components/fundable/descriptions/MatrixOperationsInXtensor.md"
78
import BinaryViewInArrowCppMD from "@site/src/components/fundable/descriptions/BinaryViewInArrowCpp.md"
89
import Decimal32InArrowCppMD from"@site/src/components/fundable/descriptions/Decimal32InArrowCpp.md"
@@ -75,6 +76,18 @@ export const fundableProjectsDetails = {
7576
currentFundingPercentage: 0,
7677
repoLink: "https://github.com/xtensor-stack/xsimd"
7778
},
79+
{
80+
category: "Scientific Computing",
81+
title: "Float16 support in xsimd",
82+
pageName: "Float16SupportInXsimd",
83+
shortDescription: "xsimd is a C++ scientific library that abstract low-level high performances computing primitives across different hardwares. We will add vectorized support for half-precision 16 bits float operations where hardware supports them, and correct fallbacks elsewhere.",
84+
description: Float16SupportInXsimdMD,
85+
price: "20 000 €",
86+
maxNbOfFunders: 2,
87+
currentNbOfFunders: 0,
88+
currentFundingPercentage: 0,
89+
repoLink: "https://github.com/xtensor-stack/xsimd"
90+
},
7891
{
7992
category: "Scientific Computing",
8093
title: "Implementing Kazushige Goto Algorithms for Matrix Operations in xtensor",
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
import useDocusaurusContext from '@docusaurus/useDocusaurusContext';
2+
import GetAQuotePage from '@site/src/components/fundable/GetAQuotePage';
3+
4+
export default function FundablePage() {
5+
const { siteConfig } = useDocusaurusContext();
6+
return (
7+
<GetAQuotePage/>
8+
);
9+
}
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
import useDocusaurusContext from '@docusaurus/useDocusaurusContext';
2+
import LargeProjectCardPage from '@site/src/components/fundable/LargeProjectCardPage';
3+
4+
export default function FundablePage() {
5+
const { siteConfig } = useDocusaurusContext();
6+
return (
7+
<LargeProjectCardPage/>
8+
);
9+
}

0 commit comments

Comments
 (0)