Fused delta(for(bitpacking)) decode (unstable_encodings)#8224
Draft
joseph-isaacs wants to merge 7 commits into
Draft
Fused delta(for(bitpacking)) decode (unstable_encodings)#8224joseph-isaacs wants to merge 7 commits into
joseph-isaacs wants to merge 7 commits into
Conversation
Wire the new `fastlanes::Delta::unfor_undelta_pack` kernel into delta decompression. When a DeltaArray's `deltas` child is a FoR array (unsigned reference) wrapping a BitPacked array stored as full, zero-offset chunks with no patches, `delta_decompress` now takes a fully fused fast path (`try_fused_for_bitpacking` -> `decompress_fused`) that unpacks, applies the frame-of-reference, and inverts the delta encoding in a single pass per chunk before untransposing. All other shapes fall back to the existing generic path. A round-trip test builds the stack from non-strictly-increasing (monotone non-decreasing) u32/u64 columns and asserts the fused path is actually taken. The `delta_for_bitpack` divan bench compares the fused decode against an unfused baseline (materialize the FoR(bitpacked) deltas, then generic delta decode). On non-decreasing columns the fused path is ~1.3-2.0x faster, with the gap widening at larger sizes and for u64. A local-dev `[patch.crates-io]` points fastlanes at the sibling checkout that carries the kernel; it would be replaced by a published version bump. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Put the fused decode fast path (`try_fused_for_bitpacking` / `decompress_fused`), its imports, the round-trip test, and the bench behind a new `unstable_encodings` feature on vortex-fastlanes that enables `fastlanes/unstable`. With the feature off (the default) the kernel is compiled out entirely, so there is no `.text` cost; vortex-btrblocks' existing `unstable_encodings` feature now propagates it. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The `[patch.crates-io]` previously pointed at a sibling `../fastlanes` checkout, which does not exist in CI and broke workspace resolution for every job. Point it at the pushed fastlanes branch (spiraldb/fastlanes#140) so the workspace resolves and both default and all-features builds compile. To be replaced by a published fastlanes version bump once that PR merges. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Split the combined `use` statements into one item per line and regroup, matching the repo's nightly rustfmt config (imports_granularity = "Item", group_imports = "StdExternalCrate"). No functional change. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| 🆕 | Simulation | current_u64[65536] |
N/A | 654 µs | N/A |
| 🆕 | Simulation | fused_u32[65536] |
N/A | 235.3 µs | N/A |
| 🆕 | Simulation | fused_u64[65536] |
N/A | 378 µs | N/A |
| 🆕 | Simulation | fused_u64[1048576] |
N/A | 5.7 ms | N/A |
| 🆕 | Simulation | fused_u32[1048576] |
N/A | 3.5 ms | N/A |
| 🆕 | Simulation | current_u32[1048576] |
N/A | 5.6 ms | N/A |
| 🆕 | Simulation | current_u64[1048576] |
N/A | 13.5 ms | N/A |
| 🆕 | Simulation | current_u32[65536] |
N/A | 379.2 µs | N/A |
Comparing claude/delta-bitpacking-fastlanes-V6mTZ (1565f71) with develop (81046d7)
Point `unstable_encodings` at `fastlanes/delta_for_bitpacking` and bump the patched fastlanes git revision accordingly. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Expose `delta_decompress` / `delta_decompress_generic` under the `_test-harness` feature and rewrite the bench so both arms call the real decode entry points on the identical delta(for(bitpacking)) array: `fused` (the unfor_undelta_pack fast path) vs `current` (the pre-fusion generic decode). The previous baseline reused a cached intermediate and understated the gap; the cold-vs-cold comparison shows ~4.6x (u32 64Ki) to ~6.9x (u64 1Mi), dominated by avoiding the intermediate FoR-decoded PrimitiveArray materialization rather than kernel speed. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Reference the exact fastlanes revision (spiraldb/fastlanes#140) instead of the branch for reproducibility. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires the new
fastlanes::Delta::unfor_undelta_packfused kernel into delta decompression, behind a new default-offunstable_encodingsfeature.When a
DeltaArray'sdeltaschild is aFoRarray (unsigned reference) wrapping aBitPackedarray stored as full, zero-offset chunks with no patches,delta_decompresstakes a fully fused fast path (try_fused_for_bitpacking→decompress_fused): each chunk is unpacked, FoR-decoded, and un-delta'd in a single pass before untransposing. Every other shape (signed reference, patches, sliced bit-packing) falls back to the existing generic path unchanged.Note
Depends on spiraldb/fastlanes#140 (the
delta_for_bitpackingkernel). This branch carries a temporary[patch.crates-io]pinningfastlanesto rev267717cd72e8b6f0ed0e5321ae3fc785fa433058. It must be replaced by a publishedfastlanesversion bump before merge — until then,Rust publish dry-runandRust build (all-features)are expected red because crates.iofastlanes 0.5.0has nodelta_for_bitpackingfeature (this is a standard stacked cross-repo PR: merge + release fastlanes first).Feature flag
vortex-fastlanes: newunstable_encodings = ["fastlanes/delta_for_bitpacking"]. The fused path, its imports, the round-trip test, and the bench are all#[cfg(feature = "unstable_encodings")].vortex-btrblocks's existingunstable_encodingsfeature propagatesvortex-fastlanes/unstable_encodings.With the feature off (default) the kernel and fast path are compiled out — no behavior or code-size change on the default build.
Tests
fused_for_bitpacking_roundtripbuilds the stack from non-strictly-increasing u32/u64 columns, asserts the fused path is actually taken (not a silent fallback), and round-trips.cargo test -p vortex-fastlanes --lib delta::(61 tests) passes;cargo clippy --all-targets --all-features, the default lib build, and nightlyfmt --checkare clean. The compat suite passes 35/35.Performance — fused vs the real current Vortex decode
benches/delta_for_bitpack.rsA/Bs the real decode entry points on the same array:fused=delta_decompress(fast path) vscurrent=delta_decompress_generic(the pre-fusion path Vortex uses today). Cold each iteration, fastest time:The win is eliminating the intermediate FoR-decoded
PrimitiveArraymaterialization (+ its validity mask + a second allocation/pass), not the kernel itself: the kernel is ~0.16 ns/elem whilecurrentspends ~3.3 ns/elem, i.e. ~95% of the current path is array machinery.Is the kernel itself optimal? (asm)
Yes — measured locally. The fused kernel is at parity with the shipped
unfor_pack/undelta_pack(within ~3%), and wider SIMD regresses realistic widths (AVX2/AVX-512 ~10% slower than SSE2; asm is clean%ymm, zero shuffles — it's port-throughput/frequency-bound, not codegen). Details in spiraldb/fastlanes#140.Code-size analysis
The kernel is monomorphized per
(type × bit-width). Releaselibfastlanesrlib:unstable_encodings.textFully opt-in via the feature.
🤖 Generated with Claude Code