Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
24ec51d
Add FSSTView encoding: a ListView-style FSST array
claude May 30, 2026
0b19233
FSSTView: fast FSST->view filter/take, smarter canonical compaction, …
claude May 30, 2026
df7aa0d
FSSTView: simplify Auto compaction to match benchmark results
claude May 30, 2026
60c4e03
FSSTView: zero-copy from_fsst conversion + chain-ops benchmark
claude May 30, 2026
43e99fa
FSSTView: add RunCoalesce ("export paired slices") canonicalization +…
claude May 30, 2026
8c61b50
FSSTView: skip fill_null in take for non-nullable indices; op-only be…
claude May 30, 2026
033112e
FSSTView: byte accounting (compressed + uncompressed) for the gap-mer…
claude May 30, 2026
cd88533
FSSTView: coalesce adjacent spans in the gather (compaction at export)
claude May 30, 2026
0d1232f
FSSTView: VarBin export + single-filter 2x2 export benchmark
claude May 30, 2026
694b251
FSSTView: bench the VarBin -> VarBinView conversion cost after export
claude May 30, 2026
30a8f93
FSSTView: database-style benches + RunDecode export ("export all in p…
claude May 30, 2026
6144875
FSSTView: 12x faster from_fsst conversion (push_unchecked over windows)
claude May 30, 2026
9eb6fd1
FSSTView: trim canonicalize allocations (defer ulens widen, cheap cum…
claude May 30, 2026
3c38645
FSSTView: benchmark on real FineWeb columns (url + text)
claude May 31, 2026
dea8281
FSSTView: trim to the production path for merge
claude May 31, 2026
6b7a1b8
FSSTView: benchmark the real FineWeb query predicates
claude May 31, 2026
648efca
FSSTView: add benches/README summarizing the three benchmarks + numbers
claude May 31, 2026
a5b0df2
FSSTView: add handover doc + continuation prompt
claude May 31, 2026
4fdd216
FSSTView: store end offsets, eliminating the conversion floor
claude May 31, 2026
6f6c8e6
FSSTView: trim decode-path allocations, redundant validation, and a b…
claude May 31, 2026
ab48ea9
FSSTView: add zero-copy conversion regression guard
claude May 31, 2026
b08f0fd
FSSTView: drop the RunDecode uncompressed-lengths precompute
claude May 31, 2026
d1418cf
FSSTView: inline the metadata PType getters, drop a stale comment
claude May 31, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 145 additions & 0 deletions FSSTVIEW_HANDOVER.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# FSSTView — Handover

## TL;DR

Added a new **`FSSTView`** array encoding to Vortex: a ListView-style FSST that addresses its
compressed codes with separate per-element `offsets` + `ends` arrays instead of one monotonic
offsets array. This makes `filter` / `take` / `slice` **metadata-only** (rewrite only small index
arrays, reuse the compressed byte heap), where plain `FSST` rewrites the whole compressed code heap
per op. The decode cost moves to a single canonicalization at the end.

Storing the per-element **end offset** (rather than the size) makes the `FSST` → `FSSTView`
conversion allocation-free — both addressing arrays are zero-copy slices of the FSST's existing
offsets — which **eliminated the conversion floor** that previously made the view 9–16× slower than
`fsst` on tiny highly selective `url` predicates (see "Conversion floor — resolved" below).

- **Branch:** `claude/fsstview-conversion-floor-kRAeg` (built on the original
`claude/fsstview-array-listview-TdW45`).
- **Status:** merge-ready. 105 tests pass, `clippy --all-targets --all-features` clean,
`cargo +nightly fmt` clean, `vortex-file` builds, doc tests pass.
- **No PR opened yet** (was waiting on explicit request).
- **Scope:** additive, contained in `encodings/fsst/` plus 2 registration lines in `vortex-file`.

## What landed

New encoding `vortex.fsstview` in `encodings/fsst/src/fsstview/`:

| file | role |
| --- | --- |
| `array.rs` | encoding struct, `#[array_slots]` children (uncompressed_lengths, codes_offsets, codes_ends, codes_validity), VTable, serde, allocation-free `fsstview_from_fsst` conversion |
| `compute.rs` | metadata-only `FilterKernel` + `TakeExecute` |
| `ops.rs` | `scalar_at` |
| `slice.rs` | metadata-only `SliceReduce` |
| `from_fsst.rs` | `fsst_filter_to_view` / `fsst_take_to_view` helpers |
| `canonical.rs` | decode → `VarBinViewArray` / `VarBinArray`, with the `Auto` export strategy |
| `kernel.rs` / `rules.rs` | parent kernel + rule registration |
| `tests.rs` | conformance + agreement + nullable/gapped/RunDecode coverage + zero-copy conversion guard |

Registered in `vortex-file/src/lib.rs` (`register_default_encodings`). Public API:
`FSSTView`, `FSSTViewArray`, `FsstViewCompaction`, `canonicalize_fsstview_with`,
`canonicalize_fsstview_to_varbin`, `fsst_filter_to_view`, `fsst_take_to_view`, `fsstview_from_fsst`.

## Canonicalization strategy (`FsstViewCompaction::Auto`)

After metadata-only ops the survivors are scattered in the original heap; `Auto` picks how to
decode from the survivor layout:

- **Direct** — one contiguous run (untouched / sliced): single bulk decode, no copy.
- **RunDecode** — offsets monotonic, few runs (clustered/range filters, sorted takes): decode each
contiguous run straight into the element-ordered output, no gather copy. Threshold:
`runs <= len / 4`.
- **GatherBulk** — scattered (shuffle take) or fragmented (uniform-random filter): compact live
codes into one buffer, single bulk decode.

`RunDecode` and the gather coalescing came from the optimization work; `PerElement` and
`RunCoalesce` were explored, proven worse, and removed before merge.

## Benchmarks & results

Two benches in `encodings/fsst/benches/` (full write-up in `benches/README.md`). All numbers are
divan **medians**, 100 samples, single shared machine — directional, relative ordering stable.

1. **`fsst_view_compute`** — synthetic, no external data, **runs in CI**. ~2 MiB strings, ManyShort
(~12 B) / FewLong (~256 B). Single filter and a 5-op chain → VarBinView. The chain is where the
view's advantage compounds (each `fsst` op re-rewrites the heap; the view stays metadata-only):
- chain FewLong: fsst 371 µs → view **268 µs** (1.4×); chain ManyShort 4.99 ms → **4.12 ms**.

2. **`fsst_view_fineweb_queries`** — the real `vortex-bench` query predicates (`dump = ...`,
`date LIKE '2020-10-%'`, `url/text LIKE '%google%'`, `'% vortex %'`, espn filters), evaluated
in DuckDB to authentic per-row masks, then materialize the column → VarBinView. Numbers below
are a same-machine before/after (old `sizes` representation → new `ends` representation):
- text/date_prefix (12%): fsst 69.3 ms vs view **41.4 ms** (1.67×; was 41.0 ms — held)
- text/dump_eq (7%): fsst 42.6 ms vs view **25.3 ms** (1.68×; was 25.3 ms — held)
- url/vortex (0.04%): fsst 8.6 µs vs view **9.1 µs** (was view 140 µs — floor removed)
- url/espn_and (0.08%): fsst 14.5 µs vs view **14.9 µs** (was view 146 µs)
- text/espn_and (0.08%): fsst 284 µs vs view **271 µs** (was view 407 µs — flips to a view win)

With the `ends` representation the view now **wins or ties every query** in the matrix: the bulk /
clustered / long-`text` cases still win by skipping the per-op heap rewrite (up to 1.68× here), and
the tiny highly selective predicates that used to lose to the conversion floor now match `fsst` to
within noise. Full table in `benches/README.md`.

### Reproducing the FineWeb queries bench

The ~2 GB sample is **not** downloaded by the bench. Extract columns + query masks once:

```bash
pip install duckdb
python3 encodings/fsst/benches/fineweb_queries_extract.py # writes /tmp/fw_*.bin
FINEWEB_DIR=/tmp cargo bench -p vortex-fsst --bench fsst_view_fineweb_queries
```

The bench no-ops (CI-safe) when `FINEWEB_DIR` is unset.

## Conversion floor — resolved

The view's one previous weakness was a **fixed conversion cost on highly selective filters**: the
original `fsstview_from_fsst` derived a full `sizes` array (`offsets[i+1] - offsets[i]` over all
rows) even when a predicate kept <1% of rows. Samply + cachegrind had pinned this as the top
wall-clock cost (~130–150 µs floor) on the `url`-selective queries — a memory-bandwidth-bound loop
streaming `len * 8` bytes.

**Fix (this branch): store the end offset, not the size.** `codes_sizes` was replaced by
`codes_ends`, where `codes_ends[i] = codes_offsets[i] + size[i]`. Because a freshly converted heap
is contiguous (element `i` occupies `offsets[i]..offsets[i+1]`), **both** addressing arrays are now
zero-copy slices of the FSST's existing monotonic offsets buffer
(`codes_offsets = offsets[0..len]`, `codes_ends = offsets[1..len+1]`). The conversion allocates and
copies nothing; no per-row `sizes` array is materialized, so a selective `filter`/`take` never pays
to derive sizes for the rows it discards. The per-element size is recovered as
`codes_ends[i] - codes_offsets[i]` only at canonicalize / `scalar_at`, over the survivors only.

This keeps `filter`/`take`/`slice` metadata-only and composable across a chain (they carry
`codes_ends` alongside `codes_offsets`); the conversion is **not** fused into the filter. Measured
result (same-machine before/after, `fsst_view_fineweb_queries`): `url/vortex` 140 µs → **9.1 µs**,
`url/espn_and` 146 µs → **14.9 µs**, and the previously winning clustered cases (`text/dump_eq`,
`text/date_prefix`) held flat. The view now wins or ties every query in the matrix.

A regression guard (`conversion_shares_offsets_buffer_zero_copy` in `tests.rs`) asserts the
structural invariant the fix relies on: a freshly converted view's `codes_ends` slice begins exactly
one element past `codes_offsets` in the *same allocation*. This catches a silent revert to a
size-materializing conversion — which the value/agreement tests would not, since the decoded values
would still match — without depending on the FineWeb bench (gated out of CI).

The alternative follow-up (store `sizes` in the narrowest int width) was considered and rejected:
it only halves the *write* traffic, leaving the unavoidable full read of the offsets — whereas the
`ends` representation removes the whole O(rows) pass. Narrowing widths is orthogonal and can still
be layered on the file layer's compression if desired.

## Verification commands

```bash
cargo nextest run -p vortex-fsst # (or cargo test -p vortex-fsst) — 105 pass
cargo clippy -p vortex-fsst --all-targets --all-features
cargo clippy -p vortex-file
cargo +nightly fmt --all
```

## Methodology notes (for whoever continues)

- `perf` is unavailable in the dev sandbox (kernel mismatch). Use **samply** (set
`/proc/sys/kernel/perf_event_paranoid` to 1) for wall-clock sampling and **cachegrind** for
cache/instruction modeling. Build the profiled example with
`RUSTFLAGS="-C force-frame-pointers=yes -C debuginfo=2"` and resolve addresses with `addr2line`.
- Caution learned the hard way: **instruction count is not time.** A 12× instruction-count
reduction in the conversion barely moved wall-clock; always confirm with a sampling profiler and
a realistic workload (real FineWeb columns, real query masks), not synthetic micro-loops.
42 changes: 42 additions & 0 deletions FSSTVIEW_NEXT_PROMPT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Copy-paste prompt to continue the FSSTView work

Paste the block below to a fresh agent session in the Vortex repo.

---

Continue work on the `FSSTView` encoding in the Vortex repo. It's already implemented and
merge-ready on branch `claude/fsstview-array-listview-TdW45` (17 commits ahead of `develop`).
Read `FSSTVIEW_HANDOVER.md` and `encodings/fsst/benches/README.md` first for full context and
benchmark numbers.

Background: `FSSTView` (in `encodings/fsst/src/fsstview/`) is a ListView-style FSST that stores
compressed codes addressed by separate `offsets` + `sizes` arrays, making `filter`/`take`/`slice`
metadata-only (no code-heap rewrite). On real FineWeb data the view wins up to 8.6× on chained
ops over long strings. Its one measured weakness: on highly selective predicates over short
columns it pays a fixed ~130 µs floor because `fsstview_from_fsst` derives the full `sizes` array
(over all rows) even when <1% survive.

Task: eliminate that conversion floor without regressing the cases the view already wins. Approach:

1. Confirm the current behaviour first: run
`python3 encodings/fsst/benches/fineweb_queries_extract.py` (needs `pip install duckdb`, network
to HuggingFace), then
`FINEWEB_DIR=/tmp cargo bench -p vortex-fsst --bench fsst_view_fineweb_queries`. Note the
`url/vortex`, `url/google_and`, `url/espn_*` rows where `view` trails `fsst`.
2. Implement a cheaper `sizes` representation so a selective filter doesn't materialize sizes for
discarded rows — e.g. derive `sizes` lazily from `offsets` at canonicalize time, or store it in
the narrowest int width that fits. `filter`/`take` currently filter a concrete `codes_sizes`
child array, so whatever you choose must keep those ops metadata-only and still composable
across a chain (do NOT fuse conversion into filter).
3. Prove it with the same methodology, not instruction counts: samply (set
`perf_event_paranoid=1`) for wall-clock and the real `fsst_view_fineweb_queries` bench. Show the
selective `url` queries improve AND the winning cases (`chain text`, `dump_eq`, `date_prefix`)
do not regress.
4. Keep it merge-clean: `cargo test -p vortex-fsst` (107 tests), `cargo clippy -p vortex-fsst
--all-targets --all-features`, `cargo +nightly fmt --all`. Add/adjust tests for any new
representation. Update `benches/README.md` and `FSSTVIEW_HANDOVER.md` with new numbers. Commit
with sign-off `Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>` and push to the same branch.

Be rigorous about measurement: instruction count is not time, and synthetic micro-loops mislead —
always validate on the real FineWeb columns/query masks. If a change doesn't actually help the real
workload, say so and revert it rather than shipping it.
8 changes: 8 additions & 0 deletions encodings/fsst/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -56,5 +56,13 @@ name = "chunked_dict_fsst_builder"
harness = false
required-features = ["_test-harness"]

[[bench]]
name = "fsst_view_compute"
harness = false

[[bench]]
name = "fsst_view_fineweb_queries"
harness = false

[package.metadata.cargo-machete]
ignored = ["fsst-rs"]
119 changes: 119 additions & 0 deletions encodings/fsst/benches/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
<!--
SPDX-License-Identifier: Apache-2.0
SPDX-FileCopyrightText: Copyright the Vortex contributors
-->

# FSSTView benchmarks

`FSSTView` is a ListView-style FSST: it addresses its compressed codes with separate
`offsets` + `sizes` arrays instead of a single monotonic offsets array. That makes
`filter` / `take` / `slice` **metadata-only** (they rewrite only the small
offsets/sizes/lengths/validity arrays and reuse the compressed byte heap), whereas plain
`FSST` delegates those ops to `VarBin` and **rewrites the whole compressed code heap** each
time. The cost moves to a single canonicalization (decode → `VarBinViewArray`) at the end.

These benchmarks quantify that trade-off. All numbers are divan **medians**, 100 samples, on
one shared machine — treat them as directional; the relative ordering is stable. `fsst` =
stay in `FSST` (rewrite heap per op); `view` = convert to `FSSTView`, metadata-only ops,
decode once.

## 1. `fsst_view_compute` — synthetic shapes

Self-contained (no external data). ~2 MiB of synthetic strings in two shapes — `ManyShort`
(~12 B) and `FewLong` (~256 B) — with a clustered 10 % filter and a sorted take. Two
workloads, each ending in a `VarBinViewArray`:

- `single_filter_{fsst,view}` — one filter, then canonicalize.
- `chain_{fsst,view}` — convert once, then 5 alternating filter/take ops, canonicalize once
(the case the view is designed for).

| workload | shape | fsst | view | speedup |
| --- | --- | --- | --- | --- |
| single_filter | ManyShort | 0.63 ms | 0.62 ms | ~1× |
| single_filter | FewLong | 65 µs | 53 µs | 1.2× |
| chain (5 ops) | ManyShort | 4.99 ms | 4.12 ms | 1.2× |
| chain (5 ops) | FewLong | 371 µs | 268 µs | 1.4× |

Takeaway: the gap widens with chain length, because each `fsst` op re-rewrites the heap while
the view stays metadata-only and defers the single decode.

## 2. `fsst_view_fineweb_queries` — real query predicates

The actual `vortex-bench` FineWeb queries are `SELECT * FROM fineweb WHERE <predicate>`. Each
predicate is evaluated once in DuckDB against the real sample to produce an authentic per-row
selection mask (recipe: `benches/fineweb_queries_extract.py`); the bench applies that mask to
the FSST-compressed `url`/`text` column and decodes to a `VarBinViewArray`. This is the
materialization half of a real query. No-ops unless `FINEWEB_DIR` points at the dumps.

Mask shapes vary by predicate (over 200 k rows): `dump_eq` 7 %/177 runs and `date_prefix`
12 %/178 runs are clustered; `google_or` 2 %/4046 runs is scattered; `vortex`/`espn` are
~0.04–0.09 % and tiny.

The `view (before)` column is the original representation, which derived a full `sizes` array in
`fsstview_from_fsst` (one i64 per row, materialized over **all** 200 k rows regardless of
selectivity). The `view` column stores the per-element **end offset** instead — a zero-copy slice
of the FSST's existing monotonic offsets — so the conversion allocates nothing and a selective
predicate never pays to derive sizes for the rows it discards (see "Conversion is allocation-free"
below). `fsst` is unchanged by this work; its small run-to-run drift is machine noise (the two
measurement runs were back-to-back on a shared machine).

| query (selectivity) | column | fsst | view (before) | view | winner |
| --- | --- | --- | --- | --- | --- |
| date_prefix (12 %) | text | 69.3 ms | 41.0 ms | **41.4 ms** | view 1.67× |
| dump_eq (7 %) | text | 42.6 ms | 25.3 ms | **25.3 ms** | view 1.68× |
| google_or (2 %) | text | 23.9 ms | 23.7 ms | **19.8 ms** | view 1.2× |
| google_and (0.19 %) | text | 708 µs | 782 µs | **642 µs** | view |
| vortex (0.04 %) | text | 529 µs | 606 µs | **456 µs** | view |
| espn_and (0.08 %) | text | 284 µs | 407 µs | **271 µs** | view |
| espn_or (0.09 %) | text | 650 µs* | 418 µs | **281 µs** | view |
| date_prefix (12 %) | url | 1.68 ms | 1.39 ms | **1.25 ms** | view 1.34× |
| dump_eq (7 %) | url | 1.11 ms | 944 µs | **881 µs** | view 1.25× |
| google_or (2 %) | url | 398 µs | 478 µs | **331 µs** | view 1.2× |
| google_and (0.19 %) | url | 30.2 µs | 173 µs | **28.7 µs** | view |
| espn_and (0.08 %) | url | 14.5 µs | 146 µs | **14.9 µs** | ~tie |
| espn_or (0.09 %) | url | 16.4 µs | 152 µs | **16.0 µs** | ~tie |
| vortex (0.04 %) | url | 8.6 µs | 140 µs | **9.1 µs** | ~tie |

(divan medians. `*` `text/espn_or` `fsst` was noisy that run — fastest 283 µs, mean 578 µs.)

Takeaway:

- **The conversion floor is gone.** Every highly selective `url` predicate that previously trailed
`fsst` by 9–16× — it paid a fixed ~140 µs to walk all 200 k offsets building `sizes` even when
<0.2 % of rows survived — now matches `fsst` to within noise (`url/vortex` 140 µs → **9.1 µs**,
`url/espn_and` 146 µs → **14.9 µs**). The same floor that quietly taxed the *short selective
`text`* predicates (`text/vortex`, `text/espn_*`, `text/google_and`) is also gone, flipping each
of those from an `fsst` win to a `view` win.
- **The winning cases do not regress.** The clustered/bulk selections the view was already built
for hold or improve: `text/dump_eq` and `text/date_prefix` stay at ~1.67–1.68× (the decode, not
the conversion, dominates them), while `url/date_prefix`, `url/dump_eq`, and both `google_or`
columns get a touch faster because the conversion no longer allocates.

With the floor removed the view now wins or ties **every** query in this matrix.

## Conversion is allocation-free

`FSSTView` stores the per-element **end offset** (`codes_ends[i] = offset[i] + size[i]`) rather
than the size. A freshly converted heap is contiguous, so element `i` occupies
`offsets[i]..offsets[i + 1]`, which means **both** addressing arrays are zero-copy slices of the
FSST's existing monotonic offsets buffer: `codes_offsets = offsets[0..len]` and
`codes_ends = offsets[1..len + 1]`. `fsstview_from_fsst` therefore allocates and copies nothing —
in particular it never materializes a per-row `sizes` array, so a selective `filter`/`take` that
keeps a handful of rows no longer pays an O(rows) cost to derive sizes for the rows it discards.
The per-element size is recovered as `codes_ends[i] - codes_offsets[i]` only where it is needed
(canonicalize / `scalar_at`), over the survivors only. `filter`/`take`/`slice` stay metadata-only
and compose across a chain exactly as before — they now carry `codes_ends` alongside
`codes_offsets` instead of `codes_sizes`.

## How `Auto` chooses the decode

Canonicalization picks a decode strategy from the survivor layout (`FsstViewCompaction::Auto`):

- **Direct** — survivors are one contiguous run (untouched / sliced): one bulk decode, no copy.
- **RunDecode** — offsets still monotonic with few runs (clustered/range filters, sorted
takes): decode each contiguous run straight into the element-ordered output, no gather copy.
- **GatherBulk** — scattered (shuffle take) or heavily fragmented (uniform-random filter):
compact the live codes into one buffer, then a single bulk decode.

The threshold (`runs <= len / 4` → RunDecode, else GatherBulk) was calibrated with the
synthetic `fsst_view_compute` shapes.
Loading
Loading