vortex-data · joseph-isaacs · May 30, 2026 · May 30, 2026 · May 30, 2026 · May 30, 2026
diff --git a/FSSTVIEW_HANDOVER.md b/FSSTVIEW_HANDOVER.md
@@ -0,0 +1,145 @@
+# FSSTView — Handover
+
+## TL;DR
+
+Added a new **`FSSTView`** array encoding to Vortex: a ListView-style FSST that addresses its
+compressed codes with separate per-element `offsets` + `ends` arrays instead of one monotonic
+offsets array. This makes `filter` / `take` / `slice` **metadata-only** (rewrite only small index
+arrays, reuse the compressed byte heap), where plain `FSST` rewrites the whole compressed code heap
+per op. The decode cost moves to a single canonicalization at the end.
+
+Storing the per-element **end offset** (rather than the size) makes the `FSST` → `FSSTView`
+conversion allocation-free — both addressing arrays are zero-copy slices of the FSST's existing
+offsets — which **eliminated the conversion floor** that previously made the view 9–16× slower than
+`fsst` on tiny highly selective `url` predicates (see "Conversion floor — resolved" below).
+
+- **Branch:** `claude/fsstview-conversion-floor-kRAeg` (built on the original
+  `claude/fsstview-array-listview-TdW45`).
+- **Status:** merge-ready. 105 tests pass, `clippy --all-targets --all-features` clean,
+  `cargo +nightly fmt` clean, `vortex-file` builds, doc tests pass.
+- **No PR opened yet** (was waiting on explicit request).
+- **Scope:** additive, contained in `encodings/fsst/` plus 2 registration lines in `vortex-file`.
+
+## What landed
+
+New encoding `vortex.fsstview` in `encodings/fsst/src/fsstview/`:
+
+| file | role |
+| --- | --- |
+| `array.rs` | encoding struct, `#[array_slots]` children (uncompressed_lengths, codes_offsets, codes_ends, codes_validity), VTable, serde, allocation-free `fsstview_from_fsst` conversion |
+| `compute.rs` | metadata-only `FilterKernel` + `TakeExecute` |
+| `ops.rs` | `scalar_at` |
+| `slice.rs` | metadata-only `SliceReduce` |
+| `from_fsst.rs` | `fsst_filter_to_view` / `fsst_take_to_view` helpers |
+| `canonical.rs` | decode → `VarBinViewArray` / `VarBinArray`, with the `Auto` export strategy |
+| `kernel.rs` / `rules.rs` | parent kernel + rule registration |
+| `tests.rs` | conformance + agreement + nullable/gapped/RunDecode coverage + zero-copy conversion guard |
+
+Registered in `vortex-file/src/lib.rs` (`register_default_encodings`). Public API:
+`FSSTView`, `FSSTViewArray`, `FsstViewCompaction`, `canonicalize_fsstview_with`,
+`canonicalize_fsstview_to_varbin`, `fsst_filter_to_view`, `fsst_take_to_view`, `fsstview_from_fsst`.
+
+## Canonicalization strategy (`FsstViewCompaction::Auto`)
+
+After metadata-only ops the survivors are scattered in the original heap; `Auto` picks how to
+decode from the survivor layout:
+
+- **Direct** — one contiguous run (untouched / sliced): single bulk decode, no copy.
+- **RunDecode** — offsets monotonic, few runs (clustered/range filters, sorted takes): decode each
+  contiguous run straight into the element-ordered output, no gather copy. Threshold:
+  `runs <= len / 4`.
+- **GatherBulk** — scattered (shuffle take) or fragmented (uniform-random filter): compact live
+  codes into one buffer, single bulk decode.
+
+`RunDecode` and the gather coalescing came from the optimization work; `PerElement` and
+`RunCoalesce` were explored, proven worse, and removed before merge.
+
+## Benchmarks & results
+
+Two benches in `encodings/fsst/benches/` (full write-up in `benches/README.md`). All numbers are
+divan **medians**, 100 samples, single shared machine — directional, relative ordering stable.
+
+1. **`fsst_view_compute`** — synthetic, no external data, **runs in CI**. ~2 MiB strings, ManyShort
+   (~12 B) / FewLong (~256 B). Single filter and a 5-op chain → VarBinView. The chain is where the
+   view's advantage compounds (each `fsst` op re-rewrites the heap; the view stays metadata-only):
+   - chain FewLong: fsst 371 µs → view **268 µs** (1.4×); chain ManyShort 4.99 ms → **4.12 ms**.
+
+2. **`fsst_view_fineweb_queries`** — the real `vortex-bench` query predicates (`dump = ...`,
+   `date LIKE '2020-10-%'`, `url/text LIKE '%google%'`, `'% vortex %'`, espn filters), evaluated
+   in DuckDB to authentic per-row masks, then materialize the column → VarBinView. Numbers below
+   are a same-machine before/after (old `sizes` representation → new `ends` representation):
+   - text/date_prefix (12%): fsst 69.3 ms vs view **41.4 ms** (1.67×; was 41.0 ms — held)
+   - text/dump_eq (7%): fsst 42.6 ms vs view **25.3 ms** (1.68×; was 25.3 ms — held)
+   - url/vortex (0.04%): fsst 8.6 µs vs view **9.1 µs** (was view 140 µs — floor removed)
+   - url/espn_and (0.08%): fsst 14.5 µs vs view **14.9 µs** (was view 146 µs)
+   - text/espn_and (0.08%): fsst 284 µs vs view **271 µs** (was view 407 µs — flips to a view win)
+
+With the `ends` representation the view now **wins or ties every query** in the matrix: the bulk /
+clustered / long-`text` cases still win by skipping the per-op heap rewrite (up to 1.68× here), and
+the tiny highly selective predicates that used to lose to the conversion floor now match `fsst` to
+within noise. Full table in `benches/README.md`.
+
+### Reproducing the FineWeb queries bench
+
+The ~2 GB sample is **not** downloaded by the bench. Extract columns + query masks once:
+
+```bash
+pip install duckdb
+python3 encodings/fsst/benches/fineweb_queries_extract.py     # writes /tmp/fw_*.bin
+FINEWEB_DIR=/tmp cargo bench -p vortex-fsst --bench fsst_view_fineweb_queries
+```
+
+The bench no-ops (CI-safe) when `FINEWEB_DIR` is unset.
+
+## Conversion floor — resolved
+
+The view's one previous weakness was a **fixed conversion cost on highly selective filters**: the
+original `fsstview_from_fsst` derived a full `sizes` array (`offsets[i+1] - offsets[i]` over all
+rows) even when a predicate kept <1% of rows. Samply + cachegrind had pinned this as the top
+wall-clock cost (~130–150 µs floor) on the `url`-selective queries — a memory-bandwidth-bound loop
+streaming `len * 8` bytes.
+
+**Fix (this branch): store the end offset, not the size.** `codes_sizes` was replaced by
+`codes_ends`, where `codes_ends[i] = codes_offsets[i] + size[i]`. Because a freshly converted heap
+is contiguous (element `i` occupies `offsets[i]..offsets[i+1]`), **both** addressing arrays are now
+zero-copy slices of the FSST's existing monotonic offsets buffer
+(`codes_offsets = offsets[0..len]`, `codes_ends = offsets[1..len+1]`). The conversion allocates and
+copies nothing; no per-row `sizes` array is materialized, so a selective `filter`/`take` never pays
+to derive sizes for the rows it discards. The per-element size is recovered as
+`codes_ends[i] - codes_offsets[i]` only at canonicalize / `scalar_at`, over the survivors only.
+
+This keeps `filter`/`take`/`slice` metadata-only and composable across a chain (they carry
+`codes_ends` alongside `codes_offsets`); the conversion is **not** fused into the filter. Measured
+result (same-machine before/after, `fsst_view_fineweb_queries`): `url/vortex` 140 µs → **9.1 µs**,
+`url/espn_and` 146 µs → **14.9 µs**, and the previously winning clustered cases (`text/dump_eq`,
+`text/date_prefix`) held flat. The view now wins or ties every query in the matrix.
+
+A regression guard (`conversion_shares_offsets_buffer_zero_copy` in `tests.rs`) asserts the
+structural invariant the fix relies on: a freshly converted view's `codes_ends` slice begins exactly
+one element past `codes_offsets` in the *same allocation*. This catches a silent revert to a
+size-materializing conversion — which the value/agreement tests would not, since the decoded values
+would still match — without depending on the FineWeb bench (gated out of CI).
+
+The alternative follow-up (store `sizes` in the narrowest int width) was considered and rejected:
+it only halves the *write* traffic, leaving the unavoidable full read of the offsets — whereas the
+`ends` representation removes the whole O(rows) pass. Narrowing widths is orthogonal and can still
+be layered on the file layer's compression if desired.
+
+## Verification commands
+
+```bash
+cargo nextest run -p vortex-fsst          # (or cargo test -p vortex-fsst) — 105 pass
+cargo clippy -p vortex-fsst --all-targets --all-features
+cargo clippy -p vortex-file
+cargo +nightly fmt --all
+```
+
+## Methodology notes (for whoever continues)
+
+- `perf` is unavailable in the dev sandbox (kernel mismatch). Use **samply** (set
+  `/proc/sys/kernel/perf_event_paranoid` to 1) for wall-clock sampling and **cachegrind** for
+  cache/instruction modeling. Build the profiled example with
+  `RUSTFLAGS="-C force-frame-pointers=yes -C debuginfo=2"` and resolve addresses with `addr2line`.
+- Caution learned the hard way: **instruction count is not time.** A 12× instruction-count
+  reduction in the conversion barely moved wall-clock; always confirm with a sampling profiler and
+  a realistic workload (real FineWeb columns, real query masks), not synthetic micro-loops.
diff --git a/FSSTVIEW_NEXT_PROMPT.md b/FSSTVIEW_NEXT_PROMPT.md
@@ -0,0 +1,42 @@
+# Copy-paste prompt to continue the FSSTView work
+
+Paste the block below to a fresh agent session in the Vortex repo.
+
+---
+
+Continue work on the `FSSTView` encoding in the Vortex repo. It's already implemented and
+merge-ready on branch `claude/fsstview-array-listview-TdW45` (17 commits ahead of `develop`).
+Read `FSSTVIEW_HANDOVER.md` and `encodings/fsst/benches/README.md` first for full context and
+benchmark numbers.
+
+Background: `FSSTView` (in `encodings/fsst/src/fsstview/`) is a ListView-style FSST that stores
+compressed codes addressed by separate `offsets` + `sizes` arrays, making `filter`/`take`/`slice`
+metadata-only (no code-heap rewrite). On real FineWeb data the view wins up to 8.6× on chained
+ops over long strings. Its one measured weakness: on highly selective predicates over short
+columns it pays a fixed ~130 µs floor because `fsstview_from_fsst` derives the full `sizes` array
+(over all rows) even when <1% survive.
+
+Task: eliminate that conversion floor without regressing the cases the view already wins. Approach:
+
+1. Confirm the current behaviour first: run
+   `python3 encodings/fsst/benches/fineweb_queries_extract.py` (needs `pip install duckdb`, network
+   to HuggingFace), then
+   `FINEWEB_DIR=/tmp cargo bench -p vortex-fsst --bench fsst_view_fineweb_queries`. Note the
+   `url/vortex`, `url/google_and`, `url/espn_*` rows where `view` trails `fsst`.
+2. Implement a cheaper `sizes` representation so a selective filter doesn't materialize sizes for
+   discarded rows — e.g. derive `sizes` lazily from `offsets` at canonicalize time, or store it in
+   the narrowest int width that fits. `filter`/`take` currently filter a concrete `codes_sizes`
+   child array, so whatever you choose must keep those ops metadata-only and still composable
+   across a chain (do NOT fuse conversion into filter).
+3. Prove it with the same methodology, not instruction counts: samply (set
+   `perf_event_paranoid=1`) for wall-clock and the real `fsst_view_fineweb_queries` bench. Show the
+   selective `url` queries improve AND the winning cases (`chain text`, `dump_eq`, `date_prefix`)
+   do not regress.
+4. Keep it merge-clean: `cargo test -p vortex-fsst` (107 tests), `cargo clippy -p vortex-fsst
+   --all-targets --all-features`, `cargo +nightly fmt --all`. Add/adjust tests for any new
+   representation. Update `benches/README.md` and `FSSTVIEW_HANDOVER.md` with new numbers. Commit
+   with sign-off `Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>` and push to the same branch.
+
+Be rigorous about measurement: instruction count is not time, and synthetic micro-loops mislead —
+always validate on the real FineWeb columns/query masks. If a change doesn't actually help the real
+workload, say so and revert it rather than shipping it.
diff --git a/encodings/fsst/Cargo.toml b/encodings/fsst/Cargo.toml
@@ -56,5 +56,13 @@ name = "chunked_dict_fsst_builder"
 harness = false
 required-features = ["_test-harness"]
 
+[[bench]]
+name = "fsst_view_compute"
+harness = false
+
+[[bench]]
+name = "fsst_view_fineweb_queries"
+harness = false
+
 [package.metadata.cargo-machete]
 ignored = ["fsst-rs"]
diff --git a/encodings/fsst/benches/README.md b/encodings/fsst/benches/README.md
@@ -0,0 +1,119 @@
+<!--
+SPDX-License-Identifier: Apache-2.0
+SPDX-FileCopyrightText: Copyright the Vortex contributors
+-->
+
+# FSSTView benchmarks
+
+`FSSTView` is a ListView-style FSST: it addresses its compressed codes with separate
+`offsets` + `sizes` arrays instead of a single monotonic offsets array. That makes
+`filter` / `take` / `slice` **metadata-only** (they rewrite only the small
+offsets/sizes/lengths/validity arrays and reuse the compressed byte heap), whereas plain
+`FSST` delegates those ops to `VarBin` and **rewrites the whole compressed code heap** each
+time. The cost moves to a single canonicalization (decode → `VarBinViewArray`) at the end.
+
+These benchmarks quantify that trade-off. All numbers are divan **medians**, 100 samples, on
+one shared machine — treat them as directional; the relative ordering is stable. `fsst` =
+stay in `FSST` (rewrite heap per op); `view` = convert to `FSSTView`, metadata-only ops,
+decode once.
+
+## 1. `fsst_view_compute` — synthetic shapes
+
+Self-contained (no external data). ~2 MiB of synthetic strings in two shapes — `ManyShort`
+(~12 B) and `FewLong` (~256 B) — with a clustered 10 % filter and a sorted take. Two
+workloads, each ending in a `VarBinViewArray`:
+
+- `single_filter_{fsst,view}` — one filter, then canonicalize.
+- `chain_{fsst,view}` — convert once, then 5 alternating filter/take ops, canonicalize once
+  (the case the view is designed for).
+
+| workload | shape | fsst | view | speedup |
+| --- | --- | --- | --- | --- |
+| single_filter | ManyShort | 0.63 ms | 0.62 ms | ~1× |
+| single_filter | FewLong | 65 µs | 53 µs | 1.2× |
+| chain (5 ops) | ManyShort | 4.99 ms | 4.12 ms | 1.2× |
+| chain (5 ops) | FewLong | 371 µs | 268 µs | 1.4× |
+
+Takeaway: the gap widens with chain length, because each `fsst` op re-rewrites the heap while
+the view stays metadata-only and defers the single decode.
+
+## 2. `fsst_view_fineweb_queries` — real query predicates
+
+The actual `vortex-bench` FineWeb queries are `SELECT * FROM fineweb WHERE <predicate>`. Each
+predicate is evaluated once in DuckDB against the real sample to produce an authentic per-row
+selection mask (recipe: `benches/fineweb_queries_extract.py`); the bench applies that mask to
+the FSST-compressed `url`/`text` column and decodes to a `VarBinViewArray`. This is the
+materialization half of a real query. No-ops unless `FINEWEB_DIR` points at the dumps.
+
+Mask shapes vary by predicate (over 200 k rows): `dump_eq` 7 %/177 runs and `date_prefix`
+12 %/178 runs are clustered; `google_or` 2 %/4046 runs is scattered; `vortex`/`espn` are
+~0.04–0.09 % and tiny.
+
+The `view (before)` column is the original representation, which derived a full `sizes` array in
+`fsstview_from_fsst` (one i64 per row, materialized over **all** 200 k rows regardless of
+selectivity). The `view` column stores the per-element **end offset** instead — a zero-copy slice
+of the FSST's existing monotonic offsets — so the conversion allocates nothing and a selective
+predicate never pays to derive sizes for the rows it discards (see "Conversion is allocation-free"
+below). `fsst` is unchanged by this work; its small run-to-run drift is machine noise (the two
+measurement runs were back-to-back on a shared machine).
+
+| query (selectivity) | column | fsst | view (before) | view | winner |
+| --- | --- | --- | --- | --- | --- |
+| date_prefix (12 %) | text | 69.3 ms | 41.0 ms | **41.4 ms** | view 1.67× |
+| dump_eq (7 %) | text | 42.6 ms | 25.3 ms | **25.3 ms** | view 1.68× |
+| google_or (2 %) | text | 23.9 ms | 23.7 ms | **19.8 ms** | view 1.2× |
+| google_and (0.19 %) | text | 708 µs | 782 µs | **642 µs** | view |
+| vortex (0.04 %) | text | 529 µs | 606 µs | **456 µs** | view |
+| espn_and (0.08 %) | text | 284 µs | 407 µs | **271 µs** | view |
+| espn_or (0.09 %) | text | 650 µs* | 418 µs | **281 µs** | view |
+| date_prefix (12 %) | url | 1.68 ms | 1.39 ms | **1.25 ms** | view 1.34× |
+| dump_eq (7 %) | url | 1.11 ms | 944 µs | **881 µs** | view 1.25× |
+| google_or (2 %) | url | 398 µs | 478 µs | **331 µs** | view 1.2× |
+| google_and (0.19 %) | url | 30.2 µs | 173 µs | **28.7 µs** | view |
+| espn_and (0.08 %) | url | 14.5 µs | 146 µs | **14.9 µs** | ~tie |
+| espn_or (0.09 %) | url | 16.4 µs | 152 µs | **16.0 µs** | ~tie |
+| vortex (0.04 %) | url | 8.6 µs | 140 µs | **9.1 µs** | ~tie |
+
+(divan medians. `*` `text/espn_or` `fsst` was noisy that run — fastest 283 µs, mean 578 µs.)
+
+Takeaway:
+
+- **The conversion floor is gone.** Every highly selective `url` predicate that previously trailed
+  `fsst` by 9–16× — it paid a fixed ~140 µs to walk all 200 k offsets building `sizes` even when
+  <0.2 % of rows survived — now matches `fsst` to within noise (`url/vortex` 140 µs → **9.1 µs**,
+  `url/espn_and` 146 µs → **14.9 µs**). The same floor that quietly taxed the *short selective
+  `text`* predicates (`text/vortex`, `text/espn_*`, `text/google_and`) is also gone, flipping each
+  of those from an `fsst` win to a `view` win.
+- **The winning cases do not regress.** The clustered/bulk selections the view was already built
+  for hold or improve: `text/dump_eq` and `text/date_prefix` stay at ~1.67–1.68× (the decode, not
+  the conversion, dominates them), while `url/date_prefix`, `url/dump_eq`, and both `google_or`
+  columns get a touch faster because the conversion no longer allocates.
+
+With the floor removed the view now wins or ties **every** query in this matrix.
+
+## Conversion is allocation-free
+
+`FSSTView` stores the per-element **end offset** (`codes_ends[i] = offset[i] + size[i]`) rather
+than the size. A freshly converted heap is contiguous, so element `i` occupies
+`offsets[i]..offsets[i + 1]`, which means **both** addressing arrays are zero-copy slices of the
+FSST's existing monotonic offsets buffer: `codes_offsets = offsets[0..len]` and
+`codes_ends = offsets[1..len + 1]`. `fsstview_from_fsst` therefore allocates and copies nothing —
+in particular it never materializes a per-row `sizes` array, so a selective `filter`/`take` that
+keeps a handful of rows no longer pays an O(rows) cost to derive sizes for the rows it discards.
+The per-element size is recovered as `codes_ends[i] - codes_offsets[i]` only where it is needed
+(canonicalize / `scalar_at`), over the survivors only. `filter`/`take`/`slice` stay metadata-only
+and compose across a chain exactly as before — they now carry `codes_ends` alongside
+`codes_offsets` instead of `codes_sizes`.
+
+## How `Auto` chooses the decode
+
+Canonicalization picks a decode strategy from the survivor layout (`FsstViewCompaction::Auto`):
+
+- **Direct** — survivors are one contiguous run (untouched / sliced): one bulk decode, no copy.
+- **RunDecode** — offsets still monotonic with few runs (clustered/range filters, sorted
+  takes): decode each contiguous run straight into the element-ordered output, no gather copy.
+- **GatherBulk** — scattered (shuffle take) or heavily fragmented (uniform-random filter):
+  compact the live codes into one buffer, then a single bulk decode.
+
+The threshold (`runs <= len / 4` → RunDecode, else GatherBulk) was calibrated with the
+synthetic `fsst_view_compute` shapes.