perf(fse): donor FSE_buildCTable_wksp parity — drop per-symbol Vec<State> by polaz · Pull Request #166 · structured-world/structured-zstd

polaz · 2026-05-18T07:33:21Z

Summary

Part of #110 — block compressor entropy coding rewrite. This PR targets the FSE table builder (build_table_from_probabilities) which a fresh profile flagged as the dominant cost on small inputs.

Replace the O(num_symbols × table_size) builder loop with donor FSE_buildCTable_wksp (cumul + spread + sorted-by-symbol sweep):

Drop per-symbol Vec<State> entirely from FSETable. Production no longer materializes any per-state vector; nextStateTable (== state_table_flat) + symbolTT are all the hot path needs.
Precompute start_state and max_num_bits once per symbol via donor 16.16 fixed-point arithmetic — single array load on read instead of a Vec<State> scan.
Test fse::check_tables rewritten to enumerate next_state over every (symbol, input_state) pair — same invariant, no live per-symbol storage needed.

Results

Scenario	main	this PR	Δ
L2_dfast small-4k-log-lines pure_rust	83.7 µs	47.4 µs	−43% (+77% thrpt)
L2_dfast small-4k-log-lines c_ffi	7.1 µs	7.1 µs	noise
L3_dfast z000033 pure_rust	13.1 ms	13.1 ms	unchanged
Small-4k-log-lines ratio (L3)	R=150	R=150	unchanged
z000033 L3 ratio	R=522171	R=522176	+5 B (still beats donor 527148)

Gap on small-input compression: 11.8× → 6.7×. No regression on large inputs (entropy build wasn't the bottleneck there). Ratio preserved across the level matrix.

Why this and not the huff0 `optimal_table_log` rewrite

I also tried donor's huff0 fast-path `HUF_optimalTableLog` (single-shot `tableLog` instead of the current `min_table_log..=11` search). That gives another −22% on top of this PR but loses ratio vs donor on one cell (L1_fast on small-4k-log: R=159 vs C=157, +2 B). Per project rule "Ratio first — if rust_bytes > ffi_bytes we lose vs donor → real bug", the search must stay until a smarter cheap proxy for the description-size scoring is in place. Separate follow-up task.

Test plan

501/501 lib tests pass via nextest
clippy clean with -D warnings
fmt clean
compare_ffi ratio sweep level_1_fast..level_22_btultra2 over all scenarios — no new rust_bytes > ffi_bytes cells introduced

Part of #110

Summary by CodeRabbit

Refactor
- Reduced memory use and simplified encoder table layout by switching to a flat state table and storing only per-symbol start index and bit-width metadata, eliminating per-symbol state vectors.
Tests
- Strengthened validation to exhaustively verify encoder→decoder transitions for each symbol and ensure every decoder slot is reachable.

…ate> #110. Replace O(num_symbols × table_size) builder loop + BTreeSet dedup with donor `FSE_buildCTable_wksp` (cumul + tableSymbol spread + sorted-by-symbol sweep). Production no longer materializes any per-state Vec; `nextStateTable` (`state_table_flat`) + `symbolTT` are everything the hot path needs. `start_state` and `max_num_bits` are precomputed once per symbol via donor 16.16 arithmetic and held in plain arrays — no per-symbol scan on read. Test parity check `fse::check_tables` rewritten to enumerate `next_state` over every (symbol, input_state) pair (was iterating the per-symbol Vec). Same invariant, no live storage required. Speed (small-4k-log-lines L2_dfast pure_rust): 83.7 µs → 47.4 µs (−43%, +77% throughput). z000033 L3 unchanged (entropy build was not the bottleneck there). Ratio preserved across the matrix. 501/501 lib tests, clippy clean.

coderabbitai · 2026-05-18T07:33:33Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4bbbcba0-97b2-40cf-a9ce-196eaa291c4b

📥 Commits

Reviewing files that changed from the base of the PR and between 594e769 and a8c5e93.

📒 Files selected for processing (1)

zstd/src/fse/fse_encoder.rs

📝 Walkthrough

Walkthrough

Refactors FSE encoder metadata to remove per-symbol state vectors, storing per-symbol start_state and max_num_bits, emitting a donor-parity flat state_table_flat/symbol_tt; updates the table builder and check_tables test to validate the flat layout and O(1) transition model.

Changes

FSE Table Donor-Parity Refactor

Layer / File(s)	Summary
SymbolStates struct redesign and metadata accessors `zstd/src/fse/fse_encoder.rs`	`SymbolStates` drops `Vec<State>` and `start_state_slot`, adding `start_state: Option<usize>` and `max_num_bits: Option<u8>`. `FSETable::start_state` and `max_num_bits_for_symbol` read these fields directly. Removes an unused `BTreeSet` import.
Table construction with donor-parity layout `zstd/src/fse/fse_encoder.rs`	`build_table_from_probabilities` rewritten to emit `state_table_flat` and `symbol_tt` using multi-phase donor-parity placement and fixed-point donor parameter computation; per-symbol `start_state`/`max_num_bits` are computed directly and no per-symbol `Vec<State>` is materialized.
Test validation for new table structure `zstd/src/fse/mod.rs`	`check_tables` updated to enumerate all encoder transitions `(symbol, prev_state)`, compare encoder `baseline`/`num_bits` with decoder `new_state`/`num_bits`, verify decoder slot `symbol` routing, and assert every decoder slot is reached via a `hit` bitmap instead of searching per-decoder-slot through encoder states.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

perf(encoding): #111 Phase 7 — replace FSE encode linear search with donor-parity flat tables #164: Implements the donor-parity flat-table approach and per-symbol start_state/max_num_bits metadata referenced by this refactor.

Possibly related PRs

structured-world/structured-zstd#165: Similar refactor toward donor-parity flat tables and per-symbol metadata.
structured-world/structured-zstd#76: Related check_tables helper adjustments validating decoder new_state-based layout.

Poem

🐰 I hopped through tables neat and spry,
Replaced big vectors—now they fly.
Flat states hum and start states gleam,
Donor-parity makes transitions stream.
I nudge a carrot at the optimized scheme.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: a performance optimization to FSE table building that eliminates per-symbol Vec materialization by adopting a donor approach with precomputed metadata.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#110-entropy-table-builders

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-18T07:37:08Z

Codecov Report

❌ Patch coverage is 98.83721% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/fse/fse_encoder.rs	98.52%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR optimizes FSE encoder table construction by replacing per-symbol Vec<State> materialization with donor-style flat next-state tables and precomputed per-symbol metadata.

Changes:

Reworked build_table_from_probabilities to build state_table_flat via cumul/spread/sorted sweep.
Replaced SymbolStates per-state storage with start_state, probability, and max_num_bits.
Updated the FSE table parity test helper to validate encoder transitions against decoder slots.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`zstd/src/fse/fse_encoder.rs`	Implements donor-parity FSE table construction without per-symbol state vectors.
`zstd/src/fse/mod.rs`	Updates test validation to enumerate encoder transitions from the new flat table representation.

…comment - `fse::check_tables`: assert every `(symbol, input_state)` transition lands on a decoder slot OWNED by that symbol (was silently skipping cross-symbol transitions, which would mask a routing bug). - `build_table_from_probabilities` Phase 3 doc: rewrite to describe the raw-slot Rust convention without misclaiming the code stores `table_size + u` (donor representation); cross-reference `FSETable::next_state` arithmetic that depends on the raw-slot form.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

`fse_decoder::FSETable::to_encoder_table` can round-trip through `build_table_from_probabilities` with `accuracy_log` up to `ENTRY_MAX_ACCURACY_LOG = 16`, where the prefix sum reaches `table_size = 65 536` — one past `u16::MAX`. The Phase-1 `cumul` array (and the Phase-3 `cursor` snapshot) must be `u32` so the cumulative count is representable for every valid `acc_log`. Slot indices written into `state_table_flat` stay in `0..table_size-1` (≤ `u16::MAX`) and remain `u16`.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/fse/fse_encoder.rs`:
- Around line 780-781: Add a defensive debug assertion that documents and
enforces the non-negative index invariant before casting and indexing: right
before using state_table_index (computed from init_value, init_nb_bits_out,
delta_find_state) assert that state_table_index >= 0 (and optionally assert
state_table_index as usize < state_table_flat.len() for extra safety), so the
subsequent conversion to usize and assignment to start_index is guarded and will
fail loudly in debug builds if the invariant is violated.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 72ea5d67-eb9e-4433-b0c6-7a350f4830ee

📥 Commits

Reviewing files that changed from the base of the PR and between 9b1b778 and 594e769.

📒 Files selected for processing (1)

zstd/src/fse/fse_encoder.rs

Donor `FSE_initCState2` arithmetic guarantees `state_table_index ≥ 0` by construction, but the `isize → usize` cast at the indexing site would silently wrap on regression. Add a `debug_assert!` so a future arithmetic bug surfaces in dev builds instead of an out-of-bounds panic or wrong slot read.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

…E-encode per candidate (#168) * perf(huff0): replace FSE-encode-per-candidate with cheap entropy proxy #167. `HuffmanTable::build_from_counts` previously called `try_table_description_size` per `min_table_log..=11` candidate, which ran a full FSE-encode of the weight stream against a freshly-built FSE table just to count bytes (~31 % inclusive on the 4 KiB compress profile after #166). Replace with `cheap_desc_size_proxy(weights)`: an integer entropy estimate that reproduces the donor `HUF_writeCTable_wksp` decision (FSE vs raw nibble) without touching the FSE encoder. - FSE estimate: `sum(c_i * ceil_log2(total / c_i))` over the 13-bin weight histogram + 8 B header overhead (empirical upper bound for the `acc_log = 6` weight FSE table seen in `encode_weight_description`). - Raw nibble: exact `weights.len().div_ceil(2) + 1`, representable when `weights.len() <= 128`. - Return min of the two when both are representable. Validated: - 502/502 lib tests (incl. new `cheap_desc_size_proxy_is_conservative_vs_exact`) - `compare_ffi` ratio REPORT sweep across all scenarios × all levels: no new `rust_bytes > ffi_bytes` cells; `decodecorpus-z000033` L18/L19 *improved* by 442 B each (R=443 434 → 442 992) — the proxy steers selection to a marginally tighter `table_log` on those cells. Every small-4k-log-lines cell preserved (L1_fast R=154 ≤ C=157, L2_dfast R=150 ≤ C=157, …). Speed (`compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust`): 47.4 µs → 34.2 µs (−28 % on top of #166; cumulative −59 % vs the pre-#165 baseline). Gap vs donor 6.7× → 4.8×. * fix(huff0): use ceiling division in cheap_desc_size_proxy entropy bound `cheap_desc_size_proxy` claims to be a conservative entropy upper bound, but used truncating integer division for `total / c` before `ceil_log2`. For non-integer ratios (e.g. `total=10, c=4 → 2.5`) this truncated to `2`, the subsequent `ceil_log2` emitted `1` bit, and the proxy under-shot the real entropy ≥ 2 bits per symbol. Switch to `total.div_ceil(c)` so the ceiling is taken BEFORE the `ceil_log2` step. Ratio sweep across `compare_ffi` corpus × every supported level: no new `rust_bytes > ffi_bytes` cells; small-4k-log unchanged; z000033 L18/L19 *improved* (R=442 992 → 442 863). small-4k-log L2_dfast pure_rust 34.2 µs → 32.8 µs (slightly faster: the tighter estimate cuts off more candidates earlier in the `min_table_log..=11` loop's monotone-increase break). Added `cheap_desc_size_proxy_edge_cases` covering every `(fse_ok, raw_ok)` arm plus the `n == 0` early-out and the `ratio <= 1` clamp branch — the prior test only hit a handful of input shapes, leaving the `(true, false)` / `(false, true)` / `(false, false)` arms and the early-out without coverage. * fix(huff0): cheap_desc_size_proxy off-by-one + drop per-candidate Vec alloc Three threads from PR #168 review. - **fse_ok off-by-one.** `encode_weight_description` rejects only `encoded.len() >= 128`, so `encoded.len() == 127` is the largest accepted FSE payload and the total serialized description (`encoded.len() + 1` length-byte prefix) is exactly 128 B at the boundary. The proxy's `fse_size` includes the length byte — accept `<= 128`, not `<= 127`. As written the proxy would skip a valid candidate at the exact boundary and force a worse fallback. - **Per-candidate `Vec<u8>` alloc.** `build_from_counts` called `table.weights()` per `table_log` candidate just to score `desc_size` — fresh `Vec<u8>` allocation each iteration. Replace with a stack-allocated `[u8; 256]` buffer reused across iterations (counts.len() max is 256). The Vec from `build_donor_limited_weights` already carries the same weight values; just copy them into the buffer. - **Test slice-length mismatch.** `cheap_desc_size_proxy_is_conservative_vs_exact` compared `proxy(weights)` (N items) against `table.try_table_description_size()` (which trims `[..N-1]` internally). Fix: trim `weights` before calling the proxy so both paths score the same serialized slice. All 503/503 lib tests pass, clippy / fmt clean. Targeted ratio sweep (small-4k-log L1-L3, z000033 L1-L3, large-log-stream L1-L3): unchanged — no new R>C cells, no regression. * test(huff0): rebuild conservative-vs-exact fixtures from build_from_counts; align proxy docs `cheap_desc_size_proxy_is_conservative_vs_exact` silently skipped every hand-curated fixture because their weight vectors ([1,2,3,4,5], [1;13], [6;50], (0..13).cycle().take(120), ...) all failed `huffman_weight_sum_is_power_of_two` — Kraft equality must hold for a valid Huffman weight set, and the synthetic arrays never did. The loop body was unreachable; the test passed trivially. - Rebuild fixtures via `HuffmanTable::build_from_counts(counts)` → `table.weights()`. The encoder's own output is Kraft-valid by construction. Counts inputs cover skewed, uniform, geometric, wide alphabets, and near-raw-limit cases. - Add an `exercised > 0` assertion to fail loud if a future refactor silently skips all fixtures again. - Fix `raw_floor`: the writer's raw representation for the trimmed slice is `trimmed.len().div_ceil(2) + 1` (nibbles + length byte), not `weights.len().div_ceil(2)`. Doc fixes: - `cheap_desc_size_proxy` docstring: "representable when result is < 128 bytes" → "<= 128 bytes" — `encode_weight_description` rejects only `encoded.len() >= 128`, so total `encoded.len() + 1` length-byte prefix tops out at exactly 128. The code at the `fse_ok` site already uses `<= 128`; the doc was out of sync. - `cheap_desc_size_proxy` docstring: explicitly document that `n == 0` returns `None` (raw could in principle encode an empty slice as just the length byte, but production callers never hand `n == 0` here — `build_from_counts` short-circuits on `symbol_cardinality <= 1`).

Copilot AI review requested due to automatic review settings May 18, 2026 07:33

Copilot started reviewing on behalf of polaz May 18, 2026 07:35 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread zstd/src/fse/mod.rs Outdated

Comment thread zstd/src/fse/fse_encoder.rs Outdated

polaz requested a review from Copilot May 18, 2026 07:44

Copilot started reviewing on behalf of polaz May 18, 2026 07:45 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread zstd/src/fse/fse_encoder.rs Outdated

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread zstd/src/fse/fse_encoder.rs

Copilot AI review requested due to automatic review settings May 18, 2026 08:03

Copilot started reviewing on behalf of polaz May 18, 2026 08:03 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

polaz merged commit 85b7e1f into main May 18, 2026
25 checks passed

polaz deleted the perf/#110-entropy-table-builders branch May 18, 2026 08:06

sw-release-bot Bot mentioned this pull request May 18, 2026

chore: release v0.0.22 #156

Merged

Copilot AI mentioned this pull request May 18, 2026

perf(huff0): cache encoded weight-description bytes on HuffmanTable and reuse in emit path #170

Merged

polaz mentioned this pull request May 18, 2026

perf(decode + encode-greedy): close 3-5× donor gap on negative-level decompress; share SIMD primitives + add dedicated greedy strategy #178

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(fse): donor FSE_buildCTable_wksp parity — drop per-symbol Vec<State>#166

perf(fse): donor FSE_buildCTable_wksp parity — drop per-symbol Vec<State>#166
polaz merged 4 commits into
mainfrom
perf/#110-entropy-table-builders

polaz commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

codecov Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

polaz commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Why this and not the huff0 `optimal_table_log` rewrite

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polaz commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading

codecov Bot commented May 18, 2026 •

edited

Loading