perf(fse): donor FSE_buildCTable_wksp parity — drop per-symbol Vec<State>#166
Conversation
…ate> #110. Replace O(num_symbols × table_size) builder loop + BTreeSet dedup with donor `FSE_buildCTable_wksp` (cumul + tableSymbol spread + sorted-by-symbol sweep). Production no longer materializes any per-state Vec; `nextStateTable` (`state_table_flat`) + `symbolTT` are everything the hot path needs. `start_state` and `max_num_bits` are precomputed once per symbol via donor 16.16 arithmetic and held in plain arrays — no per-symbol scan on read. Test parity check `fse::check_tables` rewritten to enumerate `next_state` over every (symbol, input_state) pair (was iterating the per-symbol Vec). Same invariant, no live storage required. Speed (small-4k-log-lines L2_dfast pure_rust): 83.7 µs → 47.4 µs (−43%, +77% throughput). z000033 L3 unchanged (entropy build was not the bottleneck there). Ratio preserved across the matrix. 501/501 lib tests, clippy clean.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughRefactors FSE encoder metadata to remove per-symbol state vectors, storing per-symbol ChangesFSE Table Donor-Parity Refactor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR optimizes FSE encoder table construction by replacing per-symbol Vec<State> materialization with donor-style flat next-state tables and precomputed per-symbol metadata.
Changes:
- Reworked
build_table_from_probabilitiesto buildstate_table_flatvia cumul/spread/sorted sweep. - Replaced
SymbolStatesper-state storage withstart_state,probability, andmax_num_bits. - Updated the FSE table parity test helper to validate encoder transitions against decoder slots.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
zstd/src/fse/fse_encoder.rs |
Implements donor-parity FSE table construction without per-symbol state vectors. |
zstd/src/fse/mod.rs |
Updates test validation to enumerate encoder transitions from the new flat table representation. |
…comment - `fse::check_tables`: assert every `(symbol, input_state)` transition lands on a decoder slot OWNED by that symbol (was silently skipping cross-symbol transitions, which would mask a routing bug). - `build_table_from_probabilities` Phase 3 doc: rewrite to describe the raw-slot Rust convention without misclaiming the code stores `table_size + u` (donor representation); cross-reference `FSETable::next_state` arithmetic that depends on the raw-slot form.
`fse_decoder::FSETable::to_encoder_table` can round-trip through `build_table_from_probabilities` with `accuracy_log` up to `ENTRY_MAX_ACCURACY_LOG = 16`, where the prefix sum reaches `table_size = 65 536` — one past `u16::MAX`. The Phase-1 `cumul` array (and the Phase-3 `cursor` snapshot) must be `u32` so the cumulative count is representable for every valid `acc_log`. Slot indices written into `state_table_flat` stay in `0..table_size-1` (≤ `u16::MAX`) and remain `u16`.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/fse/fse_encoder.rs`:
- Around line 780-781: Add a defensive debug assertion that documents and
enforces the non-negative index invariant before casting and indexing: right
before using state_table_index (computed from init_value, init_nb_bits_out,
delta_find_state) assert that state_table_index >= 0 (and optionally assert
state_table_index as usize < state_table_flat.len() for extra safety), so the
subsequent conversion to usize and assignment to start_index is guarded and will
fail loudly in debug builds if the invariant is violated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 72ea5d67-eb9e-4433-b0c6-7a350f4830ee
📒 Files selected for processing (1)
zstd/src/fse/fse_encoder.rs
Donor `FSE_initCState2` arithmetic guarantees `state_table_index ≥ 0` by construction, but the `isize → usize` cast at the indexing site would silently wrap on regression. Add a `debug_assert!` so a future arithmetic bug surfaces in dev builds instead of an out-of-bounds panic or wrong slot read.
…E-encode per candidate (#168) * perf(huff0): replace FSE-encode-per-candidate with cheap entropy proxy #167. `HuffmanTable::build_from_counts` previously called `try_table_description_size` per `min_table_log..=11` candidate, which ran a full FSE-encode of the weight stream against a freshly-built FSE table just to count bytes (~31 % inclusive on the 4 KiB compress profile after #166). Replace with `cheap_desc_size_proxy(weights)`: an integer entropy estimate that reproduces the donor `HUF_writeCTable_wksp` decision (FSE vs raw nibble) without touching the FSE encoder. - FSE estimate: `sum(c_i * ceil_log2(total / c_i))` over the 13-bin weight histogram + 8 B header overhead (empirical upper bound for the `acc_log = 6` weight FSE table seen in `encode_weight_description`). - Raw nibble: exact `weights.len().div_ceil(2) + 1`, representable when `weights.len() <= 128`. - Return min of the two when both are representable. Validated: - 502/502 lib tests (incl. new `cheap_desc_size_proxy_is_conservative_vs_exact`) - `compare_ffi` ratio REPORT sweep across all scenarios × all levels: no new `rust_bytes > ffi_bytes` cells; `decodecorpus-z000033` L18/L19 *improved* by 442 B each (R=443 434 → 442 992) — the proxy steers selection to a marginally tighter `table_log` on those cells. Every small-4k-log-lines cell preserved (L1_fast R=154 ≤ C=157, L2_dfast R=150 ≤ C=157, …). Speed (`compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust`): 47.4 µs → 34.2 µs (−28 % on top of #166; cumulative −59 % vs the pre-#165 baseline). Gap vs donor 6.7× → 4.8×. * fix(huff0): use ceiling division in cheap_desc_size_proxy entropy bound `cheap_desc_size_proxy` claims to be a conservative entropy upper bound, but used truncating integer division for `total / c` before `ceil_log2`. For non-integer ratios (e.g. `total=10, c=4 → 2.5`) this truncated to `2`, the subsequent `ceil_log2` emitted `1` bit, and the proxy under-shot the real entropy ≥ 2 bits per symbol. Switch to `total.div_ceil(c)` so the ceiling is taken BEFORE the `ceil_log2` step. Ratio sweep across `compare_ffi` corpus × every supported level: no new `rust_bytes > ffi_bytes` cells; small-4k-log unchanged; z000033 L18/L19 *improved* (R=442 992 → 442 863). small-4k-log L2_dfast pure_rust 34.2 µs → 32.8 µs (slightly faster: the tighter estimate cuts off more candidates earlier in the `min_table_log..=11` loop's monotone-increase break). Added `cheap_desc_size_proxy_edge_cases` covering every `(fse_ok, raw_ok)` arm plus the `n == 0` early-out and the `ratio <= 1` clamp branch — the prior test only hit a handful of input shapes, leaving the `(true, false)` / `(false, true)` / `(false, false)` arms and the early-out without coverage. * fix(huff0): cheap_desc_size_proxy off-by-one + drop per-candidate Vec alloc Three threads from PR #168 review. - **fse_ok off-by-one.** `encode_weight_description` rejects only `encoded.len() >= 128`, so `encoded.len() == 127` is the largest accepted FSE payload and the total serialized description (`encoded.len() + 1` length-byte prefix) is exactly 128 B at the boundary. The proxy's `fse_size` includes the length byte — accept `<= 128`, not `<= 127`. As written the proxy would skip a valid candidate at the exact boundary and force a worse fallback. - **Per-candidate `Vec<u8>` alloc.** `build_from_counts` called `table.weights()` per `table_log` candidate just to score `desc_size` — fresh `Vec<u8>` allocation each iteration. Replace with a stack-allocated `[u8; 256]` buffer reused across iterations (counts.len() max is 256). The Vec from `build_donor_limited_weights` already carries the same weight values; just copy them into the buffer. - **Test slice-length mismatch.** `cheap_desc_size_proxy_is_conservative_vs_exact` compared `proxy(weights)` (N items) against `table.try_table_description_size()` (which trims `[..N-1]` internally). Fix: trim `weights` before calling the proxy so both paths score the same serialized slice. All 503/503 lib tests pass, clippy / fmt clean. Targeted ratio sweep (small-4k-log L1-L3, z000033 L1-L3, large-log-stream L1-L3): unchanged — no new R>C cells, no regression. * test(huff0): rebuild conservative-vs-exact fixtures from build_from_counts; align proxy docs `cheap_desc_size_proxy_is_conservative_vs_exact` silently skipped every hand-curated fixture because their weight vectors ([1,2,3,4,5], [1;13], [6;50], (0..13).cycle().take(120), ...) all failed `huffman_weight_sum_is_power_of_two` — Kraft equality must hold for a valid Huffman weight set, and the synthetic arrays never did. The loop body was unreachable; the test passed trivially. - Rebuild fixtures via `HuffmanTable::build_from_counts(counts)` → `table.weights()`. The encoder's own output is Kraft-valid by construction. Counts inputs cover skewed, uniform, geometric, wide alphabets, and near-raw-limit cases. - Add an `exercised > 0` assertion to fail loud if a future refactor silently skips all fixtures again. - Fix `raw_floor`: the writer's raw representation for the trimmed slice is `trimmed.len().div_ceil(2) + 1` (nibbles + length byte), not `weights.len().div_ceil(2)`. Doc fixes: - `cheap_desc_size_proxy` docstring: "representable when result is < 128 bytes" → "<= 128 bytes" — `encode_weight_description` rejects only `encoded.len() >= 128`, so total `encoded.len() + 1` length-byte prefix tops out at exactly 128. The code at the `fse_ok` site already uses `<= 128`; the doc was out of sync. - `cheap_desc_size_proxy` docstring: explicitly document that `n == 0` returns `None` (raw could in principle encode an empty slice as just the length byte, but production callers never hand `n == 0` here — `build_from_counts` short-circuits on `symbol_cardinality <= 1`).
Summary
Part of #110 — block compressor entropy coding rewrite. This PR targets the FSE table builder (
build_table_from_probabilities) which a fresh profile flagged as the dominant cost on small inputs.Replace the O(num_symbols × table_size) builder loop with donor
FSE_buildCTable_wksp(cumul + spread + sorted-by-symbol sweep):Vec<State>entirely fromFSETable. Production no longer materializes any per-state vector;nextStateTable(==state_table_flat) +symbolTTare all the hot path needs.start_stateandmax_num_bitsonce per symbol via donor 16.16 fixed-point arithmetic — single array load on read instead of aVec<State>scan.fse::check_tablesrewritten to enumeratenext_stateover every (symbol, input_state) pair — same invariant, no live per-symbol storage needed.Results
Gap on small-input compression: 11.8× → 6.7×. No regression on large inputs (entropy build wasn't the bottleneck there). Ratio preserved across the level matrix.
Why this and not the huff0 `optimal_table_log` rewrite
I also tried donor's huff0 fast-path `HUF_optimalTableLog` (single-shot `tableLog` instead of the current `min_table_log..=11` search). That gives another −22% on top of this PR but loses ratio vs donor on one cell (L1_fast on small-4k-log: R=159 vs C=157, +2 B). Per project rule "Ratio first — if rust_bytes > ffi_bytes we lose vs donor → real bug", the search must stay until a smarter cheap proxy for the description-size scoring is in place. Separate follow-up task.
Test plan
Part of #110
Summary by CodeRabbit
Refactor
Tests