Skip to content

perf(fse): donor FSE_buildCTable_wksp parity — drop per-symbol Vec<State>#166

Merged
polaz merged 4 commits into
mainfrom
perf/#110-entropy-table-builders
May 18, 2026
Merged

perf(fse): donor FSE_buildCTable_wksp parity — drop per-symbol Vec<State>#166
polaz merged 4 commits into
mainfrom
perf/#110-entropy-table-builders

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 18, 2026

Summary

Part of #110 — block compressor entropy coding rewrite. This PR targets the FSE table builder (build_table_from_probabilities) which a fresh profile flagged as the dominant cost on small inputs.

Replace the O(num_symbols × table_size) builder loop with donor FSE_buildCTable_wksp (cumul + spread + sorted-by-symbol sweep):

  • Drop per-symbol Vec<State> entirely from FSETable. Production no longer materializes any per-state vector; nextStateTable (== state_table_flat) + symbolTT are all the hot path needs.
  • Precompute start_state and max_num_bits once per symbol via donor 16.16 fixed-point arithmetic — single array load on read instead of a Vec<State> scan.
  • Test fse::check_tables rewritten to enumerate next_state over every (symbol, input_state) pair — same invariant, no live per-symbol storage needed.

Results

Scenario main this PR Δ
L2_dfast small-4k-log-lines pure_rust 83.7 µs 47.4 µs −43% (+77% thrpt)
L2_dfast small-4k-log-lines c_ffi 7.1 µs 7.1 µs noise
L3_dfast z000033 pure_rust 13.1 ms 13.1 ms unchanged
Small-4k-log-lines ratio (L3) R=150 R=150 unchanged
z000033 L3 ratio R=522171 R=522176 +5 B (still beats donor 527148)

Gap on small-input compression: 11.8× → 6.7×. No regression on large inputs (entropy build wasn't the bottleneck there). Ratio preserved across the level matrix.

Why this and not the huff0 `optimal_table_log` rewrite

I also tried donor's huff0 fast-path `HUF_optimalTableLog` (single-shot `tableLog` instead of the current `min_table_log..=11` search). That gives another −22% on top of this PR but loses ratio vs donor on one cell (L1_fast on small-4k-log: R=159 vs C=157, +2 B). Per project rule "Ratio first — if rust_bytes > ffi_bytes we lose vs donor → real bug", the search must stay until a smarter cheap proxy for the description-size scoring is in place. Separate follow-up task.

Test plan

  • 501/501 lib tests pass via nextest
  • clippy clean with -D warnings
  • fmt clean
  • compare_ffi ratio sweep level_1_fast..level_22_btultra2 over all scenarios — no new rust_bytes > ffi_bytes cells introduced

Part of #110

Summary by CodeRabbit

  • Refactor

    • Reduced memory use and simplified encoder table layout by switching to a flat state table and storing only per-symbol start index and bit-width metadata, eliminating per-symbol state vectors.
  • Tests

    • Strengthened validation to exhaustively verify encoder→decoder transitions for each symbol and ensure every decoder slot is reachable.

Review Change Stack

…ate>

#110. Replace O(num_symbols × table_size) builder loop + BTreeSet
dedup with donor `FSE_buildCTable_wksp` (cumul + tableSymbol spread +
sorted-by-symbol sweep). Production no longer materializes any
per-state Vec; `nextStateTable` (`state_table_flat`) + `symbolTT`
are everything the hot path needs. `start_state` and `max_num_bits`
are precomputed once per symbol via donor 16.16 arithmetic and held
in plain arrays — no per-symbol scan on read.

Test parity check `fse::check_tables` rewritten to enumerate
`next_state` over every (symbol, input_state) pair (was iterating
the per-symbol Vec). Same invariant, no live storage required.

Speed (small-4k-log-lines L2_dfast pure_rust): 83.7 µs → 47.4 µs
(−43%, +77% throughput). z000033 L3 unchanged (entropy build was
not the bottleneck there). Ratio preserved across the matrix.

501/501 lib tests, clippy clean.
Copilot AI review requested due to automatic review settings May 18, 2026 07:33
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4bbbcba0-97b2-40cf-a9ce-196eaa291c4b

📥 Commits

Reviewing files that changed from the base of the PR and between 594e769 and a8c5e93.

📒 Files selected for processing (1)
  • zstd/src/fse/fse_encoder.rs

📝 Walkthrough

Walkthrough

Refactors FSE encoder metadata to remove per-symbol state vectors, storing per-symbol start_state and max_num_bits, emitting a donor-parity flat state_table_flat/symbol_tt; updates the table builder and check_tables test to validate the flat layout and O(1) transition model.

Changes

FSE Table Donor-Parity Refactor

Layer / File(s) Summary
SymbolStates struct redesign and metadata accessors
zstd/src/fse/fse_encoder.rs
SymbolStates drops Vec<State> and start_state_slot, adding start_state: Option<usize> and max_num_bits: Option<u8>. FSETable::start_state and max_num_bits_for_symbol read these fields directly. Removes an unused BTreeSet import.
Table construction with donor-parity layout
zstd/src/fse/fse_encoder.rs
build_table_from_probabilities rewritten to emit state_table_flat and symbol_tt using multi-phase donor-parity placement and fixed-point donor parameter computation; per-symbol start_state/max_num_bits are computed directly and no per-symbol Vec<State> is materialized.
Test validation for new table structure
zstd/src/fse/mod.rs
check_tables updated to enumerate all encoder transitions (symbol, prev_state), compare encoder baseline/num_bits with decoder new_state/num_bits, verify decoder slot symbol routing, and assert every decoder slot is reached via a hit bitmap instead of searching per-decoder-slot through encoder states.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I hopped through tables neat and spry,
Replaced big vectors—now they fly.
Flat states hum and start states gleam,
Donor-parity makes transitions stream.
I nudge a carrot at the optimized scheme.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: a performance optimization to FSE table building that eliminates per-symbol Vec materialization by adopting a donor approach with precomputed metadata.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#110-entropy-table-builders

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 98.83721% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/fse/fse_encoder.rs 98.52% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes FSE encoder table construction by replacing per-symbol Vec<State> materialization with donor-style flat next-state tables and precomputed per-symbol metadata.

Changes:

  • Reworked build_table_from_probabilities to build state_table_flat via cumul/spread/sorted sweep.
  • Replaced SymbolStates per-state storage with start_state, probability, and max_num_bits.
  • Updated the FSE table parity test helper to validate encoder transitions against decoder slots.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
zstd/src/fse/fse_encoder.rs Implements donor-parity FSE table construction without per-symbol state vectors.
zstd/src/fse/mod.rs Updates test validation to enumerate encoder transitions from the new flat table representation.

Comment thread zstd/src/fse/mod.rs Outdated
Comment thread zstd/src/fse/fse_encoder.rs Outdated
…comment

- `fse::check_tables`: assert every `(symbol, input_state)` transition
  lands on a decoder slot OWNED by that symbol (was silently skipping
  cross-symbol transitions, which would mask a routing bug).
- `build_table_from_probabilities` Phase 3 doc: rewrite to describe
  the raw-slot Rust convention without misclaiming the code stores
  `table_size + u` (donor representation); cross-reference
  `FSETable::next_state` arithmetic that depends on the raw-slot form.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/fse/fse_encoder.rs Outdated
`fse_decoder::FSETable::to_encoder_table` can round-trip through
`build_table_from_probabilities` with `accuracy_log` up to
`ENTRY_MAX_ACCURACY_LOG = 16`, where the prefix sum reaches
`table_size = 65 536` — one past `u16::MAX`. The Phase-1 `cumul`
array (and the Phase-3 `cursor` snapshot) must be `u32` so the
cumulative count is representable for every valid `acc_log`. Slot
indices written into `state_table_flat` stay in `0..table_size-1`
(≤ `u16::MAX`) and remain `u16`.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/fse/fse_encoder.rs`:
- Around line 780-781: Add a defensive debug assertion that documents and
enforces the non-negative index invariant before casting and indexing: right
before using state_table_index (computed from init_value, init_nb_bits_out,
delta_find_state) assert that state_table_index >= 0 (and optionally assert
state_table_index as usize < state_table_flat.len() for extra safety), so the
subsequent conversion to usize and assignment to start_index is guarded and will
fail loudly in debug builds if the invariant is violated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 72ea5d67-eb9e-4433-b0c6-7a350f4830ee

📥 Commits

Reviewing files that changed from the base of the PR and between 9b1b778 and 594e769.

📒 Files selected for processing (1)
  • zstd/src/fse/fse_encoder.rs

Comment thread zstd/src/fse/fse_encoder.rs
Donor `FSE_initCState2` arithmetic guarantees `state_table_index ≥ 0`
by construction, but the `isize → usize` cast at the indexing site
would silently wrap on regression. Add a `debug_assert!` so a future
arithmetic bug surfaces in dev builds instead of an out-of-bounds
panic or wrong slot read.
Copilot AI review requested due to automatic review settings May 18, 2026 08:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@polaz polaz merged commit 85b7e1f into main May 18, 2026
25 checks passed
@polaz polaz deleted the perf/#110-entropy-table-builders branch May 18, 2026 08:06
@sw-release-bot sw-release-bot Bot mentioned this pull request May 18, 2026
polaz added a commit that referenced this pull request May 18, 2026
…E-encode per candidate (#168)

* perf(huff0): replace FSE-encode-per-candidate with cheap entropy proxy

#167. `HuffmanTable::build_from_counts` previously called
`try_table_description_size` per `min_table_log..=11` candidate, which
ran a full FSE-encode of the weight stream against a freshly-built FSE
table just to count bytes (~31 % inclusive on the 4 KiB compress
profile after #166).

Replace with `cheap_desc_size_proxy(weights)`: an integer entropy
estimate that reproduces the donor `HUF_writeCTable_wksp` decision
(FSE vs raw nibble) without touching the FSE encoder.

- FSE estimate: `sum(c_i * ceil_log2(total / c_i))` over the 13-bin
  weight histogram + 8 B header overhead (empirical upper bound for
  the `acc_log = 6` weight FSE table seen in
  `encode_weight_description`).
- Raw nibble: exact `weights.len().div_ceil(2) + 1`, representable
  when `weights.len() <= 128`.
- Return min of the two when both are representable.

Validated:
- 502/502 lib tests (incl. new `cheap_desc_size_proxy_is_conservative_vs_exact`)
- `compare_ffi` ratio REPORT sweep across all scenarios × all levels:
  no new `rust_bytes > ffi_bytes` cells; `decodecorpus-z000033` L18/L19
  *improved* by 442 B each (R=443 434 → 442 992) — the proxy steers
  selection to a marginally tighter `table_log` on those cells. Every
  small-4k-log-lines cell preserved (L1_fast R=154 ≤ C=157, L2_dfast
  R=150 ≤ C=157, …).

Speed (`compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust`):
47.4 µs → 34.2 µs (−28 % on top of #166; cumulative −59 % vs the
pre-#165 baseline). Gap vs donor 6.7× → 4.8×.

* fix(huff0): use ceiling division in cheap_desc_size_proxy entropy bound

`cheap_desc_size_proxy` claims to be a conservative entropy upper
bound, but used truncating integer division for `total / c` before
`ceil_log2`. For non-integer ratios (e.g. `total=10, c=4 → 2.5`) this
truncated to `2`, the subsequent `ceil_log2` emitted `1` bit, and the
proxy under-shot the real entropy ≥ 2 bits per symbol.

Switch to `total.div_ceil(c)` so the ceiling is taken BEFORE the
`ceil_log2` step. Ratio sweep across `compare_ffi` corpus × every
supported level: no new `rust_bytes > ffi_bytes` cells; small-4k-log
unchanged; z000033 L18/L19 *improved* (R=442 992 → 442 863).
small-4k-log L2_dfast pure_rust 34.2 µs → 32.8 µs (slightly faster:
the tighter estimate cuts off more candidates earlier in the
`min_table_log..=11` loop's monotone-increase break).

Added `cheap_desc_size_proxy_edge_cases` covering every `(fse_ok,
raw_ok)` arm plus the `n == 0` early-out and the `ratio <= 1` clamp
branch — the prior test only hit a handful of input shapes, leaving
the `(true, false)` / `(false, true)` / `(false, false)` arms and the
early-out without coverage.

* fix(huff0): cheap_desc_size_proxy off-by-one + drop per-candidate Vec alloc

Three threads from PR #168 review.

- **fse_ok off-by-one.** `encode_weight_description` rejects only
  `encoded.len() >= 128`, so `encoded.len() == 127` is the largest
  accepted FSE payload and the total serialized description
  (`encoded.len() + 1` length-byte prefix) is exactly 128 B at the
  boundary. The proxy's `fse_size` includes the length byte — accept
  `<= 128`, not `<= 127`. As written the proxy would skip a valid
  candidate at the exact boundary and force a worse fallback.

- **Per-candidate `Vec<u8>` alloc.** `build_from_counts` called
  `table.weights()` per `table_log` candidate just to score
  `desc_size` — fresh `Vec<u8>` allocation each iteration. Replace
  with a stack-allocated `[u8; 256]` buffer reused across iterations
  (counts.len() max is 256). The Vec from
  `build_donor_limited_weights` already carries the same weight
  values; just copy them into the buffer.

- **Test slice-length mismatch.** `cheap_desc_size_proxy_is_conservative_vs_exact`
  compared `proxy(weights)` (N items) against
  `table.try_table_description_size()` (which trims `[..N-1]`
  internally). Fix: trim `weights` before calling the proxy so both
  paths score the same serialized slice.

All 503/503 lib tests pass, clippy / fmt clean. Targeted ratio sweep
(small-4k-log L1-L3, z000033 L1-L3, large-log-stream L1-L3):
unchanged — no new R>C cells, no regression.

* test(huff0): rebuild conservative-vs-exact fixtures from build_from_counts; align proxy docs

`cheap_desc_size_proxy_is_conservative_vs_exact` silently skipped
every hand-curated fixture because their weight vectors
([1,2,3,4,5], [1;13], [6;50], (0..13).cycle().take(120), ...) all
failed `huffman_weight_sum_is_power_of_two` — Kraft equality must
hold for a valid Huffman weight set, and the synthetic arrays never
did. The loop body was unreachable; the test passed trivially.

- Rebuild fixtures via `HuffmanTable::build_from_counts(counts)` →
  `table.weights()`. The encoder's own output is Kraft-valid by
  construction. Counts inputs cover skewed, uniform, geometric, wide
  alphabets, and near-raw-limit cases.
- Add an `exercised > 0` assertion to fail loud if a future refactor
  silently skips all fixtures again.
- Fix `raw_floor`: the writer's raw representation for the trimmed
  slice is `trimmed.len().div_ceil(2) + 1` (nibbles + length byte),
  not `weights.len().div_ceil(2)`.

Doc fixes:
- `cheap_desc_size_proxy` docstring: "representable when result is
  < 128 bytes" → "<= 128 bytes" — `encode_weight_description` rejects
  only `encoded.len() >= 128`, so total `encoded.len() + 1` length-byte
  prefix tops out at exactly 128. The code at the `fse_ok` site
  already uses `<= 128`; the doc was out of sync.
- `cheap_desc_size_proxy` docstring: explicitly document that
  `n == 0` returns `None` (raw could in principle encode an empty
  slice as just the length byte, but production callers never hand
  `n == 0` here — `build_from_counts` short-circuits on
  `symbol_cardinality <= 1`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants