Skip to content

perf(huff0): #167 cheap entropy proxy for table_log selection — no FSE-encode per candidate#168

Merged
polaz merged 4 commits into
mainfrom
perf/#167-huff0-optimal-table-log-cheap-proxy
May 18, 2026
Merged

perf(huff0): #167 cheap entropy proxy for table_log selection — no FSE-encode per candidate#168
polaz merged 4 commits into
mainfrom
perf/#167-huff0-optimal-table-log-cheap-proxy

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 18, 2026

Summary

Closes #167. HuffmanTable::build_from_counts previously ran a min_table_log..=11 speculative search, calling try_table_description_size per candidate — which actually FSE-encodes the weight stream against a freshly-built FSE table just to count bytes. That was ~31% inclusive on the post-#166 4 KiB compress profile.

Replace with cheap_desc_size_proxy(weights): an integer entropy estimate that reproduces the donor HUF_writeCTable_wksp decision (FSE vs raw nibble) without touching the FSE encoder.

Algorithm

For each table_log candidate:

  • Raw nibble — exact size = weights.len().div_ceil(2) + 1. Representable when weights.len() <= 128.
  • FSEsum(c_i * ceil_log2(total / c_i)) bits over the 13-bin weight histogram + 8 B header constant. Empirically-derived upper bound for the acc_log = 6 weight FSE table. Representable when result < 128 B.
  • Return min(fse_size, raw_size) when both representable.

Donor HUF_writeCTable_wksp picks the smaller of the two; the proxy mirrors that decision analytically.

Validation

Ratio sweep (STRUCTURED_ZSTD_EMIT_REPORT=1 cargo bench --bench compare_ffi --features dict_builder) — every scenario × every supported level:

  • Zero new rust_bytes > ffi_bytes cells introduced.
  • small-4k-log-lines/L1_fast preserved at R=154 (≤ C=157) — the cell that broke when the donor HUF_optimalTableLog fast-path single-shot was tried in perf(fse): donor FSE_buildCTable_wksp parity — drop per-symbol Vec<State> #166 review and reverted.
  • decodecorpus-z000033/level_18_btultra + /level_19_btultra improved by 442 B (R=443 434 → 442 992) — the proxy steers selection to a marginally tighter table_log on those cells.
  • Every other R/C pair: unchanged or improved.

Speedcompress/level_2_dfast/small-4k-log-lines/matrix/pure_rust:

Baseline Time vs pre-#165
pre-#165 83.7 µs
#165 (FSE encode_sequences O(1)) 59.7 µs −29 %
#166 (FSE build donor parity) 47.4 µs −43 %
this PR (#167 cheap proxy) 34.2 µs −59 % (+145 % thrpt)

Gap vs C donor (7.1 µs): 11.8× → 4.8×.

No regression on larger corpora (entropy table-log search wasn't the bottleneck on z000033 — 13.1 ms unchanged).

Test plan

  • 502/502 lib tests pass via cargo nextest, including new cheap_desc_size_proxy_is_conservative_vs_exact sanity test.
  • cargo clippy --all-targets -- -D warnings clean.
  • cargo fmt --check clean.
  • compare_ffi REPORT sweep documented above.

Closes #167.

Summary by CodeRabbit

  • Refactor

    • Improved Huffman table selection to speed candidate evaluation, reduce transient memory use, and produce more consistent/compressed table descriptions for better overall compression efficiency.
  • Tests

    • Added unit tests covering varied weight distributions and edge cases to validate description-size estimation, representability decisions, and selection behavior.

Review Change Stack

#167. `HuffmanTable::build_from_counts` previously called
`try_table_description_size` per `min_table_log..=11` candidate, which
ran a full FSE-encode of the weight stream against a freshly-built FSE
table just to count bytes (~31 % inclusive on the 4 KiB compress
profile after #166).

Replace with `cheap_desc_size_proxy(weights)`: an integer entropy
estimate that reproduces the donor `HUF_writeCTable_wksp` decision
(FSE vs raw nibble) without touching the FSE encoder.

- FSE estimate: `sum(c_i * ceil_log2(total / c_i))` over the 13-bin
  weight histogram + 8 B header overhead (empirical upper bound for
  the `acc_log = 6` weight FSE table seen in
  `encode_weight_description`).
- Raw nibble: exact `weights.len().div_ceil(2) + 1`, representable
  when `weights.len() <= 128`.
- Return min of the two when both are representable.

Validated:
- 502/502 lib tests (incl. new `cheap_desc_size_proxy_is_conservative_vs_exact`)
- `compare_ffi` ratio REPORT sweep across all scenarios × all levels:
  no new `rust_bytes > ffi_bytes` cells; `decodecorpus-z000033` L18/L19
  *improved* by 442 B each (R=443 434 → 442 992) — the proxy steers
  selection to a marginally tighter `table_log` on those cells. Every
  small-4k-log-lines cell preserved (L1_fast R=154 ≤ C=157, L2_dfast
  R=150 ≤ C=157, …).

Speed (`compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust`):
47.4 µs → 34.2 µs (−28 % on top of #166; cumulative −59 % vs the
pre-#165 baseline). Gap vs donor 6.7× → 4.8×.
Copilot AI review requested due to automatic review settings May 18, 2026 09:09
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 07c6d5b5-8110-434a-acde-0b6c4ff6846d

📥 Commits

Reviewing files that changed from the base of the PR and between ad375a6 and 59f96c5.

📒 Files selected for processing (1)
  • zstd/src/huff0/huff0_encoder.rs

📝 Walkthrough

Walkthrough

This PR adds cheap_desc_size_proxy—an analytic estimator for Huffman weight-description serialized size—adds unit tests validating its conservativeness versus exact sizing, and replaces per-candidate try_table_description_size calls in HuffmanTable::build_from_counts with the proxy to avoid expensive FSE encodes.

Changes

Huffman Table Description-Size Proxy

Layer / File(s) Summary
Integration into candidate search loop
zstd/src/huff0/huff0_encoder.rs
build_from_counts now calls cheap_desc_size_proxy instead of try_table_description_size during candidate evaluation; the loop trims the sentinel consistently, writes candidate weights into a stack buffer to avoid allocations, and skips candidates rejected by the proxy.
Proxy function implementation and correctness validation
zstd/src/huff0/huff0_encoder.rs
cheap_desc_size_proxy estimates weight-description serialized size using exact raw nibble sizing for small alphabets or an FSE-size upper bound from a 13-bin histogram plus fixed header overhead; returns None when neither representation fits. Unit tests cheap_desc_size_proxy_is_conservative_vs_exact and cheap_desc_size_proxy_edge_cases validate representability decisions and edge branches.

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Possibly related issues:
    • #169: Adds a cheap_desc_size_proxy and tests in huff0_encoder.rs — directly related; both target description-size estimation and proxy-based candidate selection.

"I nibble bits beneath the moonlit tree,
Calculating sizes quick and free.
No heavy encodes to slow my hop,
A tiny proxy helps me stop.
Tests keep my hop precise and light."

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: replacing expensive FSE-encode with a cheap entropy proxy for table_log selection, directly addressing issue #167.
Linked Issues check ✅ Passed All acceptance criteria from issue #167 are met: cheap proxy implemented with debug asserts, zero new ratio regressions in compare_ffi, measurable speedup on hot path (47.4µs→34.2µs), and 502/502 tests pass.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing the cheap_desc_size_proxy, optimizing table_log selection, and adding corresponding unit tests; no unrelated modifications present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#167-huff0-optimal-table-log-cheap-proxy

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 95.45455% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/huff0/huff0_encoder.rs 95.45% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets the huff0 table-log search hot path by replacing per-candidate try_table_description_size() (which FSE-encodes weights to measure bytes) with a new cheap_desc_size_proxy() that estimates weight-description size analytically, aiming to preserve table_log selection while removing the repeated FSE work.

Changes:

  • Switch HuffmanTable::build_from_counts candidate scoring from exact try_table_description_size() to cheap_desc_size_proxy() using a histogram/entropy proxy.
  • Add cheap_desc_size_proxy() implementation to estimate raw-vs-FSE description sizes without invoking the FSE encoder.
  • Add a new unit test intended to sanity-check proxy conservatism vs exact sizing.
Comments suppressed due to low confidence (2)

zstd/src/huff0/huff0_encoder.rs:478

  • The FSE representability check is off by one: encode_weight_description only rejects when encoded.len() >= 128, meaning the serialized description size encoded.len() + 1 can be up to 128 bytes. Since the proxy’s fse_size is documented as a byte size, fse_ok should allow fse_size <= 128 (or alternatively clarify that fse_size excludes the leading length byte and compare against < 128).
    let fse_payload_bytes = bits.div_ceil(8) as usize;
    // FSE description overhead seen in `encode_weight_description`:
    // 4 bits `acc_log` + the `write_table` probability stream (~5 B for
    // a 13-symbol alphabet) + a 1-byte length prefix. 8 B is an
    // empirically-derived upper bound for our `acc_log = 6` weight tables.
    const FSE_HEADER_OVERHEAD_BYTES: usize = 8;
    let fse_size = fse_payload_bytes + FSE_HEADER_OVERHEAD_BYTES;
    let fse_ok = fse_size <= 127;

zstd/src/huff0/huff0_encoder.rs:1062

  • cheap_desc_size_proxy_is_conservative_vs_exact currently risks being a no-op: all listed cases appear to fail huffman_weight_sum_is_power_of_two (e.g. 13×1 sums to 13), so the loop continues and no assertions run. Also raw_floor is computed as weights.len().div_ceil(2) but the raw encoding size is div_ceil(2) + 1 (length byte). Consider generating valid weight distributions (power-of-two sum) and tightening the invariant to match the documented “conservative” intent.
    let cases: &[(Vec<u8>, &str)] = &[
        (alloc::vec![1, 2, 3, 4, 5], "small uniform"),
        (alloc::vec![1; 13], "all-ones 13"),
        (alloc::vec![6; 50], "skewed dominant"),
        (
            (0..13u8).cycle().take(120).collect(),
            "cycle near raw limit",
        ),
        ((0..13u8).cycle().take(127).collect(), "cycle at raw limit"),
    ];
    for (weights, label) in cases {
        let proxy = cheap_desc_size_proxy(weights);
        // Build the actual table from these weights and ask for the
        // exact size via the existing encoder path.
        let weights_usize: alloc::vec::Vec<usize> = weights.iter().map(|&w| w as usize).collect();
        if !huffman_weight_sum_is_power_of_two(&weights_usize) {
            // Skip cases the encoder would reject upstream.
            continue;
        }
        let table = HuffmanTable::build_from_weights(&weights_usize);
        let exact = table.try_table_description_size();
        match (proxy, exact) {
            (Some(p), Some(e)) => {
                // Proxy is allowed to overestimate; flag a hard
                // underestimate vs raw (the natural floor — if the
                // exact path picks raw nibble, the proxy must not be
                // below `n.div_ceil(2) + 1 - 1`).
                let raw_floor = weights.len().div_ceil(2);
                assert!(
                    p + 2 >= e || p >= raw_floor,
                    "[{label}] proxy {p} under-shot exact {e} (raw_floor {raw_floor})"
                );

Comment thread zstd/src/huff0/huff0_encoder.rs Outdated
`cheap_desc_size_proxy` claims to be a conservative entropy upper
bound, but used truncating integer division for `total / c` before
`ceil_log2`. For non-integer ratios (e.g. `total=10, c=4 → 2.5`) this
truncated to `2`, the subsequent `ceil_log2` emitted `1` bit, and the
proxy under-shot the real entropy ≥ 2 bits per symbol.

Switch to `total.div_ceil(c)` so the ceiling is taken BEFORE the
`ceil_log2` step. Ratio sweep across `compare_ffi` corpus × every
supported level: no new `rust_bytes > ffi_bytes` cells; small-4k-log
unchanged; z000033 L18/L19 *improved* (R=442 992 → 442 863).
small-4k-log L2_dfast pure_rust 34.2 µs → 32.8 µs (slightly faster:
the tighter estimate cuts off more candidates earlier in the
`min_table_log..=11` loop's monotone-increase break).

Added `cheap_desc_size_proxy_edge_cases` covering every `(fse_ok,
raw_ok)` arm plus the `n == 0` early-out and the `ratio <= 1` clamp
branch — the prior test only hit a handful of input shapes, leaving
the `(true, false)` / `(false, true)` / `(false, false)` arms and the
early-out without coverage.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

zstd/src/huff0/huff0_encoder.rs:1066

  • In cheap_desc_size_proxy_is_conservative_vs_exact, raw_floor is computed as weights.len().div_ceil(2), but the raw nibble representation size used elsewhere is weights.len().div_ceil(2) + 1 (length byte + packed nibbles). As-is, the assertion can allow proxy underestimates that are still below the true raw encoding size; adjust the floor to match the actual raw size (or update the assertion/comment accordingly).
                // Proxy is allowed to overestimate; flag a hard
                // underestimate vs raw (the natural floor — if the
                // exact path picks raw nibble, the proxy must not be
                // below `n.div_ceil(2) + 1 - 1`).
                let raw_floor = weights.len().div_ceil(2);
                assert!(
                    p + 2 >= e || p >= raw_floor,
                    "[{label}] proxy {p} under-shot exact {e} (raw_floor {raw_floor})"

Comment thread zstd/src/huff0/huff0_encoder.rs Outdated
Comment thread zstd/src/huff0/huff0_encoder.rs Outdated
Comment thread zstd/src/huff0/huff0_encoder.rs
… alloc

Three threads from PR #168 review.

- **fse_ok off-by-one.** `encode_weight_description` rejects only
  `encoded.len() >= 128`, so `encoded.len() == 127` is the largest
  accepted FSE payload and the total serialized description
  (`encoded.len() + 1` length-byte prefix) is exactly 128 B at the
  boundary. The proxy's `fse_size` includes the length byte — accept
  `<= 128`, not `<= 127`. As written the proxy would skip a valid
  candidate at the exact boundary and force a worse fallback.

- **Per-candidate `Vec<u8>` alloc.** `build_from_counts` called
  `table.weights()` per `table_log` candidate just to score
  `desc_size` — fresh `Vec<u8>` allocation each iteration. Replace
  with a stack-allocated `[u8; 256]` buffer reused across iterations
  (counts.len() max is 256). The Vec from
  `build_donor_limited_weights` already carries the same weight
  values; just copy them into the buffer.

- **Test slice-length mismatch.** `cheap_desc_size_proxy_is_conservative_vs_exact`
  compared `proxy(weights)` (N items) against
  `table.try_table_description_size()` (which trims `[..N-1]`
  internally). Fix: trim `weights` before calling the proxy so both
  paths score the same serialized slice.

All 503/503 lib tests pass, clippy / fmt clean. Targeted ratio sweep
(small-4k-log L1-L3, z000033 L1-L3, large-log-stream L1-L3):
unchanged — no new R>C cells, no regression.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/huff0/huff0_encoder.rs`:
- Around line 1062-1095: The test cheap_desc_size_proxy_is_conservative_vs_exact
currently never exercises the assertion because the power-of-two check uses the
untrimmed weights and all fixtures fail it; change the guard to compute the
trimmed slice first (trimmed = &weights[..weights.len() - 1]), build
trimmed_usize from trimmed (instead of weights_usize), and call
huffman_weight_sum_is_power_of_two(&trimmed_usize) to decide continue; also
compute raw_floor from trimmed.len() (raw_floor = trimmed.len().div_ceil(2)) so
the assertion comparing proxy p and exact e uses the correct floor value when
invoking cheap_desc_size_proxy and try_table_description_size.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 423417f1-1193-4b7a-9a2a-f3168115c007

📥 Commits

Reviewing files that changed from the base of the PR and between f61dde3 and ad375a6.

📒 Files selected for processing (1)
  • zstd/src/huff0/huff0_encoder.rs

Comment thread zstd/src/huff0/huff0_encoder.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

zstd/src/huff0/huff0_encoder.rs:1095

  • In cheap_desc_size_proxy_is_conservative_vs_exact, raw_floor is computed from the full weights length and omits the +1 length byte, while the proxy is called on the trimmed [..len-1] slice and returns a total serialized size. This makes the guardrail comparison hard to interpret and likely incorrect even once the test cases are made Kraft-valid. Consider computing the raw floor using the same trimmed length and the same size convention as try_table_description_size (i.e., trimmed.len().div_ceil(2) + 1).
            (Some(p), Some(e)) => {
                // Proxy is allowed to overestimate; flag a hard
                // underestimate vs raw (the natural floor — if the
                // exact path picks raw nibble, the proxy must not be
                // below `n.div_ceil(2) + 1 - 1`).
                let raw_floor = weights.len().div_ceil(2);
                assert!(
                    p + 2 >= e || p >= raw_floor,
                    "[{label}] proxy {p} under-shot exact {e} (raw_floor {raw_floor})"
                );

zstd/src/huff0/huff0_encoder.rs:1142

  • This comment says “+8 header = 136 > 127 → fse_ok=false”, but the proxy’s fse_ok threshold is fse_size <= 128 (total bytes). The reasoning should reference 128 (or explain in terms of encoded.len() < 128 for the payload) to stay consistent with the actual logic.
    // High-entropy + huge length: both representations fail →
    // `(false, false)` arm returns `None`. With 256 weights cycling
    // over 13 bins, `bits/sym ≈ ceil_log2(ceil(256/20)) = 4`. Total
    // payload bits ≈ 1024 B = 128 B, +8 header = 136 > 127 → fse_ok=false.
    // raw is also off the table (256 > 128) → None.

Comment thread zstd/src/huff0/huff0_encoder.rs Outdated
Comment thread zstd/src/huff0/huff0_encoder.rs
Comment thread zstd/src/huff0/huff0_encoder.rs Outdated
…ounts; align proxy docs

`cheap_desc_size_proxy_is_conservative_vs_exact` silently skipped
every hand-curated fixture because their weight vectors
([1,2,3,4,5], [1;13], [6;50], (0..13).cycle().take(120), ...) all
failed `huffman_weight_sum_is_power_of_two` — Kraft equality must
hold for a valid Huffman weight set, and the synthetic arrays never
did. The loop body was unreachable; the test passed trivially.

- Rebuild fixtures via `HuffmanTable::build_from_counts(counts)` →
  `table.weights()`. The encoder's own output is Kraft-valid by
  construction. Counts inputs cover skewed, uniform, geometric, wide
  alphabets, and near-raw-limit cases.
- Add an `exercised > 0` assertion to fail loud if a future refactor
  silently skips all fixtures again.
- Fix `raw_floor`: the writer's raw representation for the trimmed
  slice is `trimmed.len().div_ceil(2) + 1` (nibbles + length byte),
  not `weights.len().div_ceil(2)`.

Doc fixes:
- `cheap_desc_size_proxy` docstring: "representable when result is
  < 128 bytes" → "<= 128 bytes" — `encode_weight_description` rejects
  only `encoded.len() >= 128`, so total `encoded.len() + 1` length-byte
  prefix tops out at exactly 128. The code at the `fse_ok` site
  already uses `<= 128`; the doc was out of sync.
- `cheap_desc_size_proxy` docstring: explicitly document that
  `n == 0` returns `None` (raw could in principle encode an empty
  slice as just the length byte, but production callers never hand
  `n == 0` here — `build_from_counts` short-circuits on
  `symbol_cardinality <= 1`).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@polaz polaz merged commit 414355a into main May 18, 2026
25 checks passed
@polaz polaz deleted the perf/#167-huff0-optimal-table-log-cheap-proxy branch May 18, 2026 12:08
@sw-release-bot sw-release-bot Bot mentioned this pull request May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(huff0): donor HUF_optimalTableLog fast-path single-shot tableLog with cheap description-size proxy

2 participants