perf(huff0): #167 cheap entropy proxy for table_log selection — no FSE-encode per candidate#168
Conversation
#167. `HuffmanTable::build_from_counts` previously called `try_table_description_size` per `min_table_log..=11` candidate, which ran a full FSE-encode of the weight stream against a freshly-built FSE table just to count bytes (~31 % inclusive on the 4 KiB compress profile after #166). Replace with `cheap_desc_size_proxy(weights)`: an integer entropy estimate that reproduces the donor `HUF_writeCTable_wksp` decision (FSE vs raw nibble) without touching the FSE encoder. - FSE estimate: `sum(c_i * ceil_log2(total / c_i))` over the 13-bin weight histogram + 8 B header overhead (empirical upper bound for the `acc_log = 6` weight FSE table seen in `encode_weight_description`). - Raw nibble: exact `weights.len().div_ceil(2) + 1`, representable when `weights.len() <= 128`. - Return min of the two when both are representable. Validated: - 502/502 lib tests (incl. new `cheap_desc_size_proxy_is_conservative_vs_exact`) - `compare_ffi` ratio REPORT sweep across all scenarios × all levels: no new `rust_bytes > ffi_bytes` cells; `decodecorpus-z000033` L18/L19 *improved* by 442 B each (R=443 434 → 442 992) — the proxy steers selection to a marginally tighter `table_log` on those cells. Every small-4k-log-lines cell preserved (L1_fast R=154 ≤ C=157, L2_dfast R=150 ≤ C=157, …). Speed (`compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust`): 47.4 µs → 34.2 µs (−28 % on top of #166; cumulative −59 % vs the pre-#165 baseline). Gap vs donor 6.7× → 4.8×.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR adds ChangesHuffman Table Description-Size Proxy
🎯 3 (Moderate) | ⏱️ ~20 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR targets the huff0 table-log search hot path by replacing per-candidate try_table_description_size() (which FSE-encodes weights to measure bytes) with a new cheap_desc_size_proxy() that estimates weight-description size analytically, aiming to preserve table_log selection while removing the repeated FSE work.
Changes:
- Switch
HuffmanTable::build_from_countscandidate scoring from exacttry_table_description_size()tocheap_desc_size_proxy()using a histogram/entropy proxy. - Add
cheap_desc_size_proxy()implementation to estimate raw-vs-FSE description sizes without invoking the FSE encoder. - Add a new unit test intended to sanity-check proxy conservatism vs exact sizing.
Comments suppressed due to low confidence (2)
zstd/src/huff0/huff0_encoder.rs:478
- The FSE representability check is off by one:
encode_weight_descriptiononly rejects whenencoded.len() >= 128, meaning the serialized description sizeencoded.len() + 1can be up to 128 bytes. Since the proxy’sfse_sizeis documented as a byte size,fse_okshould allowfse_size <= 128(or alternatively clarify thatfse_sizeexcludes the leading length byte and compare against< 128).
let fse_payload_bytes = bits.div_ceil(8) as usize;
// FSE description overhead seen in `encode_weight_description`:
// 4 bits `acc_log` + the `write_table` probability stream (~5 B for
// a 13-symbol alphabet) + a 1-byte length prefix. 8 B is an
// empirically-derived upper bound for our `acc_log = 6` weight tables.
const FSE_HEADER_OVERHEAD_BYTES: usize = 8;
let fse_size = fse_payload_bytes + FSE_HEADER_OVERHEAD_BYTES;
let fse_ok = fse_size <= 127;
zstd/src/huff0/huff0_encoder.rs:1062
cheap_desc_size_proxy_is_conservative_vs_exactcurrently risks being a no-op: all listedcasesappear to failhuffman_weight_sum_is_power_of_two(e.g. 13×1sums to 13), so the loopcontinues and no assertions run. Alsoraw_flooris computed asweights.len().div_ceil(2)but the raw encoding size isdiv_ceil(2) + 1(length byte). Consider generating valid weight distributions (power-of-two sum) and tightening the invariant to match the documented “conservative” intent.
let cases: &[(Vec<u8>, &str)] = &[
(alloc::vec![1, 2, 3, 4, 5], "small uniform"),
(alloc::vec![1; 13], "all-ones 13"),
(alloc::vec![6; 50], "skewed dominant"),
(
(0..13u8).cycle().take(120).collect(),
"cycle near raw limit",
),
((0..13u8).cycle().take(127).collect(), "cycle at raw limit"),
];
for (weights, label) in cases {
let proxy = cheap_desc_size_proxy(weights);
// Build the actual table from these weights and ask for the
// exact size via the existing encoder path.
let weights_usize: alloc::vec::Vec<usize> = weights.iter().map(|&w| w as usize).collect();
if !huffman_weight_sum_is_power_of_two(&weights_usize) {
// Skip cases the encoder would reject upstream.
continue;
}
let table = HuffmanTable::build_from_weights(&weights_usize);
let exact = table.try_table_description_size();
match (proxy, exact) {
(Some(p), Some(e)) => {
// Proxy is allowed to overestimate; flag a hard
// underestimate vs raw (the natural floor — if the
// exact path picks raw nibble, the proxy must not be
// below `n.div_ceil(2) + 1 - 1`).
let raw_floor = weights.len().div_ceil(2);
assert!(
p + 2 >= e || p >= raw_floor,
"[{label}] proxy {p} under-shot exact {e} (raw_floor {raw_floor})"
);
`cheap_desc_size_proxy` claims to be a conservative entropy upper bound, but used truncating integer division for `total / c` before `ceil_log2`. For non-integer ratios (e.g. `total=10, c=4 → 2.5`) this truncated to `2`, the subsequent `ceil_log2` emitted `1` bit, and the proxy under-shot the real entropy ≥ 2 bits per symbol. Switch to `total.div_ceil(c)` so the ceiling is taken BEFORE the `ceil_log2` step. Ratio sweep across `compare_ffi` corpus × every supported level: no new `rust_bytes > ffi_bytes` cells; small-4k-log unchanged; z000033 L18/L19 *improved* (R=442 992 → 442 863). small-4k-log L2_dfast pure_rust 34.2 µs → 32.8 µs (slightly faster: the tighter estimate cuts off more candidates earlier in the `min_table_log..=11` loop's monotone-increase break). Added `cheap_desc_size_proxy_edge_cases` covering every `(fse_ok, raw_ok)` arm plus the `n == 0` early-out and the `ratio <= 1` clamp branch — the prior test only hit a handful of input shapes, leaving the `(true, false)` / `(false, true)` / `(false, false)` arms and the early-out without coverage.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
zstd/src/huff0/huff0_encoder.rs:1066
- In
cheap_desc_size_proxy_is_conservative_vs_exact,raw_flooris computed asweights.len().div_ceil(2), but the raw nibble representation size used elsewhere isweights.len().div_ceil(2) + 1(length byte + packed nibbles). As-is, the assertion can allow proxy underestimates that are still below the true raw encoding size; adjust the floor to match the actual raw size (or update the assertion/comment accordingly).
// Proxy is allowed to overestimate; flag a hard
// underestimate vs raw (the natural floor — if the
// exact path picks raw nibble, the proxy must not be
// below `n.div_ceil(2) + 1 - 1`).
let raw_floor = weights.len().div_ceil(2);
assert!(
p + 2 >= e || p >= raw_floor,
"[{label}] proxy {p} under-shot exact {e} (raw_floor {raw_floor})"
… alloc Three threads from PR #168 review. - **fse_ok off-by-one.** `encode_weight_description` rejects only `encoded.len() >= 128`, so `encoded.len() == 127` is the largest accepted FSE payload and the total serialized description (`encoded.len() + 1` length-byte prefix) is exactly 128 B at the boundary. The proxy's `fse_size` includes the length byte — accept `<= 128`, not `<= 127`. As written the proxy would skip a valid candidate at the exact boundary and force a worse fallback. - **Per-candidate `Vec<u8>` alloc.** `build_from_counts` called `table.weights()` per `table_log` candidate just to score `desc_size` — fresh `Vec<u8>` allocation each iteration. Replace with a stack-allocated `[u8; 256]` buffer reused across iterations (counts.len() max is 256). The Vec from `build_donor_limited_weights` already carries the same weight values; just copy them into the buffer. - **Test slice-length mismatch.** `cheap_desc_size_proxy_is_conservative_vs_exact` compared `proxy(weights)` (N items) against `table.try_table_description_size()` (which trims `[..N-1]` internally). Fix: trim `weights` before calling the proxy so both paths score the same serialized slice. All 503/503 lib tests pass, clippy / fmt clean. Targeted ratio sweep (small-4k-log L1-L3, z000033 L1-L3, large-log-stream L1-L3): unchanged — no new R>C cells, no regression.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/huff0/huff0_encoder.rs`:
- Around line 1062-1095: The test cheap_desc_size_proxy_is_conservative_vs_exact
currently never exercises the assertion because the power-of-two check uses the
untrimmed weights and all fixtures fail it; change the guard to compute the
trimmed slice first (trimmed = &weights[..weights.len() - 1]), build
trimmed_usize from trimmed (instead of weights_usize), and call
huffman_weight_sum_is_power_of_two(&trimmed_usize) to decide continue; also
compute raw_floor from trimmed.len() (raw_floor = trimmed.len().div_ceil(2)) so
the assertion comparing proxy p and exact e uses the correct floor value when
invoking cheap_desc_size_proxy and try_table_description_size.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 423417f1-1193-4b7a-9a2a-f3168115c007
📒 Files selected for processing (1)
zstd/src/huff0/huff0_encoder.rs
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (2)
zstd/src/huff0/huff0_encoder.rs:1095
- In
cheap_desc_size_proxy_is_conservative_vs_exact,raw_flooris computed from the fullweightslength and omits the +1 length byte, while the proxy is called on the trimmed[..len-1]slice and returns a total serialized size. This makes the guardrail comparison hard to interpret and likely incorrect even once the test cases are made Kraft-valid. Consider computing the raw floor using the same trimmed length and the same size convention astry_table_description_size(i.e.,trimmed.len().div_ceil(2) + 1).
(Some(p), Some(e)) => {
// Proxy is allowed to overestimate; flag a hard
// underestimate vs raw (the natural floor — if the
// exact path picks raw nibble, the proxy must not be
// below `n.div_ceil(2) + 1 - 1`).
let raw_floor = weights.len().div_ceil(2);
assert!(
p + 2 >= e || p >= raw_floor,
"[{label}] proxy {p} under-shot exact {e} (raw_floor {raw_floor})"
);
zstd/src/huff0/huff0_encoder.rs:1142
- This comment says “+8 header = 136 > 127 → fse_ok=false”, but the proxy’s
fse_okthreshold isfse_size <= 128(total bytes). The reasoning should reference 128 (or explain in terms ofencoded.len() < 128for the payload) to stay consistent with the actual logic.
// High-entropy + huge length: both representations fail →
// `(false, false)` arm returns `None`. With 256 weights cycling
// over 13 bins, `bits/sym ≈ ceil_log2(ceil(256/20)) = 4`. Total
// payload bits ≈ 1024 B = 128 B, +8 header = 136 > 127 → fse_ok=false.
// raw is also off the table (256 > 128) → None.
…ounts; align proxy docs `cheap_desc_size_proxy_is_conservative_vs_exact` silently skipped every hand-curated fixture because their weight vectors ([1,2,3,4,5], [1;13], [6;50], (0..13).cycle().take(120), ...) all failed `huffman_weight_sum_is_power_of_two` — Kraft equality must hold for a valid Huffman weight set, and the synthetic arrays never did. The loop body was unreachable; the test passed trivially. - Rebuild fixtures via `HuffmanTable::build_from_counts(counts)` → `table.weights()`. The encoder's own output is Kraft-valid by construction. Counts inputs cover skewed, uniform, geometric, wide alphabets, and near-raw-limit cases. - Add an `exercised > 0` assertion to fail loud if a future refactor silently skips all fixtures again. - Fix `raw_floor`: the writer's raw representation for the trimmed slice is `trimmed.len().div_ceil(2) + 1` (nibbles + length byte), not `weights.len().div_ceil(2)`. Doc fixes: - `cheap_desc_size_proxy` docstring: "representable when result is < 128 bytes" → "<= 128 bytes" — `encode_weight_description` rejects only `encoded.len() >= 128`, so total `encoded.len() + 1` length-byte prefix tops out at exactly 128. The code at the `fse_ok` site already uses `<= 128`; the doc was out of sync. - `cheap_desc_size_proxy` docstring: explicitly document that `n == 0` returns `None` (raw could in principle encode an empty slice as just the length byte, but production callers never hand `n == 0` here — `build_from_counts` short-circuits on `symbol_cardinality <= 1`).
Summary
Closes #167.
HuffmanTable::build_from_countspreviously ran amin_table_log..=11speculative search, callingtry_table_description_sizeper candidate — which actually FSE-encodes the weight stream against a freshly-built FSE table just to count bytes. That was ~31% inclusive on the post-#166 4 KiB compress profile.Replace with
cheap_desc_size_proxy(weights): an integer entropy estimate that reproduces the donorHUF_writeCTable_wkspdecision (FSE vs raw nibble) without touching the FSE encoder.Algorithm
For each
table_logcandidate:weights.len().div_ceil(2) + 1. Representable whenweights.len() <= 128.sum(c_i * ceil_log2(total / c_i))bits over the 13-bin weight histogram +8 Bheader constant. Empirically-derived upper bound for theacc_log = 6weight FSE table. Representable when result < 128 B.min(fse_size, raw_size)when both representable.Donor
HUF_writeCTable_wksppicks the smaller of the two; the proxy mirrors that decision analytically.Validation
Ratio sweep (
STRUCTURED_ZSTD_EMIT_REPORT=1 cargo bench --bench compare_ffi --features dict_builder) — every scenario × every supported level:rust_bytes > ffi_bytescells introduced.small-4k-log-lines/L1_fastpreserved atR=154(≤C=157) — the cell that broke when the donorHUF_optimalTableLogfast-path single-shot was tried in perf(fse): donor FSE_buildCTable_wksp parity — drop per-symbol Vec<State> #166 review and reverted.decodecorpus-z000033/level_18_btultra+/level_19_btultraimproved by 442 B (R=443 434 → 442 992) — the proxy steers selection to a marginally tightertable_logon those cells.Speed —
compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust:Gap vs C donor (7.1 µs): 11.8× → 4.8×.
No regression on larger corpora (entropy table-log search wasn't the bottleneck on z000033 — 13.1 ms unchanged).
Test plan
cargo nextest, including newcheap_desc_size_proxy_is_conservative_vs_exactsanity test.cargo clippy --all-targets -- -D warningsclean.cargo fmt --checkclean.compare_ffiREPORT sweep documented above.Closes #167.
Summary by CodeRabbit
Refactor
Tests