BitBuffer binary operations support in place updates to lhs by robert3005 · Pull Request #8162 · vortex-data/vortex

robert3005 · 2026-05-29T16:59:34Z

This speeds up And/Or/Xor - 50% to 4x time
Append - 2x (by shifting the work elsewhere in the code)
append_buffer - 10-50%

Key optimizations: - `Not for &BitBuffer`: allocate directly instead of clone+try_into_mut which always failed (clone shares Arc, so 2 refs = failure) → 27-58% faster bitwise NOT on references - `iter_bits()`: process u64 chunks instead of byte-by-byte → 13% faster iteration - `PartialEq`: fast path using memcmp for byte-aligned buffers - `append_buffer()`: memcpy fast path when both sides are byte-aligned → 20-52% faster buffer appends - `append_false()`: remove unnecessary branch (new bytes are zero-init) → 65% faster single-bit appends - `from_indices()`: use set_bit_unchecked directly on the byte slice - `FromIterator` tail: batch remaining items in u64 words → 13% faster from_iter - `sliced()`: use bitwise_unary_op_copy to avoid clone+fail path, fix byte range bug in aligned path - `filter_bitbuffer_by_indices`: detect contiguous runs for bulk copy - `filter_bitbuffer_by_slices`: use slice()+append_buffer() instead of per-bit get+append - Add #[inline] to hot methods: set, set_to, append, true_count, etc. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

BitBuffers almost always have shared backing storage (from slicing, array construction, etc.), so try_into_mut nearly always fails. The owned `Not for BitBuffer` now uses the same direct-copy path as the reference version, and the dead `bitwise_unary_op` function is removed. For true in-place mutation, use `BitBufferMut` directly. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

Keep the in-place mutation fast path for owned BitBuffer when the backing storage has exclusive access (Arc refcount == 1). When try_into_mut fails (shared storage), delegate to bitwise_unary_op_copy instead of duplicating the copy logic. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

The scan loop pattern `mask = mask.bitand(&conjunct_mask)` previously always allocated a new buffer because all binary ops took references. Now owned-left operands try try_into_mut for zero-allocation in-place mutation when the backing storage has exclusive access. Changes: - Add bitwise_binary_op_lhs_owned in ops.rs: tries in-place on owned left, falls back to allocating bitwise_binary_op - Wire BitAnd/BitOr/BitXor owned-left impls to use in-place path - Add BitBuffer::into_bitand_not for owned variant - Add BitAnd<&Mask>/BitOr<&Mask> for Mask: extracts owned BitBuffer for in-place binary ops in the scan loop - Update Mask::bitand_not to use owned BitBuffer path - Fix flat/zoned readers to capture density before consuming mask Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

…no-op micro-opts Rebased onto develop, then verified the optimizations with benchmarks and hardened the ones that pay off. Correctness fixes (each with a regression test added first): - `bitwise_binary_op_lhs_owned`: the in-place path combined operands byte-for-byte, ignoring bit offsets, so `owned & &rhs` corrupted results when `left.offset() != right.offset()` (reachable via sliced masks in the scan loop). Now falls back to the offset-aware `bitwise_binary_op` unless the offsets match. Added an rstest comparing the owned path against the reference path across every offset/length/op combination. - `append_false` dropped its read-modify-write on the assumption that bits past `len` are zero, but `truncate` left stale bits in the final partial byte — so `truncate(n)` then `append_false()` read back as true. `truncate` now clears the vacated bits, restoring the invariant `append_false`/`append_buffer` rely on (keeps the ~2.3x single-bit append win). - vortex-cuda's flat reader had the same `mask.density()`-after-move pattern the flat/zoned readers were fixed for; capture density before consuming the mask. Simplification / common patterns: - Replaced three hand-rolled unsafe u64-chunk loops (`bitwise_unary_op_copy`, `bitwise_unary_op_mut`, owned binary in-place) with safe shared helpers `read_u64_le` / `map_u64_words_in_place` / `zip_u64_words_in_place` built on `chunks_exact` + `u64::{from,to}_le_bytes`. No `unsafe`, portable to big-endian (the raw `read_unaligned::<u64>` was a latent BE bug), and the owned in-place AND stays 3.7-4.7x faster than allocating at 16K-64K bits. Dropped optimizations that benchmarks showed don't pay off: - `FromIterator` 64-bit tail batching: never exercised by exact-size iterators (the common case), measured ~1.0x, ~60 lines of intricate unsafe. Reverted to a simple `append` loop now that `append` is fast. Other notes (no code change here): - The filter and `from_indices` optimizations from the original branch are superseded by develop's PEXT filter and u64-word `from_indices`; taken from develop during the rebase. - `iter_bits`' u64-chunk rewrite is kept (cleaner/safer than the byte-by-byte original) but it has no non-test callers and does not feed `iter()`. Added a `bitand_owned_lhs_vortex_buffer` bench so the owned in-place path has ongoing coverage. Signed-off-by: Robert Kruszewski <dzobert@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Robert Kruszewski <github@robertk.io>

codspeed-hq · 2026-05-29T17:10:46Z

Merging this PR will improve performance by 26.26%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 21 improved benchmarks
❌ 4 regressed benchmarks
✅ 1250 untouched benchmarks
🆕 5 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`bitwise_not_vortex_buffer_mut[65536]`	4.3 µs	6.2 µs	-29.96%
❌	Simulation	`bitwise_not_vortex_buffer_mut[16384]`	1.3 µs	1.7 µs	-25.03%
❌	Simulation	`bitwise_not_vortex_buffer[65536]`	12.5 µs	16.6 µs	-24.88%
❌	Simulation	`varbinview_zip_block_mask`	2.9 ms	3.4 ms	-13.92%
⚡	Simulation	`append_vortex_buffer[65536]`	923.4 µs	468.2 µs	+97.24%
⚡	Simulation	`append_vortex_buffer[16384]`	231.1 µs	117.2 µs	+97.21%
⚡	Simulation	`append_vortex_buffer[2048]`	29.2 µs	14.8 µs	+96.69%
⚡	Simulation	`append_vortex_buffer[1024]`	14.7 µs	7.5 µs	+96.14%
⚡	Simulation	`append_vortex_buffer[128]`	2.1 µs	1.1 µs	+88.01%
⚡	Simulation	`append_buffer_vortex_buffer[2048]`	17.3 µs	10.6 µs	+63.02%
⚡	Simulation	`chunked_opt_bool_canonical_into[(1000, 10)]`	54.2 µs	35 µs	+54.68%
⚡	Simulation	`chunked_bool_canonical_into[(1000, 10)]`	30.2 µs	20.2 µs	+49.75%
⚡	Simulation	`chunked_opt_bool_into_canonical[(1000, 10)]`	68.9 µs	49.6 µs	+38.84%
⚡	Simulation	`true_count_vortex_buffer[128]`	707.8 ns	521.7 ns	+35.68%
⚡	Simulation	`bitwise_not_vortex_buffer_mut[128]`	246.1 ns	185.6 ns	+32.63%
⚡	Simulation	`append_buffer_vortex_buffer[1024]`	15.6 µs	12.4 µs	+25.89%
⚡	Simulation	`bitwise_not_vortex_buffer_mut[1024]`	307.8 ns	245.8 ns	+25.2%
⚡	Simulation	`append_buffer_vortex_buffer[128]`	11.6 µs	10.1 µs	+14.23%
⚡	Simulation	`fsst_compress_string`	13.8 ms	12.3 ms	+12.67%
⚡	Simulation	`compress_fsst[(10000, 4, 4)]`	1.5 ms	1.4 ms	+11.39%
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing rk/optimise-bit-buffer (94cd970) with develop (e724fe7)}

robert3005 · 2026-06-02T14:00:26Z

+/// Tries to get exclusive access via `try_into_mut`. If the backing storage is shared
+/// (Arc refcount > 1), falls back to [`bitwise_unary_op_copy`].
+#[inline]
+pub(super) fn bitwise_unary_op<F: FnMut(u64) -> u64>(buffer: BitBuffer, op: F) -> BitBuffer {


This went too far, the bitwise_unary_op_copy should be allocating empty buffer and writing to it

Signed-off-by: Robert Kruszewski <github@robertk.io>

claude and others added 6 commits May 29, 2026 16:33

less

c8bc973

Signed-off-by: Robert Kruszewski <github@robertk.io>

robert3005 requested a review from joseph-isaacs June 2, 2026 13:54

robert3005 added the changelog/performance A performance improvement label Jun 2, 2026

robert3005 commented Jun 2, 2026

View reviewed changes

fixes

94cd970

Signed-off-by: Robert Kruszewski <github@robertk.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BitBuffer binary operations support in place updates to lhs#8162

BitBuffer binary operations support in place updates to lhs#8162
robert3005 wants to merge 7 commits into
developfrom
rk/optimise-bit-buffer

robert3005 commented May 29, 2026

Uh oh!

codspeed-hq Bot commented May 29, 2026 •

edited

Loading

Uh oh!

robert3005 Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robert3005 commented May 29, 2026

Uh oh!

codspeed-hq Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 26.26%

Performance Changes

Uh oh!

robert3005 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq Bot commented May 29, 2026 •

edited

Loading