Skip to content

BitBuffer binary operations support in place updates to lhs#8162

Open
robert3005 wants to merge 7 commits into
developfrom
rk/optimise-bit-buffer
Open

BitBuffer binary operations support in place updates to lhs#8162
robert3005 wants to merge 7 commits into
developfrom
rk/optimise-bit-buffer

Conversation

@robert3005
Copy link
Copy Markdown
Contributor

Revive #7375

This speeds up And/Or/Xor - 50% to 4x time
Append - 2x (by shifting the work elsewhere in the code)
append_buffer - 10-50%

claude and others added 6 commits May 29, 2026 16:33
Key optimizations:
- `Not for &BitBuffer`: allocate directly instead of clone+try_into_mut
  which always failed (clone shares Arc, so 2 refs = failure)
  → 27-58% faster bitwise NOT on references
- `iter_bits()`: process u64 chunks instead of byte-by-byte
  → 13% faster iteration
- `PartialEq`: fast path using memcmp for byte-aligned buffers
- `append_buffer()`: memcpy fast path when both sides are byte-aligned
  → 20-52% faster buffer appends
- `append_false()`: remove unnecessary branch (new bytes are zero-init)
  → 65% faster single-bit appends
- `from_indices()`: use set_bit_unchecked directly on the byte slice
- `FromIterator` tail: batch remaining items in u64 words
  → 13% faster from_iter
- `sliced()`: use bitwise_unary_op_copy to avoid clone+fail path,
  fix byte range bug in aligned path
- `filter_bitbuffer_by_indices`: detect contiguous runs for bulk copy
- `filter_bitbuffer_by_slices`: use slice()+append_buffer() instead
  of per-bit get+append
- Add #[inline] to hot methods: set, set_to, append, true_count, etc.

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
BitBuffers almost always have shared backing storage (from slicing,
array construction, etc.), so try_into_mut nearly always fails. The
owned `Not for BitBuffer` now uses the same direct-copy path as the
reference version, and the dead `bitwise_unary_op` function is removed.

For true in-place mutation, use `BitBufferMut` directly.

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
Keep the in-place mutation fast path for owned BitBuffer when the
backing storage has exclusive access (Arc refcount == 1). When
try_into_mut fails (shared storage), delegate to bitwise_unary_op_copy
instead of duplicating the copy logic.

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
The scan loop pattern `mask = mask.bitand(&conjunct_mask)` previously
always allocated a new buffer because all binary ops took references.
Now owned-left operands try try_into_mut for zero-allocation in-place
mutation when the backing storage has exclusive access.

Changes:
- Add bitwise_binary_op_lhs_owned in ops.rs: tries in-place on owned
  left, falls back to allocating bitwise_binary_op
- Wire BitAnd/BitOr/BitXor owned-left impls to use in-place path
- Add BitBuffer::into_bitand_not for owned variant
- Add BitAnd<&Mask>/BitOr<&Mask> for Mask: extracts owned BitBuffer
  for in-place binary ops in the scan loop
- Update Mask::bitand_not to use owned BitBuffer path
- Fix flat/zoned readers to capture density before consuming mask

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
…no-op micro-opts

Rebased onto develop, then verified the optimizations with benchmarks and
hardened the ones that pay off.

Correctness fixes (each with a regression test added first):
- `bitwise_binary_op_lhs_owned`: the in-place path combined operands
  byte-for-byte, ignoring bit offsets, so `owned & &rhs` corrupted results when
  `left.offset() != right.offset()` (reachable via sliced masks in the scan
  loop). Now falls back to the offset-aware `bitwise_binary_op` unless the
  offsets match. Added an rstest comparing the owned path against the reference
  path across every offset/length/op combination.
- `append_false` dropped its read-modify-write on the assumption that bits past
  `len` are zero, but `truncate` left stale bits in the final partial byte —
  so `truncate(n)` then `append_false()` read back as true. `truncate` now
  clears the vacated bits, restoring the invariant `append_false`/`append_buffer`
  rely on (keeps the ~2.3x single-bit append win).
- vortex-cuda's flat reader had the same `mask.density()`-after-move pattern the
  flat/zoned readers were fixed for; capture density before consuming the mask.

Simplification / common patterns:
- Replaced three hand-rolled unsafe u64-chunk loops (`bitwise_unary_op_copy`,
  `bitwise_unary_op_mut`, owned binary in-place) with safe shared helpers
  `read_u64_le` / `map_u64_words_in_place` / `zip_u64_words_in_place` built on
  `chunks_exact` + `u64::{from,to}_le_bytes`. No `unsafe`, portable to
  big-endian (the raw `read_unaligned::<u64>` was a latent BE bug), and the
  owned in-place AND stays 3.7-4.7x faster than allocating at 16K-64K bits.

Dropped optimizations that benchmarks showed don't pay off:
- `FromIterator` 64-bit tail batching: never exercised by exact-size iterators
  (the common case), measured ~1.0x, ~60 lines of intricate unsafe. Reverted to
  a simple `append` loop now that `append` is fast.

Other notes (no code change here):
- The filter and `from_indices` optimizations from the original branch are
  superseded by develop's PEXT filter and u64-word `from_indices`; taken from
  develop during the rebase.
- `iter_bits`' u64-chunk rewrite is kept (cleaner/safer than the byte-by-byte
  original) but it has no non-test callers and does not feed `iter()`.

Added a `bitand_owned_lhs_vortex_buffer` bench so the owned in-place path has
ongoing coverage.

Signed-off-by: Robert Kruszewski <dzobert@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Robert Kruszewski <github@robertk.io>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 29, 2026

Merging this PR will improve performance by 26.26%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 21 improved benchmarks
❌ 4 regressed benchmarks
✅ 1250 untouched benchmarks
🆕 5 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bitwise_not_vortex_buffer_mut[65536] 4.3 µs 6.2 µs -29.96%
Simulation bitwise_not_vortex_buffer_mut[16384] 1.3 µs 1.7 µs -25.03%
Simulation bitwise_not_vortex_buffer[65536] 12.5 µs 16.6 µs -24.88%
Simulation varbinview_zip_block_mask 2.9 ms 3.4 ms -13.92%
Simulation append_vortex_buffer[65536] 923.4 µs 468.2 µs +97.24%
Simulation append_vortex_buffer[16384] 231.1 µs 117.2 µs +97.21%
Simulation append_vortex_buffer[2048] 29.2 µs 14.8 µs +96.69%
Simulation append_vortex_buffer[1024] 14.7 µs 7.5 µs +96.14%
Simulation append_vortex_buffer[128] 2.1 µs 1.1 µs +88.01%
Simulation append_buffer_vortex_buffer[2048] 17.3 µs 10.6 µs +63.02%
Simulation chunked_opt_bool_canonical_into[(1000, 10)] 54.2 µs 35 µs +54.68%
Simulation chunked_bool_canonical_into[(1000, 10)] 30.2 µs 20.2 µs +49.75%
Simulation chunked_opt_bool_into_canonical[(1000, 10)] 68.9 µs 49.6 µs +38.84%
Simulation true_count_vortex_buffer[128] 707.8 ns 521.7 ns +35.68%
Simulation bitwise_not_vortex_buffer_mut[128] 246.1 ns 185.6 ns +32.63%
Simulation append_buffer_vortex_buffer[1024] 15.6 µs 12.4 µs +25.89%
Simulation bitwise_not_vortex_buffer_mut[1024] 307.8 ns 245.8 ns +25.2%
Simulation append_buffer_vortex_buffer[128] 11.6 µs 10.1 µs +14.23%
Simulation fsst_compress_string 13.8 ms 12.3 ms +12.67%
Simulation compress_fsst[(10000, 4, 4)] 1.5 ms 1.4 ms +11.39%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing rk/optimise-bit-buffer (94cd970) with develop (e724fe7)

Open in CodSpeed

@robert3005 robert3005 requested a review from joseph-isaacs June 2, 2026 13:54
@robert3005 robert3005 added the changelog/performance A performance improvement label Jun 2, 2026
/// Tries to get exclusive access via `try_into_mut`. If the backing storage is shared
/// (Arc refcount > 1), falls back to [`bitwise_unary_op_copy`].
#[inline]
pub(super) fn bitwise_unary_op<F: FnMut(u64) -> u64>(buffer: BitBuffer, op: F) -> BitBuffer {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This went too far, the bitwise_unary_op_copy should be allocating empty buffer and writing to it

Signed-off-by: Robert Kruszewski <github@robertk.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants