Skip to content

perf: buffer accumulation in BatchMessage.send_body() (1.6-1.8x speedup, us improvement, depends on PR #790)#791

Draft
mykaul wants to merge 3 commits intoscylladb:masterfrom
mykaul:perf/buffer-accum-batch-message
Draft

perf: buffer accumulation in BatchMessage.send_body() (1.6-1.8x speedup, us improvement, depends on PR #790)#791
mykaul wants to merge 3 commits intoscylladb:masterfrom
mykaul:perf/buffer-accum-batch-message

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Apr 4, 2026

Summary

Replace per-call write_value()/write_byte()/write_short() in BatchMessage.send_body() with buffer accumulation (list.append + b"".join + single f.write()), reducing f.write() calls from Q*(4 + 2*P) + footer to 1 for Q queries with P params each.

Depends on PR #790 (perf/buffer-accum-write-params).

What changed

cassandra/protocol.py

BatchMessage.send_body() -- Full buffer accumulation for the entire message: batch header, per-query framing (prepared/unprepared), all parameters (with NULL/UNSET/str handling), and trailer.

Pre-computed constants -- _INT32_NULL and _INT32_UNSET as module-level constants to avoid repeated int32_pack() calls in the hot loop.

tests/unit/test_protocol.py

Added 7 new batch-specific test methods: prepared queries, unprepared queries, mixed, empty batch, many queries (50), NULL/UNSET params, and vector params.

Benchmark

Measured with min() of timeit.repeat(repeat=7, number=50_000) on a quiet machine (load <3), Cython .so compiled, before/after rebuild on same machine.

Scenario Baseline (ns/call) Buffer accum (ns/call) Speedup
10 queries x 2 params (128D vec) 7699 4866 1.58x
10 queries x 10 params (text) 18976 10490 1.81x
50 queries x 2 params (128D vec) 34492 20608 1.67x
50 queries x 10 params (text) 83815 48338 1.73x

Consistent 1.6-1.8x speedup across all batch scenarios. Larger batches with more params see the greatest absolute savings (35+ us saved for 50q x 10p).

Tests

@mykaul mykaul marked this pull request as draft April 4, 2026 17:01
@mykaul mykaul force-pushed the perf/buffer-accum-batch-message branch 2 times, most recently from 5eec7ed to b1d1cd0 Compare April 5, 2026 17:29
@mykaul mykaul changed the title perf: buffer accumulation in BatchMessage.send_body() perf: buffer accumulation in BatchMessage.send_body() (2x speedup, us improvement, depends on PR #790) Apr 7, 2026
mykaul added 3 commits April 7, 2026 11:28
Replace the per-parameter write_value(f, param) loop in
_QueryMessage._write_query_params() with a buffer accumulation approach:
list.append + b"".join + single f.write().

This reduces the number of f.write() calls from 2*N+1 to 1, which is
significant for vector workloads with large parameters.

Also removes the redundant ExecuteMessage._write_query_params()
pass-through override to avoid extra MRO lookup per call.

Includes 14 unit tests covering normal, NULL, UNSET, empty, large vector,
and mixed parameter scenarios for both ExecuteMessage and QueryMessage.

Includes a benchmark script (benchmarks/bench_execute_write_params.py).
Replace per-write_value()/write_byte()/write_short() calls in
BatchMessage.send_body() with buffer accumulation (list.append +
b"".join + single f.write()), reducing f.write() calls from
Q*(4 + 2*P) + footer to 1 for Q queries with P params each.

Benchmark results (Python 3.14, Cython .so, 50K iters, best of 3,
quiet machine):

  Scenario                              Before    After    Speedup
  10 queries x 2 params (128D vec)      8364 ns   4475 ns  1.87x
  10 queries x 2 params (768D vec)      8081 ns   5516 ns  1.47x
  50 queries x 2 params (128D vec)     32368 ns  16271 ns  1.99x
  10 queries x 10 text params          19138 ns   9051 ns  2.11x
  50 queries x 10 text params          86845 ns  40020 ns  2.17x
  10 unprepared x 2 params              8666 ns   4252 ns  2.04x

Also updates test_batch_message_with_keyspace to use BytesIO for
byte-level verification (compatible with single-write output).

Adds 7 batch-specific unit tests covering prepared, unprepared, mixed,
empty, many-query, NULL/UNSET, and vector parameter scenarios.

Includes benchmark script benchmarks/bench_batch_send_body.py.
Replace per-call int32_pack(-1) and int32_pack(-2) with module-level
_INT32_NEG1 and _INT32_NEG2 constants. Avoids redundant struct packing
on every null or unset parameter in the inner write_value loop.

Benchmark: ~11% speedup on the parameter serialization loop for a
typical 12-param mix of values, nulls, and unsets.
@mykaul mykaul force-pushed the perf/buffer-accum-batch-message branch from b1d1cd0 to 62f91eb Compare April 7, 2026 08:29
@mykaul mykaul changed the title perf: buffer accumulation in BatchMessage.send_body() (2x speedup, us improvement, depends on PR #790) perf: buffer accumulation in BatchMessage.send_body() (1.6-1.8x speedup, us improvement, depends on PR #790) Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant