A lock-free single-producer / single-consumer (SPSC) ring buffer written in modern C++, with a benchmark that measures throughput and latency.
The idea is simple: one thread pushes items in, one thread pops them out, and neither ever takes a lock. Coordination happens purely through two atomic counters (head and tail) using acquire/release memory ordering.
- Lock-free Push / Pop — single items, no mutexes, no blocking.
- Batch operations —
PushBatch/PopBatchmove whole chunks at once, which massively reduces the number of atomic operations and cache traffic. They return how many items were actually processed, so partial batches are handled safely. - Power-of-2 sizing — the buffer size must be a power of 2, so wrapping around is a cheap bitmask (
index & mask) instead of a modulo. - Cache-line padding —
headandtailare padded onto separate cache lines to avoid false sharing (see below). - Spin-wait, not sleep — when the buffer is full/empty, threads spin with
_mm_pause()(x86) orstd::this_thread::yield()instead of blocking, keeping latency low. - Safe construction —
newRingBuffer(size)validates that the size is a power of 2 and returnsnullptrotherwise.
The header contains two versions of the same buffer:
RingBuffer—headandtailare each followed by 56 bytes of padding, so each counter sits alone on its own 64-byte cache line.UnpaddedRingBuffer— identical logic, no padding, so both counters share one cache line.
Why it matters: the producer constantly writes tail and the consumer constantly writes head. If both live on the same cache line, every write by one thread invalidates that line in the other thread's core, even though they never touch the same variable. That's false sharing, and the CPUs end up ping-ponging the cache line back and forth.
In this benchmark the unpadded version has comparable average throughput on a single run, but its tail latency is clearly worse — p99.9 spiked to ~1245 us versus ~90–400 us for the padded version. Padding costs 112 bytes and buys you predictability.
| File | What it is |
|---|---|
ringbuffer.hpp |
The ring buffer (padded + unpadded versions) |
benchmark.cpp |
Throughput + latency benchmark |
g++ -O2 -std=c++17 -pthread benchmark.cpp -o benchmark
./benchmark10 million uint64_t items, buffer size 4096, one producer and one consumer thread:
| Mode | Throughput | p99.9 latency |
|---|---|---|
| Single-element Push/Pop (padded) | ~70 M ops/sec | — |
| Batched, chunks of 128 (padded) | ~800–1000 M ops/sec | ~90–400 us |
| Batched, chunks of 128 (unpadded) | ~625 M ops/sec | ~1245 us |
Batching is roughly 10x faster because the atomic head/tail updates are amortized over 128 items instead of paid per item.
Latency per cluster of 10,000 items (batched mode):
p50 (Median) : ~10 us
p95 (Tail) : ~14 us
p99 (Jitter) : ~18 us
p99.9 (Spikes) : ~90-400 us
Buffer size matters too: a sweep from 16 up to 65536 showed tiny buffers choke throughput (~9 M ops/sec at size 16, since threads mostly wait on each other), with the sweet spot around 1024–4096.
A checksum of all popped values is verified at the end (49999995000000), which proves no items were lost, duplicated, or corrupted along the way.
Two bugs found while building the benchmark, both good reminders:
- Trust the return value. The batch loops originally assumed a full chunk was always pushed/popped. When the buffer returned a partial chunk, the code read uninitialized memory and produced impossible throughput numbers. Fix: only process the count the buffer actually reports.
- Save your measurements. Latency percentiles all read 0 because the per-cluster durations were computed but never stored in the vector before sorting. One
push_backfixed it.