Parametric Sectorized Bloom filter policy#808
Conversation
…bloom_filter_impl.
…t interface to the new APIs.
…mpilation of example is choking on #pragma unroll.
…try with early exit next.
|
/ok to test 8b1e995 |
|
/ok to test eae3049 |
PointKernel
left a comment
There was a problem hiding this comment.
The actual code changes are not that big, but the use of work stealing definitely caught my attention.
@sleeepyjack, could you please review all files touched by this PR and make sure the copyright years are updated where necessary?
|
|
||
| /** | ||
| * @brief A GPU-accelerated Blocked Bloom Filter. | ||
| * @brief A GPU-accelerated Bloom filter. |
There was a problem hiding this comment.
| * @brief A GPU-accelerated Bloom filter. | |
| * @brief A GPU-accelerated Bloom Filter. |
|
|
||
| /** | ||
| * @brief A GPU-accelerated Blocked Bloom Filter. | ||
| * @brief A GPU-accelerated Bloom filter. |
There was a problem hiding this comment.
It would be helpful to add a brief section describing the underlying algorithm, along with a reference to the original paper. I noticed the paper is referenced in the policy document, but it doesn't appear to be mentioned here.
|
Regarding the tuning knobs, I (or better Codex) did an ablation study: Bloom filter tuning sweep summaryI ran a tuning sweep on
Overall recommendationThe only clear default change suggested by this run is:
Everything else should stay at the current default unless we want a very small size-specific horizontal-contains optimization. Per-knob findings
|
|
tl;dr here is a summary and my suggestion on what we should do with each tuning knob/ code path:
What do you think? |
Too late. I've already gone through the whole lengthy AI-generated report. 😉
All looks valid to me. Several points:
|
Benchmark default policy
|
| Filter size | Dev throughput | PR throughput | Speedup | Dev time | PR time | Dev noise | PR noise |
|---|---|---|---|---|---|---|---|
| 1 MiB | 48.78 G elem/s | 63.68 G elem/s | +30.54% | 20.499 ms | 15.703 ms | 3.99% | 2.65% |
| 16 MiB | 47.73 G elem/s | 60.53 G elem/s | +26.81% | 20.951 ms | 16.522 ms | 4.01% | 4.40% |
| 32 MiB | 47.73 G elem/s | 59.99 G elem/s | +25.69% | 20.952 ms | 16.669 ms | 3.81% | 4.30% |
| 64 MiB | 36.43 G elem/s | 40.64 G elem/s | +11.55% | 27.453 ms | 24.609 ms | 1.03% | 0.17% |
| 128 MiB | 8.82 G elem/s | 8.81 G elem/s | -0.05% | 113.404 ms | 113.460 ms | 0.06% | 0.03% |
| 256 MiB | 5.36 G elem/s | 5.38 G elem/s | +0.38% | 186.582 ms | 185.872 ms | 0.02% | 0.03% |
| 512 MiB | 4.77 G elem/s | 4.77 G elem/s | +0.09% | 209.773 ms | 209.591 ms | 0.01% | 0.01% |
Contains throughput and FPR
Geomean throughput speedup: +17.77%. Arithmetic mean: +18.94%.
| Filter size | Dev throughput | PR throughput | Speedup | Dev FPR | PR FPR | FPR reduction | Dev time | PR time | Dev noise | PR noise |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 MiB | 78.39 G elem/s | 105.98 G elem/s | +35.20% | 0.013022 | 0.001308 | 10.0x | 12.757 ms | 9.436 ms | 4.13% | 2.19% |
| 16 MiB | 73.37 G elem/s | 97.73 G elem/s | +33.19% | 0.029271 | 0.001313 | 22.3x | 13.629 ms | 10.232 ms | 4.93% | 3.53% |
| 32 MiB | 73.23 G elem/s | 97.05 G elem/s | +32.54% | 0.029418 | 0.001315 | 22.4x | 13.656 ms | 10.304 ms | 4.98% | 3.51% |
| 64 MiB | 73.01 G elem/s | 96.07 G elem/s | +31.59% | 0.046045 | 0.001315 | 35.0x | 13.697 ms | 10.409 ms | 5.00% | 3.42% |
| 128 MiB | 38.73 G elem/s | 38.74 G elem/s | +0.03% | 0.060881 | 0.001316 | 46.3x | 25.819 ms | 25.812 ms | 0.05% | 0.06% |
| 256 MiB | 18.81 G elem/s | 18.82 G elem/s | +0.04% | 0.067084 | 0.001316 | 51.0x | 53.167 ms | 53.149 ms | 0.01% | 0.01% |
| 512 MiB | 16.41 G elem/s | 16.41 G elem/s | +0.02% | 0.067727 | 0.001315 | 51.5x | 60.940 ms | 60.930 ms | 0.01% | 0.01% |
Note: The huge FPR gap comes from a problem with the old algorithm. Block indexing and fingerprint generation reuse the same hash bits. As the filter gets larger, more bits are consumed by the block index, leaving fewer effective bits for the fingerprint, which inflates FPR. The new algorithm splits those bits, enabling orders-of-magnitude smaller filters at comparable FPR — likely small enough to fit in L2 and run faster.
|
/ok to test cc6675b |
PointKernel
left a comment
There was a problem hiding this comment.
One small but not blocking. Let's ship it.
| // Normalize by size so any policy-provided word_type (uint32_t, uint64_t, unsigned long, ...) | ||
| // resolves to a matching overload via the reinterpret_cast in atomic_or(). | ||
| using atomic_word_type = | ||
| cuda::std::conditional_t<sizeof(word_type) == 8, unsigned long long, unsigned int>; |
There was a problem hiding this comment.
Is unsigned long long required here because it's the type supported by atomicOr?
Reworks
cuco::bloom_filteraround a parametric Sectorized Bloom Filter (SBF), following "Optimizing Bloom Filters for Modern GPU Architectures" (arXiv:2512.15595).Benchmark results