Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
d67ae07
Add support for horizontal/verstical vectorization parameter
sleeepyjack Sep 9, 2025
cb7a78d
Restructure policies
sleeepyjack Sep 9, 2025
21ff88c
Fix indexing bug
sleeepyjack Sep 10, 2025
41e217a
Coalesced output write
sleeepyjack Sep 10, 2025
2b8ecde
Add unit test for adaptive contains kernel
sleeepyjack Sep 10, 2025
c322825
Add parametric filter policy (dummy)
sleeepyjack Sep 10, 2025
17f9c19
Merge remote-tracking branch 'upstream' into exp-filter-policy
sleeepyjack Sep 10, 2025
97693a3
Multiplicative hashing implemented in policy. Some changes needed to …
kevkrist Sep 11, 2025
4285b39
Finalized proposed policy interface.
kevkrist Sep 11, 2025
0934cba
Fixed a mistake in thread_dispatch(). Removed some dead static variab…
kevkrist Sep 16, 2025
cf43c8f
Multiplicative hashing calling code infrastructure.
kevkrist Sep 17, 2025
52d7f17
New example script for sanity checking. Still need to connect the hos…
kevkrist Sep 18, 2025
91714b0
host and device APIs are connected for multiplicative hashing, but co…
kevkrist Sep 18, 2025
fa7d9a7
Debugging done. End-to-end filters working properly.
kevkrist Sep 19, 2025
13918c2
Tests updated.
kevkrist Sep 19, 2025
7008773
Good performance agains arrow FP when early exit is turned off. Will …
kevkrist Sep 19, 2025
9252f69
Updated bloom filter nvbench script.
kevkrist Sep 22, 2025
12b4847
Changing exp kernels from if to while for grid-striding.
kevkrist Sep 22, 2025
d3fcce2
Bug fix in filter size in PFP_EVALUATION_EXAMPLE
kevkrist Sep 23, 2025
a30d1b1
Bug fix in while loop in exp kernels.
kevkrist Sep 23, 2025
18e9c34
Small PR review fixes.
kevkrist Sep 23, 2025
a018896
group-cooperative parametric filter policy code paths implemented.
kevkrist Sep 23, 2025
8020b72
Benchmark scripts updated.
kevkrist Sep 23, 2025
b655183
Notebook with theoretical FPR calculators.
kevkrist Sep 25, 2025
e558f1c
Remove static checks on hash result type that are blocking NVBench.
kevkrist Sep 25, 2025
c83912c
Enum type lists for the add benchmark added.
kevkrist Sep 25, 2025
46cf45f
Added salt generation script. Updated the total number of salts to 64.
kevkrist Sep 25, 2025
892e4a9
Updated block index selection in PFP to match Arrow policy.
kevkrist Sep 26, 2025
e9f8ac9
Merge remote-tracking branch 'upstream' into exp-filter-policy
sleeepyjack Sep 28, 2025
1a4b5e0
Enable magic modulo
sleeepyjack Sep 28, 2025
8c0f49a
Enable warp-cooperative kernels
sleeepyjack Sep 28, 2025
726735a
Update benchmarks
sleeepyjack Sep 30, 2025
04c48ee
Add experimental Arrow filter policy
sleeepyjack Sep 30, 2025
a177107
Update benchmarks
sleeepyjack Oct 1, 2025
db020aa
Add RTX600 Blackwell benchmark results
sleeepyjack Oct 1, 2025
a815053
Add H200 benchmark results
sleeepyjack Oct 1, 2025
a793644
Add B200 benchmark results
sleeepyjack Oct 1, 2025
70376cd
Notebook for generating filter size sweep plots.
kevkrist Oct 1, 2025
f6ef851
Plot generating script added and plots for the add/contains sweeps ge…
kevkrist Oct 2, 2025
69425d5
Modified group-cooperative hashing.
kevkrist Oct 6, 2025
a38caba
Fixed some bugs in the plotting script. New data/plots for new group-…
kevkrist Oct 6, 2025
f638544
Added warpcore benchmarks. Updated frontier.
kevkrist Oct 7, 2025
b6301e2
Updated frontier plot
kevkrist Oct 7, 2025
252a32b
Updated pfp implementation to switch to conditional atomic inserts wh…
kevkrist Oct 9, 2025
8bc13b0
Cleaning up dead code.
kevkrist Oct 9, 2025
087dfdd
Outsource thirdparty benchmarks
sleeepyjack Oct 14, 2025
8368f9b
Merge branch 'dev' into consolidate
sleeepyjack Oct 14, 2025
b454236
Fix add benchmarks and more merge conflicts
sleeepyjack Oct 14, 2025
3db5153
Fix lazy_discard_iterator
sleeepyjack Nov 11, 2025
ac204eb
Add CTA work stealing option
sleeepyjack Nov 11, 2025
4cea931
Add arch guard for work stealing kernels
sleeepyjack Nov 12, 2025
4de8306
Update benchmarks
sleeepyjack Dec 1, 2025
d0ad05c
Minor fixes
sleeepyjack Dec 1, 2025
81cf164
Enable ninja build generator
sleeepyjack Dec 1, 2025
c676524
Merge remote-tracking branch 'upstream' into consolidate
sleeepyjack Dec 1, 2025
65c727b
cache_sectorized implemented in parametric_filter_policy.
kevkrist Dec 5, 2025
555113f
Started on bloom_filter_imp
kevkrist Dec 5, 2025
25bf1cc
Add implemented for cache-sectorized.
kevkrist Dec 8, 2025
aa5b117
Fix bug in set_bits routine for cache-sectorized.
kevkrist Dec 8, 2025
2cce8c2
contains implemented for cache-sectorized.
kevkrist Dec 8, 2025
00d5bde
contains has bug when horizontal_layout > 1.
kevkrist Dec 8, 2025
5643887
Cache-sectorized working.
kevkrist Dec 9, 2025
4d39069
Turned off use_cub_kernels and work stealing for clearer evaluation.
kevkrist Dec 11, 2025
355b646
Add CSBF benchmarks
sleeepyjack Dec 18, 2025
94bce41
Merge remote-tracking branch 'upstream/dev' into consolidate
sleeepyjack Feb 26, 2026
617cb3f
Merge remote-tracking branch 'upstream/dev' into bloom-filter-release
sleeepyjack Apr 29, 2026
b805999
Drop baggage
sleeepyjack Apr 29, 2026
519970f
WIP but working
sleeepyjack Apr 29, 2026
6d299e0
Tuning struct
sleeepyjack Apr 29, 2026
076a5cf
Docs
sleeepyjack Apr 29, 2026
96e2a5a
Address review comments
sleeepyjack Apr 30, 2026
a999f46
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] Apr 30, 2026
c2dc7ab
Merge branch 'dev' into bloom-filter-release
sleeepyjack Apr 30, 2026
b4c4aa5
Update copyright year
sleeepyjack Apr 30, 2026
2a8de62
Merge branch 'bloom-filter-release' of github.com:sleeepyjack/cuColle…
sleeepyjack Apr 30, 2026
20be4e3
Address Doxygen
sleeepyjack Apr 30, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion benchmarks/benchmark_utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,9 @@ struct lazy_discard {
__device__ void device_dispatch(index_type index, value_type const& value) const
{
// pick some predicate that is always false, but depends on the runtime value
if (threadIdx.x > 2025 + *reinterpret_cast<char const*>(&value)) { *(it + index) = value; }
if (threadIdx.x > 2025 + *reinterpret_cast<char const*>(&value) + static_cast<int>(index)) {
*(it + index) = value;
}
}
__host__ __device__ void operator()(index_type index, value_type const& value) const
{
Expand Down
328 changes: 217 additions & 111 deletions benchmarks/bloom_filter/add_bench.cu

Large diffs are not rendered by default.

387 changes: 262 additions & 125 deletions benchmarks/bloom_filter/contains_bench.cu

Large diffs are not rendered by default.

4 changes: 3 additions & 1 deletion benchmarks/bloom_filter/defaults.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,11 @@ using BF_WORD = nvbench::uint32_t;
static constexpr auto BF_N = 1'000'000'000;
static constexpr auto BF_SIZE_MB = 2'000;
static constexpr auto BF_WORDS_PER_BLOCK = 8;
static constexpr auto BF_PATTERN_BITS = 8;

auto const BF_SIZE_MB_RANGE_CACHE =
std::vector<nvbench::int64_t>{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048};
auto const BF_PATTERN_BITS_RANGE = std::vector<nvbench::int64_t>{1, 2, 4, 6, 8, 16};
auto const BF_SIZE_MB_RANGE_FRONTIER_CACHE = std::vector<nvbench::int64_t>{32, 1024};
auto const BF_PATTERN_BITS_RANGE = std::vector<nvbench::int64_t>{1, 2, 4, 6, 8, 16};

} // namespace cuco::benchmark::defaults
28 changes: 21 additions & 7 deletions include/cuco/bloom_filter.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2024-2025, NVIDIA CORPORATION.
* Copyright (c) 2024-2026, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -36,7 +36,13 @@
namespace cuco {

/**
* @brief A GPU-accelerated Blocked Bloom Filter.
* @brief A GPU-accelerated Bloom filter.
*
* Implements the Sectorized Bloom Filter (SBF) and Cache-Sectorized Bloom Filter (CSBF) variants
* from "Optimizing Bloom Filters for Modern GPU Architectures" (arXiv:2512.15595). Fingerprint
* generation is parameterized by `parametric_filter_policy` (see `cuco/bloom_filter_policies.cuh`),
* which exposes the paper's Theta/Phi vectorization layout and optional `GroupsPerBlock`
* cache-sectorization parameters as compile-time template parameters.
*
* The `bloom_filter` supports two types of operations:
* - Host-side "bulk" operations
Expand All @@ -51,6 +57,12 @@ namespace cuco {
* independent add or lookup operations from device code. These operations are accessed through
* non-owning, trivially copyable reference types (or "ref").
*
* @note The default `Policy` (`cuco::default_filter_policy<Key>`) is an alias for a
* `parametric_filter_policy` instantiation with paper-recommended layouts on a 256-bit block. Users
* who need different layouts (smaller/larger blocks, different fingerprint bit counts, CSBF mode)
* can instantiate `cuco::parametric_filter_policy<...>` directly and pass it as the `Policy`
* template argument.
*
* @tparam Key Key type
* @tparam Extent Size type that is used to determine the number of blocks in the filter
* @tparam Scope The scope in which operations will be performed by individual threads
Expand All @@ -61,8 +73,8 @@ namespace cuco {
template <class Key,
class Extent = cuco::extent<std::size_t>,
cuda::thread_scope Scope = cuda::thread_scope_device,
class Policy = cuco::default_filter_policy<cuco::xxhash_64<Key>, std::uint32_t, 8>,
class Allocator = cuco::cuda_allocator<cuda::std::byte>>
class Policy = cuco::default_filter_policy<Key>,
class Allocator = cuco::cuda_allocator<cuda::std::byte>>
class bloom_filter {
public:
/**
Expand Down Expand Up @@ -136,7 +148,7 @@ class bloom_filter {
* @param stream CUDA stream used for device memory operations and kernel launches
*/
__host__ constexpr void clear_async(cuda::stream_ref stream = cuda::stream_ref{
cudaStream_t{nullptr}});
cudaStream_t{nullptr}}) noexcept;

/**
* @brief Adds all keys in the range `[first, last)` to the filter.
Expand Down Expand Up @@ -164,8 +176,10 @@ class bloom_filter {
* @param stream CUDA stream used for device memory operations and kernel launches
*/
template <class InputIt>
__host__ constexpr void add_async(
InputIt first, InputIt last, cuda::stream_ref stream = cuda::stream_ref{cudaStream_t{nullptr}});
__host__ constexpr void add_async(InputIt first,
InputIt last,
cuda::stream_ref stream = cuda::stream_ref{
cudaStream_t{nullptr}}) noexcept;

/**
* @brief Adds keys in the range `[first, last)` if `pred` of the corresponding `stencil` returns
Expand Down
149 changes: 49 additions & 100 deletions include/cuco/bloom_filter_policies.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION.
* Copyright (c) 2024-2026, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -16,119 +16,68 @@

#pragma once

#include <cuco/detail/bloom_filter/arrow_filter_policy.cuh>
#include <cuco/detail/bloom_filter/default_filter_policy_impl.cuh>
#include <cuco/detail/bloom_filter/parametric_filter_policy.cuh>
#include <cuco/hash_functions.cuh>

#include <cstdint>

namespace cuco {

/**
* @brief A policy that defines how Arrow Block-Split Bloom Filter generates and stores a key's
* fingerprint.
* @brief Sectorized Bloom filter policy with multiplicative-hashing fingerprint generation.
*
* @tparam Key The type of the values to generate a fingerprint for.
* @tparam XXHash64 Custom (64 bit) XXHash hasher to generate a key's fingerprint.
* By default, cuco::xxhash_64 hasher will be used.
* Implements the Sectorized Bloom Filter (SBF) and Cache-Sectorized Bloom Filter (CSBF) variants
* from "Optimizing Bloom Filters for Modern GPU Architectures" (arXiv:2512.15595). Distributes
* `PatternBits` set bits across `WordsPerBlock` words via compile-time salt-based multiplicative
* hashing.
*
* Requires a 64-bit hash function: the result is split into upper 32 bits (block selection via
* multiply-shift) and lower 32 bits (pattern generation). This is a permanent design requirement.
*
* @note Constructor signature: `parametric_filter_policy(Hash hash = {})`.
*
* @tparam Hash 64-bit hash functor.
* @tparam Word Underlying word type of a filter block.
* @tparam WordsPerBlock Words per filter block.
* @tparam PatternBits Fingerprint bits per key (paper's k).
* @tparam AddHorizontalLayout CG size for add (paper's Theta).
* @tparam AddVerticalLayout Words per thread per add step (paper's Phi).
* @tparam ContainsHorizontalLayout CG size for contains.
* @tparam ContainsVerticalLayout Words per thread per contains step.
* @tparam GroupsPerBlock Cache-sectorization groups (paper's z). Defaults to `WordsPerBlock`
* (non-CSBF).
*/
template <typename Key, template <typename> class XXHash64 = cuco::xxhash_64>
using arrow_filter_policy = detail::arrow_filter_policy<Key, XXHash64>;
template <class Hash,
class Word,
std::uint32_t WordsPerBlock,
std::uint32_t PatternBits,
std::uint32_t AddHorizontalLayout,
std::uint32_t AddVerticalLayout,
std::uint32_t ContainsHorizontalLayout,
std::uint32_t ContainsVerticalLayout,
std::uint32_t GroupsPerBlock = WordsPerBlock>
using parametric_filter_policy = detail::parametric_filter_policy<Hash,
Word,
WordsPerBlock,
PatternBits,
AddHorizontalLayout,
AddVerticalLayout,
ContainsHorizontalLayout,
ContainsVerticalLayout,
GroupsPerBlock>;

/**
* @brief The default policy that defines how a Blocked Bloom Filter generates and stores a key's
* fingerprint.
* @brief Default Bloom filter policy used by `cuco::bloom_filter` when no policy is specified.
*
* @note `Word` type must be an atomically updatable integral type. `WordsPerBlock` must
* be a power-of-two.
* Alias for a `parametric_filter_policy` instantiation with paper-recommended layouts on a 256-bit
* block: 8 x `uint32_t` words, 8 fingerprint bits per key, fully horizontal add (Theta=8) and fully
* vertical contains (Phi=8).
*
* @tparam Hash Hash function used to generate a key's fingerprint
* @tparam Word Underlying word/segment type of a filter block
* @tparam WordsPerBlock Number of words/segments in each block
* @tparam Key The key type to generate a fingerprint for.
* @tparam XXHash64 64-bit XXHash functor template. Defaults to `cuco::xxhash_64`.
*/
template <class Hash, class Word, std::uint32_t WordsPerBlock>
class default_filter_policy {
using impl_type = cuco::detail::default_filter_policy_impl<Hash, Word, WordsPerBlock>;

public:
using hasher = typename impl_type::hasher; ///< Type of the hash function
using hash_argument_type = typename impl_type::hash_argument_type; ///< Hash function input type
using hash_result_type = typename impl_type::hash_result_type; ///< hash function output type
using word_type =
typename impl_type::word_type; ///< Underlying word/segment type of a filter block

static constexpr std::uint32_t words_per_block =
impl_type::words_per_block; ///< Number of words/segments in each filter block

public:
/**
* @brief Constructs the `default_filter_policy` object.
*
* @throws Compile-time error if the specified number of words in a filter block is not a
* power-of-two or is larger than 32. If called from host: throws exception; If called from
* device: Traps the kernel.
*
* @throws If the `hash_result_type` is too narrow to generate the requested number of
* `pattern_bits`. If called from host: throws exception; If called from device: Traps the kernel.
*
* @throws If `pattern_bits` is smaller than the number of words in a filter block or larger than
* the total number of bits in a filter block. If called from host: throws exception; If called
* from device: Traps the kernel.
*
* @param pattern_bits Number of bits in a key's fingerprint
* @param hash Hash function used to generate a key's fingerprint
*/
__host__ __device__ constexpr default_filter_policy(std::uint32_t pattern_bits = words_per_block,
Hash hash = {});

/**
* @brief Generates the hash value for a given key.
*
* @note This function is meant as a customization point and is only used in the internals of the
* `bloom_filter(_ref)` implementation.
*
* @param key The key to hash
*
* @return The hash value of the key
*/
__device__ constexpr hash_result_type hash(hash_argument_type const& key) const;

/**
* @brief Determines the filter block a key is added into.
*
* @note This function is meant as a customization point and is only used in the internals of the
* `bloom_filter(_ref)` implementation.
*
* @tparam Extent Size type that is used to determine the number of blocks in the filter
*
* @param hash Hash value of the key
* @param num_blocks Number of block in the filter
*
* @return The block index for the given key's hash value
*/
template <class Extent>
__device__ constexpr auto block_index(hash_result_type hash, Extent num_blocks) const;

/**
* @brief Determines the fingerprint pattern for a word/segment within the filter block for a
* given key's hash value.
*
* @note This function is meant as a customization point and is only used in the internals of the
* `bloom_filter(_ref)` implementation.
*
* @param hash Hash value of the key
* @param word_index Target word/segment within the filter block
*
* @return The bit pattern for the word/segment in the filter block
*/
__device__ constexpr word_type word_pattern(hash_result_type hash,
std::uint32_t word_index) const;

private:
impl_type impl_; ///< Policy implementation
};
template <class Key, template <typename> class XXHash64 = cuco::xxhash_64>
using default_filter_policy =
parametric_filter_policy<XXHash64<Key>, std::uint32_t, 8, 8, 8, 1, 1, 8>;

} // namespace cuco

#include <cuco/detail/bloom_filter/default_filter_policy.inl>
52 changes: 45 additions & 7 deletions include/cuco/bloom_filter_ref.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ class bloom_filter_ref {
* @param stream CUDA stream used for device memory operations and kernel launches
*/
__host__ constexpr void clear_async(cuda::stream_ref stream = cuda::stream_ref{
cudaStream_t{nullptr}});
cudaStream_t{nullptr}}) noexcept;

/**
* @brief Device function that adds a key to the filter.
Expand All @@ -121,6 +121,17 @@ class bloom_filter_ref {
template <class ProbeKey>
__device__ void add(ProbeKey const& key);

/**
* @brief Device function that adds all keys in the range `[first, last)` to the filter.
*
* @tparam InputIt Device-accessible random access input key iterator
*
* @param first Beginning of the sequence of keys
* @param last End of the sequence of keys
*/
template <class InputIt>
__device__ void add(InputIt first, InputIt last);

/**
* @brief Device function that cooperatively adds a key to the filter.
*
Expand Down Expand Up @@ -178,8 +189,10 @@ class bloom_filter_ref {
* @param stream CUDA stream used for device memory operations and kernel launches
*/
template <class InputIt>
__host__ constexpr void add_async(
InputIt first, InputIt last, cuda::stream_ref stream = cuda::stream_ref{cudaStream_t{nullptr}});
__host__ constexpr void add_async(InputIt first,
InputIt last,
cuda::stream_ref stream = cuda::stream_ref{
cudaStream_t{nullptr}}) noexcept;

/**
* @brief Adds keys in the range `[first, last)` if `pred` of the corresponding `stencil` returns
Expand Down Expand Up @@ -248,6 +261,20 @@ class bloom_filter_ref {
template <class ProbeKey>
[[nodiscard]] __device__ bool contains(ProbeKey const& key) const;

/**
* @brief Device function that tests if all keys in the range `[first, last)` are present in the
* filter.
*
* @tparam InputIt Device-accessible random access input key iterator
* @tparam OutputIt Device-accessible output iterator assignable from `bool`
*
* @param first Beginning of the sequence of keys
* @param last End of the sequence of keys
* @param output_begin Beginning of the sequence of booleans for the presence of each key
*/
template <class InputIt, class OutputIt>
__device__ void contains(InputIt first, InputIt last, OutputIt output_begin) const;

/**
* @brief Device function that tests if a key's fingerprint is present in the filter.
*
Expand All @@ -265,10 +292,21 @@ class bloom_filter_ref {
template <class CG, class ProbeKey>
[[nodiscard]] __device__ bool contains(CG group, ProbeKey const& key) const;

// TODO
// template <class CG, class InputIt, class OutputIt>
// __device__ void contains(CG group, InputIt first, InputIt last, OutputIt output_begin)
// const;
/**
* @brief Device function that tests if all keys in the range `[first, last)` are present in the
* filter.
*
* @tparam CG Cooperative Group type
* @tparam InputIt Device-accessible random access input key iterator
* @tparam OutputIt Device-accessible output iterator assignable from `bool`
*
* @param group The Cooperative Group this operation is executed with
* @param first Beginning of the sequence of keys
* @param last End of the sequence of keys
* @param output_begin Beginning of the sequence of booleans for the presence of each key
*/
template <class CG, class InputIt, class OutputIt>
__device__ void contains(CG group, InputIt first, InputIt last, OutputIt output_begin) const;

/**
* @brief Tests all keys in the range `[first, last)` if their fingerprints are present in the
Expand Down
Loading
Loading