Scripts for Billion-Scale Synthetic Data Generation by jinsolp · Pull Request #2247 · NVIDIA/cuvs

jinsolp · 2026-06-17T00:38:00Z

Adding billion-scale synthetic data generation scripts to cuvs_bench.

New Files in synthesize_dataset/:

__main__.py: fit/generate/verify CLI.
_fit.py: KMeans + per-cluster PCA fitting logic.
_fingerprint.py: Fingerprint class.
_generate.py: per-cluster data generation logic.
_ground_truth.py: exact (streaming) and nprobe GT computation.
_verify.py: compares nprobe GT vs exact GT to validate nprobes.
_io.py — dataset loading (support for npz and pkl files) + fingerprint NPZ save/load logic.
README.md + figures/: full workflow guide and synth-vs-real DiskANN validation on Falcon/BigANN/Wiki.

For Reviewers: It would be easier to read through the README.md and review code for each step in that order (starting with __main__.py.

copy-pr-bot · 2026-06-17T00:38:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jinsolp · 2026-06-17T00:38:34Z

/ok to test

copy-pr-bot · 2026-06-17T00:38:37Z

/ok to test

@jinsolp, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

jinsolp · 2026-06-17T00:39:13Z

/ok to test 8fcb3b2

coderabbitai · 2026-06-17T00:48:07Z

📝 Walkthrough

Summary by CodeRabbit

Release Notes

New Features
- Added synthesize_dataset module with CLI tool for generating synthetic vector datasets and ground truth at scale for benchmarking
- Three-command interface: fit (learn dataset fingerprint), generate (create synthetic data/queries/ground truth), and verify (check accuracy)
- Support for random jitter query generation mode
- Ground truth computation in exact and approximate modes
Documentation
- Added comprehensive README with workflow, examples, and API reference for synthetic dataset generation

Walkthrough

Introduces a new cuvs_bench.synthesize_dataset package for generating GPU-accelerated billion-scale synthetic ANN benchmark datasets from fitted cluster fingerprints, including exact and nprobe ground-truth computation, a three-subcommand CLI, and full documentation. Also extends generate_groundtruth with a random-jitter query generation mode backed by new shared utility functions.

Changes

Billion-Scale Synthetic Dataset Generator and Jitter Query Support

Layer / File(s)	Summary
Shared jitter utilities and generate_groundtruth wiring `python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py`, `python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py`	Adds `is_l2_normalized` and `add_jitter` to `utils.py`, then imports them into `__main__.py` to introduce `choose_random_queries_with_jitter`, extend the `--queries` CLI argument, and branch on the new `random-jitter` mode.
Fingerprint dataclass and package exports `python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py`, `python/cuvs_bench/cuvs_bench/synthesize_dataset/__init__.py`	Defines the `Fingerprint` dataclass storing all clustering/PCA fields with `__post_init__` density normalization, and re-exports all public symbols via `__all__` in the package `__init__`.
Fingerprint and dataset I/O `python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py`	Implements `load_dataset` (multi-format with optional truncation), `save_fingerprint` (NPZ with empty-array diagonal-fallback markers), and `load_fingerprint` (NPZ deserialization reconstructing `None` PCA entries and returning a `Fingerprint`).
Cluster fingerprint fitting pipeline `python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py`	Adds `fit_cluster_stats` running GPU KMeans and per-cluster PCA (with diagonal fallback for small clusters), computing per-cluster density, variance, and residual noise variance, backed by `_run_kmeans` and `_fit_cluster_pca` helpers.
GPU synthetic data and query generation `python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py`	Adds `gen_cluster_gpu` (PCA-correlated Gaussian sampling with noise, variance rescaling, optional normalization), density-proportional cluster point allocation, in-memory and async double-buffered streaming `.fbin` generation, and jitter-based query sampling.
Ground truth computation and verification `python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py`, `python/cuvs_bench/cuvs_bench/synthesize_dataset/_verify.py`	Implements GPU brute-force k-NN helpers, `compute_groundtruth_exact` (streaming per-cluster k-NN merge), `compute_groundtruth_nprobe` (probe-limited batched k-NN), and `verify_groundtruth` (nprobe vs exact recall comparison with timing).
CLI entry point and README `python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py`, `python/cuvs_bench/cuvs_bench/synthesize_dataset/README.md`	Defines the `fit`, `generate`, and `verify` subcommand CLI with argument parsing, pipeline dispatch, default nprobes derivation, and output path conventions; adds the full README covering workflow steps, fingerprint schema, YAML registration, pre-fit experiments, caveats, glossary, and Python API list.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

improvement, non-breaking, benchmarking, doc

Suggested reviewers

jrbourbeau
cjnolet
lowener

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.76% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title clearly describes the primary change: adding scripts for billion-scale synthetic data generation to cuvs_bench.
Description check	✅ Passed	Description is well-related to the changeset, explaining the new synthesize_dataset modules, files added, and workflow.
Linked Issues check	✅ Passed	The PR successfully implements the objective from issue `#2208` to open-source the BSDG as part of cuvs_bench with complete fit/generate/verify workflow.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with the goal of adding billion-scale synthetic data generation to cuvs_bench; no out-of-scope modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py`:
- Around line 116-133: The function `choose_random_queries_with_jitter` always
returns float32 jittered query data (via the return statement calling
`add_jitter` on the float32-casted `sampled` array), but the code that uses this
function's output (around line 389) determines the filename suffix using the
original `dataset.dtype` instead of the actual output dtype. This causes float32
data to be written to filenames with the wrong type suffix (e.g., `.u8bin` for
uint8 inputs), leading to incorrect decoding later. Update the code that
generates the output filename suffix to use float32 as the dtype instead of
`dataset.dtype`, since the jittered queries are always float32 regardless of the
input dataset type.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py`:
- Around line 343-351: Add centralized validation logic in the main function
after the parser.parse_args(argv) call and before dispatching to the command
handlers (_cmd_fit, _cmd_generate, _cmd_verify). Create a validation function or
inline checks to verify that numeric arguments including total_rows, n_queries,
k, nprobes, sample_size, n_clusters, and pca_components have valid bounds.
Additionally, ensure that nprobes is validated against config.nclusters to
enforce the documented contract. If any argument fails validation, print a clear
error message that includes the argument name, expected valid range, and actual
value provided, then return 1 to exit early. This prevents invalid values from
reaching downstream processing.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py`:
- Around line 67-71: In the __post_init__ method, before normalizing
cluster_densities, add comprehensive validation to check not only that the sum
is positive but also that all individual density values are non-negative
(greater than or equal to zero), contain no NaN or infinite values, and match
the expected shape or dimensions as documented in the class contract. These
checks should occur after computing the total sum but before performing the
normalization division to prevent silent data corruption in downstream
allocation operations.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py`:
- Around line 64-75: The function `_fit_cluster_pca` creates GPU memory
allocations with `residuals_gpu` and `out` but doesn't explicitly free them,
causing GPU memory to accumulate when called repeatedly in a loop (once per
cluster). After converting the results to NumPy arrays using `cp.asnumpy()`,
explicitly delete the GPU objects `residuals_gpu` and `out` using the `del`
statement, and optionally call
`cp.cuda.stream.get_current_stream().synchronize()` to ensure GPU memory is
properly released before the function returns.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py`:
- Around line 219-230: The daemon thread in _flush_async can fail silently if
buf_view.tofile(f) throws an exception (disk full, I/O error), and
_wait_for_write() only calls join() without checking for errors, causing the
code to continue with a potentially corrupt file. Add a nonlocal variable to
capture exceptions from the write thread, wrap the buf_view.tofile(f) call in a
try-except block to catch and store any exception, then modify _wait_for_write()
to check for captured exceptions after joining the thread and re-raise them if
they exist.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py`:
- Around line 54-60: The compute_groundtruth_exact function and related
functions (around lines 119-126) lack explicit bounds validation for the
parameters k, total_rows, and nprobes. When k exceeds total_rows or other
invalid values are passed, the functions prefill outputs with sentinel values
(-1) which then propagate to recall computations downstream. Add validation
checks at the beginning of compute_groundtruth_exact and the other affected
function to verify that k is less than or equal to total_rows, total_rows is
positive, and nprobes is within valid bounds. When validation fails, raise clear
exceptions that include both the expected constraints and the actual invalid
values provided, following the pattern of providing actionable error messages
with expected-vs-actual comparisons.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py`:
- Around line 73-93: Add input validation at the start of the load_dataset
function to ensure sample_size is positive when provided (sample_size must be
greater than 0 to prevent silent data corruption from Python's negative
indexing). Additionally, after loading data in each branch (the .npy block, .pkl
block, and memmap_bin_file block), validate that the resulting numpy array is 2D
and numeric before applying sample_size slicing or returning, raising clear and
actionable errors if the data is 1D or of invalid type. This ensures shape and
type errors are caught at the load boundary rather than deferred to downstream
functions like fit_cluster_stats.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 06365399-0de0-4fab-a427-5f11f7d266be

📥 Commits

Reviewing files that changed from the base of the PR and between 9ae6f93 and 8fcb3b2.

⛔ Files ignored due to path filters (4)

python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_bigann.png is excluded by !**/*.png
python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_falcon.png is excluded by !**/*.png
python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_wiki.png is excluded by !**/*.png
python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/pipeline.png is excluded by !**/*.png

📒 Files selected for processing (11)

python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py
python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/README.md
python/cuvs_bench/cuvs_bench/synthesize_dataset/__init__.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/_verify.py

jinsolp added 11 commits May 19, 2026 22:06

synthetic data gen

11a2b46

Merge branch 'rapidsai:main' into billion-scale-data-gen

c2151ee

uint64 header support based on size

def7526

uint64 support in synthesize data

1e31b47

add experiment result sin readme

788e885

resolve merge conflict

ebd00ea

rm norm flag

fd18762

Merge branch 'rapidsai:main' into billion-scale-data-gen

d4bf7a5

cleanup

c44a245

cleanup and rename

4dce8f9

cleanup readme

8fcb3b2

jinsolp self-assigned this Jun 17, 2026

jinsolp requested a review from a team as a code owner June 17, 2026 00:38

jinsolp added the feature request New feature or request label Jun 17, 2026

github-project-automation Bot added this to Unstructured Data Processing Jun 17, 2026

jinsolp added the non-breaking Introduces a non-breaking change label Jun 17, 2026

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

jinsolp added 3 commits June 25, 2026 20:40

norm quantiles

bbbcfe5

Merge branch 'main' into billion-scale-data-gen

854d3d6

coderabbit reviews

d144ecc

Uh oh!

Conversation

jinsolp commented Jun 17, 2026

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

jinsolp commented Jun 17, 2026

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

jinsolp commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant