Scripts for Billion-Scale Synthetic Data Generation#2247
Conversation
|
/ok to test |
@jinsolp, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 8fcb3b2 |
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughIntroduces a new ChangesBillion-Scale Synthetic Dataset Generator and Jitter Query Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py`:
- Around line 116-133: The function `choose_random_queries_with_jitter` always
returns float32 jittered query data (via the return statement calling
`add_jitter` on the float32-casted `sampled` array), but the code that uses this
function's output (around line 389) determines the filename suffix using the
original `dataset.dtype` instead of the actual output dtype. This causes float32
data to be written to filenames with the wrong type suffix (e.g., `.u8bin` for
uint8 inputs), leading to incorrect decoding later. Update the code that
generates the output filename suffix to use float32 as the dtype instead of
`dataset.dtype`, since the jittered queries are always float32 regardless of the
input dataset type.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py`:
- Around line 343-351: Add centralized validation logic in the main function
after the parser.parse_args(argv) call and before dispatching to the command
handlers (_cmd_fit, _cmd_generate, _cmd_verify). Create a validation function or
inline checks to verify that numeric arguments including total_rows, n_queries,
k, nprobes, sample_size, n_clusters, and pca_components have valid bounds.
Additionally, ensure that nprobes is validated against config.nclusters to
enforce the documented contract. If any argument fails validation, print a clear
error message that includes the argument name, expected valid range, and actual
value provided, then return 1 to exit early. This prevents invalid values from
reaching downstream processing.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py`:
- Around line 67-71: In the __post_init__ method, before normalizing
cluster_densities, add comprehensive validation to check not only that the sum
is positive but also that all individual density values are non-negative
(greater than or equal to zero), contain no NaN or infinite values, and match
the expected shape or dimensions as documented in the class contract. These
checks should occur after computing the total sum but before performing the
normalization division to prevent silent data corruption in downstream
allocation operations.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py`:
- Around line 64-75: The function `_fit_cluster_pca` creates GPU memory
allocations with `residuals_gpu` and `out` but doesn't explicitly free them,
causing GPU memory to accumulate when called repeatedly in a loop (once per
cluster). After converting the results to NumPy arrays using `cp.asnumpy()`,
explicitly delete the GPU objects `residuals_gpu` and `out` using the `del`
statement, and optionally call
`cp.cuda.stream.get_current_stream().synchronize()` to ensure GPU memory is
properly released before the function returns.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py`:
- Around line 219-230: The daemon thread in _flush_async can fail silently if
buf_view.tofile(f) throws an exception (disk full, I/O error), and
_wait_for_write() only calls join() without checking for errors, causing the
code to continue with a potentially corrupt file. Add a nonlocal variable to
capture exceptions from the write thread, wrap the buf_view.tofile(f) call in a
try-except block to catch and store any exception, then modify _wait_for_write()
to check for captured exceptions after joining the thread and re-raise them if
they exist.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py`:
- Around line 54-60: The compute_groundtruth_exact function and related
functions (around lines 119-126) lack explicit bounds validation for the
parameters k, total_rows, and nprobes. When k exceeds total_rows or other
invalid values are passed, the functions prefill outputs with sentinel values
(-1) which then propagate to recall computations downstream. Add validation
checks at the beginning of compute_groundtruth_exact and the other affected
function to verify that k is less than or equal to total_rows, total_rows is
positive, and nprobes is within valid bounds. When validation fails, raise clear
exceptions that include both the expected constraints and the actual invalid
values provided, following the pattern of providing actionable error messages
with expected-vs-actual comparisons.
In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py`:
- Around line 73-93: Add input validation at the start of the load_dataset
function to ensure sample_size is positive when provided (sample_size must be
greater than 0 to prevent silent data corruption from Python's negative
indexing). Additionally, after loading data in each branch (the .npy block, .pkl
block, and memmap_bin_file block), validate that the resulting numpy array is 2D
and numeric before applying sample_size slicing or returning, raising clear and
actionable errors if the data is 1D or of invalid type. This ensures shape and
type errors are caught at the load boundary rather than deferred to downstream
functions like fit_cluster_stats.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 06365399-0de0-4fab-a427-5f11f7d266be
⛔ Files ignored due to path filters (4)
python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_bigann.pngis excluded by!**/*.pngpython/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_falcon.pngis excluded by!**/*.pngpython/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_wiki.pngis excluded by!**/*.pngpython/cuvs_bench/cuvs_bench/synthesize_dataset/figures/pipeline.pngis excluded by!**/*.png
📒 Files selected for processing (11)
python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.pypython/cuvs_bench/cuvs_bench/generate_groundtruth/utils.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/README.mdpython/cuvs_bench/cuvs_bench/synthesize_dataset/__init__.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/_io.pypython/cuvs_bench/cuvs_bench/synthesize_dataset/_verify.py
Closes #2208
Adding billion-scale synthetic data generation scripts to
cuvs_bench.New Files in
synthesize_dataset/:__main__.py: fit/generate/verify CLI._fit.py: KMeans + per-cluster PCA fitting logic._fingerprint.py: Fingerprint class._generate.py: per-cluster data generation logic._ground_truth.py: exact (streaming) and nprobe GT computation._verify.py: compares nprobe GT vs exact GT to validate nprobes._io.py— dataset loading (support for npz and pkl files) + fingerprint NPZ save/load logic.README.md+figures/: full workflow guide and synth-vs-real DiskANN validation on Falcon/BigANN/Wiki.For Reviewers: It would be easier to read through the
README.mdand review code for each step in that order (starting with__main__.py.