Fix GPU variety undercount for kernels larger than 5x5#2800
Open
brendancol wants to merge 2 commits into
Open
Conversation
The CUDA variety kernel used a fixed 25-element cuda.local.array, so it silently capped unique-value counts at 25. A 7x7 all-unique window returned 25 on GPU versus 49 on CPU. Rewrite the kernel to count distinct values without a scratch buffer: for each valid non-NaN cell, scan only the earlier cells in the same window and increment the count when no earlier cell matches. This drops the cap and the register-pressure concern, and matches the CPU implementation for arbitrary kernel sizes. Add test_variety_gpu_large_kernel_parity asserting cupy matches numpy on 7x7 and 9x9 all-unique windows, plus a numpy-only large-kernel test that runs without a GPU.
brendancol
commented
Jun 1, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
PR Review: Fix GPU variety undercount for kernels larger than 5x5
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
None.
Nits (optional improvements)
- focal.py:934-951: the buffer-free scan is correct, but the inner double-break does a few no-op passes. Once the running flat index
pk*kcols+phreachesk*kcols+h, thephloop breaks out of the current row, then every laterpkrow breaks again on its first iteration. No correctness impact. Comparing against a precomputedtarget = k*kcols+hwould exit cleanly, but the current form is fine. - benchmarks/benchmarks/focal.py:36: the focal_stats benchmark runs with default stats, so it never exercises the variety path. Not required for this fix, but a variety case would catch future regressions in this kernel's cost.
What looks good
- The GPU count now matches the CPU
_calc_varietyexactly: each distinct value is counted once, at its earliest occurrence in the window. Verified on a real GPU (cupy == numpy for 7x7 -> 49 and 9x9 -> 81). - NaN handling matches the CPU path (
v != vskip, all-NaN window returns NaN). - Dropping the cuda.local.array removes both the 25-value cap and the register-pressure reason it existed.
- Tests are split so the numpy reference is checked even without a GPU, and the cupy parity assertion is gated by the cuda_and_cupy_available marker.
- Full test_focal.py suite passes (154 tests).
Checklist
- Algorithm matches CPU reference
- All implemented backends produce consistent results (cupy verified on GPU)
- NaN handling is correct
- Edge cases covered (existing all-NaN, single-cell tests still pass)
- Dask chunk boundaries handled (dask+cupy focal test passes)
- No premature materialization or unnecessary copies
- Benchmark exists or is not needed (no variety-specific benchmark; not required)
- README feature matrix updated (n/a, no new function)
- Docstrings present and accurate (internal kernel, comment added)
Break the outer pk loop once pk*kcols reaches the target flat index so the scan stops at the row boundary instead of re-breaking on the first cell of every later row. No behaviour change; verified by the variety tests including the GPU parity cases.
brendancol
commented
Jun 1, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
Follow-up review (after 8710016)
The prior-cell scan now breaks the outer pk loop once pk*kcols reaches the target flat index (focal.py:935-937), so it stops at the row boundary instead of re-breaking on the first cell of each later row. This resolves the only actionable nit from the first pass. No behaviour change.
- Variety tests pass, including the on-GPU cupy/numpy parity cases for 7x7 (49) and 9x9 (81).
- flake8 clean on focal.py.
Remaining item, dismissed with reason:
- Variety-specific benchmark: out of scope for a correctness fix. A benchmark measures cost, not correctness, so it would not have caught the 25-value cap this PR removes. The focal_stats benchmark already exists for cost tracking.
No blockers or suggestions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2775
What changed
The CUDA variety kernel used a fixed 25-element
cuda.local.array, so it silently capped unique-value counts at 25. A 7x7 all-unique window returned 25 on GPU versus 49 on CPU. For a correctness module that is wrong._focal_variety_cudato count distinct values without a scratch buffer. For each valid non-NaN cell it scans only the earlier cells in the same window and increments the count when no earlier cell matches. O(window^2) per pixel, nocuda.local.array, so it works for arbitrary kernel sizes and matches the CPU_calc_varietyexactly.Backend coverage
GPU only (cupy / dask+cupy go through this kernel). numpy and dask+numpy were already correct and are unchanged.
Test plan
test_variety_gpu_large_kernel_parity[7]/[9]: cupy matches numpy on 7x7 (49) and 9x9 (81) all-unique windows. Verified on a real GPU.test_variety_large_kernel_numpy[7]/[9]: numpy reference returns 49 and 81; runs without a GPU.test_focal.pysuite: 154 passed.