Fix GPU variety undercount for kernels larger than 5x5 by brendancol · Pull Request #2800 · xarray-contrib/xarray-spatial

brendancol · 2026-06-01T17:44:55Z

What changed

The CUDA variety kernel used a fixed 25-element cuda.local.array, so it silently capped unique-value counts at 25. A 7x7 all-unique window returned 25 on GPU versus 49 on CPU. For a correctness module that is wrong.

Rewrote _focal_variety_cuda to count distinct values without a scratch buffer. For each valid non-NaN cell it scans only the earlier cells in the same window and increments the count when no earlier cell matches. O(window^2) per pixel, no cuda.local.array, so it works for arbitrary kernel sizes and matches the CPU _calc_variety exactly.
The rewrite removes both the 25-value cap and the register-pressure concern the old buffer was sized for.

Backend coverage

GPU only (cupy / dask+cupy go through this kernel). numpy and dask+numpy were already correct and are unchanged.

Test plan

test_variety_gpu_large_kernel_parity[7] / [9]: cupy matches numpy on 7x7 (49) and 9x9 (81) all-unique windows. Verified on a real GPU.
test_variety_large_kernel_numpy[7] / [9]: numpy reference returns 49 and 81; runs without a GPU.
Full test_focal.py suite: 154 passed.

The CUDA variety kernel used a fixed 25-element cuda.local.array, so it silently capped unique-value counts at 25. A 7x7 all-unique window returned 25 on GPU versus 49 on CPU. Rewrite the kernel to count distinct values without a scratch buffer: for each valid non-NaN cell, scan only the earlier cells in the same window and increment the count when no earlier cell matches. This drops the cap and the register-pressure concern, and matches the CPU implementation for arbitrary kernel sizes. Add test_variety_gpu_large_kernel_parity asserting cupy matches numpy on 7x7 and 9x9 all-unique windows, plus a numpy-only large-kernel test that runs without a GPU.

brendancol

PR Review: Fix GPU variety undercount for kernels larger than 5x5

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

None.

Nits (optional improvements)

focal.py:934-951: the buffer-free scan is correct, but the inner double-break does a few no-op passes. Once the running flat index pk*kcols+ph reaches k*kcols+h, the ph loop breaks out of the current row, then every later pk row breaks again on its first iteration. No correctness impact. Comparing against a precomputed target = k*kcols+h would exit cleanly, but the current form is fine.
benchmarks/benchmarks/focal.py:36: the focal_stats benchmark runs with default stats, so it never exercises the variety path. Not required for this fix, but a variety case would catch future regressions in this kernel's cost.

What looks good

The GPU count now matches the CPU _calc_variety exactly: each distinct value is counted once, at its earliest occurrence in the window. Verified on a real GPU (cupy == numpy for 7x7 -> 49 and 9x9 -> 81).
NaN handling matches the CPU path (v != v skip, all-NaN window returns NaN).
Dropping the cuda.local.array removes both the 25-value cap and the register-pressure reason it existed.
Tests are split so the numpy reference is checked even without a GPU, and the cupy parity assertion is gated by the cuda_and_cupy_available marker.
Full test_focal.py suite passes (154 tests).

Checklist

Algorithm matches CPU reference
All implemented backends produce consistent results (cupy verified on GPU)
NaN handling is correct
Edge cases covered (existing all-NaN, single-cell tests still pass)
Dask chunk boundaries handled (dask+cupy focal test passes)
No premature materialization or unnecessary copies
Benchmark exists or is not needed (no variety-specific benchmark; not required)
README feature matrix updated (n/a, no new function)
Docstrings present and accurate (internal kernel, comment added)

Break the outer pk loop once pk*kcols reaches the target flat index so the scan stops at the row boundary instead of re-breaking on the first cell of every later row. No behaviour change; verified by the variety tests including the GPU parity cases.

brendancol

Follow-up review (after `8710016`)

The prior-cell scan now breaks the outer pk loop once pk*kcols reaches the target flat index (focal.py:935-937), so it stops at the row boundary instead of re-breaking on the first cell of each later row. This resolves the only actionable nit from the first pass. No behaviour change.

Variety tests pass, including the on-GPU cupy/numpy parity cases for 7x7 (49) and 9x9 (81).
flake8 clean on focal.py.

Remaining item, dismissed with reason:

Variety-specific benchmark: out of scope for a correctness fix. A benchmark measures cost, not correctness, so it would not have caught the 25-value cap this PR removes. The focal_stats benchmark already exists for cost tracking.

No blockers or suggestions.

github-actions Bot added the performance PR touches performance-sensitive code label Jun 1, 2026

brendancol commented Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU variety undercount for kernels larger than 5x5#2800

Fix GPU variety undercount for kernels larger than 5x5#2800
brendancol wants to merge 2 commits into
mainfrom
issue-2775

brendancol commented Jun 1, 2026

Uh oh!

brendancol left a comment

Uh oh!

brendancol left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented Jun 1, 2026

What changed

Backend coverage

Test plan

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

PR Review: Fix GPU variety undercount for kernels larger than 5x5

Blockers (must fix before merge)

Suggestions (should fix, not blocking)

Nits (optional improvements)

What looks good

Checklist

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

Follow-up review (after 8710016)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Follow-up review (after `8710016`)