Skip to content

Fix GPU variety undercount for kernels larger than 5x5#2800

Open
brendancol wants to merge 2 commits into
mainfrom
issue-2775
Open

Fix GPU variety undercount for kernels larger than 5x5#2800
brendancol wants to merge 2 commits into
mainfrom
issue-2775

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #2775

What changed

The CUDA variety kernel used a fixed 25-element cuda.local.array, so it silently capped unique-value counts at 25. A 7x7 all-unique window returned 25 on GPU versus 49 on CPU. For a correctness module that is wrong.

  • Rewrote _focal_variety_cuda to count distinct values without a scratch buffer. For each valid non-NaN cell it scans only the earlier cells in the same window and increments the count when no earlier cell matches. O(window^2) per pixel, no cuda.local.array, so it works for arbitrary kernel sizes and matches the CPU _calc_variety exactly.
  • The rewrite removes both the 25-value cap and the register-pressure concern the old buffer was sized for.

Backend coverage

GPU only (cupy / dask+cupy go through this kernel). numpy and dask+numpy were already correct and are unchanged.

Test plan

  • test_variety_gpu_large_kernel_parity[7] / [9]: cupy matches numpy on 7x7 (49) and 9x9 (81) all-unique windows. Verified on a real GPU.
  • test_variety_large_kernel_numpy[7] / [9]: numpy reference returns 49 and 81; runs without a GPU.
  • Full test_focal.py suite: 154 passed.

The CUDA variety kernel used a fixed 25-element cuda.local.array, so it
silently capped unique-value counts at 25. A 7x7 all-unique window
returned 25 on GPU versus 49 on CPU.

Rewrite the kernel to count distinct values without a scratch buffer:
for each valid non-NaN cell, scan only the earlier cells in the same
window and increment the count when no earlier cell matches. This drops
the cap and the register-pressure concern, and matches the CPU
implementation for arbitrary kernel sizes.

Add test_variety_gpu_large_kernel_parity asserting cupy matches numpy on
7x7 and 9x9 all-unique windows, plus a numpy-only large-kernel test that
runs without a GPU.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Jun 1, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Fix GPU variety undercount for kernels larger than 5x5

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

None.

Nits (optional improvements)

  • focal.py:934-951: the buffer-free scan is correct, but the inner double-break does a few no-op passes. Once the running flat index pk*kcols+ph reaches k*kcols+h, the ph loop breaks out of the current row, then every later pk row breaks again on its first iteration. No correctness impact. Comparing against a precomputed target = k*kcols+h would exit cleanly, but the current form is fine.
  • benchmarks/benchmarks/focal.py:36: the focal_stats benchmark runs with default stats, so it never exercises the variety path. Not required for this fix, but a variety case would catch future regressions in this kernel's cost.

What looks good

  • The GPU count now matches the CPU _calc_variety exactly: each distinct value is counted once, at its earliest occurrence in the window. Verified on a real GPU (cupy == numpy for 7x7 -> 49 and 9x9 -> 81).
  • NaN handling matches the CPU path (v != v skip, all-NaN window returns NaN).
  • Dropping the cuda.local.array removes both the 25-value cap and the register-pressure reason it existed.
  • Tests are split so the numpy reference is checked even without a GPU, and the cupy parity assertion is gated by the cuda_and_cupy_available marker.
  • Full test_focal.py suite passes (154 tests).

Checklist

  • Algorithm matches CPU reference
  • All implemented backends produce consistent results (cupy verified on GPU)
  • NaN handling is correct
  • Edge cases covered (existing all-NaN, single-cell tests still pass)
  • Dask chunk boundaries handled (dask+cupy focal test passes)
  • No premature materialization or unnecessary copies
  • Benchmark exists or is not needed (no variety-specific benchmark; not required)
  • README feature matrix updated (n/a, no new function)
  • Docstrings present and accurate (internal kernel, comment added)

Break the outer pk loop once pk*kcols reaches the target flat index so
the scan stops at the row boundary instead of re-breaking on the first
cell of every later row. No behaviour change; verified by the variety
tests including the GPU parity cases.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review (after 8710016)

The prior-cell scan now breaks the outer pk loop once pk*kcols reaches the target flat index (focal.py:935-937), so it stops at the row boundary instead of re-breaking on the first cell of each later row. This resolves the only actionable nit from the first pass. No behaviour change.

  • Variety tests pass, including the on-GPU cupy/numpy parity cases for 7x7 (49) and 9x9 (81).
  • flake8 clean on focal.py.

Remaining item, dismissed with reason:

  • Variety-specific benchmark: out of scope for a correctness fix. A benchmark measures cost, not correctness, so it would not have caught the 25-value cap this PR removes. The focal_stats benchmark already exists for cost tracking.

No blockers or suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU variety focal statistic silently undercounts unique values for kernels larger than 5x5

1 participant