Skip to content

Add sparse-read primitives: shards_initialized and read_regions#4028

Open
espg wants to merge 6 commits into
zarr-developers:mainfrom
espg:feat/chunk-access-primitives
Open

Add sparse-read primitives: shards_initialized and read_regions#4028
espg wants to merge 6 commits into
zarr-developers:mainfrom
espg:feat/chunk-access-primitives

Conversation

@espg
Copy link
Copy Markdown

@espg espg commented Jun 3, 2026

Related to / closes #3929 (first of two PRs)

Summary

Adds two composable, public functions for efficiently reading sparse arrays — arrays where most chunks are empty and resolve to the fill value:

  • zarr.shards_initialized(array, *, strategy="auto") — discover which shards (or chunks, when unsharded) actually exist in the store.
  • zarr.read_regions(array, regions=None, *, concurrency=None) — concurrently read and decode array regions — by default only the populated ones — yielding each (region, data) pair spatially resolved to its location in the array.

Both are available synchronously (zarr.*, zarr.api.synchronous) and asynchronously (zarr.api.asynchronous); the async read_regions is a generator that streams each region as soon as its data is available. Nothing about the existing arr[:] path changes — these are additive.

Motivation

On a sparse array, arr[:] pays a store round-trip + codec call for every chunk, including empty ones. In the issue's 49,152-chunk HEALPix example (~3% populated), ~150 s of the 173 s wall time is spent iterating empty chunks with zero useful I/O.

These primitives let callers touch only the populated chunks, so cost scales with the populated count rather than the total count.

Design

This follows the direction from the discussion in #3929: rather than mutable state on the array that changes how __getitem__ behaves, expose plain, composable functions -- decomposes into two pieces:

  1. Discover the chunks that exist (shards_initialized). Reported at the granularity of stored objects — shard keys for sharded arrays, chunk keys otherwise — because that is what physically exists in the store and is what a single list_prefix returns. Two strategies, selected by strategy=:

    • "list" — one store.list_prefix, filtered to this array's shard grid (ignores zarr.json and any other objects sharing the prefix).
    • "probe" — concurrent per-key exists() checks; avoids listing a prefix that may hold many unrelated objects, and is faster when there are few possible keys.
    • "auto" (default) — probe for small grids, list otherwise.
  2. Read + decode those chunks, spatially resolved (read_regions). Keyed on array regions (a tuple of slices) rather than key strings, on the assumption that regions are the more reusable handle. Reads concurrently and yields (region, data) in completion order. For sharded arrays it yields whole shard regions; empty inner chunks within a populated shard are still skipped efficiently by the existing ShardingCodec partial-decode path.

The "pack N decoded chunks into one contiguous array" step that arr[:] performs is deliberately not forced here — pipelines that operate per chunk skip it for a further performance win. A pack/read_sparse convenience will follow in a second PR underzarr.experimental.

Implementation notes

  • A single private discovery core (_initialized_shards) returns (coords, key) pairs; shards_initialized projects it to keys and read_regions projects it to regions, so neither has to reverse-parse the other's output. This mirrors the existing _nchunks_initializednchunks_initialized and _iter_* core/wrapper pattern in array.py.
  • The pre-existing private _shards_initialized (used by nchunks_initialized / nshards_initialized / info) now delegates to that same core, removing duplicated list_prefix-and-intersect logic and incidentally fixing an O(grid×objects) membership check (list → set).

API

import zarr

# 1. Which shards/chunks actually exist in the store?
keys = zarr.shards_initialized(arr)                  # auto strategy
keys = zarr.shards_initialized(arr, strategy="probe")

# 2. Read only the populated regions, each paired with its location
for region, data in zarr.read_regions(arr):
    ...                                              # region: tuple[slice, ...]

# Reproduce arr[:] without touching empty chunks
out = np.full(arr.shape, arr.fill_value, dtype=arr.dtype)
for region, data in zarr.read_regions(arr):
    out[region] = np.asarray(data)

# Async: stream each region as soon as it is decoded
import zarr.api.asynchronous as za
async for region, data in za.read_regions(arr):
    ...

Benchmarks

bench/empty_chunks.py sweeps chunk count at ~3% sparsity, comparing stock arr[:] against read_regions + pack and a per-region stream:

store           n_chunks  populated  arr[:] (s)  pack (s)  stream (s)  pack x  stream x
MemoryStore         1024         32     0.0458    0.0111      0.0100    4.1x     4.6x
LocalStore          1024         32     0.2863    0.0236      0.0218   12.1x    13.2x
MemoryStore         4096        128     0.1780    0.0281      0.0458    6.3x     3.9x
LocalStore          4096        128     1.0605    0.0934      0.1008   11.4x    10.5x
MemoryStore        16384        512     0.8202    0.1699      0.1442    4.8x     5.7x
LocalStore         16384        512     5.2325    0.4762      0.4040   11.0x    13.0x
MemoryStore        49152       1536     2.7726    0.6218      0.5360    4.5x     5.2x
LocalStore         49152       1536    13.4691    1.2704      1.2380   10.6x    10.9x

LocalStore plateaus around ~10–13×; remote object stores see much more (~64× in the issue's S3 report) because each skipped empty chunk avoids a network round-trip.

Testing

tests/test_chunk_access.py (memory + local stores; unsharded, sharded, 2-D; all-empty / all-populated / sparse layouts):

  • all three strategies agree, with hand-known populated counts;
  • the "list" strategy ignores non-chunk objects sharing the prefix;
  • packing read_regions output reproduces arr[:] byte-for-byte;
  • default region count matches shards_initialized;
  • explicit regions and concurrency=1 paths;
  • async streaming yields the same set as the sync wrapper.

Existing test_array / test_api (incl. the sync/async docstring-match test) and test_zarr pass unchanged.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

Comment thread bench/empty_chunks.py Outdated
@@ -0,0 +1,157 @@
"""Benchmark for sparse-array reads via the chunk-access primitives.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we want this checked in -- we have a benchmarks directory already, could you see if these code paths are already exercised there? Those benchmarks get run in CI, which is nice.

Comment thread src/zarr/core/array.py Outdated
return [
tuple(
slice(c * s, min((c + 1) * s, dim))
for c, s, dim in zip(coords, shard_shape, array.shape, strict=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can re-use the per-shard and per-chunk iteration routines we already have defined?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, I would express "give me all the regions that are initialized" as a filter over iter_shard_regions, where the predicate function is whether the corresponding shard was initialized or not.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, fixed now

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.57%. Comparing base (fe22910) to head (b8d1d8c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4028      +/-   ##
==========================================
+ Coverage   93.55%   93.57%   +0.02%     
==========================================
  Files          88       88              
  Lines       11896    11934      +38     
==========================================
+ Hits        11129    11167      +38     
  Misses        767      767              
Files with missing lines Coverage Δ
src/zarr/__init__.py 100.00% <ø> (ø)
src/zarr/api/asynchronous.py 94.05% <ø> (ø)
src/zarr/api/synchronous.py 93.82% <100.00%> (+0.86%) ⬆️
src/zarr/core/array.py 97.93% <100.00%> (+0.05%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/zarr/core/array.py Outdated
if concurrency is None:
concurrency = zarr_config.get("async.concurrency")

region_list = await _initialized_regions(array) if regions is None else list(regions)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this function should speculatively check which regions are initialized ahead of time. That seems like something the caller should do when coming up with the collection of regions.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, implemented

Comment thread tests/test_chunk_access.py Outdated
@@ -0,0 +1,194 @@
"""Tests for the shard-discovery and region-read primitives.

These cover :func:`zarr.shards_initialized` (discover which shards/chunks of an
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we aren't using rst

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

Comment thread tests/test_chunk_access.py Outdated
@@ -0,0 +1,194 @@
"""Tests for the shard-discovery and region-read primitives.
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these tests should just be in the same place as all the other array tests? not sure we need a new test file here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Jun 3, 2026

I'm not sure this approach would be useful, but we could also frame the question "how should we store our knowledge that a chunk is missing" as a caching problem, and express this in the storage layer by caching missing keys. I'm not sure if our experimental storage cache does this already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhancement proposal: empty-chunk-aware read path (array.prefetch_populated_keys)

2 participants