Skip to content

feat: add IndexTransform library for composable, lazy coordinate mappings#3906

Open
d-v-b wants to merge 3 commits intozarr-developers:mainfrom
d-v-b:refactor/simplify-indexing
Open

feat: add IndexTransform library for composable, lazy coordinate mappings#3906
d-v-b wants to merge 3 commits intozarr-developers:mainfrom
d-v-b:refactor/simplify-indexing

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented Apr 14, 2026

My summary:

With dask maintenance on the decline, it's more important than ever that we give zarr-python users a dask-free way to do something very intuitive: index large zarr arrays without turning the whole thing into a numpy array first. This was discussed at length in #1603.

This PR, done with Claude, makes regular indexing go through a lazy indexing layer. The lazy indexing layer is based on abstractions defined in tensorstore. The basic idea is to explicitly model indexing an array as a transformation from some input coordinates to output coordinates, and to bind such a representation to our Array classes.

Regular indexing via .__getitem__ is still immediate, but arrays have a new .z attribute that exposes the lazy indexing layer:

>>> from zarr import create_array                                                        
>>> arr = create_array(store={}, shape=(100,), dtype="uint8")
>>> arr[0] # immediate
np.uint8(0)
>>> arr.z[0] # lazy
<Array memory://136981112028224 shape=() dtype=uint8 domain={ 0 }>
>>> arr.z[0] = 10 # __setitem__ is immediate, no lazy writes yet
>>> arr.z.resolve() # call .resolve to make a numpy array
np.uint8(10)
>>> arr.z[-1] # for lazy indexing, -1 is a real index!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/d-v-b/dev/zarr-python/src/zarr/core/array.py", line 4639, in __getitem__
    new_t = selection_to_transform(selection, self._array._async_array._transform, "basic")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/d-v-b/dev/zarr-python/src/zarr/core/transforms/transform.py", line 924, in selection_to_transform
    return transform[selection]
           ~~~~~~~~~^^^^^^^^^^^
  File "/home/d-v-b/dev/zarr-python/src/zarr/core/transforms/transform.py", line 176, in __getitem__
    return _apply_basic_indexing(self, selection)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/d-v-b/dev/zarr-python/src/zarr/core/transforms/transform.py", line 482, in _apply_basic_indexing
    raise IndexError(
IndexError: index -1 is out of bounds for dimension 0 with domain [0, 100)

Goals here:

  • harmonize our indexing internals. The data structures in main are disparate ad-hoc copies of stuff from zarr-python 2.x. We can do better.
  • use data structures compatible with tensorstore. tensorstore's indexing machinery serializes to JSON. if we adopt the same patterns, we are closer to using tensorstore as an optional backend.
  • add a useful lazy indexing API for users who need to lazily index zarr arrays without using dask or xarray.
  • prepare our codebase for chunk encoding / decoding improvements, such as pushing array indexing down into the chunk encoding / decoding process
  • no breaking changes. old indexing routines should still work, even if they are not load-bearing any more.

Non-goals:

  • make the codec pipeline lazy-indexing-aware. future work.
  • support lazy writing. that's for future work.

Claude's summary.

Add a new src/zarr/core/transforms/ package implementing TensorStore-inspired
index transforms. The core idea: every indexing operation (slicing, fancy indexing,
etc.) produces a coordinate mapping from user space to storage space. These mappings
compose lazily — no I/O until explicitly resolved.

Key types:

  • IndexDomain — rectangular region in N-dimensional integer space
  • ConstantMap, DimensionMap, ArrayMap — three representations of a set of
    storage coordinates (singleton, arithmetic progression, explicit enumeration)
  • IndexTransform — pairs an input domain with output maps (one per storage dim)
  • compose(outer, inner) — chain two transforms

Key operations on IndexTransform:

  • __getitem__, .oindex[], .vindex[] — indexing produces new transforms
  • .intersect(domain) — restrict to coordinates within a region (chunk resolution)
  • .translate(shift) — shift coordinates (make chunk-local)

The transform library is standalone with no dependency on Array.

Includes comprehensive test suite (143 tests covering all types, operations,
composition, chunk resolution, and edge cases).

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

…ings

Add a new `src/zarr/core/transforms/` package implementing TensorStore-inspired
index transforms. The core idea: every indexing operation (slicing, fancy indexing,
etc.) produces a coordinate mapping from user space to storage space. These mappings
compose lazily — no I/O until explicitly resolved.

Key types:
- `IndexDomain` — rectangular region in N-dimensional integer space
- `ConstantMap`, `DimensionMap`, `ArrayMap` — three representations of a set of
  storage coordinates (singleton, arithmetic progression, explicit enumeration)
- `IndexTransform` — pairs an input domain with output maps (one per storage dim)
- `compose(outer, inner)` — chain two transforms

Key operations on IndexTransform:
- `__getitem__`, `.oindex[]`, `.vindex[]` — indexing produces new transforms
- `.intersect(domain)` — restrict to coordinates within a region (chunk resolution)
- `.translate(shift)` — shift coordinates (make chunk-local)

The transform library is standalone with no dependency on Array.

Includes comprehensive test suite (143 tests covering all types, operations,
composition, chunk resolution, and edge cases).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 14, 2026
@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label Apr 14, 2026
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 14, 2026

cc @vincentsarago

@d-v-b d-v-b mentioned this pull request Apr 14, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 88.99734% with 124 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.58%. Comparing base (0ea15fd) to head (79cd9c8).

Files with missing lines Patch % Lines
src/zarr/core/transforms/transform.py 83.33% 89 Missing ⚠️
src/zarr/core/array.py 92.22% 22 Missing ⚠️
src/zarr/core/transforms/composition.py 87.75% 6 Missing ⚠️
src/zarr/core/transforms/chunk_resolution.py 96.73% 3 Missing ⚠️
src/zarr/core/transforms/domain.py 96.62% 3 Missing ⚠️
src/zarr/core/transforms/json.py 98.24% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3906      +/-   ##
==========================================
- Coverage   93.07%   92.58%   -0.50%     
==========================================
  Files          85       92       +7     
  Lines       11228    12326    +1098     
==========================================
+ Hits        10451    11412     +961     
- Misses        777      914     +137     
Files with missing lines Coverage Δ
src/zarr/core/transforms/__init__.py 100.00% <100.00%> (ø)
src/zarr/core/transforms/output_map.py 100.00% <100.00%> (ø)
src/zarr/core/transforms/json.py 98.24% <98.24%> (ø)
src/zarr/core/transforms/chunk_resolution.py 96.73% <96.73%> (ø)
src/zarr/core/transforms/domain.py 96.62% <96.62%> (ø)
src/zarr/core/transforms/composition.py 87.75% <87.75%> (ø)
src/zarr/core/array.py 95.77% <92.22%> (-2.04%) ⬇️
src/zarr/core/transforms/transform.py 83.33% <83.33%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vincentsarago
Copy link
Copy Markdown

@d-v-b I'm new to zarr-python indexing, does .z[0] select the first value of the first dimension?

My use case is mainly if I have an array of shape (n, m, x, y), I want to be able to select on the first/second dimension to reduce the size of my array. How would I be able to do it with .z[] ?

Add TypedDict definitions and conversion functions for serializing
IndexDomain, OutputIndexMap, and IndexTransform to/from JSON.

The JSON format follows TensorStore's conventions for interoperability:
- IndexDomain: input_inclusive_min, input_exclusive_max, input_labels
- OutputIndexMap: offset + optional stride/input_dimension/index_array
- IndexTransform: domain fields + output array

TypedDicts: IndexDomainJSON, OutputIndexMapJSON, IndexTransformJSON
Functions: index_domain_to_json, index_domain_from_json,
           index_transform_to_json, index_transform_from_json

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 14, 2026

@d-v-b I'm new to zarr-python indexing, does .z[0] select the first value of the first dimension?

My use case is mainly if I have an array of shape (n, m, x, y), I want to be able to select on the first/second dimension to reduce the size of my array. How would I be able to do it with .z[] ?

on this branch:

>>> arr = create_array(store={}, shape=(100,100), dtype="uint8")
>>> arr.z[0]      
<Array memory://136980013828096 shape=(100,) dtype=uint8 domain={ 0, [0, 100) }>
>>> arr.z[0].shape                 
(100,)

the domain={ 0, [0, 100) } means "this array is what you get when you take the first index for the first dimension, and all values along the other dimension", which I think is what you want?

@vincentsarago
Copy link
Copy Markdown

vincentsarago commented Apr 14, 2026

What is I have

# create a 4d array (time, band, y, x)
arr = create_array(store={}, shape=(2, 3, 100,100), dtype="uint8")

and I want to select the arrays for band 1 (second dim) but for all the times, would arr.z[:, 0] gives what I want?

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 14, 2026

What is I have

# create a 4d array (time, band, y, x)
arr = create_array(store={}, shape=(2, 3, 100,100), dtype="uint8")

and I want to select the arrays for band 1 (second dim) but for all the times, would arr.z[:, 0] gives what I want?

yeah it should!

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 14, 2026

Merging this PR will degrade performance by 17.94%

⚡ 4 improved benchmarks
❌ 14 regressed benchmarks
✅ 48 untouched benchmarks
⏩ 6 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None] 1.6 s 1.3 s +25.3%
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip] 1 s 1.2 s -12.69%
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip] 2.1 s 1.9 s +11.24%
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None] 2.8 s 2 s +36.42%
WallTime test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None] 275.7 ms 336 ms -17.94%
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip] 3.2 s 2.6 s +25.43%
WallTime test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip] 578.8 ms 652.6 ms -11.31%
WallTime test_sharded_morton_single_chunk[(32, 32, 32)-memory] 1.8 ms 2 ms -10.6%
WallTime test_morton_order_iter[(20, 20, 20)] 35.8 ms 41.6 ms -13.84%
WallTime test_morton_order_iter[(33, 33, 33)] 163.9 ms 188.3 ms -12.94%
WallTime test_sharded_morton_write_single_chunk[(33, 33, 33)-memory] 193.4 ms 224.4 ms -13.84%
WallTime test_morton_order_iter[(10, 10, 10)] 4.7 ms 5.4 ms -12.85%
WallTime test_sharded_morton_write_single_chunk[(30, 30, 30)-memory] 146.5 ms 168.8 ms -13.19%
WallTime test_morton_order_iter[(8, 8, 8)] 2.4 ms 2.8 ms -13.75%
WallTime test_sharded_morton_write_single_chunk[(32, 32, 32)-memory] 173.6 ms 202.8 ms -14.37%
WallTime test_morton_order_iter[(16, 16, 16)] 18.2 ms 21 ms -13.38%
WallTime test_morton_order_iter[(30, 30, 30)] 122.5 ms 143.7 ms -14.71%
WallTime test_morton_order_iter[(32, 32, 32)] 147.1 ms 172.2 ms -14.54%

Comparing d-v-b:refactor/simplify-indexing (732dddd) with main (7c78574)2

Open in CodSpeed

Footnotes

  1. 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on main (0ea15fd) during the generation of this report, so 7c78574 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job. needs release notes Automatically applied to PRs which haven't added release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants