feat: add IndexTransform library for composable, lazy coordinate mappings by d-v-b · Pull Request #3906 · zarr-developers/zarr-python

d-v-b · 2026-04-14T12:53:31Z

My summary:

With dask maintenance on the decline, it's more important than ever that we give zarr-python users a dask-free way to do something very intuitive: index large zarr arrays without turning the whole thing into a numpy array first. This was discussed at length in #1603.

This PR, done with Claude, makes regular indexing go through a lazy indexing layer. The lazy indexing layer is based on abstractions defined in tensorstore. The basic idea is to explicitly model indexing an array as a transformation from some input coordinates to output coordinates, and to bind such a representation to our Array classes.

Regular indexing via .__getitem__ is still immediate, but arrays have a new .z attribute that exposes the lazy indexing layer:

>>> from zarr import create_array                                                        
>>> arr = create_array(store={}, shape=(100,), dtype="uint8")
>>> arr[0] # immediate
np.uint8(0)
>>> arr.z[0] # lazy
<Array memory://136981112028224 shape=() dtype=uint8 domain={ 0 }>
>>> arr.z[0] = 10 # __setitem__ is immediate, no lazy writes yet
>>> arr.z.resolve() # call .resolve to make a numpy array
np.uint8(10)
>>> arr.z[-1] # for lazy indexing, -1 is a real index!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/d-v-b/dev/zarr-python/src/zarr/core/array.py", line 4639, in __getitem__
    new_t = selection_to_transform(selection, self._array._async_array._transform, "basic")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/d-v-b/dev/zarr-python/src/zarr/core/transforms/transform.py", line 924, in selection_to_transform
    return transform[selection]
           ~~~~~~~~~^^^^^^^^^^^
  File "/home/d-v-b/dev/zarr-python/src/zarr/core/transforms/transform.py", line 176, in __getitem__
    return _apply_basic_indexing(self, selection)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/d-v-b/dev/zarr-python/src/zarr/core/transforms/transform.py", line 482, in _apply_basic_indexing
    raise IndexError(
IndexError: index -1 is out of bounds for dimension 0 with domain [0, 100)

Goals here:

harmonize our indexing internals. The data structures in main are disparate ad-hoc copies of stuff from zarr-python 2.x. We can do better.
use data structures compatible with tensorstore. tensorstore's indexing machinery serializes to JSON. if we adopt the same patterns, we are closer to using tensorstore as an optional backend.
add a useful lazy indexing API for users who need to lazily index zarr arrays without using dask or xarray.
prepare our codebase for chunk encoding / decoding improvements, such as pushing array indexing down into the chunk encoding / decoding process
no breaking changes. old indexing routines should still work, even if they are not load-bearing any more.

Non-goals:

make the codec pipeline lazy-indexing-aware. future work.
support lazy writing. that's for future work.

Claude's summary.

Add a new src/zarr/core/transforms/ package implementing TensorStore-inspired
index transforms. The core idea: every indexing operation (slicing, fancy indexing,
etc.) produces a coordinate mapping from user space to storage space. These mappings
compose lazily — no I/O until explicitly resolved.

Key types:

IndexDomain — rectangular region in N-dimensional integer space
ConstantMap, DimensionMap, ArrayMap — three representations of a set of
storage coordinates (singleton, arithmetic progression, explicit enumeration)
IndexTransform — pairs an input domain with output maps (one per storage dim)
compose(outer, inner) — chain two transforms

Key operations on IndexTransform:

__getitem__, .oindex[], .vindex[] — indexing produces new transforms
.intersect(domain) — restrict to coordinates within a region (chunk resolution)
.translate(shift) — shift coordinates (make chunk-local)

The transform library is standalone with no dependency on Array.

Includes comprehensive test suite (143 tests covering all types, operations,
composition, chunk resolution, and edge cases).

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

…ings Add a new `src/zarr/core/transforms/` package implementing TensorStore-inspired index transforms. The core idea: every indexing operation (slicing, fancy indexing, etc.) produces a coordinate mapping from user space to storage space. These mappings compose lazily — no I/O until explicitly resolved. Key types: - `IndexDomain` — rectangular region in N-dimensional integer space - `ConstantMap`, `DimensionMap`, `ArrayMap` — three representations of a set of storage coordinates (singleton, arithmetic progression, explicit enumeration) - `IndexTransform` — pairs an input domain with output maps (one per storage dim) - `compose(outer, inner)` — chain two transforms Key operations on IndexTransform: - `__getitem__`, `.oindex[]`, `.vindex[]` — indexing produces new transforms - `.intersect(domain)` — restrict to coordinates within a region (chunk resolution) - `.translate(shift)` — shift coordinates (make chunk-local) The transform library is standalone with no dependency on Array. Includes comprehensive test suite (143 tests covering all types, operations, composition, chunk resolution, and edge cases). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

d-v-b · 2026-04-14T12:58:55Z

cc @vincentsarago

codecov · 2026-04-14T13:05:02Z

Codecov Report

❌ Patch coverage is 87.67228% with 161 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.90%. Comparing base (1ab9953) to head (ba97603).

Files with missing lines	Patch %	Lines
src/zarr/core/transforms/transform.py	86.28%	76 Missing ⚠️
src/zarr/core/array.py	83.28%	60 Missing ⚠️
src/zarr/core/transforms/chunk_resolution.py	94.00%	6 Missing ⚠️
src/zarr/core/transforms/composition.py	87.75%	6 Missing ⚠️
src/zarr/core/chunk_partition.py	90.62%	3 Missing ⚠️
src/zarr/core/indexing.py	50.00%	3 Missing ⚠️
src/zarr/core/transforms/domain.py	96.80%	3 Missing ⚠️
src/zarr/testing/strategies.py	88.88%	3 Missing ⚠️
src/zarr/core/transforms/json.py	98.30%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3906      +/-   ##
==========================================
- Coverage   93.50%   92.90%   -0.61%     
==========================================
  Files          90       98       +8     
  Lines       11981    13250    +1269     
==========================================
+ Hits        11203    12310    +1107     
- Misses        778      940     +162

Files with missing lines	Coverage Δ
src/zarr/__init__.py	`100.00% <100.00%> (ø)`
src/zarr/api/synchronous.py	`92.95% <ø> (ø)`
src/zarr/core/group.py	`95.20% <ø> (ø)`
src/zarr/core/transforms/__init__.py	`100.00% <100.00%> (ø)`
src/zarr/core/transforms/output_map.py	`100.00% <100.00%> (ø)`
src/zarr/errors.py	`100.00% <100.00%> (ø)`
src/zarr/core/transforms/json.py	`98.30% <98.30%> (ø)`
src/zarr/core/chunk_partition.py	`90.62% <90.62%> (ø)`
src/zarr/core/indexing.py	`95.57% <50.00%> (-0.72%)`	⬇️
src/zarr/core/transforms/domain.py	`96.80% <96.80%> (ø)`
... and 5 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vincentsarago · 2026-04-14T13:07:50Z

@d-v-b I'm new to zarr-python indexing, does .z[0] select the first value of the first dimension?

My use case is mainly if I have an array of shape (n, m, x, y), I want to be able to select on the first/second dimension to reduce the size of my array. How would I be able to do it with .z[] ?

Add TypedDict definitions and conversion functions for serializing IndexDomain, OutputIndexMap, and IndexTransform to/from JSON. The JSON format follows TensorStore's conventions for interoperability: - IndexDomain: input_inclusive_min, input_exclusive_max, input_labels - OutputIndexMap: offset + optional stride/input_dimension/index_array - IndexTransform: domain fields + output array TypedDicts: IndexDomainJSON, OutputIndexMapJSON, IndexTransformJSON Functions: index_domain_to_json, index_domain_from_json, index_transform_to_json, index_transform_from_json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

d-v-b · 2026-04-14T13:12:50Z

@d-v-b I'm new to zarr-python indexing, does .z[0] select the first value of the first dimension?

My use case is mainly if I have an array of shape (n, m, x, y), I want to be able to select on the first/second dimension to reduce the size of my array. How would I be able to do it with .z[] ?

on this branch:

>>> arr = create_array(store={}, shape=(100,100), dtype="uint8")
>>> arr.z[0]      
<Array memory://136980013828096 shape=(100,) dtype=uint8 domain={ 0, [0, 100) }>
>>> arr.z[0].shape                 
(100,)

the domain={ 0, [0, 100) } means "this array is what you get when you take the first index for the first dimension, and all values along the other dimension", which I think is what you want?

vincentsarago · 2026-04-14T13:16:03Z

What is I have

# create a 4d array (time, band, y, x)
arr = create_array(store={}, shape=(2, 3, 100,100), dtype="uint8")

and I want to select the arrays for band 1 (second dim) but for all the times, would arr.z[:, 0] gives what I want?

d-v-b · 2026-04-14T13:19:32Z

What is I have
# create a 4d array (time, band, y, x)
arr = create_array(store={}, shape=(2, 3, 100,100), dtype="uint8")
and I want to select the arrays for band 1 (second dim) but for all the times, would arr.z[:, 0] gives what I want?

yeah it should!

codspeed-hq · 2026-04-14T13:21:06Z

Merging this PR will degrade performance by 17.94%

⚡ 4 improved benchmarks
❌ 14 regressed benchmarks
✅ 48 untouched benchmarks
⏩ 6 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	1.6 s	1.3 s	+25.3%
❌	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	1 s	1.2 s	-12.69%
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	2.1 s	1.9 s	+11.24%
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	2.8 s	2 s	+36.42%
❌	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	275.7 ms	336 ms	-17.94%
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	3.2 s	2.6 s	+25.43%
❌	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	578.8 ms	652.6 ms	-11.31%
❌	WallTime	`test_sharded_morton_single_chunk[(32, 32, 32)-memory]`	1.8 ms	2 ms	-10.6%
❌	WallTime	`test_morton_order_iter[(20, 20, 20)]`	35.8 ms	41.6 ms	-13.84%
❌	WallTime	`test_morton_order_iter[(33, 33, 33)]`	163.9 ms	188.3 ms	-12.94%
❌	WallTime	`test_sharded_morton_write_single_chunk[(33, 33, 33)-memory]`	193.4 ms	224.4 ms	-13.84%
❌	WallTime	`test_morton_order_iter[(10, 10, 10)]`	4.7 ms	5.4 ms	-12.85%
❌	WallTime	`test_sharded_morton_write_single_chunk[(30, 30, 30)-memory]`	146.5 ms	168.8 ms	-13.19%
❌	WallTime	`test_morton_order_iter[(8, 8, 8)]`	2.4 ms	2.8 ms	-13.75%
❌	WallTime	`test_sharded_morton_write_single_chunk[(32, 32, 32)-memory]`	173.6 ms	202.8 ms	-14.37%
❌	WallTime	`test_morton_order_iter[(16, 16, 16)]`	18.2 ms	21 ms	-13.38%
❌	WallTime	`test_morton_order_iter[(30, 30, 30)]`	122.5 ms	143.7 ms	-14.71%
❌	WallTime	`test_morton_order_iter[(32, 32, 32)]`	147.1 ms	172.2 ms	-14.54%

_{Comparing d-v-b:refactor/simplify-indexing (732dddd) with main (7c78574)²}

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
No successful run was found on main (0ea15fd) during the generation of this report, so 7c78574 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

psobolewskiPhD · 2026-04-21T16:58:23Z

I played with this a bit today and this type of slicing is really great:
data.z[:,:,:,0:100,0:100]

One thing that tripped me up -- probably just shows my naivety (PBCAK) -- is that e.g. data.z.shape is AttributeError. So you need to get your attributes from data, but do your slicing with data.z.

d-v-b · 2026-04-21T17:05:04Z

I played with this a bit today and this type of slicing is really great:

data.z[:,:,:,0:100,0:100]

One thing that tripped me up -- probably just shows my naivety (PBCAK) -- is that e.g. data.z.shape is AttributeError. So you need to get your attributes from data, but do your slicing with data.z.

We should probably have a shape attribute there. Thanks for reporting that, and thanks for trying this branch out. Let me know if you find anything else we need to fix!

psobolewskiPhD · 2026-04-21T17:25:45Z

I guess from my PoV it would be nice if I could just use data.z for everything, so it was basically array-like.
Or to flip it around: if data acted like data.z -- but that would be a big breaking change, so I understand :P

Edit: Don't get me wrong, this is pretty great -- as an example, with this I can remove the dask.array wrapper here:
https://github.com/napari/napari/pull/8260/changes#diff-418087900d622739ab769c612c09fbf012e7555d28d0e75840d145eb055785f0

instead, just check if my array is zarr and if so use array.z for slicing.

d-v-b · 2026-04-21T19:04:29Z

I guess from my PoV it would be nice if I could just use data.z for everything, so it was basically array-like. Or to flip it around: if data acted like data.z -- but that would be a big breaking change, so I understand :P

Edit: Don't get me wrong, this is pretty great -- as an example, with this I can remove the dask.array wrapper here: https://github.com/napari/napari/pull/8260/changes#diff-418087900d622739ab769c612c09fbf012e7555d28d0e75840d145eb055785f0

instead, just check if my array is zarr and if so use array.z for slicing.

I'm glad it's useful! I agree that this is probably how slicing should work by default. But that would be a big breaking change, so it's far down the road.

The .z thing, however, is purely additive so I hope we can ship that in a minor release soon! Ping @zarr-developers/python-core-devs, it would be good to get some feedback on this PR.

psobolewskiPhD · 2026-04-21T19:23:30Z

Makes sense. So maybe in followup _LazyIndexAccessor could have more of the array properties and support more of the numpy protocol, so that array.z could be used as a drop-in for array, but be lazy about slicing, etc.

normanrz · 2026-04-22T12:29:31Z

I like this a lot!
Why did you pick .z as a name for the key? I think something like .lazy would be more descriptive, but 3 more chars.
tensorstore uses .result() to materialize a read, because it models the lazy slicing as futures. Maybe it makes sense to use the same terms? Or even making this API based on futures as well?

d-v-b · 2026-04-22T12:36:02Z

I like this a lot!
Why did you pick .z as a name for the key? I think something like .lazy would be more descriptive, but 3 more chars.

I just wanted something really short. Since this is (IMO) a strictly better slicing API, I wanted to make accessing it as low-friction as possible. But I will totally bow to the will of the crowd here, we can use whatever people find intuitive.

tensorstore uses .result() to materialize a read, because it models the lazy slicing as futures. Maybe it makes sense to use the same terms? Or even making this API based on futures as well?

I use resolve here but I don't feel strongly about it. Changing to .result() would be fine! And I think using a futures-based API is a great idea, especially when we get into lazy writes, which are currently not supported in this PR. If we do go for a futures API, we should think about how to use this comprehensively, e.g. with our stores as well.

vincentsarago · 2026-04-22T12:36:17Z

I agree with @normanrz, the .z keys feels a bit weird to me. I think something like .lazy_select(...) -> _LazyIndexAccessor makes things a bit more descriptive.

the length of .z make me feel like it is an attribute not a method.

d-v-b · 2026-04-22T12:37:49Z

what about .select or .sel?

jsignell · 2026-04-22T13:55:50Z

Just for reference: xarray uses .sel, .isel and .loc (https://docs.xarray.dev/en/latest/user-guide/indexing.html#quick-overview). I guess in that naming scheme what you have in this PR would more like .iloc

d-v-b · 2026-04-22T15:56:16Z

ping @ilan-gold for visibility

ilan-gold

use data structures compatible with tensorstore. tensorstore's indexing machinery serializes to JSON. if we adopt the same patterns, we are closer to using tensorstore as an optional backend.

I'm sure zarrs could also support this but I wonder about JSON performance for things like vindex/oindex where you could easily have 10's of thousands of individual coordinates.

ilan-gold · 2026-04-27T09:39:56Z

+- **translate(shift)** — shift all output coordinates. This makes coordinates
+  chunk-local: "express my coordinates relative to the chunk origin."
+
+- **compose(outer, inner)** — chain two transforms. See ``composition.py``.


I think it could be cool to have different kinds of composition like union and I don't see that anywhere intthe PR - it seems that only array.z[my_transform][my_other_transform_relative_to_the_first] is supported wheres it could be cool to have array.z[my_transform + my_other_transform]. Concretely, if I wanted (slice(0, 10), slice(0, 10)) and then also (slice(50, 60), slice(50, 60)) from an array, how would I do that? np.concatenate + np.arange to generate outer indexers is the only option AFAICT but that kind of stinks.

that's a very cool idea, not sure about overloading the addition operator, but some method for combining indexing expressions seems nice. I'll see what we need for this

Resolve independent (oindex) and correlated (vindex) ArrayMaps by their dependency axes during chunk intersection instead of assuming >=2 arrays are correlated or raveling full-rank arrays to 1-D. - _intersect routes on input_dimension is None (correlated marker): orthogonal maps are filtered along their own dependency axis at full input rank; correlated maps are jointly masked over the shared broadcast block while residual DimensionMap dims are narrowed against the chunk bounds and preserved (previously a rank-1 domain was paired with untouched DimensionMap outputs, raising ValueError at read time for any vindex with >=2 arrays plus a residual slice dim, including partial-rank boolean masks). - The correlated intersection now emits a flat row-major scatter index covering (surviving points, residual slice block), and sub_transform_to_selections consumes it as the single out-selection. - _reindex_array applies basic selections only along an ArrayMap's dependency axes; selections on singleton (broadcast) axes narrow the domain without touching the array's values, fixing basic slicing of the non-fancy axis of an oindex view. - Degenerate length-1 selections (all-singleton shapes) use input_dimension as the tie-breaker so they are not misclassified. - iter_chunk_transforms returns early for empty fancy selections instead of crashing on chunk_ids.min(). Assisted-by: ClaudeCode:claude-fable-5

Normalize public selections at the Array boundary: wrap NumPy-style negatives against the current view's domain (positive indices stay literal domain coordinates), reject boolean scalar indices, and route basic-slice-then-fancy composition through the view's transform. Generalizes the coordinate-selection wrapping from d0abd4b into one helper used by the lazy accessors and the non-identity view methods; the transform layer stays literal. Also reject fields= on a non-identity view (NotImplementedError) in all eight get/set_*_selection methods before any storage access, fixing silent corruption where the legacy indexer ignored the transform. Assisted-by: ClaudeCode:claude-fable-5

A boolean mask must exactly match the view domain's shape on the dimensions it consumes (NumPy parity: 'boolean index did not match indexed array'); under/over-length masks previously truncated silently and a too-high-rank mask raised ValueError from the transform layer. Validate in _normalize_public_selection against the current view domain, raising IndexError before transform construction. Also pin the pre-existing eager fields= sibling-field corruption with a strict xfail so the identity fields test documents rather than hides it. Assisted-by: ClaudeCode:claude-fable-5

Zero-rank correlated reads (e.g. an integer index into a vindex-created view) collapsed a flat (1,) working buffer with as_scalar, which returns shape-(1,) data; collapse to 0-d first so the caller gets a true scalar. The caller-visible `out` contract for vectorized reads is now the broadcast selection shape rather than a flat buffer: a flat temporary is used internally and the reshaped result is copied into the supplied buffer. Efficiency: reshape the freshly-allocated contiguous working buffer as a view instead of np.array(...)-copying, and drop the now-redundant second reshape in get_coordinate_selection's transform branch. Assisted-by: ClaudeCode:claude-fable-5

@example

Property-based differential oracle for the index-transform layer: an IndexingProgram composes one to three indexing operations (basic prefix, optional trailing orthogonal/vectorized step) on a small array and checks five execution recipes (materialize, eager-on-lazy, out=, scalar and array writes) against a NumPy model applied in lockstep. Deterministic @example cases pin slice-then-oindex, unequal orthogonal lengths, multidimensional vindex with out=, a zero-rank result, and an empty result. Also restores zero-extent coverage lost with the deleted pre-branch tests: _indexing_array and the lazy-view parity test draw min_side=0 again (mask mode verified to work on zero-size shapes and now admitted by _eligible), and assert_read_matches_numpy learns the transform-path out= convention (broadcast result shape) so lazy-view vindex reads with multidimensional results are checked correctly. Assisted-by: ClaudeCode:claude-fable-5

The async surface built legacy indexers from metadata.shape without consulting a lazy view's index transform, silently reading/writing the wrong region (getitem/setitem/selection methods, oindex/vindex accessors, and from_array's copy loop). Guard every such entry point with a new AsyncArray._require_identity_for_async helper that raises LazyViewError on non-identity views, and precheck from_array up front so no half-created target array is left behind. Steer users to the sync surface / .result(). Assisted-by: ClaudeCode:claude-fable-5

emfdavid · 2026-07-17T05:45:02Z

Following this closely as a downstream consumer and I'd like to help get it over the line — it's a big PR and the coordinate algebra is exactly the primitive a data loader wants under it. A few concrete ways I could contribute; happy to take whichever is most useful to you.

Context: insitubatch plans a training batch as a set of chunk reads and gathers decoded chunks into one buffer — i.e. it hand-rolls a sample→chunk coordinate mapping today. IndexTransform + the chunk_projections partition API is the thing I'd rather build on than maintain myself, so I have a real incentive to see this land.

Ways I could help:

Dogfood it against a real consumer. I can map our planner onto IndexTransform (.intersect/.translate/chunk_projections) and report back friction, gaps, and — relevant to the benchmark label — whether the ~18% read regression actually hits a consumer that drives the transform layer directly vs. the eager __getitem__ path. A second real user exercising the resolver may surface or de-risk things the test matrix doesn't.
Take the serialization perf item (the large vindex/oindex JSON concern). My read is it's two separate problems and worth splitting:
- Small/common case: output_index_map_to_json does index_array.tolist() (N boxed Python ints) before encode. An optional numpy-native fast encoder (e.g. orjson's OPT_SERIALIZE_NUMPY) behind a stdlib fallback skips the boxing — a constant-factor win, no hard dependency added.
- Genuinely-large ArrayMap: the real problem is representational — a big index array as inline JSON decimals is huge to produce and to reparse, on both encode and decode, and no encoder speed fixes O(n)-huge text. Worth noting TensorStore stores these inline too (and its docs acknowledge index-array representation size growing under composition), so this isn't a parity gap — but an optional out-of-line / binary encoding (base64-packed buffer or sidecar reference) would fix size, encode, and decode at once. The trade-off is a deliberate divergence from the TensorStore-compatible JSON, so it'd want to be opt-in rather than the default wire format.
I'm happy to prototype either tier if that's a useful thing to peel off.
Feed a consumer requirement for the union / multi-disjoint-window work that's noted as future. Our gather is N sample-windows resolved together per batch, so multi-window union isn't hypothetical for us — I can write it up as a concrete use case (or contribute) whenever that piece is in scope.

Let me know what's actually useful vs. noise.

…out-of-grid guardrail `_is_sharded` keyed off `len(codecs) == 1`, misclassifying a valid sharded array with a trailing bytes-bytes codec (e.g. an outer compressor) as unsharded, so `chunk_projections(unit="read")` silently returned shard-granularity projections instead of raising. `_covers_full_chunk` returned partial for any integer chunk_selection entry, diverging from the write path's `_is_complete_chunk` for a constant over a chunk dim of extent 1. The transform read/write loops and `iter_chunk_projections` silently `continue`d on an out-of-grid chunk coordinate instead of raising, risking uninitialized-buffer reads and silently dropped writes if a transform composition bug ever produced one. Assisted-by: ClaudeCode:claude-fable-5

…, 0-d pin, ChunkProjection docs Address five small code-review items on the lazy-indexing branch: - Convert branch-introduced bare collection truthiness checks to explicit len() comparisons (array.py, transforms/transform.py, testing/strategies.py, test_properties.py). - Convert branch-introduced RST-style ``double-backtick`` markup and :role:`target` references to plain markdown backticks, since docs render via mkdocs/mkdocstrings (transforms/*.py, chunk_partition.py, errors.py, array.py). - Add changes/3906.feature.md documenting the new .lazy accessor and chunk_projections. - Pin the intentional 0-d array iteration behavior change (raises TypeError eagerly at iter(), matching numpy, instead of the old silent empty sequence-protocol iteration). - Document zarr.ChunkProjection under docs/api/zarr/array.md, which was exported in zarr.__all__ but missing from the API docs. Assisted-by: ClaudeCode:claude-fable-5

…forms package Move the TensorStore-style index-transform algebra out of zarr core into a new numpy-only uv-workspace subpackage packages/zarr-transforms (import name zarr_transforms), mirroring packages/zarr-metadata. This is a pure move: behavior is byte-for-byte unchanged and the full suite stays green. The package depends on numpy + stdlib only and must not import zarr. Two couplings are broken: - errors: the canonical BoundsCheckError / VindexInvalidSelectionError class definitions now live in zarr_transforms.errors (both still subclass IndexError). zarr.errors re-exports the same objects, so zarr.errors.BoundsCheckError is zarr_transforms.errors.BoundsCheckError and every existing catch site is unaffected. - chunk grid: chunk_resolution is typed against new structural Protocols (ChunkGridLike / DimensionGridLike) in zarr_transforms.grid instead of importing zarr's concrete ChunkGrid, which satisfies them structurally. zarr now declares a runtime dependency on zarr-transforms, resolved to the in-tree package via a uv workspace source. The package tests are collected by the root test suite (they exercise chunk resolution against zarr's ChunkGrid, so they need zarr importable) rather than run in isolation. The package __init__ promotes the names the zarr integration layer consumes (iter_chunk_transforms, sub_transform_to_selections, selection_to_transform) to the public surface alongside the existing exports. Assisted-by: ClaudeCode:claude-opus-4.8

Assisted-by: ClaudeCode:claude-opus-4.8

The hatch run-coverage / run-coverage-html / run-hypothesis (and gputest run-coverage) scripts passed --source=src to coverage run, which excluded the index-transform algebra after its move to packages/zarr-transforms/src — it was measured before the move. Declare the measured source trees in [tool.coverage.run] source instead, so every coverage run invocation measures both src and the workspace package consistently, and drop the CLI flag from all four scripts. Assisted-by: ClaudeCode:claude-opus-4.8

d-v-b · 2026-07-17T09:59:58Z

Dogfood it against a real consumer. I can map our planner onto IndexTransform (.intersect/.translate/chunk_projections) and report back friction, gaps, and — relevant to the benchmark label — whether the ~18% read regression actually hits a consumer that drives the transform layer directly vs. the eager getitem path. A second real user exercising the resolver may surface or de-risk things the test matrix doesn't.

Great idea, and thanks for offering your help. I'm working on moving the lazy indexing logic here out into a separate package in the zarr-python workspace. That PR should be much easier to land, since it wouldn't change anything about zarr python. Once that package is up and running, we can publish it on pypi and you can start trying it out. How does this sound?

Take the serialization perf item (the large vindex/oindex JSON concern). My read is it's two separate problems and worth splitting:

Small/common case: output_index_map_to_json does index_array.tolist() (N boxed Python ints) before encode. An optional numpy-native fast encoder (e.g. orjson's OPT_SERIALIZE_NUMPY) behind a stdlib fallback skips the boxing — a constant-factor win, no hard dependency added.
Genuinely-large ArrayMap: the real problem is representational — a big index array as inline JSON decimals is huge to produce and to reparse, on both encode and decode, and no encoder speed fixes O(n)-huge text. Worth noting TensorStore stores these inline too (and its docs acknowledge index-array representation size growing under composition), so this isn't a parity gap — but an optional out-of-line / binary encoding (base64-packed buffer or sidecar reference) would fix size, encode, and decode at once. The trade-off is a deliberate divergence from the TensorStore-compatible JSON, so it'd want to be opt-in rather than the default wire format.

I'm happy to prototype either tier if that's a useful thing to peel off.

I think we can sidestep this for now, because the JSON thing is only relevant when / if we want to interchange array selections expressions over the wire. We don't need a performant JSON serialization inside Zarr-Python itself.

3. Feed a consumer requirement for the union / multi-disjoint-window work that's noted as future. Our gather is N sample-windows resolved together per batch, so multi-window union isn't hypothetical for us — I can write it up as a concrete use case (or contribute) whenever that piece is in scope.

this is a shared need -- see #4028. I think the lazy-indexing-oriented API would look like this: you declare the N regions of the array you want to read, then you arrange those N regions into a concatenated array, and then you submit the request to fetch that array to a zarr backend that can efficiently do all the IO you need. This is just an idea right now, but I think it would be a very clean API.

Applying an oindex/vindex selection to a view that already carries an orthogonal ArrayMap could land a fancy index on a broadcast (singleton) axis of that map, leaking a raw numpy IndexError ("index N is out of bounds for axis ... with size 1") at resolve time — reading like a user error rather than the implementation gap it is. Add `_guard_fancy_after_fancy`, invoked from `_apply_oindex` and `_apply_vindex`, which raises a clear NotImplementedError at composition time when a fancy axis targets a non-dependency axis of an existing ArrayMap. It names the limitation and the workaround (materialize via .result() then index, or reorder so the fancy step is last). Compositions that keep the fancy step on the dependency axis (or on a correlated vindex view) are unaffected and still resolve correctly. Adds TestFancyAfterFancy, documents the limitation in the lazy-indexing user guide, and refreshes the now-stale generator/runner comments that described this class as a silent bug. Assisted-by: ClaudeCode:claude-fable-5

…anges zarr-transforms is a hard runtime dependency of zarr with no CI. Add `zarr-transforms.yml` (push/PR path-filtered test/ruff/pyright, mirroring zarr-metadata; the test job runs from the repo root because the transform tests import zarr) and `zarr-transforms-release.yml` (tag-triggered publish on `zarr_transforms-v*`). Extend `check_changelogs.yml` to check the package's changes directory. Bump the package `requires-python` floor to >=3.12 (consistent with the repo; nothing tested 3.11) and update its classifiers, ruff target, and pyright pythonVersion to match. Extend the root hatch `--match v*` comment to note the `zarr_transforms-v*` tags it also excludes. Disclose the two deliberate eager-path changes in the 3906 changelog: the Array repr `domain={...}` suffix and the 0-d iteration TypeError matching NumPy. Fix a stale comment in array.py that claimed pop_fields yields [] for no fields (it now yields None). Assisted-by: ClaudeCode:claude-fable-5

Mirrors zarr_metadata's importlib.metadata idiom; the release workflow's isolated-wheel check imports it. Assisted-by: ClaudeCode:claude-fable-5

…ansforms The ArrayMap (fancy) branch of iter_chunk_transforms enumerated the dense range(min_chunk, max_chunk + 1) bounding box and ran transform.intersect against every candidate chunk, making sparse fancy/vindex chunk resolution scale O(n_chunks) instead of O(n_touched). The exact touched chunk ids were already computed and then discarded in favor of min/max. Enumerate each fancy dimension's distinct touched chunk ids (np.unique) instead; the cartesian product then spans only touched-per-dimension combinations. Constant/Dimension dims keep their contiguous ranges. Semantics are unchanged: the dense range only ever added empty intersections, which intersect already skipped. Sparse vindex (2 far-apart coords) goes from 15.9x slower than eager at 1k chunks / 65.5x at 4k to ~1.1x at both, flat in grid size. Dense fancy selections are unchanged. Assisted-by: ClaudeCode:claude-fable-5

Vendored, unmodified, from ndsel branch fix/slice-origin-trunc (commit c132b4c1caa3205830ce35a42502363171f650a7). PROVENANCE.md records the source URL, commit SHA, and do-not-edit note. The corpus is the language-agnostic fixture set every ndsel implementation runs against. Assisted-by: ClaudeCode:claude-opus-4.8

Add a pure JSON->JSON message layer (messages.py: parse_ndsel, normalize_ndsel, NdselError) implementing the ndsel draft wire format, which adapts TensorStore's IndexTransform. It accepts all five kinds (point/box/slice/points/transform), normalizes to the canonical transform body (spec 4.3), and enforces the full error taxonomy. Verified against the vendored conformance corpus. Rework json.py into the engine/lowering layer: transform_{to,from}_canonical (re-pointed from index_transform_{to,from}_json). Engine constraints live here only (reject infinite bounds, implicit-lowers-by-value). Fix the oindex wire format so index_array maps no longer carry input_dimension (rejected by ndsel and TensorStore); degenerate all-singleton arrays collapse to constant maps, and input_dimension is reconstructed from array dependency axes on load. Cross-checked against a real tensorstore (importorskip) for a set of finite canonical bodies. Assisted-by: ClaudeCode:claude-opus-4.8

The pyright job added alongside zarr-metadata's config had never actually run before an external PR triggered it, and fails with 68 errors on this branch's HEAD. Downgrade reportUnknownVariableType/Argument/Member/ParameterType to warnings: the strict config was copied from zarr-metadata's JSON/dataclass code, but numpy's stubs return partially-unknown types even at fully-typed call sites, so this numpy-heavy package can't reasonably satisfy that family. CI only fails on errors, so warnings keep the signal visible without blocking the build. Fix the genuine findings in code: - reportUnnecessaryIsInstance (14 sites across transform.py, composition.py, json.py, chunk_resolution.py): replace tautological final isinstance arms (pyright proves them always-true once prior branches exhaust the union) with `else:` plus a comment naming the narrowed type. Two sites in composition.py and one in json.py also drop a `# pragma: no cover`-marked dead `raise TypeError` that followed. - reportPrivateUsage (2 sites): chunk_resolution.py's `chunk_grid._dimensions` (declared on the ChunkGridLike Protocol for structural typing against zarr's ChunkGrid) and json.py's cross-module import of transform.py's `_array_map_dependency_axes`. Both are suppressed with `# pyright: ignore[reportPrivateUsage]` plus a comment; renaming either is left as an open pre-publish API decision. Zero behavior change. Verified: pyright 0 errors (was 68); pytest 266 passed, 1 skipped; mypy clean; ruff clean. Assisted-by: ClaudeCode:claude-fable-5

Fixtures verified byte-identical to the merged c59bc556c; only the reference moves off the deleted feature branch. Assisted-by: ClaudeCode:claude-fable-5

…nsforms The fast path bound 'm' as ArrayMap before the general loop rebinds it to the OutputIndexMap union, which mypy rejects (and treats the ConstantMap arm as unreachable). packages/ is outside the pre-commit mypy scope, so #234's CI could not catch this. Assisted-by: ClaudeCode:claude-fable-5

…ndexing # Conflicts: # pyproject.toml # src/zarr/core/array.py # uv.lock

…indexing Pre-publish rename: the in-tree zarr-transforms workspace package is not yet published, so this is a safe, purely mechanical rename with zero behavior change. Distribution name is now zarr-indexing; import name is now zarr_indexing. Covers: the package directory and src layout, the package's own pyproject.toml (name, tag-pattern, wheel packages, towncrier package), the root pyproject.toml (workspace member, uv source, runtime dependency, coverage source, pytest testpaths, hatch version comment), all imports across src/zarr, tests, and the package's own tests, the zarr-transforms(-release).yml workflows (renamed to zarr-indexing(-release).yml with matching job names, path filters, tag pattern, environment names, artifact names, and smoke test), check_changelogs.yml's package path, changelog fragments (root changes/3906.feature.md and the package's own changes/), and the regenerated uv.lock. Assisted-by: ClaudeCode:claude-fable-5

…classes Identity (non-view) arrays now render with the legacy 3.x repr with no `domain={...}` suffix, byte-identical across `Array` and `AsyncArray`. The suffix appears only on non-identity lazy views, and `AsyncArray` gains it for views (reachable via `translate_by` and friends). Restores the doctests and repr tests that had been rewritten to expect the unconditional suffix on identity arrays. Assisted-by: ClaudeCode:claude-opus-4.8

`get_coordinate_selection(out=...)` requires a flat `(n,)` buffer on eager/identity arrays but the broadcast selection shape on lazy views. Document both sides in the `out` parameter docstring and note that the broadcast-shape contract is the long-term direction. No behavior change. Assisted-by: ClaudeCode:claude-opus-4.8

…nning docs) - name the zarr-indexing changelog fragment by PR number (3906) so the integer-issue check passes - drop :class: sphinx roles from the indexing-program docstrings (mkdocs markdown; caught by main's new ci/lint_docs.py) - untrack docs/superpowers/ planning documents: upstream main gitignores that path (agent planning notes are local-only by convention), which also removes them from markdownlint's scope; files remain on disk Assisted-by: ClaudeCode:claude-fable-5

github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 14, 2026

Merge branch 'main' into refactor/simplify-indexing

732dddd

d-v-b added the benchmark Code will be benchmarked in a CI job. label Apr 14, 2026

d-v-b mentioned this pull request Apr 14, 2026

feat/lazy indexing #3678

Closed

d-v-b added 2 commits April 15, 2026 11:58

Merge branch 'main' into refactor/simplify-indexing

db31317

Merge branch 'main' into refactor/simplify-indexing

797b25e

Merge branch 'main' into refactor/simplify-indexing

9f50a4c

Merge branch 'main' into refactor/simplify-indexing

53c0042

ilan-gold reviewed Apr 27, 2026

View reviewed changes

jsignell mentioned this pull request Apr 28, 2026

Improve Cubed support in xarray NASA-IMPACT/science-support#30

Open

d-v-b added 6 commits July 16, 2026 10:22

d-v-b added 5 commits July 17, 2026 09:50

docs(transforms): note zarr-transforms package in 3906 changelog

f72df94

Assisted-by: ClaudeCode:claude-opus-4.8

d-v-b added 4 commits July 17, 2026 12:09

fix(zarr-transforms): expose __version__ for the release smoke test

d6d4d3d

Mirrors zarr_metadata's importlib.metadata idiom; the release workflow's isolated-wheel check imports it. Assisted-by: ClaudeCode:claude-fable-5

github-actions Bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Jul 17, 2026

d-v-b mentioned this pull request Jul 21, 2026

arr_1d[np.sort(idx)] can be much faster (CoordinateIndexer fast path) #4170

Open

selmanozleyen and others added 11 commits July 22, 2026 15:21

add _iter_sorted_1d_array_map and add tests (#234)

4d35384

docs(zarr-transforms): point corpus provenance at merged ndsel main

68404bc

Fixtures verified byte-identical to the merged c59bc556c; only the reference moves off the deleted feature branch. Assisted-by: ClaudeCode:claude-fable-5

Merge remote-tracking branch 'upstream/main' into refactor/simplify-i…

ddd1fbf

…ndexing # Conflicts: # pyproject.toml # src/zarr/core/array.py # uv.lock

Uh oh!

Uh oh!

Conversation

d-v-b commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Apr 14, 2026

Uh oh!

codecov Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vincentsarago commented Apr 14, 2026

Uh oh!

d-v-b commented Apr 14, 2026

Uh oh!

vincentsarago commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Apr 14, 2026

Uh oh!

codspeed-hq Bot commented Apr 14, 2026

Merging this PR will degrade performance by 17.94%

Performance Changes

Footnotes

Uh oh!

psobolewskiPhD commented Apr 21, 2026

Uh oh!

d-v-b commented Apr 21, 2026

Uh oh!

psobolewskiPhD commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Apr 21, 2026

Uh oh!

psobolewskiPhD commented Apr 21, 2026

Uh oh!

normanrz commented Apr 22, 2026

Uh oh!

d-v-b commented Apr 22, 2026

Uh oh!

vincentsarago commented Apr 22, 2026

Uh oh!

d-v-b commented Apr 22, 2026

Uh oh!

jsignell commented Apr 22, 2026

Uh oh!

d-v-b commented Apr 22, 2026

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

ilan-gold Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

emfdavid commented Jul 17, 2026

Uh oh!

d-v-b commented Jul 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

d-v-b commented Apr 14, 2026 •

edited

Loading

codecov Bot commented Apr 14, 2026 •

edited

Loading

vincentsarago commented Apr 14, 2026 •

edited

Loading

psobolewskiPhD commented Apr 21, 2026 •

edited

Loading

ilan-gold Apr 27, 2026 •

edited

Loading