perf: cache lexicographic chunk coords in sharding codec by d-v-b · Pull Request #4012 · zarr-developers/zarr-python

d-v-b · 2026-05-26T21:02:15Z

Fixes a performance regression introduced in #3826 that made morton-order-indexing slower.

for details, see claude's summary below.

The subchunk_write_order feature (#3826) regressed sharded write performance: _encode_partial_single rebuilt the full per-shard chunk coordinate grid on every write via
np.array(list(_subchunk_order_iter(..., "lexicographic"))), and to_dict_vectorized rebuilt a tuple key per row with tuple(coords.ravel()). For a single-chunk write into a shard with tens of thousands of chunks this roughly doubled write time (~22ms -> ~40ms on test_sharded_morton_write_single_chunk, matching the -44% CodSpeed regression).

Add cached _lexicographic_order (array) and _lexicographic_order_keys (tuples) helpers in indexing.py, mirroring _morton_order/_morton_order_keys, and pass the cached keys into to_dict_vectorized instead of deriving them row-by-row. This restores write throughput to the pre-#3826 baseline while preserving identical chunk ordering (verified equal to np.ndindex across shapes including 0-d and empty).

[Description of PR]

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

The subchunk_write_order feature (zarr-developers#3826) regressed sharded write performance: _encode_partial_single rebuilt the full per-shard chunk coordinate grid on every write via `np.array(list(_subchunk_order_iter(..., "lexicographic")))`, and `to_dict_vectorized` rebuilt a tuple key per row with `tuple(coords.ravel())`. For a single-chunk write into a shard with tens of thousands of chunks this roughly doubled write time (~22ms -> ~40ms on test_sharded_morton_write_single_chunk, matching the -44% CodSpeed regression). Add cached `_lexicographic_order` (array) and `_lexicographic_order_keys` (tuples) helpers in indexing.py, mirroring `_morton_order`/`_morton_order_keys`, and pass the cached keys into `to_dict_vectorized` instead of deriving them row-by-row. This restores write throughput to the pre-zarr-developers#3826 baseline while preserving identical chunk ordering (verified equal to np.ndindex across shapes including 0-d and empty). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-26T21:11:30Z

Codecov Report

❌ Patch coverage is 97.43590% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 93.55%. Comparing base (fe22910) to head (aa2bb30).

Files with missing lines	Patch %	Lines
src/zarr/codecs/sharding.py	93.33%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4012   +/-   ##
=======================================
  Coverage   93.55%   93.55%           
=======================================
  Files          88       88           
  Lines       11896    11909   +13     
=======================================
+ Hits        11129    11142   +13     
  Misses        767      767

Files with missing lines	Coverage Δ
src/zarr/core/indexing.py	`96.29% <100.00%> (+0.04%)`	⬆️
src/zarr/codecs/sharding.py	`92.20% <93.33%> (+0.06%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codspeed-hq · 2026-05-26T21:28:36Z

Merging this PR will improve performance by 40.32%

⚡ 3 improved benchmarks
❌ 3 regressed benchmarks
✅ 53 untouched benchmarks
🆕 10 new benchmarks
⏩ 13 skipped benchmarks¹

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	WallTime	`test_sharded_morton_single_chunk[(33, 33, 33)-memory]`	2.2 ms	7.7 ms	-71.7%
❌	WallTime	`test_sharded_morton_single_chunk[(32, 32, 32)-memory]`	2.1 ms	7 ms	-69.24%
❌	WallTime	`test_sharded_morton_single_chunk[(30, 30, 30)-memory]`	2.1 ms	6.2 ms	-65.96%
⚡	WallTime	`test_sharded_morton_write_single_chunk[(32, 32, 32)-memory]`	350.3 ms	54 ms	×6.5
⚡	WallTime	`test_sharded_morton_write_single_chunk[(33, 33, 33)-memory]`	391.3 ms	61.7 ms	×6.3
⚡	WallTime	`test_sharded_morton_write_single_chunk[(30, 30, 30)-memory]`	292.7 ms	46.7 ms	×6.3
🆕	WallTime	`test_morton_order[(8, 8, 8)]`	N/A	469.3 µs	N/A
🆕	WallTime	`test_morton_order[(10, 10, 10)]`	N/A	861.5 µs	N/A
🆕	WallTime	`test_morton_order[(20, 20, 20)]`	N/A	5.6 ms	N/A
🆕	WallTime	`test_morton_order[(30, 30, 30)]`	N/A	18.3 ms	N/A
🆕	WallTime	`test_morton_order[(16, 16, 16)]`	N/A	2.5 ms	N/A
🆕	WallTime	`test_morton_order[(32, 32, 32)]`	N/A	20.8 ms	N/A
🆕	WallTime	`test_sharded_morton_write_single_chunk_warm_cache[(33, 33, 33)-memory]`	N/A	37.4 ms	N/A
🆕	WallTime	`test_morton_order[(33, 33, 33)]`	N/A	24.6 ms	N/A
🆕	WallTime	`test_sharded_morton_write_single_chunk_warm_cache[(30, 30, 30)-memory]`	N/A	28.6 ms	N/A
🆕	WallTime	`test_sharded_morton_write_single_chunk_warm_cache[(32, 32, 32)-memory]`	N/A	33.1 ms	N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing d-v-b:perf-sharding-coord-cache (2742897) with main (6ce787d)²}

13 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
No successful run was found on main (fe22910) during the generation of this report, so 6ce787d was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

maxrjones · 2026-06-01T19:32:23Z

@@ -266,13 +269,19 @@ def __iter__(self) -> Iterator[tuple[int, ...]]:
    def to_dict_vectorized(


I think the following reads cleaner than adding another parameter. The reader already knows its own chunks_per_shard, so it can fetch both cached structures itself instead of having them threaded in. With this, the call site simplifies to just shard_reader.to_dict_vectorized().

I tested it still fixes the regression — same ~1.6–1.8× speedup over main on test_sharded_morton_write_single_chunk, and the difference vs the committed version is within run-to-run noise.

def to_dict_vectorized(self) -dict[tuple[int, ...], Buffer | None]: """Build a dict of chunk coordinates to buffers using vectorized lookup. The full per-shard chunk coordinate grid (both the array used for the vectorized index lookup and the plain tuples used as dict keys) is cached on ``chunks_per_shard``, so neither is rebuilt on every call. For a shard with tens of thousands of chunks this avoids reconstructing that many tuples on every partial write. Returns ------- dict mapping chunk coordinate tuples to Buffer or None """ chunks_per_shard = self.index.chunks_per_shard chunk_coords_array = _lexicographic_order(chunks_per_shard) chunk_coords_keys = _lexicographic_order_keys(chunks_per_shard) starts, ends, valid = self.index.get_chunk_slices_vectorized(chunk_coords_array) result: dict[tuple[int, ...], Buffer | None] = {} for i, coords in enumerate(chunk_coords_keys): if valid[i]: result[coords] = self.buf[int(starts[i]) : int(ends[i])] else: result[coords] = None return result

that's great feedback! have a look at 6e02c8f and see if things look better to you

Address review feedback: `_ShardReader.to_dict_vectorized` took the lexicographic coordinate array and key tuples as parameters, even though the reader already knows its own `chunks_per_shard` and both structures are `lru_cache`d. Thread nothing in — fetch them inside the method via `_lexicographic_order`/`_lexicographic_order_keys`. Same cache, so no perf change; the call site collapses to `to_dict_vectorized()`. Add a unit test covering the method directly across 0-d, 1-d, and 2-d shard grids: present chunks map to their stored bytes, empty chunks to None, and every lexicographic coordinate appears as a key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ilan-gold

Apologies for the regression!

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

chuckwondo · 2026-06-02T15:36:51Z

+def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
+    return iter(_lexicographic_order_keys(chunk_shape))


Not sure this makes a lot of sense. Why would we provide a "convenience" function for producing an "lazy" iterator over something that has already been eagerly constructed? Shouldn't we expose the lazy option to the user and then the user can eagerly collect the results, if that's what they want. Otherwise, we simply provide no means for the caller to get lazy results.

Although, in this case, does being lazy even matter? I'm not familiar enough with what's going on here to tell if chunk_shape is always something relatively small, in which case laziness is likely unnecessary.

I didn't want to open this can of worms, but I agree things are funny here. I don't think the iter function here is super important but

user can eagerly collect the results, if that's what they want.

The idea here is definitely to cache the eager results and then let the user create an iterator. But I don't even think an iterator is necessary _lexicographic_order_keys should satisfy the APIs that lexicographic_order_iter is meant for e.e.,g

dict.fromkeys((1, 2, 3))

works just fine (which is one place where the iterator is used).

It really should be the other way around. We should define the lazy version and the eager version should just be "collect" the lazy version.

The problem is that pay an enormous penalty for generating the entire list of all chunks when we might only just want the first one.

agreed! this pattern is what you get when you optimize for benchmarks. we might need to relax that concern, or add a separate benchmark that checks "time to fetch 1 chunk"

I think @chuckwondo's point indicates something more basic -- there's no reason to expose an iteration interface, lazy or not, for something that has a defined size, like a dense, finite, coordinate grid. We should be exposing an indexable data structure from the start.

@ilan-gold

Address review feedback from @ilan-gold and @chuckwondo on the `lexicographic_order_iter` helper. `lexicographic_order_iter` returned a *lazy* iterator over an *eagerly-built, cached* tuple (`_lexicographic_order_keys`), which chuckwondo rightly flagged as confusing — and its output is byte-for-byte identical to the pre-existing, genuinely-lazy `c_order_iter` (verified across 0-d, empty, and N-d shapes). So the name promised laziness the implementation didn't provide, over a sequence we could already produce. Remove the wrapper and use the cached `_lexicographic_order_keys` tuple directly at the two `dict.fromkeys` call sites and in `_subchunk_order_iter`. This keeps the eager/cached coordinate tuples — which is the actual optimization: `dict.fromkeys` over the cached tuple is ~1.4x faster than over lazy `c_order_iter` at 32^3 (≈900us vs ≈1300us), because the cache amortizes tuple construction across repeated writes to same-shaped shards. Switching to `c_order_iter` would have reintroduced that cost, so it is deliberately not used here. Also drop the now-dead `tuple()` wrap in `morton_order_iter` (its argument is typed `tuple[int, ...]` and every caller passes one), per ilan-gold. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-cache

Reconcile the remote branch, which had two commits not present locally: - 2b8b095 Merge branch 'main' (same upstream tooling commits already merged locally in 6ee3440; no new content) - 51d0f05 "Update src/zarr/core/indexing.py" (Co-authored-by: Ilan Gold) dropped the redundant tuple() wrap from lexicographic_order_iter. Conflict in src/zarr/core/indexing.py resolved in favor of the local approach: lexicographic_order_iter was removed entirely as redundant with c_order_iter (identical output across all shapes), with its call sites switched to the cached _lexicographic_order_keys tuple. That supersedes ilan-gold's narrower tuple()-trim — the function they edited no longer exists — while addressing the same review feedback (and chuckwondo's point about the misleading lazy-over-eager wrapper). ilan-gold's commit is preserved in history as a merge parent. Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

…_order_iter `c_order_iter` names a memory layout ("C order") rather than what the iterator actually yields. Reintroduce `lexicographic_order_iter` as the clearer name for the same row-major coordinate sequence, and make `c_order_iter` a thin alias that delegates to it, with a docstring note steering new code to the preferred name. No runtime warning — these are internal helpers. `lexicographic_order_iter` keeps the eager/cached implementation (iter over the lru_cached `_lexicographic_order_keys` tuple), which is ~1.4x faster than the old lazy `itertools.product` on the `dict.fromkeys` shard-write path and is the optimization this branch exists to deliver. The alias therefore changes `c_order_iter` from lazy to eager/cached; all in-repo callers (_ShardReader.__iter__, _is_total_shard, _subchunk_order_iter, and two tests) are migrated to `lexicographic_order_iter`, so nothing in-tree relies on the old laziness. Output is unchanged: lexicographic_order_iter, the c_order_iter alias, and np.ndindex all agree across 0-d, empty, and N-d shapes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mkitti · 2026-06-02T18:08:44Z

+def _lexicographic_order_keys(chunk_shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]:
+    return tuple(tuple(int(x) for x in row) for row in _lexicographic_order(chunk_shape))
+
+
+def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
+    # Iterate the chunk grid in lexicographic (row-major / C) order. The full
+    # coordinate tuple is built once and cached on `chunk_shape` (see
+    # _lexicographic_order_keys), so repeated calls for the same shape — e.g. one
+    # per shard write — reuse it rather than rebuilding tens of thousands of
+    # tuples each time.
+    return iter(_lexicographic_order_keys(chunk_shape))


Suggested change

def _lexicographic_order_keys(chunk_shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]:

return tuple(tuple(int(x) for x in row) for row in _lexicographic_order(chunk_shape))

def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:

# Iterate the chunk grid in lexicographic (row-major / C) order. The full

# coordinate tuple is built once and cached on `chunk_shape` (see

# _lexicographic_order_keys), so repeated calls for the same shape — e.g. one

# per shard write — reuse it rather than rebuilding tens of thousands of

# tuples each time.

return iter(_lexicographic_order_keys(chunk_shape))

def _lexicographic_order_keys(chunk_shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]:

return tuple(lexicographic_order_iter(chunk_shape))

def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:

# Iterate the chunk grid in lexicographic (row-major / C) order. The full

# coordinate tuple is built once and cached on `chunk_shape` (see

# _lexicographic_order_keys), so repeated calls for the same shape — e.g. one

# per shard write — reuse it rather than rebuilding tens of thousands of

# tuples each time.

return (tuple(int(x) for x in row) for row in _lexicographic_order(chunk_shape))

The lazy "lexicographic_order_iter" uses a generator comprehension. The eager "_lexicographic_order_keys " just pulls that into a tuple.

@mkitti

Per review from @mkitti: invert the relationship between the lazy iterator and the eagerly-collected tuple. `lexicographic_order_iter` is now a genuine lazy generator over the chunk-grid coordinates, and `_lexicographic_order_keys` collects it into a cached tuple — the eager version is "collect the lazy one", not the other way around. Previously lexicographic_order_iter returned iter() over the cached tuple, so any consumer that only needed a prefix still paid to materialize the entire grid. _is_total_shard does exactly that — an early-exit `all(coord in set for coord in ...)` — and on a cold cache for a 32^3 shard whose first coordinate is absent this dropped from ~15.8ms to ~24us (the lazy generator builds one coordinate and bails). The hot path is unchanged: the two dict.fromkeys sites consume the full grid and use the cached `_lexicographic_order_keys` tuple directly (~0.9ms at 32^3), so the regression fix this branch delivers is intact. This also resolves @chuckwondo's point — the iterator is now actually lazy rather than a thin wrapper over eager data. Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR addresses a sharded-write performance regression introduced with subchunk_write_order by caching lexicographic (C-order) chunk coordinates and precomputed tuple keys, avoiding repeated rebuilding of large per-shard coordinate grids during partial writes.

Changes:

Added cached lexicographic coordinate helpers in zarr.core.indexing and a new lexicographic_order_iter.
Updated the sharding codec to reuse cached lexicographic coordinate arrays/keys and simplified _ShardReader.to_dict_vectorized() to be shape-derived.
Added a unit test to validate to_dict_vectorized() behavior and included a changelog entry.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`tests/test_codecs/test_sharding.py`	Adds coverage for `_ShardReader.to_dict_vectorized()` and switches tests to `lexicographic_order_iter`.
`src/zarr/core/indexing.py`	Introduces cached lexicographic coordinate/key generators and aliases `c_order_iter` to the new iterator.
`src/zarr/codecs/sharding.py`	Uses cached lexicographic coords/keys to eliminate per-write reconstruction overhead.
`changes/4001.misc.md`	Documents the performance restoration in the changelog.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

d-v-b · 2026-06-02T18:17:18Z

something I added here: I realized that we already had a c_order_iter function, but lexicographic_order_iter is a much better name for this functionality. so c_order_iter is now just a wrapper for lexicographic_order_iter.

d-v-b · 2026-06-02T18:22:28Z

thanks for the careful review all! I think I've addressed everything

mkitti · 2026-06-02T20:10:19Z


 def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
-    return iter(_morton_order_keys(tuple(chunk_shape)))
+    return iter(_morton_order_keys(chunk_shape))


Suggested change

return iter(_morton_order_keys(chunk_shape))

@lru_cache(maxsize=16)

def _morton_order_keys(chunk_shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]:

return tuple(morton_order_iter)

def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:

return (tuple(int(x) for x in row) for row in _morton_order(chunk_shape))

This is also backwards.

The suggestion is not applying to the correct lines. Manual intervention is likely required.

done in 6840249

@mkitti

Per @mkitti: the morton pair was backwards in the same way the lexicographic pair was. Invert it to match — `morton_order_iter` is now the lazy generator primitive and `_morton_order_keys` collects it into a cached tuple, mirroring `lexicographic_order_iter` / `_lexicographic_order_keys`. No behavioral change for the in-tree consumers (all fully consume the sequence) and the Z-order is identical; this keeps the two coordinate- order families symmetric and gives morton the same lazy/early-exit option lexicographic now has. Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

+def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
+    # Lazily yield the chunk grid coordinates in Morton (Z) order. The lazy
+    # generator is the primitive; `_morton_order_keys` collects it into a cached
+    # tuple for callers that need every coordinate repeatedly. Mirrors
+    # `lexicographic_order_iter` / `_lexicographic_order_keys`.
+    return (tuple(int(x) for x in row) for row in _morton_order(chunk_shape))


+def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
+    # Lazily yield the chunk grid coordinates in lexicographic (row-major / C)
+    # order. This is the primitive: callers that only need a prefix (e.g. an
+    # early-exit `all(...)`) pay only for the coordinates they consume, instead
+    # of materializing the whole grid. Callers that need every coordinate
+    # repeatedly for the same shape should use `_lexicographic_order_keys`, which
+    # caches the fully collected tuple.
+    return (tuple(int(x) for x in row) for row in _lexicographic_order(chunk_shape))


mkitti · 2026-06-02T21:44:47Z

@codspeedbot explain why this is faster

mkitti · 2026-06-02T21:45:43Z

@codspeedbot explain why this is faster

codspeed-hq · 2026-06-02T21:45:48Z

@codspeedbot explain why this is faster

Why this PR is faster

I compared the flamegraphs between main and this PR for test_sharded_morton_write_single_chunk[(33, 33, 33)-memory] (the benchmark with the largest improvement: 391 ms → 226 ms, +72.9%). The speedup comes from eliminating two expensive, redundant computations in the sharded write hot path:

1. Cached coordinate keys eliminate per-write grid rebuilding (~180 ms saved)

On main, every call to _encode_partial_single triggered _encode_shard_dict → _subchunk_order_iter → _morton_order_keys, which rebuilt the full per-shard chunk coordinate grid from scratch. The nested generators in _morton_order_keys consumed 181 ms (23% of total time), dominated by expensive per-element array_item numpy scalar extraction (~115 ms in the inner <genexpr> alone).

With this PR, the morton/lexicographic order arrays and their tuple keys are cached, so _encode_partial_single's coordinate work drops from ~181 ms to <1 ms on subsequent writes to the same shard shape.

2. Pre-computed keys eliminate row-by-row tuple construction in `to_dict_vectorized` (~139 ms saved)

On main, _ShardReader.to_dict_vectorized spent 138.6 ms (17.6% of total time) building a dictionary key per row by calling tuple(coords.ravel()) in a loop. Each iteration hit array_item (63 ms), array_ravel (10.5 ms), and PyArray_Scalar (10 ms) — all numpy C-level operations that are fast individually but extremely costly when repeated tens of thousands of times (once per chunk in the shard).

With this PR, to_dict_vectorized receives pre-computed cached tuple keys directly, completely bypassing the per-row numpy extraction. This function no longer appears in the top hotspots.

Net effect

Path	`main`	This PR
`_subchunk_order_iter` → `_morton_order_keys` (coordinate grid)	181 ms	< 1 ms
`to_dict_vectorized` (tuple key construction)	138.6 ms	eliminated from hotspots
Total benchmark	786 ms	458 ms

The combined ~320 ms savings from these two paths accounts for the full improvement, restoring write throughput to the pre-#3826 baseline.

Summary

Warming up the performance wizard...
Analyzing performance improvement — Flamegraph analysis complete — identified two root causes of the speedup
Handling comment — Comment handling finished successfully

_{Was this response helpful? React with 👍 or 👎 to help us improve CodSpeedBot.}

mkitti · 2026-06-02T22:18:50Z

Claude review incoming...

mkitti

This review was authored by Claude (Claude Code), at the request of @mkitti.

Reviewed the diff critically. The core fix is sound — caching _lexicographic_order/_lexicographic_order_keys and feeding the cached keys straight into to_dict_vectorized/dict.fromkeys removes exactly the two per-write costs #3826 introduced (np.array(list(...)) and per-row tuple(coords.ravel())). The to_dict_vectorized() signature simplification is a genuine improvement (the reader already knows chunks_per_shard), order equivalence with np.ndindex holds across 0-d/empty/N-d, and chunk_coords_array[i]/chunk_coords_keys[i] are guaranteed aligned.

My concern is that the PR has accreted a lazy/eager + naming-symmetry refactor that overshoots the goal, and one piece works against it. Details inline, summarized by severity:

(high) The morton "symmetry" commit de-optimizes a hot write path — see comment on morton_order_iter.
(medium) _is_total_shard is actually made worse than the itertools.product it replaced, and the commit rationale claims the reverse — see comment on _is_total_shard.
(low-med) c_order_iter quietly changes operational characteristics (lazy/no-alloc → eager + cached numpy grid) — see comment on c_order_iter.

Minor: dropping tuple(chunk_shape) coercion in morton_order_iter means a non-tuple arg now fails inside lru_cache rather than being coerced (acceptable per type hints); _lexicographic_order caches a non-contiguous transposed view that pins its larger np.indices base alive; and the rename/soft-deprecation is scope creep relative to a perf-regression fix — worth a line in the PR description so the c_order_iter behavior change is intentional and visible.

Net: I'd approve the lexicographic caching + to_dict_vectorized() change as-is, but push back on item 1 and trim/correct the item 2 narrative before merge.

mkitti · 2026-06-02T22:19:45Z

    return order


+def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:


Issue 1 (high): this de-optimizes a hot write path — the opposite of this PR's goal.

Previously morton_order_iter returned iter(_morton_order_keys(...)), i.e. it handed back an lru_cached tuple-of-tuples, so repeated calls were essentially free. Now it rebuilds every coordinate tuple from the (cached) array on every call.

morton_order_iter is consumed in _encode_shard_dict, the per-shard serialization path that runs on every shard write — so this reintroduces exactly the per-write tuple-reconstruction cost this PR removes for the lexicographic case. Two knock-on effects:

_morton_order_keys is now dead in src/ (only the benchmark's cache_clear references it); its cache is never populated in production.

The asymmetry is backwards: the lexicographic hot path uses cached _lexicographic_order_keys, while the morton hot path uses the uncached morton_order_iter.

Suggest reverting the morton inversion (keep morton_order_iter delegating to cached _morton_order_keys), or route the _encode_shard_dict call site through _morton_order_keys. Source-shape symmetry isn't worth a hot-path de-optimization in a perf PR.

— Claude (Claude Code)

mkitti · 2026-06-02T22:19:45Z

        return len(all_chunk_coords) == product(chunks_per_shard) and all(
-            chunk_coords in all_chunk_coords for chunk_coords in c_order_iter(chunks_per_shard)
+            chunk_coords in all_chunk_coords
+            for chunk_coords in lexicographic_order_iter(chunks_per_shard)


Issue 2 (medium): this is strictly worse than the code it replaced for the early-exit case, and the commit message claims the reverse.

This previously iterated c_order_iter = itertools.product — genuinely lazy, O(1) memory, true early-exit. lexicographic_order_iter calls _lexicographic_order, which eagerly materializes the whole np.indices grid before yielding the first tuple. So the "laziness" only skips tuple conversion of unconsumed rows, not the expensive grid build.

Concretely, commit f6d91e6's claim ("on a cold cache for a 32³ shard … ~15.8ms → ~24µs, the lazy generator builds one coordinate and bails") can't hold here — np.indices((32,32,32)) runs in full on the first next(). For the pure early-exit case on a cold cache, the original itertools.product was strictly better and used less memory.

In practice this is mostly masked because the dict.fromkeys hot path warms _lexicographic_order for the same shape. But if the early-exit win is actually wanted, make lexicographic_order_iter a true generator (e.g. itertools.product) and have _lexicographic_order_keys collect that — otherwise drop the early-exit framing from the commit history.

— Claude (Claude Code)

mkitti · 2026-06-02T22:19:45Z

+    return tuple(lexicographic_order_iter(chunk_shape))


 def c_order_iter(chunks_per_shard: tuple[int, ...]) -> Iterator[tuple[int, ...]]:


Issue 3 (low-med): the alias quietly changes operational characteristics.

c_order_iter went from pure-lazy itertools.product (no allocation, no cache) to an alias that materializes and lru_caches a full numpy grid. For any large-chunks_per_shard caller this is a cold-start + memory regression, and it now retains up to 16 large arrays. In practice chunks_per_shard is modest so it's likely fine — but it's a real semantic change presented as a no-op "soft-deprecated alias." Worth calling out explicitly in the PR description.

— Claude (Claude Code)

@mkitti

Replace the morton/lexicographic order iterators (and the c_order_iter alias) with two cached, numpy-backed sequences: `morton_order_coords(shape)` and `lexicographic_order_coords(shape)`, each returning the grid coordinates in that order as a tuple of coordinate tuples. This addresses several points from review: - The earlier "lazy primitive" inversion de-optimized the hot write path: `morton_order_iter` rebuilt every coordinate tuple from the array on each call, and that path runs in `_encode_shard_dict` on every shard write (~16ms/write at 32^3 chunks-per-shard). The coords are a finite set of known length reused in full, so they are an indexable sequence built once and cached, not a lazily-rebuilt generator. (per @mkitti) - `lexicographic_order_iter` was never genuinely lazy — `_lexicographic_order` materializes the whole `np.indices` grid up front — so the early-exit framing was inaccurate. (per @Copilot, @chuckwondo) - Two functions differing only in caching vs laziness was redundant (per @ilan-gold); there is now one sequence per order. `_ShardReader.__iter__` wraps it in `iter()`, the only site that needs an iterator. - `_is_total_shard` no longer iterates the order at all: `all_chunk_coords` is always a subset of the shard grid (guaranteed by `validate`'s shard/chunk divisibility check), so a count check proves totality. A subset assertion documents the invariant. Coordinates are Python int tuples because every consumer uses them as dict keys / set members, which numpy arrays cannot be (unhashable, mutable); the numpy array is kept only for the vectorized index lookup in `to_dict_vectorized`. The per-shape cache holds ~prod(chunks_per_shard) tuples (~0.07% of shard size for multi-GB shards with (64,64,64) chunks), capped at 16 shapes per order. Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The existing test_sharded_morton_write_single_chunk clears the chunk-order cache before every iteration, so it only measures the cold grid-build cost. That made it blind to a regression where the per-shard coordinate tuples were rebuilt on every write instead of being reused from the cache — the cold benchmark could not distinguish the two (both pay the build each iteration). Add test_sharded_morton_write_single_chunk_warm_cache, which warms the cache once and then times repeated same-shape writes — the amortized regime the cache exists to optimize (many shards of one shape per array). Verified it discriminates: with the cached sequence it is ~4x faster than the cold benchmark, and a rebuild-every-write regression shows up as a ~4x slowdown here while staying invisible to the cold benchmark. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The fix caches the per-shard coordinate grid for every shard write, not only partial writes, and the win is amortized across repeated writes to same-shaped shards. Reword the note accordingly; keep it user-facing (the internal indexing helper refactor is not part of the public API). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ation `morton_order_coords` / `lexicographic_order_coords` built their tuple-of- tuples with a row-by-row `tuple(int(x) for x in row)` comprehension. Using `map(tuple, arr.tolist())` instead does the int conversion in a single C-level call, producing byte-identical native-int tuples ~8-9x faster (~16ms -> ~1.9ms cold build at 32^3). It is a per-shape cached build, so this only speeds the first write to each shard shape, but it is free. Also document in `to_dict_vectorized` why the chunk coordinates are needed in two forms — a numpy array for the vectorized index lookup and hashable tuples for the dict keys — since numpy rows are unhashable and a tuple list can't be used for the vectorized modulo/advanced-indexing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

d-v-b · 2026-06-03T08:16:55Z

thanks for everyone's patience here. Some recent changes:

I added explanatory comments for why we need two representations (tuples of ints and numpy arrays of ints) of the same data in to_dict_vectorized. Just using Numpy arrays of ints would be more performant, but fully exploiting that requires changes to the calling code, which models chunk indices as tuples of ints. IMO we need to consider the chunk index representation a target for performance optimization.
I removed the premise of laziness from the *_order routines. Now they are not iterators, they return full collections of coordinates, and they have names that make this clear: morton_order_coords and lexicographic_order_coords. I also changed the name of the parameters from chunk_shape to just shape, because these functions are not just for chunks, and chunk_shape is actually rather ambiguous.

@mkitti feel free to throw your agents at the latest changes.

d-v-b · 2026-06-03T08:21:27Z

@claude please review

- Drop the O(n_chunks) assert in _is_total_shard. It built a fresh set(lexicographic_order_coords(...)) on every partial read/write to check an invariant `validate` already guarantees, regressing the very partial-access hot path this PR optimizes (~673us vs ~112ns at 32^3 chunks-per-shard) and vanishing under -O. The invariant is documented in the comment; the count check alone proves totality. - Cache the colexicographic subchunk order. The colex branch of _subchunk_order_iter rebuilt the grid via uncached np.ndindex on every write while its morton/lexicographic siblings hit the cache; add colexicographic_order_coords (cached, derived from lexicographic_order_coords of the reversed shape) and use it. - Fix two benchmark docstrings: the cold benchmark now clears the lexicographic caches too (the write path builds that grid via dict.fromkeys / to_dict_vectorized, so a morton-only clear left it warm and under-reported the cold cost); the warm benchmark docstring now describes what it actually exercises (repeated writes to one shard, which reuse the cache identically to writes across same-shaped shards). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

d-v-b added the benchmark Code will be benchmarked in a CI job. label May 26, 2026

d-v-b requested a review from ilan-gold May 26, 2026 21:04

d-v-b requested a review from maxrjones May 27, 2026 15:53

Merge branch 'main' into perf-sharding-coord-cache

6053018

d-v-b requested a review from mkitti May 29, 2026 14:54

Merge branch 'main' into perf-sharding-coord-cache

d74d857

maxrjones assigned d-v-b and maxrjones Jun 1, 2026

maxrjones reviewed Jun 1, 2026

View reviewed changes

d-v-b and others added 2 commits June 2, 2026 15:26

Merge branch 'main' into perf-sharding-coord-cache

2b8b095

ilan-gold approved these changes Jun 2, 2026

View reviewed changes

Comment thread src/zarr/core/indexing.py Outdated

Update src/zarr/core/indexing.py

51d0f05

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

d-v-b removed the benchmark Code will be benchmarked in a CI job. label Jun 2, 2026

chuckwondo reviewed Jun 2, 2026

View reviewed changes

d-v-b and others added 4 commits June 2, 2026 18:01

Merge remote-tracking branch 'upstream/main' into perf-sharding-coord…

6ee3440

…-cache

mkitti reviewed Jun 2, 2026

View reviewed changes

mkitti requested a review from Copilot June 2, 2026 18:12

Copilot started reviewing on behalf of mkitti June 2, 2026 18:12 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread src/zarr/core/indexing.py Outdated

mkitti reviewed Jun 2, 2026

View reviewed changes

d-v-b added the benchmark Code will be benchmarked in a CI job. label Jun 2, 2026

Merge branch 'main' into perf-sharding-coord-cache

5a12e49

mkitti approved these changes Jun 2, 2026

View reviewed changes

mkitti requested a review from Copilot June 2, 2026 21:34

Copilot started reviewing on behalf of mkitti June 2, 2026 21:35 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

mkitti reviewed Jun 2, 2026

View reviewed changes

d-v-b and others added 3 commits June 3, 2026 09:17

d-v-b removed the benchmark Code will be benchmarked in a CI job. label Jun 3, 2026

d-v-b added the benchmark Code will be benchmarked in a CI job. label Jun 3, 2026

		@@ -266,13 +269,19 @@ def __iter__(self) -> Iterator[tuple[int, ...]]:
		def to_dict_vectorized(

		def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
		return iter(_lexicographic_order_keys(chunk_shape))

		return order


		def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:

		return tuple(lexicographic_order_iter(chunk_shape))


		def c_order_iter(chunks_per_shard: tuple[int, ...]) -> Iterator[tuple[int, ...]]:

Uh oh!

Conversation

d-v-b commented May 26, 2026

Uh oh!

codecov Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codspeed-hq Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 40.32%

Performance Changes

Footnotes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkitti Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

d-v-b commented Jun 2, 2026

Uh oh!

d-v-b commented Jun 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

mkitti commented Jun 2, 2026

Uh oh!

mkitti commented Jun 2, 2026

Uh oh!

codspeed-hq Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this PR is faster

1. Cached coordinate keys eliminate per-write grid rebuilding (~180 ms saved)

2. Pre-computed keys eliminate row-by-row tuple construction in to_dict_vectorized (~139 ms saved)

Net effect

Uh oh!

mkitti commented Jun 2, 2026

Uh oh!

mkitti left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkitti Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

codecov Bot commented May 26, 2026 •

edited

Loading

codspeed-hq Bot commented May 26, 2026 •

edited

Loading

mkitti Jun 2, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 2, 2026 •

edited

Loading

2. Pre-computed keys eliminate row-by-row tuple construction in `to_dict_vectorized` (~139 ms saved)

mkitti left a comment •

edited

Loading

mkitti Jun 2, 2026 •

edited

Loading

mkitti Jun 2, 2026 •

edited

Loading

mkitti Jun 2, 2026 •

edited

Loading