Skip to content

perf: cache lexicographic chunk coords in sharding codec#4012

Open
d-v-b wants to merge 18 commits into
zarr-developers:mainfrom
d-v-b:perf-sharding-coord-cache
Open

perf: cache lexicographic chunk coords in sharding codec#4012
d-v-b wants to merge 18 commits into
zarr-developers:mainfrom
d-v-b:perf-sharding-coord-cache

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented May 26, 2026

Fixes a performance regression introduced in #3826 that made morton-order-indexing slower.

for details, see claude's summary below.

The subchunk_write_order feature (#3826) regressed sharded write performance: _encode_partial_single rebuilt the full per-shard chunk coordinate grid on every write via
np.array(list(_subchunk_order_iter(..., "lexicographic"))), and to_dict_vectorized rebuilt a tuple key per row with tuple(coords.ravel()). For a single-chunk write into a shard with tens of thousands of chunks this roughly doubled write time (~22ms -> ~40ms on test_sharded_morton_write_single_chunk, matching the -44% CodSpeed regression).

Add cached _lexicographic_order (array) and _lexicographic_order_keys (tuples) helpers in indexing.py, mirroring _morton_order/_morton_order_keys, and pass the cached keys into to_dict_vectorized instead of deriving them row-by-row. This restores write throughput to the pre-#3826 baseline while preserving identical chunk ordering (verified equal to np.ndindex across shapes including 0-d and empty).

[Description of PR]

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

The subchunk_write_order feature (zarr-developers#3826) regressed sharded write
performance: _encode_partial_single rebuilt the full per-shard chunk
coordinate grid on every write via
`np.array(list(_subchunk_order_iter(..., "lexicographic")))`, and
`to_dict_vectorized` rebuilt a tuple key per row with `tuple(coords.ravel())`.
For a single-chunk write into a shard with tens of thousands of chunks this
roughly doubled write time (~22ms -> ~40ms on test_sharded_morton_write_single_chunk,
matching the -44% CodSpeed regression).

Add cached `_lexicographic_order` (array) and `_lexicographic_order_keys`
(tuples) helpers in indexing.py, mirroring `_morton_order`/`_morton_order_keys`,
and pass the cached keys into `to_dict_vectorized` instead of deriving them
row-by-row. This restores write throughput to the pre-zarr-developers#3826 baseline while
preserving identical chunk ordering (verified equal to np.ndindex across
shapes including 0-d and empty).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label May 26, 2026
@d-v-b d-v-b requested a review from ilan-gold May 26, 2026 21:04
@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

❌ Patch coverage is 97.43590% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 93.55%. Comparing base (fe22910) to head (aa2bb30).

Files with missing lines Patch % Lines
src/zarr/codecs/sharding.py 93.33% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4012   +/-   ##
=======================================
  Coverage   93.55%   93.55%           
=======================================
  Files          88       88           
  Lines       11896    11909   +13     
=======================================
+ Hits        11129    11142   +13     
  Misses        767      767           
Files with missing lines Coverage Δ
src/zarr/core/indexing.py 96.29% <100.00%> (+0.04%) ⬆️
src/zarr/codecs/sharding.py 92.20% <93.33%> (+0.06%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 26, 2026

Merging this PR will improve performance by 40.32%

⚡ 3 improved benchmarks
❌ 3 regressed benchmarks
✅ 53 untouched benchmarks
🆕 10 new benchmarks
⏩ 13 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_sharded_morton_single_chunk[(33, 33, 33)-memory] 2.2 ms 7.7 ms -71.7%
WallTime test_sharded_morton_single_chunk[(32, 32, 32)-memory] 2.1 ms 7 ms -69.24%
WallTime test_sharded_morton_single_chunk[(30, 30, 30)-memory] 2.1 ms 6.2 ms -65.96%
WallTime test_sharded_morton_write_single_chunk[(32, 32, 32)-memory] 350.3 ms 54 ms ×6.5
WallTime test_sharded_morton_write_single_chunk[(33, 33, 33)-memory] 391.3 ms 61.7 ms ×6.3
WallTime test_sharded_morton_write_single_chunk[(30, 30, 30)-memory] 292.7 ms 46.7 ms ×6.3
🆕 WallTime test_morton_order[(8, 8, 8)] N/A 469.3 µs N/A
🆕 WallTime test_morton_order[(10, 10, 10)] N/A 861.5 µs N/A
🆕 WallTime test_morton_order[(20, 20, 20)] N/A 5.6 ms N/A
🆕 WallTime test_morton_order[(30, 30, 30)] N/A 18.3 ms N/A
🆕 WallTime test_morton_order[(16, 16, 16)] N/A 2.5 ms N/A
🆕 WallTime test_morton_order[(32, 32, 32)] N/A 20.8 ms N/A
🆕 WallTime test_sharded_morton_write_single_chunk_warm_cache[(33, 33, 33)-memory] N/A 37.4 ms N/A
🆕 WallTime test_morton_order[(33, 33, 33)] N/A 24.6 ms N/A
🆕 WallTime test_sharded_morton_write_single_chunk_warm_cache[(30, 30, 30)-memory] N/A 28.6 ms N/A
🆕 WallTime test_sharded_morton_write_single_chunk_warm_cache[(32, 32, 32)-memory] N/A 33.1 ms N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing d-v-b:perf-sharding-coord-cache (2742897) with main (6ce787d)2

Open in CodSpeed

Footnotes

  1. 13 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on main (fe22910) during the generation of this report, so 6ce787d was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@d-v-b d-v-b requested a review from maxrjones May 27, 2026 15:53
@d-v-b d-v-b requested a review from mkitti May 29, 2026 14:54
Comment thread src/zarr/codecs/sharding.py Outdated
@@ -266,13 +269,19 @@ def __iter__(self) -> Iterator[tuple[int, ...]]:
def to_dict_vectorized(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the following reads cleaner than adding another parameter. The reader already knows its own chunks_per_shard, so it can fetch both cached structures itself instead of having them threaded in. With this, the call site simplifies to just shard_reader.to_dict_vectorized().

I tested it still fixes the regression — same ~1.6–1.8× speedup over main on test_sharded_morton_write_single_chunk, and the difference vs the committed version is within run-to-run noise.

    def to_dict_vectorized(self) -dict[tuple[int, ...], Buffer | None]:
        """Build a dict of chunk coordinates to buffers using vectorized lookup.

        The full per-shard chunk coordinate grid (both the array used for the
        vectorized index lookup and the plain tuples used as dict keys) is
        cached on ``chunks_per_shard``, so neither is rebuilt on every call.
        For a shard with tens of thousands of chunks this avoids reconstructing
        that many tuples on every partial write.

        Returns
        -------
        dict mapping chunk coordinate tuples to Buffer or None
        """
        chunks_per_shard = self.index.chunks_per_shard
        chunk_coords_array = _lexicographic_order(chunks_per_shard)
        chunk_coords_keys = _lexicographic_order_keys(chunks_per_shard)
        starts, ends, valid = self.index.get_chunk_slices_vectorized(chunk_coords_array)

        result: dict[tuple[int, ...], Buffer | None] = {}
        for i, coords in enumerate(chunk_coords_keys):
            if valid[i]:
                result[coords] = self.buf[int(starts[i]) : int(ends[i])]
            else:
                result[coords] = None

        return result

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's great feedback! have a look at 6e02c8f and see if things look better to you

d-v-b and others added 2 commits June 2, 2026 15:26
Address review feedback: `_ShardReader.to_dict_vectorized` took the
lexicographic coordinate array and key tuples as parameters, even though
the reader already knows its own `chunks_per_shard` and both structures
are `lru_cache`d. Thread nothing in — fetch them inside the method via
`_lexicographic_order`/`_lexicographic_order_keys`. Same cache, so no
perf change; the call site collapses to `to_dict_vectorized()`.

Add a unit test covering the method directly across 0-d, 1-d, and 2-d
shard grids: present chunks map to their stored bytes, empty chunks to
None, and every lexicographic coordinate appears as a key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the regression!

Comment thread src/zarr/core/indexing.py Outdated
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
@d-v-b d-v-b removed the benchmark Code will be benchmarked in a CI job. label Jun 2, 2026
Comment thread src/zarr/core/indexing.py Outdated
Comment on lines +1610 to +1611
def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
return iter(_lexicographic_order_keys(chunk_shape))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this makes a lot of sense. Why would we provide a "convenience" function for producing an "lazy" iterator over something that has already been eagerly constructed? Shouldn't we expose the lazy option to the user and then the user can eagerly collect the results, if that's what they want. Otherwise, we simply provide no means for the caller to get lazy results.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, in this case, does being lazy even matter? I'm not familiar enough with what's going on here to tell if chunk_shape is always something relatively small, in which case laziness is likely unnecessary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to open this can of worms, but I agree things are funny here. I don't think the iter function here is super important but

user can eagerly collect the results, if that's what they want.

The idea here is definitely to cache the eager results and then let the user create an iterator. But I don't even think an iterator is necessary _lexicographic_order_keys should satisfy the APIs that lexicographic_order_iter is meant for e.e.,g

dict.fromkeys((1, 2, 3))

works just fine (which is one place where the iterator is used).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really should be the other way around. We should define the lazy version and the eager version should just be "collect" the lazy version.

The problem is that pay an enormous penalty for generating the entire list of all chunks when we might only just want the first one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed! this pattern is what you get when you optimize for benchmarks. we might need to relax that concern, or add a separate benchmark that checks "time to fetch 1 chunk"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @chuckwondo's point indicates something more basic -- there's no reason to expose an iteration interface, lazy or not, for something that has a defined size, like a dense, finite, coordinate grid. We should be exposing an indexable data structure from the start.

d-v-b and others added 4 commits June 2, 2026 18:01
Address review feedback from @ilan-gold and @chuckwondo on the
`lexicographic_order_iter` helper.

`lexicographic_order_iter` returned a *lazy* iterator over an
*eagerly-built, cached* tuple (`_lexicographic_order_keys`), which
chuckwondo rightly flagged as confusing — and its output is byte-for-byte
identical to the pre-existing, genuinely-lazy `c_order_iter`
(verified across 0-d, empty, and N-d shapes). So the name promised
laziness the implementation didn't provide, over a sequence we could
already produce.

Remove the wrapper and use the cached `_lexicographic_order_keys` tuple
directly at the two `dict.fromkeys` call sites and in
`_subchunk_order_iter`. This keeps the eager/cached coordinate tuples —
which is the actual optimization: `dict.fromkeys` over the cached tuple
is ~1.4x faster than over lazy `c_order_iter` at 32^3 (≈900us vs ≈1300us),
because the cache amortizes tuple construction across repeated writes to
same-shaped shards. Switching to `c_order_iter` would have reintroduced
that cost, so it is deliberately not used here.

Also drop the now-dead `tuple()` wrap in `morton_order_iter` (its
argument is typed `tuple[int, ...]` and every caller passes one), per
ilan-gold.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reconcile the remote branch, which had two commits not present locally:
- 2b8b095 Merge branch 'main' (same upstream tooling commits already
  merged locally in 6ee3440; no new content)
- 51d0f05 "Update src/zarr/core/indexing.py" (Co-authored-by: Ilan Gold)
  dropped the redundant tuple() wrap from lexicographic_order_iter.

Conflict in src/zarr/core/indexing.py resolved in favor of the local
approach: lexicographic_order_iter was removed entirely as redundant with
c_order_iter (identical output across all shapes), with its call sites
switched to the cached _lexicographic_order_keys tuple. That supersedes
ilan-gold's narrower tuple()-trim — the function they edited no longer
exists — while addressing the same review feedback (and chuckwondo's
point about the misleading lazy-over-eager wrapper). ilan-gold's commit
is preserved in history as a merge parent.

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
…_order_iter

`c_order_iter` names a memory layout ("C order") rather than what the
iterator actually yields. Reintroduce `lexicographic_order_iter` as the
clearer name for the same row-major coordinate sequence, and make
`c_order_iter` a thin alias that delegates to it, with a docstring note
steering new code to the preferred name. No runtime warning — these are
internal helpers.

`lexicographic_order_iter` keeps the eager/cached implementation
(iter over the lru_cached `_lexicographic_order_keys` tuple), which is
~1.4x faster than the old lazy `itertools.product` on the
`dict.fromkeys` shard-write path and is the optimization this branch
exists to deliver. The alias therefore changes `c_order_iter` from lazy
to eager/cached; all in-repo callers (_ShardReader.__iter__,
_is_total_shard, _subchunk_order_iter, and two tests) are migrated to
`lexicographic_order_iter`, so nothing in-tree relies on the old laziness.

Output is unchanged: lexicographic_order_iter, the c_order_iter alias,
and np.ndindex all agree across 0-d, empty, and N-d shapes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/zarr/core/indexing.py Outdated
Comment on lines +1606 to +1616
def _lexicographic_order_keys(chunk_shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]:
return tuple(tuple(int(x) for x in row) for row in _lexicographic_order(chunk_shape))


def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
# Iterate the chunk grid in lexicographic (row-major / C) order. The full
# coordinate tuple is built once and cached on `chunk_shape` (see
# _lexicographic_order_keys), so repeated calls for the same shape — e.g. one
# per shard write — reuse it rather than rebuilding tens of thousands of
# tuples each time.
return iter(_lexicographic_order_keys(chunk_shape))
Copy link
Copy Markdown
Contributor

@mkitti mkitti Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _lexicographic_order_keys(chunk_shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]:
return tuple(tuple(int(x) for x in row) for row in _lexicographic_order(chunk_shape))
def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
# Iterate the chunk grid in lexicographic (row-major / C) order. The full
# coordinate tuple is built once and cached on `chunk_shape` (see
# _lexicographic_order_keys), so repeated calls for the same shape — e.g. one
# per shard write — reuse it rather than rebuilding tens of thousands of
# tuples each time.
return iter(_lexicographic_order_keys(chunk_shape))
def _lexicographic_order_keys(chunk_shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]:
return tuple(lexicographic_order_iter(chunk_shape))
def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
# Iterate the chunk grid in lexicographic (row-major / C) order. The full
# coordinate tuple is built once and cached on `chunk_shape` (see
# _lexicographic_order_keys), so repeated calls for the same shape — e.g. one
# per shard write — reuse it rather than rebuilding tens of thousands of
# tuples each time.
return (tuple(int(x) for x in row) for row in _lexicographic_order(chunk_shape))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lazy "lexicographic_order_iter" uses a generator comprehension. The eager "_lexicographic_order_keys " just pulls that into a tuple.

Per review from @mkitti: invert the relationship between the lazy
iterator and the eagerly-collected tuple. `lexicographic_order_iter` is
now a genuine lazy generator over the chunk-grid coordinates, and
`_lexicographic_order_keys` collects it into a cached tuple — the eager
version is "collect the lazy one", not the other way around.

Previously lexicographic_order_iter returned iter() over the cached
tuple, so any consumer that only needed a prefix still paid to
materialize the entire grid. _is_total_shard does exactly that — an
early-exit `all(coord in set for coord in ...)` — and on a cold cache
for a 32^3 shard whose first coordinate is absent this dropped from
~15.8ms to ~24us (the lazy generator builds one coordinate and bails).

The hot path is unchanged: the two dict.fromkeys sites consume the full
grid and use the cached `_lexicographic_order_keys` tuple directly
(~0.9ms at 32^3), so the regression fix this branch delivers is intact.
This also resolves @chuckwondo's point — the iterator is now actually
lazy rather than a thin wrapper over eager data.

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a sharded-write performance regression introduced with subchunk_write_order by caching lexicographic (C-order) chunk coordinates and precomputed tuple keys, avoiding repeated rebuilding of large per-shard coordinate grids during partial writes.

Changes:

  • Added cached lexicographic coordinate helpers in zarr.core.indexing and a new lexicographic_order_iter.
  • Updated the sharding codec to reuse cached lexicographic coordinate arrays/keys and simplified _ShardReader.to_dict_vectorized() to be shape-derived.
  • Added a unit test to validate to_dict_vectorized() behavior and included a changelog entry.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
tests/test_codecs/test_sharding.py Adds coverage for _ShardReader.to_dict_vectorized() and switches tests to lexicographic_order_iter.
src/zarr/core/indexing.py Introduces cached lexicographic coordinate/key generators and aliases c_order_iter to the new iterator.
src/zarr/codecs/sharding.py Uses cached lexicographic coords/keys to eliminate per-write reconstruction overhead.
changes/4001.misc.md Documents the performance restoration in the changelog.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/zarr/core/indexing.py Outdated
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Jun 2, 2026

something I added here: I realized that we already had a c_order_iter function, but lexicographic_order_iter is a much better name for this functionality. so c_order_iter is now just a wrapper for lexicographic_order_iter.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Jun 2, 2026

thanks for the careful review all! I think I've addressed everything

Comment thread src/zarr/core/indexing.py Outdated
Comment on lines +1578 to +1584

def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
return iter(_morton_order_keys(tuple(chunk_shape)))
return iter(_morton_order_keys(chunk_shape))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return iter(_morton_order_keys(chunk_shape))
@lru_cache(maxsize=16)
def _morton_order_keys(chunk_shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]:
return tuple(morton_order_iter)
def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
return (tuple(int(x) for x in row) for row in _morton_order(chunk_shape))

This is also backwards.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion is not applying to the correct lines. Manual intervention is likely required.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 6840249

Per @mkitti: the morton pair was backwards in the same way the
lexicographic pair was. Invert it to match — `morton_order_iter` is now
the lazy generator primitive and `_morton_order_keys` collects it into a
cached tuple, mirroring `lexicographic_order_iter` /
`_lexicographic_order_keys`.

No behavioral change for the in-tree consumers (all fully consume the
sequence) and the Z-order is identical; this keeps the two coordinate-
order families symmetric and gives morton the same lazy/early-exit
option lexicographic now has.

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label Jun 2, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread src/zarr/core/indexing.py Outdated
Comment on lines +1578 to +1583
def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
# Lazily yield the chunk grid coordinates in Morton (Z) order. The lazy
# generator is the primitive; `_morton_order_keys` collects it into a cached
# tuple for callers that need every coordinate repeatedly. Mirrors
# `lexicographic_order_iter` / `_lexicographic_order_keys`.
return (tuple(int(x) for x in row) for row in _morton_order(chunk_shape))
Comment thread src/zarr/core/indexing.py Outdated
Comment on lines +1609 to +1616
def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
# Lazily yield the chunk grid coordinates in lexicographic (row-major / C)
# order. This is the primitive: callers that only need a prefix (e.g. an
# early-exit `all(...)`) pay only for the coordinates they consume, instead
# of materializing the whole grid. Callers that need every coordinate
# repeatedly for the same shape should use `_lexicographic_order_keys`, which
# caches the fully collected tuple.
return (tuple(int(x) for x in row) for row in _lexicographic_order(chunk_shape))
@mkitti
Copy link
Copy Markdown
Contributor

mkitti commented Jun 2, 2026

@codspeedbot explain why this is faster

1 similar comment
@mkitti
Copy link
Copy Markdown
Contributor

mkitti commented Jun 2, 2026

@codspeedbot explain why this is faster

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 2, 2026

@codspeedbot explain why this is faster

Why this PR is faster

I compared the flamegraphs between main and this PR for test_sharded_morton_write_single_chunk[(33, 33, 33)-memory] (the benchmark with the largest improvement: 391 ms → 226 ms, +72.9%). The speedup comes from eliminating two expensive, redundant computations in the sharded write hot path:

1. Cached coordinate keys eliminate per-write grid rebuilding (~180 ms saved)

On main, every call to _encode_partial_single triggered _encode_shard_dict_subchunk_order_iter_morton_order_keys, which rebuilt the full per-shard chunk coordinate grid from scratch. The nested generators in _morton_order_keys consumed 181 ms (23% of total time), dominated by expensive per-element array_item numpy scalar extraction (~115 ms in the inner <genexpr> alone).

With this PR, the morton/lexicographic order arrays and their tuple keys are cached, so _encode_partial_single's coordinate work drops from ~181 ms to <1 ms on subsequent writes to the same shard shape.

2. Pre-computed keys eliminate row-by-row tuple construction in to_dict_vectorized (~139 ms saved)

On main, _ShardReader.to_dict_vectorized spent 138.6 ms (17.6% of total time) building a dictionary key per row by calling tuple(coords.ravel()) in a loop. Each iteration hit array_item (63 ms), array_ravel (10.5 ms), and PyArray_Scalar (10 ms) — all numpy C-level operations that are fast individually but extremely costly when repeated tens of thousands of times (once per chunk in the shard).

With this PR, to_dict_vectorized receives pre-computed cached tuple keys directly, completely bypassing the per-row numpy extraction. This function no longer appears in the top hotspots.

Net effect

Path main This PR
_subchunk_order_iter_morton_order_keys (coordinate grid) 181 ms < 1 ms
to_dict_vectorized (tuple key construction) 138.6 ms eliminated from hotspots
Total benchmark 786 ms 458 ms

The combined ~320 ms savings from these two paths accounts for the full improvement, restoring write throughput to the pre-#3826 baseline.

Summary
  • Warming up the performance wizard...
  • Analyzing performance improvement — Flamegraph analysis complete — identified two root causes of the speedup
  • Handling comment — Comment handling finished successfully

Was this response helpful? React with 👍 or 👎 to help us improve CodSpeedBot.

@mkitti
Copy link
Copy Markdown
Contributor

mkitti commented Jun 2, 2026

Claude review incoming...

Copy link
Copy Markdown
Contributor

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review was authored by Claude (Claude Code), at the request of @mkitti.

Reviewed the diff critically. The core fix is sound — caching _lexicographic_order/_lexicographic_order_keys and feeding the cached keys straight into to_dict_vectorized/dict.fromkeys removes exactly the two per-write costs #3826 introduced (np.array(list(...)) and per-row tuple(coords.ravel())). The to_dict_vectorized() signature simplification is a genuine improvement (the reader already knows chunks_per_shard), order equivalence with np.ndindex holds across 0-d/empty/N-d, and chunk_coords_array[i]/chunk_coords_keys[i] are guaranteed aligned.

My concern is that the PR has accreted a lazy/eager + naming-symmetry refactor that overshoots the goal, and one piece works against it. Details inline, summarized by severity:

  1. (high) The morton "symmetry" commit de-optimizes a hot write path — see comment on morton_order_iter.
  2. (medium) _is_total_shard is actually made worse than the itertools.product it replaced, and the commit rationale claims the reverse — see comment on _is_total_shard.
  3. (low-med) c_order_iter quietly changes operational characteristics (lazy/no-alloc → eager + cached numpy grid) — see comment on c_order_iter.

Minor: dropping tuple(chunk_shape) coercion in morton_order_iter means a non-tuple arg now fails inside lru_cache rather than being coerced (acceptable per type hints); _lexicographic_order caches a non-contiguous transposed view that pins its larger np.indices base alive; and the rename/soft-deprecation is scope creep relative to a perf-regression fix — worth a line in the PR description so the c_order_iter behavior change is intentional and visible.

Net: I'd approve the lexicographic caching + to_dict_vectorized() change as-is, but push back on item 1 and trim/correct the item 2 narrative before merge.

Comment thread src/zarr/core/indexing.py Outdated
return order


def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
Copy link
Copy Markdown
Contributor

@mkitti mkitti Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue 1 (high): this de-optimizes a hot write path — the opposite of this PR's goal.

Previously morton_order_iter returned iter(_morton_order_keys(...)), i.e. it handed back an lru_cached tuple-of-tuples, so repeated calls were essentially free. Now it rebuilds every coordinate tuple from the (cached) array on every call.

morton_order_iter is consumed in _encode_shard_dict, the per-shard serialization path that runs on every shard write — so this reintroduces exactly the per-write tuple-reconstruction cost this PR removes for the lexicographic case. Two knock-on effects:

  • _morton_order_keys is now dead in src/ (only the benchmark's cache_clear references it); its cache is never populated in production.
  • The asymmetry is backwards: the lexicographic hot path uses cached _lexicographic_order_keys, while the morton hot path uses the uncached morton_order_iter.

Suggest reverting the morton inversion (keep morton_order_iter delegating to cached _morton_order_keys), or route the _encode_shard_dict call site through _morton_order_keys. Source-shape symmetry isn't worth a hot-path de-optimization in a perf PR.

— Claude (Claude Code)

Comment thread src/zarr/codecs/sharding.py Outdated
return len(all_chunk_coords) == product(chunks_per_shard) and all(
chunk_coords in all_chunk_coords for chunk_coords in c_order_iter(chunks_per_shard)
chunk_coords in all_chunk_coords
for chunk_coords in lexicographic_order_iter(chunks_per_shard)
Copy link
Copy Markdown
Contributor

@mkitti mkitti Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue 2 (medium): this is strictly worse than the code it replaced for the early-exit case, and the commit message claims the reverse.

This previously iterated c_order_iter = itertools.product — genuinely lazy, O(1) memory, true early-exit. lexicographic_order_iter calls _lexicographic_order, which eagerly materializes the whole np.indices grid before yielding the first tuple. So the "laziness" only skips tuple conversion of unconsumed rows, not the expensive grid build.

Concretely, commit f6d91e6's claim ("on a cold cache for a 32³ shard … ~15.8ms → ~24µs, the lazy generator builds one coordinate and bails") can't hold here — np.indices((32,32,32)) runs in full on the first next(). For the pure early-exit case on a cold cache, the original itertools.product was strictly better and used less memory.

In practice this is mostly masked because the dict.fromkeys hot path warms _lexicographic_order for the same shape. But if the early-exit win is actually wanted, make lexicographic_order_iter a true generator (e.g. itertools.product) and have _lexicographic_order_keys collect that — otherwise drop the early-exit framing from the commit history.

— Claude (Claude Code)

Comment thread src/zarr/core/indexing.py Outdated
return tuple(lexicographic_order_iter(chunk_shape))


def c_order_iter(chunks_per_shard: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
Copy link
Copy Markdown
Contributor

@mkitti mkitti Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue 3 (low-med): the alias quietly changes operational characteristics.

c_order_iter went from pure-lazy itertools.product (no allocation, no cache) to an alias that materializes and lru_caches a full numpy grid. For any large-chunks_per_shard caller this is a cold-start + memory regression, and it now retains up to 16 large arrays. In practice chunks_per_shard is modest so it's likely fine — but it's a real semantic change presented as a no-op "soft-deprecated alias." Worth calling out explicitly in the PR description.

— Claude (Claude Code)

d-v-b and others added 3 commits June 3, 2026 09:17
Replace the morton/lexicographic order iterators (and the c_order_iter
alias) with two cached, numpy-backed sequences:
`morton_order_coords(shape)` and `lexicographic_order_coords(shape)`,
each returning the grid coordinates in that order as a tuple of
coordinate tuples.

This addresses several points from review:

- The earlier "lazy primitive" inversion de-optimized the hot write
  path: `morton_order_iter` rebuilt every coordinate tuple from the
  array on each call, and that path runs in `_encode_shard_dict` on
  every shard write (~16ms/write at 32^3 chunks-per-shard). The coords
  are a finite set of known length reused in full, so they are an
  indexable sequence built once and cached, not a lazily-rebuilt
  generator. (per @mkitti)

- `lexicographic_order_iter` was never genuinely lazy — `_lexicographic_order`
  materializes the whole `np.indices` grid up front — so the early-exit
  framing was inaccurate. (per @Copilot, @chuckwondo)

- Two functions differing only in caching vs laziness was redundant
  (per @ilan-gold); there is now one sequence per order. `_ShardReader.__iter__`
  wraps it in `iter()`, the only site that needs an iterator.

- `_is_total_shard` no longer iterates the order at all: `all_chunk_coords`
  is always a subset of the shard grid (guaranteed by `validate`'s
  shard/chunk divisibility check), so a count check proves totality. A
  subset assertion documents the invariant.

Coordinates are Python int tuples because every consumer uses them as
dict keys / set members, which numpy arrays cannot be (unhashable,
mutable); the numpy array is kept only for the vectorized index lookup
in `to_dict_vectorized`. The per-shape cache holds ~prod(chunks_per_shard)
tuples (~0.07% of shard size for multi-GB shards with (64,64,64) chunks),
capped at 16 shapes per order.

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The existing test_sharded_morton_write_single_chunk clears the chunk-order
cache before every iteration, so it only measures the cold grid-build cost.
That made it blind to a regression where the per-shard coordinate tuples
were rebuilt on every write instead of being reused from the cache — the
cold benchmark could not distinguish the two (both pay the build each
iteration).

Add test_sharded_morton_write_single_chunk_warm_cache, which warms the
cache once and then times repeated same-shape writes — the amortized
regime the cache exists to optimize (many shards of one shape per array).
Verified it discriminates: with the cached sequence it is ~4x faster than
the cold benchmark, and a rebuild-every-write regression shows up as a ~4x
slowdown here while staying invisible to the cold benchmark.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fix caches the per-shard coordinate grid for every shard write, not
only partial writes, and the win is amortized across repeated writes to
same-shaped shards. Reword the note accordingly; keep it user-facing
(the internal indexing helper refactor is not part of the public API).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@d-v-b d-v-b removed the benchmark Code will be benchmarked in a CI job. label Jun 3, 2026
…ation

`morton_order_coords` / `lexicographic_order_coords` built their tuple-of-
tuples with a row-by-row `tuple(int(x) for x in row)` comprehension. Using
`map(tuple, arr.tolist())` instead does the int conversion in a single
C-level call, producing byte-identical native-int tuples ~8-9x faster
(~16ms -> ~1.9ms cold build at 32^3). It is a per-shape cached build, so
this only speeds the first write to each shard shape, but it is free.

Also document in `to_dict_vectorized` why the chunk coordinates are needed
in two forms — a numpy array for the vectorized index lookup and hashable
tuples for the dict keys — since numpy rows are unhashable and a tuple list
can't be used for the vectorized modulo/advanced-indexing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label Jun 3, 2026
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Jun 3, 2026

thanks for everyone's patience here. Some recent changes:

  • I added explanatory comments for why we need two representations (tuples of ints and numpy arrays of ints) of the same data in to_dict_vectorized. Just using Numpy arrays of ints would be more performant, but fully exploiting that requires changes to the calling code, which models chunk indices as tuples of ints. IMO we need to consider the chunk index representation a target for performance optimization.
  • I removed the premise of laziness from the *_order routines. Now they are not iterators, they return full collections of coordinates, and they have names that make this clear: morton_order_coords and lexicographic_order_coords. I also changed the name of the parameters from chunk_shape to just shape, because these functions are not just for chunks, and chunk_shape is actually rather ambiguous.

@mkitti feel free to throw your agents at the latest changes.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Jun 3, 2026

@claude please review

- Drop the O(n_chunks) assert in _is_total_shard. It built a fresh
  set(lexicographic_order_coords(...)) on every partial read/write to
  check an invariant `validate` already guarantees, regressing the very
  partial-access hot path this PR optimizes (~673us vs ~112ns at 32^3
  chunks-per-shard) and vanishing under -O. The invariant is documented
  in the comment; the count check alone proves totality.

- Cache the colexicographic subchunk order. The colex branch of
  _subchunk_order_iter rebuilt the grid via uncached np.ndindex on every
  write while its morton/lexicographic siblings hit the cache; add
  colexicographic_order_coords (cached, derived from lexicographic_order_coords
  of the reversed shape) and use it.

- Fix two benchmark docstrings: the cold benchmark now clears the
  lexicographic caches too (the write path builds that grid via
  dict.fromkeys / to_dict_vectorized, so a morton-only clear left it warm
  and under-reported the cold cost); the warm benchmark docstring now
  describes what it actually exercises (repeated writes to one shard,
  which reuse the cache identically to writes across same-shaped shards).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants