perf: cache lexicographic chunk coords in sharding codec#4012
Conversation
The subchunk_write_order feature (zarr-developers#3826) regressed sharded write performance: _encode_partial_single rebuilt the full per-shard chunk coordinate grid on every write via `np.array(list(_subchunk_order_iter(..., "lexicographic")))`, and `to_dict_vectorized` rebuilt a tuple key per row with `tuple(coords.ravel())`. For a single-chunk write into a shard with tens of thousands of chunks this roughly doubled write time (~22ms -> ~40ms on test_sharded_morton_write_single_chunk, matching the -44% CodSpeed regression). Add cached `_lexicographic_order` (array) and `_lexicographic_order_keys` (tuples) helpers in indexing.py, mirroring `_morton_order`/`_morton_order_keys`, and pass the cached keys into `to_dict_vectorized` instead of deriving them row-by-row. This restores write throughput to the pre-zarr-developers#3826 baseline while preserving identical chunk ordering (verified equal to np.ndindex across shapes including 0-d and empty). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4012 +/- ##
=======================================
Coverage 93.55% 93.55%
=======================================
Files 88 88
Lines 11896 11909 +13
=======================================
+ Hits 11129 11142 +13
Misses 767 767
🚀 New features to boost your workflow:
|
Merging this PR will improve performance by 40.32%
Warning Please fix the performance issues or acknowledge them on CodSpeed. Performance ChangesTip Investigate this regression by commenting Comparing Footnotes
|
| @@ -266,13 +269,19 @@ def __iter__(self) -> Iterator[tuple[int, ...]]: | |||
| def to_dict_vectorized( | |||
There was a problem hiding this comment.
I think the following reads cleaner than adding another parameter. The reader already knows its own chunks_per_shard, so it can fetch both cached structures itself instead of having them threaded in. With this, the call site simplifies to just shard_reader.to_dict_vectorized().
I tested it still fixes the regression — same ~1.6–1.8× speedup over main on test_sharded_morton_write_single_chunk, and the difference vs the committed version is within run-to-run noise.
def to_dict_vectorized(self) -dict[tuple[int, ...], Buffer | None]:
"""Build a dict of chunk coordinates to buffers using vectorized lookup.
The full per-shard chunk coordinate grid (both the array used for the
vectorized index lookup and the plain tuples used as dict keys) is
cached on ``chunks_per_shard``, so neither is rebuilt on every call.
For a shard with tens of thousands of chunks this avoids reconstructing
that many tuples on every partial write.
Returns
-------
dict mapping chunk coordinate tuples to Buffer or None
"""
chunks_per_shard = self.index.chunks_per_shard
chunk_coords_array = _lexicographic_order(chunks_per_shard)
chunk_coords_keys = _lexicographic_order_keys(chunks_per_shard)
starts, ends, valid = self.index.get_chunk_slices_vectorized(chunk_coords_array)
result: dict[tuple[int, ...], Buffer | None] = {}
for i, coords in enumerate(chunk_coords_keys):
if valid[i]:
result[coords] = self.buf[int(starts[i]) : int(ends[i])]
else:
result[coords] = None
return resultThere was a problem hiding this comment.
that's great feedback! have a look at 6e02c8f and see if things look better to you
Address review feedback: `_ShardReader.to_dict_vectorized` took the lexicographic coordinate array and key tuples as parameters, even though the reader already knows its own `chunks_per_shard` and both structures are `lru_cache`d. Thread nothing in — fetch them inside the method via `_lexicographic_order`/`_lexicographic_order_keys`. Same cache, so no perf change; the call site collapses to `to_dict_vectorized()`. Add a unit test covering the method directly across 0-d, 1-d, and 2-d shard grids: present chunks map to their stored bytes, empty chunks to None, and every lexicographic coordinate appears as a key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ilan-gold
left a comment
There was a problem hiding this comment.
Apologies for the regression!
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
| def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]: | ||
| return iter(_lexicographic_order_keys(chunk_shape)) |
There was a problem hiding this comment.
Not sure this makes a lot of sense. Why would we provide a "convenience" function for producing an "lazy" iterator over something that has already been eagerly constructed? Shouldn't we expose the lazy option to the user and then the user can eagerly collect the results, if that's what they want. Otherwise, we simply provide no means for the caller to get lazy results.
There was a problem hiding this comment.
Although, in this case, does being lazy even matter? I'm not familiar enough with what's going on here to tell if chunk_shape is always something relatively small, in which case laziness is likely unnecessary.
There was a problem hiding this comment.
I didn't want to open this can of worms, but I agree things are funny here. I don't think the iter function here is super important but
user can eagerly collect the results, if that's what they want.
The idea here is definitely to cache the eager results and then let the user create an iterator. But I don't even think an iterator is necessary _lexicographic_order_keys should satisfy the APIs that lexicographic_order_iter is meant for e.e.,g
dict.fromkeys((1, 2, 3))works just fine (which is one place where the iterator is used).
There was a problem hiding this comment.
It really should be the other way around. We should define the lazy version and the eager version should just be "collect" the lazy version.
The problem is that pay an enormous penalty for generating the entire list of all chunks when we might only just want the first one.
There was a problem hiding this comment.
agreed! this pattern is what you get when you optimize for benchmarks. we might need to relax that concern, or add a separate benchmark that checks "time to fetch 1 chunk"
There was a problem hiding this comment.
I think @chuckwondo's point indicates something more basic -- there's no reason to expose an iteration interface, lazy or not, for something that has a defined size, like a dense, finite, coordinate grid. We should be exposing an indexable data structure from the start.
Address review feedback from @ilan-gold and @chuckwondo on the `lexicographic_order_iter` helper. `lexicographic_order_iter` returned a *lazy* iterator over an *eagerly-built, cached* tuple (`_lexicographic_order_keys`), which chuckwondo rightly flagged as confusing — and its output is byte-for-byte identical to the pre-existing, genuinely-lazy `c_order_iter` (verified across 0-d, empty, and N-d shapes). So the name promised laziness the implementation didn't provide, over a sequence we could already produce. Remove the wrapper and use the cached `_lexicographic_order_keys` tuple directly at the two `dict.fromkeys` call sites and in `_subchunk_order_iter`. This keeps the eager/cached coordinate tuples — which is the actual optimization: `dict.fromkeys` over the cached tuple is ~1.4x faster than over lazy `c_order_iter` at 32^3 (≈900us vs ≈1300us), because the cache amortizes tuple construction across repeated writes to same-shaped shards. Switching to `c_order_iter` would have reintroduced that cost, so it is deliberately not used here. Also drop the now-dead `tuple()` wrap in `morton_order_iter` (its argument is typed `tuple[int, ...]` and every caller passes one), per ilan-gold. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reconcile the remote branch, which had two commits not present locally: - 2b8b095 Merge branch 'main' (same upstream tooling commits already merged locally in 6ee3440; no new content) - 51d0f05 "Update src/zarr/core/indexing.py" (Co-authored-by: Ilan Gold) dropped the redundant tuple() wrap from lexicographic_order_iter. Conflict in src/zarr/core/indexing.py resolved in favor of the local approach: lexicographic_order_iter was removed entirely as redundant with c_order_iter (identical output across all shapes), with its call sites switched to the cached _lexicographic_order_keys tuple. That supersedes ilan-gold's narrower tuple()-trim — the function they edited no longer exists — while addressing the same review feedback (and chuckwondo's point about the misleading lazy-over-eager wrapper). ilan-gold's commit is preserved in history as a merge parent. Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
…_order_iter
`c_order_iter` names a memory layout ("C order") rather than what the
iterator actually yields. Reintroduce `lexicographic_order_iter` as the
clearer name for the same row-major coordinate sequence, and make
`c_order_iter` a thin alias that delegates to it, with a docstring note
steering new code to the preferred name. No runtime warning — these are
internal helpers.
`lexicographic_order_iter` keeps the eager/cached implementation
(iter over the lru_cached `_lexicographic_order_keys` tuple), which is
~1.4x faster than the old lazy `itertools.product` on the
`dict.fromkeys` shard-write path and is the optimization this branch
exists to deliver. The alias therefore changes `c_order_iter` from lazy
to eager/cached; all in-repo callers (_ShardReader.__iter__,
_is_total_shard, _subchunk_order_iter, and two tests) are migrated to
`lexicographic_order_iter`, so nothing in-tree relies on the old laziness.
Output is unchanged: lexicographic_order_iter, the c_order_iter alias,
and np.ndindex all agree across 0-d, empty, and N-d shapes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| def _lexicographic_order_keys(chunk_shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]: | ||
| return tuple(tuple(int(x) for x in row) for row in _lexicographic_order(chunk_shape)) | ||
|
|
||
|
|
||
| def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]: | ||
| # Iterate the chunk grid in lexicographic (row-major / C) order. The full | ||
| # coordinate tuple is built once and cached on `chunk_shape` (see | ||
| # _lexicographic_order_keys), so repeated calls for the same shape — e.g. one | ||
| # per shard write — reuse it rather than rebuilding tens of thousands of | ||
| # tuples each time. | ||
| return iter(_lexicographic_order_keys(chunk_shape)) |
There was a problem hiding this comment.
| def _lexicographic_order_keys(chunk_shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]: | |
| return tuple(tuple(int(x) for x in row) for row in _lexicographic_order(chunk_shape)) | |
| def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]: | |
| # Iterate the chunk grid in lexicographic (row-major / C) order. The full | |
| # coordinate tuple is built once and cached on `chunk_shape` (see | |
| # _lexicographic_order_keys), so repeated calls for the same shape — e.g. one | |
| # per shard write — reuse it rather than rebuilding tens of thousands of | |
| # tuples each time. | |
| return iter(_lexicographic_order_keys(chunk_shape)) | |
| def _lexicographic_order_keys(chunk_shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]: | |
| return tuple(lexicographic_order_iter(chunk_shape)) | |
| def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]: | |
| # Iterate the chunk grid in lexicographic (row-major / C) order. The full | |
| # coordinate tuple is built once and cached on `chunk_shape` (see | |
| # _lexicographic_order_keys), so repeated calls for the same shape — e.g. one | |
| # per shard write — reuse it rather than rebuilding tens of thousands of | |
| # tuples each time. | |
| return (tuple(int(x) for x in row) for row in _lexicographic_order(chunk_shape)) |
There was a problem hiding this comment.
The lazy "lexicographic_order_iter" uses a generator comprehension. The eager "_lexicographic_order_keys " just pulls that into a tuple.
Per review from @mkitti: invert the relationship between the lazy iterator and the eagerly-collected tuple. `lexicographic_order_iter` is now a genuine lazy generator over the chunk-grid coordinates, and `_lexicographic_order_keys` collects it into a cached tuple — the eager version is "collect the lazy one", not the other way around. Previously lexicographic_order_iter returned iter() over the cached tuple, so any consumer that only needed a prefix still paid to materialize the entire grid. _is_total_shard does exactly that — an early-exit `all(coord in set for coord in ...)` — and on a cold cache for a 32^3 shard whose first coordinate is absent this dropped from ~15.8ms to ~24us (the lazy generator builds one coordinate and bails). The hot path is unchanged: the two dict.fromkeys sites consume the full grid and use the cached `_lexicographic_order_keys` tuple directly (~0.9ms at 32^3), so the regression fix this branch delivers is intact. This also resolves @chuckwondo's point — the iterator is now actually lazy rather than a thin wrapper over eager data. Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a sharded-write performance regression introduced with subchunk_write_order by caching lexicographic (C-order) chunk coordinates and precomputed tuple keys, avoiding repeated rebuilding of large per-shard coordinate grids during partial writes.
Changes:
- Added cached lexicographic coordinate helpers in
zarr.core.indexingand a newlexicographic_order_iter. - Updated the sharding codec to reuse cached lexicographic coordinate arrays/keys and simplified
_ShardReader.to_dict_vectorized()to be shape-derived. - Added a unit test to validate
to_dict_vectorized()behavior and included a changelog entry.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
tests/test_codecs/test_sharding.py |
Adds coverage for _ShardReader.to_dict_vectorized() and switches tests to lexicographic_order_iter. |
src/zarr/core/indexing.py |
Introduces cached lexicographic coordinate/key generators and aliases c_order_iter to the new iterator. |
src/zarr/codecs/sharding.py |
Uses cached lexicographic coords/keys to eliminate per-write reconstruction overhead. |
changes/4001.misc.md |
Documents the performance restoration in the changelog. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
something I added here: I realized that we already had a |
|
thanks for the careful review all! I think I've addressed everything |
|
|
||
| def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]: | ||
| return iter(_morton_order_keys(tuple(chunk_shape))) | ||
| return iter(_morton_order_keys(chunk_shape)) |
There was a problem hiding this comment.
| return iter(_morton_order_keys(chunk_shape)) | |
| @lru_cache(maxsize=16) | |
| def _morton_order_keys(chunk_shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]: | |
| return tuple(morton_order_iter) | |
| def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]: | |
| return (tuple(int(x) for x in row) for row in _morton_order(chunk_shape)) |
This is also backwards.
There was a problem hiding this comment.
The suggestion is not applying to the correct lines. Manual intervention is likely required.
Per @mkitti: the morton pair was backwards in the same way the lexicographic pair was. Invert it to match — `morton_order_iter` is now the lazy generator primitive and `_morton_order_keys` collects it into a cached tuple, mirroring `lexicographic_order_iter` / `_lexicographic_order_keys`. No behavioral change for the in-tree consumers (all fully consume the sequence) and the Z-order is identical; this keeps the two coordinate- order families symmetric and gives morton the same lazy/early-exit option lexicographic now has. Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]: | ||
| # Lazily yield the chunk grid coordinates in Morton (Z) order. The lazy | ||
| # generator is the primitive; `_morton_order_keys` collects it into a cached | ||
| # tuple for callers that need every coordinate repeatedly. Mirrors | ||
| # `lexicographic_order_iter` / `_lexicographic_order_keys`. | ||
| return (tuple(int(x) for x in row) for row in _morton_order(chunk_shape)) |
| def lexicographic_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]: | ||
| # Lazily yield the chunk grid coordinates in lexicographic (row-major / C) | ||
| # order. This is the primitive: callers that only need a prefix (e.g. an | ||
| # early-exit `all(...)`) pay only for the coordinates they consume, instead | ||
| # of materializing the whole grid. Callers that need every coordinate | ||
| # repeatedly for the same shape should use `_lexicographic_order_keys`, which | ||
| # caches the fully collected tuple. | ||
| return (tuple(int(x) for x in row) for row in _lexicographic_order(chunk_shape)) |
|
@codspeedbot explain why this is faster |
1 similar comment
|
@codspeedbot explain why this is faster |
Why this PR is fasterI compared the flamegraphs between 1. Cached coordinate keys eliminate per-write grid rebuilding (~180 ms saved)On With this PR, the morton/lexicographic order arrays and their tuple keys are cached, so 2. Pre-computed keys eliminate row-by-row tuple construction in
|
| Path | main |
This PR |
|---|---|---|
_subchunk_order_iter → _morton_order_keys (coordinate grid) |
181 ms | < 1 ms |
to_dict_vectorized (tuple key construction) |
138.6 ms | eliminated from hotspots |
| Total benchmark | 786 ms | 458 ms |
The combined ~320 ms savings from these two paths accounts for the full improvement, restoring write throughput to the pre-#3826 baseline.
Summary
- Warming up the performance wizard...
- Analyzing performance improvement — Flamegraph analysis complete — identified two root causes of the speedup
- Handling comment — Comment handling finished successfully
Was this response helpful? React with 👍 or 👎 to help us improve CodSpeedBot.
|
Claude review incoming... |
There was a problem hiding this comment.
This review was authored by Claude (Claude Code), at the request of @mkitti.
Reviewed the diff critically. The core fix is sound — caching _lexicographic_order/_lexicographic_order_keys and feeding the cached keys straight into to_dict_vectorized/dict.fromkeys removes exactly the two per-write costs #3826 introduced (np.array(list(...)) and per-row tuple(coords.ravel())). The to_dict_vectorized() signature simplification is a genuine improvement (the reader already knows chunks_per_shard), order equivalence with np.ndindex holds across 0-d/empty/N-d, and chunk_coords_array[i]/chunk_coords_keys[i] are guaranteed aligned.
My concern is that the PR has accreted a lazy/eager + naming-symmetry refactor that overshoots the goal, and one piece works against it. Details inline, summarized by severity:
- (high) The morton "symmetry" commit de-optimizes a hot write path — see comment on
morton_order_iter. - (medium)
_is_total_shardis actually made worse than theitertools.productit replaced, and the commit rationale claims the reverse — see comment on_is_total_shard. - (low-med)
c_order_iterquietly changes operational characteristics (lazy/no-alloc → eager + cached numpy grid) — see comment onc_order_iter.
Minor: dropping tuple(chunk_shape) coercion in morton_order_iter means a non-tuple arg now fails inside lru_cache rather than being coerced (acceptable per type hints); _lexicographic_order caches a non-contiguous transposed view that pins its larger np.indices base alive; and the rename/soft-deprecation is scope creep relative to a perf-regression fix — worth a line in the PR description so the c_order_iter behavior change is intentional and visible.
Net: I'd approve the lexicographic caching + to_dict_vectorized() change as-is, but push back on item 1 and trim/correct the item 2 narrative before merge.
| return order | ||
|
|
||
|
|
||
| def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]: |
There was a problem hiding this comment.
Issue 1 (high): this de-optimizes a hot write path — the opposite of this PR's goal.
Previously morton_order_iter returned iter(_morton_order_keys(...)), i.e. it handed back an lru_cached tuple-of-tuples, so repeated calls were essentially free. Now it rebuilds every coordinate tuple from the (cached) array on every call.
morton_order_iter is consumed in _encode_shard_dict, the per-shard serialization path that runs on every shard write — so this reintroduces exactly the per-write tuple-reconstruction cost this PR removes for the lexicographic case. Two knock-on effects:
_morton_order_keysis now dead insrc/(only the benchmark'scache_clearreferences it); its cache is never populated in production.- The asymmetry is backwards: the lexicographic hot path uses cached
_lexicographic_order_keys, while the morton hot path uses the uncachedmorton_order_iter.
Suggest reverting the morton inversion (keep morton_order_iter delegating to cached _morton_order_keys), or route the _encode_shard_dict call site through _morton_order_keys. Source-shape symmetry isn't worth a hot-path de-optimization in a perf PR.
— Claude (Claude Code)
| return len(all_chunk_coords) == product(chunks_per_shard) and all( | ||
| chunk_coords in all_chunk_coords for chunk_coords in c_order_iter(chunks_per_shard) | ||
| chunk_coords in all_chunk_coords | ||
| for chunk_coords in lexicographic_order_iter(chunks_per_shard) |
There was a problem hiding this comment.
Issue 2 (medium): this is strictly worse than the code it replaced for the early-exit case, and the commit message claims the reverse.
This previously iterated c_order_iter = itertools.product — genuinely lazy, O(1) memory, true early-exit. lexicographic_order_iter calls _lexicographic_order, which eagerly materializes the whole np.indices grid before yielding the first tuple. So the "laziness" only skips tuple conversion of unconsumed rows, not the expensive grid build.
Concretely, commit f6d91e6's claim ("on a cold cache for a 32³ shard … ~15.8ms → ~24µs, the lazy generator builds one coordinate and bails") can't hold here — np.indices((32,32,32)) runs in full on the first next(). For the pure early-exit case on a cold cache, the original itertools.product was strictly better and used less memory.
In practice this is mostly masked because the dict.fromkeys hot path warms _lexicographic_order for the same shape. But if the early-exit win is actually wanted, make lexicographic_order_iter a true generator (e.g. itertools.product) and have _lexicographic_order_keys collect that — otherwise drop the early-exit framing from the commit history.
— Claude (Claude Code)
| return tuple(lexicographic_order_iter(chunk_shape)) | ||
|
|
||
|
|
||
| def c_order_iter(chunks_per_shard: tuple[int, ...]) -> Iterator[tuple[int, ...]]: |
There was a problem hiding this comment.
Issue 3 (low-med): the alias quietly changes operational characteristics.
c_order_iter went from pure-lazy itertools.product (no allocation, no cache) to an alias that materializes and lru_caches a full numpy grid. For any large-chunks_per_shard caller this is a cold-start + memory regression, and it now retains up to 16 large arrays. In practice chunks_per_shard is modest so it's likely fine — but it's a real semantic change presented as a no-op "soft-deprecated alias." Worth calling out explicitly in the PR description.
— Claude (Claude Code)
Replace the morton/lexicographic order iterators (and the c_order_iter alias) with two cached, numpy-backed sequences: `morton_order_coords(shape)` and `lexicographic_order_coords(shape)`, each returning the grid coordinates in that order as a tuple of coordinate tuples. This addresses several points from review: - The earlier "lazy primitive" inversion de-optimized the hot write path: `morton_order_iter` rebuilt every coordinate tuple from the array on each call, and that path runs in `_encode_shard_dict` on every shard write (~16ms/write at 32^3 chunks-per-shard). The coords are a finite set of known length reused in full, so they are an indexable sequence built once and cached, not a lazily-rebuilt generator. (per @mkitti) - `lexicographic_order_iter` was never genuinely lazy — `_lexicographic_order` materializes the whole `np.indices` grid up front — so the early-exit framing was inaccurate. (per @Copilot, @chuckwondo) - Two functions differing only in caching vs laziness was redundant (per @ilan-gold); there is now one sequence per order. `_ShardReader.__iter__` wraps it in `iter()`, the only site that needs an iterator. - `_is_total_shard` no longer iterates the order at all: `all_chunk_coords` is always a subset of the shard grid (guaranteed by `validate`'s shard/chunk divisibility check), so a count check proves totality. A subset assertion documents the invariant. Coordinates are Python int tuples because every consumer uses them as dict keys / set members, which numpy arrays cannot be (unhashable, mutable); the numpy array is kept only for the vectorized index lookup in `to_dict_vectorized`. The per-shape cache holds ~prod(chunks_per_shard) tuples (~0.07% of shard size for multi-GB shards with (64,64,64) chunks), capped at 16 shapes per order. Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The existing test_sharded_morton_write_single_chunk clears the chunk-order cache before every iteration, so it only measures the cold grid-build cost. That made it blind to a regression where the per-shard coordinate tuples were rebuilt on every write instead of being reused from the cache — the cold benchmark could not distinguish the two (both pay the build each iteration). Add test_sharded_morton_write_single_chunk_warm_cache, which warms the cache once and then times repeated same-shape writes — the amortized regime the cache exists to optimize (many shards of one shape per array). Verified it discriminates: with the cached sequence it is ~4x faster than the cold benchmark, and a rebuild-every-write regression shows up as a ~4x slowdown here while staying invisible to the cold benchmark. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fix caches the per-shard coordinate grid for every shard write, not only partial writes, and the win is amortized across repeated writes to same-shaped shards. Reword the note accordingly; keep it user-facing (the internal indexing helper refactor is not part of the public API). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ation `morton_order_coords` / `lexicographic_order_coords` built their tuple-of- tuples with a row-by-row `tuple(int(x) for x in row)` comprehension. Using `map(tuple, arr.tolist())` instead does the int conversion in a single C-level call, producing byte-identical native-int tuples ~8-9x faster (~16ms -> ~1.9ms cold build at 32^3). It is a per-shape cached build, so this only speeds the first write to each shard shape, but it is free. Also document in `to_dict_vectorized` why the chunk coordinates are needed in two forms — a numpy array for the vectorized index lookup and hashable tuples for the dict keys — since numpy rows are unhashable and a tuple list can't be used for the vectorized modulo/advanced-indexing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
thanks for everyone's patience here. Some recent changes:
@mkitti feel free to throw your agents at the latest changes. |
|
@claude please review |
- Drop the O(n_chunks) assert in _is_total_shard. It built a fresh set(lexicographic_order_coords(...)) on every partial read/write to check an invariant `validate` already guarantees, regressing the very partial-access hot path this PR optimizes (~673us vs ~112ns at 32^3 chunks-per-shard) and vanishing under -O. The invariant is documented in the comment; the count check alone proves totality. - Cache the colexicographic subchunk order. The colex branch of _subchunk_order_iter rebuilt the grid via uncached np.ndindex on every write while its morton/lexicographic siblings hit the cache; add colexicographic_order_coords (cached, derived from lexicographic_order_coords of the reversed shape) and use it. - Fix two benchmark docstrings: the cold benchmark now clears the lexicographic caches too (the write path builds that grid via dict.fromkeys / to_dict_vectorized, so a morton-only clear left it warm and under-reported the cold cost); the warm benchmark docstring now describes what it actually exercises (repeated writes to one shard, which reuse the cache identically to writes across same-shaped shards). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes a performance regression introduced in #3826 that made morton-order-indexing slower.
for details, see claude's summary below.
The subchunk_write_order feature (#3826) regressed sharded write performance: _encode_partial_single rebuilt the full per-shard chunk coordinate grid on every write via
np.array(list(_subchunk_order_iter(..., "lexicographic"))), andto_dict_vectorizedrebuilt a tuple key per row withtuple(coords.ravel()). For a single-chunk write into a shard with tens of thousands of chunks this roughly doubled write time (~22ms -> ~40ms on test_sharded_morton_write_single_chunk, matching the -44% CodSpeed regression).Add cached
_lexicographic_order(array) and_lexicographic_order_keys(tuples) helpers in indexing.py, mirroring_morton_order/_morton_order_keys, and pass the cached keys intoto_dict_vectorizedinstead of deriving them row-by-row. This restores write throughput to the pre-#3826 baseline while preserving identical chunk ordering (verified equal to np.ndindex across shapes including 0-d and empty).[Description of PR]
TODO:
docs/user-guide/*.mdchanges/