perf: phased codecpipeline#3885
Conversation
`PreparedWrite` models a set of per-chunk changes that would be applied to a stored chunk. `SupportsChunkPacking` is a protocol for array -> bytes codecs that can use `PreparedWrite` objects to update an existing chunk.
…into perf/prepared-write-v2
…into perf/prepared-write-v2
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3885 +/- ##
==========================================
- Coverage 93.53% 93.25% -0.29%
==========================================
Files 88 88
Lines 11894 12526 +632
==========================================
+ Hits 11125 11681 +556
- Misses 769 845 +76
🚀 New features to boost your workflow:
|
…into perf/prepared-write-v2
|
@TomAugspurger how would this design work with CUDA codecs? |
5d3064e to
b67a5a0
Compare
a84a15a to
68a7cdc
Compare
| # Phase 1: fetch all chunks (IO, sequential) | ||
| raw_buffers: list[Buffer | None] = [ | ||
| bg.get_sync(prototype=cs.prototype) # type: ignore[attr-defined] | ||
| for bg, cs, *_ in batch | ||
| ] | ||
|
|
||
| # Phase 2: decode (compute, optionally threaded) | ||
| def _decode_one(raw: Buffer | None, chunk_spec: ArraySpec) -> NDBuffer | None: | ||
| if raw is None: | ||
| return None | ||
| return transform.decode_chunk(raw, chunk_spec) | ||
|
|
||
| specs = [cs for _, cs, *_ in batch] | ||
| if n_workers > 0 and len(batch) > 1: | ||
| with ThreadPoolExecutor(max_workers=n_workers) as pool: | ||
| decoded_list = list(pool.map(_decode_one, raw_buffers, specs)) | ||
| else: | ||
| decoded_list = [ | ||
| _decode_one(raw, spec) for raw, spec in zip(raw_buffers, specs, strict=True) | ||
| ] |
There was a problem hiding this comment.
Why isn't this all multi-threaded i.e., the I/O as well?
There was a problem hiding this comment.
I should benchmark this, but my expectation was that IO against memory storage and local storage is not compute-limited, and so threads wouldn't remove a real bottleneck. for memory storage i'm sure this is true, not sure about local storage though
Adds a SupportsSetRange protocol to zarr.abc.store for stores that allow overwriting a byte range within an existing value. Implementations are added for LocalStore (using file-handle seek+write) and MemoryStore (in-memory bytearray slice assignment). This is the prerequisite for the partial-shard write fast path in ShardingCodec, which can patch individual inner-chunk slots without rewriting the entire shard blob when the inner codec chain is fixed-size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V2Codec, BytesCodec, BloscCodec, etc. previously only implemented the async _decode_single / _encode_single methods. Add their sync counterparts (_decode_sync / _encode_sync) so that the upcoming SyncCodecPipeline can dispatch through them without spinning up an event loop. For codecs that wrap external compressors (numcodecs.Zstd, numcodecs.Blosc, the V2 fallback chain), the sync versions just call the underlying compressor's blocking API directly instead of routing through asyncio.to_thread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arallelism
Adds SyncCodecPipeline alongside BatchedCodecPipeline. The new pipeline
runs codecs through their sync entry points (_decode_sync / _encode_sync)
and dispatches per-chunk work to a module-level thread pool sized by
the codec_pipeline.max_workers config (default = os.cpu_count()).
Each chunk's full lifecycle (fetch + decode + scatter for reads;
get-existing + merge + encode + set/delete for writes) runs as one
pool task — overlapping IO of one chunk with compute of another.
Scatter into the shared output buffer is thread-safe because chunks
have non-overlapping output selections.
The async wrappers (read/write) detect SupportsGetSync/SupportsSetSync
stores and dispatch to the sync fast path, passing the configured
max_workers. Other stores fall through to the async path, which still
uses asyncio.concurrent_map at async.concurrency.
Notes on perf:
- Default (None → cpu_count) is tuned for chunks ≥ ~512 KB.
- Small chunks (≤ 64 KB) regress 1.5-3x because pool dispatch overhead
(~30-50 µs/task) dominates per-chunk work. Workaround:
zarr.config.set({"codec_pipeline.max_workers": 1}).
- For large chunks on local/memory stores, IO+compute parallelism
yields 1.7-2.5x over BatchedCodecPipeline on direct-API reads and
~2.5x on roundtrip.
ChunkTransform encapsulates the sync codec chain. It caches resolved
ArraySpecs across calls with the same chunk_spec — combined with the
constant-ArraySpec optimization in indexing, hot-path overhead is
minimized.
Includes test scaffolding for the new pipeline (test_sync_codec_pipeline)
and config plumbing for the max_workers key.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds _encode_partial_sync and _decode_partial_sync to ShardingCodec.
For fixed-size inner codec chains and stores that implement
SupportsSetRange, partial writes patch individual inner-chunk slots
in-place instead of rewriting the whole shard:
- Reads existing shard index (one byte-range get).
- For each affected inner chunk: decodes the slot, merges the new
region, re-encodes.
- Writes each modified slot at its deterministic byte offset, then
rewrites just the index.
For variable-size inner codecs (e.g. with compression) or stores that
don't support byte-range writes, falls through to a full-shard rewrite
matching BatchedCodecPipeline semantics.
The partial-decode path computes a ReadPlan from the shard index and
issues one byte-range get per overlapping chunk, decoding only what
the read selection touches.
Both paths are dispatched from SyncCodecPipeline via the existing
supports_partial_decode / supports_partial_encode protocol checks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new test files:
test_codec_invariants — asserts contract-level properties that every
codec / shard / buffer combination must satisfy: round-trip exactness,
prototype propagation, fill-value handling, all-empty shard handling.
test_pipeline_parity — exhaustive matrix asserting that
SyncCodecPipeline and BatchedCodecPipeline produce semantically
identical results across codec configs, layouts (including
nested sharding), write sequences, and write_empty_chunks settings.
Three checks per cell:
1. Same array contents on read.
2. Same set of store keys after writes.
3. Each pipeline reads the other's output identically (catches
layout-divergence bugs).
These tests pinned the design throughout the SyncCodecPipeline +
partial-shard development.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds .gitignore entries for .claude/, CLAUDE.md, and docs/superpowers/ so local IDE/agent planning artifacts don't get committed by accident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aa111a2 to
1be5563
Compare
| selected = decoded[chunk_selection] | ||
| if drop_axes: | ||
| selected = selected.squeeze(axis=drop_axes) | ||
| out[out_selection] = selected |
There was a problem hiding this comment.
It might be worth experimenting with moving this setting operation out[out_selection] = selected outside the threadpool execution since, IIRC, it holds the GIL and is probably non-trivial time-wise.
There was a problem hiding this comment.
The memory usage will probably go up a bit though....
There was a problem hiding this comment.
The first thing that jumps out at me is the potential for a performance regression because there is no whole-shard special casing in the new fused codec pipeline. I guess your benchmarks cover that, but they also might not.
As you point out in #3925, that PR will then bring nested concurrency because the pipeline will have "outer" concurrency (that also controls the decompression) while there will be some inner concurrency from coalesced ranges.
I'd like to understand how these two PRs will fit together!
ilan-gold
left a comment
There was a problem hiding this comment.
Hi @d-v-b maybe in the interest of pruning this a bit, I don't actually see a dependency on set_range_sync being used in the pipeline - could it be removed from this PR?
In fact, if we want to be able to do what zarrs does for write performance, we mayb likely need #3826 + special casing for unordered first as this would then be mixed with set_range_sync to achieve the "first to compress, first to write" paradigm that appears in zarrs for unordered subchunk writing:
Otherwise, you need to hold the shard in-memory AFAICT to be able to create the ordering ahead of time (i.e., morton) to write.
|
first apologies for the messy state and second yes consider the set_range stuff extra credit. I do want to ensure that we can support range writes, but as long as we are confident that we aren't blocking that path, then it's totally fine to slim this PR down to "whatever it takes to speed up local + memory storage" |
|
and you should absolutely feel free to push ad libitum to this branch. I'm not actively working on it, and I'm confident that the twin constraints of our test suite and the benchmarks can keep us sane |
) In ShardingCodec._encode_partial_sync's full-shard-rewrite loop, a scalar broadcast value produces byte-for-byte identical results for every complete inner chunk (same fill, same empty-check, same encoded bytes). Compute that outcome once and reuse it across all complete chunks instead of re-merging, re-checking write_empty_chunks, and re-encoding tens of thousands of identical chunks. Incomplete edge chunks still merge against their own data individually. Target case (fused, memory, chunks=100/shards=1M, no compression): write 92.26ms -> 21.59ms (4.3x). Pipeline parity (byte-identical to batched) and 956 tests pass under the fused pipeline; adversarial partial-overwrite/ edge/compression/2D/aliasing checks pass. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#3826, partial-read opt zarr-developers#3004, _ShardIndex refactor zarr-developers#3975) Resolves conflicts in sharding.py (kept FusedCodecPipeline sync methods + main's _subchunk_order_iter / _load_partial_shard_maybe; fixed _ShardIndex construction to main's 2-arg signature), array.py (took main's cached regular_chunk_spec), test_codec_pipeline.py (kept the dual-pipeline suite + main's evolve test), .gitignore (union). 423 codec/sharding/parity + 807 codecs/indexing tests pass under both pipelines. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t-merge Two things, both scoped to the sync sharding read path: 1. Fix: main's zarr-developers#3975 made _ShardIndex a 2-field NamedTuple (chunks_per_shard, offsets_and_lengths), but the Fused sync methods still constructed it with one arg, erroring on every Fused sharded read. Pass chunks_per_shard through in _decode_shard_index_sync and the byte-range write path. 2. Perf: _decode_full_shard_bulk + _ShardIndex.is_dense. A whole-shard read of a dense, fixed-size, uncompressed shard is reconstructed by reshaping/scattering the data section in bulk, replacing the per-chunk decode/index/projection loop (~78% of a full read). Chunk positions are read from the stored index, so it is correct for any subchunk_write_order. Falls through to the per-chunk path for compression/filters, non-dense shards, and any read whose output shape != the shard shape (strided/partial/fancy). Full read (memory, 10000 chunks/shard, uint8): ~291ms -> ~21ms (13.9x vs Batched). Verified: 0 new test failures vs the merge baseline; full reads correct across dtypes and 2D; partial/strided/gzip fall through. (Pre-existing Fused x subchunk_write_order gaps remain, tracked separately.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…al reads Three integration gaps surfaced when the Fused pipeline met main's new subchunk_write_order (zarr-developers#3826), partial-read coalescing (zarr-developers#3004), and _ShardIndex refactor. Under Fused these caused 25 sharding/parity failures (data was correct in the partial-read cases; the failures were write-order layout + IO-pattern divergence). Fixes: 1. Write order: _encode_shard_dict_sync laid out chunks in hardcoded morton order, ignoring subchunk_write_order. Now iterates _subchunk_order_iter(self.subchunk_write_order), matching the async _encode_shard_dict. Fixes lexicographic/colexicographic/unordered storage. 2. Coalesced sync partial reads: add Store.get_ranges_sync (a synchronous, coalescing counterpart of get_ranges, reusing coalesce_ranges) and ShardingCodec._load_partial_shard_maybe_sync; route _decode_partial_sync's partial branch through it. Sync stores now get zarr-developers#3004's byte-range coalescing without an event loop (fewer, merged reads). 3. Non-sync fallback: FusedCodecPipeline.read now routes non-sync stores (e.g. ZipStore) through the async partial-decode path when the AB codec supports it, instead of _async_read_fallback's whole-shard get(). Matches Batched's IO behavior; avoids over-reading whole shards on partial reads. Tests: the zarr-developers#3004 partial-read tests are made pipeline-aware (assert the active method family: get/get_ranges vs get_sync/get_ranges_sync, gated on store sync support). 573 sharding+parity+pipeline+indexing and 657 codec tests pass under BOTH pipelines (was 25 failing under Fused). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
HIGH (sharding.py, byte-range write fast path): derived each chunk's physical slot from self.subchunk_write_order instead of hardcoded morton order, and excluded 'unordered' (no recoverable rank -> falls through to the index-driven full-rewrite path). A partial write into a dense shard first written with a non-default order no longer corrupts data via wrong byte offsets. HIGH (sharding.py, _decode_full_shard_bulk): build the read-view dtype from the BytesCodec's endian (as BytesCodec._decode_sync does), not the dtype's native endianness. A big-endian shard read on a little-endian host (or vice versa) now decodes correctly instead of silently reinterpreting bytes. MEDIUM (sharding.py, _decode_full_shard_bulk): the bulk fast path now requires the inner chain to be exactly one BytesCodec, excluding crc-bearing shards. The bulk path can't verify per-chunk checksums, so crc shards fall through to the per-chunk path and keep their corruption detection. LOW (codec_pipeline.py, ChunkTransform._resolve_specs): key the resolved-spec cache on the frozen, hashable ArraySpec value instead of (shape, id()), which could collide after id reuse. LOW (codec_pipeline.py, _get_pool): don't shutdown(wait=False) the old pool on grow — a concurrent in-flight pool.map could hit 'cannot schedule new futures after shutdown'. The orphaned pool drains and is GC'd. Tests: extended test_pipeline_parity with big-endian + crc32c codec configs and a dedicated subchunk_write_order x index_location parity test (asserts identical contents always, identical bytes for deterministic orders). Verified each new test fails when its corresponding fix is reverted. 1219 tests pass under both pipelines; mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mpute separation The class docstring claimed it 'separates IO from compute', then immediately said the ShardingCodec does IO internally — self-contradictory and misleading. The actual win is replacing per-chunk ASYNC scheduling with synchronous, batched/coalesced execution; the sharding codec still owns its storage IO (the zarrs model, unlike tensorstore's storage-free codecs). Rewrite the docstring to state this plainly and note that a storage-free codec is a possible future direction, not what this pipeline does. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…f/prepared-write-v2-mainmerge
After merging zarr-developers#4011 (which made 'unordered' deterministic and warns callers not to rely on its layout), drop the two places my earlier fixes special-cased it by name: - Byte-range write fast path: remove the 'subchunk_write_order != unordered' gate. The rank map is derived from _subchunk_order_iter(self.subchunk_write_ order), which is the single source of truth for physical layout — correct for every order without a name check. _subchunk_order_iter is the only place that knows a given order's layout. - Parity test: assert byte-equality across pipelines for ALL orders, not just 'deterministic' ones. The check verifies the two pipelines AGREE (they share _subchunk_order_iter), which holds whatever an order resolves to; it makes no assumption about what 'unordered' means. 540 parity+sharding and 862 codec/indexing tests pass under both pipelines; mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Flip codec_pipeline.path default from BatchedCodecPipeline to FusedCodecPipeline. Fused runs codec compute synchronously/in bulk and gives large speedups on sharded workloads (up to ~24x write / ~14x read on many-chunks-per-shard, more with compression) and no regressions on compute-bound cases; it falls back to the async path for non-sync stores. Batched remains selectable via config. Test fallout from the flip (all behavior, not stale-assertion churn): - test_config_defaults_set: expected default path updated. - test_config_codec_implementation: the mock codec now also overrides _encode_sync, so it records a call regardless of which pipeline is default (Fused uses the sync entry point). - StoreExpectingTestBuffer (zarr.testing.buffer): added set_sync/get_sync that mirror the async buffer-type guards, so the 'all buffers are TestBuffer' invariant is checked on the sync write path too. Verified Fused correctly threads a custom BufferPrototype (sharded writes store TestBuffer instances) — the test simply wasn't exercising the sync path before. Full suite: 6346 passed, 0 failed under the new default. NOTE: changelog fragment filename is a PLACEHOLDER — rename changes/PLACEHOLDER-fused-default.feature.md to changes/<PR#>.feature.md once the PR number is known (towncrier keys fragments by issue/PR number). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… hard-coded Batched The codec_pipeline property hard-coded BatchedCodecPipeline.from_codecs(). main resolves it against the registry via get_pipeline_class() (zarr-developers#2179); the branch carried an older hard-coded version and the main-merge kept the branch side. With FusedCodecPipeline now the default this left the inner sub-chunk pipeline stuck on Batched while the outer array used Fused — an inconsistency, and stale relative to main. Restore get_pipeline_class().from_codecs(), matching the rest of this module (which already uses get_pipeline_class elsewhere). Verified: sharding + parity + pipeline (596) and codecs+array+indexing+properties (2161) pass; nested sharding roundtrips correctly under both pipelines; no functional BatchedCodecPipeline references remain in sharding.py. mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…red CodecPipelineTests HIGH-2: FusedCodecPipeline.decode()/encode() (the async fallback for non-sync stores) reused one flat chunk_spec across every codec stage instead of evolving it per codec via resolve_metadata. Spec-changing array->array codecs broke: TransposeCodec crashed on read (could not broadcast (2,2) into (2,4)); cast_value/scale_offset would silently corrupt. Reachable on the DEFAULT pipeline for every non-sync store (S3/GCS/fsspec/zip). Fix, without re-duplicating spec logic (the duplication caused the bug): - Extract resolve_aa_specs(): single source of truth for per-stage spec evolution (forward-thread resolve_metadata over the AA codecs). Pure metadata. - Add AsyncChunkTransform: per-chunk ASYNC mirror of ChunkTransform, driving the codecs' async _decode_single/_encode_single with the correct per-stage spec. No mini-batch concept (that stays a BatchedCodecPipeline concern). - ChunkTransform._resolve_specs delegates to resolve_aa_specs. - Fused.decode()/encode() loop per chunk through AsyncChunkTransform. Also harden the sharding byte-range WRITE fast path: take chunk offsets from the stored shard index, not from the live subchunk_write_order (which is not recoverable on reopen by design). New tests/test_codec_pipeline_suite.py: xUnit CodecPipelineTests base run as TestBatchedPipeline and TestFusedPipeline over a sync (MemoryStore) AND a non-sync (LatencyStore) store axis. Reproduces HIGH-2 automatically. 140 pass; mypy clean; original ZipStore+transpose crash now roundtrips. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ts suite The shared suite runs every pipeline-agnostic behavior test against BOTH pipelines x both store paths, so per-file copies of the same behavior are redundant. Remove confirmed duplicates; keep tests that exercise something the suite does not. - Strengthen the suite's write_empty_chunks tests to also assert chunk-key presence/absence (absorbing the old _no_store / _persists coverage). - test_codec_pipeline.py: drop the 8 behavior duplicates now in the suite. KEEP test_read_returns_get_results (low-level pipeline.read GetResult API), test_write_empty_chunks_false_no_store (store-key shape), and test_codec_pipeline_threads_dtype_through_evolve (zarr-developers#3937 regression). - test_fused_pipeline.py: drop the array-level streaming read/write tests and test_partial_shard_write_roundtrip_correctness (array behavior, suite-covered). KEEP all pipeline-API / Fused-internal tests (construction, evolve, low-level write/read(_sync) roundtrips, sync-write/async-read interop, ChunkTransform encode/decode, set_range, inner_codecs_fixed_size, byte-range fast path). 740 pass across suite + codec_pipeline + fused + sync + invariants + parity + sharding; ruff + mypy clean. No coverage removed without a verified equivalent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…etrized test The bulk of CodecPipelineTests followed one shape: create an array, apply some writes, optionally assert which chunk keys exist, then assert reads come back correct. Capture those variables in a frozen Scenario dataclass (array_kwargs, writes, reads, keys_present/absent) and drive them all through a single parametrized test_scenario. Correctness is checked against a numpy reference the scenario derives from its own writes, so cases don't hand-maintain expected values. 18 scenarios cover the same matrix (layouts, gzip, transpose spec-evolution, nested sharding, partial-shard overwrite, write_empty key presence/absence) x both pipelines x sync/async stores. Kept as separate focused tests the two cases that don't fit the shape: test_read_missing_chunks_false_raises (asserts an exception) and test_partial_write_after_reopen_is_correct (has an extra reopen step). Verified the parametrized form keeps its regression-guard value: reverting the HIGH-2 spec-evolution fix still fails test_scenario[async-transpose]. 670 pass across pipeline + sharding suites; ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…core The Fused test file had accumulated tests that either duplicated the pipeline-agnostic CodecPipelineTests suite or were misfiled. Triage: - async roundtrip / missing-chunk-fill / partial-shard-write dups: removed; the shared test_scenario covers these across both pipelines x sync/async stores. Added float32 and zstd Scenarios first so the dtype/codec coverage the dups carried transfers to the shared matrix (no net coverage loss). - store set_range / SupportsSetRange tests: already covered (more thoroughly, parametrized) in tests/test_store/test_memory.py; removed as dups. - ShardingCodec._inner_codecs_fixed_size tests: moved to tests/test_codecs/test_sharding_unit.py where the sharding internals live. What stays is genuinely Fused-only and cannot be pipeline-agnostic: the synchronous API (write_sync / read_sync / _sync_transform) which Batched has no equivalent of, and the byte-range fast-path assertions (set_range_sync fires / falls back) which test a Fused-only optimization. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The "invariants" file grouped tests by their shared motivation (a design doc) rather than by what they test, which is the wrong axis -- it mixed pipeline-agnostic behavior, Fused-only internals, and a per-codec property into one file. Sorted each test into the home its subject implies: Pipeline-agnostic behavior -> CodecPipelineTests (runs on BOTH pipelines x sync/async stores via the existing fixtures): - S2 empty-chunk skipping under default config -> a Scenario (keys_absent). - S2 shard deleted after overwrite-to-fill -> a base-class method (it needs a mid-sequence key assertion the Scenario shape can't express). - C3 no isinstance(ShardingCodec) branching in read/write -> a base-class method that resolves the subclass's configured pipeline and source-scans it. Fused-only (byte-range fast path / ChunkTransform internals) -> test_fused_pipeline.py: - S3 fast path skipped when write_empty_chunks=False (the unique complement of the existing uses-set-range test; the write_empty_chunks=True case was a dup and is dropped). - B1 byte-range path copies read-only LocalStore buffers before mutating. - C2 ChunkTransform passes each codec the runtime chunk_spec prototype. Per-codec contract -> tests/test_codecs/test_codecs.py: - C1 resolve_metadata only mutates shape (prototype/dtype/fill_value/config stable across the chain) -- a property of individual codecs, no pipeline. Dropped as a pure duplicate (already in test_store/test_memory.py): - test_supports_set_range_is_runtime_checkable. No coverage lost: every kept test moved, and the two genuinely-shared behaviors now run on both pipelines instead of only whichever was default. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…o shared suite test_pipeline_read_parity checked Fused vs Batched partial reads against *each other*. The shared CodecPipelineTests suite already reads partial/strided selections from sharded arrays against a numpy reference on BOTH pipelines -- which is strictly stronger (it would catch both pipelines diverging from the spec in the same way, which a pipeline-vs-pipeline check cannot). The one sliver read-parity covered that the shared suite didn't was scalar single-element reads from a sharded array (the sharding codec's partial-decode path). Added two Scenarios (sharded-scalar-reads-1d / -2d) to capture it. Verified they exercise the partial-decode path on both pipelines: the default Fused pipeline routes a scalar sharded read through _decode_partial_sync, the Batched pipeline through _decode_partial_single -- so both variants are now checked against numpy, not just against each other. Kept in test_pipeline_parity.py the two checks the per-pipeline suite cannot express, because its two subclasses run in isolation and never see each other's output: - test_pipeline_parity: cross-read interop (write under A, read whole under B) + cross-pipeline store-key-set equality. - test_pipeline_parity_subchunk_write_order: byte-identical shard output across pipelines for every subchunk_write_order x index_location. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ross-file dup The file named test_sync_codec_pipeline.py tested no pipeline -- it is the unit test suite for ChunkTransform (the per-chunk synchronous codec chain that FusedCodecPipeline uses internally). "sync codec pipeline" was an earlier name for the Fused pipeline; the filename had outlived it. Renamed to test_chunk_transform.py (git mv preserves history) and added a module docstring naming what it actually covers. Also removed test_sync_transform_encode_decode_roundtrip from test_fused_pipeline.py: it was a weaker cross-file duplicate of this file's test_encode_decode_roundtrip (which covers the same encode->decode->compare over five codec chains rather than just bytes-only). Its one extra assertion -- that evolve_from_array_spec populates _sync_transform -- is already covered by test_evolve_from_array_spec in the Fused file. test_codec_pipeline.py left as-is: all three tests are correctly placed and cover things the Scenario suite can't (the low-level pipeline.read GetResult API, a plain dict store, and the zarr-developers#3937 cast_value dtype-threading regression). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The byte-range-write machinery works, but the right store interface for it is still undecided, so it is removed from this PR and will return once that lands. Removed: - SupportsSetRange protocol (abc/store.py) and its __all__ export. - MemoryStore.set_range / set_range_sync / _set_range_impl and the SupportsSetRange base (storage/_memory.py). - LocalStore.set_range / set_range_sync, the _put_range helper, and the SupportsSetRange base (storage/_local.py). - The sharding codec's byte-range-write fast path in _encode_partial_sync; partial shard writes now always take the full-shard-rewrite path (identical to BatchedCodecPipeline, verified by the pipeline-parity suite). Also dropped the now-dead _chunk_byte_offset helper it relied on. - changes/3907.feature.md (the byte-range-writes changelog note). The byte-range-READ changelog (3004) is unrelated and kept. Byte-range READS (ByteRequest, get(byte_range=), get_ranges coalescing, the read-side bulk shard decode) are untouched -- this only removes writes. The known-good tests that exercise byte-range writes are commented out (not deleted) in test_store/test_memory.py, test_store/test_local.py, and test_fused_pipeline.py, to restore once the store design is settled. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR-added module-level helper in array.py with zero callers — an ArraySpec-reuse optimization that was never wired up. Plain function, no protocol role, safe to drop. Verified: no references anywhere in src/ or tests/, and the full array/sharding/pipeline suites stay green. Note: ShardingCodec._encode_sync, though never *called*, is NOT dead — it is a required member of the runtime_checkable SupportsSyncCodec protocol. Removing it drops ShardingCodec from SupportsSyncCodec and breaks the sync read-fallback routing (16 test failures), so it stays. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The docstring claimed _encode_sync "iterates inner chunks in Morton order — that's the canonical layout the shard index expects", which is wrong and a latent footgun: it implies the method imposes a morton physical layout. It does not. The morton iteration only populates an intermediate dict whose key order is immaterial; the on-disk layout is decided downstream by the subchunk_write_order loop in _encode_shard_dict_sync (same as the async _encode_single sibling). Also clarified that this method IS reached — via nested sharding, where an inner ShardingCodec is encoded through the outer codec's ChunkTransform. (It is not called for top-level sharded writes, which route through _encode_partial_sync.) Verified empirically: routing through nested _encode_sync, all three subchunk_write_order values roundtrip correctly AND morton vs lexicographic produce physically different bytes — i.e. the order is honored, not ignored. Behavior unchanged; docstring only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR-added thin wrapper (`_load_shard_index_maybe(...) or _ShardIndex.create_empty(...)`) with zero invocations anywhere in src/ or tests/. Unlike _encode_sync, this is genuinely removable: confirmed it is NOT a member of any runtime_checkable protocol or ABC (no reference in src/zarr/abc/, not a base-class override) and is reached by no dynamic dispatch (no getattr / string reference). main has no _load_shard_index* methods at all, so it was introduced and left unused by this PR. The _maybe and _maybe_sync variants it wrapped remain and are used. Verified: full sharding + nested-sharding + parity + pipeline suites stay green, ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ring The FusedCodecPipeline class docstring still described sharded writes as using "byte-range writes via set_range_sync" — but byte-range-write support was removed from this PR (set_range_sync / SupportsSetRange are gone). Sharded writes now take the codec's synchronous full-shard-rewrite path. Docstring only; no behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This branch's docstrings/comments had introduced RST-style ``double-backtick`` inline literals, which this project does not use (plain single backticks only — no RST roles or double-backticks). Converted the 25 occurrences across the sharding codec, codec_pipeline, and fsspec store docstrings/comments to single backticks. Style only; no behavior change. Also confirmed (via git blame, this-branch lines only) there are no remaining references to removed/outdated designs: the byte-range-write (set_range) mentions and the "separating IO from compute" framing were already corrected earlier in this branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… opt-in Pairs with the FusedCodecPipeline default: keep the new pipeline, but do NOT enable threading by default. `max_workers=None` (auto -> cpu_count) spawned a thread pool on every read/write, which is a behavior change with real downstream risk — it runs custom stores/codecs concurrently (thread-safety) and can oversubscribe many-core nodes whose workloads already parallelize at the dask/MPI layer. The default is now 1 (fully sequential: the pool is never created when max_workers <= 1). Parallelism is opt-in via `codec_pipeline.max_workers` (positive int, or None for auto). Updates _resolve_max_workers docstring and the config-defaults test accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This PR defines a new codec pipeline class called
PhasedCodecPipelinethat enables much higher performance for chunk encoding and decoding than the currentBatchedCodecPipeline.The approach here is to completely ignore how the v3 spec defines array -> bytes codecs 😆. Instead of treating codecs as functions that mix IO and compute, we treat codec encoding and decoding as a sequence:
fetch exactly what we need to fetch from storage, given the codecs we have. So if there's a sharding codec in the first array->bytes position, the codec pipeline knows it must fetch the shard index, then fetch the involved subchunks, before passing them to compute.
Basically, we use the first array -> bytes codec to figure out what kind of preparatory IO and final IO we need to perform, and the rest of the codecs to figure out what kind of chunk encoding we need to do. Separating IO from compute in different phases makes things simpler and faster.
Happy to chat more about this direction. IMO the spec should be re-written with this framing, because it makes much more sense than trying to shoe-horn sharding in as a codec.
I don't want to make our benchmarking suite any bigger but on my laptop this codec pipeline is 2-5x faster than the batchedcodec pipeline for a lot of workloads. I can include some of those benchmarks later.
This was mostly written by claude, based on previous work in #3719. All these changes should be non-breaking, so I think this is in principle safe for us to play around with in a patch or minor release.
Edit: this PR depends on changes submitted in #3907 and #3908