perf: phased codecpipeline by d-v-b · Pull Request #3885 · zarr-developers/zarr-python

d-v-b · 2026-04-08T13:34:35Z

This PR defines a new codec pipeline class called PhasedCodecPipeline that enables much higher performance for chunk encoding and decoding than the current BatchedCodecPipeline.

The approach here is to completely ignore how the v3 spec defines array -> bytes codecs 😆. Instead of treating codecs as functions that mix IO and compute, we treat codec encoding and decoding as a sequence:

preparatory IO, async
fetch exactly what we need to fetch from storage, given the codecs we have. So if there's a sharding codec in the first array->bytes position, the codec pipeline knows it must fetch the shard index, then fetch the involved subchunks, before passing them to compute.
pure compute. sync. Apply filters and compressors. safe to parallelize over chunks.
(if writing): final IO, async. reconcile the in-memory compressed chunks against our model of the stored chunk. Write out bytes.

Basically, we use the first array -> bytes codec to figure out what kind of preparatory IO and final IO we need to perform, and the rest of the codecs to figure out what kind of chunk encoding we need to do. Separating IO from compute in different phases makes things simpler and faster.

Happy to chat more about this direction. IMO the spec should be re-written with this framing, because it makes much more sense than trying to shoe-horn sharding in as a codec.

I don't want to make our benchmarking suite any bigger but on my laptop this codec pipeline is 2-5x faster than the batchedcodec pipeline for a lot of workloads. I can include some of those benchmarks later.

This was mostly written by claude, based on previous work in #3719. All these changes should be non-breaking, so I think this is in principle safe for us to play around with in a patch or minor release.

Edit: this PR depends on changes submitted in #3907 and #3908

`PreparedWrite` models a set of per-chunk changes that would be applied to a stored chunk. `SupportsChunkPacking` is a protocol for array -> bytes codecs that can use `PreparedWrite` objects to update an existing chunk.

…into perf/prepared-write-v2

codecov · 2026-04-08T16:52:11Z

Codecov Report

❌ Patch coverage is 90.19934% with 59 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.16%. Comparing base (3ede5e8) to head (2f97362).

Files with missing lines	Patch %	Lines
src/zarr/core/codec_pipeline.py	88.72%	31 Missing ⚠️
src/zarr/codecs/sharding.py	90.83%	22 Missing ⚠️
src/zarr/codecs/numcodecs/_codecs.py	83.33%	4 Missing ⚠️
src/zarr/storage/_local.py	93.75%	1 Missing ⚠️
src/zarr/storage/_memory.py	94.73%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3885      +/-   ##
==========================================
- Coverage   93.27%   93.16%   -0.12%     
==========================================
  Files          87       87              
  Lines       11676    12174     +498     
==========================================
+ Hits        10891    11342     +451     
- Misses        785      832      +47

Files with missing lines	Coverage Δ
src/zarr/abc/store.py	`96.34% <100.00%> (+0.04%)`	⬆️
src/zarr/codecs/_v2.py	`94.11% <100.00%> (+0.50%)`	⬆️
src/zarr/core/array.py	`97.74% <100.00%> (+0.02%)`	⬆️
src/zarr/core/config.py	`100.00% <ø> (ø)`
src/zarr/storage/_local.py	`95.27% <93.75%> (-0.14%)`	⬇️
src/zarr/storage/_memory.py	`96.57% <94.73%> (-0.18%)`	⬇️
src/zarr/codecs/numcodecs/_codecs.py	`95.45% <83.33%> (-0.94%)`	⬇️
src/zarr/codecs/sharding.py	`90.17% <90.83%> (+0.77%)`	⬆️
src/zarr/core/codec_pipeline.py	`90.46% <88.72%> (-3.72%)`	⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…into perf/prepared-write-v2

d-v-b · 2026-04-09T08:36:25Z

@TomAugspurger how would this design work with CUDA codecs?

ilan-gold · 2026-04-17T13:08:47Z

+        # Phase 1: fetch all chunks (IO, sequential)
+        raw_buffers: list[Buffer | None] = [
+            bg.get_sync(prototype=cs.prototype)  # type: ignore[attr-defined]
+            for bg, cs, *_ in batch
+        ]
+
+        # Phase 2: decode (compute, optionally threaded)
+        def _decode_one(raw: Buffer | None, chunk_spec: ArraySpec) -> NDBuffer | None:
+            if raw is None:
+                return None
+            return transform.decode_chunk(raw, chunk_spec)
+
+        specs = [cs for _, cs, *_ in batch]
+        if n_workers > 0 and len(batch) > 1:
+            with ThreadPoolExecutor(max_workers=n_workers) as pool:
+                decoded_list = list(pool.map(_decode_one, raw_buffers, specs))
+        else:
+            decoded_list = [
+                _decode_one(raw, spec) for raw, spec in zip(raw_buffers, specs, strict=True)
+            ]


Why isn't this all multi-threaded i.e., the I/O as well?

I should benchmark this, but my expectation was that IO against memory storage and local storage is not compute-limited, and so threads wouldn't remove a real bottleneck. for memory storage i'm sure this is true, not sure about local storage though

Adds a SupportsSetRange protocol to zarr.abc.store for stores that allow overwriting a byte range within an existing value. Implementations are added for LocalStore (using file-handle seek+write) and MemoryStore (in-memory bytearray slice assignment). This is the prerequisite for the partial-shard write fast path in ShardingCodec, which can patch individual inner-chunk slots without rewriting the entire shard blob when the inner codec chain is fixed-size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

V2Codec, BytesCodec, BloscCodec, etc. previously only implemented the async _decode_single / _encode_single methods. Add their sync counterparts (_decode_sync / _encode_sync) so that the upcoming SyncCodecPipeline can dispatch through them without spinning up an event loop. For codecs that wrap external compressors (numcodecs.Zstd, numcodecs.Blosc, the V2 fallback chain), the sync versions just call the underlying compressor's blocking API directly instead of routing through asyncio.to_thread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…arallelism Adds SyncCodecPipeline alongside BatchedCodecPipeline. The new pipeline runs codecs through their sync entry points (_decode_sync / _encode_sync) and dispatches per-chunk work to a module-level thread pool sized by the codec_pipeline.max_workers config (default = os.cpu_count()). Each chunk's full lifecycle (fetch + decode + scatter for reads; get-existing + merge + encode + set/delete for writes) runs as one pool task — overlapping IO of one chunk with compute of another. Scatter into the shared output buffer is thread-safe because chunks have non-overlapping output selections. The async wrappers (read/write) detect SupportsGetSync/SupportsSetSync stores and dispatch to the sync fast path, passing the configured max_workers. Other stores fall through to the async path, which still uses asyncio.concurrent_map at async.concurrency. Notes on perf: - Default (None → cpu_count) is tuned for chunks ≥ ~512 KB. - Small chunks (≤ 64 KB) regress 1.5-3x because pool dispatch overhead (~30-50 µs/task) dominates per-chunk work. Workaround: zarr.config.set({"codec_pipeline.max_workers": 1}). - For large chunks on local/memory stores, IO+compute parallelism yields 1.7-2.5x over BatchedCodecPipeline on direct-API reads and ~2.5x on roundtrip. ChunkTransform encapsulates the sync codec chain. It caches resolved ArraySpecs across calls with the same chunk_spec — combined with the constant-ArraySpec optimization in indexing, hot-path overhead is minimized. Includes test scaffolding for the new pipeline (test_sync_codec_pipeline) and config plumbing for the max_workers key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds _encode_partial_sync and _decode_partial_sync to ShardingCodec. For fixed-size inner codec chains and stores that implement SupportsSetRange, partial writes patch individual inner-chunk slots in-place instead of rewriting the whole shard: - Reads existing shard index (one byte-range get). - For each affected inner chunk: decodes the slot, merges the new region, re-encodes. - Writes each modified slot at its deterministic byte offset, then rewrites just the index. For variable-size inner codecs (e.g. with compression) or stores that don't support byte-range writes, falls through to a full-shard rewrite matching BatchedCodecPipeline semantics. The partial-decode path computes a ReadPlan from the shard index and issues one byte-range get per overlapping chunk, decoding only what the read selection touches. Both paths are dispatched from SyncCodecPipeline via the existing supports_partial_decode / supports_partial_encode protocol checks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two new test files: test_codec_invariants — asserts contract-level properties that every codec / shard / buffer combination must satisfy: round-trip exactness, prototype propagation, fill-value handling, all-empty shard handling. test_pipeline_parity — exhaustive matrix asserting that SyncCodecPipeline and BatchedCodecPipeline produce semantically identical results across codec configs, layouts (including nested sharding), write sequences, and write_empty_chunks settings. Three checks per cell: 1. Same array contents on read. 2. Same set of store keys after writes. 3. Each pipeline reads the other's output identically (catches layout-divergence bugs). These tests pinned the design throughout the SyncCodecPipeline + partial-shard development. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds .gitignore entries for .claude/, CLAUDE.md, and docs/superpowers/ so local IDE/agent planning artifacts don't get committed by accident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ilan-gold · 2026-04-18T14:35:38Z

+            selected = decoded[chunk_selection]
+            if drop_axes:
+                selected = selected.squeeze(axis=drop_axes)
+            out[out_selection] = selected


It might be worth experimenting with moving this setting operation out[out_selection] = selected outside the threadpool execution since, IIRC, it holds the GIL and is probably non-trivial time-wise.

The memory usage will probably go up a bit though....

…r-python into perf/prepared-write-v2

Both were exported from zarr.abc.codec.__all__ but never referenced by either codec pipeline or any test. Artifacts of an earlier design iteration superseded by the current SyncCodecPipeline. Also remove now-unused imports of `dataclass` and `ChunkProjection` that were only needed by the deleted symbols. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both tests/test_phased_codec_pipeline.py and tests/test_pipeline_benchmark.py import PhasedCodecPipeline, which no longer exists in src/. Each failed at collection. The benchmarking intent of test_pipeline_benchmark.py is replaced by extensions to tests/benchmarks/test_e2e.py later in this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The new name describes what the pipeline does (fuses fetch+decode+scatter into one task per chunk) rather than the implementation detail of using sync codec entry points. The name also stays accurate when this pipeline gains a remote-store / async fast path in a future change. Mechanical rename across the class, register_pipeline call, dotted-path strings used by zarr.config, isinstance checks, parametrize values, and docstrings. tests/test_sync_pipeline.py renamed to tests/test_fused_pipeline.py. Nothing on this branch is released, so no deprecation alias is needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The BatchedCodecPipeline and FusedCodecPipeline classes had identical copies of _merge_chunk_array (one method, one staticmethod). Extract once as a module-level free function and call from both. No new base class or mixin is introduced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both BatchedCodecPipeline.read_batch (non-partial-decode branch) and FusedCodecPipeline.read (async fallback) duplicate the same sequence: concurrent_map(get) -> pipeline.decode -> scatter into out. Lift to a module-level free function and call from both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both BatchedCodecPipeline.write_batch (non-partial-encode branch) and FusedCodecPipeline.write (async fallback) duplicate the same sequence: read existing bytes -> decode -> merge -> encode -> set/delete. Lift to a module-level free function and call from both. After this change, neither pipeline class carries _merge_chunk_array, nor the duplicated read/write fallback bodies. Each class is reduced to its constructor, fast-path methods, and thin async dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a `pipeline` fixture with values ["batched", "fused"] that swaps codec_pipeline.path for the duration of each benchmark. Both test_write_array and test_read_array now produce one benchmark cell per (compression x layout x store x pipeline). CodSpeed will report comparable numbers for both pipelines on the same workloads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `latency in {0, 0.001, 0.05, 0.2}` and a bench_store fixture that wraps the underlying memory store in zarr.testing.store.LatencyStore when latency > 0. Local-store cells skip nonzero latency — adding synthetic latency on top of a real filesystem double-counts and is not the intended measurement. Combined with the pipeline parameter, the matrix now produces comparable benchmark numbers for {Batched, Fused} x {0, 1ms, 50ms, 200ms} on memory-shaped operation. The numbers are signal under one simple model of remote latency, not absolute predictions of S3 behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…into perf/prepared-write-v2

Commit 7f45aba (which converted _decode_single -> _decode_sync) dropped two explanatory comment blocks from the dtype-handling branches in V2Codec.decode. Both comments document non-obvious WHY: - The TypeError catch is for chunks whose stored dtype doesn't match the array spec dtype (e.g. string dtype vs object array). - The elif branch fires when filters were tampered with: an object array needs an object codec in the filter chain to be read correctly. These were removed as drive-by cleanup during the sync-method rename without intent to delete the substance. Restoring verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add docstring substance and a couple of inline notes to the new sync methods on ShardingCodec that landed on this branch. Concretely: - _decode_sync / _encode_sync: explain how each relates to the async counterpart and the partial-* variants, and why inner chunks are iterated in Morton order on the encode path. - _encode_shard_dict_sync: explain the two-pass offset shift in the index-at-start branch (offsets are written relative to the data section, then bumped by len(index_bytes)) and the MAX_UINT_64 empty-chunk sentinel that must not be touched. - _encode_partial_sync byte-range path: explain WHY morton-rank determines byte offset deterministically (fixed-size inner chunks = every slot at a stable offset regardless of which others are present); this is the load-bearing invariant for the byte-range fast path. - _decode_partial_sync: docstring now lists the two sub-paths (full-shard fetch vs. index-then-byte-ranges) and the reason the full-shard branch exists (one round trip beats N+1 small ones). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Docstrings added on this branch used RST-style ``literal`` markup (double-backticks). Convert to Markdown-style `literal` (single backticks) so the docstrings render correctly in Markdown-aware viewers without needing a separate RST-to-Markdown step. Two cases worth calling out: - src/zarr/core/codec_pipeline.py and src/zarr/codecs/sharding.py: every ``literal`` in these files came in on this branch, so the conversion is global within those files. - src/zarr/abc/store.py and src/zarr/core/array.py: only docstrings added on this branch are converted; pre-existing RST-style literals from main are left alone (out of scope). Also converted one .. note:: directive in src/zarr/core/array.py (the regular_chunk_array_spec helper) to a Markdown blockquote, since that directive was added on this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

d-v-b added 4 commits April 7, 2026 10:38

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

a072c31

…into perf/prepared-write-v2

feat: new codec pipeline that uses sync path

47a407f

feat: complete second codecpipeline

3c27e49

github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 8, 2026

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

9b834a4

…into perf/prepared-write-v2

d-v-b added 2 commits April 8, 2026 19:51

fix: handle rectilinear chunks

c731cf2

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

9e25150

…into perf/prepared-write-v2

This was referenced Apr 8, 2026

perf/prepared write #3727

Closed

perf/sharding chunk transform #3729

Closed

perf/chunkrequest #3730

Closed

sketch out sync codecs + threadpool #3715

Closed

fixup

ae0580c

d-v-b mentioned this pull request Apr 9, 2026

[do not merge] benchmarks + tests for phased codecpipeline #3891

Open

d-v-b force-pushed the perf/prepared-write-v2 branch from 5d3064e to b67a5a0 Compare April 15, 2026 09:51

github-actions Bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 15, 2026

d-v-b force-pushed the perf/prepared-write-v2 branch 2 times, most recently from a84a15a to 68a7cdc Compare April 17, 2026 10:41

ilan-gold reviewed Apr 17, 2026

View reviewed changes

d-v-b and others added 6 commits April 17, 2026 22:51

chore: gitignore local agent/planning notes

1be5563

Adds .gitignore entries for .claude/, CLAUDE.md, and docs/superpowers/ so local IDE/agent planning artifacts don't get committed by accident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

d-v-b force-pushed the perf/prepared-write-v2 branch from aa111a2 to 1be5563 Compare April 17, 2026 21:04

ilan-gold reviewed Apr 18, 2026

View reviewed changes

Merge branch 'main' into perf/prepared-write-v2

985716b

d-v-b and others added 14 commits April 30, 2026 17:37

Merge branch 'main' into perf/prepared-write-v2

65d98a1

Merge branch 'perf/prepared-write-v2' of https://github.com/d-v-b/zar…

3f3e7ea

…r-python into perf/prepared-write-v2

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

d2e08de

…into perf/prepared-write-v2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: phased codecpipeline#3885

perf: phased codecpipeline#3885
d-v-b wants to merge 29 commits intozarr-developers:mainfrom
d-v-b:perf/prepared-write-v2

d-v-b commented Apr 8, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

d-v-b commented Apr 9, 2026

Uh oh!

ilan-gold Apr 17, 2026 •

edited

Loading

Uh oh!

d-v-b Apr 17, 2026

Uh oh!

ilan-gold Apr 18, 2026 •

edited

Loading

Uh oh!

ilan-gold Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

d-v-b commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

d-v-b commented Apr 9, 2026

Uh oh!

ilan-gold Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

ilan-gold Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

d-v-b commented Apr 8, 2026 •

edited

Loading

codecov Bot commented Apr 8, 2026 •

edited

Loading

ilan-gold Apr 17, 2026 •

edited

Loading

ilan-gold Apr 18, 2026 •

edited

Loading