Add mode="r+" for to_zarr and use consolidated writes/reads by default#5252
Add mode="r+" for to_zarr and use consolidated writes/reads by default#5252dcherian merged 17 commits intopydata:masterfrom
Conversation
`mode="r+"` only allows for modifying pre-existing array values in a Zarr store. This makes it a safer default `mode` when doing a limited `region` write. It also offers a nice performance bonus when using consolidated metadata, because the store to modify can be opened in "consolidated" mode. This PR includes several related changes to `to_zarr()`: 1. It adds support for the new `mode="r+"`. 2. `consolidated=True` in `to_zarr()` now means "open in consolidated mode" if using using `mode="r+"`, instead of "write in consolidated mode" (which would not make sense for r+). 3. It allows setting `consolidated=True` when using `region`, mostly for the sake of fast store opening with r+. 4. Validation in `to_zarr()` has been reorganized to always use the _existing_ Zarr group, rather than re-opening zar stores from scratch, which could require additional network requests. 5. Incidentally, I've renamed the `ZarrStore.ds` attribute to `ZarrStore.zarr_group`, which is a much more descriptive name. These changes gave me a ~5x boost in write performance in a large parallel job making use of `to_zarr` with `region`.
|
Question: does this mode still require eager loading of dimension coordinates? |
|
Nope, not anymore! We decide the existing variables (to figure out/match
their encoding) but that's done in a lazy way. We never put them in an
xarray.Dataset so there is no eager loading of dimension coordinates.
…On Tue, May 4, 2021 at 7:07 AM Ryan Abernathey ***@***.***> wrote:
Question: does this mode still require eager loading of dimension
coordinates?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#5252 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVVIPHXHCNFR2GJXJKDTL75R5ANCNFSM44ALL2AQ>
.
|
|
I pushed a commit making consolidated metadata the default (see #5251). This was easier to do building on top of my existing refactor here (but conceivably could be separated). |
|
Thanks @joshmoore for the copy-editing! |
|
@rabernat gentle ping for review here :) |
|
@nbren12 @spencerkclark are either of you up for a review here? |
|
I vote to merge since it's so useful and @shoyer has probably found all the bugs already :) |
Hah :). We've been using this branch at Google for a while now and it seems to be working pretty well. |
best kind of code review! Thanks @shoyer |
|
Really sorry I didn't get around to review. My excuse is that I moved back to NYC last week and fell behind on everything. Thanks for moving it forward. 💪 |
|
I'm sorry too! I don't have any good excuse though... |
|
Well hey, it's still not too late to catch any issues before they turn up
in a released version of Xarray :)
…On Mon, Jun 21, 2021 at 11:56 PM Noah D. Brenowitz ***@***.***> wrote:
I'm sorry too! I don't have any good excuse though...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5252 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVXBEQCIEIUDKLD6NBLTUAJXLANCNFSM44ALL2AQ>
.
|
mode="r+"only allows for modifying pre-existing array values in aZarr store. This makes it a safer default
modewhen doing a limitedregionwrite. It also offers a nice performance bonus when usingconsolidated metadata, because the store to modify can be opened in
"consolidated" mode -- rather than painfully slow non-consolidated
mode.
This PR includes several related changes to
to_zarr():mode="r+".consolidated=Trueinto_zarr()now means "open in consolidatedmode" if using using
mode="r+", instead of "write in consolidatedmode" (which would not make sense for r+).
consolidated=Truewhen usingregion, mostlyfor the sake of fast store opening with r+.
to_zarr()has been reorganized to always use theexisting Zarr group, rather than re-opening zar stores from
scratch, which could require additional network requests.
ZarrStore.dsattribute toZarrStore.zarr_group, which is a much more descriptive name.These changes gave me a ~5x boost in write performance in a large
parallel job making use of
to_zarrwithregion.pre-commit run --all-fileswhats-new.rst