Allow assigning values to a subset of a dataset#5045
Allow assigning values to a subset of a dataset#5045max-sixty merged 20 commits intopydata:masterfrom
Conversation
Both, positional and label-based (with .loc) indexing using
a dictionary as key are implemented.
All variables in the dataset are updated one by one with the given
value at the given location. Variables that do not possess all
of the dimensions given in the location key are skipped.
If the given value is also a dataset, corresponding variables
in the given value and in the dataset to be changed are selected.
Tests for all cases are added.
|
I haven't updated the documentation yet, where it says that this feature is not supported yet. Do you think we need example code for this feature in the documentation? |
|
@matzegoebel forgive the very long delay on the review. We're planning to find a better system to ensure these don't drop through. I would be up for adding this, for consistency. I don't think I've ever needed the functionality, but it also doesn't make the interface more complicated given it's mirroring We probably need to think through whether there are any corner cases here; I can't think of any atm. Any other thoughts? |
|
Strong +1 from me on narrowing scope whenever possible. Adding features
incrementally is much easier than doing things all at once :)
…On Sun, Apr 18, 2021 at 6:15 PM Maximilian Roos ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In xarray/core/dataset.py
<#5045 (comment)>:
> + # loop over dataset variables
+ for var, da in self.items():
+ val = value
+ if type(value) == xr.core.dataset.Dataset:
+ val = value[var]
+ # only set value if all dimensions are present
+ if all([k in da.dims for k in key.keys()]):
+ da[key] = val
V good point. Thanks.
Either a) raising on missing dimensions or b) "entirely set to zero" in
this case, would be reasonable imo.
To the extent you're less confident that (b) is correct, I'd suggest we
move forward with (a) and evaluating whether we should do (b) separately.
(though @shoyer <https://github.com/shoyer> if you're up for it, let's
discuss the general pace of reviews next week and whether you think this
sort of "narrowing" of PR scope is a reasonable tactic)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5045 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVQR6Q6VI44EEY2SF43TJN7ZTANCNFSM4ZKTDBUA>
.
|
- check for errors before setting values to avoid partial update - raise error if not all variables contain all dimensions of key instead of skipping these variables - If value is dataset: check that all variables of value are used - updated unit tests
|
ok I tried to include your suggestions. Concerning @shoyer's point 3: Since I guess there could be a lot of different errors appearing, I created a copy of the data to be changed to check if the setitem fails before doing the actual update. That's of course suboptimal concerning the performance. What do you think, should we include checks for all conceivable errors or keep this test update? |
|
I don't understand what's the issue of the failing test. Do you? |
|
that's a flaky test which randomly fails (see #4539). You can safely ignore it, the CI should pass on the next run. |
|
Great, this is shaping up. I think we can find a way of failing early on bad indexes without attempting the whole operation on a copy. At the very least, we could call I also think that because the currently proposed code uses a shallow copy, it may be mutating the original when bad indexes are passed — it's worth adding a test to confirm. |
|
Calling getitem is not enough to detect all possible errors, I guess. Another possibility would be to do a deep copy before the assignments, and if anything goes wrong, restore the original data from the copy. In this way, the assignments do not have to be done twice unless an error appears. |
|
Which errors would The issue with a deep copy of the whole dataset is that it's very expensive. It's probably better to have that rather than nothing, but it could have confusing performance effects given that people are often going to be mutating values to reduce copies. These aren't strongly held views though. Any thoughts from others? |
|
ok, I deleted the copy stuff and included a few checks to catch possible errors before setting the values. Did I miss anything? How do we check for "type errors that don't coerce", as you mentioned? |
shoyer
left a comment
There was a problem hiding this comment.
Looking really nice, thank you!
Excellent. Re the checks — I mostly meant that it was going to be very rare for something to get through — I don't think it's necessary to check for something like "type errors that don't coerce". |
|
@shoyer thanks for your suggestions! I included them as best as I could. |
max-sixty
left a comment
There was a problem hiding this comment.
This looks great! Thanks @matzegoebel (and @shoyer , as ever).
Any final feedback before we merge?
|
I revised the pre-assignment checks. In my opinion xr.align is not so helpful when checking that the dimension sizes and coordinates are consistent, because it doesn't fail when the dimension size of the two Datasets is different, but the coordinate of the second Dataset is a subset of the first one. Therefore, I reimplemented the check that I had previously in a similar way. I also added a check for the wrong order of the dimensions, that you mentioned @shoyer. |
Could you kindly elaborate on this issue, maybe with a specific example?
|
I think I somehow forgot the join="exact" when testing the functionality of xr.align. So nevermind, I'll reimplement again. :P
Ok good point. I'll give it a try. |
I'm not sure how to use this, because |
Oh, good point, thanks for checking.
Yes, I like this idea! |
|
ok done |
|
could you also resolve the merge conflicts? It seems we can't "approve and run" the CI with conflicts. |
Ok I resolved them |
* upstream/master: (23 commits) combine keep_attrs and combine_attrs in apply_ufunc (pydata#5041) Explained what a deprecation cycle is (pydata#5289) Code cleanup (pydata#5234) FacetGrid docstrings (pydata#5293) Add whats new for dataset interpolation with non-numerics (pydata#5297) Allow dataset interpolation with different datatypes (pydata#5008) Flexible indexes: add Index base class and xindexes properties (pydata#5102) pre-commit: autoupdate hook versions (pydata#5280) convert the examples for apply_ufunc to doctest (pydata#5279) fix the new whatsnew section Ensure `HighLevelGraph` layers are `Layer` instances (pydata#5271) New whatsnew section Release-workflow: Bug fix (pydata#5273) more maintenance on whats-new.rst (pydata#5272) v0.18.0 release highlights (pydata#5266) Fix exception when display_expand_data=False for file-backed array. (pydata#5235) Warn ignored keep attrs (pydata#5265) Disable workflows on forks (pydata#5267) fix the built wheel test (pydata#5270) pypi upload workflow maintenance (pydata#5269) ...
|
I've fixed the location of the whats-new note. Is this ready to go in? |
Yes. I have one typing question but we can merge regardless if needed |
|
Thanks a lot @matzegoebel ! |
|
Thanks for your help @max-sixty and @shoyer! |
|
I guess we should also add this feature to the documentation, right? |
|
Docs would be great! Particularly if the current docs are out of date now. Thanks @matzegoebel |
pre-commit run --all-fileswhats-new.rst