Allow assigning values to a subset of a dataset by matzegoebel · Pull Request #5045 · pydata/xarray

matzegoebel · 2021-03-17T13:30:55Z

Both, positional and label-based (with .loc) indexing using
a dictionary as key are implemented.
All variables in the dataset are updated one by one with the given
value at the given location.
If the given value is also a dataset, corresponding variables
in the given value and in the dataset to be changed are selected.

Tests for all cases are added.

Closes Assigning values to a subset of a dataset #3015
Tests added
Passes pre-commit run --all-files
User visible changes (including notable bug fixes) are documented in whats-new.rst
Documentation still needs to be updated

Both, positional and label-based (with .loc) indexing using a dictionary as key are implemented. All variables in the dataset are updated one by one with the given value at the given location. Variables that do not possess all of the dimensions given in the location key are skipped. If the given value is also a dataset, corresponding variables in the given value and in the dataset to be changed are selected. Tests for all cases are added.

matzegoebel · 2021-03-17T13:32:46Z

I haven't updated the documentation yet, where it says that this feature is not supported yet. Do you think we need example code for this feature in the documentation?

max-sixty · 2021-04-18T20:29:18Z

@matzegoebel forgive the very long delay on the review. We're planning to find a better system to ensure these don't drop through.

I would be up for adding this, for consistency. I don't think I've ever needed the functionality, but it also doesn't make the interface more complicated given it's mirroring __getitem__.

We probably need to think through whether there are any corner cases here; I can't think of any atm.

Any other thoughts?

shoyer · 2021-04-19T01:21:22Z

Strong +1 from me on narrowing scope whenever possible. Adding features incrementally is much easier than doing things all at once :)

…

On Sun, Apr 18, 2021 at 6:15 PM Maximilian Roos ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In xarray/core/dataset.py <#5045 (comment)>: > + # loop over dataset variables + for var, da in self.items(): + val = value + if type(value) == xr.core.dataset.Dataset: + val = value[var] + # only set value if all dimensions are present + if all([k in da.dims for k in key.keys()]): + da[key] = val V good point. Thanks. Either a) raising on missing dimensions or b) "entirely set to zero" in this case, would be reasonable imo. To the extent you're less confident that (b) is correct, I'd suggest we move forward with (a) and evaluating whether we should do (b) separately. (though @shoyer <https://github.com/shoyer> if you're up for it, let's discuss the general pace of reviews next week and whether you think this sort of "narrowing" of PR scope is a reasonable tactic) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5045 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJJFVQR6Q6VI44EEY2SF43TJN7ZTANCNFSM4ZKTDBUA> .

- check for errors before setting values to avoid partial update - raise error if not all variables contain all dimensions of key instead of skipping these variables - If value is dataset: check that all variables of value are used - updated unit tests

matzegoebel · 2021-04-19T08:04:49Z

ok I tried to include your suggestions. Concerning @shoyer's point 3: Since I guess there could be a lot of different errors appearing, I created a copy of the data to be changed to check if the setitem fails before doing the actual update. That's of course suboptimal concerning the performance. What do you think, should we include checks for all conceivable errors or keep this test update?

matzegoebel · 2021-04-19T09:46:29Z

I don't understand what's the issue of the failing test. Do you?

keewis · 2021-04-19T09:52:45Z

that's a flaky test which randomly fails (see #4539). You can safely ignore it, the CI should pass on the next run.

max-sixty · 2021-04-19T13:29:07Z

Great, this is shaping up.

I think we can find a way of failing early on bad indexes without attempting the whole operation on a copy.

At the very least, we could call __getitem__ with the indexes and see whether that passes. There may be better ways yet.

I also think that because the currently proposed code uses a shallow copy, it may be mutating the original when bad indexes are passed — it's worth adding a test to confirm.

matzegoebel · 2021-04-30T06:10:56Z

Calling getitem is not enough to detect all possible errors, I guess. Another possibility would be to do a deep copy before the assignments, and if anything goes wrong, restore the original data from the copy. In this way, the assignments do not have to be done twice unless an error appears.

max-sixty · 2021-04-30T16:57:53Z

Which errors would __getitem__ miss? At least type errors that don't coerce; are there other cases?

The issue with a deep copy of the whole dataset is that it's very expensive. It's probably better to have that rather than nothing, but it could have confusing performance effects given that people are often going to be mutating values to reduce copies.

These aren't strongly held views though. Any thoughts from others?

matzegoebel · 2021-05-03T08:11:39Z

ok, I deleted the copy stuff and included a few checks to catch possible errors before setting the values. Did I miss anything? How do we check for "type errors that don't coerce", as you mentioned?
The setitem method of the LocIndexer now calls the setitem method of the Dataset class, so that we don't have redundant code.

shoyer

Looking really nice, thank you!

max-sixty · 2021-05-04T06:46:46Z

ok, I deleted the copy stuff and included a few checks to catch possible errors before setting the values. Did I miss anything? How do we check for "type errors that don't coerce", as you mentioned?

Excellent. Re the checks — I mostly meant that it was going to be very rare for something to get through — I don't think it's necessary to check for something like "type errors that don't coerce".

matzegoebel · 2021-05-04T07:26:35Z

@shoyer thanks for your suggestions! I included them as best as I could.

max-sixty

This looks great! Thanks @matzegoebel (and @shoyer , as ever).

Any final feedback before we merge?

matzegoebel · 2021-05-05T12:45:53Z

I revised the pre-assignment checks. In my opinion xr.align is not so helpful when checking that the dimension sizes and coordinates are consistent, because it doesn't fail when the dimension size of the two Datasets is different, but the coordinate of the second Dataset is a subset of the first one. Therefore, I reimplemented the check that I had previously in a similar way. I also added a check for the wrong order of the dimensions, that you mentioned @shoyer.
If, despite the checks, an error occurs during the assignment, e.g. due to a type error, and the dataset has been updated already partially, the user is informed about this.

shoyer · 2021-05-05T16:11:37Z

I revised the pre-assignment checks. In my opinion xr.align is not so helpful when checking that the dimension sizes and coordinates are consistent, because it doesn't fail when the dimension size of the two Datasets is different, but the coordinate of the second Dataset is a subset of the first one.

Could you kindly elaborate on this issue, maybe with a specific example?

If, despite the checks, an error occurs during the assignment, e.g. due to a type error, and the dataset has been updated already partially, the user is informed about this.

np.can_cast with casting='unsafe can check this. It sounds like this would probably be something good to add to our checks :)

matzegoebel · 2021-05-05T16:37:23Z

Could you kindly elaborate on this issue, maybe with a specific example?

I think I somehow forgot the join="exact" when testing the functionality of xr.align. So nevermind, I'll reimplement again. :P

np.can_cast with casting='unsafe can check this. It sounds like this would probably be something good to add to our checks :)

Ok good point. I'll give it a try.

matzegoebel · 2021-05-05T16:57:06Z

np.can_cast with casting='unsafe can check this. It sounds like this would probably be something good to add to our checks :)

I'm not sure how to use this, because np.can_cast with unsafe casting mostly returns True, e.g. np.can_cast(str, float, casting="unsafe")==True. The error that is implemented in the tests would not be caught then. But I could explicitly try to cast the new values to the datatype of the original values with .astype. Would that be an option?

shoyer · 2021-05-05T17:23:27Z

I'm not sure how to use this, because np.can_cast with unsafe casting mostly returns True, e.g. np.can_cast(str, float, casting="unsafe")==True. The error that is implemented in the tests would not be caught then.

Oh, good point, thanks for checking.

But I could explicitly try to cast the new values to the datatype of the original values with .astype. Would that be an option?

Yes, I like this idea!

matzegoebel · 2021-05-05T17:49:35Z

ok done

keewis · 2021-05-05T18:13:22Z

could you also resolve the merge conflicts? It seems we can't "approve and run" the CI with conflicts.

matzegoebel · 2021-05-05T19:27:00Z

could you also resolve the merge conflicts? It seems we can't "approve and run" the CI with conflicts.

Ok I resolved them

* upstream/master: (23 commits) combine keep_attrs and combine_attrs in apply_ufunc (pydata#5041) Explained what a deprecation cycle is (pydata#5289) Code cleanup (pydata#5234) FacetGrid docstrings (pydata#5293) Add whats new for dataset interpolation with non-numerics (pydata#5297) Allow dataset interpolation with different datatypes (pydata#5008) Flexible indexes: add Index base class and xindexes properties (pydata#5102) pre-commit: autoupdate hook versions (pydata#5280) convert the examples for apply_ufunc to doctest (pydata#5279) fix the new whatsnew section Ensure `HighLevelGraph` layers are `Layer` instances (pydata#5271) New whatsnew section Release-workflow: Bug fix (pydata#5273) more maintenance on whats-new.rst (pydata#5272) v0.18.0 release highlights (pydata#5266) Fix exception when display_expand_data=False for file-backed array. (pydata#5235) Warn ignored keep attrs (pydata#5265) Disable workflows on forks (pydata#5267) fix the built wheel test (pydata#5270) pypi upload workflow maintenance (pydata#5269) ...

dcherian · 2021-05-13T17:35:46Z

I've fixed the location of the whats-new note. Is this ready to go in?

max-sixty · 2021-05-13T18:01:58Z

I've fixed the location of the whats-new note. Is this ready to go in?

Yes. I have one typing question but we can merge regardless if needed

…e overloaded functions

max-sixty · 2021-05-25T08:13:08Z

Thanks a lot @matzegoebel !

matzegoebel · 2021-05-25T09:08:14Z

Thanks for your help @max-sixty and @shoyer!

matzegoebel · 2021-05-25T09:08:59Z

I guess we should also add this feature to the documentation, right?

max-sixty · 2021-05-25T14:14:23Z

Docs would be great! Particularly if the current docs are out of date now. Thanks @matzegoebel

max-sixty reviewed Apr 18, 2021

View reviewed changes

Comment thread xarray/core/dataset.py Outdated

Comment thread xarray/core/dataset.py Outdated

shoyer reviewed Apr 18, 2021

View reviewed changes

Comment thread xarray/core/dataset.py Outdated

Matthias Göbel added 2 commits April 19, 2021 05:52

changed variable names in loop and use isinstance instead of type

94cee0e

improved setitem

95378a7

- check for errors before setting values to avoid partial update - raise error if not all variables contain all dimensions of key instead of skipping these variables - If value is dataset: check that all variables of value are used - updated unit tests

matzegoebel force-pushed the dev branch from 2ed874c to 95378a7 Compare April 19, 2021 07:59

Matthias Göbel added 2 commits May 3, 2021 10:03

setitem of LocIndexer calls setitem of Dataset

430a6a7

improved checks before doing setitem: no copy necessary

bed319d

shoyer reviewed May 3, 2021

View reviewed changes

Comment thread xarray/core/dataset.py Outdated

Comment thread xarray/core/dataset.py Outdated

Comment thread xarray/core/dataset.py Outdated

Comment thread xarray/core/dataset.py Outdated

Comment thread xarray/core/dataset.py Outdated

Matthias Göbel added 5 commits May 4, 2021 09:22

use f-strings and implicit truthiness for lists; improve error messages

92cefca

don't accept numpy arrays as new values

e7e2dcb

use xr.align for consistency check

d2389d0

update docstring

87d7efa

move consistency check to helper function

f609ebb

max-sixty approved these changes May 5, 2021

View reviewed changes

Comment thread xarray/core/dataset.py

shoyer reviewed May 5, 2021

View reviewed changes

Comment thread xarray/core/dataset.py Outdated

Comment thread xarray/core/dataset.py Outdated

Matthias Göbel added 2 commits May 5, 2021 14:31

warn about partial update if error occurs during assignment

d1f473e

improve pre-assignment checks

6ae8717

changed back to xr.align

df116fc

check for type consistency

4e395d1

shoyer reviewed May 5, 2021

View reviewed changes

Comment thread xarray/core/dataset.py Outdated

Matthias Göbel added 2 commits May 5, 2021 20:45

save output of type conversion in dataset

3a9cbed

Merge branch 'master' of https://github.com/pydata/xarray into dev

e94dd47

dcherian added 2 commits May 13, 2021 11:34

Fix whats-new

3af229c

dcherian mentioned this pull request May 13, 2021

0.18.1 patch release? #5298

Closed

9 tasks

max-sixty reviewed May 13, 2021

View reviewed changes

Comment thread xarray/core/dataset.py Outdated

max-sixty reviewed May 13, 2021

View reviewed changes

Comment thread xarray/core/dataset.py

Matthias Göbel and others added 2 commits May 25, 2021 09:10

put allowed types of _setitem_'s key in function definition and remov…

78b12d9

…e overloaded functions

Merge branch 'master' into dev

031c3a9

max-sixty merged commit de6144c into pydata:master May 25, 2021

matzegoebel mentioned this pull request May 26, 2021

Dataset subset assignments in doc #5378

Merged

Uh oh!

Conversation

matzegoebel commented Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matzegoebel commented Mar 17, 2021

Uh oh!

max-sixty commented Apr 18, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shoyer commented Apr 19, 2021 via email

Uh oh!

matzegoebel commented Apr 19, 2021

Uh oh!

matzegoebel commented Apr 19, 2021

Uh oh!

keewis commented Apr 19, 2021

Uh oh!

max-sixty commented Apr 19, 2021

Uh oh!

matzegoebel commented Apr 30, 2021

Uh oh!

max-sixty commented Apr 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matzegoebel commented May 3, 2021

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

max-sixty commented May 4, 2021

Uh oh!

matzegoebel commented May 4, 2021

Uh oh!

max-sixty left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matzegoebel commented May 5, 2021

Uh oh!

shoyer commented May 5, 2021

Uh oh!

matzegoebel commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matzegoebel commented May 5, 2021

Uh oh!

shoyer commented May 5, 2021

Uh oh!

matzegoebel commented May 5, 2021

Uh oh!

Uh oh!

keewis commented May 5, 2021

Uh oh!

matzegoebel commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian commented May 13, 2021

Uh oh!

Uh oh!

Uh oh!

max-sixty commented May 13, 2021

Uh oh!

max-sixty commented May 25, 2021

Uh oh!

matzegoebel commented May 25, 2021

Uh oh!

matzegoebel commented May 25, 2021

Uh oh!

max-sixty commented May 25, 2021

Uh oh!

matzegoebel commented Mar 17, 2021 •

edited

Loading

max-sixty commented Apr 30, 2021 •

edited

Loading

matzegoebel commented May 5, 2021 •

edited

Loading

matzegoebel commented May 5, 2021 •

edited

Loading