Skip to content

Allow assigning values to a subset of a dataset#5045

Merged
max-sixty merged 20 commits intopydata:masterfrom
matzegoebel:dev
May 25, 2021
Merged

Allow assigning values to a subset of a dataset#5045
max-sixty merged 20 commits intopydata:masterfrom
matzegoebel:dev

Conversation

@matzegoebel
Copy link
Copy Markdown
Contributor

@matzegoebel matzegoebel commented Mar 17, 2021

Both, positional and label-based (with .loc) indexing using
a dictionary as key are implemented.
All variables in the dataset are updated one by one with the given
value at the given location.
If the given value is also a dataset, corresponding variables
in the given value and in the dataset to be changed are selected.

Tests for all cases are added.

    Both, positional and label-based (with .loc) indexing using
    a dictionary as key are implemented.
    All variables in the dataset are updated one by one with the given
    value at the given location. Variables that do not possess all
    of the dimensions given in the location key are skipped.
    If the given value is also a dataset, corresponding variables
    in the given value and in the dataset to be changed are selected.

    Tests for all cases are added.
@matzegoebel
Copy link
Copy Markdown
Contributor Author

I haven't updated the documentation yet, where it says that this feature is not supported yet. Do you think we need example code for this feature in the documentation?

@max-sixty
Copy link
Copy Markdown
Collaborator

@matzegoebel forgive the very long delay on the review. We're planning to find a better system to ensure these don't drop through.

I would be up for adding this, for consistency. I don't think I've ever needed the functionality, but it also doesn't make the interface more complicated given it's mirroring __getitem__.

We probably need to think through whether there are any corner cases here; I can't think of any atm.

Any other thoughts?

Comment thread xarray/core/dataset.py Outdated
Comment thread xarray/core/dataset.py Outdated
Comment thread xarray/core/dataset.py Outdated
@shoyer
Copy link
Copy Markdown
Member

shoyer commented Apr 19, 2021 via email

Matthias Göbel added 2 commits April 19, 2021 05:52
- check for errors before setting values to avoid partial update
- raise error if not all variables contain all dimensions of key instead
of skipping these variables
- If value is dataset: check that all variables of value are used
- updated unit tests
@matzegoebel
Copy link
Copy Markdown
Contributor Author

ok I tried to include your suggestions. Concerning @shoyer's point 3: Since I guess there could be a lot of different errors appearing, I created a copy of the data to be changed to check if the setitem fails before doing the actual update. That's of course suboptimal concerning the performance. What do you think, should we include checks for all conceivable errors or keep this test update?

@matzegoebel
Copy link
Copy Markdown
Contributor Author

I don't understand what's the issue of the failing test. Do you?

@keewis
Copy link
Copy Markdown
Collaborator

keewis commented Apr 19, 2021

that's a flaky test which randomly fails (see #4539). You can safely ignore it, the CI should pass on the next run.

@max-sixty
Copy link
Copy Markdown
Collaborator

Great, this is shaping up.

I think we can find a way of failing early on bad indexes without attempting the whole operation on a copy.

At the very least, we could call __getitem__ with the indexes and see whether that passes. There may be better ways yet.

I also think that because the currently proposed code uses a shallow copy, it may be mutating the original when bad indexes are passed — it's worth adding a test to confirm.

@matzegoebel
Copy link
Copy Markdown
Contributor Author

Calling getitem is not enough to detect all possible errors, I guess. Another possibility would be to do a deep copy before the assignments, and if anything goes wrong, restore the original data from the copy. In this way, the assignments do not have to be done twice unless an error appears.

@max-sixty
Copy link
Copy Markdown
Collaborator

max-sixty commented Apr 30, 2021

Which errors would __getitem__ miss? At least type errors that don't coerce; are there other cases?

The issue with a deep copy of the whole dataset is that it's very expensive. It's probably better to have that rather than nothing, but it could have confusing performance effects given that people are often going to be mutating values to reduce copies.

These aren't strongly held views though. Any thoughts from others?

@matzegoebel
Copy link
Copy Markdown
Contributor Author

ok, I deleted the copy stuff and included a few checks to catch possible errors before setting the values. Did I miss anything? How do we check for "type errors that don't coerce", as you mentioned?
The setitem method of the LocIndexer now calls the setitem method of the Dataset class, so that we don't have redundant code.

Copy link
Copy Markdown
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really nice, thank you!

Comment thread xarray/core/dataset.py Outdated
Comment thread xarray/core/dataset.py Outdated
Comment thread xarray/core/dataset.py Outdated
Comment thread xarray/core/dataset.py Outdated
Comment thread xarray/core/dataset.py Outdated
@max-sixty
Copy link
Copy Markdown
Collaborator

ok, I deleted the copy stuff and included a few checks to catch possible errors before setting the values. Did I miss anything? How do we check for "type errors that don't coerce", as you mentioned?

Excellent. Re the checks — I mostly meant that it was going to be very rare for something to get through — I don't think it's necessary to check for something like "type errors that don't coerce".

@matzegoebel
Copy link
Copy Markdown
Contributor Author

@shoyer thanks for your suggestions! I included them as best as I could.

Copy link
Copy Markdown
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks @matzegoebel (and @shoyer , as ever).

Any final feedback before we merge?

Comment thread xarray/core/dataset.py
Comment thread xarray/core/dataset.py Outdated
Comment thread xarray/core/dataset.py Outdated
@matzegoebel
Copy link
Copy Markdown
Contributor Author

I revised the pre-assignment checks. In my opinion xr.align is not so helpful when checking that the dimension sizes and coordinates are consistent, because it doesn't fail when the dimension size of the two Datasets is different, but the coordinate of the second Dataset is a subset of the first one. Therefore, I reimplemented the check that I had previously in a similar way. I also added a check for the wrong order of the dimensions, that you mentioned @shoyer.
If, despite the checks, an error occurs during the assignment, e.g. due to a type error, and the dataset has been updated already partially, the user is informed about this.

@shoyer
Copy link
Copy Markdown
Member

shoyer commented May 5, 2021

I revised the pre-assignment checks. In my opinion xr.align is not so helpful when checking that the dimension sizes and coordinates are consistent, because it doesn't fail when the dimension size of the two Datasets is different, but the coordinate of the second Dataset is a subset of the first one.

Could you kindly elaborate on this issue, maybe with a specific example?

If, despite the checks, an error occurs during the assignment, e.g. due to a type error, and the dataset has been updated already partially, the user is informed about this.

np.can_cast with casting='unsafe can check this. It sounds like this would probably be something good to add to our checks :)

@matzegoebel
Copy link
Copy Markdown
Contributor Author

matzegoebel commented May 5, 2021

Could you kindly elaborate on this issue, maybe with a specific example?

I think I somehow forgot the join="exact" when testing the functionality of xr.align. So nevermind, I'll reimplement again. :P

np.can_cast with casting='unsafe can check this. It sounds like this would probably be something good to add to our checks :)

Ok good point. I'll give it a try.

@matzegoebel
Copy link
Copy Markdown
Contributor Author

np.can_cast with casting='unsafe can check this. It sounds like this would probably be something good to add to our checks :)

I'm not sure how to use this, because np.can_cast with unsafe casting mostly returns True, e.g. np.can_cast(str, float, casting="unsafe")==True. The error that is implemented in the tests would not be caught then. But I could explicitly try to cast the new values to the datatype of the original values with .astype. Would that be an option?

@shoyer
Copy link
Copy Markdown
Member

shoyer commented May 5, 2021

I'm not sure how to use this, because np.can_cast with unsafe casting mostly returns True, e.g. np.can_cast(str, float, casting="unsafe")==True. The error that is implemented in the tests would not be caught then.

Oh, good point, thanks for checking.

But I could explicitly try to cast the new values to the datatype of the original values with .astype. Would that be an option?

Yes, I like this idea!

@matzegoebel
Copy link
Copy Markdown
Contributor Author

ok done

Comment thread xarray/core/dataset.py Outdated
@keewis
Copy link
Copy Markdown
Collaborator

keewis commented May 5, 2021

could you also resolve the merge conflicts? It seems we can't "approve and run" the CI with conflicts.

@matzegoebel
Copy link
Copy Markdown
Contributor Author

matzegoebel commented May 5, 2021

could you also resolve the merge conflicts? It seems we can't "approve and run" the CI with conflicts.

Ok I resolved them

dcherian added 2 commits May 13, 2021 11:34
* upstream/master: (23 commits)
  combine keep_attrs and combine_attrs in apply_ufunc (pydata#5041)
  Explained what a deprecation cycle is (pydata#5289)
  Code cleanup (pydata#5234)
  FacetGrid docstrings (pydata#5293)
  Add whats new for dataset interpolation with non-numerics (pydata#5297)
  Allow dataset interpolation with different datatypes (pydata#5008)
  Flexible indexes: add Index base class and xindexes properties (pydata#5102)
  pre-commit: autoupdate hook versions (pydata#5280)
  convert the examples for apply_ufunc to doctest (pydata#5279)
  fix the new whatsnew section
  Ensure `HighLevelGraph` layers are `Layer` instances (pydata#5271)
  New whatsnew section
  Release-workflow: Bug fix (pydata#5273)
  more maintenance on whats-new.rst (pydata#5272)
  v0.18.0 release highlights (pydata#5266)
  Fix exception when display_expand_data=False for file-backed array. (pydata#5235)
  Warn ignored keep attrs (pydata#5265)
  Disable workflows on forks (pydata#5267)
  fix the built wheel test (pydata#5270)
  pypi upload workflow maintenance (pydata#5269)
  ...
@dcherian
Copy link
Copy Markdown
Contributor

I've fixed the location of the whats-new note. Is this ready to go in?

@dcherian dcherian mentioned this pull request May 13, 2021
9 tasks
Comment thread xarray/core/dataset.py Outdated
Comment thread xarray/core/dataset.py
@max-sixty
Copy link
Copy Markdown
Collaborator

I've fixed the location of the whats-new note. Is this ready to go in?

Yes. I have one typing question but we can merge regardless if needed

@max-sixty max-sixty merged commit de6144c into pydata:master May 25, 2021
@max-sixty
Copy link
Copy Markdown
Collaborator

Thanks a lot @matzegoebel !

@matzegoebel
Copy link
Copy Markdown
Contributor Author

Thanks for your help @max-sixty and @shoyer!

@matzegoebel
Copy link
Copy Markdown
Contributor Author

I guess we should also add this feature to the documentation, right?

@max-sixty
Copy link
Copy Markdown
Collaborator

Docs would be great! Particularly if the current docs are out of date now. Thanks @matzegoebel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Assigning values to a subset of a dataset

5 participants