Velocity HOURLY#99
Conversation
Codecov Report
@@ Coverage Diff @@
## master #99 +/- ##
=======================================
Coverage 54.78% 54.78%
=======================================
Files 9 9
Lines 1150 1150
Branches 167 167
=======================================
Hits 630 630
Misses 493 493
Partials 27 27 Continue to review full report at Codecov.
|
…to_middle as a variable
|
The code does not shift the TIME according to seconds_to_middle. However, the portions of the code that do this are commented in case we want to go back and use this approach |
mhidas
left a comment
There was a problem hiding this comment.
I haven't quite got to the end of the code yet, but I think I've managed to get my head around the guts of it. There's a few things to think about.
| :param is_WCUR: flag indicating if WCUR is present | ||
| :return: end index of the slice | ||
| """ | ||
| nc_cell = nc_cell.where(nc_cell.DEPTH_quality_control < 4, drop=True) |
There was a problem hiding this comment.
Why limit to < 4 and not < 3 ? With the non-velocity product we included only flags 1 & 2, so let's be consistent here.
Also, this limit would ideally be specified as an argument to the function, or at least a constant within this module.
There was a problem hiding this comment.
More importantly, we also need to mask out bad values of the velocity components!!
E.g. Replace UCUR values where UCUR_quality_control > 2 with nan before doing the resample. I think you can do this with each variable on the same DataFrame (without drop=True), then still just do the resample.
| ## back the index 30min | ||
| nc_cell.index = nc_cell.index - pd.Timedelta(30, units='m') | ||
|
|
||
| nc_cell_1H = nc_cell.resample('1H') |
There was a problem hiding this comment.
I think this might be the reason for some of the slowness.. we are repeating this resample for every cell, even though the timestamps are the same for each cell! Not sure if xarray allows you to get around that?
Actually, the same is true for excluding the "bad" DEPTH values - this should be done only once.
There was a problem hiding this comment.
I would not expect this to be the bottleneck. Anyway, without running locally, I don't know why there is the conversion to data_frame above and then resampling - xarray can use the pandas resampling method.
| nc_cell['DEPTH'] = nc_cell['DEPTH'] - cell | ||
| slice_end = get_resampled_values(nc_cell, ds, slice_start, varlist, binning_fun, | ||
| epoch, one_day, is_WCUR) | ||
| CELL_INDEX[slice_start:slice_end] = np.full(slice_end - slice_start, cell_idx, dtype=np.uint32) |
There was a problem hiding this comment.
Note that this ordering of values along the OBSERVATION dimension is different from what we did in the aggregated product. In that case we cycled through all the cells at a timestamp before moving on to the next timestamp. I can see why this would be a bit more difficult here, but in any case it might be helpful to keep the two products consistent in this sense.
Co-Authored-By: Marty Hidas <marty.hidas@utas.edu.au>
Co-Authored-By: Marty Hidas <marty.hidas@utas.edu.au>
Co-Authored-By: Marty Hidas <marty.hidas@utas.edu.au>
Co-Authored-By: Marty Hidas <marty.hidas@utas.edu.au>
mhidas
left a comment
There was a problem hiding this comment.
A few more comments, but otherwise I think it's good for now. Let's see how the Tech Team review goes.
Main issues yet to be resolved:
- I think we should be shifting the input timestamps according to the
seconds_to_middleattribute, when present; - Ordering of time and depth/cell along the OBSERVATION dimension is different from the aggregated product.
|
|
||
| ## NOTE: There is a possibility of having NaNs in DEPTH after the binning | ||
| ## this is the warning when calculating the min/max DEPTH | ||
| ## maybe I should clean the dataset before close it |
There was a problem hiding this comment.
Is this comment still relevant? Doesn't using np.nanmin() get around this problem?
| - All files to be aggregated are from the same site, and have the same `site_code` attribute; | ||
| - Variables to be aggregated have `TIME` and (optionally) `HEIGHT_ABOVE_SENSOR` as their only dimensions (or if `LATITUDE` and `LONGITUDE` are included as dimensions, they have size 1); | ||
| - The in-water data are bounded by the global attributes `time_deployment_start` and `time_deployment_end`; | ||
|
|
There was a problem hiding this comment.
Might need to add this if it is so decided:
The TIME variable has an attribute seconds_to_middle_of_measurement to indicate the offset from each recorded timestamp to the centre of the averaging period.
mhidas
left a comment
There was a problem hiding this comment.
Added a couple of suggestions in the global attributes, to warn users that the timestamps have not been centred before binning.
(also update description of CELL_INDEX and SECONDS_TO_MIDDLE)
* make cell_methods CF compliant * more accurate description in long_name * update author_email to AODN one
* reorder imports * global variable for max QC flag to include * minor doc string edits * clearer variable renames
(add comments about not centering the timestamps before binning)
|
Note the Travis tests are failing due to https://github.com/aodn/issues/issues/692, but a fix is almost ready. |
|
|
||
| def get_resampled_values(nc_cell, ds, slice_start, varlist, binning_fun, epoch, one_day, is_WCUR): | ||
| def get_resampled_values(nc_cell, ds, slice_start, varlist, binning_function, epoch, one_day, is_WCUR): | ||
| """ |
There was a problem hiding this comment.
Agree. I have removed varlist and the last three.
| @@ -73,7 +71,7 @@ def get_resampled_values(nc_cell, ds, slice_start, varlist, binning_fun, epoch, | |||
| ds['WCUR'][slice_start:slice_end], \ | |||
There was a problem hiding this comment.
mutation of input arguments in the function
There was a problem hiding this comment.
I would rename the function to highlight that I'm mutating ds.
There was a problem hiding this comment.
Renamed to append_resampled_values, and I've made this clear in the docstring too
| ## back the index 30min | ||
| df_cell.index = df_cell.index - pd.Timedelta(minutes=30) | ||
| df_cell.index = df_cell.index + pd.Timedelta(minutes=30) | ||
| # TODO: shift timestamps to centre of sampling interval |
There was a problem hiding this comment.
Comment not updated with code - you are advancing the index instead.
There was a problem hiding this comment.
I imagine the code was wrong before, so the commit could highlight that
There was a problem hiding this comment.
It wasn't wrong, it just took two steps instead of one: it subtracted 30min, resampled, then added an hour.
So I simplified it to just adding 30min, which is exactly equivalent. I did a test comparison and all the variables in files generated with the two version of the code are exactly the same.
There was a problem hiding this comment.
comment changed to "shift the index forward 30min to centre the bins on the hour"
|
@mhidas - I'm leaving to you to merge of just a single thing - there a change on the time resampling (forward shifting instead of backward) that I assumed was further checked? |
|
Thanks @ocehugo. I will update that commit, fix the merge conflicts (due to version numbers changing after the other PR was merged) and merge. |
This code aggregates the flattened velocity variables binned into one hour time intervals. It checks for the presence of seconds_to_middle_of_measurement and lags forward the timestamp.
Here is a sample file aggregating 20 deployments from NRSDAR