Update S3StorageBroker._run_query to return a RemotePipelineFileColle… · Pull Request #208 · aodn/python-aodncore

ghost · 2020-05-05T05:05:39Z

…ction containing RemotePipelineFile instances

codecov · 2020-05-05T05:21:07Z

Codecov Report

Merging #208 into master will decrease coverage by 0.04%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##           master     #208      +/-   ##
==========================================
- Coverage   90.65%   90.61%   -0.05%     
==========================================
  Files          35       35              
  Lines        3478     3483       +5     
  Branches      423      424       +1     
==========================================
+ Hits         3153     3156       +3     
- Misses        265      266       +1     
- Partials       60       61       +1

Impacted Files	Coverage Δ
setup.py	`0.00% <ø> (ø)`
aodncore/pipeline/handlerbase.py	`94.08% <66.66%> (-0.45%)`	⬇️
aodncore/pipeline/storage.py	`96.49% <100.00%> (ø)`
aodncore/version.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 841b1ad...cfe4c64. Read the comment docs.

mhidas

Looks good to me! Just a couple of minor suggestions.

Also, since this is a bit more than a bug fix, should we bump the minor version on the package?

mhidas · 2020-05-07T01:08:16Z

    # "public" methods
    #

+    def download_remotepipelinefilecollection(self, remotepipelinefilecollection, local_path=None):


👍 This is great! Removes the need for child handlers to mess around with self._upload_store_runner

For added convenience (e.g. for @ocehugo's use case), this method could also accept a single RemotePipelineFile instead of a collection.

And, would it be worth adding a unittest for this method?

@mhidas - yeap I agree that it should also allow using the RemotePipelineFile object as arg.

However, when I mentioned, I was more thinking that RemotePipelineFile should also have a download method. This way, we could avoid initing/wrapping the thing into another class to just download a singleton file.

This could affect the design of the Collection class though, and I understand if it's not quickly implemented.

You can't download a file without knowing where to download it from. Having a remote file being downloadable in a self contained way means each file would need an independent link to a storage broker which is intentionally not implemented in this way.

The relationship is one way, in that a broker knows how to download a RemotePipelineFileCollection from itself, and not the other way around. Otherwise every single class ends up doing absolutely everything, instead of being focused on one job. This is why the hierarchy is intentionally "one way" in terms of where the methods are.

Doing otherwise encourages child handlers to mess around in unnecessarily low level ways (such as trying to interact with storage directly) that are most likely inconsistent with other handlers and we get back to the scenario where handlers all go off and do their own similar things in different ways. Child handlers should deal with high level concepts through the library, and the library should deal with low level concepts such as storage and harvester implementation etc. Separation of concerns, abstraction, encapsulation, all that good stuff.

A File represents a file and the state of that file.

A FileCollection contains Files and understands how to filter and manipulate them en masse.

A Broker can take a collection and attempt to download so that the collection object needs zero knowledge about the storage layer (since that is not it's purpose)

The Handler in this case can simply pivot the parameters from being "tell this collection a broker it can use to download itself" (which requires knowing about brokers, and actually having one to use which is too low level) to "tell the handler to download this collection" (which requires zero low level knowledge)

So in a sense, if the child handler needs to upload or download, it must do this through the designated public library methods, otherwise it's breaking the intended encapsulation, and thinking in terms that are too "low level". Child handlers should be declarative in that sense. They should be defining what needs to be done, not how it is done.

I'm talking about the download method in Collections being implemented in RemoteFile. After peeking through the code, I realize it should be simple.

My question was raised from the following observation: RemoteFile can delete themselves, checksum themselves, so why they can't download themselves!?

Then my thinking was: OK, no self.download method - let pick up the guy who can do it/define the location(broker).

However, after the comment above, I checked the Collection method and the "download" method does exactly that - uses the broker to download.

The relationship is one way, in that a broker knows how to download a RemotePipelineFileCollection from itself, and not the other way around.

We are allowing both atm. RemotePipelineFileCollection.download allows doing exactly the opposite - in a semantic sense. You say "Hey RemotePipelineFileCollection, please download yourself with this broker at this path", but what happens is "Hey broker, use me and put the result on this path".

My whole point is, why not allow the same in RemoteFile!? Why do I need to wrap RemoteFile in another class, within a singleton IndexedSet, to just call a provided/required external object with a path argument? Just allow the same in RemoteFile then.

A Broker can take a collection and attempt to download so that the collection object needs zero knowledge about the storage layer (since that is not it's purpose)

Well, the collection group knows about who can help it being downloaded, so why the individual is ignorant?

Regards,
Nominated lawyer of RemoteFile rights

Haha, it's been a while since thinking in the headspace. As usual, there is sometimes a discrepancy between my internal mental model and the current reality.

Now that I look at it, I didn't follow my own vision and implemented it wrongly and invalidated everything I just said ;)

My question was raised from the following observation: RemoteFile can delete themselves, checksum themselves, so why they can't download themselves!?

The original intention for Files and Collections was that they can't do those things to themselves, they simply hold state information of their current status (e.g. has this been published) and intended state (e.g. is this a candidate for publishing). Some things crept in like checksumming itself and a remote file managing it's own locally cached file etc. but there is a real danger in this scope creeping out to all kinds of things. If anything this scope should be reduced back to being simpler types rather than expanded.

I actually think it would be better to get rid of the download method on the collection now that this is in place, because storage brokers really aren't something a collection type should have anything to do with.

Haha, it's been a while since thinking in the headspace. As usual, there is sometimes a discrepancy between my internal mental model and the current reality.
Now that I look at it, I didn't follow my own vision and implemented it wrongly and invalidated everything I just said ;)
The original intention for Files and Collections was that they can't do those things to themselves, they simply hold state information of their current status (e.g. has this been published) and intended state (e.g. is this a candidate for publishing). Some things crept in like checksumming itself and a remote file managing it's own locally cached file etc. but there is a real danger in this scope creeping out to all kinds of things. If anything this scope should be reduced back to being simpler types rather than expanded.

Maybe you don't need to prune but embrace the living objects (teenagers!?) they became and led them to adulthood!? IMO, I don't think self.check_summing or self.download are jumping on the fence of their objectives, otherwise, they would be dumb down to just boring storage entities. Maybe it's the right thing to do, but they will complain about losing privileges!

they would be dumb down to just boring storage entities

Boring sounds good to me... keep the exciting code out in the modules that deal with the data generation/transformation/aggregation, but by the time it's gets to the workflow it should be as generic, consistent and stable as possible!

mhidas · 2020-05-07T05:26:04Z

👍 Great! Thanks @lwgordonimos. Let's take it for a spin...

mhidas · 2020-05-10T23:27:25Z

+        if local_path is None:
+            local_path = self.temp_dir
+
+        remotepipelinefilecollection.download(self._upload_store_runner.broker, local_path)


Without interrupting the extended philosophical discussion above, I just wanted to point out that (consistent with @lwgordonimos' "internal mental model"), this line should really be

self._upload_store_runner.broker.download(remotepipelinefilecollection, local_path)

As a bonus, the broker's download method also accepts either a RemotePipelineFileCollection or a single RemotePipelineFile, so you don't even need to do the ensure_remotepipelinefilecollection above.

Yeap, and if RemoteFile ~~got the same method~~ is accepted, this function could be just named differently and accept any of the inputs.

My point of turning both classes into boring stuff (e.g. named tuples/dicts) was that, as of today, would require some effort (that for me is hard to grasp since this package is used everywhere).

@mhidas I realised that exact thing as soon is it was merged unfortunately. In fact, I realised that the handler was the wrong place as well anyway, since we went to the trouble of making a StateQuery class to be the state API for exactly this kind of thing :/

#209

@ocehugo remember that this is not a general purpose library for interacting with our data, this is a framework for keeping handlers as consistent as possible. This is why it is so opinionated about how things should be done in terms of the workflow during execution.

We can implement that kind of flexible download interface without methods on those classes or dismantling other aspects of the design which are useful and/or intentional based on real world use cases we've encountered along the way (and in the previous pipeline implementation).

With the above change, this is the kind of code handlers can write to download files if they really need to, and this is the kind of style I would encourage.

myfilter = SomeOgcFilter() collection = self.state_query.query_wfs_urls_for_layer('mylayer', myfilter) self.state_query.download(collection, self.temp_dir) # and of course there's nothing preventing you from # looking at/manipulating your collection to get whatever info you need, e.g. bigger_than_a_kb = RemotePipelineFileCollection(f for f in collection if f.size > 1024) only_netcdfs = collection.filter_by_attribute_id(FileType.NETCDF)

Leigh Gordon added 2 commits May 5, 2020 15:05

Update S3StorageBroker._run_query to return a RemotePipelineFileColle…

f6f530b

…ction containing RemotePipelineFile instances

Add constraint of cftime to support Python 3.5 on Xenial

43cab57

ghost force-pushed the s3broker-return branch from c5a9512 to 1df7ec4 Compare May 5, 2020 05:35

ghost marked this pull request as ready for review May 5, 2020 06:22

ghost mentioned this pull request May 5, 2020

New aatams dm v2 aodn/python-aodndata#208

Merged

mhidas reviewed May 7, 2020

View reviewed changes

ghost force-pushed the s3broker-return branch from 1df7ec4 to 1d9c9d2 Compare May 7, 2020 02:34

Leigh Gordon added 2 commits May 7, 2020 13:07

Add download_remotepipelinefilecollection to HandlerBase

0125677

Bump minor version to 1.1.x

cfe4c64

ghost force-pushed the s3broker-return branch from 1d9c9d2 to cfe4c64 Compare May 7, 2020 03:08

mhidas merged commit a8704af into master May 7, 2020

mhidas deleted the s3broker-return branch May 7, 2020 05:26

mhidas reviewed May 10, 2020

View reviewed changes

This was referenced May 11, 2020

Refactor StateQuery to provide download methods for child handlers #209

Merged

PipelineFile lacks publish_type argument at init #210

Closed

Conversation

ghost commented May 5, 2020

Uh oh!

codecov Bot commented May 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mhidas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ocehugo May 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhidas commented May 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ocehugo May 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 5, 2020 •

edited

Loading

ocehugo May 7, 2020 •

edited

Loading

ocehugo May 10, 2020 •

edited

Loading