Update S3StorageBroker._run_query to return a RemotePipelineFileColle…#208
Conversation
…ction containing RemotePipelineFile instances
Codecov Report
@@ Coverage Diff @@
## master #208 +/- ##
==========================================
- Coverage 90.65% 90.61% -0.05%
==========================================
Files 35 35
Lines 3478 3483 +5
Branches 423 424 +1
==========================================
+ Hits 3153 3156 +3
- Misses 265 266 +1
- Partials 60 61 +1
Continue to review full report at Codecov.
|
mhidas
left a comment
There was a problem hiding this comment.
Looks good to me! Just a couple of minor suggestions.
Also, since this is a bit more than a bug fix, should we bump the minor version on the package?
| # "public" methods | ||
| # | ||
|
|
||
| def download_remotepipelinefilecollection(self, remotepipelinefilecollection, local_path=None): |
There was a problem hiding this comment.
There was a problem hiding this comment.
@mhidas - yeap I agree that it should also allow using the RemotePipelineFile object as arg.
However, when I mentioned, I was more thinking that RemotePipelineFile should also have a download method. This way, we could avoid initing/wrapping the thing into another class to just download a singleton file.
This could affect the design of the Collection class though, and I understand if it's not quickly implemented.
There was a problem hiding this comment.
You can't download a file without knowing where to download it from. Having a remote file being downloadable in a self contained way means each file would need an independent link to a storage broker which is intentionally not implemented in this way.
The relationship is one way, in that a broker knows how to download a RemotePipelineFileCollection from itself, and not the other way around. Otherwise every single class ends up doing absolutely everything, instead of being focused on one job. This is why the hierarchy is intentionally "one way" in terms of where the methods are.
Doing otherwise encourages child handlers to mess around in unnecessarily low level ways (such as trying to interact with storage directly) that are most likely inconsistent with other handlers and we get back to the scenario where handlers all go off and do their own similar things in different ways. Child handlers should deal with high level concepts through the library, and the library should deal with low level concepts such as storage and harvester implementation etc. Separation of concerns, abstraction, encapsulation, all that good stuff.
- A File represents a file and the state of that file.
- A FileCollection contains Files and understands how to filter and manipulate them en masse.
- A Broker can take a collection and attempt to download so that the collection object needs zero knowledge about the storage layer (since that is not it's purpose)
- The Handler in this case can simply pivot the parameters from being "tell this collection a broker it can use to download itself" (which requires knowing about brokers, and actually having one to use which is too low level) to "tell the handler to download this collection" (which requires zero low level knowledge)
So in a sense, if the child handler needs to upload or download, it must do this through the designated public library methods, otherwise it's breaking the intended encapsulation, and thinking in terms that are too "low level". Child handlers should be declarative in that sense. They should be defining what needs to be done, not how it is done.
There was a problem hiding this comment.
I'm talking about the download method in Collections being implemented in RemoteFile. After peeking through the code, I realize it should be simple.
My question was raised from the following observation: RemoteFile can delete themselves, checksum themselves, so why they can't download themselves!?
Then my thinking was: OK, no self.download method - let pick up the guy who can do it/define the location(broker).
However, after the comment above, I checked the Collection method and the "download" method does exactly that - uses the broker to download.
The relationship is one way, in that a broker knows how to download a RemotePipelineFileCollection from itself, and not the other way around.
We are allowing both atm. RemotePipelineFileCollection.download allows doing exactly the opposite - in a semantic sense. You say "Hey RemotePipelineFileCollection, please download yourself with this broker at this path", but what happens is "Hey broker, use me and put the result on this path".
My whole point is, why not allow the same in RemoteFile!? Why do I need to wrap RemoteFile in another class, within a singleton IndexedSet, to just call a provided/required external object with a path argument? Just allow the same in RemoteFile then.
A Broker can take a collection and attempt to download so that the collection object needs zero knowledge about the storage layer (since that is not it's purpose)
Well, the collection group knows about who can help it being downloaded, so why the individual is ignorant?
Regards,
Nominated lawyer of RemoteFile rights
There was a problem hiding this comment.
Haha, it's been a while since thinking in the headspace. As usual, there is sometimes a discrepancy between my internal mental model and the current reality.
Now that I look at it, I didn't follow my own vision and implemented it wrongly and invalidated everything I just said ;)
My question was raised from the following observation: RemoteFile can delete themselves, checksum themselves, so why they can't download themselves!?
The original intention for Files and Collections was that they can't do those things to themselves, they simply hold state information of their current status (e.g. has this been published) and intended state (e.g. is this a candidate for publishing). Some things crept in like checksumming itself and a remote file managing it's own locally cached file etc. but there is a real danger in this scope creeping out to all kinds of things. If anything this scope should be reduced back to being simpler types rather than expanded.
I actually think it would be better to get rid of the download method on the collection now that this is in place, because storage brokers really aren't something a collection type should have anything to do with.
There was a problem hiding this comment.
Haha, it's been a while since thinking in the headspace. As usual, there is sometimes a discrepancy between my internal mental model and the current reality.
Now that I look at it, I didn't follow my own vision and implemented it wrongly and invalidated everything I just said ;)
The original intention for Files and Collections was that they can't do those things to themselves, they simply hold state information of their current status (e.g. has this been published) and intended state (e.g. is this a candidate for publishing). Some things crept in like checksumming itself and a remote file managing it's own locally cached file etc. but there is a real danger in this scope creeping out to all kinds of things. If anything this scope should be reduced back to being simpler types rather than expanded.
Maybe you don't need to prune but embrace the living objects (teenagers!?) they became and led them to adulthood!? IMO, I don't think self.check_summing or self.download are jumping on the fence of their objectives, otherwise, they would be dumb down to just boring storage entities. Maybe it's the right thing to do, but they will complain about losing privileges!
There was a problem hiding this comment.
they would be dumb down to just boring storage entities
Boring sounds good to me... keep the exciting code out in the modules that deal with the data generation/transformation/aggregation, but by the time it's gets to the workflow it should be as generic, consistent and stable as possible!
|
👍 Great! Thanks @lwgordonimos. Let's take it for a spin... |
| if local_path is None: | ||
| local_path = self.temp_dir | ||
|
|
||
| remotepipelinefilecollection.download(self._upload_store_runner.broker, local_path) |
There was a problem hiding this comment.
Without interrupting the extended philosophical discussion above, I just wanted to point out that (consistent with @lwgordonimos' "internal mental model"), this line should really be
self._upload_store_runner.broker.download(remotepipelinefilecollection, local_path)As a bonus, the broker's download method also accepts either a RemotePipelineFileCollection or a single RemotePipelineFile, so you don't even need to do the ensure_remotepipelinefilecollection above.
There was a problem hiding this comment.
Yeap, and if RemoteFile got the same method is accepted, this function could be just named differently and accept any of the inputs.
My point of turning both classes into boring stuff (e.g. named tuples/dicts) was that, as of today, would require some effort (that for me is hard to grasp since this package is used everywhere).
There was a problem hiding this comment.
@mhidas I realised that exact thing as soon is it was merged unfortunately. In fact, I realised that the handler was the wrong place as well anyway, since we went to the trouble of making a StateQuery class to be the state API for exactly this kind of thing :/
@ocehugo remember that this is not a general purpose library for interacting with our data, this is a framework for keeping handlers as consistent as possible. This is why it is so opinionated about how things should be done in terms of the workflow during execution.
We can implement that kind of flexible download interface without methods on those classes or dismantling other aspects of the design which are useful and/or intentional based on real world use cases we've encountered along the way (and in the previous pipeline implementation).
With the above change, this is the kind of code handlers can write to download files if they really need to, and this is the kind of style I would encourage.
myfilter = SomeOgcFilter()
collection = self.state_query.query_wfs_urls_for_layer('mylayer', myfilter)
self.state_query.download(collection, self.temp_dir)
# and of course there's nothing preventing you from
# looking at/manipulating your collection to get whatever info you need, e.g.
bigger_than_a_kb = RemotePipelineFileCollection(f for f in collection if f.size > 1024)
only_netcdfs = collection.filter_by_attribute_id(FileType.NETCDF)
…ction containing RemotePipelineFile instances