Skip to content

Update S3StorageBroker._run_query to return a RemotePipelineFileColle…#208

Merged
mhidas merged 4 commits into
masterfrom
s3broker-return
May 7, 2020
Merged

Update S3StorageBroker._run_query to return a RemotePipelineFileColle…#208
mhidas merged 4 commits into
masterfrom
s3broker-return

Conversation

@ghost
Copy link
Copy Markdown

@ghost ghost commented May 5, 2020

…ction containing RemotePipelineFile instances

@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2020

Codecov Report

Merging #208 into master will decrease coverage by 0.04%.
The diff coverage is 75.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #208      +/-   ##
==========================================
- Coverage   90.65%   90.61%   -0.05%     
==========================================
  Files          35       35              
  Lines        3478     3483       +5     
  Branches      423      424       +1     
==========================================
+ Hits         3153     3156       +3     
- Misses        265      266       +1     
- Partials       60       61       +1     
Impacted Files Coverage Δ
setup.py 0.00% <ø> (ø)
aodncore/pipeline/handlerbase.py 94.08% <66.66%> (-0.45%) ⬇️
aodncore/pipeline/storage.py 96.49% <100.00%> (ø)
aodncore/version.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 841b1ad...cfe4c64. Read the comment docs.

@ghost ghost force-pushed the s3broker-return branch from c5a9512 to 1df7ec4 Compare May 5, 2020 05:35
@ghost ghost marked this pull request as ready for review May 5, 2020 06:22
@ghost ghost mentioned this pull request May 5, 2020
Copy link
Copy Markdown
Contributor

@mhidas mhidas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just a couple of minor suggestions.

Also, since this is a bit more than a bug fix, should we bump the minor version on the package?

# "public" methods
#

def download_remotepipelinefilecollection(self, remotepipelinefilecollection, local_path=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This is great! Removes the need for child handlers to mess around with self._upload_store_runner

For added convenience (e.g. for @ocehugo's use case), this method could also accept a single RemotePipelineFile instead of a collection.

And, would it be worth adding a unittest for this method?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhidas - yeap I agree that it should also allow using the RemotePipelineFile object as arg.

However, when I mentioned, I was more thinking that RemotePipelineFile should also have a download method. This way, we could avoid initing/wrapping the thing into another class to just download a singleton file.

This could affect the design of the Collection class though, and I understand if it's not quickly implemented.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't download a file without knowing where to download it from. Having a remote file being downloadable in a self contained way means each file would need an independent link to a storage broker which is intentionally not implemented in this way.

The relationship is one way, in that a broker knows how to download a RemotePipelineFileCollection from itself, and not the other way around. Otherwise every single class ends up doing absolutely everything, instead of being focused on one job. This is why the hierarchy is intentionally "one way" in terms of where the methods are.

Doing otherwise encourages child handlers to mess around in unnecessarily low level ways (such as trying to interact with storage directly) that are most likely inconsistent with other handlers and we get back to the scenario where handlers all go off and do their own similar things in different ways. Child handlers should deal with high level concepts through the library, and the library should deal with low level concepts such as storage and harvester implementation etc. Separation of concerns, abstraction, encapsulation, all that good stuff.

  • A File represents a file and the state of that file.
  • A FileCollection contains Files and understands how to filter and manipulate them en masse.
  • A Broker can take a collection and attempt to download so that the collection object needs zero knowledge about the storage layer (since that is not it's purpose)
  • The Handler in this case can simply pivot the parameters from being "tell this collection a broker it can use to download itself" (which requires knowing about brokers, and actually having one to use which is too low level) to "tell the handler to download this collection" (which requires zero low level knowledge)

So in a sense, if the child handler needs to upload or download, it must do this through the designated public library methods, otherwise it's breaking the intended encapsulation, and thinking in terms that are too "low level". Child handlers should be declarative in that sense. They should be defining what needs to be done, not how it is done.

Copy link
Copy Markdown

@ocehugo ocehugo May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about the download method in Collections being implemented in RemoteFile. After peeking through the code, I realize it should be simple.

My question was raised from the following observation: RemoteFile can delete themselves, checksum themselves, so why they can't download themselves!?

Then my thinking was: OK, no self.download method - let pick up the guy who can do it/define the location(broker).

However, after the comment above, I checked the Collection method and the "download" method does exactly that - uses the broker to download.

The relationship is one way, in that a broker knows how to download a RemotePipelineFileCollection from itself, and not the other way around.

We are allowing both atm. RemotePipelineFileCollection.download allows doing exactly the opposite - in a semantic sense. You say "Hey RemotePipelineFileCollection, please download yourself with this broker at this path", but what happens is "Hey broker, use me and put the result on this path".

My whole point is, why not allow the same in RemoteFile!? Why do I need to wrap RemoteFile in another class, within a singleton IndexedSet, to just call a provided/required external object with a path argument? Just allow the same in RemoteFile then.

A Broker can take a collection and attempt to download so that the collection object needs zero knowledge about the storage layer (since that is not it's purpose)

Well, the collection group knows about who can help it being downloaded, so why the individual is ignorant?

Regards,
Nominated lawyer of RemoteFile rights

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, it's been a while since thinking in the headspace. As usual, there is sometimes a discrepancy between my internal mental model and the current reality.

Now that I look at it, I didn't follow my own vision and implemented it wrongly and invalidated everything I just said ;)

My question was raised from the following observation: RemoteFile can delete themselves, checksum themselves, so why they can't download themselves!?

The original intention for Files and Collections was that they can't do those things to themselves, they simply hold state information of their current status (e.g. has this been published) and intended state (e.g. is this a candidate for publishing). Some things crept in like checksumming itself and a remote file managing it's own locally cached file etc. but there is a real danger in this scope creeping out to all kinds of things. If anything this scope should be reduced back to being simpler types rather than expanded.

I actually think it would be better to get rid of the download method on the collection now that this is in place, because storage brokers really aren't something a collection type should have anything to do with.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, it's been a while since thinking in the headspace. As usual, there is sometimes a discrepancy between my internal mental model and the current reality.
Now that I look at it, I didn't follow my own vision and implemented it wrongly and invalidated everything I just said ;)
The original intention for Files and Collections was that they can't do those things to themselves, they simply hold state information of their current status (e.g. has this been published) and intended state (e.g. is this a candidate for publishing). Some things crept in like checksumming itself and a remote file managing it's own locally cached file etc. but there is a real danger in this scope creeping out to all kinds of things. If anything this scope should be reduced back to being simpler types rather than expanded.

Maybe you don't need to prune but embrace the living objects (teenagers!?) they became and led them to adulthood!? IMO, I don't think self.check_summing or self.download are jumping on the fence of their objectives, otherwise, they would be dumb down to just boring storage entities. Maybe it's the right thing to do, but they will complain about losing privileges!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they would be dumb down to just boring storage entities

Boring sounds good to me... keep the exciting code out in the modules that deal with the data generation/transformation/aggregation, but by the time it's gets to the workflow it should be as generic, consistent and stable as possible!

@ghost ghost force-pushed the s3broker-return branch from 1df7ec4 to 1d9c9d2 Compare May 7, 2020 02:34
@ghost ghost force-pushed the s3broker-return branch from 1d9c9d2 to cfe4c64 Compare May 7, 2020 03:08
@mhidas
Copy link
Copy Markdown
Contributor

mhidas commented May 7, 2020

👍 Great! Thanks @lwgordonimos. Let's take it for a spin...

@mhidas mhidas merged commit a8704af into master May 7, 2020
@mhidas mhidas deleted the s3broker-return branch May 7, 2020 05:26
if local_path is None:
local_path = self.temp_dir

remotepipelinefilecollection.download(self._upload_store_runner.broker, local_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without interrupting the extended philosophical discussion above, I just wanted to point out that (consistent with @lwgordonimos' "internal mental model"), this line should really be

    self._upload_store_runner.broker.download(remotepipelinefilecollection, local_path)

As a bonus, the broker's download method also accepts either a RemotePipelineFileCollection or a single RemotePipelineFile, so you don't even need to do the ensure_remotepipelinefilecollection above.

Copy link
Copy Markdown

@ocehugo ocehugo May 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, and if RemoteFile got the same method is accepted, this function could be just named differently and accept any of the inputs.

My point of turning both classes into boring stuff (e.g. named tuples/dicts) was that, as of today, would require some effort (that for me is hard to grasp since this package is used everywhere).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhidas I realised that exact thing as soon is it was merged unfortunately. In fact, I realised that the handler was the wrong place as well anyway, since we went to the trouble of making a StateQuery class to be the state API for exactly this kind of thing :/

#209

@ocehugo remember that this is not a general purpose library for interacting with our data, this is a framework for keeping handlers as consistent as possible. This is why it is so opinionated about how things should be done in terms of the workflow during execution.

We can implement that kind of flexible download interface without methods on those classes or dismantling other aspects of the design which are useful and/or intentional based on real world use cases we've encountered along the way (and in the previous pipeline implementation).

With the above change, this is the kind of code handlers can write to download files if they really need to, and this is the kind of style I would encourage.

myfilter = SomeOgcFilter()
collection = self.state_query.query_wfs_urls_for_layer('mylayer', myfilter)
self.state_query.download(collection, self.temp_dir)

# and of course there's nothing preventing you from
# looking at/manipulating your collection to get whatever info you need, e.g.
bigger_than_a_kb = RemotePipelineFileCollection(f for f in collection if f.size > 1024)
only_netcdfs = collection.filter_by_attribute_id(FileType.NETCDF)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants