Flyte flyte:// file system and improve remote file handling#1674
Conversation
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
|
@pingsutw / @wild-endeavor / @eapolinario this PR is ready and works in cases I have tested. I think a few more unit tests would be good. But, one problem. take the following example df = pd.DataFrame({"Name": ["Tom", "Joseph11"], "Age": [20, 22]})
@workflow
def wf(sd: StructuredDataset = StructuredDataset(dataframe=df)) -> StructuredDataset:
return t1(sd=sd)This currently does not work. With this PR, this will work, but will cause "registration clashes" even for registering the same code again, as the path is not auto de-duped. The problem is because we are using pandas parquet writer, we cannot control the remote path. The remote path is created as a random path on the raw prefix. One solution might be to hash the data value, but this is expensive (though completely ok for the registration case). Identifying the difference between registration vs runtime may be tough. I am still investigating, but want to make the PR ready for review |
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
pingsutw
left a comment
There was a problem hiding this comment.
LGTM, some questions in the comment
| self._remote = get_remote(self.config_file, self.project, self.domain) | ||
| # TODO @wild-endeavor - why should the local data upload location be /tmp? what should it be? | ||
| # Also why do we even copy the local data? | ||
| data_upload_location = "/tmp" |
| @click.pass_context | ||
| def info(ctx: click.Context): | ||
| """ | ||
| Print out information about the current Flyte Python CLI environment - like the version of Flytekit, backend endpoint |
| with tf.io.TFRecordWriter(local_path) as writer: | ||
| writer.write(python_val.SerializeToString()) | ||
| ctx.file_access.put_data(local_path, remote_path, is_multipart=False) | ||
| remote_path = ctx.file_access.put_raw_data(local_path) |
There was a problem hiding this comment.
do we need to override here? I thought type_engine would modify it
There was a problem hiding this comment.
i think we are going to override everywhere. TypeEngine could modify, but i want to be sure, we should do it wherever we can
| # h = hashlib.md5() | ||
| # h.update(data) | ||
| # md5 = h.digest() | ||
| # l = len(data) | ||
| # | ||
| # headers = {"Content-Length": str(l), "Content-MD5": md5} |
There was a problem hiding this comment.
| # h = hashlib.md5() | |
| # h.update(data) | |
| # md5 = h.digest() | |
| # l = len(data) | |
| # | |
| # headers = {"Content-Length": str(l), "Content-MD5": md5} |
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
* try no fs to pa read Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * file or local Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * no parens Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * add check for tuple Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * wrong order Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * pin fsspec Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> --------- Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
…rg#1674) * Add RemoteFileAccessProvider Signed-off-by: Kevin Su <pingsutw@apache.org> * more tests Signed-off-by: Kevin Su <pingsutw@apache.org> * lint Signed-off-by: Kevin Su <pingsutw@apache.org> * lint Signed-off-by: Kevin Su <pingsutw@apache.org> * nit Signed-off-by: Kevin Su <pingsutw@apache.org> * move to remote.py Signed-off-by: Kevin Su <pingsutw@apache.org> * remove random remote file path Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * fix _path Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * posting error now, needs investigation Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * add some tests Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * using new idl, adding the constant prefix Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * fix plugins Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * fix last bad replace Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * assign output of put_data in case of mutation Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * add output location Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * add to register Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * fix some tests, make fmt Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * delete weird file Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * update Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * add back random remote, remove staging Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * fix test Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * fix test hash Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * more put raw tests Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * fix tf Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * remove old file Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * update put data calls in plugins except spark Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * Unnecessary options removed Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * working launchplan Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * Flyte File system (Put only) (flyteorg#1776) * wip Signed-off-by: Kevin Su <pingsutw@apache.org> * get_flyte_fs func Signed-off-by: Kevin Su <pingsutw@apache.org> * Flyte file like object Signed-off-by: Kevin Su <pingsutw@apache.org> * Add support open Signed-off-by: Kevin Su <pingsutw@apache.org> * test Signed-off-by: Kevin Su <pingsutw@apache.org> * lint Signed-off-by: Kevin Su <pingsutw@apache.org> * fix tests Signed-off-by: Kevin Su <pingsutw@apache.org> * Fix union type Signed-off-by: Kevin Su <pingsutw@apache.org> * Fix tests Signed-off-by: Kevin Su <pingsutw@apache.org> * Fix tests Signed-off-by: Kevin Su <pingsutw@apache.org> * resolved conflict Signed-off-by: Kevin Su <pingsutw@apache.org> * merged yee's branch Signed-off-by: Kevin Su <pingsutw@apache.org> * Add open Signed-off-by: Kevin Su <pingsutw@apache.org> * update tests Signed-off-by: Kevin Su <pingsutw@apache.org> * update context Signed-off-by: Kevin Su <pingsutw@apache.org> --------- Signed-off-by: Kevin Su <pingsutw@apache.org> * Updated Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * wip Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * updated Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * updated Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * lint fixed Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * fixed potential issue with parsing list/dict Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * more unit tests Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * Union and other types now improved Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * wip Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * fixing hitl / gate Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * more updates to fetch Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * fix tests Signed-off-by: Kevin Su <pingsutw@apache.org> * test windows fixed Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * updated code Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * clarification in docs Signed-off-by: Ketan Umare <ketan.umare@gmail.com> * pa.arrow, issue with windows (flyteorg#1911) * try no fs to pa read Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * file or local Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * no parens Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * add check for tuple Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * wrong order Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * pin fsspec Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> --------- Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> --------- Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> Signed-off-by: Ketan Umare <ketan.umare@gmail.com> Co-authored-by: Kevin Su <pingsutw@apache.org> Co-authored-by: Ketan Umare <ketan.umare@gmail.com>
Tom-Newton
left a comment
There was a problem hiding this comment.
Hello. I've been trying to upgrade from flytekit 1.9.1 to 1.10.2 and I think the changes from this PR are causing some issues for me when I try to use a FlyteFile as an input to a workflow that I submit using pyflyte.
I've left a couple of comments but in addition to that I'm having issues when re-running the same workflow with the same FllyteFile.
details = "file already exists at location [<path>], specify a matching hash if you wish to rewrite"
debug_error_string = "UNKNOWN:Error received from peer ipv4:172.27.128.127:81 {created_time:"2024-01-22T21:59:07.492837995+00:00", grpc_status:6, grpc_message:"file already exists at location [<path>], specify a matching hash if you wish to rewrite"}"
This one confuses me because it looks like its doing the same as before except previously it didn't fail.
| p = kwargs.pop(_PREFIX_KEY) | ||
| hashes = kwargs.pop(_HASHES_KEY) | ||
| # Parse rpath, strip out everything that doesn't make sense. | ||
| rpath = rpath.replace(f"{REMOTE_PLACEHOLDER}/", "", 1) |
There was a problem hiding this comment.
It seems to me like this should be replacing REMOTE_PLACEHOLDER. We call
res = await super()._put(lpath, REMOTE_PLACEHOLDER, recursive, callback, batch_size, **kwargs)
So I think the rpath is going to be just REMOTE_PLACEHOLDER. For me this results in paths that look like
abfs://flyte/flyteexamples/development/HAZFZJWIE6KZ3TWFZJJRFAXH6A======/flyte://data
This does work but it doesn't seem quite right to me.
There was a problem hiding this comment.
yeah i ran into this issue too. working on a fix. but this is opt-in behavior isn't it?
There was a problem hiding this comment.
Sorry, I don't understand how its opt-in. pyflyte run depends on this if you use FlyteFile.
There was a problem hiding this comment.
#2121 this is the pr i'm working on, but not sure what's going on with the test failures, will have to take a look later this week.
| rpath = rpath.replace(f"{REMOTE_PLACEHOLDER}/", "", 1) | ||
| resp, content_length, md5_bytes = self.get_upload_link(lpath, rpath, p, hashes) | ||
|
|
||
| headers = {"Content-Length": str(content_length), "Content-MD5": b64encode(md5_bytes).decode("utf-8")} |
There was a problem hiding this comment.
I think this is a problem for Azure users. For writes to Azure blob storage the "x-ms-blob-type": "BlockBlob" header is required. Without it we get
│ Invalid value for '--segment_list_file': Failed to convert param: <Option segment_list_file>, value: /home/tomnewton/WayveCode/wayve/services/data/pipelines/flyte/examples/fanout_workflow/run_windows_list.txt to type: <class 'flytekit.types.file.file.FlyteFile'>. Reason 400, │
│ message="An HTTP header that's mandatory for this request is not specified.",
In the old code path there was a hack to add in the extra header required for Azure
flytekit/flytekit/remote/remote.py
Line 825 in cba830e
|
BTW I worked out the |
TL;DR
Examples of improvements
New additions
centralizes fsspec as the way to interact with data
Planned work
and many more data interaction improvements
Details
What it does?
flyte://is the protocol. This is not a real filesystem though, so it's only used inpyflyte runandpyflyte register.The reason the file system is useful:
pyflyte runcurrently duplicates a bunch of work that's supposed to be handled by the type engine because it needs to handle remote upload. This can be cleaned up in the future if the data proxy service can be accessible at a lower level.The drawback of this approach:
cp src flyte://destcommand, the destination is ignored. For this reason,rpathin the implementation is always ignored and a placeholder is used instead. Instead, theremote_fsimplementation returns the s3 native url as part of theputcall.This is basically an fsspec filesystem that writes to a different location than what is specified. However, it's not always possible to capture the output where the file was really written to. More concretely, if you do:
pandas will use fsspec to fetch the file system, call
openon it, and then write to the open buffer. After writing, users expect that s3 location to hold the contents of the dataframe. With the flyte remote fs, if you calleddf.to_parquet("flyte://anything"), the dataframe contents do not end up there. Instead they end up in somes3://location (assuming s3 is the blob store).Because of this, calls to
put_datawere updated to read back theremote_path, now returned byput_data. (this is why dataframes don't work as default input. we can of course add another transformer that writes to a memory buffer or file first of course only for pyflyte run/register.)To promote the behavior, a new function was added to data_persistence
put_raw_datawhich writes always to the raw output data prefix. See the function and corresponding test. This PR does not update all the calls yet to the new interface.The flyte fs also does not support
open. The reason is simply that the data proxy call requires that we know the MD5 hash. This doesn't really make sense in the context of an open call where the data might not even be known yet. If this wasn't the case, we could've written the pandas transformer to fs.open() and then write directly to the buffer.Also,
get_random_remote_path/directorywas removed from data persistence and rewritten to be based on an input fs.Type
Are all requirements met?
Complete description
Smaller changes
hash_file/field for proper Windows use.Other thoughts
We also briefly thought about replacing the fsspec s3 filesystem with something that will do the signed urls but this felt more complicated and felt like it interfered too much with fsspec. Also would be dependent on what blob store admin was configured to use.
Needs: flyteorg/flyteidl#416 and flyteorg/flyteadmin#577
flyteorg/flytestdlib#160 and flyteorg/flyteadmin#587
Tracking Issue
NA