Skip to content

Add GCS Support#6

Closed
matthewphsmith wants to merge 3 commits into
masterfrom
gcp
Closed

Add GCS Support#6
matthewphsmith wants to merge 3 commits into
masterfrom
gcp

Conversation

@matthewphsmith

Copy link
Copy Markdown
Collaborator

This adds support for uploading and downloading data to/from GCS.

@honnix

honnix commented Sep 5, 2019

Copy link
Copy Markdown
Member

@matthewphsmith What is the status of this PR? Would you like us to review as well? Thanks.

@matthewphsmith

Copy link
Copy Markdown
Collaborator Author

It's good to go from my perspective, I just need an approval on a review.

@wild-endeavor

Copy link
Copy Markdown
Contributor

+1 bump version plz


def download_directory(self, remote_path, local_path):
"""
:param Text remote_path: remote s3:// path

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/s3/gs/g

@tnsetting

tnsetting commented Nov 1, 2019

Copy link
Copy Markdown
Contributor

I tried to the PR and the pyflyte-execute could download the input.pb from gcs and upload the output.pb to gcs. But the node still failed because of OutputsNotFoundError: Outputs not found for task type python-task.... It looks like the whole engine_dir_xxx folder was uploaded instead of just upload the .pb files.

raise ValueError("Not an GS Key. Please use FQN (GS ARN) of the format gs://...")

GCSProxy._check_binary()
cmd = [GCSProxy._GS_UTIL_CLI, "cp", "-r", local_path, remote_path]

@tnsetting tnsetting Nov 1, 2019

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change local_path to local_path + '/*' to just avoid upload the whole engine_dir folder.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behaviour in this PR seems fine to me, because the expectation of this function is to upload the specified local path. Maybe it is the caller who should use ../* instead?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can confirm that gsutil and aws s3 behave differently when copying a dir recursively.

e.g:

$ tree test
test
└── test1
    └── test.txt

1 directories, 1 file

$ gsutil cp -r test/* gs://flyte-test
Copying file://test/test1/a/b/test.txt [Content-Type=text/plain]...
/ [1 files][    0.0 B/    0.0 B]
Operation completed over 1 objects.

$ gsutil ls gs://flyte-test
gs://flyte-test//
gs://flyte-test/test1/    <--------

vs.

$ gsutil cp -r test gs://flyte-test
Copying file://test/test1/a/b/test.txt [Content-Type=text/plain]...
/ [1 files][    0.0 B/    0.0 B]
Operation completed over 1 objects.

$ gsutil ls gs://flyte-test
gs://flyte-test//
gs://flyte-test/test/    <--------

@honnix

honnix commented Nov 4, 2019

Copy link
Copy Markdown
Member

I think there is a caveat using gsutil: it doesn't seem to respect GOOGLE_APPLICATION_CREDENTIALS environment variable. That means the container should either have a service account activated so it becomes ADC effectively, or carry an identity such as GCE.

@honnix

honnix commented Nov 5, 2019

Copy link
Copy Markdown
Member

Rebased and addressed the comments #41

@kumare3 kumare3 closed this Nov 6, 2019
ByronHsu pushed a commit to ByronHsu/flytekit that referenced this pull request May 29, 2023
wild-endeavor added a commit that referenced this pull request Sep 26, 2023
eapolinario added a commit that referenced this pull request Feb 8, 2024
* temp

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* stuff

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* temp

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* scaffolding areas mostly identified

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* add artifact to upload request

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* remove an unnecessary line in workflow

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* finish adding examples use cases maybe

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* add project/dom to get query

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* add from flyte idl

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* add project domain to as query

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* add condition in parameter to flyte idl

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* test stuff

* Remove artifactID from literal oneof, add to metadata (#2)

* Triggers (#6)

* Minor changes to get time series example working #8

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* switch channel (#10)

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* fix tests ignore - pr into other pr (#1858)

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Artf/update idl ux (#1920)


Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Artf/trigger (#1948)

* Add triggers
* Remove bind_partition_time and just assume users won't use that. It's just time_partition in the normal call function now.

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* remove the now deleted artifact spec (#1984)

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Literal metadata model update (#2089)


Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Separate time partition (#2114)



Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Split service code (#2136)


Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* remove empty files

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* add noneness check to metadata and add test

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* remove sandbox test for now

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Artf/cleanup (#2158)

* add a test

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* try updates

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

---------

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Use python 3.9 to run make doc-requirements.txt

Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>

* reasons not msg

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

---------

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Co-authored-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
eapolinario added a commit that referenced this pull request Feb 8, 2024
* temp

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* stuff

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* temp

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* scaffolding areas mostly identified

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* add artifact to upload request

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* remove an unnecessary line in workflow

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* finish adding examples use cases maybe

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* add project/dom to get query

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* add from flyte idl

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* add project domain to as query

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* add condition in parameter to flyte idl

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* test stuff

* Remove artifactID from literal oneof, add to metadata (#2)

* Triggers (#6)

* Minor changes to get time series example working #8

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* switch channel (#10)

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* fix tests ignore - pr into other pr (#1858)

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Artf/update idl ux (#1920)

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Artf/trigger (#1948)

* Add triggers
* Remove bind_partition_time and just assume users won't use that. It's just time_partition in the normal call function now.

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* remove the now deleted artifact spec (#1984)

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Literal metadata model update (#2089)

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Separate time partition (#2114)

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Split service code (#2136)

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* remove empty files

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* add noneness check to metadata and add test

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* remove sandbox test for now

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Artf/cleanup (#2158)

* add a test

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* try updates

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

---------

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

* Use python 3.9 to run make doc-requirements.txt

Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>

* reasons not msg

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>

---------

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Co-authored-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
noahjax referenced this pull request in dominodatalab/flytekit Mar 18, 2024
* remove cruft and add agent

* clean up updated agent, get it running locally

* switch to poetry to enable installing extras

* cleanup and black formatting

* dumb mypy change to pass ci

* ruff

* maybe fix test?

* fix logging, set log level

* black formatting and ruff fix

* update agent to create pipeline job

* cleanup agent, shuffle deps

* update Pipfile.lock

* black format

* black + mypy

* ruff

* update testt

* update test to stop testing DominoJobTask logic

* get dominoHost from a secret

* add os back

* [DOM-52824] Add pipeline interface args to pipeline config sent to job api (#6)

* add interfaces to pipelineConfig job api arg

* ruff

* update deps

* disabled test for now

* Re-enable test

* Appease the testing gods

* provide output paths

* supply inputMetadataPrefix to jobs api

* use same bucket for raw output

* Build container image with nonroot support

 Build with:

 > docker build -t train-flyte-domino-agent-service -f ./build/docker/Dockerfile .

 - With the previous container image

 ❯ docker run -i --rm --user 1000 --cap-drop all docker.io/library/train-flyte-domino-agent-service
Traceback (most recent call last):
  File "/usr/local/bin/pyflyte", line 5, in <module>
    from flytekit.clis.sdk_in_container.pyflyte import main
  File "/usr/local/lib/python3.9/site-packages/flytekit/__init__.py", line 208, in <module>
    from flytekit.core.base_sql_task import SQLTask
  File "/usr/local/lib/python3.9/site-packages/flytekit/core/base_sql_task.py", line 4, in <module>
    from flytekit.core.base_task import PythonTask, TaskMetadata
  File "/usr/local/lib/python3.9/site-packages/flytekit/core/base_task.py", line 31, in <module>
    from flytekit.core.context_manager import (
  File "/usr/local/lib/python3.9/site-packages/flytekit/core/context_manager.py", line 33, in <module>
    from flytekit.core.data_persistence import FileAccessProvider, default_local_file_access_provider
  File "/usr/local/lib/python3.9/site-packages/flytekit/core/data_persistence.py", line 513, in <module>
    data_config=DataConfig.auto(),
  File "/usr/local/lib/python3.9/site-packages/flytekit/configuration/__init__.py", line 606, in auto
    config_file = get_config_file(config_file)
  File "/usr/local/lib/python3.9/site-packages/flytekit/configuration/file.py", line 259, in get_config_file
    if current_location_config.exists():
  File "/usr/local/lib/python3.9/pathlib.py", line 1424, in exists
    self.stat()
  File "/usr/local/lib/python3.9/pathlib.py", line 1232, in stat
    return self._accessor.stat(self)
PermissionError: [Errno 13] Permission denied: 'flytekit.config'

 - With the updated container build

 ❯ docker run -i --rm --user 1000 --cap-drop all docker.io/library/train-flyte-domino-agent-service
 Starting up the server to expose the prometheus metrics...
 Starting the agent service...

* add datasetSnapshots to agent -> domino comms

* drop Pipfile in favor of poetry

* multi stage build

* add Pipfile back

* defer to nucleus for start pipeline job validation

* update readme

* add some debugging instructions

* black

* fix get domino host

* fix lint test

---------

Co-authored-by: integration-test integration-test <test-notifs+integration-test@dominodatalab.com>
Co-authored-by: ddl-ryan-connor <106334081+ddl-ryan-connor@users.noreply.github.com>
Co-authored-by: ddl-ebrown <ethan.brown@dominodatalab.com>
Co-authored-by: Ryan Connor <ryan.connor@dominodatalab.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants