Fix config of user facing execution parameters in spawning elastic tasks#1677
Conversation
2cc8283 to
a1e0a8e
Compare
Codecov Report
@@ Coverage Diff @@
## master #1677 +/- ##
==========================================
- Coverage 71.03% 71.00% -0.03%
==========================================
Files 336 336
Lines 30798 30781 -17
Branches 5589 5576 -13
==========================================
- Hits 21876 21855 -21
- Misses 8375 8379 +4
Partials 547 547 |
| ("spawn", "", False), | ||
| ("spawn", "f12345678", True), | ||
| ("fork", "local", False), |
There was a problem hiding this comment.
When spawning, the execution_id.name, .project, .domain, ... are set to the default value "" here when the FLYTE_INTERNAL_EXECUTION_ID, ... env vars are not set, i.e. during a local execution.
When executing a workflow/task locally, these execution identifiers are normally set to "local" which happens here. Since the parent processes stack is copied during forking, "local" is set when using this start method.
Accepting this difference between forking and spawning in a local execution might be a pragmatic compromise but is something that gives me a bit of grief.
If we want to remove this difference, I see two options for doing so.
- Not set
""as the default value for execution id name, project, domain, ... inflytekit.bin.entrypoint.setup_execution. Would this have any undesired effect? - Maintain an adapted copy of
setup_executionhere, which would, however, lead to quite some code duplication which wouldn't be nice either.
There was a problem hiding this comment.
will defer to @eapolinario @pingsutw on this point
There was a problem hiding this comment.
I think that's fine. We don't use project, domain, and name in the local execution, right?
a1e0a8e to
56d91f5
Compare
56d91f5 to
1263dab
Compare
| ("spawn", "", False), | ||
| ("spawn", "f12345678", True), | ||
| ("fork", "local", False), |
There was a problem hiding this comment.
will defer to @eapolinario @pingsutw on this point
| ("spawn", "", False), | ||
| ("spawn", "f12345678", True), | ||
| ("fork", "local", False), |
There was a problem hiding this comment.
I think that's fine. We don't use project, domain, and name in the local execution, right?
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
1263dab to
d71c3bb
Compare
TL;DR
When using
@task(task_config=flytekitplugins.kfpytorch.Elastic()), the task function is started in a number of worker processes using torchelastic_launch(torchrun). The processes can be created using fork or spawn which is controlled by the argElastic(start_method=...).When using
fork, the child process inherits a copy of the parent process' stack including the flyte context and the user facing execution parametersctx = flytekit.current_context().When spawning, however, fresh processes are started and the flyte context and the execution parameters are not transferred to the child process currently. This means that within a task with
@task(task_config=Elastic(start_method="spawn"))the execution id and the checkpoint cannot be accessed from the execution parameters.This PR fixes this by setting up the flyte context in the spawned worker processes.
Type
Are all requirements met?
Complete description
In the spawned worker processes I call
flytekit.bin.entrypoint.setup_executionwhich sets up the flyte context the same way as when a normal python task is started. Raw data prefix and checkpoint pathes are transferred from the parent process.Tracking Issue
NA
Follow-up issue
NA