Skip to content

Fix config of user facing execution parameters in spawning elastic tasks#1677

Merged
pingsutw merged 1 commit into
masterfrom
fabio/fix/kfpytorch-elastic-execution-params
Jun 26, 2023
Merged

Fix config of user facing execution parameters in spawning elastic tasks#1677
pingsutw merged 1 commit into
masterfrom
fabio/fix/kfpytorch-elastic-execution-params

Conversation

@fg91

@fg91 fg91 commented Jun 6, 2023

Copy link
Copy Markdown
Member

TL;DR

When using @task(task_config=flytekitplugins.kfpytorch.Elastic()), the task function is started in a number of worker processes using torch elastic_launch (torchrun). The processes can be created using fork or spawn which is controlled by the arg Elastic(start_method=...).

When using fork, the child process inherits a copy of the parent process' stack including the flyte context and the user facing execution parameters ctx = flytekit.current_context().

When spawning, however, fresh processes are started and the flyte context and the execution parameters are not transferred to the child process currently. This means that within a task with @task(task_config=Elastic(start_method="spawn")) the execution id and the checkpoint cannot be accessed from the execution parameters.

This PR fixes this by setting up the flyte context in the spawned worker processes.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

In the spawned worker processes I call flytekit.bin.entrypoint.setup_execution which sets up the flyte context the same way as when a normal python task is started. Raw data prefix and checkpoint pathes are transferred from the parent process.

Tracking Issue

NA

Follow-up issue

NA

@fg91 fg91 force-pushed the fabio/fix/kfpytorch-elastic-execution-params branch from 2cc8283 to a1e0a8e Compare June 6, 2023 09:40
@codecov

codecov Bot commented Jun 6, 2023

Copy link
Copy Markdown

Codecov Report

Merging #1677 (1263dab) into master (3370a96) will decrease coverage by 0.03%.
The diff coverage is n/a.

❗ Current head 1263dab differs from pull request most recent head d71c3bb. Consider uploading reports for the commit d71c3bb to get more accurate results

@@            Coverage Diff             @@
##           master    #1677      +/-   ##
==========================================
- Coverage   71.03%   71.00%   -0.03%     
==========================================
  Files         336      336              
  Lines       30798    30781      -17     
  Branches     5589     5576      -13     
==========================================
- Hits        21876    21855      -21     
- Misses       8375     8379       +4     
  Partials      547      547              

see 15 files with indirect coverage changes

Comment on lines +74 to +76
("spawn", "", False),
("spawn", "f12345678", True),
("fork", "local", False),

@fg91 fg91 Jun 6, 2023

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When spawning, the execution_id.name, .project, .domain, ... are set to the default value "" here when the FLYTE_INTERNAL_EXECUTION_ID, ... env vars are not set, i.e. during a local execution.

When executing a workflow/task locally, these execution identifiers are normally set to "local" which happens here. Since the parent processes stack is copied during forking, "local" is set when using this start method.

Accepting this difference between forking and spawning in a local execution might be a pragmatic compromise but is something that gives me a bit of grief.

If we want to remove this difference, I see two options for doing so.

  • Not set "" as the default value for execution id name, project, domain, ... in flytekit.bin.entrypoint.setup_execution. Would this have any undesired effect?
  • Maintain an adapted copy of setup_execution here, which would, however, lead to quite some code duplication which wouldn't be nice either.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will defer to @eapolinario @pingsutw on this point

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine. We don't use project, domain, and name in the local execution, right?

@fg91 fg91 force-pushed the fabio/fix/kfpytorch-elastic-execution-params branch from a1e0a8e to 56d91f5 Compare June 6, 2023 13:51
@fg91 fg91 marked this pull request as ready for review June 6, 2023 14:30
@fg91 fg91 force-pushed the fabio/fix/kfpytorch-elastic-execution-params branch from 56d91f5 to 1263dab Compare June 7, 2023 10:01
Comment on lines +74 to +76
("spawn", "", False),
("spawn", "f12345678", True),
("fork", "local", False),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will defer to @eapolinario @pingsutw on this point

Comment on lines +74 to +76
("spawn", "", False),
("spawn", "f12345678", True),
("fork", "local", False),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine. We don't use project, domain, and name in the local execution, right?

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
@fg91 fg91 force-pushed the fabio/fix/kfpytorch-elastic-execution-params branch from 1263dab to d71c3bb Compare June 21, 2023 17:10
@pingsutw pingsutw merged commit d7bfa6e into master Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants