Skip to content
This repository was archived by the owner on Oct 9, 2023. It is now read-only.

Feat: Configure elastic training in pytorch plugin#343

Merged
kumare3 merged 5 commits into
masterfrom
fabio/feat/torch-elastic-plugin
Apr 24, 2023
Merged

Feat: Configure elastic training in pytorch plugin#343
kumare3 merged 5 commits into
masterfrom
fabio/feat/torch-elastic-plugin

Conversation

@fg91

@fg91 fg91 commented Apr 10, 2023

Copy link
Copy Markdown
Member

TL;DR

This PR modifies the pytorch plugin so that it can set an ElasticPolicy in the kubeflow PytorchJob in case a user configures torch elastic training (torchrun) in the task decorator:

from flytekitplugins.kfpytorch import Elastic

@task(
    task_config=Elastic(
        replicas=4,
        nproc_per_node=4,
        ...
    ),
    ...
)
def train(...):
    ...

See this issue for motivation and more details.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

Tracking Issue

Fixes flyteorg/flyte#3614

Follow-up issue

Fabio Grätz added 2 commits April 22, 2023 20:03
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
@fg91

fg91 commented Apr 23, 2023

Copy link
Copy Markdown
Member Author

Tests are failing since flyteidl needs to be updated first.

@fg91 fg91 marked this pull request as ready for review April 23, 2023 11:01
@fg91 fg91 requested a review from kumare3 April 23, 2023 11:01
@fg91 fg91 self-assigned this Apr 23, 2023
@fg91 fg91 added the enhancement New feature or request label Apr 23, 2023
kumare3 added 2 commits April 23, 2023 21:11
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
@codecov

codecov Bot commented Apr 24, 2023

Copy link
Copy Markdown

Codecov Report

Merging #343 (74ea839) into master (1f39163) will increase coverage by 1.43%.
The diff coverage is 100.00%.

❗ Current head 74ea839 differs from pull request most recent head c8de6e2. Consider uploading reports for the commit c8de6e2 to get more accurate results

@@            Coverage Diff             @@
##           master     #343      +/-   ##
==========================================
+ Coverage   62.64%   64.07%   +1.43%     
==========================================
  Files         148      148              
  Lines       12397    10072    -2325     
==========================================
- Hits         7766     6454    -1312     
+ Misses       4036     3023    -1013     
  Partials      595      595              
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go 80.00% <100.00%> (+7.65%) ⬆️

... and 130 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core feature] Support torch elastic training/torchrun

2 participants