Skip to content
This repository was archived by the owner on Oct 9, 2023. It is now read-only.

Don't add master replica log link when doing elastic pytorch training#356

Merged
fg91 merged 2 commits into
masterfrom
fabio/fix/elastic-log-links
Jun 7, 2023
Merged

Don't add master replica log link when doing elastic pytorch training#356
fg91 merged 2 commits into
masterfrom
fabio/fix/elastic-log-links

Conversation

@fg91

@fg91 fg91 commented Jun 7, 2023

Copy link
Copy Markdown
Member

TL;DR

When doing torch elastic training, there is no so-called master replica in the resulting PytorchJob as opposed to when doing non-elastic pytorch distributed training.

Flyteplugins, however, still generates a log link for the non-existing master replica in case of elastic training. This PR fixes this.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

I built a propeller image and tested that the correct log links are shown both for elastic and the original non-elastic pytorch tasks.

Complete description

When doing "normal" non-elastic training, a flyte task looks like this:

@task(
    task_config=PyTorch(
        num_workers=3,
    )
)
def train():

The pytorch job that is created from this task definition looks like this:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  ...
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      ...
    Worker:
      replicas: 3
      ...

Notice that there is a so-called "master" replica and multiple workers.

In the Flyte console, a link to the master replica and to the 3 worker replicas logs is shown.

When using the new elastic training task (torchrun) ...

@task(
    task_config=Elastic(
        nnodes=3,
        nproc_per_node=1,
    ),
)
def train():

... the resulting pytorch job looks like this:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  ...
spec:
  elasticPolicy:
    ...
  pytorchReplicaSpecs:
    Worker:
      replicas: 3
      ...

Notice that there is no-more "master" replica.

Even though there is no "master" replica, currently the Flyte console still shows a log link for the master replica that doesn't exist.

This PR fixes this.

Tracking Issue

NA

Follow-up issue

NA

@fg91 fg91 force-pushed the fabio/fix/elastic-log-links branch from 984c99c to a4fbf94 Compare June 7, 2023 14:21
@codecov

codecov Bot commented Jun 7, 2023

Copy link
Copy Markdown

Codecov Report

Merging #356 (173acea) into master (76a80ec) will increase coverage by 1.44%.
The diff coverage is 77.83%.

❗ Current head 173acea differs from pull request most recent head 63b4fcd. Consider uploading reports for the commit 63b4fcd to get more accurate results

@@            Coverage Diff             @@
##           master     #356      +/-   ##
==========================================
+ Coverage   62.76%   64.20%   +1.44%     
==========================================
  Files         148      148              
  Lines       12444    10289    -2155     
==========================================
- Hits         7810     6606    -1204     
+ Misses       4038     3072     -966     
- Partials      596      611      +15     
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
go/tasks/plugins/webapi/agent/config.go 100.00% <ø> (ø)
.../plugins/k8s/kfoperators/common/common_operator.go 65.94% <34.88%> (-5.99%) ⬇️
go/tasks/plugins/k8s/dask/dask.go 85.77% <57.14%> (+1.85%) ⬆️
go/tasks/plugins/webapi/agent/plugin.go 68.22% <66.66%> (ø)
go/tasks/plugins/k8s/kfoperators/mpi/mpi.go 75.00% <81.48%> (+0.80%) ⬆️
go/tasks/plugins/k8s/ray/ray.go 77.48% <84.09%> (-0.09%) ⬇️
...s/plugins/k8s/kfoperators/tensorflow/tensorflow.go 77.77% <84.33%> (+3.02%) ⬆️
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go 78.08% <88.37%> (-1.45%) ⬇️
...tasks/pluginmachinery/flytek8s/container_helper.go 86.36% <88.88%> (+0.36%) ⬆️
go/tasks/pluginmachinery/flytek8s/pod_helper.go 69.81% <100.00%> (-3.21%) ⬇️
... and 1 more

... and 121 files with indirect coverage changes

fg91 added 2 commits June 7, 2023 16:11
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
@fg91 fg91 force-pushed the fabio/fix/elastic-log-links branch from 0709b32 to 63b4fcd Compare June 7, 2023 16:11
@fg91 fg91 marked this pull request as ready for review June 7, 2023 16:12
@fg91 fg91 requested a review from pingsutw June 7, 2023 16:12
@fg91 fg91 merged commit de1d1d3 into master Jun 7, 2023
@fg91 fg91 deleted the fabio/fix/elastic-log-links branch June 7, 2023 18:15
eapolinario pushed a commit that referenced this pull request Sep 6, 2023
…#356)

* Don't add master log link when doing elastic pytorch training

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>

* Lint

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>

---------

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants