This repository was archived by the owner on Oct 9, 2023. It is now read-only.
Don't add master replica log link when doing elastic pytorch training#356
Merged
Conversation
984c99c to
a4fbf94
Compare
Codecov Report
@@ Coverage Diff @@
## master #356 +/- ##
==========================================
+ Coverage 62.76% 64.20% +1.44%
==========================================
Files 148 148
Lines 12444 10289 -2155
==========================================
- Hits 7810 6606 -1204
+ Misses 4038 3072 -966
- Partials 596 611 +15
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
0709b32 to
63b4fcd
Compare
hamersaw
approved these changes
Jun 7, 2023
eapolinario
pushed a commit
that referenced
this pull request
Sep 6, 2023
…#356) * Don't add master log link when doing elastic pytorch training Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> * Lint Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> --------- Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
When doing torch elastic training, there is no so-called master replica in the resulting
PytorchJobas opposed to when doing non-elastic pytorch distributed training.Flyteplugins, however, still generates a log link for the non-existing master replica in case of elastic training. This PR fixes this.
Type
Are all requirements met?
I built a propeller image and tested that the correct log links are shown both for elastic and the original non-elastic pytorch tasks.
Complete description
When doing "normal" non-elastic training, a flyte task looks like this:
The pytorch job that is created from this task definition looks like this:
Notice that there is a so-called "master" replica and multiple workers.
In the Flyte console, a link to the master replica and to the 3 worker replicas logs is shown.
When using the new elastic training task (torchrun) ...
... the resulting pytorch job looks like this:
Notice that there is no-more "master" replica.
Even though there is no "master" replica, currently the Flyte console still shows a log link for the master replica that doesn't exist.
This PR fixes this.
Tracking Issue
NA
Follow-up issue
NA