Skip to content

Feat: Warn when doing local torch elastic training with nnodes > 1#1697

Merged
fg91 merged 1 commit into
masterfrom
fg91/feat/warn-local-elastic-training
Jun 19, 2023
Merged

Feat: Warn when doing local torch elastic training with nnodes > 1#1697
fg91 merged 1 commit into
masterfrom
fg91/feat/warn-local-elastic-training

Conversation

@fg91

@fg91 fg91 commented Jun 19, 2023

Copy link
Copy Markdown
Member

TL;DR

With @task(task_config=Elastic(...)) one can perform training with torch elastic launch (torchrun).
This works both locally as well as in a cluster with a kubeflow PyTorchJob.

When executing a workflow locally, i.e. python workflow.py, but setting e.g. Elastic(nnodes=2), the rendezvous of the workers will timeout because the workers wait for the non-existing workers from the non-existing 2nd node to join.

One would have to set the log level to debug in order to see that torch is waiting for the rendezvous to complete. By default, the workflow appears to not do anything.

I thins PR I add a warning log message that informs the user about this.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

I check for an environment variable that is set by the kubeflow training operator. If this is not set but the user set nnodes>1, the warning is emitted.

One could discuss whether we should just automatically switch to nnodes=1 if the environment variables for distributed training have not been set by the training operator but I found this too intrusive. Warning the user, however, should be done.

Tracking Issue

NA

Follow-up issue

NA

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
@fg91 fg91 merged commit 68ac1f5 into master Jun 19, 2023
@fg91 fg91 deleted the fg91/feat/warn-local-elastic-training branch June 19, 2023 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants