Skip to content

Fix: Allow elastic tasks to be recoverable#1846

Merged
fg91 merged 7 commits into
masterfrom
fg91/fix/recoverable-elastic-tasks
Oct 28, 2023
Merged

Fix: Allow elastic tasks to be recoverable#1846
fg91 merged 7 commits into
masterfrom
fg91/fix/recoverable-elastic-tasks

Conversation

@fg91

@fg91 fg91 commented Sep 22, 2023

Copy link
Copy Markdown
Member

TL;DR

For some failures, users might want to retry a flyte task, for other failures not.
Flyte tasks can be marked as retriable by raising a FlyteRecoverableException.

Torch's elastic_launch (which is used by flytekit's Elastic task type) always raises a ChildFailedError even if in the worker process e.g. a FlyteRecoverableException was raised.

This means that Flyte's retry mechanism cannot be used for user errors in elastic tasks.

Elastic tasks offer torchrun's retry mechanism by specifying max_retries. However, this retry mechanism (which doesn't restart the pod but the worker processes within the pod) will retry all exceptions which is not always desireable.

This PR makes elastic tasks work with Flyte's retry mechanism by propagating FlyteRecoverableException up from the worker processes.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

How did you fix the bug, make the feature etc. Link to any design docs etc

Tracking Issue

NA

Follow-up issue

NA

return_val = fn(**kwargs)

try:
return_val = fn(**kwargs)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the invocation of the actual task function in the worker processes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add a comment here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do 👍

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fg91 fg91 marked this pull request as draft September 22, 2023 14:46
@codecov

codecov Bot commented Sep 22, 2023

Copy link
Copy Markdown

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (af429f6) 20.13% compared to head (e656424) 53.81%.
Report is 46 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #1846       +/-   ##
===========================================
+ Coverage   20.13%   53.81%   +33.67%     
===========================================
  Files         337      301       -36     
  Lines       32427    22245    -10182     
  Branches     5857     3453     -2404     
===========================================
+ Hits         6530    11971     +5441     
+ Misses      25731    10102    -15629     
- Partials      166      172        +6     
Files Coverage Δ
...ins/flytekit-kf-pytorch/tests/test_elastic_task.py 100.00% <100.00%> (ø)
...ytorch/flytekitplugins/kfpytorch/error_handling.py 92.85% <92.85%> (ø)
...tekit-kf-pytorch/flytekitplugins/kfpytorch/task.py 85.20% <66.66%> (ø)

... and 348 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment thread plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py Outdated
Comment on lines +304 to +305
2. The pods belonging to a Flyte Elastic task write a single `error.pb` into blob storage, causing a
race condition as one pod might overwrite the error file of another pod.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work currently? If multiple workers fail it seems like it still only displays the error message of the first one which failed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part in the pods' entrypoint where the error file is uploaded.
See here and then here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, an open question is whether the error file can be overwritten by other pods.

But even if not, there is definitely a race condition which pod get's to write its file first.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work currently? If multiple workers fail it seems like it still only displays the error message of the first one which failed.

So yes, a single one is displayed. But I think we don't have a guarantee it's the first one.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting catch, we could write error files from all pods. But will all pods write and error file, if so we can and let the plugin collate it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the cleanest solution.
All pods will write an error file unless they just disappear because of e.g. a preemption.

How would this look like in practice?

  • Would we add an optional function to the plugin interface here to give plugins the option to customize the interpretation of error file(s)?
  • On the python side, in the entrypoint, would we check whether the "task class" wants to customize how the respective error file is called so that the different workers don't overwrite the file?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given @kumare3's proposal to properly resolve the race condition which worker pod get's to write the error.pb, I removed the logic that always makes worker group 0 write the file regardless of whether the first exception occured in another worker group.

So this PR now only tackles the problem of propagating the recoverable exception to the agent process.

@fg91 fg91 changed the title Fix: Allow elastic tasks to be recoverable [WIP] Fix: Allow elastic tasks to be recoverable Sep 27, 2023
@fellhorn

Copy link
Copy Markdown
Contributor

@fg91

fg91 commented Oct 4, 2023

Copy link
Copy Markdown
Member Author

@fg91 fg91 requested a review from fellhorn October 4, 2023 16:24
fg91 and others added 7 commits October 27, 2023 11:44
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Co-authored-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
… the agent process

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
@fg91 fg91 force-pushed the fg91/fix/recoverable-elastic-tasks branch from fe619e1 to e656424 Compare October 27, 2023 09:44
@fg91 fg91 self-assigned this Oct 27, 2023
@fg91 fg91 added bug Something isn't working and removed bug Something isn't working labels Oct 27, 2023
@fg91 fg91 marked this pull request as ready for review October 27, 2023 09:51
@fg91 fg91 merged commit 8d6f920 into master Oct 28, 2023
troychiu pushed a commit to troychiu/flytekit that referenced this pull request Oct 30, 2023
troychiu pushed a commit to troychiu/flytekit that referenced this pull request Oct 30, 2023
Signed-off-by: troychiu <y.troychiu@gmail.com>
ringohoffman pushed a commit to ringohoffman/flytekit that referenced this pull request Nov 24, 2023
@fg91 fg91 changed the title [WIP] Fix: Allow elastic tasks to be recoverable Fix: Allow elastic tasks to be recoverable Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants