Fix: Allow elastic tasks to be recoverable#1846
Conversation
| return_val = fn(**kwargs) | ||
|
|
||
| try: | ||
| return_val = fn(**kwargs) |
There was a problem hiding this comment.
This is the invocation of the actual task function in the worker processes.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #1846 +/- ##
===========================================
+ Coverage 20.13% 53.81% +33.67%
===========================================
Files 337 301 -36
Lines 32427 22245 -10182
Branches 5857 3453 -2404
===========================================
+ Hits 6530 11971 +5441
+ Misses 25731 10102 -15629
- Partials 166 172 +6
☔ View full report in Codecov by Sentry. |
| 2. The pods belonging to a Flyte Elastic task write a single `error.pb` into blob storage, causing a | ||
| race condition as one pod might overwrite the error file of another pod. |
There was a problem hiding this comment.
How does this work currently? If multiple workers fail it seems like it still only displays the error message of the first one which failed.
There was a problem hiding this comment.
For me, an open question is whether the error file can be overwritten by other pods.
But even if not, there is definitely a race condition which pod get's to write its file first.
There was a problem hiding this comment.
How does this work currently? If multiple workers fail it seems like it still only displays the error message of the first one which failed.
So yes, a single one is displayed. But I think we don't have a guarantee it's the first one.
There was a problem hiding this comment.
This is an interesting catch, we could write error files from all pods. But will all pods write and error file, if so we can and let the plugin collate it?
There was a problem hiding this comment.
This would be the cleanest solution.
All pods will write an error file unless they just disappear because of e.g. a preemption.
How would this look like in practice?
- Would we add an optional function to the plugin interface here to give plugins the option to customize the interpretation of error file(s)?
- On the python side, in the entrypoint, would we check whether the "task class" wants to customize how the respective error file is called so that the different workers don't overwrite the file?
There was a problem hiding this comment.
Given @kumare3's proposal to properly resolve the race condition which worker pod get's to write the error.pb, I removed the logic that always makes worker group 0 write the file regardless of whether the first exception occured in another worker group.
So this PR now only tackles the problem of propagating the recoverable exception to the agent process.
Thanks for the headsup, removed fe619e1 |
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Co-authored-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
… the agent process Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
fe619e1 to
e656424
Compare
Signed-off-by: troychiu <y.troychiu@gmail.com>
TL;DR
For some failures, users might want to retry a flyte task, for other failures not.
Flyte tasks can be marked as retriable by raising a
FlyteRecoverableException.Torch's
elastic_launch(which is used by flytekit'sElastictask type) always raises aChildFailedErroreven if in the worker process e.g. aFlyteRecoverableExceptionwas raised.This means that Flyte's retry mechanism cannot be used for user errors in elastic tasks.
Elastic tasks offer
torchrun's retry mechanism by specifyingmax_retries. However, this retry mechanism (which doesn't restart the pod but the worker processes within the pod) will retry all exceptions which is not always desireable.This PR makes elastic tasks work with Flyte's retry mechanism by propagating
FlyteRecoverableExceptionup from the worker processes.Type
Are all requirements met?
Complete description
How did you fix the bug, make the feature etc. Link to any design docs etc
Tracking Issue
NA
Follow-up issue
NA