Fix: Allow elastic tasks to be recoverable by fg91 · Pull Request #1846 · flyteorg/flytekit

fg91 · 2023-09-22T14:42:36Z

TL;DR

For some failures, users might want to retry a flyte task, for other failures not.
Flyte tasks can be marked as retriable by raising a FlyteRecoverableException.

Torch's elastic_launch (which is used by flytekit's Elastic task type) always raises a ChildFailedError even if in the worker process e.g. a FlyteRecoverableException was raised.

This means that Flyte's retry mechanism cannot be used for user errors in elastic tasks.

Elastic tasks offer torchrun's retry mechanism by specifying max_retries. However, this retry mechanism (which doesn't restart the pod but the worker processes within the pod) will retry all exceptions which is not always desireable.

This PR makes elastic tasks work with Flyte's retry mechanism by propagating FlyteRecoverableException up from the worker processes.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

How did you fix the bug, make the feature etc. Link to any design docs etc

Tracking Issue

NA

Follow-up issue

NA

fg91 · 2023-09-22T14:44:32Z

-        return_val = fn(**kwargs)
-
+        try:
+            return_val = fn(**kwargs)


This is the invocation of the actual task function in the worker processes.

lets add a comment here

Will do 👍

codecov · 2023-09-22T14:50:20Z

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (af429f6) 20.13% compared to head (e656424) 53.81%.
Report is 46 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #1846       +/-   ##
===========================================
+ Coverage   20.13%   53.81%   +33.67%     
===========================================
  Files         337      301       -36     
  Lines       32427    22245    -10182     
  Branches     5857     3453     -2404     
===========================================
+ Hits         6530    11971     +5441     
+ Misses      25731    10102    -15629     
- Partials      166      172        +6

Files	Coverage Δ
...ins/flytekit-kf-pytorch/tests/test_elastic_task.py	`100.00% <100.00%> (ø)`
...ytorch/flytekitplugins/kfpytorch/error_handling.py	`92.85% <92.85%> (ø)`
...tekit-kf-pytorch/flytekitplugins/kfpytorch/task.py	`85.20% <66.66%> (ø)`

... and 348 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fellhorn · 2023-09-25T10:17:12Z

+        2. The pods belonging to a Flyte Elastic task write a single `error.pb` into blob storage, causing a
+            race condition as one pod might overwrite the error file of another pod.


How does this work currently? If multiple workers fail it seems like it still only displays the error message of the first one which failed.

This is the part in the pods' entrypoint where the error file is uploaded.
See here and then here.

For me, an open question is whether the error file can be overwritten by other pods.

But even if not, there is definitely a race condition which pod get's to write its file first.

How does this work currently? If multiple workers fail it seems like it still only displays the error message of the first one which failed.

So yes, a single one is displayed. But I think we don't have a guarantee it's the first one.

This is an interesting catch, we could write error files from all pods. But will all pods write and error file, if so we can and let the plugin collate it?

This would be the cleanest solution.
All pods will write an error file unless they just disappear because of e.g. a preemption.

How would this look like in practice?

Would we add an optional function to the plugin interface here to give plugins the option to customize the interpretation of error file(s)?

On the python side, in the entrypoint, would we check whether the "task class" wants to customize how the respective error file is called so that the different workers don't overwrite the file?

Given @kumare3's proposal to properly resolve the race condition which worker pod get's to write the error.pb, I removed the logic that always makes worker group 0 write the file regardless of whether the first exception occured in another worker group.

So this PR now only tackles the problem of propagating the recoverable exception to the agent process.

fellhorn · 2023-09-27T09:03:45Z

TODO @fg91
Remember to remove https://github.com/flyteorg/flytekit/blame/de8a5fe7f7e249785c28edfb0ca206e0f12b85f0/plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py#L365

fg91 · 2023-10-04T16:24:02Z

TODO @fg91 Remember to remove https://github.com/flyteorg/flytekit/blame/de8a5fe7f7e249785c28edfb0ca206e0f12b85f0/plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py#L365

Thanks for the headsup, removed fe619e1

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

Co-authored-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

… the agent process Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

Signed-off-by: troychiu <y.troychiu@gmail.com>

fg91 requested review from cosmicBboy, eapolinario, kumare3, pingsutw and wild-endeavor as code owners September 22, 2023 14:42

fg91 commented Sep 22, 2023

View reviewed changes

fg91 marked this pull request as draft September 22, 2023 14:46

fellhorn reviewed Sep 25, 2023

View reviewed changes

fg91 changed the title ~~Fix: Allow elastic tasks to be recoverable~~ [WIP] Fix: Allow elastic tasks to be recoverable Sep 27, 2023

fg91 requested a review from fellhorn October 4, 2023 16:24

fg91 and others added 7 commits October 27, 2023 11:44

Make elastic tasks recoverable

83879bb

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

Update plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py

8578748

Co-authored-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

Document return value in doc string of 'is_recoverable_worker_error'

9e5cc92

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

Make error handling dependent on worker group rank again

2520dd0

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

Test propagation of recoverable error from worker to agent process

8cfc3dd

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

Leave comment pointing to explanation of recoverable error file

7874295

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

Remove outdated warning that recoverable errors are not propagated to…

e656424

… the agent process Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com> Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

fg91 force-pushed the fg91/fix/recoverable-elastic-tasks branch from fe619e1 to e656424 Compare October 27, 2023 09:44

fg91 self-assigned this Oct 27, 2023

fg91 added bug Something isn't working and removed bug Something isn't working labels Oct 27, 2023

fg91 marked this pull request as ready for review October 27, 2023 09:51

kumare3 approved these changes Oct 28, 2023

View reviewed changes

fg91 merged commit 8d6f920 into master Oct 28, 2023

troychiu pushed a commit to troychiu/flytekit that referenced this pull request Oct 30, 2023

[WIP] Fix: Allow elastic tasks to be recoverable (flyteorg#1846)

285cdcb

troychiu pushed a commit to troychiu/flytekit that referenced this pull request Oct 30, 2023

[WIP] Fix: Allow elastic tasks to be recoverable (flyteorg#1846)

6ef7fb4

Signed-off-by: troychiu <y.troychiu@gmail.com>

ringohoffman pushed a commit to ringohoffman/flytekit that referenced this pull request Nov 24, 2023

[WIP] Fix: Allow elastic tasks to be recoverable (flyteorg#1846)

1b773ec

fg91 changed the title ~~[WIP] Fix: Allow elastic tasks to be recoverable~~ Fix: Allow elastic tasks to be recoverable Dec 19, 2023

fg91 mentioned this pull request Jul 3, 2024

Add myself to code owners of flytekit-kf-pytorch #2556

Merged

3 tasks

		2. The pods belonging to a Flyte Elastic task write a single `error.pb` into blob storage, causing a
		race condition as one pod might overwrite the error file of another pod.

Uh oh!

Conversation

fg91 commented Sep 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Sep 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fellhorn commented Sep 27, 2023

Uh oh!

fg91 commented Oct 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fg91 commented Sep 22, 2023 •

edited

Loading

codecov Bot commented Sep 22, 2023 •

edited

Loading