Only mark SchedulerJobs as failed, not any jobs by jedcunningham · Pull Request #19375 · apache/airflow

jedcunningham · 2021-11-02T23:00:36Z

In adopt_or_reset_orphaned_tasks, we set any SchedulerJobs that have
failed scheduler_health_check_threshold to failed, however a missing
condition was allowing that timeout to apply to all jobs, not just SchedulerJobs.
This is because polymorphic identity isn't included for update():
https://docs.sqlalchemy.org/en/13/orm/query.html#sqlalchemy.orm.query.Query.update

So if we had any running LocalTaskJobs that, for whatever reason, aren't
heartbeating faster than scheduler_health_check_threshold, their state
gets set to failed and they subsequently exit with a log line similar to:

State of this instance has been externally set to scheduled. Terminating instance.

Note that the state it is set to can be different (e.g. queued or
up_for_retry) simply depending on how quickly the scheduler has
progressed that task_instance again.

Closes: #16881
Closes: #16573
Related: #16023 (comment)
Might also fix #19277

In `adopt_or_reset_orphaned_tasks`, we set any SchedulerJobs that have failed `scheduler_health_check_threshold` to failed, however a missing condition was allowing that timeout to apply to all jobs, not just SchedulerJobs. This is because polymorphic identity isn't included for `update()`: https://docs.sqlalchemy.org/en/13/orm/query.html#sqlalchemy.orm.query.Query.update So if we had any running LocalTaskJobs that, for whatever reason, aren't heartbeating faster than `scheduler_health_check_threshold`, their state gets set to failed and they subsequently exit with a log line similar to: State of this instance has been externally set to scheduled. Terminating instance. Note that the state it is set to can be different (e.g. queued or up_for_retry) simply depending on how quickly the scheduler has progressed that task_instance again.

kaxil

I can confirm it was running the following query before:

[2021-11-02 23:37:09,230] {base.py:727} INFO - BEGIN (implicit)
[2021-11-02 23:37:09,231] {base.py:1234} INFO - UPDATE job SET state=%(state)s WHERE job.state = %(state_1)s AND job.latest_heartbeat < %(latest_heartbeat_1)s
[2021-11-02 23:37:09,231] {base.py:1239} INFO - "\x1b[01m{'state': <TaskInstanceState.FAILED: 'failed'>, 'state_1': <TaskInstanceState.RUNNING: 'running'>, 'latest_heartbeat_1': datetime.datetime(2021, 11, 2, 23, 36, 59, 213724, tzinfo=Timezone('UTC'))}\x1b[22m"

and now runs:

[2021-11-02 23:39:30,548] {base.py:1234} INFO - UPDATE job SET state=%(state)s WHERE job.job_type = %(job_type_1)s AND job.state = %(state_1)s AND job.latest_heartbeat < %(latest_heartbeat_1)s
[2021-11-02 23:39:30,548] {base.py:1239} INFO - "\x1b[01m{'state': <TaskInstanceState.FAILED: 'failed'>, 'job_type_1': 'SchedulerJob', 'state_1': <TaskInstanceState.RUNNING: 'running'>, 'latest_heartbeat_1': datetime.datetime(2021, 11, 2, 23, 39, 20, 547621, tzinfo=Timezone('UTC'))}\x1b[22m"

I have also tested it locally

github-actions · 2021-11-02T23:42:32Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

kaxil · 2021-11-02T23:44:44Z

Good find - I think @ephraimbuddy and I had stumbled upon it sometime back but weren't sure that Polymorphic identity didn't apply to update statements - TIL.

ephraimbuddy · 2021-11-03T06:43:57Z

Good find - I think @ephraimbuddy and I had stumbled upon it sometime back but weren't sure that Polymorphic identity didn't apply to update statements - TIL.

Yeah! Yesterday during our debugging session with Collin, it came up again. Good that it's now being fixed. Likely related to the many sigterms everyone is complaining about

potiuk

Nice!

collinmcnulty · 2021-11-03T13:52:14Z

Hallelujah, great fix

jedcunningham · 2021-11-03T14:13:04Z

Yeah, shoutout to @ephraimbuddy for spotting this 🎉

ashb · 2021-11-03T14:16:11Z

🤦

In `adopt_or_reset_orphaned_tasks`, we set any SchedulerJobs that have failed `scheduler_health_check_threshold` to failed, however a missing condition was allowing that timeout to apply to all jobs, not just SchedulerJobs. This is because polymorphic identity isn't included for `update()`: https://docs.sqlalchemy.org/en/13/orm/query.html#sqlalchemy.orm.query.Query.update So if we had any running LocalTaskJobs that, for whatever reason, aren't heartbeating faster than `scheduler_health_check_threshold`, their state gets set to failed and they subsequently exit with a log line similar to: State of this instance has been externally set to scheduled. Terminating instance. Note that the state it is set to can be different (e.g. queued or up_for_retry) simply depending on how quickly the scheduler has progressed that task_instance again. (cherry picked from commit 38d329b)

In `adopt_or_reset_orphaned_tasks`, we set any SchedulerJobs that have failed `scheduler_health_check_threshold` to failed, however a missing condition was allowing that timeout to apply to all jobs, not just SchedulerJobs. This is because polymorphic identity isn't included for `update()`: https://docs.sqlalchemy.org/en/13/orm/query.html#sqlalchemy.orm.query.Query.update So if we had any running LocalTaskJobs that, for whatever reason, aren't heartbeating faster than `scheduler_health_check_threshold`, their state gets set to failed and they subsequently exit with a log line similar to: State of this instance has been externally set to scheduled. Terminating instance. Note that the state it is set to can be different (e.g. queued or up_for_retry) simply depending on how quickly the scheduler has progressed that task_instance again. (cherry picked from commit 38d329b) (cherry picked from commit fa0b998) (cherry picked from commit 2071544)

In `adopt_or_reset_orphaned_tasks`, we set any SchedulerJobs that have failed `scheduler_health_check_threshold` to failed, however a missing condition was allowing that timeout to apply to all jobs, not just SchedulerJobs. This is because polymorphic identity isn't included for `update()`: https://docs.sqlalchemy.org/en/13/orm/query.html#sqlalchemy.orm.query.Query.update So if we had any running LocalTaskJobs that, for whatever reason, aren't heartbeating faster than `scheduler_health_check_threshold`, their state gets set to failed and they subsequently exit with a log line similar to: State of this instance has been externally set to scheduled. Terminating instance. Note that the state it is set to can be different (e.g. queued or up_for_retry) simply depending on how quickly the scheduler has progressed that task_instance again. (cherry picked from commit 38d329b) (cherry picked from commit fa0b998)

In `adopt_or_reset_orphaned_tasks`, we set any SchedulerJobs that have failed `scheduler_health_check_threshold` to failed, however a missing condition was allowing that timeout to apply to all jobs, not just SchedulerJobs. This is because polymorphic identity isn't included for `update()`: https://docs.sqlalchemy.org/en/13/orm/query.html#sqlalchemy.orm.query.Query.update So if we had any running LocalTaskJobs that, for whatever reason, aren't heartbeating faster than `scheduler_health_check_threshold`, their state gets set to failed and they subsequently exit with a log line similar to: State of this instance has been externally set to scheduled. Terminating instance. Note that the state it is set to can be different (e.g. queued or up_for_retry) simply depending on how quickly the scheduler has progressed that task_instance again. (cherry picked from commit 38d329b) (cherry picked from commit fa0b998) (cherry picked from commit 2071544)

jedcunningham requested review from XD-DENG, ashb and kaxil as code owners November 2, 2021 23:00

boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label Nov 2, 2021

jedcunningham mentioned this pull request Nov 2, 2021

Airflow Scheduler may set running task instance state into None state in multiple scheduler deployment #19277

Closed

2 tasks

jedcunningham added this to the Airflow 2.2.2 milestone Nov 2, 2021

jedcunningham requested a review from ephraimbuddy November 2, 2021 23:11

kaxil approved these changes Nov 2, 2021

View reviewed changes

github-actions Bot added the full tests needed We need to run full set of tests for this PR to merge label Nov 2, 2021

kaxil added the type:bug-fix Changelog: Bug Fixes label Nov 2, 2021

kaxil closed this Nov 3, 2021

kaxil reopened this Nov 3, 2021

mik-laj approved these changes Nov 3, 2021

View reviewed changes

ephraimbuddy approved these changes Nov 3, 2021

View reviewed changes

ephraimbuddy merged commit 38d329b into apache:main Nov 3, 2021

ephraimbuddy deleted the fix_failed_schedulerjob_query branch November 3, 2021 06:45

potiuk reviewed Nov 3, 2021

View reviewed changes

This was referenced Nov 10, 2021

Status of testing of Apache Airflow 2.2.2rc1 #19515

Closed

Status of testing of Apache Airflow 2.2.2rc2 #19558

Closed

ephraimbuddy mentioned this pull request Nov 23, 2021

Running tasks marked as 'orphaned' and killed by scheduler #16023

Closed

andrewrjones mentioned this pull request Dec 17, 2021

Tasks can be stuck in running state indefinitely #12103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Only mark SchedulerJobs as failed, not any jobs#19375

Only mark SchedulerJobs as failed, not any jobs#19375
ephraimbuddy merged 1 commit into
apache:mainfrom
astronomer:fix_failed_schedulerjob_query

jedcunningham commented Nov 2, 2021

Uh oh!

kaxil left a comment

Uh oh!

github-actions Bot commented Nov 2, 2021

Uh oh!

kaxil commented Nov 2, 2021 •

edited

Loading

Uh oh!

ephraimbuddy commented Nov 3, 2021

Uh oh!

potiuk left a comment

Uh oh!

collinmcnulty commented Nov 3, 2021 •

edited

Loading

Uh oh!

jedcunningham commented Nov 3, 2021

Uh oh!

ashb commented Nov 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

jedcunningham commented Nov 2, 2021

Uh oh!

kaxil left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Nov 2, 2021

Uh oh!

kaxil commented Nov 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ephraimbuddy commented Nov 3, 2021

Uh oh!

potiuk left a comment

Choose a reason for hiding this comment

Uh oh!

collinmcnulty commented Nov 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jedcunningham commented Nov 3, 2021

Uh oh!

ashb commented Nov 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kaxil commented Nov 2, 2021 •

edited

Loading

collinmcnulty commented Nov 3, 2021 •

edited

Loading