We are using deferred operators to execute jobs in databricks. These jobs utlize a common database so we use task pools to limit the concurrency to 1 task. This pool includes deferred operators. In some cases we see task timeouts, even though the deferred task successfully finished. You can see 1.5h passing between trigger event and scheduling:
[2024-12-06, 14:01:10 CET] {{taskinstance.py:2344}} INFO - Pausing task as DEFERRED. dag_id=my-dag, task_id=my-task, execution_date=20241205T130000, start_date=20241206T130108
[2024-12-06, 14:01:10 CET] {{local_task_job_runner.py:231}} INFO - Task exited with return code 100 (task deferral)
[2024-12-06, 14:01:11 CET] {{base.py:83}} INFO - Using connection ID 'databricks' for task execution.
[2024-12-06, 14:01:11 CET] {{databricks.py:94}} INFO - run-id 847717920033451 in run state {'life_cycle_state': 'PENDING', 'result_state': '', 'state_message': 'Waiting for cluster'}. sleeping for 30 seconds
...
[2024-12-06, 14:09:42 CET] {{databricks.py:94}} INFO - run-id 847717920033451 in run state {'life_cycle_state': 'RUNNING', 'result_state': '', 'state_message': 'In run'}. sleeping for 30 seconds
[2024-12-06, 14:10:12 CET] {{databricks.py:94}} INFO - run-id 847717920033451 in run state {'life_cycle_state': 'RUNNING', 'result_state': '', 'state_message': 'In run'}. sleeping for 30 seconds
[2024-12-06, 14:10:42 CET] {{triggerer_job_runner.py:602}} INFO - Trigger my-dag/scheduled__2024-12-05T13:00:00+00:00/my-task/-1/1 (ID 10030) fired: TriggerEvent<{'run_id': 847717920033451, 'run_page_url': '...', 'run_state': '{"life_cycle_state": "TERMINATED", "result_state": "SUCCESS", "state_message": ""}'}>
[2024-12-06, 15:38:27 CET] {{taskinstance.py:1956}} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: my-dag.my-task scheduled__2024-12-05T13:00:00+00:00 [queued]>
...
[2024-12-06, 15:38:27 CET] {{taskinstance.py:2698}} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 425, in _execute_task
raise AirflowTaskTimeout()
airflow.exceptions.AirflowTaskTimeout
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.8.1
What happened?
We are using deferred operators to execute jobs in databricks. These jobs utlize a common database so we use task pools to limit the concurrency to 1 task. This pool includes deferred operators. In some cases we see task timeouts, even though the deferred task successfully finished. You can see 1.5h passing between trigger event and scheduling:
Our assumption of what happens in the following:
What you think should happen instead?
I see multiple things that could improve this behavior:
How to reproduce
Include Deferred.Operating System
Amazon Linux 2
Versions of Apache Airflow Providers
No response
Deployment
Amazon (AWS) MWAA
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
Code of Conduct