Skip to content

Graceful termination does not work with apache chart #13591

Description

@dstandish

In apache/airflow, helm chart has worker default terminationGracePeriodSeconds: 600.

I observed after deploy using 1.10.14 that worker was terminated immediately. This reproduced consistently.

Tested also with 2.0.0 and again no lucke

Anyone have any hints of something to look into?

Here are some logs from a worker that shutdown ungracefully, running 1.10.14:

worker: Warm shutdown (MainProcess)
[2021-01-09 22:24:30,747: ERROR/MainProcess] Process 'ForkPoolWorker-15' pid:37 exited with 'signal 15 (SIGTERM)'
[2021-01-09 22:24:30,858: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).')
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/worker.py", line 208, in start
    self.blueprint.start(self)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 369, in start
    return self.obj.start()
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 599, in start
    c.loop(*c.loop_args())
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/worker/loops.py", line 83, in asynloop
    next(loop)
  File "/home/airflow/.local/lib/python3.7/site-packages/kombu/asynchronous/hub.py", line 308, in create_loop
    events = poll(poll_timeout)
  File "/home/airflow/.local/lib/python3.7/site-packages/kombu/utils/eventio.py", line 84, in poll
    return self._epoll.poll(timeout if timeout is not None else -1)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/apps/worker.py", line 285, in _handle_request
    raise exc(exitcode)
celery.exceptions.WorkerShutdown: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
[2021-01-09 22:24:30,865: ERROR/MainProcess] Process 'ForkPoolWorker-16' pid:38 exited with 'signal 15 (SIGTERM)'

 -------------- celery@airflow-worker-66b7bf687b-8j2x5 v4.4.7 (cliffs)
--- ***** -----
-- ******* ---- Linux-4.14.209-160.335.amzn2.x86_64-x86_64-with-debian-10.6 2021-01-09 22:22:38
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app:         airflow.executors.celery_executor:0x7fc31160fd90
- ** ---------- .> transport:   redis://:**@airflow-redis:6379/0
- ** ---------- .> results:     postgresql://postgres:**@airflow-pgbouncer:6543/airflow-result-backend
- *** --- * --- .> concurrency: 16 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery


[tasks]
  . airflow.executors.celery_executor.execute_command

And again with 2.0.0:

[2021-01-10 06:46:21,159: INFO/MainProcess] Connected to redis://:**@airflow-redis:6379/0
[2021-01-10 06:46:21,168: INFO/MainProcess] mingle: searching for neighbors
[2021-01-10 06:46:22,208: INFO/MainProcess] mingle: all alone
[2021-01-10 06:46:22,224: INFO/MainProcess] celery@airflow-worker-66b9b6495b-6m7jd ready.
[2021-01-10 06:46:25,199: INFO/MainProcess] Events of group {task} enabled by remote.
[2021-01-10 06:47:47,441: INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[dab010cc-72fb-4f73-8c53-05b46ca71848]
[2021-01-10 06:47:47,503: INFO/ForkPoolWorker-7] Executing command in Celery: ['airflow', 'tasks', 'run', 'standish_test_dag', 'test-secrets-backend', '2021-01-10T06:44:35.573723+00:00', '--local', '--pool', 'default_pool', '--subdir', '/opt/airflow/dags/standish_test.py']
[2021-01-10 06:47:47,549: INFO/ForkPoolWorker-7] Filling up the DagBag from /opt/airflow/dags/standish_test.py
[2021-01-10 06:47:47,830: INFO/ForkPoolWorker-7] Loading 1 plugin(s) took 0.26 seconds
[2021-01-10 06:47:47,845: WARNING/ForkPoolWorker-7] Running <TaskInstance: standish_test_dag.test-secrets-backend 2021-01-10T06:44:35.573723+00:00 [queued]> on host 10.5.21.64
[2021-01-10 06:48:17,735] {_internal.py:113} INFO - 10.5.22.61 - - [10/Jan/2021 06:48:17] "GET /log/standish_test_dag/test-secrets-backend/2021-01-10T06:44:35.573723+00:00/1.log HTTP/1.1" 404 -
[2021-01-10 06:48:17,738] {_internal.py:113} INFO - 10.5.22.61 - - [10/Jan/2021 06:48:17] "GET /log/standish_test_dag/test-secrets-backend/2021-01-10T06:44:35.573723+00:00/2.log HTTP/1.1" 200 -

With 2.0.0 theres no error, but still it is immediate termination with no respecting of grace period.

I tried various combinations of args and saw the same behavior every time:

  • ["bash", "-c", "airflow worker"]
  • ["bash", "-c", "exec airflow worker"]
  • ["worker"]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions