Fix celery executor bug trying to call len on map#14883
Conversation
There was a problem hiding this comment.
can you please add a test -- similar to the one you mentioned in the PR description
There was a problem hiding this comment.
@kaxil Done. One of these days I'll have a PR that includes a test ;)
|
The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*. |
There was a problem hiding this comment.
Do you need to set logging to debug for this test to cover the new code?
There was a problem hiding this comment.
Good call.
I refactored the tests for the possible iterations of get_many to set the log mode to DEBUG and updated the debug statement to account for being a map object vs being a list object. If it's a map object and the backend is not a base backend, the length of async_results will always be 0 because it comes in as an iterator and _tasks_list_to_task_ids converts it to a set.
There was a problem hiding this comment.
| if isinstance(async_results, map): | |
| self.log.debug("Fetched state for %d task(s)", len(result)) | |
| else: | |
| self.log.debug("Fetched %d state(s) for %d task(s)", len(result), len(async_results)) | |
| self.log.debug("Fetched %d state(s) for %d task(s)", len(result), len(async_results)) |
I think this should be enough, as you have already changed the type on L552.
There was a problem hiding this comment.
Only if it's not a keyvalue or database celery backend. If it's a keyvalue or database backend, _tasks_list_to_task_ids is called in _get_many_from_kv_backend and _get_many_from_db_backend, respectively. In these cases, async_results will be length 0 if converted to a list before calling _get_many_from_kv_backend (which I think would be kinda redundant).
There was a problem hiding this comment.
| async_results = list(async_results) if isinstance(async_results, map) else async_results | |
| if isinstance(async_results, map): | |
| async_results = list(async_results) |
There was a problem hiding this comment.
The other possible fix is to change the one call that passes a map (L478) to be list(map(...))
There was a problem hiding this comment.
That's probably the most intuitive fix, but in the case of the recommended backends (i.e. BaseKeyValueStoreBackend or DatabaseBackend, we'd be converting it to a list and then immediately creating a set based off that list instead of just creating the set based on the map object. Do you think that could have a potential performance hit if a lot of tasks are running?
There was a problem hiding this comment.
@RNHTTR I think most of the cases are not large enough for this to make much of a difference - the only map case is in code that runs once every 30s, and is not performance critical, so for ease of understanding, lets change the one usage instead of complicating this case.
There was a problem hiding this comment.
Sounds good. PR updated. Any resources you recommend on diving into some of airflow's trickier topics? I can't seem to find much on task adoption for example. We run managed airflow, so a lot of airflow's challenging aspects are abstracted away.
There was a problem hiding this comment.
Task Adoption is when running a scheduler, if it dies another scheduler is able to "adopt" tasks started by the first Scheduler/executor -- this currently works for Celery, Kubernetes (and CeleryKubernete) executor
Code:
Scheduler:
airflow/airflow/jobs/scheduler_job.py
Line 1825 in 4d1b2e9
Celery:
airflow/airflow/executors/celery_executor.py
Lines 448 to 503 in 4d1b2e9
There was a problem hiding this comment.
Thanks Kaxil that's helpful. Do you think it'd be a good idea to put together a GLOSSARY.rst or similar to capture some of these more nuanced concepts?
|
@RNHTTR I've just taken another detailed look -- and I think there may be a simpler fix. |
Co-authored-by: RNHTTR <ryan@wiftapp.com> (cherry picked from commit 4ee4429)
Co-authored-by: RNHTTR <ryan@wiftapp.com> (cherry picked from commit 4ee4429)
|
@ashb @hbrls @RNHTTR I am getting similar issue when trying to start airflow webserver with version 2.0.2. File "/usr/local/bin/airflow", line 8, in |
|
@jayitapanda17 2.0.2 is two years old now, and the stack trace is in the openid library by the looks of things, so we can't really help unless your can upgrade. Sorry! |
When the celery executor tries to adopt task instances, and there are indeed task instances to adopt,
bulk_state_fetcher.get_manyis called, passing a map object. If the celeryresult_backendis not an instance of BaseKeyValueStoreBackend or DatabaseBackend, the method_get_many_using_multiprocessingwill be called. This method attempts to get the len of its parameter, but you can't take the length of a map object. So, it needs to be converted to a list (or perhaps another iterable?) first. Since there is a debug statement that first takes the len before_get_many_using_multiprocessingis called, it makes sense to convert the map object to a list immediately prior even if it wasn't handled in theelseblock.I wasn't able to actually reproduce the issue with Airflow, but I was able to reproduce it with the test_try_adopt_task_instances test by setting the celery result backend to
rpc://and the broker to redis.closes: #14163