Reduce DB load incurred by Stale DAG deactivation by SamWheating · Pull Request #21399 · apache/airflow

SamWheating · 2022-02-07T19:20:40Z

By moving this logic into the DagFileProcessorManager and running it across all processed file periodically, we can prevent the use of un-indexed queries.

The basic logic is that we can look at the last processed time of a file (for a given processor) and compare that to the last_parsed_time of an entry in the dag table. If the file has been processed significantly more recently than the DAG has been updated, then its safe to assume that the DAG is missing and can be marked inactive.

Todo:

Improve test coverage
Exposed new tuneable parameters in the config

ashb · 2022-02-08T10:00:11Z

This feels like the wrong timeout to use -- processor timeout is how long each file should take to process:

# How long before timing out a DagFileProcessor, which processes a dag file dag_file_processor_timeout = 50

But that doesn't mean that every dag file should be "reparsed" every 50 seconds

So there's actually a reason for this.

We're comparing the parse time as reported by the processor manager to the last_parsed_time as seen in the DAG table, however these values are taken independently:

DagModel.last_parsed_time is decided here, when the DAG is written to the DB:

airflow/airflow/models/dag.py

Line 2427 in 960f573

orm_dag.last_parsed_time = timezone.utcnow()

whereas the DagParsingStat.last_finish_time is decided when the file processor finishes:
https://github.com/apache/airflow/blob/dbe723da95143f6d33e5d2594bc2017c4164e687/airflow/dag_processing/manager.py#L915

So because of this, DagParsingStat.last_finish_time is always going to be slightly later than DagModel.last_parsed_time (typically on the order of milliseconds). Thus in order to be certain that the file was processed more recently than the DAG was last observed we can't directly compare the two timestamps and instead have to do something like:

DagParsingStat.last_finish_time > (SOME_BUFFER + DagModel.last_parsed_time)

I chose to use the processor_timeout here because it represents the absolute upper bound on the difference between DagParsingStat.last_finish_time and DagModel.last_parsed_time, and thus we favour false negatives (not deactivating a DAG which is actually gone) over false positives (incorrectly deactivating a DAG because the file processor was blocking for a few seconds after updating the DB)

Let me know what you think - from my testing in breeze this approach appears to work reliably, but it also adds a lot of complexity.

Ohhhhh! Right yeah that makes sense. Could you try and distil some of this down to a short comment?

Yup, will do (probably won't have time to clean up this PR until next week though)

SamWheating · 2022-02-15T00:20:06Z

Setting this value to max caused issues due to the following line of code, which led to an overflow:

and (dag.last_parsed_time + self._processor_timeout) < last_parsed[dag.fileloc]

SamWheating · 2022-02-15T00:27:22Z

OK, this is now ready for a proper review - I will patch this into our production 2.2.2 container sometime this week and confirm that it fixes the original performance issue while still managing to clean up stale DAGs.

ashb · 2022-02-21T15:26:51Z

@SamWheating Did you manage to get this running in prod?

SamWheating · 2022-02-21T16:39:46Z

Not yet, I've built a patched version of 2.2.2 with this change but haven't had a chance to roll it out in any large-scale environments.

Will do it tomorrow and report back wednesday.

SamWheating · 2022-02-22T23:08:46Z

Ok, I have created a patched version of Airflow 2.2.2 with this change and deployed it in our prod-scale staging environment (Airflow 2.2.2). I can confirm that:

DB CPU utilization and Queries/second is approximately the same before and after the change
DAGs are correctly cleaned up after being removed from a file (this takes longer than it did with the previous change, but its eventually consistent)

potiuk

It looks really cool and I think it might handle a lot of stability issues resulting from some synchronisation solutions that cause some intermitttent instabilities of the filesystem and some dynamic dag generation scenarios.

I think it needs a few more eyes though.

github-actions · 2022-02-26T19:34:59Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

ashb · 2022-02-28T13:36:16Z

@jedcunningham I've marked this for possible inclusion in 2.2.5

Deactivating stale DAGs periodically in bulk By moving this logic into the DagFileProcessorManager and running it across all processed file periodically, we can prevent the use of un-indexed queries. The basic logic is that we can look at the last processed time of a file (for a given processor) and compare that to the last_parsed_time of an entry in the dag table. If the file has been processed significantly more recently than the DAG has been updated, then its safe to assume that the DAG is missing and can be marked inactive. (cherry picked from commit f309ea7)

SamWheating requested review from ephraimbuddy and jedcunningham as code owners February 7, 2022 19:20

boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label Feb 7, 2022

ashb reviewed Feb 8, 2022

View reviewed changes

SamWheating commented Feb 15, 2022

View reviewed changes

SamWheating changed the title ~~(WIP) Reduce DB load incurred by Stale DAG deactivation~~ Reduce DB load incurred by Stale DAG deactivation Feb 15, 2022

SamWheating force-pushed the reduce-overhead-of-stale-dag-deactivation branch from 3efe1a3 to 2fef453 Compare February 15, 2022 00:25

SamWheating mentioned this pull request Feb 23, 2022

Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs #21397

Closed

2 tasks

potiuk approved these changes Feb 26, 2022

View reviewed changes

github-actions Bot added the full tests needed We need to run full set of tests for this PR to merge label Feb 26, 2022

ashb added this to the Airflow 2.2.5 milestone Feb 28, 2022

ashb approved these changes Feb 28, 2022

View reviewed changes

ashb force-pushed the reduce-overhead-of-stale-dag-deactivation branch from 2fef453 to 2aac5c5 Compare February 28, 2022 13:37

SamWheating added 5 commits March 11, 2022 22:10

Deactivating stale DAGs periodically in bulk

25008c8

Adding tests

717a586

Fixing test

920d7c8

Make deactivate_stale_dags_interval configurable

09b1de7

Comments

aef9f8c

ashb force-pushed the reduce-overhead-of-stale-dag-deactivation branch from 2aac5c5 to aef9f8c Compare March 11, 2022 22:10

ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Mar 20, 2022

ephraimbuddy merged commit f309ea7 into apache:main Mar 20, 2022

ephraimbuddy mentioned this pull request Mar 27, 2022

Status of testing of Apache Airflow 2.2.5rc3 #22549

Closed

36 tasks

mhenc mentioned this pull request Dec 16, 2022

AIP-44 Migrate DagFileProcessorManager._deactivate_stale_dags to Internal API #28270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce DB load incurred by Stale DAG deactivation#21399

Reduce DB load incurred by Stale DAG deactivation#21399
ephraimbuddy merged 5 commits into
apache:mainfrom
SamWheating:reduce-overhead-of-stale-dag-deactivation

SamWheating commented Feb 7, 2022 •

edited

Loading

Uh oh!

ashb Feb 8, 2022

Uh oh!

SamWheating Feb 8, 2022 •

edited

Loading

Uh oh!

ashb Feb 8, 2022

Uh oh!

SamWheating Feb 8, 2022

Uh oh!

SamWheating Feb 15, 2022

Uh oh!

SamWheating commented Feb 15, 2022

Uh oh!

ashb commented Feb 21, 2022

Uh oh!

SamWheating commented Feb 21, 2022

Uh oh!

SamWheating commented Feb 22, 2022

Uh oh!

potiuk left a comment

Uh oh!

github-actions Bot commented Feb 26, 2022

Uh oh!

ashb commented Feb 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

SamWheating commented Feb 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Todo:

Uh oh!

ashb Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

SamWheating Feb 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashb Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

SamWheating Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

SamWheating Feb 15, 2022

Choose a reason for hiding this comment

Uh oh!

SamWheating commented Feb 15, 2022

Uh oh!

ashb commented Feb 21, 2022

Uh oh!

SamWheating commented Feb 21, 2022

Uh oh!

SamWheating commented Feb 22, 2022

Uh oh!

potiuk left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Feb 26, 2022

Uh oh!

ashb commented Feb 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SamWheating commented Feb 7, 2022 •

edited

Loading

SamWheating Feb 8, 2022 •

edited

Loading