Skip to content

Reduce DB load incurred by Stale DAG deactivation#21399

Merged
ephraimbuddy merged 5 commits into
apache:mainfrom
SamWheating:reduce-overhead-of-stale-dag-deactivation
Mar 20, 2022
Merged

Reduce DB load incurred by Stale DAG deactivation#21399
ephraimbuddy merged 5 commits into
apache:mainfrom
SamWheating:reduce-overhead-of-stale-dag-deactivation

Conversation

@SamWheating

@SamWheating SamWheating commented Feb 7, 2022

Copy link
Copy Markdown
Contributor

Re: #21397

By moving this logic into the DagFileProcessorManager and running it across all processed file periodically, we can prevent the use of un-indexed queries.

The basic logic is that we can look at the last processed time of a file (for a given processor) and compare that to the last_parsed_time of an entry in the dag table. If the file has been processed significantly more recently than the DAG has been updated, then its safe to assume that the DAG is missing and can be marked inactive.

Todo:

  • Improve test coverage
  • Exposed new tuneable parameters in the config

@boring-cyborg boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label Feb 7, 2022
Comment thread airflow/dag_processing/manager.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like the wrong timeout to use -- processor timeout is how long each file should take to process:

# How long before timing out a DagFileProcessor, which processes a dag file
dag_file_processor_timeout = 50

But that doesn't mean that every dag file should be "reparsed" every 50 seconds

@SamWheating SamWheating Feb 8, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there's actually a reason for this.

We're comparing the parse time as reported by the processor manager to the last_parsed_time as seen in the DAG table, however these values are taken independently:

DagModel.last_parsed_time is decided here, when the DAG is written to the DB:

orm_dag.last_parsed_time = timezone.utcnow()

whereas the DagParsingStat.last_finish_time is decided when the file processor finishes:
https://github.com/apache/airflow/blob/dbe723da95143f6d33e5d2594bc2017c4164e687/airflow/dag_processing/manager.py#L915

So because of this, DagParsingStat.last_finish_time is always going to be slightly later than DagModel.last_parsed_time (typically on the order of milliseconds). Thus in order to be certain that the file was processed more recently than the DAG was last observed we can't directly compare the two timestamps and instead have to do something like:

DagParsingStat.last_finish_time > (SOME_BUFFER + DagModel.last_parsed_time)

I chose to use the processor_timeout here because it represents the absolute upper bound on the difference between DagParsingStat.last_finish_time and DagModel.last_parsed_time, and thus we favour false negatives (not deactivating a DAG which is actually gone) over false positives (incorrectly deactivating a DAG because the file processor was blocking for a few seconds after updating the DB)

Let me know what you think - from my testing in breeze this approach appears to work reliably, but it also adds a lot of complexity.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhhhh! Right yeah that makes sense. Could you try and distil some of this down to a short comment?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, will do (probably won't have time to clean up this PR until next week though)

Comment thread tests/dag_processing/test_manager.py Outdated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting this value to max caused issues due to the following line of code, which led to an overflow:

                    and (dag.last_parsed_time + self._processor_timeout) < last_parsed[dag.fileloc]

@SamWheating SamWheating changed the title (WIP) Reduce DB load incurred by Stale DAG deactivation Reduce DB load incurred by Stale DAG deactivation Feb 15, 2022
@SamWheating SamWheating force-pushed the reduce-overhead-of-stale-dag-deactivation branch from 3efe1a3 to 2fef453 Compare February 15, 2022 00:25
@SamWheating

Copy link
Copy Markdown
Contributor Author

OK, this is now ready for a proper review - I will patch this into our production 2.2.2 container sometime this week and confirm that it fixes the original performance issue while still managing to clean up stale DAGs.

@ashb

ashb commented Feb 21, 2022

Copy link
Copy Markdown
Member

@SamWheating Did you manage to get this running in prod?

@SamWheating

Copy link
Copy Markdown
Contributor Author

Not yet, I've built a patched version of 2.2.2 with this change but haven't had a chance to roll it out in any large-scale environments.

Will do it tomorrow and report back wednesday.

@SamWheating

Copy link
Copy Markdown
Contributor Author

Ok, I have created a patched version of Airflow 2.2.2 with this change and deployed it in our prod-scale staging environment (Airflow 2.2.2). I can confirm that:

  • DB CPU utilization and Queries/second is approximately the same before and after the change
  • DAGs are correctly cleaned up after being removed from a file (this takes longer than it did with the previous change, but its eventually consistent)

@potiuk potiuk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks really cool and I think it might handle a lot of stability issues resulting from some synchronisation solutions that cause some intermitttent instabilities of the filesystem and some dynamic dag generation scenarios.

I think it needs a few more eyes though.

@github-actions

Copy link
Copy Markdown
Contributor

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions Bot added the full tests needed We need to run full set of tests for this PR to merge label Feb 26, 2022
@ashb

ashb commented Feb 28, 2022

Copy link
Copy Markdown
Member

@jedcunningham I've marked this for possible inclusion in 2.2.5

@ashb ashb added this to the Airflow 2.2.5 milestone Feb 28, 2022
@ashb ashb force-pushed the reduce-overhead-of-stale-dag-deactivation branch from 2fef453 to 2aac5c5 Compare February 28, 2022 13:37
@ashb ashb force-pushed the reduce-overhead-of-stale-dag-deactivation branch from 2aac5c5 to aef9f8c Compare March 11, 2022 22:10
@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Mar 20, 2022
@ephraimbuddy ephraimbuddy merged commit f309ea7 into apache:main Mar 20, 2022
ephraimbuddy pushed a commit that referenced this pull request Mar 22, 2022
Deactivating stale DAGs periodically in bulk

By moving this logic into the DagFileProcessorManager and running it across all processed file periodically, we can prevent the use of un-indexed queries.

The basic logic is that we can look at the last processed time of a file (for a given processor) and compare that to the last_parsed_time of an entry in the dag table. If the file has been processed significantly more recently than the DAG has been updated, then its safe to assume that the DAG is missing and can be marked inactive.

(cherry picked from commit f309ea7)
ephraimbuddy pushed a commit that referenced this pull request Mar 22, 2022
Deactivating stale DAGs periodically in bulk

By moving this logic into the DagFileProcessorManager and running it across all processed file periodically, we can prevent the use of un-indexed queries.

The basic logic is that we can look at the last processed time of a file (for a given processor) and compare that to the last_parsed_time of an entry in the dag table. If the file has been processed significantly more recently than the DAG has been updated, then its safe to assume that the DAG is missing and can be marked inactive.

(cherry picked from commit f309ea7)
ephraimbuddy pushed a commit that referenced this pull request Mar 22, 2022
Deactivating stale DAGs periodically in bulk

By moving this logic into the DagFileProcessorManager and running it across all processed file periodically, we can prevent the use of un-indexed queries.

The basic logic is that we can look at the last processed time of a file (for a given processor) and compare that to the last_parsed_time of an entry in the dag table. If the file has been processed significantly more recently than the DAG has been updated, then its safe to assume that the DAG is missing and can be marked inactive.

(cherry picked from commit f309ea7)
ephraimbuddy pushed a commit that referenced this pull request Mar 23, 2022
Deactivating stale DAGs periodically in bulk

By moving this logic into the DagFileProcessorManager and running it across all processed file periodically, we can prevent the use of un-indexed queries.

The basic logic is that we can look at the last processed time of a file (for a given processor) and compare that to the last_parsed_time of an entry in the dag table. If the file has been processed significantly more recently than the DAG has been updated, then its safe to assume that the DAG is missing and can be marked inactive.

(cherry picked from commit f309ea7)
ephraimbuddy pushed a commit that referenced this pull request Mar 24, 2022
Deactivating stale DAGs periodically in bulk

By moving this logic into the DagFileProcessorManager and running it across all processed file periodically, we can prevent the use of un-indexed queries.

The basic logic is that we can look at the last processed time of a file (for a given processor) and compare that to the last_parsed_time of an entry in the dag table. If the file has been processed significantly more recently than the DAG has been updated, then its safe to assume that the DAG is missing and can be marked inactive.

(cherry picked from commit f309ea7)
ephraimbuddy pushed a commit that referenced this pull request Mar 26, 2022
Deactivating stale DAGs periodically in bulk

By moving this logic into the DagFileProcessorManager and running it across all processed file periodically, we can prevent the use of un-indexed queries.

The basic logic is that we can look at the last processed time of a file (for a given processor) and compare that to the last_parsed_time of an entry in the dag table. If the file has been processed significantly more recently than the DAG has been updated, then its safe to assume that the DAG is missing and can be marked inactive.

(cherry picked from commit f309ea7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants