Skip to content

Add db clean CLI command for purging old data#20838

Merged
dstandish merged 44 commits into
apache:mainfrom
astronomer:maintenance-cli-utility
Feb 26, 2022
Merged

Add db clean CLI command for purging old data#20838
dstandish merged 44 commits into
apache:mainfrom
astronomer:maintenance-cli-utility

Conversation

@dstandish

@dstandish dstandish commented Jan 12, 2022

Copy link
Copy Markdown
Contributor

Must supply "purge before date".
Can optionally provide table list.
Dry run will only print the number of rows meeting criteria.
If not dry run, will require the user to confirm before deleting.

example dry run output:

local ❯ airflow db clean --clean-before-timestamp '2021-10-21 16:23:13+0000'  --tables 'xcom, log, dag_run' --verbose --dry-run
Running dry run for metastore table cleanup.
Data prior to 2021-10-21T16:23:13+00:00 would be purged from tables ['dag_run', 'log', 'xcom'] with the following config:

table   | recency_column     | keep_last | keep_last_filters                     | keep_last_group_by
========+====================+===========+=======================================+===================
dag_run | DagRun.start_date  | True      | ['dag_run.external_trigger IS false'] | DagRun.dag_id
log     | Log.dttm           | False     | None                                  | None
xcom    | BaseXCom.timestamp | False     | None                                  | None


Performing dry run for table 'dag_run'
Found 4569 rows meeting deletion criteria.

Performing dry run for table 'log'
Found 26 rows meeting deletion criteria.

Performing dry run for table 'xcom'
Found 0 rows meeting deletion criteria.

Notes:

I didn't add docs since it appears that CLI docs are mostly automated and the command is pretty intuitive.

One thing that's maybe a bit non-obvious (though very sensible) that I'll highlight here is that for DagRuns the last scheduled dag run is always retained. This is to ensure continuity with with scheduled dag runs.

The other thing that's maybe nonobvious is that we have foreign key relationships and they have on delete cascade built in to the model so this means if your cleanup run deletes a dag run it will also delete all of its associated TIs, even if you didn't ask to cleanup the TI table.

I have toyed with the idea of moving this to the airflow db subcommand for now since it's very much db-specific and we don't have any other maintenance commands at the moment. I we add a lot of maintenancey stuff later we can always move the commands. I welcome opinions on this.

The last thing I should point out is, the verbose mode doesn't do anything right now. Initially I thought we might print out the rows that were deleted (similar to how they are printed to logs in the "maintenance dags" that inspired this effort). But I don't think it's actually that helpful to do and it could be a lot of data which could actually be harmful since it would crowd out other output. So I've left the verbose option but it doesn't do anything. We could leave it, remove it, or add some kind of verbose output. Similarly, I thought we might want to print all the to-be-deleted rows in the dry run, but for the same reason I decided not to do that. Though I've left some print rows logic in there in case we want to enable that.

Comment thread airflow/utils/metastore_cleanup.py Outdated
Comment thread airflow/cli/cli_parser.py Outdated
Comment thread airflow/cli/cli_parser.py Outdated
Comment thread airflow/utils/metastore_cleanup.py Outdated
Comment thread airflow/utils/metastore_cleanup.py Outdated
Comment thread airflow/utils/metastore_cleanup.py Outdated
Comment thread airflow/cli/commands/maintenance_command.py Outdated
Comment thread airflow/utils/metastore_cleanup.py Outdated
@kaxil kaxil added this to the Airflow 2.3.0 milestone Jan 13, 2022
Comment thread airflow/utils/metastore_cleanup.py Outdated
@potiuk

potiuk commented Jan 13, 2022

Copy link
Copy Markdown
Member

Love this @dstandish ❤️

@potiuk

potiuk commented Jan 13, 2022

Copy link
Copy Markdown
Member

We just had discussion that we need it yesterday with @mik-laj and @mhenc :) . Great minds think alike :)

Comment thread airflow/cli/cli_parser.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you mean "Splits comma-separated string and returns the result in a list"?

Comment thread airflow/utils/metastore_cleanup.py Outdated
@dstandish

Copy link
Copy Markdown
Contributor Author

Okie doke I think this is ready for a look. Had to write a lot of tests. Think I have decent coverage.

I have toyed with the idea of moving this to the airflow db subcommand for now since it's very much db-specific and we don't have any other maintenance commands at the moment. I we add a lot of maintenancey stuff later we can always move the commands. I welcome opinions on this.

@dstandish dstandish force-pushed the maintenance-cli-utility branch from 78bd301 to d1291e8 Compare January 28, 2022 20:41
@dstandish dstandish changed the title Add maintenance cleanup CLI command for purging old data Add db clean CLI command for purging old data Jan 28, 2022
@dstandish

Copy link
Copy Markdown
Contributor Author

renamed airflow db clean

welcome opinions on the naming

@dstandish dstandish force-pushed the maintenance-cli-utility branch from 74bc174 to 9130721 Compare January 31, 2022 19:55
Comment thread airflow/cli/commands/db_command.py Outdated
Comment thread airflow/cli/commands/db_command.py Outdated
Comment thread airflow/utils/metastore_cleanup.py Outdated
Comment on lines 79 to 74

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably easier to understand if this logic is put into sorted(..., key=...) instead. This class does not need to be generally sortable (and the sorting logic isn’t obvious either).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this reference other arg instead of self twice?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this reference other arg instead of self twice?

ha, yup!

Comment thread airflow/utils/metastore_cleanup.py Outdated
Comment thread airflow/utils/metastore_cleanup.py Outdated
Comment thread airflow/utils/metastore_cleanup.py Outdated
Comment thread airflow/utils/metastore_cleanup.py Outdated
Comment thread airflow/utils/metastore_cleanup.py Outdated
@dstandish dstandish force-pushed the maintenance-cli-utility branch from f8ea2a6 to b2b77a4 Compare February 8, 2022 22:52
Comment thread airflow/utils/metastore_cleanup.py Outdated
@potiuk

potiuk commented Feb 15, 2022

Copy link
Copy Markdown
Member

Static checks :) ?

@dstandish

Copy link
Copy Markdown
Contributor Author

Static checks :) ?

sorry should be fixed now

Comment thread airflow/cli/commands/db_command.py Outdated
Comment thread airflow/utils/metastore_cleanup.py Outdated
@dstandish dstandish force-pushed the maintenance-cli-utility branch from 1aeb431 to b66c7a8 Compare February 23, 2022 00:47
@dstandish dstandish merged commit c75774d into apache:main Feb 26, 2022
@dstandish dstandish deleted the maintenance-cli-utility branch February 26, 2022 05:55
@jedcunningham jedcunningham added the type:new-feature Changelog: New Features label Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:CLI full tests needed We need to run full set of tests for this PR to merge type:new-feature Changelog: New Features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants