Skip to content

CLI: add --batch-size option to "airflow db clean"#51510

Merged
potiuk merged 3 commits into
apache:mainfrom
nailo2c:core-47600-add-batch-size-for-db-clean
Jun 9, 2025
Merged

CLI: add --batch-size option to "airflow db clean"#51510
potiuk merged 3 commits into
apache:mainfrom
nailo2c:core-47600-add-batch-size-for-db-clean

Conversation

@nailo2c

@nailo2c nailo2c commented Jun 8, 2025

Copy link
Copy Markdown
Contributor

Closes: #47600

Support new argument --batch-size for airflow db clean.

example output

root@b634cda2186b:/opt/airflow# airflow db clean --help
Usage: airflow db clean [-h] [--batch-size BATCH_SIZE] --clean-before-timestamp CLEAN_BEFORE_TIMESTAMP [--dry-run] [--skip-archive] [-t TABLES] [-v] [-y]

Purge old records in metastore tables

Optional Arguments:
  -h, --help            show this help message and exit
  --batch-size BATCH_SIZE
                        Maximum number of rows to delete or archive in a single transaction.
                        Lower values reduce long-running locks but increase the number of batches.
  --clean-before-timestamp CLEAN_BEFORE_TIMESTAMP
                        The date or timestamp before which data should be purged.
                        ...

Unit tests

  • test_run_cleanup_batch_size_propagation
  • test_batch_size

▶️ Manual / end‑to‑end check

I verified the flag with three back‑ends - PostgreSQL, MySQL, and SQLite. Using 100,000 mock rows in both the log and import_error tables.

PostgreSQL

-- insert mock data into `log`
INSERT INTO log (dttm, dag_id, task_id, event, owner, extra)
SELECT
    now() - interval '500 days' - (gs.id || ' seconds')::interval,
    'batch_test_dag',
    'dummy_task',
    'unit-test',
    'tester',
    NULL
FROM generate_series(1, 100000) AS gs(id);

-- insert mock data into `import_error`
INSERT INTO import_error (timestamp, filename, stacktrace)
SELECT
    now() - interval '500 days' - (gs.id || ' seconds')::interval,
    concat('dummy_file_', gs.id, '.py'),
    'Traceback (most recent call last): …'
FROM generate_series(1, 100000) AS gs(id);

MySQL

SET SESSION cte_max_recursion_depth = 100000;

-- log
INSERT INTO log (dttm, dag_id, task_id, `event`, owner, extra)
WITH RECURSIVE seq(id) AS (
    SELECT 1
    UNION ALL
    SELECT id + 1 FROM seq WHERE id < 100000
)
SELECT
    TIMESTAMPADD(SECOND, -id, TIMESTAMPADD(DAY, -500, NOW())),
    'batch_test_dag',
    'dummy_task',
    'unit-test',
    'tester',
    NULL
FROM seq;

-- import_error
INSERT INTO import_error (`timestamp`, filename, stacktrace)
WITH RECURSIVE seq(id) AS (
    SELECT 1
    UNION ALL
    SELECT id + 1 FROM seq WHERE id < 100000
)
SELECT
    TIMESTAMPADD(SECOND, -id, TIMESTAMPADD(DAY, -500, NOW())),
    CONCAT('dummy_file_', id, '.py'),
    'Traceback (most recent call last): …'
FROM seq;

SQLite

-- log
WITH RECURSIVE seq(id) AS (
    SELECT 1
    UNION ALL
    SELECT id + 1 FROM seq WHERE id < 100000
)
INSERT INTO log (dttm, dag_id, task_id, "event", owner, extra)
SELECT
    datetime('now', '-500 days', printf('-%d seconds', id)),
    'batch_test_dag',
    'dummy_task',
    'unit-test',
    'tester',
    NULL
FROM seq;

-- import_error
WITH RECURSIVE seq(id) AS (
    SELECT 1
    UNION ALL
    SELECT id + 1 FROM seq WHERE id < 100000
)
INSERT INTO import_error ("timestamp", filename, stacktrace)
SELECT
    datetime('now', '-500 days', printf('-%d seconds', id)),
    'dummy_file_' || id || '.py',
    'Traceback (most recent call last): …'
FROM seq;

Both commands below completed successfully on all three back‑ends, with rows fully purged and archives created/dropped as expected:

# test new flag
airflow db clean --clean-before-timestamp '2025-06-06 00:00:00+01:00' -t log,import_error --batch-size 25000

# regenerate data and test default behaviour (single large batch)
airflow db clean --clean-before-timestamp '2025-06-06 00:00:00+01:00' -t log,import_error

Docs

I verified that the documentation (cli-and-env-variables-ref.rst) now includes the new argument.

截圖 2025-06-07 下午5 31 55

Please let me know if there's anything that needs to be improved, many thanks 😄

@potiuk

potiuk commented Jun 9, 2025

Copy link
Copy Markdown
Member

Very nice ! Thank you!

@potiuk potiuk closed this Jun 9, 2025
@potiuk potiuk reopened this Jun 9, 2025
@potiuk potiuk merged commit cafe913 into apache:main Jun 9, 2025
101 checks passed
@potiuk

potiuk commented Jun 9, 2025

Copy link
Copy Markdown
Member

Closed by mistake - reopened and merged :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow to set batch size for db clean CLI

2 participants