Skip to content

feat: store hash of dag_version.version_data to avoid loading/comparing large manifests#68635

Open
anmolxlight wants to merge 3 commits into
apache:mainfrom
anmolxlight:fix-68567-version-data-hash
Open

feat: store hash of dag_version.version_data to avoid loading/comparing large manifests#68635
anmolxlight wants to merge 3 commits into
apache:mainfrom
anmolxlight:fix-68567-version-data-hash

Conversation

@anmolxlight

Copy link
Copy Markdown
Contributor

Summary

Store a hash of dag_version.version_data to avoid loading and comparing the full JSON manifest on every DAG parse.

Problem

SerializedDagModel.write_dag's "serialized hash unchanged" fast path refreshes DagVersion.bundle_version / version_data in place, comparing the full stored version_data against the incoming value:

  1. _prefetch_dag_write_metadata loads the full DagVersion row — including the entire version_data JSON — for every DAG in the bulk write.
  2. The steady-state same-bundle case re-compares the full version_data dict each parse.

Solution

Persist a version_data_hash (md5 of canonical JSON, String(32), nullable) on dag_version and compare/prefetch that instead of the full blob:

  • DagVersion model: new version_data_hash column + compute_version_data_hash() static method
  • _prefetch_dag_write_metadata: uses load_only() to skip loading the version_data JSON column entirely
  • Fast path comparison: compares version_data_hash instead of full dicts
  • In-place refresh: updates version_data_hash when bundle metadata changes
  • New DagVersion rows: computed on creation

Verification

  • All 66 test_serialized_dag tests pass
  • All 8 test_dag_version tests pass
  • All migrations chain correctly from latest 9ff64e1c35d3

Closes: #68567

…ng large manifests

Persist a version_data_hash (md5 of canonical JSON) on DagVersion and
compare/prefetch that instead of the full version_data blob.

Changes:
- Add version_data_hash column (String(32), nullable) to DagVersion model
- Add compute_version_data_hash() static method on DagVersion
- In _prefetch_dag_write_metadata, use load_only() to skip loading the
  potentially-large version_data JSON column
- In write_dag fast path, compare version_data_hash instead of full dicts
- Update in-place refresh and no-TI-update paths to set version_data_hash
- Alembic migration 0123 (rev: 9e8d7c6b5a4f) for the new column

Closes: apache#68567
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

uv.lock on main just moved via #68710 ("Add Ray constraints for Python 3.14 in Google provider"), commit 9c49080 and this PR currently conflicts.

Quickest fix:

git fetch upstream main && git rebase upstream/main
rm uv.lock && uv lock
git add uv.lock && git rebase --continue
git push --force-with-lease

Automated nudge — ignore if you're not ready to rebase. This comment is updated in place on future uv.lock bumps.

…dependencies.sha256sum

- Point 3.3.0 head to new migration revision 9e8d7c6b5a4f
- Fix end-of-file-fixer on generated/provider_dependencies.json.sha256sum
@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:DAG-processing area:db-migrations PRs with DB migration ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Store a hash of dag_version.version_data to avoid loading/comparing large manifests on parse

2 participants