feat: store hash of dag_version.version_data to avoid loading/comparing large manifests#68635
Open
anmolxlight wants to merge 3 commits into
Open
feat: store hash of dag_version.version_data to avoid loading/comparing large manifests#68635anmolxlight wants to merge 3 commits into
anmolxlight wants to merge 3 commits into
Conversation
…ng large manifests Persist a version_data_hash (md5 of canonical JSON) on DagVersion and compare/prefetch that instead of the full version_data blob. Changes: - Add version_data_hash column (String(32), nullable) to DagVersion model - Add compute_version_data_hash() static method on DagVersion - In _prefetch_dag_write_metadata, use load_only() to skip loading the potentially-large version_data JSON column - In write_dag fast path, compare version_data_hash instead of full dicts - Update in-place refresh and no-TI-update paths to set version_data_hash - Alembic migration 0123 (rev: 9e8d7c6b5a4f) for the new column Closes: apache#68567
Contributor
|
Quickest fix: git fetch upstream main && git rebase upstream/main
rm uv.lock && uv lock
git add uv.lock && git rebase --continue
git push --force-with-leaseAutomated nudge — ignore if you're not ready to rebase. This comment is updated in place on future |
…dependencies.sha256sum - Point 3.3.0 head to new migration revision 9e8d7c6b5a4f - Fix end-of-file-fixer on generated/provider_dependencies.json.sha256sum
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Store a hash of
dag_version.version_datato avoid loading and comparing the full JSON manifest on every DAG parse.Problem
SerializedDagModel.write_dag's "serialized hash unchanged" fast path refreshesDagVersion.bundle_version/version_datain place, comparing the full storedversion_dataagainst the incoming value:_prefetch_dag_write_metadataloads the fullDagVersionrow — including the entireversion_dataJSON — for every DAG in the bulk write.version_datadict each parse.Solution
Persist a
version_data_hash(md5 of canonical JSON,String(32), nullable) ondag_versionand compare/prefetch that instead of the full blob:DagVersionmodel: newversion_data_hashcolumn +compute_version_data_hash()static method_prefetch_dag_write_metadata: usesload_only()to skip loading theversion_dataJSON column entirelyversion_data_hashinstead of full dictsversion_data_hashwhen bundle metadata changesDagVersionrows: computed on creationVerification
test_serialized_dagtests passtest_dag_versiontests pass9ff64e1c35d3Closes: #68567