Make benchmark dataset scrubbing metadata-driven and unify HDF5/MFD load behavior#653
Open
Make benchmark dataset scrubbing metadata-driven and unify HDF5/MFD load behavior#653
Conversation
…et load behavior to dataset_metadata.yml, routing HDF5 and MFD loaders through processDataSet, preserving legacy scrubbing behind a deprecated path, and requiring explicit similarity/load configuration for curated datasets.
Contributor
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR restructures benchmark dataset loading so scrubbing behavior is explicit, metadata-controlled, and uniform across HDF5 and MFD loaders. The primary goal is to stop hard-coding legacy load-time scrubbing behavior into loader-specific paths and prepare a safe transition toward prescrubbed datasets whose offline ground truth matches the stored vectors exactly.
Key changes
Scrubbing behavior becomes explicit
DataSetProperties.LoadBehaviorwith:LEGACY_SCRUBNO_SCRUBload_behaviorsupport todataset_metadata.yml.DataSetUtils.processDataSet(...)as the new metadata-aware entry point for benchmark dataset processing.legacyScrubDataSet(...).getScrubbedDataSet(...)temporarily as a deprecated compatibility shim.Unified metadata-driven loader flow
DataSetLoaderHDF5andDataSetLoaderMFDto carry fullDataSetPropertiesthrough the load path instead of collapsing metadata down to onlysimilarity_function.DataSetUtils.processDataSet(...)so load behavior is applied in one place.Metadata coverage updates
load_behaviorto existing curated dataset entries.Visibility / debugging
Behavior changes
What changes now
dataset_metadata.ymlentry.NO_SCRUBnow loads vectors and ground truth exactly as stored.LEGACY_SCRUBpreserves the existing load-time behavior:What does not change yet
LEGACY_SCRUBduring the transition.Why this change
Notes / limitations
LEGACY_SCRUBremains the default during the transition to avoid breaking currently deployed datasets.getScrubbedDataSet(...)is still present as a deprecated compatibility API and should be removed after downstream callers are fully migrated.DataSetLoaderMFDmust also be added todataset_metadata.ymlto participate in the new configuration model.