Skip to content

feat: replace convert_units and pre_process with a unified transforms pipeline #79

@turban

Description

@turban

Problem

The current codebase has two separate, ad-hoc mechanisms for post-download data transformation before the GeoZarr is written:

Neither supports multiple ordered transformations, and adding new conversions requires changing core code rather than configuration.

Proposed solution

Replace both fields with a single transforms list in the dataset YAML, using the same dotted-path callable pattern already used by ingestion.function:

```yaml

era5_land temperature

transforms:

  • function: climate_api.transforms.convert_units

era5_land precipitation

transforms:

  • function: climate_api.transforms.deaccumulate_era5
  • function: climate_api.transforms.convert_units
    ```

Each callable has the signature (ds: xr.Dataset, dataset: dict[str, Any]) -> xr.Dataset and is resolved at runtime via importlib, exactly like download functions. convert_units reads the existing units/convert_units fields (kept for STAC metadata). Transforms from external packages (e.g. dhis2eo) are supported without any changes to core code.

Changes required

  • Add src/climate_api/transforms/ module with at least convert_units (replacing _UNIT_CONVERSIONS) and a placeholder/implementation for deaccumulate_era5
  • Update build_dataset_zarr() in downloader.py to run the transforms pipeline instead of calling _apply_unit_conversion() directly
  • Update era5_land.yaml to use transforms: entries, removing pre_process and keeping convert_units/units for STAC metadata only
  • Remove the hardcoded _UNIT_CONVERSIONS dict and _apply_unit_conversion() from downloader.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions