Skip to content

Climate normals and anomalies: computation, storage, and automatic cascade #56

@turban

Description

@turban

Background

Climate normals (long-term averages) and anomalies (deviation from normal) are among the highest-value derived products for DHIS2 health use cases — malaria early warning, climate impact assessment, and exposure indices all depend on them. This issue covers the full pipeline from normals computation through automatic anomaly updates on sync.


Clarifying design decisions

Normals are not time-indexed. The output has a dayofyear (1–366) or month (1–12) dimension, not a time dimension. This makes normals a genuinely different dataset type, not a sync-able artifact. They are computed once from a defined period and recomputed only when the normal period changes.

ERA5 hourly to daily aggregation first. Computing hourly normals (8,784 values per pixel per year) is impractical for DHIS2 use cases. ERA5 hourly data must be aggregated to daily (mean temperature, total precipitation) before normals are computed. The daily ERA5 aggregate is itself a publishable dataset, independently useful.

Anomalies as a separate dataset, not extra variables. Adding precip_anomaly alongside precip in the same Zarr store couples anomaly updates to the base dataset write path and complicates sync. A separate dataset (chirps3_precipitation_daily_anomaly_sle) is consistent with the project's one-variable-per-dataset and one-period-type-per-dataset design constraints.

30-year historical ingestion is a separate job. Ingesting 1991–2020 of CHIRPS daily (~10,000 files) or ERA5 hourly is a large one-time download and must not block the normal sync workflow.


Phase 1 — Climate normals

1.1 New dataset template kind: climatology

A new period_type: climatology alongside daily, monthly, yearly. A new climatology block in the YAML template describes the source, normal period, and smoothing:

- id: chirps3_precipitation_daily_normals
  name: CHIRPS3 daily precipitation normals (1991-2020)
  variable: precip_normal
  period_type: climatology
  sync_kind: static
  climatology:
    source_dataset_id: chirps3_precipitation_daily
    period: [1991, 2020]
    aggregation: mean
    smoothing_window: 31

The smoothing_window: 31 applies a 31-day centered circular rolling mean, which is the WMO-recommended approach. Wrapping December into January for the window boundary handles Jan 1 correctly and naturally resolves the Feb 29 sparse-data problem (the window around DOY 60 includes enough non-leap-year days).

1.2 Normals computation

def compute_normals(source_zarr_path, *, period, smoothing_window=31) -> xr.Dataset:
    ds = xr.open_zarr(source_zarr_path).sel(time=slice(str(period[0]), str(period[1])))
    normals = ds.groupby("time.dayofyear").mean("time")   # shape (366, y, x)
    if smoothing_window:
        normals = _circular_rolling_mean(normals, window=smoothing_window)
    return normals

1.3 New POST /processes/normals endpoint

Long-running job returning 202 + job_id:

POST /processes/normals
{
  "source_dataset_id": "chirps3_precipitation_daily_sle",
  "period": [1991, 2020],
  "smoothing_window": 31
}

Writes a new GeoZarr to the storage backend and registers it as a published artifact under the normals dataset ID.

1.4 ERA5 daily aggregation step

Before computing ERA5 normals, a temporal aggregation step produces daily values from hourly. This daily aggregate is a publishable dataset in its own right:

daily_mean = ds.resample(time="1D").mean()    # temperature
daily_sum  = ds.resample(time="1D").sum()     # precipitation (accumulated)

Dataset ID example: era5land_temperature_daily_sle


Phase 2 — Anomalies

2.1 New dataset template kind: anomaly

- id: chirps3_precipitation_daily_anomaly
  name: CHIRPS3 daily precipitation anomaly
  variable: precip_anomaly
  period_type: daily
  sync_kind: temporal
  sync_execution: append
  anomaly:
    source_dataset_id: chirps3_precipitation_daily
    normals_dataset_id: chirps3_precipitation_daily_normals
    method: absolute

method can be absolute (observed minus normal), relative (percentage), or standardized (z-score, requires standard deviation normals).

2.2 Anomaly computation

def compute_anomaly(source_zarr_path, normals_zarr_path, *, start, end) -> xr.Dataset:
    ds      = xr.open_zarr(source_zarr_path).sel(time=slice(start, end))
    normals = xr.open_zarr(normals_zarr_path)   # shape (366, y, x)
    doy     = ds.time.dt.dayofyear
    return ds - normals.sel(dayofyear=doy).drop_vars("dayofyear")

xarray's label-based alignment handles this in one expression — no looping over time steps.

2.3 New POST /processes/anomaly endpoint

POST /processes/anomaly
{
  "source_dataset_id": "chirps3_precipitation_daily_sle",
  "normals_dataset_id": "chirps3_precipitation_daily_normals_sle",
  "start": "2024-01-01",
  "end": "2024-12-31"
}

Phase 3 — Automatic cascade on sync

After run_sync completes with status="completed", the sync engine checks the dataset registry for any anomaly datasets that declare this dataset as their source_dataset_id. If found, it computes and appends the anomaly for the new delta range.

# In sync_engine.run_sync, after artifact is stored:
_trigger_downstream_anomaly_jobs(
    source_dataset_id=managed_dataset_id,
    delta_start=sync_detail.delta_start,
    delta_end=sync_detail.delta_end,
)

This implements the event-driven cascade from roadmap Step 3 without requiring a full workflow engine — for the anomaly case it is a direct background task call.

Guard: the cascade checks that the normals dataset exists before running. If it does not exist yet, it skips with a warning rather than failing the sync.


Dependency order

1. ERA5 daily aggregation dataset template + compute step
2. Normals computation service + POST /processes/normals
3. Anomaly computation service + POST /processes/anomaly
4. Post-sync cascade hook

Steps 1–3 are independent. Step 4 depends on all three.


Deferred

  • Standardized anomalies (z-score): requires standard deviation as a second normals variable. The method: standardized field is reserved in the template schema but not part of V1.
  • Monthly normals from CHIRPS daily: trivially derived by resampling but a separate template.
  • Normal period update (e.g. shifting to 2001–2030 in 2031): the period field handles this but re-ingesting 30 years is heavyweight — out of scope for now.
  • OGC API Processes compliance: the /processes endpoints are REST jobs for now. Full OGC API Processes (Parts 1–3) is the open question in roadmap Step 2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions