Background
Climate normals (long-term averages) and anomalies (deviation from normal) are among the highest-value derived products for DHIS2 health use cases — malaria early warning, climate impact assessment, and exposure indices all depend on them. This issue covers the full pipeline from normals computation through automatic anomaly updates on sync.
Clarifying design decisions
Normals are not time-indexed. The output has a dayofyear (1–366) or month (1–12) dimension, not a time dimension. This makes normals a genuinely different dataset type, not a sync-able artifact. They are computed once from a defined period and recomputed only when the normal period changes.
ERA5 hourly to daily aggregation first. Computing hourly normals (8,784 values per pixel per year) is impractical for DHIS2 use cases. ERA5 hourly data must be aggregated to daily (mean temperature, total precipitation) before normals are computed. The daily ERA5 aggregate is itself a publishable dataset, independently useful.
Anomalies as a separate dataset, not extra variables. Adding precip_anomaly alongside precip in the same Zarr store couples anomaly updates to the base dataset write path and complicates sync. A separate dataset (chirps3_precipitation_daily_anomaly_sle) is consistent with the project's one-variable-per-dataset and one-period-type-per-dataset design constraints.
30-year historical ingestion is a separate job. Ingesting 1991–2020 of CHIRPS daily (~10,000 files) or ERA5 hourly is a large one-time download and must not block the normal sync workflow.
Phase 1 — Climate normals
1.1 New dataset template kind: climatology
A new period_type: climatology alongside daily, monthly, yearly. A new climatology block in the YAML template describes the source, normal period, and smoothing:
- id: chirps3_precipitation_daily_normals
name: CHIRPS3 daily precipitation normals (1991-2020)
variable: precip_normal
period_type: climatology
sync_kind: static
climatology:
source_dataset_id: chirps3_precipitation_daily
period: [1991, 2020]
aggregation: mean
smoothing_window: 31
The smoothing_window: 31 applies a 31-day centered circular rolling mean, which is the WMO-recommended approach. Wrapping December into January for the window boundary handles Jan 1 correctly and naturally resolves the Feb 29 sparse-data problem (the window around DOY 60 includes enough non-leap-year days).
1.2 Normals computation
def compute_normals(source_zarr_path, *, period, smoothing_window=31) -> xr.Dataset:
ds = xr.open_zarr(source_zarr_path).sel(time=slice(str(period[0]), str(period[1])))
normals = ds.groupby("time.dayofyear").mean("time") # shape (366, y, x)
if smoothing_window:
normals = _circular_rolling_mean(normals, window=smoothing_window)
return normals
1.3 New POST /processes/normals endpoint
Long-running job returning 202 + job_id:
POST /processes/normals
{
"source_dataset_id": "chirps3_precipitation_daily_sle",
"period": [1991, 2020],
"smoothing_window": 31
}
Writes a new GeoZarr to the storage backend and registers it as a published artifact under the normals dataset ID.
1.4 ERA5 daily aggregation step
Before computing ERA5 normals, a temporal aggregation step produces daily values from hourly. This daily aggregate is a publishable dataset in its own right:
daily_mean = ds.resample(time="1D").mean() # temperature
daily_sum = ds.resample(time="1D").sum() # precipitation (accumulated)
Dataset ID example: era5land_temperature_daily_sle
Phase 2 — Anomalies
2.1 New dataset template kind: anomaly
- id: chirps3_precipitation_daily_anomaly
name: CHIRPS3 daily precipitation anomaly
variable: precip_anomaly
period_type: daily
sync_kind: temporal
sync_execution: append
anomaly:
source_dataset_id: chirps3_precipitation_daily
normals_dataset_id: chirps3_precipitation_daily_normals
method: absolute
method can be absolute (observed minus normal), relative (percentage), or standardized (z-score, requires standard deviation normals).
2.2 Anomaly computation
def compute_anomaly(source_zarr_path, normals_zarr_path, *, start, end) -> xr.Dataset:
ds = xr.open_zarr(source_zarr_path).sel(time=slice(start, end))
normals = xr.open_zarr(normals_zarr_path) # shape (366, y, x)
doy = ds.time.dt.dayofyear
return ds - normals.sel(dayofyear=doy).drop_vars("dayofyear")
xarray's label-based alignment handles this in one expression — no looping over time steps.
2.3 New POST /processes/anomaly endpoint
POST /processes/anomaly
{
"source_dataset_id": "chirps3_precipitation_daily_sle",
"normals_dataset_id": "chirps3_precipitation_daily_normals_sle",
"start": "2024-01-01",
"end": "2024-12-31"
}
Phase 3 — Automatic cascade on sync
After run_sync completes with status="completed", the sync engine checks the dataset registry for any anomaly datasets that declare this dataset as their source_dataset_id. If found, it computes and appends the anomaly for the new delta range.
# In sync_engine.run_sync, after artifact is stored:
_trigger_downstream_anomaly_jobs(
source_dataset_id=managed_dataset_id,
delta_start=sync_detail.delta_start,
delta_end=sync_detail.delta_end,
)
This implements the event-driven cascade from roadmap Step 3 without requiring a full workflow engine — for the anomaly case it is a direct background task call.
Guard: the cascade checks that the normals dataset exists before running. If it does not exist yet, it skips with a warning rather than failing the sync.
Dependency order
1. ERA5 daily aggregation dataset template + compute step
2. Normals computation service + POST /processes/normals
3. Anomaly computation service + POST /processes/anomaly
4. Post-sync cascade hook
Steps 1–3 are independent. Step 4 depends on all three.
Deferred
- Standardized anomalies (z-score): requires standard deviation as a second normals variable. The
method: standardized field is reserved in the template schema but not part of V1.
- Monthly normals from CHIRPS daily: trivially derived by resampling but a separate template.
- Normal period update (e.g. shifting to 2001–2030 in 2031): the
period field handles this but re-ingesting 30 years is heavyweight — out of scope for now.
- OGC API Processes compliance: the
/processes endpoints are REST jobs for now. Full OGC API Processes (Parts 1–3) is the open question in roadmap Step 2.
Background
Climate normals (long-term averages) and anomalies (deviation from normal) are among the highest-value derived products for DHIS2 health use cases — malaria early warning, climate impact assessment, and exposure indices all depend on them. This issue covers the full pipeline from normals computation through automatic anomaly updates on sync.
Clarifying design decisions
Normals are not time-indexed. The output has a
dayofyear(1–366) ormonth(1–12) dimension, not atimedimension. This makes normals a genuinely different dataset type, not a sync-able artifact. They are computed once from a defined period and recomputed only when the normal period changes.ERA5 hourly to daily aggregation first. Computing hourly normals (8,784 values per pixel per year) is impractical for DHIS2 use cases. ERA5 hourly data must be aggregated to daily (mean temperature, total precipitation) before normals are computed. The daily ERA5 aggregate is itself a publishable dataset, independently useful.
Anomalies as a separate dataset, not extra variables. Adding
precip_anomalyalongsideprecipin the same Zarr store couples anomaly updates to the base dataset write path and complicates sync. A separate dataset (chirps3_precipitation_daily_anomaly_sle) is consistent with the project's one-variable-per-dataset and one-period-type-per-dataset design constraints.30-year historical ingestion is a separate job. Ingesting 1991–2020 of CHIRPS daily (~10,000 files) or ERA5 hourly is a large one-time download and must not block the normal sync workflow.
Phase 1 — Climate normals
1.1 New dataset template kind: climatology
A new
period_type: climatologyalongsidedaily,monthly,yearly. A newclimatologyblock in the YAML template describes the source, normal period, and smoothing:The
smoothing_window: 31applies a 31-day centered circular rolling mean, which is the WMO-recommended approach. Wrapping December into January for the window boundary handles Jan 1 correctly and naturally resolves the Feb 29 sparse-data problem (the window around DOY 60 includes enough non-leap-year days).1.2 Normals computation
1.3 New POST /processes/normals endpoint
Long-running job returning 202 + job_id:
Writes a new GeoZarr to the storage backend and registers it as a published artifact under the normals dataset ID.
1.4 ERA5 daily aggregation step
Before computing ERA5 normals, a temporal aggregation step produces daily values from hourly. This daily aggregate is a publishable dataset in its own right:
Dataset ID example:
era5land_temperature_daily_slePhase 2 — Anomalies
2.1 New dataset template kind: anomaly
methodcan beabsolute(observed minus normal),relative(percentage), orstandardized(z-score, requires standard deviation normals).2.2 Anomaly computation
xarray's label-based alignment handles this in one expression — no looping over time steps.
2.3 New POST /processes/anomaly endpoint
Phase 3 — Automatic cascade on sync
After
run_synccompletes withstatus="completed", the sync engine checks the dataset registry for any anomaly datasets that declare this dataset as theirsource_dataset_id. If found, it computes and appends the anomaly for the new delta range.This implements the event-driven cascade from roadmap Step 3 without requiring a full workflow engine — for the anomaly case it is a direct background task call.
Guard: the cascade checks that the normals dataset exists before running. If it does not exist yet, it skips with a warning rather than failing the sync.
Dependency order
Steps 1–3 are independent. Step 4 depends on all three.
Deferred
method: standardizedfield is reserved in the template schema but not part of V1.periodfield handles this but re-ingesting 30 years is heavyweight — out of scope for now./processesendpoints are REST jobs for now. Full OGC API Processes (Parts 1–3) is the open question in roadmap Step 2.