Climate normals and anomalies: computation, storage, and automatic cascade

## Background

Climate normals (long-term averages) and anomalies (deviation from normal) are among the highest-value derived products for DHIS2 health use cases — malaria early warning, climate impact assessment, and exposure indices all depend on them. This issue covers the full pipeline from normals computation through automatic anomaly updates on sync.

---

## Clarifying design decisions

**Normals are not time-indexed.** The output has a `dayofyear` (1–366) or `month` (1–12) dimension, not a `time` dimension. This makes normals a genuinely different dataset type, not a sync-able artifact. They are computed once from a defined period and recomputed only when the normal period changes.

**ERA5 hourly to daily aggregation first.** Computing hourly normals (8,784 values per pixel per year) is impractical for DHIS2 use cases. ERA5 hourly data must be aggregated to daily (mean temperature, total precipitation) before normals are computed. The daily ERA5 aggregate is itself a publishable dataset, independently useful.

**Anomalies as a separate dataset, not extra variables.** Adding `precip_anomaly` alongside `precip` in the same Zarr store couples anomaly updates to the base dataset write path and complicates sync. A separate dataset (`chirps3_precipitation_daily_anomaly_sle`) is consistent with the project's one-variable-per-dataset and one-period-type-per-dataset design constraints.

**30-year historical ingestion is a separate job.** Ingesting 1991–2020 of CHIRPS daily (~10,000 files) or ERA5 hourly is a large one-time download and must not block the normal sync workflow.

---

## Phase 1 — Climate normals

### 1.1 New dataset template kind: climatology

A new `period_type: climatology` alongside `daily`, `monthly`, `yearly`. A new `climatology` block in the YAML template describes the source, normal period, and smoothing:

```yaml
- id: chirps3_precipitation_daily_normals
  name: CHIRPS3 daily precipitation normals (1991-2020)
  variable: precip_normal
  period_type: climatology
  sync_kind: static
  climatology:
    source_dataset_id: chirps3_precipitation_daily
    period: [1991, 2020]
    aggregation: mean
    smoothing_window: 31
```

The `smoothing_window: 31` applies a 31-day centered circular rolling mean, which is the WMO-recommended approach. Wrapping December into January for the window boundary handles Jan 1 correctly and naturally resolves the Feb 29 sparse-data problem (the window around DOY 60 includes enough non-leap-year days).

### 1.2 Normals computation

```python
def compute_normals(source_zarr_path, *, period, smoothing_window=31) -> xr.Dataset:
    ds = xr.open_zarr(source_zarr_path).sel(time=slice(str(period[0]), str(period[1])))
    normals = ds.groupby("time.dayofyear").mean("time")   # shape (366, y, x)
    if smoothing_window:
        normals = _circular_rolling_mean(normals, window=smoothing_window)
    return normals
```

### 1.3 New POST /processes/normals endpoint

Long-running job returning 202 + job_id:

```
POST /processes/normals
{
  "source_dataset_id": "chirps3_precipitation_daily_sle",
  "period": [1991, 2020],
  "smoothing_window": 31
}
```

Writes a new GeoZarr to the storage backend and registers it as a published artifact under the normals dataset ID.

### 1.4 ERA5 daily aggregation step

Before computing ERA5 normals, a temporal aggregation step produces daily values from hourly. This daily aggregate is a publishable dataset in its own right:

```python
daily_mean = ds.resample(time="1D").mean()    # temperature
daily_sum  = ds.resample(time="1D").sum()     # precipitation (accumulated)
```

Dataset ID example: `era5land_temperature_daily_sle`

---

## Phase 2 — Anomalies

### 2.1 New dataset template kind: anomaly

```yaml
- id: chirps3_precipitation_daily_anomaly
  name: CHIRPS3 daily precipitation anomaly
  variable: precip_anomaly
  period_type: daily
  sync_kind: temporal
  sync_execution: append
  anomaly:
    source_dataset_id: chirps3_precipitation_daily
    normals_dataset_id: chirps3_precipitation_daily_normals
    method: absolute
```

`method` can be `absolute` (observed minus normal), `relative` (percentage), or `standardized` (z-score, requires standard deviation normals).

### 2.2 Anomaly computation

```python
def compute_anomaly(source_zarr_path, normals_zarr_path, *, start, end) -> xr.Dataset:
    ds      = xr.open_zarr(source_zarr_path).sel(time=slice(start, end))
    normals = xr.open_zarr(normals_zarr_path)   # shape (366, y, x)
    doy     = ds.time.dt.dayofyear
    return ds - normals.sel(dayofyear=doy).drop_vars("dayofyear")
```

xarray's label-based alignment handles this in one expression — no looping over time steps.

### 2.3 New POST /processes/anomaly endpoint

```
POST /processes/anomaly
{
  "source_dataset_id": "chirps3_precipitation_daily_sle",
  "normals_dataset_id": "chirps3_precipitation_daily_normals_sle",
  "start": "2024-01-01",
  "end": "2024-12-31"
}
```

---

## Phase 3 — Automatic cascade on sync

After `run_sync` completes with `status="completed"`, the sync engine checks the dataset registry for any anomaly datasets that declare this dataset as their `source_dataset_id`. If found, it computes and appends the anomaly for the new delta range.

```python
# In sync_engine.run_sync, after artifact is stored:
_trigger_downstream_anomaly_jobs(
    source_dataset_id=managed_dataset_id,
    delta_start=sync_detail.delta_start,
    delta_end=sync_detail.delta_end,
)
```

This implements the event-driven cascade from roadmap Step 3 without requiring a full workflow engine — for the anomaly case it is a direct background task call.

**Guard:** the cascade checks that the normals dataset exists before running. If it does not exist yet, it skips with a warning rather than failing the sync.

---

## Dependency order

```
1. ERA5 daily aggregation dataset template + compute step
2. Normals computation service + POST /processes/normals
3. Anomaly computation service + POST /processes/anomaly
4. Post-sync cascade hook
```

Steps 1–3 are independent. Step 4 depends on all three.

---

## Deferred

- **Standardized anomalies** (z-score): requires standard deviation as a second normals variable. The `method: standardized` field is reserved in the template schema but not part of V1.
- **Monthly normals from CHIRPS daily**: trivially derived by resampling but a separate template.
- **Normal period update** (e.g. shifting to 2001–2030 in 2031): the `period` field handles this but re-ingesting 30 years is heavyweight — out of scope for now.
- **OGC API Processes compliance**: the `/processes` endpoints are REST jobs for now. Full OGC API Processes (Parts 1–3) is the open question in roadmap Step 2.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Climate normals and anomalies: computation, storage, and automatic cascade #56

Background

Clarifying design decisions

Phase 1 — Climate normals

1.1 New dataset template kind: climatology

1.2 Normals computation

1.3 New POST /processes/normals endpoint

1.4 ERA5 daily aggregation step

Phase 2 — Anomalies

2.1 New dataset template kind: anomaly

2.2 Anomaly computation

2.3 New POST /processes/anomaly endpoint

Phase 3 — Automatic cascade on sync

Dependency order

Deferred

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Climate normals and anomalies: computation, storage, and automatic cascade #56

Description

Background

Clarifying design decisions

Phase 1 — Climate normals

1.1 New dataset template kind: climatology

1.2 Normals computation

1.3 New POST /processes/normals endpoint

1.4 ERA5 daily aggregation step

Phase 2 — Anomalies

2.1 New dataset template kind: anomaly

2.2 Anomaly computation

2.3 New POST /processes/anomaly endpoint

Phase 3 — Automatic cascade on sync

Dependency order

Deferred

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions