Skip to content

fix: scope cache files by extent_id and validate spatial coverage#88

Merged
turban merged 15 commits into
mainfrom
fix/extent-scoped-cache-and-coverage-validation
May 9, 2026
Merged

fix: scope cache files by extent_id and validate spatial coverage#88
turban merged 15 commits into
mainfrom
fix/extent-scoped-cache-and-coverage-validation

Conversation

@turban
Copy link
Copy Markdown
Contributor

@turban turban commented May 9, 2026

Breaking changes

  • `CACHE_OVERRIDE` env var removed — deployments using `CACHE_OVERRIDE` to set the data directory must migrate to `data_dir` in `climate-api.yaml`.
  • `data_dir` required in `climate-api.yaml` — the API raises `ValueError` at startup if a config file is present but `data_dir` is not set.
  • `datasets_dir` renamed to `templates_dir` — any `climate-api.yaml` files using `datasets_dir` must be updated.

Summary

  • Directory isolation per instance: Each deployment must now declare `data_dir` in `climate-api.yaml`. This prevents silent data reuse across instances (e.g. WorldPop data for Sierra Leone being served to a Norway instance).
  • Data directory resolution: `data_dir` from config → XDG default (`~/.local/share/climate-api`). `CACHE_OVERRIDE` is gone.
  • `templates_dir` (renamed from `datasets_dir`): Points at user-supplied YAML templates. The new name reflects that it will cover both dataset and processing templates going forward.
  • Spatial coverage validation: `download_dataset` validates the request bbox against `extents.spatial.bbox` before hitting the provider, returning HTTP 400 with a clear message instead of a confusing provider error. Partial overlap is allowed; only fully outside requests are rejected.
  • OGC-aligned extents in dataset templates: Replaced the custom `coverage:` field with an `extents:` block matching OGC API Collections (`extents.spatial.bbox` in `[xmin, ymin, xmax, ymax]` format, `extents.temporal` with `begin`, `trs`, and `resolution`). CHIRPS3 gets a restricted spatial bbox of `[-180, -50, 180, 50]`.

Background

Discovered during Norway instance setup: WorldPop data downloaded for Sierra Leone was silently reused for Norway (wrong bounding box), and a CHIRPS3 ingest for Norway returned an HTTP 403 with no clear message. The root fix is directory isolation per instance via `data_dir`.

Test plan

  • `test_get_data_dir_*` — required field, relative path resolution, no-config fallback
  • `test_resolve_download_dir_` / `test_resolve_artifacts_dir_` — config data_dir, XDG fallback
  • `test_get_cache_prefix_uses_dataset_id`
  • `test_validate_spatial_coverage_*` — OGC bbox validation cases
  • `test_download_dataset_returns_400_when_bbox_outside_dataset_extents`
  • `test_templates_dir_*` — custom template loading, relative path resolution, override by id
  • All existing tests pass (164 passed, 1 skipped)

turban added 14 commits May 9, 2026 16:04
Separate deployment instances (e.g. Norway vs Sierra Leone) sharing
the same DOWNLOAD_DIR would silently reuse each other's NetCDF/Zarr
cache files because the prefix was keyed only on dataset id. Add an
optional extent_id suffix so each extent gets its own cache namespace.

Validate bbox against a dataset's declared coverage field before
downloading, returning HTTP 400 early instead of a confusing
provider-level error. Add coverage: {lat: [-50, 50]} to chirps3.yaml
since CHIRPS3 does not cover latitudes above 50°N (e.g. Norway).
Aligns dataset YAML schema with OGC API Collections by replacing the
custom coverage.lat/lon block with extents.spatial.bbox (OGC [xmin,
ymin, xmax, ymax] format) and adding extents.temporal with begin, end,
trs, and resolution fields.

_validate_spatial_coverage now reads extents.spatial.bbox directly,
which covers both axes in one check without separate lat/lon keys.
All three dataset templates receive extents blocks.
Each configured instance must now declare data_dir in climate-api.yaml.
The API raises a clear error at startup if a config file is present but
data_dir is not set, rather than silently falling back to a shared XDG
directory that another instance might also use.

Resolution order for the data directory:
1. CACHE_OVERRIDE env var — preserved for Docker/CI backward compat
2. data_dir from CLIMATE_API_CONFIG — required when config is present
3. XDG default — only used when no config file is configured

Extent_id remains in cache filenames to support future multi-extent
configurations within a single instance.
…lback

Data directory resolution now uses data_dir from climate-api.yaml (required
when a config file is present) with a clean XDG fallback. The legacy
CACHE_OVERRIDE environment variable is gone from all resolver functions,
tests, and .env.example.
Clarifies the distinction from data_dir (runtime storage) — templates_dir
points at user-supplied YAML templates and will cover both dataset and
processing templates going forward.
templates_dir now acts as a root directory. Dataset templates go in
templates_dir/datasets/, leaving room for processing/ and other template
types alongside it without structural changes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Climate API instance isolation by requiring an explicit per-deployment data_dir, removes the legacy CACHE_OVERRIDE env var, and adds dataset-template extents metadata to validate request bounding boxes before calling upstream providers.

Changes:

  • Add data_dir config resolution (required when a config file exists) and migrate download/artifact/pygeoapi directories to be scoped under it (with XDG fallback when no config is present).
  • Rename datasets_dirtemplates_dir and change custom template loading to templates_dir/datasets/.
  • Introduce dataset template extents and validate ingestion request bboxes against extents.spatial.bbox (HTTP 400 when fully outside).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/climate_api/config.py Adds get_data_dir() to resolve required instance-scoped data directory.
src/climate_api/data_manager/services/downloader.py Uses data_dir for downloads and adds spatial coverage validation based on template extents.
src/climate_api/ingestions/services.py Moves artifacts directory under data_dir (XDG fallback otherwise).
src/climate_api/publications/services.py Moves pygeoapi working directory under data_dir (XDG fallback otherwise).
src/climate_api/data_registry/services/datasets.py Renames config key to templates_dir and loads custom datasets from templates_dir/datasets/.
src/climate_api/data/datasets/chirps3.yaml Adds OGC-aligned extents metadata (with restricted spatial bbox).
src/climate_api/data/datasets/era5_land.yaml Adds OGC-aligned extents metadata for both ERA5-Land templates.
src/climate_api/data/datasets/worldpop.yaml Adds OGC-aligned extents metadata for WorldPop.
tests/conftest.py Updates test instance config to include required data_dir.
tests/test_config.py Adds tests for get_data_dir() behavior and updates to templates_dir semantics.
tests/test_downloader.py Updates directory resolution tests and adds coverage validation tests.
tests/test_publications.py Updates pygeoapi dir resolution tests to use data_dir.
tests/test_datasets.py Adjusts test setup for Zarr path monkeypatching.
docs/setup_guide.md Documents required data_dir and updates references to templates_dir.
docs/adding_custom_datasets.md Updates guidance for templates_dir and documents extents blocks.
climate-api.yaml.example Updates example config: adds required data_dir and renames templates_dir.
.env.example Removes CACHE_OVERRIDE and updates comments for templates_dir.
Comments suppressed due to low confidence (1)

src/climate_api/data_manager/services/downloader.py:76

  • Spatial coverage validation is performed before the effective bbox is resolved via _resolve_bbox(). If bbox is None (e.g. relying on DOWNLOAD_BBOX), _validate_spatial_coverage() will currently skip validation and the request can still reach the provider. Consider resolving the bbox (when applicable) first and validating the resolved bbox so all request paths get the intended early 400 response.
    _validate_spatial_coverage(dataset, bbox)
    ingestion = dataset["ingestion"]
    eo_download_func_path = ingestion["function"]
    eo_download_func = _get_dynamic_function(eo_download_func_path)
    before_files = {path.resolve(): path.stat().st_mtime_ns for path in get_cache_files(dataset)}

    params = dict(ingestion.get("default_params", {}))
    params.update(
        {
            "start": start,
            "end": end or datetime.date.today().isoformat(),
            "dirname": DOWNLOAD_DIR,
            "prefix": _get_cache_prefix(dataset),
            "overwrite": overwrite,
        }
    )

    sig = inspect.signature(eo_download_func)
    try:
        if "bbox" in sig.parameters:
            params["bbox"] = _resolve_bbox(bbox=bbox)
        if "country_code" in sig.parameters:

Comment thread src/climate_api/config.py Outdated
Comment thread src/climate_api/data_manager/services/downloader.py
- Validate request bbox against extents using the env fallback (DOWNLOAD_BBOX)
  when no explicit bbox is provided, so coverage checks apply to all request paths
- Guard against malformed template extents.spatial.bbox (non-list or wrong
  length) to avoid a 500 on user-supplied templates
- Update get_data_dir() docstring to accurately describe the None-on-missing-file
  behaviour introduced for CI safety
@turban turban merged commit 4882b31 into main May 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants