You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ingesting large historical time ranges is slow, fragile, and opaque:
Downloads run synchronously — POST /ingestions blocks until complete, timing out HTTP connections on multi-year backfills
A transient network error aborts the entire ingestion with no automatic retry
There is no visibility into progress or estimated completion time
The same problem exists for POST /processes/{id}/execution — both share the same blocking dispatch pattern
For dataset-specific timing, sizes, and failure modes that motivate this issue, see #64.
Decision
There should be a single /processes endpoint that is OGC API Processes compatible and supports async job execution. The current split — a FastAPI router at /processes for execution and pygeoapi at /ogcapi/processes for listing — should be collapsed into one.
The implementation should be native FastAPI, not pygeoapi. pygeoapi's BaseProcessor plugin interface is a poor fit for operational processes like ingest and sync, its job manager (TinyDB-backed) would become a parallel persistence layer alongside our own artifact records, and implementing the OGC API Processes standard natively for the subset we need is straightforward. pygeoapi's role shrinks to serving /collections (OGC API Coverages) only.
Both ingestion and process execution should use the same job abstraction. Designing them separately risks two diverging async patterns that plugin authors have to understand independently.
Current state
FastAPI has a /processes/{id}/execution route that executes synchronously and returns status: "completed"
pygeoapi is mounted at /ogcapi and provides GET /ogcapi/processes (listing) and GET /ogcapi/processes/{id} (description)
/ingestions and /sync are separate synchronous REST endpoints with no shared job lifecycle
Target state
A single /processes endpoint, implemented natively in FastAPI, that is OGC API Processes compliant:
GET /processes → list all processes (ingest, sync, resample, …)
GET /processes/{id} → process description
POST /processes/{id}/execution → submit job (async by default)
GET /jobs → list all jobs
GET /jobs/{job_id} → job status (accepted | running | successful | failed)
GET /jobs/{job_id}/results → output once done
DELETE /jobs/{job_id} → cancel job
All long-running operations — ingestion, sync, resample, and any custom processes — are modelled as processes:
POST /processes/ingest/execution { "dataset_id": "...", "start": "...", "end": "..." }
POST /processes/sync/execution { "dataset_id": "..." }
POST /processes/resample/execution { "source_dataset_id": "...", ... }
Async execution follows the OGC standard: Prefer: respond-async → 202 Accepted with Location: /jobs/{job_id}. The existing synchronous behaviour can be kept as a fallback (no Prefer header → block and return result directly) for simple clients.
Both ingestion functions and process execution functions accept an on_progress callback with a no-op default. Breaking changes are acceptable at this stage.
Retry happens at the whole-job level in the dispatcher. This works because ingestion functions already skip existing files when overwrite=False and validate cached files before treating them as complete (see #64), so a retry only re-downloads what failed — not the entire range.
Resume support
Resume relies on two prerequisites, both specified in #64:
A GET /processes/ingest/estimate endpoint (or a dedicated field in the job submission response) could surface estimated download size and duration before the job starts. The approach for computing estimates — using fsspec HEAD requests and Zarr's getsize_prefix() for remote Zarr sources, or file-count heuristics for others — is documented in #64.
Implementation notes
FastAPI BackgroundTasks is sufficient for the MVP: submit, return 202, run in background, persist state to JSON
The background_tasks=None argument already exists in create_artifact — the plumbing is partially there
Job state (accepted / running / successful / failed + error) can be stored in a jobs.json file alongside records.json
For production, a persistent task queue (ARQ, Celery) replaces BackgroundTasks without changing the API surface
Extensibility
Custom processes follow the same plugin pattern as dataset templates. User-supplied process YAML files live in plugins_dir/processes/ and are merged with the built-ins — a custom process with the same id overrides the built-in.
Processes can be flagged as internal so they run server-side (e.g. triggered by a post-sync cascade) without appearing in the public OGC API catalogue:
ogcapi:
expose: false
Relationship to sync_kind: derived
Dataset templates can reference a process by id to produce a derived artifact on sync:
sync_kind: derivedprocessing:
process_id: resampleparams:
freq: MS
This decouples the dataset definition (what to produce and when) from the process definition (how to produce it). The same process can back multiple derived dataset templates.
Provider-side optimisation (batch period requests where the API supports it)
Full OGC API Processes async compliance (Parts 1–3)
Open questions
Runtime registration — should there be a POST /processes endpoint for registering a process without a server restart (OGC API Processes Part 2 — Deploy), or is file-based registration sufficient for the target user base?
Process versioning — if a custom process function changes, do existing derived artifacts need to be marked stale?
Problem
Ingesting large historical time ranges is slow, fragile, and opaque:
POST /ingestionsblocks until complete, timing out HTTP connections on multi-year backfillsPOST /processes/{id}/execution— both share the same blocking dispatch patternFor dataset-specific timing, sizes, and failure modes that motivate this issue, see #64.
Decision
There should be a single
/processesendpoint that is OGC API Processes compatible and supports async job execution. The current split — a FastAPI router at/processesfor execution and pygeoapi at/ogcapi/processesfor listing — should be collapsed into one.The implementation should be native FastAPI, not pygeoapi. pygeoapi's
BaseProcessorplugin interface is a poor fit for operational processes likeingestandsync, its job manager (TinyDB-backed) would become a parallel persistence layer alongside our own artifact records, and implementing the OGC API Processes standard natively for the subset we need is straightforward. pygeoapi's role shrinks to serving/collections(OGC API Coverages) only.Both ingestion and process execution should use the same job abstraction. Designing them separately risks two diverging async patterns that plugin authors have to understand independently.
Current state
/processes/{id}/executionroute that executes synchronously and returnsstatus: "completed"/ogcapiand providesGET /ogcapi/processes(listing) andGET /ogcapi/processes/{id}(description)/ingestionsand/syncare separate synchronous REST endpoints with no shared job lifecycleTarget state
A single
/processesendpoint, implemented natively in FastAPI, that is OGC API Processes compliant:All long-running operations — ingestion, sync, resample, and any custom processes — are modelled as processes:
Async execution follows the OGC standard:
Prefer: respond-async→202 AcceptedwithLocation: /jobs/{job_id}. The existing synchronous behaviour can be kept as a fallback (noPreferheader → block and return result directly) for simple clients.Job progress is returned on
GET /jobs/{job_id}:{ "job_id": "abc123", "status": "running", "attempt": 1, "progress": { "done": 180, "total": 437, "percent": 41, "message": "Downloaded 2005-01" } }Jobs are persisted to disk so a server restart does not lose their state.
Consequences
/ingestionsand/syncendpoints become legacy and should eventually be removed/ogcapi/processesis superseded by the native implementation; pygeoapi's role shrinks to serving/collections(OGC API Coverages)/processespath conflict described in feat: expose OGC API paths at top level alongside /stac and /zarr #110 — there is only one processes surfaceFunction contract
Both ingestion functions and process execution functions accept an
on_progresscallback with a no-op default. Breaking changes are acceptable at this stage.The no-op default means functions remain directly callable in tests and scripts without a job store.
Dispatcher
Retry
Retry happens at the whole-job level in the dispatcher. This works because ingestion functions already skip existing files when
overwrite=Falseand validate cached files before treating them as complete (see #64), so a retry only re-downloads what failed — not the entire range.Resume support
Resume relies on two prerequisites, both specified in #64:
overwrite=Falseskips already-downloaded files; file validation (see Streaming ingest and sync: per-period Icechunk writes, plugin contract, no intermediate files #64) ensures corrupt stubs are re-downloaded rather than silently reused.Pre-download estimates
A
GET /processes/ingest/estimateendpoint (or a dedicated field in the job submission response) could surface estimated download size and duration before the job starts. The approach for computing estimates — usingfsspecHEAD requests and Zarr'sgetsize_prefix()for remote Zarr sources, or file-count heuristics for others — is documented in #64.Implementation notes
BackgroundTasksis sufficient for the MVP: submit, return 202, run in background, persist state to JSONbackground_tasks=Noneargument already exists increate_artifact— the plumbing is partially therejobs.jsonfile alongsiderecords.jsonBackgroundTaskswithout changing the API surfaceExtensibility
Custom processes follow the same plugin pattern as dataset templates. User-supplied process YAML files live in
plugins_dir/processes/and are merged with the built-ins — a custom process with the sameidoverrides the built-in.Processes can be flagged as internal so they run server-side (e.g. triggered by a post-sync cascade) without appearing in the public OGC API catalogue:
Relationship to
sync_kind: derivedDataset templates can reference a process by id to produce a derived artifact on sync:
This decouples the dataset definition (what to produce and when) from the process definition (how to produce it). The same process can back multiple derived dataset templates.
Out of scope for now
Open questions
POST /processesendpoint for registering a process without a server restart (OGC API Processes Part 2 — Deploy), or is file-based registration sufficient for the target user base?Related
/processesconflict described there