A self-contained uptime tracker for Virtual Fly Brain public services. Probes every endpoint listed in config/services.yml hourly and on demand, and renders a single status page.
Built after a data.virtualflybrain.org outage to give us one URL to glance at when a user reports something is broken.
The full list lives in config/services.yml. It covers the main site, v2/v3 viewers, the data and nas0 file servers, PDB / KB Neo4j browsers, Owlery, the owl cache, SOLR, VFBquery, the VFB3 MCP server, IIP3D, R/NBLAST, and every hosted CATMAID instance (fafb, l1em, vnc1, fanc, l3vnc, abd1-5, iav-tnt, iav-robo, igor). Add or remove entries by editing the YAML and restarting the container.
Each check is a plain HTTPS GET. A service is considered up when the status code is in its expect_status list and (where configured) the response body contains the expect substring. Defaults follow the rancher-compose health_check.request_line entries in VirtualFlyBrain/RancherServices where one exists.
docker compose up -dThen open http://localhost:8000/. Hit Refresh now to force an immediate probe of every endpoint; otherwise checks run automatically every hour.
The page auto-refreshes every 60 s. Endpoints:
GET /— status page with per-service status strip and 24 h / 7 d / 30 d uptime %GET /api/status— JSON of the latest resultsGET /api/uptime— per-service uptime % over 24 h / 7 d / 30 dGET /api/history?service=<name>&limit=200— recent raw history rows for one serviceGET /api/cache— latest snapshot of every cache service (hit/miss counts, connections, hit rate)GET /api/cache/history?service=<name>&since_seconds=86400&max_points=200— down-sampled cache time seriesGET /api/app— latest snapshot of every application service (shape: vfbqueryetc.)GET /api/app/history?service=<name>&since_seconds=86400&max_points=200— down-sampled app time seriesGET /api/neo4j— latest snapshot per Neo4j database, includingdb_status,node_count, and any Neo4j start-up errorGET /api/neo4j/history?service=<name>&since_seconds=86400&max_points=200— down-sampled Neo4j time seriesGET /api/cluster— Rancher cluster overview: hosts (with LB-coverage and DNS-ingress flags) and active services in the configured stacksGET /healthz— liveness for Rancher / DockerPOST /refresh— force an immediate re-probe of every service
Every probe writes a row to a SQLite database (HISTORY_DB, default /data/history.db). The schema is one append-only history table indexed by (service, ts). On a mounted volume this gives you long-term history across container restarts and image rebuilds — open it with any SQLite client for ad-hoc queries.
SELECT service, status, http_status, latency_ms, error, checked_at
FROM history
WHERE service = 'data.virtualflybrain.org (file server)'
AND ts >= strftime('%s', 'now', '-7 days')
ORDER BY ts DESC;Rows older than HISTORY_RETENTION_DAYS are pruned on startup and once a day. Set HISTORY_RETENTION_DAYS=0 to keep forever.
The status strip on the page renders the most recent HISTORY_BUCKETS buckets (default 72 hours = 3 days), oldest left. Reduction rule per bucket: any down → bucket is red; otherwise any up → bucket is green; otherwise grey (no data).
Caches running virtualflybrain/owl_cache ≥1.1.22 expose /status with JSON describing the nginx cache hit/miss counters and connection breakdown. List them under cache_services: in config/services.yml:
cache_services:
- name: VFB3 cache (v3-cached.virtualflybrain.org)
status_url: https://v3-cached.virtualflybrain.org/status
fronts: vfbquery.virtualflybrain.org
- name: IIP3D cache (iip3d.virtualflybrain.org)
status_url: https://iip3d.virtualflybrain.org/status
verify_tls: falseEach entry is probed on the same schedule as the regular endpoints; the parsed metrics go into the cache_history table. The page renders a card per cache with the latest snapshot plus two inline sparklines: active connections (load proxy) and Δ cache_total per check (request-rate proxy). Cache endpoints whose /status is unreachable still record an error row, which makes it easy to see when older 1.1.20 images get upgraded.
A bare LB-fronted probe of https://<host>/status only ever shows one random backend's view, which is misleading when a service runs more than one container. Add an optional rancher: block to each cache_services entry to enumerate the service's running instances via the Rancher v1 API and probe each container's /status directly:
cache_services:
- name: VFB3 cache (v3-cached.virtualflybrain.org)
status_url: https://v3-cached.virtualflybrain.org/status # fallback if API fails
rancher:
service_url: https://herd.virtualflybrain.org/v2-beta/projects/1a5/services/1s337
container_port: 80
container_path: /statusRequired env on the container: RANCHER_API_KEY and RANCHER_API_SECRET (environment-scoped read-only key from the Rancher UI). Per-container rows land in cache_history with the container column populated; the LB fallback rows have container = NULL. The page card aggregates across containers for the summary stats and sparklines, and shows a per-container breakdown table below.
Networking requirement: the deployed status container must be on the Rancher overlay network so that 10.42.x.x primaryIpAddress values are reachable. If they're not, the probe logs a warning and falls back to the LB-fronted single probe automatically.
When RANCHER_API_KEY / RANCHER_API_SECRET are set, the page renders a "Rancher cluster" section above the regular groups showing:
- Host coverage: N of M hosts active, how many run the load balancer, how many are pointed at by public DNS
- Service health: N of M
state=activeservices healthy in the configured stacks. Inactive services are filtered out (they're meant to be stopped, so they're not signal). A table lists every service wherecurrentScale < scaleorhealthState != healthy.
Each row in the existing Rancher servers group also gains three badges sourced from this overview:
rancher: active/rancher: inactive— host'sstatefrom the Rancher API (orthogonal to the:5050HTTP check, which only tells you the host's HTTP listener is up).LB— the activevfb-loadbalancer-mainis running on this host.DNS— the host'sagentIpAddressis in the A-records forRANCHER_DNS_HOSTNAME(defaultvirtualflybrain.org). Dynamic — re-evaluated every probe via a freshgetaddrinfolookup, so DNS changes are picked up automatically.
Configure via env vars:
| Variable | Default | Purpose |
|---|---|---|
RANCHER_PROJECT_URL |
https://herd.virtualflybrain.org/v2-beta/projects/1a5 |
Project endpoint on the Rancher v1 API. |
RANCHER_STACKS |
vfb-services-live |
Comma- or whitespace-separated stack names to include in the service overview. Leave tight to avoid dev-stack noise. |
RANCHER_DNS_HOSTNAME |
virtualflybrain.org |
Hostname whose A-records mark a host as DNS-ingress. |
RANCHER_DNS_HOSTS |
(empty) | Static override of the DNS-ingress flag (comma-separated short host names). Leave empty for dynamic DNS detection. |
Application services with their own JSON /status go under app_services:. Each entry declares its shape:, which selects the parser and renderer. Currently supported:
shape: vfbquery— extractsworkers,max_concurrent,max_queue_depth,active,waiting,total_served,cache_size,cache_hits,coalesced_total,coalesced_in_flight,scanner_probes_blocked,solr_cache.enabled.
app_services:
- name: VFBquery (vfbquery.virtualflybrain.org)
status_url: https://vfbquery.virtualflybrain.org/status
shape: vfbqueryMetrics persist in the app_history table. The page card shows the live counters plus inline sparklines for active requests, queued requests, and Δ total_served per check (request rate). Adding a new shape is a small change: define how to parse the fields you care about and how to render them.
The /browser/ endpoint of a Neo4j server returns 200 even when its database is empty or refusing to start. The neo4j_services: block runs a two-stage Cypher probe to catch the silent-failure case:
SHOW DATABASESagainst/db/system/tx/commit— verifiescurrentStatus == "online"for the target database. If the database is offline, Neo4j's own start-up error is captured (e.g."Unable to start DatabaseId{...[neo4j]}.") and shown on the page.MATCH (n) RETURN count(n)against/db/{db}/tx/commit— verifies the node count is at leastmin_nodes.
Credentials come from environment variables — set NEO4J_PASSWORD on the container (or per-service password_env: if PDB and KB diverge). The YAML never contains the password.
neo4j_services:
- name: PDB Neo4j (pdb.virtualflybrain.org)
base_url: https://pdb.virtualflybrain.org
db: neo4j
user: neo4j
password_env: NEO4J_PASSWORD
min_nodes: 1000000 # alert if the load truncated the graphThe page card shows the current node count, the configured min_nodes, Neo4j's reported db_status, and a sparkline of node count over time (so big-bang DB rebuilds are visible). Results persist in the neo4j_history table.
Environment variables (set in docker-compose.yml):
| Variable | Default | Purpose |
|---|---|---|
CHECK_INTERVAL_SECONDS |
3600 |
Seconds between scheduled probe runs. |
CONFIG_PATH |
config/services.yml |
Path to the service list inside the container. |
STATE_FILE |
/data/state.json |
Optional. Persists last-known results across restarts. Unset = in-memory only. |
HISTORY_DB |
/data/history.db |
SQLite database for long-term probe history. On by default — set to "" to disable. Mount /data (or whatever path you choose) as a persistent volume so history survives container rebuilds. If the path isn't writable, history is disabled automatically and a clear error is logged. |
HISTORY_RETENTION_DAYS |
365 |
Rows older than this are pruned on startup and once a day. 0 = keep forever. |
HISTORY_BUCKETS |
72 |
Number of buckets in the status strip on the page. |
HISTORY_BUCKET_SECONDS |
3600 |
One bucket = this many seconds. Default 1 h × 72 = 3 days of history visible inline. |
RANCHER_SERVERS |
(empty) | Comma- or whitespace-separated list of short hostnames. Each $NAME is probed at http://$NAME.$RANCHER_DOMAIN:$RANCHER_PORT. Synthesised into a separate "Rancher servers" group on the page. Mirrors the existing VFB shell check that hits :5050 on each node. |
RANCHER_DOMAIN |
inf.ed.ac.uk |
Domain suffix appended to each RANCHER_SERVERS name. |
RANCHER_PORT |
5050 |
Port to probe on each rancher server. |
RANCHER_TIMEOUT |
5 |
Per-request timeout in seconds for rancher-server probes. |
Rancher checks only work from inside the Edinburgh network. Port 5050 on inf.ed.ac.uk is dropped at the Informatics firewall — externally only 80, 443 and 7687 are reachable. If you run this container off-campus, leave RANCHER_SERVERS empty.
The config/ directory is mounted read-only into the container, so you can edit services.yml on the host and docker compose restart vfb-status to pick up changes.
defaults:
timeout: 10 # seconds, per request
method: GET
expect_status: [200, 301, 302] # acceptable HTTP status codes
groups:
- name: Core user-facing
services:
- name: Friendly label shown in the UI
url: https://example.virtualflybrain.org/
timeout: 20 # overrides default
expect_status: [200] # overrides default
expect: "Some response body substring" # optional content checkpython -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reloadTests are deliberately not included — the value here is the live probes against production. To dry-run the config, run the app locally; the first hourly tick fires immediately on startup.
Same pattern as the rest of the VFB stack: build the image, push to Docker Hub (virtualflybrain/vfb-status:latest), and point a rancher-compose service at it. The container exposes port 8000 and GET /healthz for the LB.
MIT. See LICENCE.