Reapply "feat: initial support for deploying multiple instances on th…#94
Conversation
…e same host. WIP" This reverts commit ed9278f.
radnov
left a comment
There was a problem hiding this comment.
Using Docker Swarm could be an alternative here. Docker Swarm in single-host mode is a somewhat common "built-in" pattern that implements this type of multi-instance deployment. One benefit is that if the users want to go from single to multi-host setup it would be already supported by Docker Swarm without extra effort.
Example:
docker swarm init
docker network create -d overlay --attachable proxy
docker network create -d overlay --attachable monitoring_netdocker stack deploy -c stacks/traefik/docker-compose.yml traefik
docker stack deploy -c stacks/monitoring/docker-compose.yml monitoring
docker stack deploy -c docker-compose.yml prod
docker stack deploy -c docker-compose.yml devWhat you get for free:
- Stack isolation (service names prefixed with stack name: prod_app, dev_app) replaces the ${COMPOSE_PROJECT_NAME}-app aliasing the
branch does manually- Overlay networks as external resources work identically (already how the branch models them)
- Docker secrets and configs are first-class, not compose workarounds
- Built-in rolling updates, deploy.update_config, rollback on health failure
- docker stack ls / docker stack ps gives the make list-instances you would otherwise write
- Scale services: docker service scale prod_app=3
- Placement constraints if you later add nodes (no code change to move from 1 host to 5)
What you lose vs current compose approach:
- depends_on.condition: service_healthy is NOT supported in swarm mode. The update-admin-password + create-monitoring-user chain
would need reworking (init jobs via docker service create one-shot, or application-level retry logic, or a separate docker compose
run step pre-deploy)- docker compose run --rm backup-database pattern does not translate; swarm has no ad-hoc run. Backup/restore would need to be docker
run invocations against the volume directly, or one-shot services- profiles: not supported. The backup / restore / docs profiles would need other compose files or separate invocations
- build: not supported - you must push images (not an issue here, all images are pulled)
- restart: unless-stopped replaced by deploy.restart_policy
- Loki driver + loki-external-labels works but per-container custom labels differ slightly
- No env_file for service env in older swarm versions (works in modern engines)
If we stick to the current approach, here is a long list of suggested improvements created with the help of Claude and curated by me.
16 and 17, as noted below, could simplify things a lot, if they work.
Note that I haven't tried running the changes localy.
Operational ergonomics
- Drop shared monitor credential. Per-instance creds in per-instance target JSON (Prometheus file_sd supports per-target basic_auth
via relabel_configs + labels, or switch to http_sd). Removes the "must match across all .env files" footgun.- Use include: in root docker-compose.yml to pull stacks/postgres/ + optional stacks/backup/. Collapses POSTGRES_COMPOSE_CMD /
BACKUP_COMPOSE_CMD into one COMPOSE_CMD.- Move envsubst into scripts/. Easier to test, cleaner quoting, avoids envsubst substituting unrelated shell vars that happen to
match template placeholders.- Add make list-instances / make status. Discoverability of running instances, which conf.d/targets files map to which project,
health summary.- Pre-flight checks in launch-instance. Fail fast if stacks/traefik/.env or stacks/monitoring/.env absent, if instances/.env
missing, if APP_HOSTNAME already used by another instance's route, if PROJECT_NAME collides with an existing project.- DRY_RUN=1 mode that prints the generated conf.d/.yml and targets/.json without starting containers.
- stop-instance safety. Check no running containers reference -db network before docker network rm. Currently 2>/dev/null ||
true hides legitimate errors.Correctness / race conditions
- Ordering during launch-instance. Write Prometheus targets first so scraping is ready, then write Traefik route after app
healthcheck passes - otherwise Traefik briefly 502s new traffic. Alternative: use loadBalancer.healthCheck in the route file so
Traefik gates traffic itself.- Middleware reload gap. middlewares.yml is static; changes require manual Traefik action depending on watch behavior. Document or
move per-instance-referenced middleware inline.- depends_on across stacks. Impossible in compose. Either document the ordering contract explicitly, or use an init container in
docker-compose.yml that blocks on pg_isready against the external network before the app starts. Handles the "user bypasses Makefile"
case.Security
- Replace inline configs.content with mounted file + Docker secrets for DHIS2_MONITOR_USERNAME too (currently only password is a
secret; username comes via ${VAR} substitution into config content, leaks into docker compose config output).- Drop root in create-monitoring-user. Pre-build an image alpine+jq+curl in a one-line Dockerfile or use alpine/curl + stedolan/jq.
Restores read_only: true + non-root user.- File permissions on generated files. conf.d/.yml and targets/.json currently inherit default umask. Chmod 600 or 640 since they
contain hostnames (not secret, but unnecessary to publish). instances/*.env already handled by generate-env.sh.- Credential rotation path. No script to rotate DHIS2_MONITOR_PASSWORD, POSTGRES_DB_PASSWORD, etc. without recreating the instance.
Worth at least documenting procedure.- ACME storage shared. cert:/cert is a single named volume in the Traefik stack - fine, but note that compromising Traefik exposes
all instance certs. Consistent with most single-proxy setups.Architecture alternatives worth testing before shipping
- Traefik docker provider + labels, instead of file provider. Remove envsubst entirely. Each instance's app service gets
traefik.http.routers..rule=Host(...). Traefik reads labels via docker socket. Requires docker-socket-proxy sidecar to avoid
giving Traefik /var/run/docker.sock (security). Pro: less templating, zero conf.d/ files. Con: introduces socket-proxy dependency.- Prometheus docker_sd_configs parallel to (16). Instances get prometheus.io/scrape=true labels, Prometheus discovers targets via
socket-proxy. Drops the targets/*.json files + envsubst step.Picking (16) + (17) simplifies the branch substantially - the entire conf.d/ + targets/ generation dance disappears, replaced by
labels on each service. Cost: one docker-socket-proxy container shared by Traefik and Prometheus.Testing
- Integration test: two instances, same host. Spin up prod + dev via Makefile, assert:
- Both reachable via distinct hostnames through shared Traefik
- Kill one Postgres → other instance unaffected
- Prometheus sees targets labelled instance=prod and instance=dev separately
- Grafana dashboards filter by instance variable
- stop-instance prod does not disrupt dev
This is the highest-leverage single addition - it validates the entire design claim.
Minor polishing
- Per-instance resource limits via env (APP_CPU_LIMIT, APP_MEM_LIMIT) - multi-instance on one host makes noisy-neighbor risk real.
- Backup retention rotation. ./backups/ grows forever. Add timestamped pruning policy or defer to host-level tooling and document.
- Grafana instance variable. Dashboards shipped in stacks/monitoring/config/grafana/dashboards/ need a $instance template variable
filter, otherwise multi-instance metrics collapse into single panels.- Consolidate application network. It's per-project (internal) but create-monitoring-user + update-admin-password + app are the
only members. Could drop and use project default bridge. Marginal simplification.
feat: per-project backup paths
* chore: doc updates and list-instances * Apply suggestions from code review - remove numbering in reference file Co-authored-by: Philip-Larsen-Donnelly <35666657+Philip-Larsen-Donnelly@users.noreply.github.com> * fix: review comments
Philip-Larsen-Donnelly
left a comment
There was a problem hiding this comment.
Looks Good.
I still think we should use the "detach flag" approach instead of the "compose opts" for the (default) detached state; but we can add that as another feature on the development-2.0 branch. (and maybe that action was on me, anyway 😬 )
So do I and I actually thought you already added it on the other branch. |
…e same host. WIP"
This reverts commit ed9278f.