Reapply "feat: initial support for deploying multiple instances on th… by tonsV2 · Pull Request #94 · dhis2/docker-deployment

tonsV2 · 2026-04-20T09:58:22Z

…e same host. WIP"

This reverts commit ed9278f.

…e same host. WIP" This reverts commit ed9278f.

radnov

Using Docker Swarm could be an alternative here. Docker Swarm in single-host mode is a somewhat common "built-in" pattern that implements this type of multi-instance deployment. One benefit is that if the users want to go from single to multi-host setup it would be already supported by Docker Swarm without extra effort.

Example:

docker swarm init
docker network create -d overlay --attachable proxy
docker network create -d overlay --attachable monitoring_net

docker stack deploy -c stacks/traefik/docker-compose.yml traefik
docker stack deploy -c stacks/monitoring/docker-compose.yml monitoring
docker stack deploy -c docker-compose.yml prod
docker stack deploy -c docker-compose.yml dev

What you get for free:

Stack isolation (service names prefixed with stack name: prod_app, dev_app) replaces the ${COMPOSE_PROJECT_NAME}-app aliasing the
branch does manually

Overlay networks as external resources work identically (already how the branch models them)

Docker secrets and configs are first-class, not compose workarounds

Built-in rolling updates, deploy.update_config, rollback on health failure

docker stack ls / docker stack ps gives the make list-instances you would otherwise write

Scale services: docker service scale prod_app=3

Placement constraints if you later add nodes (no code change to move from 1 host to 5)

What you lose vs current compose approach:

depends_on.condition: service_healthy is NOT supported in swarm mode. The update-admin-password + create-monitoring-user chain
would need reworking (init jobs via docker service create one-shot, or application-level retry logic, or a separate docker compose
run step pre-deploy)

docker compose run --rm backup-database pattern does not translate; swarm has no ad-hoc run. Backup/restore would need to be docker
run invocations against the volume directly, or one-shot services

profiles: not supported. The backup / restore / docs profiles would need other compose files or separate invocations

build: not supported - you must push images (not an issue here, all images are pulled)

restart: unless-stopped replaced by deploy.restart_policy

Loki driver + loki-external-labels works but per-container custom labels differ slightly

No env_file for service env in older swarm versions (works in modern engines)

If we stick to the current approach, here is a long list of suggested improvements created with the help of Claude and curated by me.

16 and 17, as noted below, could simplify things a lot, if they work.

Note that I haven't tried running the changes localy.

Operational ergonomics

Drop shared monitor credential. Per-instance creds in per-instance target JSON (Prometheus file_sd supports per-target basic_auth
via relabel_configs + labels, or switch to http_sd). Removes the "must match across all .env files" footgun.

Use include: in root docker-compose.yml to pull stacks/postgres/ + optional stacks/backup/. Collapses POSTGRES_COMPOSE_CMD /
BACKUP_COMPOSE_CMD into one COMPOSE_CMD.

Move envsubst into scripts/. Easier to test, cleaner quoting, avoids envsubst substituting unrelated shell vars that happen to
match template placeholders.

Add make list-instances / make status. Discoverability of running instances, which conf.d/targets files map to which project,
health summary.

Pre-flight checks in launch-instance. Fail fast if stacks/traefik/.env or stacks/monitoring/.env absent, if instances/.env
missing, if APP_HOSTNAME already used by another instance's route, if PROJECT_NAME collides with an existing project.

DRY_RUN=1 mode that prints the generated conf.d/.yml and targets/.json without starting containers.

stop-instance safety. Check no running containers reference -db network before docker network rm. Currently 2>/dev/null ||
true hides legitimate errors.

Correctness / race conditions

Ordering during launch-instance. Write Prometheus targets first so scraping is ready, then write Traefik route after app
healthcheck passes - otherwise Traefik briefly 502s new traffic. Alternative: use loadBalancer.healthCheck in the route file so
Traefik gates traffic itself.

Middleware reload gap. middlewares.yml is static; changes require manual Traefik action depending on watch behavior. Document or
move per-instance-referenced middleware inline.

depends_on across stacks. Impossible in compose. Either document the ordering contract explicitly, or use an init container in
docker-compose.yml that blocks on pg_isready against the external network before the app starts. Handles the "user bypasses Makefile"
case.

Security

Replace inline configs.content with mounted file + Docker secrets for DHIS2_MONITOR_USERNAME too (currently only password is a
secret; username comes via ${VAR} substitution into config content, leaks into docker compose config output).

Drop root in create-monitoring-user. Pre-build an image alpine+jq+curl in a one-line Dockerfile or use alpine/curl + stedolan/jq.
Restores read_only: true + non-root user.

File permissions on generated files. conf.d/.yml and targets/.json currently inherit default umask. Chmod 600 or 640 since they
contain hostnames (not secret, but unnecessary to publish). instances/*.env already handled by generate-env.sh.

Credential rotation path. No script to rotate DHIS2_MONITOR_PASSWORD, POSTGRES_DB_PASSWORD, etc. without recreating the instance.
Worth at least documenting procedure.

ACME storage shared. cert:/cert is a single named volume in the Traefik stack - fine, but note that compromising Traefik exposes
all instance certs. Consistent with most single-proxy setups.

Architecture alternatives worth testing before shipping

Traefik docker provider + labels, instead of file provider. Remove envsubst entirely. Each instance's app service gets
traefik.http.routers..rule=Host(...). Traefik reads labels via docker socket. Requires docker-socket-proxy sidecar to avoid
giving Traefik /var/run/docker.sock (security). Pro: less templating, zero conf.d/ files. Con: introduces socket-proxy dependency.

Prometheus docker_sd_configs parallel to (16). Instances get prometheus.io/scrape=true labels, Prometheus discovers targets via
socket-proxy. Drops the targets/*.json files + envsubst step.

Picking (16) + (17) simplifies the branch substantially - the entire conf.d/ + targets/ generation dance disappears, replaced by
labels on each service. Cost: one docker-socket-proxy container shared by Traefik and Prometheus.

Testing

Integration test: two instances, same host. Spin up prod + dev via Makefile, assert:

Both reachable via distinct hostnames through shared Traefik

Kill one Postgres → other instance unaffected

Prometheus sees targets labelled instance=prod and instance=dev separately

Grafana dashboards filter by instance variable

stop-instance prod does not disrupt dev

This is the highest-leverage single addition - it validates the entire design claim.

Minor polishing

Per-instance resource limits via env (APP_CPU_LIMIT, APP_MEM_LIMIT) - multi-instance on one host makes noisy-neighbor risk real.

Backup retention rotation. ./backups/ grows forever. Add timestamped pruning policy or defer to host-level tooling and document.

Grafana instance variable. Dashboards shipped in stacks/monitoring/config/grafana/dashboards/ need a $instance template variable
filter, otherwise multi-instance metrics collapse into single panels.

Consolidate application network. It's per-project (internal) but create-monitoring-user + update-admin-password + app are the
only members. Could drop and use project default bridge. Marginal simplification.

feat: per-project backup paths

* chore: doc updates and list-instances * Apply suggestions from code review - remove numbering in reference file Co-authored-by: Philip-Larsen-Donnelly <35666657+Philip-Larsen-Donnelly@users.noreply.github.com> * fix: review comments

…p-instance

Philip-Larsen-Donnelly

Looks Good.

I still think we should use the "detach flag" approach instead of the "compose opts" for the (default) detached state; but we can add that as another feature on the development-2.0 branch. (and maybe that action was on me, anyway 😬 )

tonsV2 · 2026-05-07T21:31:30Z

Looks Good.

I still think we should use the "detach flag" approach instead of the "compose opts" for the (default) detached state; but we can add that as another feature on the development-2.0 branch. (and maybe that action was on me, anyway 😬 )

So do I and I actually thought you already added it on the other branch.

tonsV2 and others added 2 commits April 14, 2026 15:59

Reapply "feat: initial support for deploying multiple instances on th…

b9e1ea3

…e same host. WIP" This reverts commit ed9278f.

Merge branch 'master' into feat/multiple-instances

56a0f3a

radnov reviewed Apr 21, 2026

View reviewed changes

tonsV2 changed the base branch from master to development-2.0 April 29, 2026 11:43

Philip-Larsen-Donnelly and others added 8 commits April 30, 2026 09:34

feat: per-project backup paths

4c41e14

Merge pull request #98 from dhis2/mult-inst-back2

b95b587

feat: per-project backup paths

chore: doc updates and list-instances (#96)

1de7305

* chore: doc updates and list-instances * Apply suggestions from code review - remove numbering in reference file Co-authored-by: Philip-Larsen-Donnelly <35666657+Philip-Larsen-Donnelly@users.noreply.github.com> * fix: review comments

fix: add shebang to env-utils.sh to fix shellcheck SC2148

e056d7c

feat: add delete-instance target to remove instance data and env file

4dae1dc

refactor: rename launch-* targets to start-* for consistency with sto…

4b40c90

…p-instance

refactor: rename monitoring_net network to monitoring

40fa5e3

fix: resolve shellcheck SC1091/SC2119 warnings in generate scripts

9e2583f

tonsV2 requested a review from Philip-Larsen-Donnelly May 7, 2026 19:40

Philip-Larsen-Donnelly approved these changes May 7, 2026

View reviewed changes

tonsV2 merged commit 174dbe9 into development-2.0 May 7, 2026
1 check passed

tonsV2 deleted the feat/multiple-instances branch May 7, 2026 21:31

tonsV2 mentioned this pull request May 7, 2026

chore: Reapply "feat: initial support for deploying multiple instance… #99

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reapply "feat: initial support for deploying multiple instances on th…#94

Reapply "feat: initial support for deploying multiple instances on th…#94
tonsV2 merged 10 commits into
development-2.0from
feat/multiple-instances

tonsV2 commented Apr 20, 2026

Uh oh!

radnov left a comment

Uh oh!

Philip-Larsen-Donnelly left a comment

Uh oh!

tonsV2 commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tonsV2 commented Apr 20, 2026

Uh oh!

radnov left a comment

Choose a reason for hiding this comment

Uh oh!

Philip-Larsen-Donnelly left a comment

Choose a reason for hiding this comment

Uh oh!

tonsV2 commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants