Skip to content

Reapply "feat: initial support for deploying multiple instances on th…#94

Merged
tonsV2 merged 10 commits into
development-2.0from
feat/multiple-instances
May 7, 2026
Merged

Reapply "feat: initial support for deploying multiple instances on th…#94
tonsV2 merged 10 commits into
development-2.0from
feat/multiple-instances

Conversation

@tonsV2
Copy link
Copy Markdown
Collaborator

@tonsV2 tonsV2 commented Apr 20, 2026

…e same host. WIP"

This reverts commit ed9278f.

Copy link
Copy Markdown
Contributor

@radnov radnov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Docker Swarm could be an alternative here. Docker Swarm in single-host mode is a somewhat common "built-in" pattern that implements this type of multi-instance deployment. One benefit is that if the users want to go from single to multi-host setup it would be already supported by Docker Swarm without extra effort.

Example:

docker swarm init
docker network create -d overlay --attachable proxy
docker network create -d overlay --attachable monitoring_net

docker stack deploy -c stacks/traefik/docker-compose.yml traefik
docker stack deploy -c stacks/monitoring/docker-compose.yml monitoring
docker stack deploy -c docker-compose.yml prod
docker stack deploy -c docker-compose.yml dev

What you get for free:

  • Stack isolation (service names prefixed with stack name: prod_app, dev_app) replaces the ${COMPOSE_PROJECT_NAME}-app aliasing the
    branch does manually
  • Overlay networks as external resources work identically (already how the branch models them)
  • Docker secrets and configs are first-class, not compose workarounds
  • Built-in rolling updates, deploy.update_config, rollback on health failure
  • docker stack ls / docker stack ps gives the make list-instances you would otherwise write
  • Scale services: docker service scale prod_app=3
  • Placement constraints if you later add nodes (no code change to move from 1 host to 5)

What you lose vs current compose approach:

  • depends_on.condition: service_healthy is NOT supported in swarm mode. The update-admin-password + create-monitoring-user chain
    would need reworking (init jobs via docker service create one-shot, or application-level retry logic, or a separate docker compose
    run step pre-deploy)
  • docker compose run --rm backup-database pattern does not translate; swarm has no ad-hoc run. Backup/restore would need to be docker
    run invocations against the volume directly, or one-shot services
  • profiles: not supported. The backup / restore / docs profiles would need other compose files or separate invocations
  • build: not supported - you must push images (not an issue here, all images are pulled)
  • restart: unless-stopped replaced by deploy.restart_policy
  • Loki driver + loki-external-labels works but per-container custom labels differ slightly
  • No env_file for service env in older swarm versions (works in modern engines)

If we stick to the current approach, here is a long list of suggested improvements created with the help of Claude and curated by me.

16 and 17, as noted below, could simplify things a lot, if they work.

Note that I haven't tried running the changes localy.

Operational ergonomics

  1. Drop shared monitor credential. Per-instance creds in per-instance target JSON (Prometheus file_sd supports per-target basic_auth
    via relabel_configs + labels, or switch to http_sd). Removes the "must match across all .env files" footgun.
  2. Use include: in root docker-compose.yml to pull stacks/postgres/ + optional stacks/backup/. Collapses POSTGRES_COMPOSE_CMD /
    BACKUP_COMPOSE_CMD into one COMPOSE_CMD.
  3. Move envsubst into scripts/. Easier to test, cleaner quoting, avoids envsubst substituting unrelated shell vars that happen to
    match template placeholders.
  4. Add make list-instances / make status. Discoverability of running instances, which conf.d/targets files map to which project,
    health summary.
  5. Pre-flight checks in launch-instance. Fail fast if stacks/traefik/.env or stacks/monitoring/.env absent, if instances/.env
    missing, if APP_HOSTNAME already used by another instance's route, if PROJECT_NAME collides with an existing project.
  6. DRY_RUN=1 mode that prints the generated conf.d/.yml and targets/.json without starting containers.
  7. stop-instance safety. Check no running containers reference -db network before docker network rm. Currently 2>/dev/null ||
    true hides legitimate errors.

Correctness / race conditions

  1. Ordering during launch-instance. Write Prometheus targets first so scraping is ready, then write Traefik route after app
    healthcheck passes - otherwise Traefik briefly 502s new traffic. Alternative: use loadBalancer.healthCheck in the route file so
    Traefik gates traffic itself.
  2. Middleware reload gap. middlewares.yml is static; changes require manual Traefik action depending on watch behavior. Document or
    move per-instance-referenced middleware inline.
  3. depends_on across stacks. Impossible in compose. Either document the ordering contract explicitly, or use an init container in
    docker-compose.yml that blocks on pg_isready against the external network before the app starts. Handles the "user bypasses Makefile"
    case.

Security

  1. Replace inline configs.content with mounted file + Docker secrets for DHIS2_MONITOR_USERNAME too (currently only password is a
    secret; username comes via ${VAR} substitution into config content, leaks into docker compose config output).
  2. Drop root in create-monitoring-user. Pre-build an image alpine+jq+curl in a one-line Dockerfile or use alpine/curl + stedolan/jq.
    Restores read_only: true + non-root user.
  3. File permissions on generated files. conf.d/.yml and targets/.json currently inherit default umask. Chmod 600 or 640 since they
    contain hostnames (not secret, but unnecessary to publish). instances/*.env already handled by generate-env.sh.
  4. Credential rotation path. No script to rotate DHIS2_MONITOR_PASSWORD, POSTGRES_DB_PASSWORD, etc. without recreating the instance.
    Worth at least documenting procedure.
  5. ACME storage shared. cert:/cert is a single named volume in the Traefik stack - fine, but note that compromising Traefik exposes
    all instance certs. Consistent with most single-proxy setups.

Architecture alternatives worth testing before shipping

  1. Traefik docker provider + labels, instead of file provider. Remove envsubst entirely. Each instance's app service gets
    traefik.http.routers..rule=Host(...). Traefik reads labels via docker socket. Requires docker-socket-proxy sidecar to avoid
    giving Traefik /var/run/docker.sock (security). Pro: less templating, zero conf.d/ files. Con: introduces socket-proxy dependency.
  2. Prometheus docker_sd_configs parallel to (16). Instances get prometheus.io/scrape=true labels, Prometheus discovers targets via
    socket-proxy. Drops the targets/*.json files + envsubst step.

Picking (16) + (17) simplifies the branch substantially - the entire conf.d/ + targets/ generation dance disappears, replaced by
labels on each service. Cost: one docker-socket-proxy container shared by Traefik and Prometheus.

Testing

  1. Integration test: two instances, same host. Spin up prod + dev via Makefile, assert:
  • Both reachable via distinct hostnames through shared Traefik
  • Kill one Postgres → other instance unaffected
  • Prometheus sees targets labelled instance=prod and instance=dev separately
  • Grafana dashboards filter by instance variable
  • stop-instance prod does not disrupt dev

This is the highest-leverage single addition - it validates the entire design claim.

Minor polishing

  1. Per-instance resource limits via env (APP_CPU_LIMIT, APP_MEM_LIMIT) - multi-instance on one host makes noisy-neighbor risk real.
  2. Backup retention rotation. ./backups/ grows forever. Add timestamped pruning policy or defer to host-level tooling and document.
  3. Grafana instance variable. Dashboards shipped in stacks/monitoring/config/grafana/dashboards/ need a $instance template variable
    filter, otherwise multi-instance metrics collapse into single panels.
  4. Consolidate application network. It's per-project (internal) but create-monitoring-user + update-admin-password + app are the
    only members. Could drop and use project default bridge. Marginal simplification.

@tonsV2 tonsV2 changed the base branch from master to development-2.0 April 29, 2026 11:43
Copy link
Copy Markdown
Contributor

@Philip-Larsen-Donnelly Philip-Larsen-Donnelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good.

I still think we should use the "detach flag" approach instead of the "compose opts" for the (default) detached state; but we can add that as another feature on the development-2.0 branch. (and maybe that action was on me, anyway 😬 )

@tonsV2
Copy link
Copy Markdown
Collaborator Author

tonsV2 commented May 7, 2026

Looks Good.

I still think we should use the "detach flag" approach instead of the "compose opts" for the (default) detached state; but we can add that as another feature on the development-2.0 branch. (and maybe that action was on me, anyway 😬 )

So do I and I actually thought you already added it on the other branch.

@tonsV2 tonsV2 merged commit 174dbe9 into development-2.0 May 7, 2026
1 check passed
@tonsV2 tonsV2 deleted the feat/multiple-instances branch May 7, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants