Skip to content

feat(deployment): Implement zero-downtime blue-green deployment #44

@larryro

Description

@larryro

Summary

Implement zero-downtime deployments for the Tale platform using blue-green deployment strategy with Docker Compose on a single VPS/VM.

Current Pain Points

  1. Service restarts cause brief outages (~90 seconds downtime)
  2. Database migrations block traffic
  3. Long container startup times before traffic resumes

Proposed Solution

Use blue-green deployment with Caddy upstream pools. Two versions of stateless services run simultaneously during deployment, with traffic switched only after the new version is healthy.

How It Works

PHASE 1: Blue serving     → Internet → Caddy → [Blue ✅] 
PHASE 2: Green starting   → Internet → Caddy → [Blue ✅] + [Green ⏳]
PHASE 3: Both healthy     → Internet → Caddy → [Blue draining] + [Green ✅ NEW]
PHASE 4: Cleanup          → Internet → Caddy → [Green ✅]

Colors rotate: blue → green → blue → green...

Implementation Tasks

Files to Create

  • scripts/deploy.sh - Deployment orchestration script

    • Detect current color (blue/green)
    • Build and start new color
    • Wait for health checks
    • Switch traffic via Caddy reload
    • Drain and cleanup old containers
  • compose.blue.yml - Blue deployment overlay

    • Container names with -blue suffix
    • Network aliases: platform-blue, rag-blue, etc.
  • compose.green.yml - Green deployment overlay

    • Container names with -green suffix
    • Network aliases: platform-green, rag-green, etc.

Files to Modify

  • compose.yml

    • Remove container_name from stateless services (platform, rag, crawler, search, graph-db)
    • Keep container_name for stateful services (db, proxy)
    • Optimize health checks (5s interval instead of 30s)
  • services/proxy/Caddyfile

    • Add upstream pool with both blue and green backends
    • Configure health-based routing (lb_policy first)
    • Add passive health checking (circuit breaker)
  • services/platform/docker-entrypoint.sh

    • Add graceful shutdown with connection draining
    • Create shutdown marker file for health check

Database Consistency

Both Blue and Green share the same database. For schema changes, use the expand-contract pattern:

  1. EXPAND: Add new columns/tables (backward compatible) - run BEFORE deployment
  2. DEPLOY: Code works with both old and new schema
  3. CONTRACT: Remove old columns (only AFTER old code is gone)

Safe Migrations (EXPAND)

  • ADD COLUMN
  • ADD TABLE
  • CREATE INDEX CONCURRENTLY
  • ADD NULLABLE COLUMN

Unsafe Migrations (CONTRACT - run manually after deploy)

  • DROP COLUMN
  • RENAME COLUMN
  • ADD NOT NULL

Resource Requirements

Running two versions simultaneously requires ~2x memory during deployment:

Phase Memory Required
Normal operation ~5-6 GB
During deployment ~10-12 GB

Recommendation: VPS should have at least 12-16 GB RAM.

Deployment Commands

# Normal deployment
./scripts/deploy.sh deploy

# Quick rollback (if previous containers still running)
./scripts/deploy.sh rollback

# Check status
./scripts/deploy.sh status

Testing Plan

  1. Deploy blue version initially
  2. Make code change and deploy (should switch to green)
  3. Verify zero dropped requests during switch
  4. Make another change and deploy (should switch back to blue)
  5. Test rollback scenario
  6. Test failed deployment scenario (new version doesn't pass health check)

Alternative: Simpler Rolling Update

If blue-green is too resource-intensive, use rolling updates:

docker compose up -d --no-deps --build platform

Caddy's lb_try_duration handles brief unavailability (~10-30s potential for errors).


Detailed Plan

See the comment below for the full implementation plan with diagrams.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions