Skip to content

feat(deployment): implement zero-downtime blue-green deployment#60

Merged
larryro merged 7 commits into
mainfrom
claude/blue-green-deployment-v2
Jan 1, 2026
Merged

feat(deployment): implement zero-downtime blue-green deployment#60
larryro merged 7 commits into
mainfrom
claude/blue-green-deployment-v2

Conversation

@larryro

@larryro larryro commented Jan 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Implements zero-downtime blue-green deployment strategy for Tale platform
  • Adds HTTPS support for local development with self-signed certificates (configurable TLS mode)
  • Creates deployment script (scripts/deploy.sh) with deploy, rollback, and status commands
  • Updates proxy (Caddy) configuration with health-check-based load balancing between blue/green instances

Key Changes

Deployment Infrastructure

  • Blue-Green Architecture: Separate service definitions (platform-blue, platform-green, etc.) to prevent Docker Compose conflicts during deployments
  • Deploy Script: Full-featured deployment automation with health checks, rollback capability, and graceful draining
  • Stateful Services: db, proxy, and graph-db run as single instances (graph-db requires exclusive file lock)

HTTPS & TLS

  • Self-signed certificates for local development (TLS_MODE=selfsigned)
  • Let's Encrypt support for production (TLS_MODE=letsencrypt)
  • Platform container trusts Caddy's CA for internal HTTPS calls

Health Checks

  • Platform health endpoint now verifies Convex backend readiness before reporting healthy
  • Caddy uses health-check-based routing to automatically failover between blue/green
  • Graceful shutdown support with connection draining

Test plan

  • Run docker compose up and verify services start correctly
  • Access https://tale.local and accept the self-signed certificate warning
  • Run ./scripts/deploy.sh deploy and verify zero-downtime deployment
  • Run ./scripts/deploy.sh status to check deployment state
  • Run ./scripts/deploy.sh rollback to test rollback functionality
  • Verify health endpoint returns 503 during startup until Convex is ready

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Blue-green deployment infrastructure for zero-downtime releases
    • Local development domain (tale.local) with certificate support
    • Maintenance page displayed during deployment transitions
  • Documentation

    • Updated Quick Start with local domain configuration steps
    • Added hostfile setup instructions for macOS/Linux and Windows
    • Revised production deployment guidance with Let's Encrypt support
  • Improvements

    • Enhanced health check responsiveness and reliability
    • Improved SSL/TLS certificate handling for local development environments

✏️ Tip: You can customize this high-level summary in your review settings.

Major changes to make blue-green deployments work correctly:

- Use separate service names (platform-blue, platform-green) instead of
  overlays to prevent Docker Compose from touching the other color's
  containers during deployment
- Make graph-db a stateful service (single instance) since Kuzu requires
  exclusive file lock and cannot run two instances
- Add HTTPS for local development with self-signed certificates and
  configurable TLS_MODE (selfsigned/letsencrypt)
- Update platform health check to verify Convex backend readiness
- Add Caddy stabilization wait in deploy script before draining old containers
- Trust Caddy's CA certificate in platform container for internal HTTPS calls

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jan 1, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR implements a blue-green deployment infrastructure with accompanying configuration updates. Key changes include introducing compose.blue.yml and compose.green.yml overlay files with dedicated service variants (platform-blue/green, rag-blue/green, etc.), refactoring the deployment orchestration script (scripts/deploy.sh) to dynamically manage color-specific services, and updating the development domain from localhost to tale.local with corresponding TLS and certificate handling. The PR adds health checks and readiness endpoints for zero-downtime deployments, introduces a maintenance page for deployment transitions, updates proxy configuration (Caddyfile) for improved TLS management, and updates documentation to reflect the new domain and deployment approach.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • feat(deployment): implement zero-downtime blue-green deployment #50: Implements the same blue-green deployment components across overlay files, compose configuration, deployment scripts, proxy setup, and platform service modifications.
  • talecorp/poc2#35: Modifies the health endpoint in services/platform/app/api/health/route.ts; this PR extends that endpoint with a Convex backend readiness check.
  • tale-project/poc2#351: Updates overlapping files including README, compose services, docker-entrypoint configurations, and proxy behavior for local vs. production environments.

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/deploy.sh (1)

1-27: Add local domain prerequisite to the deploy.sh header comments.

The tale.local domain configuration is documented in the main README, but the deploy.sh script header's REQUIREMENTS section (lines 15-17) doesn't mention it. Users checking the script directly would miss this prerequisite and encounter failures at line 382 where the health check runs against https://tale.local/api/health.

Consider adding to the REQUIREMENTS section:

- Local domain configured: Add "127.0.0.1 tale.local" to /etc/hosts (or C:\Windows\System32\drivers\etc\hosts on Windows)
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between ba8fb2c and 5823395.

📒 Files selected for processing (13)
  • .env.example
  • README.md
  • compose.blue.yml
  • compose.green.yml
  • compose.yml
  • scripts/deploy.sh
  • services/platform/Dockerfile
  • services/platform/app/api/health/route.ts
  • services/platform/docker-entrypoint.sh
  • services/proxy/Caddyfile
  • services/proxy/Dockerfile
  • services/proxy/docker-entrypoint.sh
  • services/proxy/maintenance.html
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx}: USE implicit typing whenever possible in TypeScript
DO NOT use type casting. Avoid any and unknown types whenever possible in TypeScript

Files:

  • services/platform/app/api/health/route.ts
**/*.{tsx,ts}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{tsx,ts}: Do NOT hardcode text in React components; use translation hooks/functions instead for user-facing UI
CONSIDER ALWAYS TO add optimistic updates with withOptimisticUpdate for useMutation in React. If you decide to NOT add optimistic update you need to provide a good reason why and comment the hook
CONSIDER ALWAYS TO use reusable components in React
USE useMemo, useCallback and memo at the right moment in React
DO NOT overuse useEffect in React
USE cva if a component has multiple variants in React
CONSIDER TO preload queries with preloadQuery and usePreloadedQuery in React when using Convex

Files:

  • services/platform/app/api/health/route.ts
**/*.{js,ts}

📄 CodeRabbit inference engine (CLAUDE.md)

CONSIDER TO use rate limiting and action caching in Convex

Files:

  • services/platform/app/api/health/route.ts
🧠 Learnings (6)
📚 Learning: 2025-12-19T04:29:46.183Z
Learnt from: larryro
Repo: tale-project/tale PR: 26
File: services/rag/Dockerfile:10-20
Timestamp: 2025-12-19T04:29:46.183Z
Learning: Do not pin apt package versions in Dockerfiles within the tale-project/tale repository (e.g., services/rag/Dockerfile). Rely on regularly updated base images (like python:3.11-slim) and unpinned apt packages (curl, build-essential, libpq-dev) so that security updates and compatibility are handled via base image refresh and CI/CD caching. This reduces maintenance burden; verify through CI pipelines and ensure reproducibility comes from image rebuilds rather than manual pinning.

Applied to files:

  • services/platform/Dockerfile
  • services/proxy/Dockerfile
📚 Learning: 2025-11-30T03:53:00.316Z
Learnt from: CR
Repo: tale-project/poc2 PR: 0
File: .cursor/rules/convex_rules.mdc:0-0
Timestamp: 2025-11-30T03:53:00.316Z
Learning: Applies to convex/http.ts : HTTP endpoints must be defined in `convex/http.ts` using `httpRouter` and `httpAction` decorator, with exact path matching as specified in the `path` field

Applied to files:

  • services/platform/app/api/health/route.ts
📚 Learning: 2025-10-03T11:34:20.628Z
Learnt from: CR
Repo: talecorp/poc2 PR: 0
File: .cursor/rules/convex_rules.mdc:0-0
Timestamp: 2025-10-03T11:34:20.628Z
Learning: Applies to convex/http.ts : Define HTTP endpoints in convex/http.ts using httpAction and httpRouter

Applied to files:

  • services/platform/app/api/health/route.ts
📚 Learning: 2025-10-03T11:34:20.628Z
Learnt from: CR
Repo: talecorp/poc2 PR: 0
File: .cursor/rules/convex_rules.mdc:0-0
Timestamp: 2025-10-03T11:34:20.628Z
Learning: Applies to convex/http.ts : HTTP endpoints are registered at the exact path specified in the route path field

Applied to files:

  • services/platform/app/api/health/route.ts
📚 Learning: 2025-10-03T11:34:20.628Z
Learnt from: CR
Repo: talecorp/poc2 PR: 0
File: .cursor/rules/convex_rules.mdc:0-0
Timestamp: 2025-10-03T11:34:20.628Z
Learning: Applies to convex/**/*.{ts,js} : Always use the new Convex function syntax (query/mutation/action with args/returns/handler)

Applied to files:

  • services/platform/app/api/health/route.ts
📚 Learning: 2025-12-30T06:21:13.183Z
Learnt from: larryro
Repo: tale-project/tale PR: 37
File: services/platform/convex/model/documents/validators.ts:89-102
Timestamp: 2025-12-30T06:21:13.183Z
Learning: Do not flag a missing trailing newline for TypeScript files in code reviews. POSIX text files should end with a trailing newline and Prettier (or your formatter) will enforce this. Treat the trailing newline as a non-issue in reviews for all TS files.

Applied to files:

  • services/platform/app/api/health/route.ts
🪛 Hadolint (2.14.0)
services/platform/Dockerfile

[warning] 71-71: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>

(DL3008)

🪛 markdownlint-cli2 (0.18.1)
README.md

52-52: Bare URL used

(MD034, no-bare-urls)


62-62: Bare URL used

(MD034, no-bare-urls)

🪛 Shellcheck (0.11.0)
services/proxy/docker-entrypoint.sh

[warning] 58-58: i appears unused. Verify use (or export if used externally).

(SC2034)

scripts/deploy.sh

[info] 338-338: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 347-347: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 446-446: Double quote to prevent globbing and word splitting.

(SC2086)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build Docker Image
🔇 Additional comments (30)
.env.example (2)

18-21: LGTM: Domain configuration updated for tale.local.

The domain configuration correctly reflects the PR's switch to tale.local for local development, with clear separation between DOMAIN (full URL) and DOMAIN_HOST (hostname only).


23-40: LGTM: TLS configuration well-documented.

The TLS_MODE configuration clearly distinguishes between development (self-signed) and production (Let's Encrypt) use cases, with helpful inline documentation and examples.

services/platform/app/api/health/route.ts (2)

10-14: LGTM: Convex backend URL correctly configured.

The hardcoded localhost URL is appropriate since the Convex backend runs in the same container as Next.js, and the /version endpoint is the standard readiness indicator.


21-21: LGTM: Comment accurately reflects expanded health check.

The updated comment correctly describes both shutdown and Convex not-ready scenarios that trigger 503 responses.

README.md (3)

16-30: LGTM: Clear local domain configuration instructions.

The new "Configure Local Domain" section provides clear, platform-specific instructions for adding tale.local to the hosts file, making the setup process straightforward for users on macOS, Linux, and Windows.


52-54: LGTM: URL and certificate guidance updated.

The documentation correctly reflects the switch to https://tale.local and includes a helpful note about the expected certificate warning with instructions to trust the certificate.


71-85: LGTM: Production deployment instructions updated.

The production deployment section correctly includes the new TLS_MODE=letsencrypt configuration and clearly explains that Let's Encrypt handles SSL certificate provisioning.

compose.blue.yml (4)

1-10: LGTM: Clear blue overlay documentation.

The header comments clearly explain the purpose of this overlay file and the key difference from previous approaches (separate service names to prevent Docker Compose conflicts).


39-44: LGTM: Health checks properly configured.

All services have appropriate health check configurations with reasonable intervals (5s), timeouts (3s), retries (2-3), and start periods (30-120s depending on service complexity).

Also applies to: 80-85, 114-119, 148-162


177-192: LGTM: External resources properly referenced.

Volumes and networks are correctly defined as external with proper naming convention (tale_ prefix), enabling resource sharing with the base compose.yml.


52-56: Dual network aliases enable health-check-based routing in Caddy.

Each blue service has two network aliases (e.g., platform-blue and platform) that map to the upstream backends in Caddy's reverse proxy configuration. Caddy is configured with active health checking (health_uri /api/health, health_interval 2s) and uses lb_policy first to route traffic only to healthy instances. The platform-blue, platform-green, and platform aliases in the Caddyfile's reverse proxy ensure that during blue-green deployments, traffic automatically switches to healthy backends regardless of color, enabling zero-downtime deployments.

services/platform/Dockerfile (1)

70-78: LGTM: CA certificates properly configured for TLS support.

The addition of ca-certificates and the /usr/local/share/ca-certificates directory enables the platform to trust custom CA certificates (such as Caddy's self-signed CA), which is essential for internal HTTPS communication in the blue-green deployment setup.

Note: The Hadolint warning about version pinning is a false positive. Based on learnings, this repository intentionally does not pin apt package versions, relying instead on regularly updated base images and CI/CD caching for reproducibility.

services/proxy/maintenance.html (1)

1-74: Well-designed maintenance page with appropriate refresh interval.

The implementation is clean with inline CSS (no external dependencies), dark mode support, and semantic HTML. The 2-second auto-refresh aligns well with the health_interval 2s configured in the Caddyfile, ensuring users see the updated state promptly after backends become healthy.

services/proxy/Dockerfile (2)

13-19: LGTM! Clean separation of concerns with the custom entrypoint.

The entrypoint script handles TLS configuration mapping and certificate permissions, while Caddy's CMD remains unchanged. Copying the maintenance page to /var/www aligns with the handle_errors block in the Caddyfile.


28-32: Good change to use explicit IPv4 loopback.

Using 127.0.0.1 instead of localhost avoids potential IPv6 resolution issues (::1) that can occur in Alpine containers, making the health check more reliable.

services/proxy/docker-entrypoint.sh (1)

17-36: Clean TLS mode abstraction.

The mapping from user-friendly TLS_MODE values to Caddy's internal format (internal for self-signed, email or empty for Let's Encrypt ACME) provides a good developer experience.

compose.yml (3)

271-279: Verify CA certificate paths in production (Let's Encrypt mode).

The environment variables reference Caddy's internal CA path (/caddy-data/caddy/pki/authorities/local/root.crt), which only exists in self-signed mode. In production with TLS_MODE=letsencrypt, this path won't exist.

This should be fine since:

  1. Let's Encrypt certs are publicly trusted and don't need custom CA
  2. The platform entrypoint (lines 111-137) already handles the case when CADDY_ROOT_CA file doesn't exist

However, consider adding a comment clarifying this is development-only, or conditionally setting these variables.


83-84: Good addition of container_name for the stateful graph-db service.

Since Kuzu requires an exclusive file lock, it cannot run as blue/green instances. The explicit container name prevents Docker Compose from creating duplicate instances during deployments.


423-439: Healthcheck configuration is consistent with the proxy Dockerfile.

Using http://127.0.0.1:80/health matches the Dockerfile healthcheck and avoids IPv6/SSL issues for internal checks.

services/platform/docker-entrypoint.sh (2)

27-28: Important fix for container restarts.

Clearing the stale shutdown marker ensures health checks pass on container restart. Without this, a restarted container could inherit a shutdown marker from its previous incarnation and immediately report unhealthy.


106-137: Well-implemented CA trust handling with proper fallbacks.

The implementation correctly:

  1. Creates a combined CA bundle preserving system CAs
  2. Sets both SSL_CERT_FILE (for Rust/native TLS) and REQUESTS_CA_BUNDLE (for Python)
  3. Gracefully handles missing files with informative messages
  4. Uses || true to prevent script failure on copy/cat errors

The dual-path check on line 111 (-n "${CADDY_ROOT_CA:-}" AND -f "${CADDY_ROOT_CA}") ensures the file actually exists before attempting to create the bundle.

services/proxy/Caddyfile (3)

77-89: Aggressive but appropriate health check settings for zero-downtime.

The configuration ensures tight failover:

  • health_interval 2s + health_passes 2 = ~4s before a new backend receives traffic
  • max_fails 2 with fail_duration 10s = quick circuit breaker on failures

This is well-tuned for blue-green deployments where responsiveness matters more than reducing health check overhead.


28-38: Good separation of HTTP health endpoint from main HTTPS site.

The :80 block serves two purposes:

  1. Provides a health check endpoint without SSL complexity (used by Docker healthcheck)
  2. Redirects all other HTTP traffic to HTTPS

This avoids the chicken-and-egg problem of SSL certificate validation during health checks.


105-113: Friendly maintenance page during deployment transitions.

Serving maintenance.html for 502/503/504 errors provides a better user experience than raw error pages during the brief window when backends are switching.

compose.green.yml (3)

52-56: Potential DNS resolution ambiguity with shared aliases.

Both platform-blue (in compose.blue.yml) and platform-green register the platform alias. When both are running during deployment, DNS resolution for platform becomes non-deterministic.

This might be intentional for the Caddyfile's platform:3000 fallback (line 69), but it could cause issues if other services resolve platform directly expecting a single instance.

Consider whether the generic platform alias should only exist on the active deployment, or if dependent services should always use the color-specific names.


176-192: Correct external resource references.

The external volumes and network use the tale_ prefix, which is Docker Compose's default project name prefix. This ensures the green overlay shares state with the base deployment.


15-45: Well-structured service definition matching base configuration.

The platform-green service mirrors the base platform service with:

  • Same healthcheck configuration (120s start_period for Convex)
  • Same volume mounts for data persistence
  • Same CA certificate environment variables

This consistency ensures predictable behavior during blue-green switches.

scripts/deploy.sh (3)

332-350: LGTM! Intentional word splitting for service names.

The unquoted $target_services variables (Lines 338, 347) are correct here - the space-separated service list needs to be split into individual arguments for docker compose. Shellcheck's SC2086 warning is a false positive in this context.

You may optionally add a shellcheck disable comment to document this is intentional:

# shellcheck disable=SC2086  # Intentional word splitting for service names
if ! docker compose -f "${PROJECT_ROOT}/compose.${target_color}.yml" build $target_services; then

441-449: LGTM! Consistent with deploy logic.

The rollback logic correctly uses get_services_for_color() and the compose file pattern is consistent with the deploy command. The unquoted $rollback_services is intentional for word splitting.


491-538: LGTM! Status display correctly reflects new architecture.

The status command now properly shows graph-db as a stateful service and uses ROTATABLE_SERVICE_BASES consistently throughout. The output clearly distinguishes between stateful services and blue/green rotatable services.

Comment thread scripts/deploy.sh
Comment thread scripts/deploy.sh
Comment thread scripts/deploy.sh Outdated
Comment thread scripts/deploy.sh
Comment thread scripts/deploy.sh
Comment thread services/platform/app/api/health/route.ts
Comment thread services/proxy/docker-entrypoint.sh
larryro and others added 6 commits January 1, 2026 16:37
The help text incorrectly stated 3s as the default for HEALTH_CHECK_INTERVAL,
but the actual default is 1s (line 47). Updated help text to match.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of a fixed 5-second sleep after starting stateful services,
now properly polls Docker health status for db, proxy, and graph-db.
Waits up to 60 seconds with 2-second intervals.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, if Caddy couldn't verify proxy routing after 5 attempts,
the deployment would continue with just a warning. This could cause
downtime since draining old containers while new ones aren't serving
traffic would leave no working backends.

Now the deployment fails and rolls back if verification fails.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added a comment to clarify that the -k flag in curl is intentional
for accepting self-signed certificates in local development
(TLS_MODE=selfsigned).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The 2-second timeout was too aggressive for cold starts or slower
systems. Increased to 5 seconds to be more forgiving without
significantly impacting deployment speed, since health checks
have retry safety through Docker health check configuration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ning

Changed loop variable from 'i' to '_' since it's unused, following
shellcheck best practice for intentionally unused variables.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@larryro larryro merged commit 7ae2f20 into main Jan 1, 2026
2 checks passed
@larryro larryro deleted the claude/blue-green-deployment-v2 branch January 1, 2026 08:57
@coderabbitai coderabbitai Bot mentioned this pull request Mar 26, 2026
yannickmonney pushed a commit that referenced this pull request Apr 8, 2026
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant