feat(deployment): implement zero-downtime blue-green deployment#60
Conversation
Major changes to make blue-green deployments work correctly: - Use separate service names (platform-blue, platform-green) instead of overlays to prevent Docker Compose from touching the other color's containers during deployment - Make graph-db a stateful service (single instance) since Kuzu requires exclusive file lock and cannot run two instances - Add HTTPS for local development with self-signed certificates and configurable TLS_MODE (selfsigned/letsencrypt) - Update platform health check to verify Convex backend readiness - Add Caddy stabilization wait in deploy script before draining old containers - Trust Caddy's CA certificate in platform container for internal HTTPS calls 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR implements a blue-green deployment infrastructure with accompanying configuration updates. Key changes include introducing compose.blue.yml and compose.green.yml overlay files with dedicated service variants (platform-blue/green, rag-blue/green, etc.), refactoring the deployment orchestration script (scripts/deploy.sh) to dynamically manage color-specific services, and updating the development domain from localhost to tale.local with corresponding TLS and certificate handling. The PR adds health checks and readiness endpoints for zero-downtime deployments, introduces a maintenance page for deployment transitions, updates proxy configuration (Caddyfile) for improved TLS management, and updates documentation to reflect the new domain and deployment approach. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/deploy.sh (1)
1-27: Add local domain prerequisite to the deploy.sh header comments.The
tale.localdomain configuration is documented in the main README, but the deploy.sh script header's REQUIREMENTS section (lines 15-17) doesn't mention it. Users checking the script directly would miss this prerequisite and encounter failures at line 382 where the health check runs againsthttps://tale.local/api/health.Consider adding to the REQUIREMENTS section:
- Local domain configured: Add "127.0.0.1 tale.local" to /etc/hosts (or C:\Windows\System32\drivers\etc\hosts on Windows)
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro (Legacy)
📒 Files selected for processing (13)
.env.exampleREADME.mdcompose.blue.ymlcompose.green.ymlcompose.ymlscripts/deploy.shservices/platform/Dockerfileservices/platform/app/api/health/route.tsservices/platform/docker-entrypoint.shservices/proxy/Caddyfileservices/proxy/Dockerfileservices/proxy/docker-entrypoint.shservices/proxy/maintenance.html
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.{ts,tsx}: USE implicit typing whenever possible in TypeScript
DO NOT use type casting. Avoidanyandunknowntypes whenever possible in TypeScript
Files:
services/platform/app/api/health/route.ts
**/*.{tsx,ts}
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.{tsx,ts}: Do NOT hardcode text in React components; use translation hooks/functions instead for user-facing UI
CONSIDER ALWAYS TO add optimistic updates withwithOptimisticUpdateforuseMutationin React. If you decide to NOT add optimistic update you need to provide a good reason why and comment the hook
CONSIDER ALWAYS TO use reusable components in React
USEuseMemo,useCallbackandmemoat the right moment in React
DO NOT overuseuseEffectin React
USEcvaif a component has multiple variants in React
CONSIDER TO preload queries withpreloadQueryandusePreloadedQueryin React when using Convex
Files:
services/platform/app/api/health/route.ts
**/*.{js,ts}
📄 CodeRabbit inference engine (CLAUDE.md)
CONSIDER TO use rate limiting and action caching in Convex
Files:
services/platform/app/api/health/route.ts
🧠 Learnings (6)
📚 Learning: 2025-12-19T04:29:46.183Z
Learnt from: larryro
Repo: tale-project/tale PR: 26
File: services/rag/Dockerfile:10-20
Timestamp: 2025-12-19T04:29:46.183Z
Learning: Do not pin apt package versions in Dockerfiles within the tale-project/tale repository (e.g., services/rag/Dockerfile). Rely on regularly updated base images (like python:3.11-slim) and unpinned apt packages (curl, build-essential, libpq-dev) so that security updates and compatibility are handled via base image refresh and CI/CD caching. This reduces maintenance burden; verify through CI pipelines and ensure reproducibility comes from image rebuilds rather than manual pinning.
Applied to files:
services/platform/Dockerfileservices/proxy/Dockerfile
📚 Learning: 2025-11-30T03:53:00.316Z
Learnt from: CR
Repo: tale-project/poc2 PR: 0
File: .cursor/rules/convex_rules.mdc:0-0
Timestamp: 2025-11-30T03:53:00.316Z
Learning: Applies to convex/http.ts : HTTP endpoints must be defined in `convex/http.ts` using `httpRouter` and `httpAction` decorator, with exact path matching as specified in the `path` field
Applied to files:
services/platform/app/api/health/route.ts
📚 Learning: 2025-10-03T11:34:20.628Z
Learnt from: CR
Repo: talecorp/poc2 PR: 0
File: .cursor/rules/convex_rules.mdc:0-0
Timestamp: 2025-10-03T11:34:20.628Z
Learning: Applies to convex/http.ts : Define HTTP endpoints in convex/http.ts using httpAction and httpRouter
Applied to files:
services/platform/app/api/health/route.ts
📚 Learning: 2025-10-03T11:34:20.628Z
Learnt from: CR
Repo: talecorp/poc2 PR: 0
File: .cursor/rules/convex_rules.mdc:0-0
Timestamp: 2025-10-03T11:34:20.628Z
Learning: Applies to convex/http.ts : HTTP endpoints are registered at the exact path specified in the route path field
Applied to files:
services/platform/app/api/health/route.ts
📚 Learning: 2025-10-03T11:34:20.628Z
Learnt from: CR
Repo: talecorp/poc2 PR: 0
File: .cursor/rules/convex_rules.mdc:0-0
Timestamp: 2025-10-03T11:34:20.628Z
Learning: Applies to convex/**/*.{ts,js} : Always use the new Convex function syntax (query/mutation/action with args/returns/handler)
Applied to files:
services/platform/app/api/health/route.ts
📚 Learning: 2025-12-30T06:21:13.183Z
Learnt from: larryro
Repo: tale-project/tale PR: 37
File: services/platform/convex/model/documents/validators.ts:89-102
Timestamp: 2025-12-30T06:21:13.183Z
Learning: Do not flag a missing trailing newline for TypeScript files in code reviews. POSIX text files should end with a trailing newline and Prettier (or your formatter) will enforce this. Treat the trailing newline as a non-issue in reviews for all TS files.
Applied to files:
services/platform/app/api/health/route.ts
🪛 Hadolint (2.14.0)
services/platform/Dockerfile
[warning] 71-71: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>
(DL3008)
🪛 markdownlint-cli2 (0.18.1)
README.md
52-52: Bare URL used
(MD034, no-bare-urls)
62-62: Bare URL used
(MD034, no-bare-urls)
🪛 Shellcheck (0.11.0)
services/proxy/docker-entrypoint.sh
[warning] 58-58: i appears unused. Verify use (or export if used externally).
(SC2034)
scripts/deploy.sh
[info] 338-338: Double quote to prevent globbing and word splitting.
(SC2086)
[info] 347-347: Double quote to prevent globbing and word splitting.
(SC2086)
[info] 446-446: Double quote to prevent globbing and word splitting.
(SC2086)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build Docker Image
🔇 Additional comments (30)
.env.example (2)
18-21: LGTM: Domain configuration updated for tale.local.The domain configuration correctly reflects the PR's switch to tale.local for local development, with clear separation between DOMAIN (full URL) and DOMAIN_HOST (hostname only).
23-40: LGTM: TLS configuration well-documented.The TLS_MODE configuration clearly distinguishes between development (self-signed) and production (Let's Encrypt) use cases, with helpful inline documentation and examples.
services/platform/app/api/health/route.ts (2)
10-14: LGTM: Convex backend URL correctly configured.The hardcoded localhost URL is appropriate since the Convex backend runs in the same container as Next.js, and the /version endpoint is the standard readiness indicator.
21-21: LGTM: Comment accurately reflects expanded health check.The updated comment correctly describes both shutdown and Convex not-ready scenarios that trigger 503 responses.
README.md (3)
16-30: LGTM: Clear local domain configuration instructions.The new "Configure Local Domain" section provides clear, platform-specific instructions for adding tale.local to the hosts file, making the setup process straightforward for users on macOS, Linux, and Windows.
52-54: LGTM: URL and certificate guidance updated.The documentation correctly reflects the switch to https://tale.local and includes a helpful note about the expected certificate warning with instructions to trust the certificate.
71-85: LGTM: Production deployment instructions updated.The production deployment section correctly includes the new TLS_MODE=letsencrypt configuration and clearly explains that Let's Encrypt handles SSL certificate provisioning.
compose.blue.yml (4)
1-10: LGTM: Clear blue overlay documentation.The header comments clearly explain the purpose of this overlay file and the key difference from previous approaches (separate service names to prevent Docker Compose conflicts).
39-44: LGTM: Health checks properly configured.All services have appropriate health check configurations with reasonable intervals (5s), timeouts (3s), retries (2-3), and start periods (30-120s depending on service complexity).
Also applies to: 80-85, 114-119, 148-162
177-192: LGTM: External resources properly referenced.Volumes and networks are correctly defined as external with proper naming convention (tale_ prefix), enabling resource sharing with the base compose.yml.
52-56: Dual network aliases enable health-check-based routing in Caddy.Each blue service has two network aliases (e.g.,
platform-blueandplatform) that map to the upstream backends in Caddy's reverse proxy configuration. Caddy is configured with active health checking (health_uri /api/health,health_interval 2s) and useslb_policy firstto route traffic only to healthy instances. Theplatform-blue,platform-green, andplatformaliases in the Caddyfile's reverse proxy ensure that during blue-green deployments, traffic automatically switches to healthy backends regardless of color, enabling zero-downtime deployments.services/platform/Dockerfile (1)
70-78: LGTM: CA certificates properly configured for TLS support.The addition of ca-certificates and the /usr/local/share/ca-certificates directory enables the platform to trust custom CA certificates (such as Caddy's self-signed CA), which is essential for internal HTTPS communication in the blue-green deployment setup.
Note: The Hadolint warning about version pinning is a false positive. Based on learnings, this repository intentionally does not pin apt package versions, relying instead on regularly updated base images and CI/CD caching for reproducibility.
services/proxy/maintenance.html (1)
1-74: Well-designed maintenance page with appropriate refresh interval.The implementation is clean with inline CSS (no external dependencies), dark mode support, and semantic HTML. The 2-second auto-refresh aligns well with the
health_interval 2sconfigured in the Caddyfile, ensuring users see the updated state promptly after backends become healthy.services/proxy/Dockerfile (2)
13-19: LGTM! Clean separation of concerns with the custom entrypoint.The entrypoint script handles TLS configuration mapping and certificate permissions, while Caddy's CMD remains unchanged. Copying the maintenance page to
/var/wwwaligns with thehandle_errorsblock in the Caddyfile.
28-32: Good change to use explicit IPv4 loopback.Using
127.0.0.1instead oflocalhostavoids potential IPv6 resolution issues (::1) that can occur in Alpine containers, making the health check more reliable.services/proxy/docker-entrypoint.sh (1)
17-36: Clean TLS mode abstraction.The mapping from user-friendly
TLS_MODEvalues to Caddy's internal format (internalfor self-signed, email or empty for Let's Encrypt ACME) provides a good developer experience.compose.yml (3)
271-279: Verify CA certificate paths in production (Let's Encrypt mode).The environment variables reference Caddy's internal CA path (
/caddy-data/caddy/pki/authorities/local/root.crt), which only exists in self-signed mode. In production withTLS_MODE=letsencrypt, this path won't exist.This should be fine since:
- Let's Encrypt certs are publicly trusted and don't need custom CA
- The platform entrypoint (lines 111-137) already handles the case when
CADDY_ROOT_CAfile doesn't existHowever, consider adding a comment clarifying this is development-only, or conditionally setting these variables.
83-84: Good addition of container_name for the stateful graph-db service.Since Kuzu requires an exclusive file lock, it cannot run as blue/green instances. The explicit container name prevents Docker Compose from creating duplicate instances during deployments.
423-439: Healthcheck configuration is consistent with the proxy Dockerfile.Using
http://127.0.0.1:80/healthmatches the Dockerfile healthcheck and avoids IPv6/SSL issues for internal checks.services/platform/docker-entrypoint.sh (2)
27-28: Important fix for container restarts.Clearing the stale shutdown marker ensures health checks pass on container restart. Without this, a restarted container could inherit a shutdown marker from its previous incarnation and immediately report unhealthy.
106-137: Well-implemented CA trust handling with proper fallbacks.The implementation correctly:
- Creates a combined CA bundle preserving system CAs
- Sets both
SSL_CERT_FILE(for Rust/native TLS) andREQUESTS_CA_BUNDLE(for Python)- Gracefully handles missing files with informative messages
- Uses
|| trueto prevent script failure on copy/cat errorsThe dual-path check on line 111 (
-n "${CADDY_ROOT_CA:-}"AND-f "${CADDY_ROOT_CA}") ensures the file actually exists before attempting to create the bundle.services/proxy/Caddyfile (3)
77-89: Aggressive but appropriate health check settings for zero-downtime.The configuration ensures tight failover:
health_interval 2s+health_passes 2= ~4s before a new backend receives trafficmax_fails 2withfail_duration 10s= quick circuit breaker on failuresThis is well-tuned for blue-green deployments where responsiveness matters more than reducing health check overhead.
28-38: Good separation of HTTP health endpoint from main HTTPS site.The
:80block serves two purposes:
- Provides a health check endpoint without SSL complexity (used by Docker healthcheck)
- Redirects all other HTTP traffic to HTTPS
This avoids the chicken-and-egg problem of SSL certificate validation during health checks.
105-113: Friendly maintenance page during deployment transitions.Serving
maintenance.htmlfor 502/503/504 errors provides a better user experience than raw error pages during the brief window when backends are switching.compose.green.yml (3)
52-56: Potential DNS resolution ambiguity with shared aliases.Both
platform-blue(in compose.blue.yml) andplatform-greenregister theplatformalias. When both are running during deployment, DNS resolution forplatformbecomes non-deterministic.This might be intentional for the Caddyfile's
platform:3000fallback (line 69), but it could cause issues if other services resolveplatformdirectly expecting a single instance.Consider whether the generic
platformalias should only exist on the active deployment, or if dependent services should always use the color-specific names.
176-192: Correct external resource references.The external volumes and network use the
tale_prefix, which is Docker Compose's default project name prefix. This ensures the green overlay shares state with the base deployment.
15-45: Well-structured service definition matching base configuration.The platform-green service mirrors the base platform service with:
- Same healthcheck configuration (120s start_period for Convex)
- Same volume mounts for data persistence
- Same CA certificate environment variables
This consistency ensures predictable behavior during blue-green switches.
scripts/deploy.sh (3)
332-350: LGTM! Intentional word splitting for service names.The unquoted
$target_servicesvariables (Lines 338, 347) are correct here - the space-separated service list needs to be split into individual arguments fordocker compose. Shellcheck's SC2086 warning is a false positive in this context.You may optionally add a shellcheck disable comment to document this is intentional:
# shellcheck disable=SC2086 # Intentional word splitting for service names if ! docker compose -f "${PROJECT_ROOT}/compose.${target_color}.yml" build $target_services; then
441-449: LGTM! Consistent with deploy logic.The rollback logic correctly uses
get_services_for_color()and the compose file pattern is consistent with the deploy command. The unquoted$rollback_servicesis intentional for word splitting.
491-538: LGTM! Status display correctly reflects new architecture.The status command now properly shows graph-db as a stateful service and uses
ROTATABLE_SERVICE_BASESconsistently throughout. The output clearly distinguishes between stateful services and blue/green rotatable services.
The help text incorrectly stated 3s as the default for HEALTH_CHECK_INTERVAL, but the actual default is 1s (line 47). Updated help text to match. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of a fixed 5-second sleep after starting stateful services, now properly polls Docker health status for db, proxy, and graph-db. Waits up to 60 seconds with 2-second intervals. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, if Caddy couldn't verify proxy routing after 5 attempts, the deployment would continue with just a warning. This could cause downtime since draining old containers while new ones aren't serving traffic would leave no working backends. Now the deployment fails and rolls back if verification fails. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added a comment to clarify that the -k flag in curl is intentional for accepting self-signed certificates in local development (TLS_MODE=selfsigned). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The 2-second timeout was too aggressive for cold starts or slower systems. Increased to 5 seconds to be more forgiving without significantly impacting deployment speed, since health checks have retry safety through Docker health check configuration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ning Changed loop variable from 'i' to '_' since it's unused, following shellcheck best practice for intentionally unused variables. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Summary
scripts/deploy.sh) with deploy, rollback, and status commandsKey Changes
Deployment Infrastructure
platform-blue,platform-green, etc.) to prevent Docker Compose conflicts during deploymentsdb,proxy, andgraph-dbrun as single instances (graph-db requires exclusive file lock)HTTPS & TLS
TLS_MODE=selfsigned)TLS_MODE=letsencrypt)Health Checks
Test plan
docker compose upand verify services start correctly./scripts/deploy.sh deployand verify zero-downtime deployment./scripts/deploy.sh statusto check deployment state./scripts/deploy.sh rollbackto test rollback functionality🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.