Skip to content
Merged

Dev #89

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 44 additions & 12 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,45 +3,77 @@
## Marketplace Integration: Agent Registration & Local Deploy

### Agent Self-Registration (for curl one-liner and manual install entry points)
- [ ] **`POST /api/v1/register`** (local endpoint on Status Panel) — Triggered after install.sh completes
- [x] **`POST /api/v1/register`** (local endpoint on Status Panel) — Triggered after install.sh completes
- Accept `{ purchase_token, stack_id }` from install script
- Collect server fingerprint (hostname, IP, OS, CPU, RAM, disk)
- Call Stacker Server: `POST /api/v1/agents/register { purchase_token, server_fingerprint, stack_id }`
- Store returned `agent_id`, `deployment_hash`, `dashboard_url` locally
- Begin heartbeat loop to Stacker Server
- [ ] **Local `stacker deploy` trigger** — After registration, Status Panel invokes Stacker CLI locally
- [x] **Local `stacker deploy` trigger** — After registration, Status Panel invokes Stacker CLI locally
- `stacker deploy --from /opt/stacker/stacks/{stack_id}/` (the downloaded archive)
- Monitor deploy progress, report status back to Stacker Server via existing agent report endpoint
- No Install Service involved — fully local execution

### Dashboard Linking (optional, user-initiated)
- [x] Provide web UI page at `http://localhost:{STATUS_PORT}/link` to connect Status Panel to TryDirect dashboard
- [x] Support unlinking from dashboard (agent continues to work standalone)
- [ ] **Login-based linking flow (Entry Point C):**
- [x] **Login-based linking flow (Entry Point C):**
- User logs in with TryDirect email + password from Status Panel UI
- Status Panel calls Stacker: `POST /api/v1/auth/login { email, password }` → returns `session_token` + user's deployments
- User selects a deployment from the list → Status Panel calls Stacker: `POST /api/v1/agents/link { session_token, deployment_id, server_fingerprint }`
- Stacker validates session, checks user owns the deployment, issues `agent_id` + `agent_token`
- No purchase_token needed — user's identity is the trust anchor
- `purchase_token` flow retained only for headless Entry Point B (curl one-liner)
- [ ] Add "Use Standalone" option for users without TryDirect account (skip linking entirely)
- [x] Add "Use Standalone" option for users without TryDirect account (skip linking entirely)

### Standalone Status Panel Entry Point (Phase 2)
- [ ] **"Deploy a Stack" page** in Status Panel web UI
- [x] **"Deploy a Stack" page** in Status Panel web UI
- Browse available stacks from marketplace API: `GET /api/v1/marketplace/stacks`
- User selects stack → Status Panel downloads archive + calls `stacker deploy` locally
- This enables Entry Point C: user installs Status Panel first, then deploys stacks from its UI

### Notifications Relay
- [ ] Forward marketplace notifications (stack published, update available) from Stacker Server to Status Panel UI
- [ ] Show "Update Available" badge when a newer version of the deployed stack exists
- [x] Forward marketplace notifications (stack published, update available) from Stacker Server to Status Panel UI
- [x] Show "Update Available" badge when a newer version of the deployed stack exists

---
- Align build and runtime images so the compiled `status` binary links against the same glibc version (or older) as production.
- Add a musl-based build target and image variant to provide a statically linked binary that avoids glibc drift.
- ~~Align build and runtime images so the compiled `status` binary links against the same glibc version (or older) as production.~~ ✅ Done — Dockerfiles use `clux/muslrust:stable` → `gcr.io/distroless/cc`, musl avoids glibc drift.
- ~~Add a musl-based build target and image variant to provide a statically linked binary that avoids glibc drift.~~ ✅ Done — CI builds `x86_64-unknown-linux-musl` target, releases musl binary.
- Update CI to build/test using the production base image to prevent future GLIBC_x.y.z mismatches.
- Add a simple container start-up check that surfaces linker/runtime errors early in the pipeline.

## Missing Features Implementation Plan (2026-04)

### Phase 1 - Reliability and Production Readiness
- [x] **[status-auth-refresh]** Refresh agent auth immediately on 401/403 and retry polling/report calls with backoff.
- Wire the retry path into the polling loop instead of waiting for the periodic refresh task.
- Define the Vault path/role contract for `status_panel_token` and document failure handling.
- [x] **[status-alerting]** Add outbound alert delivery for unhealthy containers, command failures, and host-level incidents.
- Webhook delivery with env-configured thresholds (`ALERT_WEBHOOK_URL`, CPU/memory/disk thresholds).
- Includes alert deduplication, severity escalation, and recovery notifications.
- [x] **[status-command-provenance]** Surface which control plane executed each action (`status_panel` vs `compose_agent`).
- Expose provenance in command reports, health metrics, and `/capabilities`-driven diagnostics.
- Publish and implement the separate token/cache schema for `compose_agent_token`.
- [ ] **[status-ssl-renewal]** Automate SSL certificate renewal for hosts that enable HTTPS.
- Add renewal scheduling, renewal result logging, and certificate reload without manual intervention.

### Phase 2 - Data Safety and Day-2 Operations
- [ ] **[status-volume-backups]** Add scheduled backup and restore support for Docker volumes.
- Support policy-driven backups for stateful services, retention, restore validation, and signed metadata.
- Reuse existing backup/security primitives where possible instead of introducing a separate backup path.

### Phase 3 - Standalone and Dashboard UX
- [x] **[status-login-linking]** Complete the login-based dashboard linking flow and standalone mode.
- Finish the UI + daemon wiring for email/password linking to an owned deployment.
- Add "Use Standalone" so the panel is usable without a TryDirect account.
- [x] **[status-deploy-stack-ui]** Build the local "Deploy a Stack" flow in Status Panel.
- Browse marketplace stacks, download the selected archive, and trigger local `stacker deploy`.
- Show deployment progress, update availability, and compatibility checks in the local UI.

### Cross-Project Coordination
- [ ] Coordinate `status-deploy-stack-ui` with Stacker marketplace archive/download validation.
- [ ] Coordinate `status-command-provenance` and future pipe execution with the Stacker control-plane roadmap.

## Status Panel Agent Commands (Pull Model)
**Key principle**: Agent polls Stacker; Stacker never pushes to the agent. Agent is responsible for adding HMAC headers on its outbound calls.

Expand All @@ -51,14 +83,14 @@
- [x] Restart: restart container by app_code, then emit updated state in report payload; include errors array on failure.
- [x] Reporting: call Stacker `POST /api/v1/agent/commands/report` with HMAC headers (`X-Agent-Id`, `X-Timestamp`, `X-Request-Id`, `X-Agent-Signature`) signed using Vault token.
- [x] Wire agent to poll loop: `GET /api/v1/agent/commands/wait/{deployment_hash}` with HMAC headers.
- [ ] On 401/403, refresh token from Vault and retry with backoff (which Vault path/role should we use for the agent token?).
- [x] On 401/403, refresh token from Vault and retry with backoff (TokenProvider with Vault → env fallback, 10s cooldown).
- [x] Ensure agent generates HMAC signature for every outbound request (wait + report + app status); no secrets expected from Stacker side.

## Compose Agent Sidecar
- [x] Ship a separate `compose-agent` container (Docker Compose + MCP Gateway) deployed alongside the Status Panel container; Service file should ensure it mounts the Docker socket while Status Panel does not.
- [x] Implement watchdog to restart only the compose container on failure/glibc mismatch without touching the Status Panel daemon; prove via integration test.
- [ ] Expose health metrics indicating which control plane executed each command (`status_panel` vs `compose_agent`) so ops can track rollout and fallbacks.
- [ ] Publish Vault secret schema: `secret/agent/{hash}/status_panel_token` and `secret/agent/{hash}/compose_agent_token`; refresh + cache them independently.
- [x] Expose health metrics indicating which control plane executed each command (`status_panel` vs `compose_agent`) so ops can track rollout and fallbacks.
- [x] Publish Vault secret schema: `secret/agent/{hash}/status_panel_token` and `secret/agent/{hash}/compose_agent_token`; refresh + cache them independently.
- [x] Add config flag to disable compose agent (legacy mode) and emit warning log so Blog receives `compose_agent=false` via `/capabilities`.

## Kata Containers Support (Stacker Server)
Expand Down
128 changes: 128 additions & 0 deletions docs/AGENT_ROTATION_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,3 +143,131 @@ spawn(refresh_loop(vault.clone(), deployment_hash.clone(), cache.clone()));
- Action: check request headers, clock skew, and signature; ensure using current token
- Symptoms: Vault errors
- Action: verify `VAULT_ADDRESS`, `VAULT_TOKEN`, network connectivity, and KV path prefix

---

## Auth Refresh on 401/403 — Implementation Details

### Problem

When the agent token expires or is rotated server-side, all outbound requests
(polling, reporting, notifications) receive 401/403 from Stacker. Previously
these were treated as generic errors with fixed backoff, causing prolonged
downtime until manual restart.

### Solution: `TokenProvider` + Retry Helpers

Two new modules handle automatic recovery:

| Module | Path | Purpose |
|--------|------|---------|
| `TokenProvider` | `src/security/token_provider.rs` | Shared mutable token with on-demand refresh |
| `RetryClient` | `src/transport/retry.rs` | HTTP helpers that detect 401/403 and retry |

### Request Flow

```
Daemon / Notification Poller
┌───────────────────┐
│ TokenProvider │ .get() → current token
│ .get() │
└────────┬──────────┘
┌───────────────────┐
│ Build signed │ build_signed_headers(agent_id, token, body)
│ HMAC headers │ → Bearer + X-Agent-Signature + X-Timestamp
└────────┬──────────┘
┌───────────────────┐
│ Send HTTP │ signed_get_with_retry / signed_post_with_retry
│ request │
└────────┬──────────┘
┌────── Status code? ──────┐
│ │ │
200/204 401/403 5xx / network error
│ │ │
✅ Done ▼ ▼
┌──────────────┐ Exponential backoff
│ TokenProvider │ 2s → 4s → 8s → … 60s cap
│ .refresh() │ retry up to 3×
└──────┬───────┘
├─ 1. Try Vault:
│ vault_client.fetch_agent_token(deployment_hash)
├─ 2. If Vault fails or returns same token:
│ re-read AGENT_TOKEN from environment
├─ 3. Cooldown: 10s between refresh attempts
│ (prevents hammering Vault on repeated failures)
Retry request once with new token
┌────┴────┐
200 401 again
│ │
✅ Done Propagate error
(token truly invalid)
```

### TokenProvider API

```rust
use crate::security::token_provider::TokenProvider;

// Create (both daemon and serve mode)
let tp = TokenProvider::new(initial_token, Some(vault_client), deployment_hash);
// or
let tp = TokenProvider::from_env(Some(vault_client));

tp.get().await // → current token (Arc<RwLock<String>>)
tp.refresh().await // → Ok(true) if token changed, Ok(false) if unchanged
tp.swap(new).await // → direct swap (used by background rotation task)
```

### Wired Consumers

| Consumer | File | Mechanism |
|----------|------|-----------|
| Daemon polling (`wait_for_command`) | `src/agent/daemon.rs` | `wait_for_command_with_retry` (auth-only retry) |
| Daemon reporting (`report_result`) | `src/agent/daemon.rs` | `report_result_with_retry` (full retry) |
| Daemon app status | `src/agent/daemon.rs` | `update_app_status_with_retry` (full retry) |
| Notification poller | `src/comms/notifications.rs` | Explicit 401/403 check → `refresh()` → 5s backoff |

### RetryConfig Presets

```rust
use crate::transport::retry::RetryConfig;

RetryConfig::default() // 1 auth retry + 3 server retries (2–60s backoff)
RetryConfig::auth_only() // 1 auth retry + 0 server retries (for long-poll)
```

### Refresh Strategy

1. **Vault first** — If `VaultClient` is configured, call
`fetch_agent_token(deployment_hash)`. If it returns a different token,
swap it in and retry.
2. **Environment fallback** — If Vault is unavailable or returns the same
token, re-read `AGENT_TOKEN` from the process environment. This covers
cases where an orchestrator (Docker, systemd) injects a new token via
env without restarting the process.
3. **Cooldown** — A 10-second minimum gap between refresh attempts prevents
hammering Vault during cascading failures.
4. **Single retry** — After refreshing, the request is retried exactly once.
If it still gets 401/403, the error propagates (the token is truly invalid
and requires operator intervention).

### Environment Variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `AGENT_TOKEN` | _(empty)_ | Bearer token for Stacker API auth |
| `DEPLOYMENT_HASH` | `"default"` | Vault path isolation key |
| `VAULT_ADDRESS` | _(none)_ | Vault server URL (enables Vault refresh) |
| `VAULT_TOKEN` | _(none)_ | Vault auth token |
| `VAULT_AGENT_PATH_PREFIX` | `"status_panel"` | Vault KV path prefix |
Loading
Loading