feat: decompose openshell-sandbox into independently deployable proxy and runtime crates

## Problem Statement

The `openshell-sandbox` crate on `main` is a ~17k-line monolith that bundles three fundamentally different concerns into one binary:

1. **Network policy enforcement** — HTTP CONNECT proxy, OPA evaluation, TLS interception, credential injection, bypass detection (~12k lines across `proxy.rs`, `opa.rs`, `l7/`, `identity.rs`, etc.)
2. **Linux process isolation** — network namespaces, Landlock, seccomp BPF, privilege dropping, process spawning, SSH server (~4k lines across `sandbox/linux/`, `process.rs`, `ssh.rs`)
3. **Orchestration** — gRPC gateway communication, policy loading, supervisor sessions, container lifecycle (~3k lines in `lib.rs`, `policy.rs`, `grpc_client.rs`)

This monolithic design forces a single deployment topology: all three concerns run in one container. Different environments have fundamentally different isolation requirements:

- A Kubernetes cluster with gVisor or Kata already provides kernel-level sandboxing — the runtime isolation primitives are redundant and add unnecessary privilege requirements.
- A bare-metal deployment needs the full stack.
- A lightweight egress filter (no shell access needed) only needs the proxy.
- An edge deployment might want the proxy on a gateway node and the runtime on worker nodes.

There is no way to use the proxy independently of the runtime, or to deploy them on separate trust boundaries.

## Proposed Design

Split `openshell-sandbox` into three crates with clear, independently deployable boundaries:

### `openshell-proxy` — Network Policy Enforcement (Data Plane)

Standalone HTTP CONNECT proxy deployable as its own container image:

- OPA-based per-request policy evaluation (allow/deny/log)
- TLS interception via in-memory CA for L7 inspection
- Provider credential injection for authorized outbound requests
- Bypass detection and denial aggregation
- L7-aware inference routing (OpenAI, Anthropic, etc.)
- Binary identity binding (TOFU fingerprinting)
- gRPC control API — receives policy, credentials, and routes at runtime

Starts in **deny-all mode** by default. A controller (gateway, operator, or custom) pushes configuration via the control plane. This makes the proxy usable **outside** the OpenShell gateway entirely — any system that needs an egress policy filter can use it.

### `openshell-runtime` — Process Isolation Primitives

Linux-specific sandboxing enforcement library:

- Network namespace isolation
- Landlock filesystem restrictions (Linux 5.13+)
- Seccomp BPF syscall filtering
- Privilege dropping and capability management
- Process spawning and entrypoint lifecycle
- Embedded SSH server (russh) for shell/exec access

Introduces a `CredentialProvider` trait to decouple SSH credential injection from the proxy's internal state.

### `openshell-sandbox` — Orchestration (Composition Layer)

Thin glue that composes proxy + runtime for the traditional single-container topology. Depends on both crates:

- gRPC communication with the gateway
- Policy loading and propagation to both proxy and runtime
- Supervisor session management
- Container lifecycle coordination

No behavioral changes — this is the same binary produced today, just with cleaner internal boundaries.

### Control Plane Interface

The proxy exposes a `ProxyControl` trait with two implementations:

- **In-process** — zero-copy, direct method calls. Used when proxy + runtime run in the same process.
- **gRPC client/server** — for remote control. Used when the proxy runs as a sidecar or standalone service.

This trait abstraction allows any topology to push policy/credentials/routes to the proxy uniformly.

## Deployment Topologies Enabled

| Topology | Proxy | Runtime | Use Case |
|----------|-------|---------|----------|
| Single container (status quo) | In-process | In-process | Docker, Podman — full isolation in one unit |
| Two-pod sidecar | Standalone pod | Not needed (gVisor/Kata provides isolation) | Kubernetes with hardware-backed sandboxing |
| Gateway-embedded | Library dependency | None | Lightweight policy-only enforcement |
| Standalone egress filter | Standalone binary | None | Non-OpenShell workloads that need egress control |
| Edge / split-node | Standalone on gateway node | Standalone on worker node | Resource-constrained or multi-node setups |

## Pros

1. **Independent deployability** — the proxy runs as its own container image without pulling in Landlock, seccomp, SSH, or any Linux isolation code
2. **Topology flexibility** — operators compose only what their environment needs; no wasted resources
3. **Smaller attack surface per component** — the proxy pod needs no privileged capabilities; the runtime has no network stack surface
4. **Faster build iteration** — changing OPA policy evaluation doesn't recompile SSH; changing seccomp rules doesn't recompile TLS
5. **Reusability beyond OpenShell** — the proxy is a generic egress policy filter usable by any workload (Kubernetes sidecar, systemd service, etc.)
6. **Clear security boundary** — trusted proxy runs at a different trust level than the untrusted workload, which can be enforced by deployment (separate pods, VMs, nodes)
7. **Focused testing** — each crate has a well-defined responsibility with clear unit-test boundaries and fewer integration dependencies
8. **Container image size** — proxy image carries no SSH server, no Landlock, no seccomp; runtime needs no TLS, no OPA, no HTTP parsing
9. **Version independence** — proxy and runtime can be released independently (with control-plane protocol versioning)

## Cons

1. **Coordination complexity** — multi-component topologies require service discovery, health-check sequencing, and CA certificate distribution between components
2. **Operational surface area** — more container images to build, version, scan, and ship
3. **Network hop latency** — inter-pod proxy adds a network round-trip vs. in-process loopback (~0.1ms per request, but measurable at high throughput)
4. **Secret distribution** — CA certificates and credentials must be shared via Kubernetes Secrets / volume mounts rather than in-memory handoff
5. **Distributed debugging** — logs span multiple pods; correlating a proxy denial with the process that triggered it requires structured log correlation (trace IDs)
6. **Version skew risk** — independently released proxy and runtime must maintain protocol compatibility; requires a versioned control-plane contract
7. **Cold-start overhead** — two-component topologies need both components scheduled and ready before the sandbox is usable (sequential readiness)
8. **Increased CI complexity** — three crates × multiple images × compatibility matrix increases the test surface

## Example: gVisor Two-Pod Topology

As a concrete example, consider a Kubernetes deployment using gVisor for workload isolation:

1. **Proxy pod** — runs the `openshell-proxy` binary in a trusted (non-gVisor) pod. No elevated privileges needed. Receives policy via gRPC from the gateway.
2. **Agent pod** — runs the user workload under a gVisor `RuntimeClass`. Environment variables (`HTTP_PROXY`, `HTTPS_PROXY`, `SSL_CERT_FILE`) route all traffic through the proxy service.
3. **NetworkPolicy** — K8s native resource restricts agent egress to proxy + kube-dns only. The CNI enforces this below the kernel — gVisor prevents raw socket bypass.
4. **CA Secret** — proxy's generated CA certificate is distributed via a Kubernetes Secret mounted into the agent pod for TLS inspection trust.

This topology is **impossible** with the monolithic `openshell-sandbox` because the proxy cannot be deployed without the runtime, and gVisor makes the runtime redundant.

## Alternatives Considered

**Feature flags in a single crate** — Reduces deployable flexibility. Still compiles and ships unused code. Cannot produce a proxy-only container image without `cfg` complexity.

**Microkernel plugin architecture** — Over-engineered for two stable, well-defined responsibilities. The proxy/runtime boundary maps directly to trust boundary — it's not arbitrary.

**Separate repositories** — Too much overhead. These crates share proto definitions, the OCSF logging framework, and release cadence. Monorepo with separate crates provides the right balance.

## Agent Investigation

Explored `crates/openshell-sandbox/` on `main`:

- `src/lib.rs`: 3,143 lines — orchestration, gRPC client, policy loading, supervisor sessions. Imports all modules directly.
- `src/proxy.rs`: 4,990 lines — the HTTP CONNECT proxy implementation, connection handling, tunnel management.
- `src/opa.rs`: 4,556 lines — OPA/Rego policy engine, rule compilation, evaluation.
- `src/l7/`: ~3,165 lines — TLS interception (522), L7 routing (1,881), inference (762), plus GraphQL/REST/path modules.
- `src/ssh.rs`: 1,533 lines — embedded russh SSH server, channel handling, credential verification.
- `src/process.rs`: 841 lines — process spawning, lifecycle, signal management.
- `src/sandbox/linux/`: ~2,098 lines — Landlock (480), seccomp (672), netns (946).
- Total monolith: ~17k lines, 82 dependencies in Cargo.toml.

The proxy subsystem (`proxy.rs` + `opa.rs` + `l7/` + `identity.rs` + `provider_credentials.rs` + `bypass_monitor.rs` + `denial_aggregator.rs`) accounts for ~70% of the crate by line count but has zero dependency on the Linux isolation modules. The coupling is entirely through `lib.rs` orchestration — the split boundary is clean.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: decompose openshell-sandbox into independently deployable proxy and runtime crates #1305

Problem Statement

Proposed Design

`openshell-proxy` — Network Policy Enforcement (Data Plane)

`openshell-runtime` — Process Isolation Primitives

`openshell-sandbox` — Orchestration (Composition Layer)

Control Plane Interface

Deployment Topologies Enabled

Pros

Cons

Example: gVisor Two-Pod Topology

Alternatives Considered

Agent Investigation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Topology	Proxy	Runtime	Use Case
Single container (status quo)	In-process	In-process	Docker, Podman — full isolation in one unit
Two-pod sidecar	Standalone pod	Not needed (gVisor/Kata provides isolation)	Kubernetes with hardware-backed sandboxing
Gateway-embedded	Library dependency	None	Lightweight policy-only enforcement
Standalone egress filter	Standalone binary	None	Non-OpenShell workloads that need egress control
Edge / split-node	Standalone on gateway node	Standalone on worker node	Resource-constrained or multi-node setups

feat: decompose openshell-sandbox into independently deployable proxy and runtime crates #1305

Description

Problem Statement

Proposed Design

openshell-proxy — Network Policy Enforcement (Data Plane)

openshell-runtime — Process Isolation Primitives

openshell-sandbox — Orchestration (Composition Layer)

Control Plane Interface

Deployment Topologies Enabled

Pros

Cons

Example: gVisor Two-Pod Topology

Alternatives Considered

Agent Investigation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`openshell-proxy` — Network Policy Enforcement (Data Plane)

`openshell-runtime` — Process Isolation Primitives

`openshell-sandbox` — Orchestration (Composition Layer)