feat(sandbox): docker + exec-free kubernetes dual execution backend#1841
Conversation
|
Update: validated the backend end-to-end locally against a kind cluster with the real
Remaining follow-ups: log-follow mid-stream reconnect (only the connect is retried today), |
📝 WalkthroughWalkthroughThis PR introduces a pluggable execution backend abstraction for the sandbox spawner, refactoring it from Docker-only to support both Docker and Kubernetes runtimes. A new Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@services/sandbox/src/backend/index.ts`:
- Around line 19-29: The exported type barrel (CacheStores, ExecutionBackend,
HealthResult, LaunchSpec, RunningExecution, RunOptions, RunResult, SweepOptions,
Workspace) is flagged as unused by knip; either remove this re-export block from
services/sandbox/src/backend/index.ts and update all call sites to import those
types directly from './types.ts', or add a knip configuration/annotation to mark
these symbols as intentional public API; locate the export block in index.ts
(the listed type names) and perform the chosen change consistently across the
codebase to satisfy the linter.
In `@services/sandbox/src/backend/kubernetes/k8s-cache.ts`:
- Around line 36-44: The empty catch for
client.core.readNamespacedPersistentVolumeClaim hides real errors; update the
catch to inspect the thrown error (e.g., check error.statusCode / error.code /
error.response?.status) and only fall through to create when the error indicates
a 404/NotFound; for any other error log it with console.error (including the
error and context like name/namespace) and rethrow or return a failure so
permission/network/API errors aren’t masked; ensure the change is applied around
the readNamespacedPersistentVolumeClaim call and the subsequent create path so
only true "not found" errors trigger creation.
In `@services/sandbox/src/backend/kubernetes/k8s-client.ts`:
- Around line 201-206: The custom Writable "stdout" ignores backpressure by
calling void untar.stdin.write(chunk); instead of handling the boolean return;
update the write implementation in the Writable returned for stdout to check the
result of untar.stdin.write(chunk) and if it returns false, pause calling cb()
until untar.stdin emits 'drain' (attach a one-time 'drain' listener that invokes
cb), otherwise call cb() immediately; ensure you also handle errors on
untar.stdin to avoid leaking listeners.
In `@services/sandbox/src/backend/kubernetes/k8s-pod-spec.test.ts`:
- Around line 11-35: Replace the explicit type annotation on the test fixture
with the TypeScript satisfies operator so the object literal remains fully
inferred but validated against SpawnerConfig; specifically update the cfg
declaration (symbol: cfg) in the k8s-pod-spec.test.ts test to use "satisfies
SpawnerConfig" instead of ": SpawnerConfig" to get compile-time conformance
without changing the inferred type.
In `@services/sandbox/src/backend/kubernetes/k8s-pod-spec.ts`:
- Around line 37-46: The exported type SandboxPodInput is flagged unused
externally; if it's only used as the parameter type for buildSandboxPod, make it
non-exported by removing the export modifier from SandboxPodInput so it remains
a local type, and update any local references (e.g., the inp: SandboxPodInput
parameter on buildSandboxPod) accordingly; alternatively, if you intend it to be
public, add a file-level Knip suppression or export a runtime value that
references the type to prevent the unused-export warning—pick one approach and
apply it consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6be0f5f5-dbbe-4fef-a693-08021f284d79
⛔ Files ignored due to path filters (1)
bun.lockis excluded by!**/*.lock
📒 Files selected for processing (17)
services/sandbox/package.jsonservices/sandbox/src/backend/docker/docker-backend.tsservices/sandbox/src/backend/docker/host-dir-workspace.tsservices/sandbox/src/backend/index.tsservices/sandbox/src/backend/kubernetes/k8s-backend.tsservices/sandbox/src/backend/kubernetes/k8s-cache.tsservices/sandbox/src/backend/kubernetes/k8s-client.tsservices/sandbox/src/backend/kubernetes/k8s-pod-spec.test.tsservices/sandbox/src/backend/kubernetes/k8s-pod-spec.tsservices/sandbox/src/backend/kubernetes/k8s-workspace.tsservices/sandbox/src/backend/types.tsservices/sandbox/src/cleanup.tsservices/sandbox/src/config.tsservices/sandbox/src/docker-args.test.tsservices/sandbox/src/server.tsservices/sandbox/src/spawn.tsservices/sandbox/src/types.ts
| export type { | ||
| CacheStores, | ||
| ExecutionBackend, | ||
| HealthResult, | ||
| LaunchSpec, | ||
| RunningExecution, | ||
| RunOptions, | ||
| RunResult, | ||
| SweepOptions, | ||
| Workspace, | ||
| } from './types.ts'; |
There was a problem hiding this comment.
Remove or justify the unused type re-export barrel to unblock lint.
This export block is currently failing CI (knip reports unused exported types on Line 20-Line 28). Either remove these re-exports and import directly from ./types.ts at use-sites, or explicitly configure knip to treat them as intentional public API.
As per coding guidelines, “Avoid barrel files in TypeScript — direct imports keep the dependency graph legible and tree-shaking honest.”
🧰 Tools
🪛 GitHub Actions: Lint / 2_Knip.txt
[error] 20-28: knip: Unused exported types detected (9). Unused type exports: CacheStores, ExecutionBackend, LaunchSpec, RunningExecution, RunOptions, RunResult, SweepOptions, Workspace.
🪛 GitHub Actions: Lint / Knip
[error] 20-20: knip: Unused exported type 'CacheStores'.
[error] 21-21: knip: Unused exported type 'ExecutionBackend'.
[error] 23-23: knip: Unused exported type 'LaunchSpec'.
[error] 24-24: knip: Unused exported type 'RunningExecution'.
[error] 25-25: knip: Unused exported type 'RunOptions'.
[error] 26-26: knip: Unused exported type 'RunResult'.
[error] 27-27: knip: Unused exported type 'SweepOptions'.
[error] 28-28: knip: Unused exported type 'Workspace'.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/sandbox/src/backend/index.ts` around lines 19 - 29, The exported
type barrel (CacheStores, ExecutionBackend, HealthResult, LaunchSpec,
RunningExecution, RunOptions, RunResult, SweepOptions, Workspace) is flagged as
unused by knip; either remove this re-export block from
services/sandbox/src/backend/index.ts and update all call sites to import those
types directly from './types.ts', or add a knip configuration/annotation to mark
these symbols as intentional public API; locate the export block in index.ts
(the listed type names) and perform the chosen change consistently across the
codebase to satisfy the linter.
Sources: Coding guidelines, Pipeline failures
| try { | ||
| await client.core.readNamespacedPersistentVolumeClaim({ | ||
| name, | ||
| namespace: client.namespace, | ||
| }); | ||
| return; // already exists | ||
| } catch { | ||
| // Not found (or transient) — fall through to create. | ||
| } |
There was a problem hiding this comment.
Empty catch block violates guidelines and may hide errors.
The catch block at line 42-44 is empty except for a comment. Per coding guidelines, empty catch blocks should log with console.warn or console.error. Additionally, this catch assumes all errors mean "not found," but errors could include permission failures, network issues, or API errors that should be surfaced rather than silently retried with a create operation.
🛡️ Proposed fix to log non-404 errors
try {
await client.core.readNamespacedPersistentVolumeClaim({
name,
namespace: client.namespace,
});
return; // already exists
- } catch {
- // Not found (or transient) — fall through to create.
+ } catch (err) {
+ // 404 expected if PVC doesn't exist; log other errors but still attempt create
+ const code = typeof err === 'object' && err !== null && 'code' in err ? err.code : undefined;
+ if (code !== 404) {
+ console.warn(`[sandbox.k8s] PVC read failed (code=${code}), will attempt create:`, err);
+ }
}As per coding guidelines, no empty catch blocks in JavaScript/TypeScript — log with console.warn/console.error or re-throw to avoid hiding bugs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/sandbox/src/backend/kubernetes/k8s-cache.ts` around lines 36 - 44,
The empty catch for client.core.readNamespacedPersistentVolumeClaim hides real
errors; update the catch to inspect the thrown error (e.g., check
error.statusCode / error.code / error.response?.status) and only fall through to
create when the error indicates a 404/NotFound; for any other error log it with
console.error (including the error and context like name/namespace) and rethrow
or return a failure so permission/network/API errors aren’t masked; ensure the
change is applied around the readNamespacedPersistentVolumeClaim call and the
subsequent create path so only true "not found" errors trigger creation.
Source: Coding guidelines
| const stdout = new Writable({ | ||
| write(chunk, _enc, cb) { | ||
| void untar.stdin.write(chunk); | ||
| cb(); | ||
| }, | ||
| }); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoff
Respect backpressure in stream write.
Line 203 ignores the boolean return value of untar.stdin.write(chunk) with a void cast. If the untar process is slower than the incoming tar stream, this can cause unbounded memory growth in the stdin buffer. The Writable callback should wait for drain events when write returns false.
♻️ Potential fix using backpressure-aware writes
const stdout = new Writable({
- write(chunk, _enc, cb) {
- void untar.stdin.write(chunk);
- cb();
+ write(chunk, _enc, cb) {
+ if (!untar.stdin.write(chunk)) {
+ untar.stdin.once('drain', cb);
+ } else {
+ cb();
+ }
},
});🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/sandbox/src/backend/kubernetes/k8s-client.ts` around lines 201 -
206, The custom Writable "stdout" ignores backpressure by calling void
untar.stdin.write(chunk); instead of handling the boolean return; update the
write implementation in the Writable returned for stdout to check the result of
untar.stdin.write(chunk) and if it returns false, pause calling cb() until
untar.stdin emits 'drain' (attach a one-time 'drain' listener that invokes cb),
otherwise call cb() immediately; ensure you also handle errors on untar.stdin to
avoid leaking listeners.
| const cfg: SpawnerConfig = { | ||
| backend: 'kubernetes', | ||
| port: 8003, | ||
| sandboxToken: 'test', | ||
| runtimeImage: 'tale-sandbox-runtime:test', | ||
| runtime: 'runc', | ||
| k8s: { | ||
| namespace: 'tale-sandbox', | ||
| runtimeClassName: 'gvisor', | ||
| holderImage: 'busybox:1.36', | ||
| cacheMode: 'none', | ||
| }, | ||
| defaultTimeoutMs: 30_000, | ||
| maxTimeoutMs: 300_000, | ||
| maxConcurrent: 4, | ||
| hostSessionRoot: '/var/lib/tale-sandbox/sessions', | ||
| cacheVolumePrefix: { pip: 'pip', npm: 'npm' }, | ||
| egressNetwork: 'tale-sandbox-net', | ||
| egressProxy: 'http://sandbox-egress:3128', | ||
| stdoutMaxBytes: 5_242_880, | ||
| stderrMaxBytes: 5_242_880, | ||
| outputFileMaxBytes: 52_428_800, | ||
| outputTotalMaxBytes: 104_857_600, | ||
| maxRequestBodyBytes: 262_144, | ||
| }; |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Prefer satisfies operator for test fixture.
The test configuration uses a type annotation (cfg: SpawnerConfig), but test fixtures benefit from the satisfies operator which provides compile-time validation while preserving the inferred type. This aligns with the TypeScript guidelines for test data.
♻️ Suggested refactor
-const cfg: SpawnerConfig = {
+const cfg = {
backend: 'kubernetes',
port: 8003,
// ... rest of config
-};
+} satisfies SpawnerConfig;Based on learnings, test fixtures (including TypeScript test data like this SpawnerConfig object) should prefer the 'satisfies' operator over type annotations. 'Satisfies' provides compile-time validation that the value conforms to the type without altering the inferred type, improving type safety.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const cfg: SpawnerConfig = { | |
| backend: 'kubernetes', | |
| port: 8003, | |
| sandboxToken: 'test', | |
| runtimeImage: 'tale-sandbox-runtime:test', | |
| runtime: 'runc', | |
| k8s: { | |
| namespace: 'tale-sandbox', | |
| runtimeClassName: 'gvisor', | |
| holderImage: 'busybox:1.36', | |
| cacheMode: 'none', | |
| }, | |
| defaultTimeoutMs: 30_000, | |
| maxTimeoutMs: 300_000, | |
| maxConcurrent: 4, | |
| hostSessionRoot: '/var/lib/tale-sandbox/sessions', | |
| cacheVolumePrefix: { pip: 'pip', npm: 'npm' }, | |
| egressNetwork: 'tale-sandbox-net', | |
| egressProxy: 'http://sandbox-egress:3128', | |
| stdoutMaxBytes: 5_242_880, | |
| stderrMaxBytes: 5_242_880, | |
| outputFileMaxBytes: 52_428_800, | |
| outputTotalMaxBytes: 104_857_600, | |
| maxRequestBodyBytes: 262_144, | |
| }; | |
| const cfg = { | |
| backend: 'kubernetes', | |
| port: 8003, | |
| sandboxToken: 'test', | |
| runtimeImage: 'tale-sandbox-runtime:test', | |
| runtime: 'runc', | |
| k8s: { | |
| namespace: 'tale-sandbox', | |
| runtimeClassName: 'gvisor', | |
| holderImage: 'busybox:1.36', | |
| cacheMode: 'none', | |
| }, | |
| defaultTimeoutMs: 30_000, | |
| maxTimeoutMs: 300_000, | |
| maxConcurrent: 4, | |
| hostSessionRoot: '/var/lib/tale-sandbox/sessions', | |
| cacheVolumePrefix: { pip: 'pip', npm: 'npm' }, | |
| egressNetwork: 'tale-sandbox-net', | |
| egressProxy: 'http://sandbox-egress:3128', | |
| stdoutMaxBytes: 5_242_880, | |
| stderrMaxBytes: 5_242_880, | |
| outputFileMaxBytes: 52_428_800, | |
| outputTotalMaxBytes: 104_857_600, | |
| maxRequestBodyBytes: 262_144, | |
| } satisfies SpawnerConfig; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/sandbox/src/backend/kubernetes/k8s-pod-spec.test.ts` around lines 11
- 35, Replace the explicit type annotation on the test fixture with the
TypeScript satisfies operator so the object literal remains fully inferred but
validated against SpawnerConfig; specifically update the cfg declaration
(symbol: cfg) in the k8s-pod-spec.test.ts test to use "satisfies SpawnerConfig"
instead of ": SpawnerConfig" to get compile-time conformance without changing
the inferred type.
Source: Learnings
| export interface SandboxPodInput { | ||
| executionId: string; | ||
| organizationId: string; | ||
| language: Extract<Language, 'python' | 'node' | 'polyglot'>; | ||
| /** Absolute path the runtime entrypoint will exec (see docker-args.ts). */ | ||
| entryPath: string; | ||
| startedAtMs: number; | ||
| /** Per-org cache PVC names; only consumed when cfg.k8s.cacheMode === 'pvc'. */ | ||
| cache?: CacheStores; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for SandboxPodInput usage across the codebase
rg -n 'SandboxPodInput' --type=ts -C2Repository: tale-project/tale
Length of output: 931
Address Knip unused-export for SandboxPodInput
SandboxPodInputappears only inservices/sandbox/src/backend/kubernetes/k8s-pod-spec.ts(as theinp: SandboxPodInputparameter type forbuildSandboxPod); no other file references theSandboxPodInputidentifier.- Either make
SandboxPodInputnon-exported if it’s not meant to be an externally consumed public type, or adjust/suppress Knip so exported types used only in type signatures aren’t flagged as unused.
🧰 Tools
🪛 GitHub Actions: Lint / 2_Knip.txt
[error] 37-37: knip: Unused exported type detected. Unused interface export: SandboxPodInput.
🪛 GitHub Actions: Lint / Knip
[error] 37-37: knip: Unused exported interface 'SandboxPodInput'.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/sandbox/src/backend/kubernetes/k8s-pod-spec.ts` around lines 37 -
46, The exported type SandboxPodInput is flagged unused externally; if it's only
used as the parameter type for buildSandboxPod, make it non-exported by removing
the export modifier from SandboxPodInput so it remains a local type, and update
any local references (e.g., the inp: SandboxPodInput parameter on
buildSandboxPod) accordingly; alternatively, if you intend it to be public, add
a file-level Knip suppression or export a runtime value that references the type
to prevent the unused-export warning—pick one approach and apply it
consistently.
Source: Pipeline failures
…re k8s Introduce an ExecutionBackend seam so the spawner can target both Docker Compose (now) and Kubernetes (later), selected by SANDBOX_BACKEND (default 'docker'). DockerBackend wraps the UNCHANGED docker-args/spawn-util/volume modules plus the cleanup helpers; spawn.ts (executeRequest/cancelExecution), server.ts, and cleanup.ts now route runtime launch, cancel, health, image warm, orphan sweep, and shutdown through the backend. Staging/harvest still operate on a local Workspace dir, so the Compose path is behavior-preserving (the docker-args snapshot, spawn-util, and staging unit tests are unchanged and green; 123 tests pass). Two intentional, benign deviations vs the prior code (confirmed by an adversarial parity review): the no-op 'docker rm' against a never-created container on pre-launch error paths is elided (a container only exists after launch, so real removals are unaffected); the periodic-sweep stale threshold is captured a few ms earlier. Neither changes which containers or session dirs get reaped. A failed 'docker ps' still short-circuits the host-dir sweep, matching prior behavior. Phase 1 of the docker+k8s dual-backend plan; KubernetesBackend lands in Phase 2.
Implements SANDBOX_BACKEND=kubernetes: each execution runs as a Pod via
@kubernetes/client-node. The spawner stays the trusted broker — it stages
inputs into a local dir, tars them INTO the runtime Pod's /workspace via a
holder sidecar, follows the runner's logs, reaps the exit code from Pod
status, tars /workspace/output back OUT, and spawn.ts then runs its existing
harvest/upload/attestation over the local dir. SANDBOX_TOKEN, output byte
caps, and input attestation never enter the Pod.
Transport (exec-tar broker + holder sidecar) and the @kubernetes/client-node
streaming integration under Bun were validated live against a kind cluster
end to end (stage -> run -> harvest; exit code, live PHASE markers, separated
stderr, harvested output all correct). No runtime-image change: the runner's
sentinel-wait + stderr redirect are done via a Pod command override.
New: backend/kubernetes/{k8s-pod-spec(.test),k8s-client,k8s-cache,k8s-workspace,k8s-backend}.ts.
Pod spec mirrors docker-args (resources, securityContext, seccomp, fsGroup,
runtimeClassName for gVisor); 13 pod-spec unit tests. Config: SANDBOX_K8S_*
(namespace, runtimeClass, holderImage), SANDBOX_CACHE (none|pvc). Workspace
finalizeStaging() lifted out of stageWorkspace so the chown is docker-only
(k8s re-owns via tar-in).
Auth note (Bun): Bun's fetch ignores client-cert kubeconfigs, so the backend
uses SA bearer-token auth (in-cluster loadFromCluster, or SANDBOX_K8S_SERVER/
TOKEN/CAFILE for dev) + NODE_EXTRA_CA_CERTS for CA trust. Egress isolation is
operator-applied (NetworkPolicy + egress proxy); init() logs a reminder.
Phase 2 of the docker+k8s dual-backend plan.
…silience) Found + fixed by validating the backend end-to-end against a kind cluster with the REAL tale-sandbox-runtime image (real Python 3.12 execution, PHASE markers, separated stderr, harvested output — 5/5 consistent): - imagePullPolicy: IfNotPresent on runner + holder. Default 'Always' on a :latest tag re-pulls the image on EVERY execution and ignores a locally loaded image (kind load / air-gapped); IfNotPresent matches the docker path (pull once at boot, reuse). - Bun + @kubernetes/client-node abort resilience: the aborted Log.log stream rejected with a detached AbortError under Bun and intermittently crashed the process — now handled at the source via the log done-callback. Pod-status reads and the log-follow connect are wrapped in the same retry as the exec ops. server.ts gains an unhandledRejection backstop so one execution's stray async error can't take down the long-running spawner.
Lift the pure/shared execution helpers out of spawn.ts into a new exec-common.ts so both DockerBackend and the planned exec-free Kubernetes in-Pod stage/harvest modes can reuse them. spawn.ts keeps only the in-flight registry, cancelExecution, and executeRequest. Behavior is byte-identical: the moved helpers are unchanged and the existing spawn-staging / spawn-prior-outputs tests (now importing from exec-common) still pass. Foundation for the exec-free K8s transport.
Replace the Phase-1 fine-grained createWorkspace/ensureCacheStore/launch methods on ExecutionBackend with a single execute(cfg, req, opts) → ExecuteResponse. The local-workspace orchestration (stage → run → phase parse + live tail → harvest → response assembly) moves verbatim out of spawn.ts into a shared runLocalWorkspaceExecution helper that DockerBackend.execute delegates to. spawn.ts is now a thin dispatcher: request validation + the in-flight registry + cancelExecution. The fine-grained trio is now the DockerBackend-internal LocalWorkspaceRuntime contract (still satisfied by both backends). The exec-tar KubernetesBackend delegates to the same helper for now; the exec-free HTTP-only rewrite that needs the coarse seam lands in the following commits. Behavior is byte-identical — the existing spawn-staging / spawn-prior-outputs tests (136 pass) plus typecheck/lint/format are the parity proof.
Replace the Phase-2 exec-tar transport (which proved unreliable under Bun — intermittent websocket upgrade failures) with an exec-free design that uses only plain HTTP K8s operations. The spawner does control-plane only; workspace I/O moves INTO the Pod via trusted helper containers. Pod shape (one Pod, shared /workspace emptyDir, no PVC required): - stage initContainer (spawner image, k8s-stage.ts): downloads inputs from presigned URLs via the same stageWorkspace the docker path uses; writes the prior-stage attestation. Completes before the runner — no sentinel handshake. - runner (runtime image): command override runs /entrypoint.sh as a child of sh -c (not exec) so the wrapper captures the exit code into a file; stderr redirected to a file (K8s logs are stdout-only for clean phase parsing). No credentials, no callbacks. - harvest sidecar (spawner image, k8s-harvest.ts): owns the token + upload slots, enforces the user timeout, uploads /workspace/output via the same harvestOutputDir, and prints one __TALE_RESULT__ line the spawner reads back from its logs. Security: the per-exec Secret (presigned URLs + SANDBOX_TOKEN + caps) is mounted ONLY into stage/harvest, never the runner (asserted in pod-spec test). Scale: the result rides the harvest container's logs, read by the owning replica — no cross-replica callback, no affinity. Cancel = delete-by-name. - new: exec-spec.ts (Secret payload), k8s-protocol.ts (result-line contract), k8s-stage.ts + k8s-harvest.ts (in-Pod entry modes); createStreamScanner extracted to exec-common and shared by both backends. - k8s-client.ts: drop exec/tar helpers, keep makeK8sClient + log read/follow. - k8s-pod-spec.ts: 3-container shape; SANDBOX_SPAWNER_IMAGE replaces holderImage. - removed: k8s-workspace.ts (spawner-local staging is gone). typecheck/lint/format green; 156 unit tests pass. K8s e2e on a healthy cluster (the 50+ run reliability bar) is still pending — kind image-load is broken for fresh clusters, so on-cluster validation is deferred.
- Set TMPDIR (not just HOME) on the stage/harvest helper containers so Bun has a writable scratch dir under readOnlyRootFilesystem. - Retry the harvest log read up to 3x on a missing result line to absorb kubelet→API log-propagation lag after the harvest container terminates.
Replace the K8s runner log-follow (Log.log streaming) with discrete polling of readNamespacedPodLog. @kubernetes/client-node 1.4.0's Log.log pipes the response body internally and never invokes the done-callback, so aborting the follow on cancel/timeout left a detached AbortError rejection (survived only by server.ts's unhandledRejection backstop). Polling makes the ENTIRE k8s path use only discrete request/response HTTP calls — create/read/read-log/delete Pod — so there is no long-lived fetch to abort under Bun. Live PHASE + stdout deltas come from periodic readPodLog (limitBytes-capped, suffix-diffed); the canonical stdout is the final full read. Drops followLogs + the dead done-callback + the separate waitForHarvestDone (folded into the same poll loop). Verified on kind: cancel mid-flight is now clean (no detached AbortError); reliability sweeps stay 100% (50/50, 60/60) with PHASE/stdout intact.
The k8s backend added @kubernetes/client-node to services/sandbox/package.json but never regenerated services/sandbox/bun.lock — the standalone lock the Dockerfile builder stage installs with --frozen-lockfile (it COPYs only the sandbox package + lock, in isolation from the workspace root lock). The stale lock made the production image build fail with 'lockfile had changes, but lockfile is frozen'. Regenerated; --frozen-lockfile now passes in isolation.
…intra-file symbols k8s-stage.ts / k8s-harvest.ts are launched by absolute path inside the Pod (STAGE_ENTRY / HARVEST_ENTRY), never imported, so knip flagged them and everything only they import as unused. Register them as knip entries; drop the export keyword from symbols only used within their defining file; trim the backend/index.ts re-exports to HealthResult, the only type consumed via the index (the rest import ./types.ts directly).
@kubernetes/client-node pushed the sandbox image to 330 MB, over its 300 MB budget. Strip *.d.ts/*.d.mts/*.d.cts/*.map from production node_modules and remove bun's install cache in the same layer — bun hardlinks node_modules out of that cache, so deleting only the node_modules links frees ~2 MiB while the cache keeps the content; removing both shrinks the flattened image from 320 MiB to 284 MiB (measured). Budget moves to 320 MB: ~10% headroom over the optimized size while staying below the un-stripped 330 MB so a silently broken strip still trips the check.
…findings) Fixes the 1 critical + 14 major findings (plus high-value minors) from the two-round multi-agent review of this branch. Highlights: - sweepOrphans no longer reaps live executions' pods: terminal reaps gate on isLive + a finishedAt grace; stale reaps clamp to the worst-case execution lifetime (startup + maxTimeout + harvest backstop + slack). Per-exec Secrets get a GC path (sweep lists by label; podless orphans reaped by age). - Every K8s API call carries an AbortSignal timeout (per-call middleware); withRetry stops retrying definitive 4xx; transient readPod failures no longer convert a healthy run into SPAWNER_UNAVAILABLE. - New shared exec-response.ts constructors route every terminal response on BOTH backends; normative outcome table documented on ExecutionBackend.execute and pinned by exec-response.test.ts. - Runner-dead short-circuit: a runner container that dies without writing the exit-code file (cgroup group OOM kill, eviction) is detected via the new __TALE_HARVEST_STARTED__ marker and classified promptly (OOM/137 in ~13s on kind, was full-timeout TIMEOUT/124); the marker also recovers the real exit code when the harvest later crashes. - Cancel redesign: local cancel is abort-only (execute() finishes its log reads before its finally deletes the pod — cancelled stdout is now deterministic; docker kills the container via an in-wait abort listener); cross-replica cancel reaches backend.cancel() when the id isn't local. - Duplicate executionId on another replica can no longer tear down the live owner (ownership-gated finally + pod-probing secret-409 handling). - Harvest hardening: per-entry stat/read guards in harvestOutputDir (timeout harvest races live writers), fd-bounded stderr read (unbounded stderr was OOM-killing the harvest container), in-harvest crash catch + fatal fallback result line, exec-spec numeric validation (NaN-deadline guard), deadline- based result-line reads, atomic results.json flushes. - Classification alignment: staging failures return SPAWNER_UNAVAILABLE on both backends with the stage log tail in the message (PRE_STAGE_FAILED stays action-side per wire.ts); docker daemon errors stop classifying as user code; k8s backstop expiry is TIMEOUT/124, not HARVEST_READ_FAILED. - Workspace emptyDir bounded by SANDBOX_K8S_WORKSPACE_SIZE_LIMIT (4Gi); boot failures exit instead of leaving a zombie process; empty catches resolved; RBAC contract gains secrets `list` (docs updated). 206 unit tests green (50 new: contract suite, sweep matrix, duplicate-409 stub flows, classify matrix, walk resilience, protocol markers, retry predicate). Validated live on kind: 13/13 scenarios including OOM short-circuit latency, cancel stdout determinism, duplicate dispatch with the owner unharmed, and a mid-flight sweep sparing the live pod; zero leftover pods/Secrets after all paths. Deferred (issues #1847-#1851): stage-attestation forgeability, executeMs semantics, Bun TLS-knob verification, log-rotation windowing, ulimit parity.
99cfe77 to
4349520
Compare
What
Make the sandbox service deployable on both Docker Compose (today) and Kubernetes (Helm), selected by
SANDBOX_BACKEND=docker|kubernetes. The Compose path is byte-identical to before; the K8s path is a native Pod-per-exec backend that needs nodocker.sock, no host-path workspace, and no--internalegress bridge.Architecture
A coarse
ExecutionBackend.execute(cfg, req, opts) → ExecuteResponseseam:DockerBackend— the Compose path, unchanged. Stages onto a host bind-mount,docker run, harvests back. The orchestration lives inlocal-workspace-run.ts; existing unit tests are the parity proof.KubernetesBackend— exec-free, HTTP-only. Every spawner↔Pod interaction is plain HTTP (createNamespacedPod/readNamespacedPodLog/deleteNamespacedPod) + presigned-URL I/O done inside the Pod.Why exec-free
Deep local e2e on kind found the K8s exec websocket unreliable under Bun (intermittent
Expected 101upgrade failures; ~0% on a stressed cluster).@kubernetes/client-node'sLog.log/readNamespacedPodLoguse a plain HTTPSfetch(reliable); onlyExecusesisomorphic-ws(the failure point). So the transport uses only HTTP K8s ops — no websocket anywhere.Exec-free Pod (one Pod, shared
/workspaceemptyDir, 3 containers)stageinitContainer (spawner image): downloads inputs from presigned URLs, writes the multi-step wrapper + prior-stage attestation. Completes before the runner — no sentinel. Required-input failure →PRE_STAGE_FAILED.runner(runtime image): command override runs/entrypoint.shas a child ofsh -c(notexec) so the wrapper captures the exit code to a file; stderr → a file (K8s logs stay clean stdout for phase parsing). No credentials, no callbacks. No runtime-image change.harvestsidecar (spawner image): owns the token + upload slots, enforces the user timeout, uploads/workspace/outputvia presigned slots + EP1/EP2, and prints one__TALE_RESULT__line the spawner reads back from its logs.Security
The per-exec Secret (presigned URLs +
SANDBOX_TOKEN+ caps) is mounted only intostage/harvest, never therunner— regression-tested. The runner has no token and no URLs.Horizontal scale
The result rides the
harvestcontainer's logs, read by the owning spawner replica — no callback to a Service VIP, no cross-replica affinity. The spawner stays HPA-able; cancel = delete-by-deterministic-name from any replica.Platform side
Unchanged. The spawner still emits the same SSE
phase/stdout/stderr/resultevents; EP1/EP2 are backend-agnostic.Verification
bun run checkgreen; staging/prior-output unit tests unchanged.__TALE_RESULT__result-line protocol. 156 tests pass, typecheck/lint/format clean.KubernetesBackendthrough the actualexecuteRequestdispatcher.RUNTIME_ERROR) / multistep (steps[] wrapper + output harvested).cancelled, Pod + Secret cleaned up.SANDBOX_TOKEN/ no presigned URLs;stage+harvestdo.AbortErroron cancel under Bun (client-node 1.4.0'sLog.logpipes the body but never invokes the done-callback). Replaced the stream with discretereadNamespacedPodLogpolling, so the whole k8s path is request/response HTTP with nothing to abort. Re-validated clean.services/sandbox/bun.lock(the standalone lock the Dockerfile builder uses) was missing@kubernetes/client-node— regenerated so--frozen-lockfilepasses.The exec-free transport is validated on real k8s — the original "don't enable in production until tested" blocker is cleared. What remains before a production cutover is ops/integration, not transport reliability: a smoke against the real platform storage + EP1/EP2 in-cluster (the e2e used an in-cluster mock for those; the code that calls them is backend-agnostic and already production-proven on the docker path), plus the operator-applied gVisor RuntimeClass + egress NetworkPolicy (documented, not yet exercised here).
Deploy contract
services/sandbox/docs/kubernetes.mddocuments the RBAC (namespaced Role;podscreate/get/list/delete,pods/logget,secretscreate/update/delete; nopods/exec), the required env (SANDBOX_SPAWNER_IMAGE,NODE_EXTRA_CA_CERTS, …), and the recommended egress NetworkPolicy. Helm manifests are authored separately by the deployment owner.