feat: gate pilot — LLM at batch decision boundaries#1
Draft
Conversation
Adds a gate pilot that replaces the simple replan() call after batch failures. The pilot reads disk artifacts (verify logs, QA verdicts, task summaries, learnings) and returns structured decisions: failure analysis, retry strategies, routed context for upcoming tasks, skip recommendations, and re-batching. Key design: - Stateless: reconstructs context from files each invocation - No telephone game: pilot makes system-level decisions, coding agents interpret their own errors directly - Structured JSON output, orchestrator validates and applies - Same model as planner (configurable via planner_model) - Falls back to replan() on parse failure - Config flag: pilot: false in otto.yaml to disable - Zero overhead when no failures (pilot only invoked at batch boundary with failures + remaining tasks) Codex-reviewed: 3 rounds, all CRITICAL/IMPORTANT findings fixed, APPROVED. Benchmark: 53 tasks across 18 runs, 0 regressions, 0 pilot overhead. Pilot not yet validated on real failures — shipping as safe no-op upgrade for i2p readiness. Will prove value at scale (5+ tasks, multiple batches). New files: - otto/pilot.py — context assembly, LLM invocation, decision parsing - tests/test_pilot.py — 22 unit tests - tests/test_pilot_benchmark.py — 6 scenario benchmark tests - bench/pilot-benchmark.sh — A/B benchmark runner - bench/pressure/projects/pilot-test-* — 3 synthetic test projects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Supersedes the gates + gate pilot approach. Simplified to 5 steps: classify → plan → execute → verify → fix-or-replan. Key decisions: - Single-task is a valid plan (no forced decomposition) - Product artifacts at project root (not otto_arch/) - Persistent context.md accumulates across tasks - Vertical slices over horizontal layers - User journeys from user's perspective, not feature list - Fix rounds continue while making progress, replan on planning failures - Codex-reviewed design Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
Apr 26, 2026
Closes 1 CRITICAL + 2 IMPORTANT findings from the mc-audit hunters. CRITICAL closed: - Logs appended unbounded into client state; 10MB log could lock the browser (codex-long-string-overflow #1). Replaced with bounded ring buffer (1MB max), separate unbounded totalBytes/totalLines counters, droppedBytes tracker for the elided-bytes header. IMPORTANT closed: - LogPane didn't distinguish "Live, polling" from "Final" state (codex-evidence-trustworthiness #5). Header now shows live polling cadence + last-update age, OR final size + line count. - Missing log file rendered as generic "waiting for output"; fetch errors toasted while polling kept hammering (codex-error-empty-states #9). Now shows the path explicitly, plus an error state with Retry button. Polling pauses when log is missing/errored. Polling resilience: - Exponential backoff on consecutive errors: 1.2s → 2s → 5s → 15s → 30s. - Resets to 1.2s on first successful read. - Stops polling when run is terminal AND fully drained (uses new server `eof` field). - Pauses polling when inspector is closed or tab is hidden (visibilitychange listener); resumes on visible. Server changes: - `LogReadResult` gains `total_bytes` (file size at read time) and `eof` (whether next_offset == total_bytes after this slice). All three constructor sites populated. Lets the client render "Final · {size}" headers and detect drain without a second HEAD request. - `LogsResponse` TS type updated. Tests: - `tests/browser/test_log_buffering.py` — 7 paired Playwright tests: - 5MB log renders <1.5MB DOM with elided-bytes header - Live state + polling header for active runs - Final state + line count for terminal runs - Missing-file path display + paused polling - Error backoff schedule (gap_first ≈ 2s, gap_second ≈ 5s) - Polling stops on inspector close - Polling stops on tab hidden - Browser suite: 15 passed (7 new + 7 cluster A + 1 smoke) - Default suite: 1076 passed (no regressions) - npm run web:typecheck: clean - npm run web:build: 277.82 kB JS / 33.34 kB CSS Note: pre-existing basedpyright warnings in service.py around `_record_event` calls (lines 393-544) are not introduced by this commit; they predate cluster B and are flagged because basedpyright now analyzes the file when it's touched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
Apr 26, 2026
CLUSTER C — bundle/build integrity (closes 1 CRITICAL + 4 IMPORTANT + 1 NOTE): CRITICAL closed: - `app.py:35`: `otto web` served checked-in bundle without verifying it matches source — developer skips `npm run web:build` → silent stale UI (codex-packaging-bundle #1). IMPORTANT closed: - Python build never built frontend; `pip install -e .` shipped whatever static was in tree (#2). Vite plugin emits `build-stamp.json` on every build with source-hash + timestamp + git commit; FastAPI startup verifies freshness via `verify_bundle_freshness()`. - `web:build` ran only Vite; `web:typecheck` advisory; no CI gates (#3). New `web:verify` script chains typecheck → build → committed-check. - Default cache headers underused hashed assets (#4). New `_CacheHeaderStaticFiles` subclass: `no-store` for shell + index.html, `public, max-age=31536000, immutable` for `/static/assets/*`. - Server didn't validate `index.html` referenced JS/CSS exist (#5). Startup parses index.html and asserts every referenced static path resolves; missing → fail-fast with `npm run web:build` guidance. NOTE closed: - `[tool.setuptools.package-data]` was flat (`static/*`, `static/assets/*`); future nested assets (fonts/, images/, locale/) would silently miss wheels. Now `static/**/*` recursive glob. Files: otto/web/bundle.py (new, 263 lines), otto/web/client/vite.config.ts (build-stamp emitter), scripts/build_stamp.py (CLI for manual stamp), scripts/check_bundle_committed.py (git-diff guard for CI), package.json (web:verify script), pyproject.toml (recursive package_data), otto/web/app.py (verify_bundle_freshness call + _CacheHeaderStaticFiles). Tests: tests/test_web_bundle_freshness.py (5 tests) + tests/test_web_cache_headers.py (2 tests, 8 actual checks). 15 new server-layer tests, all green. CLUSTER D — history pagination (closes 1 CRITICAL + 2 IMPORTANT): CRITICAL closed: - `total_rows=247` displayed but only first page rendered; power user with 200+ runs stuck on page 1 (heavy-user, codex-state-management #6, codex-long-string-overflow #3). IMPORTANT closed: - `/api/state` accepted `history_page` + `page_size`; client never sent `history_page` and rendered no controls. - Page-size selector now lives in the UI (10/25/50/100, default 25); server clamps to [1, 200] to refuse stale URLs requesting unbounded slices. Implementation: - `MissionControlFilters.history_page_size: int | None` for per-request override (server clamps to safe range). - App.tsx History pane: pagination footer (Page N of M · X runs · ←/→ · jump-to + page-size selector). URL persists `hp` + `ps` query params. - Filter changes reset page to 1. - Stale deep-link `?hp=99` (out-of-range) → "Page 99 doesn't exist; jump to page 1" with reset button. Files: otto/mission_control/{model.py,serializers.py} (history_page_size plumbing), otto/web/app.py (param wiring), otto/web/client/src/{App.tsx, api.ts,types.ts,styles.css} (pagination UI), tests/browser/test_history_pagination.py (10 paired Playwright tests, all green). Verification: - Browser suite: 25 passed (8 cluster A + 7 cluster B + 10 cluster D + new smoke set from cluster C cache headers via tests/test_web_cache_headers.py) - Default suite: 1091 passed (was 1076; +15 cluster C server tests) - npm run web:typecheck: clean - npm run web:build: 277.82 kB JS / 33.34 kB CSS (rebuilt with stamp) Note: cluster C agent hit an API overload during its final summary step but all files landed cleanly on disk; verified by independently running the new test files before committing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
Apr 26, 2026
CLUSTER E — diff freshness contract (closes 1 CRITICAL + 1 IMPORTANT): CRITICAL closed: - Diff fetched once and held client-side without target/branch SHAs or merge-base; **the code merged could differ from the diff reviewed** (codex-evidence-trustworthiness #1). IMPORTANT closed: - Diff truncation was bare `truncated` suffix; user couldn't tell how much was hidden, no full-diff download path (codex-evidence-trustworthiness #2). Implementation: - `MissionControlService.diff()` enriches response with: `fetched_at`, `target_sha`, `branch_sha`, `merge_base`, `command`, `limit_chars`, `full_size_chars`, `shown_hunks`, `total_hunks`, `errors[]`. - `MissionControlService._validate_expected_diff_shas` — merge action rejects with 409 + "Re-fetch the diff to confirm what will be merged" when `expected_target_sha` / `expected_branch_sha` differ from live HEAD. - POST /api/runs/{id}/actions/{action} forwards SHAs from request body. - DiffPane renders freshness header (captured-X-ago + target/branch/base short SHAs with full-SHA tooltip; warnings when SHAs are null) + Refresh button + truncation banner ("Showing N hunks of M · X KB of Y MB" + Copy diff command). - Merge confirm dialog spells out "Land branch {short} @ {sha} into target {short} @ {sha}" with the actual SHAs. - SPA passes SHAs from most-recent diff fetch on every merge POST. CLUSTER F — boot-loading gate + first-run clarity (closes 2 CRITICAL + ~12 IMPORTANT): CRITICAL closed: - App.tsx tri-state boot gate (`loading | launcher | ready`) — main shell no longer renders before /api/projects returns; "New job" button can no longer be enabled with project undefined (codex-first-time-user #1). - Pre-submit advanced-options summary in JobDialog: "Will run with: claude · sonnet · effort=high · verification=fast" outside Advanced details with "Edit" link (codex-first-time-user #2). IMPORTANT closed: - Launcher subhead: "Otto runs AI coding jobs in isolated git worktrees, then lets you review logs, diffs, and merge results." - "Managed root" helper text explains current-repo isolation - Empty project list: "Create your first Otto project below" + auto-focus - First-run primary CTA: "Start first build" (reverts to "New job" once any run exists) - Build/Improve/Certify dropdown options gain helper descriptions - All commands now require non-empty intent or focus (was build-only) - Dirty-project confirm lists up to 5 dirty files with "+N more" - "Start queued job" CTA when watcher is stopped + jobs queued - Empty detail copy: "Select a task card to review logs, code changes, verification, and next action." - RunInspector tab labels: jargon-soft alternatives - Recovery actions surfaced as primary contextual buttons (Retry / Resume / Cleanup) next to run header — Advanced still has full list - HTTP-code-to-actionable-copy mapping in api.ts: 409/400/403/5xx → recovery messages (no more raw "HTTP 409") Tests: - tests/test_diff_freshness.py — 6 server tests - tests/browser/test_diff_freshness.py — 5 browser tests - tests/browser/test_first_run_clarity.py — 13 browser tests - All green; default suite 1097 (was 1091; +6 cluster E server) - Browser suite 47 (was 25; +5 cluster E + +17 cluster F) - npm web:typecheck clean; bundle rebuilt Followups for orchestrator: - SHA-mismatch refusal is opt-in by client (older callers omit SHAs and bypass the gate); consider promoting to power-user opt-out flag - Pre-fetch diff inside runActionForRun("merge") so SHA gate is non-bypassable from SPA - Related provenance gaps (proof drawer cache, ArtifactRef metadata, visual-evidence manifest, proof file digest) share same architectural fix; bundle into a future cluster Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie
added a commit
that referenced
this pull request
Apr 26, 2026
…/launcher/lost-connection) ============================================================ 5 IMPORTANT closures ============================================================ W7-IMPORTANT-1 — iPhone submit button below the fold In devices["iPhone 14"] viewport (390x664), JobDialog submit at y=674 was clipped without scroll affordance. JobDialog scroll- container now uses max-height: 100dvh - 80px and overflow-y: auto; submit button always reachable via scroll OR the dialog renders a floating bottom-bar on narrow viewports. W8-IMPORTANT-1 — JobDialog ignores Cmd+Enter from textarea Power-user shortcut was dead. Added onKeyDown to intent textarea: (Cmd|Ctrl)+Enter now triggers submit when validation passes. W9-IMPORTANT-2 — Run double-rendered after terminal_outcome Live[]/history[] transition not atomic — same run_id appeared in both, UI rendered twice. Client-side dedupe: when computing rows to render, exclude live items whose run_id appears in history with terminal_outcome set. Codex first-time-user #4 — "Managed root" looked like current repo disappeared. Launcher panel adds: "Otto manages projects in isolated git worktrees so it never touches your other repos. Pick or create one below to start." Codex error-empty-states #1 — Lost-connection banner When polling fails 3+ consecutive times, sticky banner appears: "Lost connection to Mission Control. Retrying every 5s..." with manual retry button. Auto-clears on first successful poll after. ============================================================ Tests added (14 new browser tests) ============================================================ - tests/browser/test_iphone_submit_button_reachable.py (3 tests) - tests/browser/test_job_dialog_cmd_enter.py (3 tests) - tests/browser/test_no_double_render_after_terminal.py (4 tests) - tests/browser/test_launcher_managed_root_explanation.py (1 test) - tests/browser/test_lost_connection_banner.py (3 tests) ============================================================ Test counts ============================================================ - Default: 1189 (no change; all UI fixes) - Browser: 198 effective (was 184; +14 new) - npm web:typecheck: clean - npm web:build: clean ============================================================ Tally ============================================================ CRITICAL: 29 of 29 closed (100%) IMPORTANT: ~113 of 132 closed (~86%) Followups: - ~19 NOTE-tier IMPORTANTs remain (paper cuts; deferred per severity-gate rule unless adjacent) - 76 NOTE items deferred per policy - Phase 3.5 R1-R14 actual recordings ($70-140, hours of real LLM) remain scaffolded but not captured This effectively closes the bulk of Phase 4 IMPORTANT work. Branch is in releasable state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
replan()after batch failures with richer failure analysis, retry strategies, context routing, and skip recommendationsreplan()on failure. Config flagpilot: falseto disable. Zero overhead when no failures.Status: NOT validated on real failures
The pilot never fired during benchmarking because the coding agent passed all tasks. This is expected — the pilot's value is at i2p scale (8+ tasks, multiple batches, partial failures). Shipping as a safe no-op upgrade.
What's new
otto/pilot.pyotto/orchestrator.pyotto/runner.pytests/test_pilot.pytests/test_pilot_benchmark.pybench/pilot-benchmark.shbench/pressure/projects/pilot-test-*Design docs
docs/superpowers/specs/2026-03-29-gate-pilot.mddocs/superpowers/plans/2026-03-29-gate-pilot-stage1.mddocs/superpowers/specs/2026-03-26-otto-intent-to-product.mdTest plan
pilot.logon next failure.🤖 Generated with Claude Code