feat: gate pilot — LLM at batch decision boundaries by logpie · Pull Request #1 · logpie/otto

logpie · 2026-03-30T04:05:12Z

Summary

Adds a gate pilot that replaces simple replan() after batch failures with richer failure analysis, retry strategies, context routing, and skip recommendations
Stateless design: reads disk artifacts, returns structured JSON, orchestrator validates and applies
Falls back to replan() on failure. Config flag pilot: false to disable. Zero overhead when no failures.
Codex-reviewed (3 rounds, APPROVED). 484 unit tests pass. 53 tasks across 18 e2e runs, 0 regressions.

Status: NOT validated on real failures

The pilot never fired during benchmarking because the coding agent passed all tasks. This is expected — the pilot's value is at i2p scale (8+ tasks, multiple batches, partial failures). Shipping as a safe no-op upgrade.

What's new

File	What
`otto/pilot.py`	Gate pilot module — context assembly, LLM call, decision parsing
`otto/orchestrator.py`	Pilot at batch boundaries, fallback to replan, config flag
`otto/runner.py`	Pilot guidance separated in retry prompts
`tests/test_pilot.py`	22 unit tests
`tests/test_pilot_benchmark.py`	6 scenario tests
`bench/pilot-benchmark.sh`	A/B benchmark runner
`bench/pressure/projects/pilot-test-*`	3 synthetic test projects

Design docs

Spec: docs/superpowers/specs/2026-03-29-gate-pilot.md
Plan: docs/superpowers/plans/2026-03-29-gate-pilot-stage1.md
i2p spec: docs/superpowers/specs/2026-03-26-otto-intent-to-product.md

Test plan

484 unit tests pass (0 new failures)
Codex adversarial review: 3 rounds, APPROVED
18 e2e runs (6 projects × baseline/pilot): zero overhead, zero regressions
4 real-world combined runs (ufo, humanize, camelcase, pre-commit)
5-task greenfield run with merge conflicts
Pending: real-world pilot invocation — needs a run where batch has mixed pass/fail results with remaining tasks. Monitor pilot.log on next failure.

🤖 Generated with Claude Code

Adds a gate pilot that replaces the simple replan() call after batch failures. The pilot reads disk artifacts (verify logs, QA verdicts, task summaries, learnings) and returns structured decisions: failure analysis, retry strategies, routed context for upcoming tasks, skip recommendations, and re-batching. Key design: - Stateless: reconstructs context from files each invocation - No telephone game: pilot makes system-level decisions, coding agents interpret their own errors directly - Structured JSON output, orchestrator validates and applies - Same model as planner (configurable via planner_model) - Falls back to replan() on parse failure - Config flag: pilot: false in otto.yaml to disable - Zero overhead when no failures (pilot only invoked at batch boundary with failures + remaining tasks) Codex-reviewed: 3 rounds, all CRITICAL/IMPORTANT findings fixed, APPROVED. Benchmark: 53 tasks across 18 runs, 0 regressions, 0 pilot overhead. Pilot not yet validated on real failures — shipping as safe no-op upgrade for i2p readiness. Will prove value at scale (5+ tasks, multiple batches). New files: - otto/pilot.py — context assembly, LLM invocation, decision parsing - tests/test_pilot.py — 22 unit tests - tests/test_pilot_benchmark.py — 6 scenario benchmark tests - bench/pilot-benchmark.sh — A/B benchmark runner - bench/pressure/projects/pilot-test-* — 3 synthetic test projects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-30T04:05:19Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6de94498-333a-48d3-b2f0-f8f8313d2328

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch worktree-gate-pilot

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Supersedes the gates + gate pilot approach. Simplified to 5 steps: classify → plan → execute → verify → fix-or-replan. Key decisions: - Single-task is a valid plan (no forced decomposition) - Product artifacts at project root (not otto_arch/) - Persistent context.md accumulates across tasks - Vertical slices over horizontal layers - User journeys from user's perspective, not feature list - Fix rounds continue while making progress, replan on planning failures - Codex-reviewed design Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Closes 1 CRITICAL + 2 IMPORTANT findings from the mc-audit hunters. CRITICAL closed: - Logs appended unbounded into client state; 10MB log could lock the browser (codex-long-string-overflow #1). Replaced with bounded ring buffer (1MB max), separate unbounded totalBytes/totalLines counters, droppedBytes tracker for the elided-bytes header. IMPORTANT closed: - LogPane didn't distinguish "Live, polling" from "Final" state (codex-evidence-trustworthiness #5). Header now shows live polling cadence + last-update age, OR final size + line count. - Missing log file rendered as generic "waiting for output"; fetch errors toasted while polling kept hammering (codex-error-empty-states #9). Now shows the path explicitly, plus an error state with Retry button. Polling pauses when log is missing/errored. Polling resilience: - Exponential backoff on consecutive errors: 1.2s → 2s → 5s → 15s → 30s. - Resets to 1.2s on first successful read. - Stops polling when run is terminal AND fully drained (uses new server `eof` field). - Pauses polling when inspector is closed or tab is hidden (visibilitychange listener); resumes on visible. Server changes: - `LogReadResult` gains `total_bytes` (file size at read time) and `eof` (whether next_offset == total_bytes after this slice). All three constructor sites populated. Lets the client render "Final · {size}" headers and detect drain without a second HEAD request. - `LogsResponse` TS type updated. Tests: - `tests/browser/test_log_buffering.py` — 7 paired Playwright tests: - 5MB log renders <1.5MB DOM with elided-bytes header - Live state + polling header for active runs - Final state + line count for terminal runs - Missing-file path display + paused polling - Error backoff schedule (gap_first ≈ 2s, gap_second ≈ 5s) - Polling stops on inspector close - Polling stops on tab hidden - Browser suite: 15 passed (7 new + 7 cluster A + 1 smoke) - Default suite: 1076 passed (no regressions) - npm run web:typecheck: clean - npm run web:build: 277.82 kB JS / 33.34 kB CSS Note: pre-existing basedpyright warnings in service.py around `_record_event` calls (lines 393-544) are not introduced by this commit; they predate cluster B and are flagged because basedpyright now analyzes the file when it's touched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

CLUSTER C — bundle/build integrity (closes 1 CRITICAL + 4 IMPORTANT + 1 NOTE): CRITICAL closed: - `app.py:35`: `otto web` served checked-in bundle without verifying it matches source — developer skips `npm run web:build` → silent stale UI (codex-packaging-bundle #1). IMPORTANT closed: - Python build never built frontend; `pip install -e .` shipped whatever static was in tree (#2). Vite plugin emits `build-stamp.json` on every build with source-hash + timestamp + git commit; FastAPI startup verifies freshness via `verify_bundle_freshness()`. - `web:build` ran only Vite; `web:typecheck` advisory; no CI gates (#3). New `web:verify` script chains typecheck → build → committed-check. - Default cache headers underused hashed assets (#4). New `_CacheHeaderStaticFiles` subclass: `no-store` for shell + index.html, `public, max-age=31536000, immutable` for `/static/assets/*`. - Server didn't validate `index.html` referenced JS/CSS exist (#5). Startup parses index.html and asserts every referenced static path resolves; missing → fail-fast with `npm run web:build` guidance. NOTE closed: - `[tool.setuptools.package-data]` was flat (`static/*`, `static/assets/*`); future nested assets (fonts/, images/, locale/) would silently miss wheels. Now `static/**/*` recursive glob. Files: otto/web/bundle.py (new, 263 lines), otto/web/client/vite.config.ts (build-stamp emitter), scripts/build_stamp.py (CLI for manual stamp), scripts/check_bundle_committed.py (git-diff guard for CI), package.json (web:verify script), pyproject.toml (recursive package_data), otto/web/app.py (verify_bundle_freshness call + _CacheHeaderStaticFiles). Tests: tests/test_web_bundle_freshness.py (5 tests) + tests/test_web_cache_headers.py (2 tests, 8 actual checks). 15 new server-layer tests, all green. CLUSTER D — history pagination (closes 1 CRITICAL + 2 IMPORTANT): CRITICAL closed: - `total_rows=247` displayed but only first page rendered; power user with 200+ runs stuck on page 1 (heavy-user, codex-state-management #6, codex-long-string-overflow #3). IMPORTANT closed: - `/api/state` accepted `history_page` + `page_size`; client never sent `history_page` and rendered no controls. - Page-size selector now lives in the UI (10/25/50/100, default 25); server clamps to [1, 200] to refuse stale URLs requesting unbounded slices. Implementation: - `MissionControlFilters.history_page_size: int | None` for per-request override (server clamps to safe range). - App.tsx History pane: pagination footer (Page N of M · X runs · ←/→ · jump-to + page-size selector). URL persists `hp` + `ps` query params. - Filter changes reset page to 1. - Stale deep-link `?hp=99` (out-of-range) → "Page 99 doesn't exist; jump to page 1" with reset button. Files: otto/mission_control/{model.py,serializers.py} (history_page_size plumbing), otto/web/app.py (param wiring), otto/web/client/src/{App.tsx, api.ts,types.ts,styles.css} (pagination UI), tests/browser/test_history_pagination.py (10 paired Playwright tests, all green). Verification: - Browser suite: 25 passed (8 cluster A + 7 cluster B + 10 cluster D + new smoke set from cluster C cache headers via tests/test_web_cache_headers.py) - Default suite: 1091 passed (was 1076; +15 cluster C server tests) - npm run web:typecheck: clean - npm run web:build: 277.82 kB JS / 33.34 kB CSS (rebuilt with stamp) Note: cluster C agent hit an API overload during its final summary step but all files landed cleanly on disk; verified by independently running the new test files before committing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

CLUSTER E — diff freshness contract (closes 1 CRITICAL + 1 IMPORTANT): CRITICAL closed: - Diff fetched once and held client-side without target/branch SHAs or merge-base; **the code merged could differ from the diff reviewed** (codex-evidence-trustworthiness #1). IMPORTANT closed: - Diff truncation was bare `truncated` suffix; user couldn't tell how much was hidden, no full-diff download path (codex-evidence-trustworthiness #2). Implementation: - `MissionControlService.diff()` enriches response with: `fetched_at`, `target_sha`, `branch_sha`, `merge_base`, `command`, `limit_chars`, `full_size_chars`, `shown_hunks`, `total_hunks`, `errors[]`. - `MissionControlService._validate_expected_diff_shas` — merge action rejects with 409 + "Re-fetch the diff to confirm what will be merged" when `expected_target_sha` / `expected_branch_sha` differ from live HEAD. - POST /api/runs/{id}/actions/{action} forwards SHAs from request body. - DiffPane renders freshness header (captured-X-ago + target/branch/base short SHAs with full-SHA tooltip; warnings when SHAs are null) + Refresh button + truncation banner ("Showing N hunks of M · X KB of Y MB" + Copy diff command). - Merge confirm dialog spells out "Land branch {short} @ {sha} into target {short} @ {sha}" with the actual SHAs. - SPA passes SHAs from most-recent diff fetch on every merge POST. CLUSTER F — boot-loading gate + first-run clarity (closes 2 CRITICAL + ~12 IMPORTANT): CRITICAL closed: - App.tsx tri-state boot gate (`loading | launcher | ready`) — main shell no longer renders before /api/projects returns; "New job" button can no longer be enabled with project undefined (codex-first-time-user #1). - Pre-submit advanced-options summary in JobDialog: "Will run with: claude · sonnet · effort=high · verification=fast" outside Advanced details with "Edit" link (codex-first-time-user #2). IMPORTANT closed: - Launcher subhead: "Otto runs AI coding jobs in isolated git worktrees, then lets you review logs, diffs, and merge results." - "Managed root" helper text explains current-repo isolation - Empty project list: "Create your first Otto project below" + auto-focus - First-run primary CTA: "Start first build" (reverts to "New job" once any run exists) - Build/Improve/Certify dropdown options gain helper descriptions - All commands now require non-empty intent or focus (was build-only) - Dirty-project confirm lists up to 5 dirty files with "+N more" - "Start queued job" CTA when watcher is stopped + jobs queued - Empty detail copy: "Select a task card to review logs, code changes, verification, and next action." - RunInspector tab labels: jargon-soft alternatives - Recovery actions surfaced as primary contextual buttons (Retry / Resume / Cleanup) next to run header — Advanced still has full list - HTTP-code-to-actionable-copy mapping in api.ts: 409/400/403/5xx → recovery messages (no more raw "HTTP 409") Tests: - tests/test_diff_freshness.py — 6 server tests - tests/browser/test_diff_freshness.py — 5 browser tests - tests/browser/test_first_run_clarity.py — 13 browser tests - All green; default suite 1097 (was 1091; +6 cluster E server) - Browser suite 47 (was 25; +5 cluster E + +17 cluster F) - npm web:typecheck clean; bundle rebuilt Followups for orchestrator: - SHA-mismatch refusal is opt-in by client (older callers omit SHAs and bypass the gate); consider promoting to power-user opt-out flag - Pre-fetch diff inside runActionForRun("merge") so SHA gate is non-bypassable from SPA - Related provenance gaps (proof drawer cache, ArtifactRef metadata, visual-evidence manifest, proof file digest) share same architectural fix; bundle into a future cluster Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…/launcher/lost-connection) ============================================================ 5 IMPORTANT closures ============================================================ W7-IMPORTANT-1 — iPhone submit button below the fold In devices["iPhone 14"] viewport (390x664), JobDialog submit at y=674 was clipped without scroll affordance. JobDialog scroll- container now uses max-height: 100dvh - 80px and overflow-y: auto; submit button always reachable via scroll OR the dialog renders a floating bottom-bar on narrow viewports. W8-IMPORTANT-1 — JobDialog ignores Cmd+Enter from textarea Power-user shortcut was dead. Added onKeyDown to intent textarea: (Cmd|Ctrl)+Enter now triggers submit when validation passes. W9-IMPORTANT-2 — Run double-rendered after terminal_outcome Live[]/history[] transition not atomic — same run_id appeared in both, UI rendered twice. Client-side dedupe: when computing rows to render, exclude live items whose run_id appears in history with terminal_outcome set. Codex first-time-user #4 — "Managed root" looked like current repo disappeared. Launcher panel adds: "Otto manages projects in isolated git worktrees so it never touches your other repos. Pick or create one below to start." Codex error-empty-states #1 — Lost-connection banner When polling fails 3+ consecutive times, sticky banner appears: "Lost connection to Mission Control. Retrying every 5s..." with manual retry button. Auto-clears on first successful poll after. ============================================================ Tests added (14 new browser tests) ============================================================ - tests/browser/test_iphone_submit_button_reachable.py (3 tests) - tests/browser/test_job_dialog_cmd_enter.py (3 tests) - tests/browser/test_no_double_render_after_terminal.py (4 tests) - tests/browser/test_launcher_managed_root_explanation.py (1 test) - tests/browser/test_lost_connection_banner.py (3 tests) ============================================================ Test counts ============================================================ - Default: 1189 (no change; all UI fixes) - Browser: 198 effective (was 184; +14 new) - npm web:typecheck: clean - npm web:build: clean ============================================================ Tally ============================================================ CRITICAL: 29 of 29 closed (100%) IMPORTANT: ~113 of 132 closed (~86%) Followups: - ~19 NOTE-tier IMPORTANTs remain (paper cuts; deferred per severity-gate rule unless adjacent) - 76 NOTE items deferred per policy - Phase 3.5 R1-R14 actual recordings ($70-140, hours of real LLM) remain scaffolded but not captured This effectively closes the bulk of Phase 4 IMPORTANT work. Branch is in releasable state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: gate pilot — LLM at batch decision boundaries#1

feat: gate pilot — LLM at batch decision boundaries#1
logpie wants to merge 2 commits intomainfrom
worktree-gate-pilot

logpie commented Mar 30, 2026

Uh oh!

coderabbitai Bot commented Mar 30, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

logpie commented Mar 30, 2026

Summary

Status: NOT validated on real failures

What's new

Design docs

Test plan

Uh oh!

coderabbitai Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Mar 30, 2026 •

edited

Loading