Skip to content

feat: gate pilot — LLM at batch decision boundaries#1

Draft
logpie wants to merge 2 commits intomainfrom
worktree-gate-pilot
Draft

feat: gate pilot — LLM at batch decision boundaries#1
logpie wants to merge 2 commits intomainfrom
worktree-gate-pilot

Conversation

@logpie
Copy link
Copy Markdown
Owner

@logpie logpie commented Mar 30, 2026

Summary

  • Adds a gate pilot that replaces simple replan() after batch failures with richer failure analysis, retry strategies, context routing, and skip recommendations
  • Stateless design: reads disk artifacts, returns structured JSON, orchestrator validates and applies
  • Falls back to replan() on failure. Config flag pilot: false to disable. Zero overhead when no failures.
  • Codex-reviewed (3 rounds, APPROVED). 484 unit tests pass. 53 tasks across 18 e2e runs, 0 regressions.

Status: NOT validated on real failures

The pilot never fired during benchmarking because the coding agent passed all tasks. This is expected — the pilot's value is at i2p scale (8+ tasks, multiple batches, partial failures). Shipping as a safe no-op upgrade.

What's new

File What
otto/pilot.py Gate pilot module — context assembly, LLM call, decision parsing
otto/orchestrator.py Pilot at batch boundaries, fallback to replan, config flag
otto/runner.py Pilot guidance separated in retry prompts
tests/test_pilot.py 22 unit tests
tests/test_pilot_benchmark.py 6 scenario tests
bench/pilot-benchmark.sh A/B benchmark runner
bench/pressure/projects/pilot-test-* 3 synthetic test projects

Design docs

  • Spec: docs/superpowers/specs/2026-03-29-gate-pilot.md
  • Plan: docs/superpowers/plans/2026-03-29-gate-pilot-stage1.md
  • i2p spec: docs/superpowers/specs/2026-03-26-otto-intent-to-product.md

Test plan

  • 484 unit tests pass (0 new failures)
  • Codex adversarial review: 3 rounds, APPROVED
  • 18 e2e runs (6 projects × baseline/pilot): zero overhead, zero regressions
  • 4 real-world combined runs (ufo, humanize, camelcase, pre-commit)
  • 5-task greenfield run with merge conflicts
  • Pending: real-world pilot invocation — needs a run where batch has mixed pass/fail results with remaining tasks. Monitor pilot.log on next failure.

🤖 Generated with Claude Code

Adds a gate pilot that replaces the simple replan() call after batch
failures. The pilot reads disk artifacts (verify logs, QA verdicts,
task summaries, learnings) and returns structured decisions: failure
analysis, retry strategies, routed context for upcoming tasks, skip
recommendations, and re-batching.

Key design:
- Stateless: reconstructs context from files each invocation
- No telephone game: pilot makes system-level decisions, coding agents
  interpret their own errors directly
- Structured JSON output, orchestrator validates and applies
- Same model as planner (configurable via planner_model)
- Falls back to replan() on parse failure
- Config flag: pilot: false in otto.yaml to disable
- Zero overhead when no failures (pilot only invoked at batch boundary
  with failures + remaining tasks)

Codex-reviewed: 3 rounds, all CRITICAL/IMPORTANT findings fixed, APPROVED.
Benchmark: 53 tasks across 18 runs, 0 regressions, 0 pilot overhead.
Pilot not yet validated on real failures — shipping as safe no-op upgrade
for i2p readiness. Will prove value at scale (5+ tasks, multiple batches).

New files:
- otto/pilot.py — context assembly, LLM invocation, decision parsing
- tests/test_pilot.py — 22 unit tests
- tests/test_pilot_benchmark.py — 6 scenario benchmark tests
- bench/pilot-benchmark.sh — A/B benchmark runner
- bench/pressure/projects/pilot-test-* — 3 synthetic test projects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 30, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6de94498-333a-48d3-b2f0-f8f8313d2328

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-gate-pilot

Comment @coderabbitai help to get the list of available commands and usage tips.

Supersedes the gates + gate pilot approach. Simplified to 5 steps:
classify → plan → execute → verify → fix-or-replan.

Key decisions:
- Single-task is a valid plan (no forced decomposition)
- Product artifacts at project root (not otto_arch/)
- Persistent context.md accumulates across tasks
- Vertical slices over horizontal layers
- User journeys from user's perspective, not feature list
- Fix rounds continue while making progress, replan on planning failures
- Codex-reviewed design

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
logpie added a commit that referenced this pull request Apr 26, 2026
Closes 1 CRITICAL + 2 IMPORTANT findings from the mc-audit hunters.

CRITICAL closed:
- Logs appended unbounded into client state; 10MB log could lock the browser
  (codex-long-string-overflow #1). Replaced with bounded ring buffer
  (1MB max), separate unbounded totalBytes/totalLines counters, droppedBytes
  tracker for the elided-bytes header.

IMPORTANT closed:
- LogPane didn't distinguish "Live, polling" from "Final" state
  (codex-evidence-trustworthiness #5). Header now shows live polling cadence
  + last-update age, OR final size + line count.
- Missing log file rendered as generic "waiting for output"; fetch errors
  toasted while polling kept hammering (codex-error-empty-states #9). Now
  shows the path explicitly, plus an error state with Retry button. Polling
  pauses when log is missing/errored.

Polling resilience:
- Exponential backoff on consecutive errors: 1.2s → 2s → 5s → 15s → 30s.
- Resets to 1.2s on first successful read.
- Stops polling when run is terminal AND fully drained (uses new server
  `eof` field).
- Pauses polling when inspector is closed or tab is hidden
  (visibilitychange listener); resumes on visible.

Server changes:
- `LogReadResult` gains `total_bytes` (file size at read time) and `eof`
  (whether next_offset == total_bytes after this slice). All three
  constructor sites populated. Lets the client render "Final · {size}"
  headers and detect drain without a second HEAD request.
- `LogsResponse` TS type updated.

Tests:
- `tests/browser/test_log_buffering.py` — 7 paired Playwright tests:
  - 5MB log renders <1.5MB DOM with elided-bytes header
  - Live state + polling header for active runs
  - Final state + line count for terminal runs
  - Missing-file path display + paused polling
  - Error backoff schedule (gap_first ≈ 2s, gap_second ≈ 5s)
  - Polling stops on inspector close
  - Polling stops on tab hidden
- Browser suite: 15 passed (7 new + 7 cluster A + 1 smoke)
- Default suite: 1076 passed (no regressions)
- npm run web:typecheck: clean
- npm run web:build: 277.82 kB JS / 33.34 kB CSS

Note: pre-existing basedpyright warnings in service.py around `_record_event`
calls (lines 393-544) are not introduced by this commit; they predate
cluster B and are flagged because basedpyright now analyzes the file when
it's touched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request Apr 26, 2026
CLUSTER C — bundle/build integrity (closes 1 CRITICAL + 4 IMPORTANT + 1 NOTE):

CRITICAL closed:
- `app.py:35`: `otto web` served checked-in bundle without verifying it
  matches source — developer skips `npm run web:build` → silent stale UI
  (codex-packaging-bundle #1).

IMPORTANT closed:
- Python build never built frontend; `pip install -e .` shipped whatever
  static was in tree (#2). Vite plugin emits `build-stamp.json` on every
  build with source-hash + timestamp + git commit; FastAPI startup verifies
  freshness via `verify_bundle_freshness()`.
- `web:build` ran only Vite; `web:typecheck` advisory; no CI gates (#3).
  New `web:verify` script chains typecheck → build → committed-check.
- Default cache headers underused hashed assets (#4). New
  `_CacheHeaderStaticFiles` subclass: `no-store` for shell + index.html,
  `public, max-age=31536000, immutable` for `/static/assets/*`.
- Server didn't validate `index.html` referenced JS/CSS exist (#5).
  Startup parses index.html and asserts every referenced static path
  resolves; missing → fail-fast with `npm run web:build` guidance.

NOTE closed:
- `[tool.setuptools.package-data]` was flat (`static/*`, `static/assets/*`);
  future nested assets (fonts/, images/, locale/) would silently miss
  wheels. Now `static/**/*` recursive glob.

Files: otto/web/bundle.py (new, 263 lines), otto/web/client/vite.config.ts
(build-stamp emitter), scripts/build_stamp.py (CLI for manual stamp),
scripts/check_bundle_committed.py (git-diff guard for CI), package.json
(web:verify script), pyproject.toml (recursive package_data), otto/web/app.py
(verify_bundle_freshness call + _CacheHeaderStaticFiles).

Tests: tests/test_web_bundle_freshness.py (5 tests) + tests/test_web_cache_headers.py
(2 tests, 8 actual checks). 15 new server-layer tests, all green.

CLUSTER D — history pagination (closes 1 CRITICAL + 2 IMPORTANT):

CRITICAL closed:
- `total_rows=247` displayed but only first page rendered; power user with
  200+ runs stuck on page 1 (heavy-user, codex-state-management #6,
  codex-long-string-overflow #3).

IMPORTANT closed:
- `/api/state` accepted `history_page` + `page_size`; client never sent
  `history_page` and rendered no controls.
- Page-size selector now lives in the UI (10/25/50/100, default 25);
  server clamps to [1, 200] to refuse stale URLs requesting unbounded slices.

Implementation:
- `MissionControlFilters.history_page_size: int | None` for per-request
  override (server clamps to safe range).
- App.tsx History pane: pagination footer (Page N of M · X runs · ←/→ ·
  jump-to + page-size selector). URL persists `hp` + `ps` query params.
- Filter changes reset page to 1.
- Stale deep-link `?hp=99` (out-of-range) → "Page 99 doesn't exist; jump
  to page 1" with reset button.

Files: otto/mission_control/{model.py,serializers.py} (history_page_size
plumbing), otto/web/app.py (param wiring), otto/web/client/src/{App.tsx,
api.ts,types.ts,styles.css} (pagination UI), tests/browser/test_history_pagination.py
(10 paired Playwright tests, all green).

Verification:
- Browser suite: 25 passed (8 cluster A + 7 cluster B + 10 cluster D + new
  smoke set from cluster C cache headers via tests/test_web_cache_headers.py)
- Default suite: 1091 passed (was 1076; +15 cluster C server tests)
- npm run web:typecheck: clean
- npm run web:build: 277.82 kB JS / 33.34 kB CSS (rebuilt with stamp)

Note: cluster C agent hit an API overload during its final summary step but
all files landed cleanly on disk; verified by independently running the new
test files before committing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request Apr 26, 2026
CLUSTER E — diff freshness contract (closes 1 CRITICAL + 1 IMPORTANT):

CRITICAL closed:
- Diff fetched once and held client-side without target/branch SHAs or
  merge-base; **the code merged could differ from the diff reviewed**
  (codex-evidence-trustworthiness #1).

IMPORTANT closed:
- Diff truncation was bare `truncated` suffix; user couldn't tell how much
  was hidden, no full-diff download path (codex-evidence-trustworthiness #2).

Implementation:
- `MissionControlService.diff()` enriches response with: `fetched_at`,
  `target_sha`, `branch_sha`, `merge_base`, `command`, `limit_chars`,
  `full_size_chars`, `shown_hunks`, `total_hunks`, `errors[]`.
- `MissionControlService._validate_expected_diff_shas` — merge action
  rejects with 409 + "Re-fetch the diff to confirm what will be merged"
  when `expected_target_sha` / `expected_branch_sha` differ from live HEAD.
- POST /api/runs/{id}/actions/{action} forwards SHAs from request body.
- DiffPane renders freshness header (captured-X-ago + target/branch/base
  short SHAs with full-SHA tooltip; warnings when SHAs are null) + Refresh
  button + truncation banner ("Showing N hunks of M · X KB of Y MB" + Copy
  diff command).
- Merge confirm dialog spells out "Land branch {short} @ {sha} into target
  {short} @ {sha}" with the actual SHAs.
- SPA passes SHAs from most-recent diff fetch on every merge POST.

CLUSTER F — boot-loading gate + first-run clarity (closes 2 CRITICAL +
~12 IMPORTANT):

CRITICAL closed:
- App.tsx tri-state boot gate (`loading | launcher | ready`) — main shell
  no longer renders before /api/projects returns; "New job" button can no
  longer be enabled with project undefined (codex-first-time-user #1).
- Pre-submit advanced-options summary in JobDialog: "Will run with: claude
  · sonnet · effort=high · verification=fast" outside Advanced details
  with "Edit" link (codex-first-time-user #2).

IMPORTANT closed:
- Launcher subhead: "Otto runs AI coding jobs in isolated git worktrees,
  then lets you review logs, diffs, and merge results."
- "Managed root" helper text explains current-repo isolation
- Empty project list: "Create your first Otto project below" + auto-focus
- First-run primary CTA: "Start first build" (reverts to "New job" once
  any run exists)
- Build/Improve/Certify dropdown options gain helper descriptions
- All commands now require non-empty intent or focus (was build-only)
- Dirty-project confirm lists up to 5 dirty files with "+N more"
- "Start queued job" CTA when watcher is stopped + jobs queued
- Empty detail copy: "Select a task card to review logs, code changes,
  verification, and next action."
- RunInspector tab labels: jargon-soft alternatives
- Recovery actions surfaced as primary contextual buttons (Retry / Resume
  / Cleanup) next to run header — Advanced still has full list
- HTTP-code-to-actionable-copy mapping in api.ts: 409/400/403/5xx →
  recovery messages (no more raw "HTTP 409")

Tests:
- tests/test_diff_freshness.py — 6 server tests
- tests/browser/test_diff_freshness.py — 5 browser tests
- tests/browser/test_first_run_clarity.py — 13 browser tests
- All green; default suite 1097 (was 1091; +6 cluster E server)
- Browser suite 47 (was 25; +5 cluster E + +17 cluster F)
- npm web:typecheck clean; bundle rebuilt

Followups for orchestrator:
- SHA-mismatch refusal is opt-in by client (older callers omit SHAs and
  bypass the gate); consider promoting to power-user opt-out flag
- Pre-fetch diff inside runActionForRun("merge") so SHA gate is
  non-bypassable from SPA
- Related provenance gaps (proof drawer cache, ArtifactRef metadata,
  visual-evidence manifest, proof file digest) share same architectural
  fix; bundle into a future cluster

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request Apr 26, 2026
…/launcher/lost-connection)

============================================================
5 IMPORTANT closures
============================================================

W7-IMPORTANT-1 — iPhone submit button below the fold
  In devices["iPhone 14"] viewport (390x664), JobDialog submit at
  y=674 was clipped without scroll affordance. JobDialog scroll-
  container now uses max-height: 100dvh - 80px and overflow-y: auto;
  submit button always reachable via scroll OR the dialog renders a
  floating bottom-bar on narrow viewports.

W8-IMPORTANT-1 — JobDialog ignores Cmd+Enter from textarea
  Power-user shortcut was dead. Added onKeyDown to intent textarea:
  (Cmd|Ctrl)+Enter now triggers submit when validation passes.

W9-IMPORTANT-2 — Run double-rendered after terminal_outcome
  Live[]/history[] transition not atomic — same run_id appeared in
  both, UI rendered twice. Client-side dedupe: when computing rows
  to render, exclude live items whose run_id appears in history with
  terminal_outcome set.

Codex first-time-user #4 — "Managed root" looked like current repo
  disappeared. Launcher panel adds: "Otto manages projects in
  isolated git worktrees so it never touches your other repos. Pick
  or create one below to start."

Codex error-empty-states #1 — Lost-connection banner
  When polling fails 3+ consecutive times, sticky banner appears:
  "Lost connection to Mission Control. Retrying every 5s..." with
  manual retry button. Auto-clears on first successful poll after.

============================================================
Tests added (14 new browser tests)
============================================================

- tests/browser/test_iphone_submit_button_reachable.py (3 tests)
- tests/browser/test_job_dialog_cmd_enter.py (3 tests)
- tests/browser/test_no_double_render_after_terminal.py (4 tests)
- tests/browser/test_launcher_managed_root_explanation.py (1 test)
- tests/browser/test_lost_connection_banner.py (3 tests)

============================================================
Test counts
============================================================
- Default: 1189 (no change; all UI fixes)
- Browser: 198 effective (was 184; +14 new)
- npm web:typecheck: clean
- npm web:build: clean

============================================================
Tally
============================================================
CRITICAL: 29 of 29 closed (100%)
IMPORTANT: ~113 of 132 closed (~86%)

Followups:
- ~19 NOTE-tier IMPORTANTs remain (paper cuts; deferred per
  severity-gate rule unless adjacent)
- 76 NOTE items deferred per policy
- Phase 3.5 R1-R14 actual recordings ($70-140, hours of real LLM)
  remain scaffolded but not captured

This effectively closes the bulk of Phase 4 IMPORTANT work. Branch
is in releasable state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant