PlanExeOrg · neoneye · Mar 9, 2026 · Mar 9, 2026
diff --git a/docs/proposals/111-promising-directions.md b/docs/proposals/111-promising-directions.md
@@ -0,0 +1,136 @@
+# Promising Directions for PlanExe
+
+Analysis of 110+ proposals. Primary audience: AI agents consuming PlanExe via MCP. Secondary audience: humans reviewing plans.
+
+---
+
+## Strategic Frame
+
+PlanExe's future is as **infrastructure for AI agents** — the planning layer that agents call before executing complex multi-step work. Today, agents use PlanExe to generate plans for humans to review. Tomorrow, agents will plan, validate, and execute autonomously with PlanExe as the orchestration backbone.
+
+The proposals below are grouped by what agents need, in priority order.
+
+---
+
+## 1. Operational Reliability (agents can't babysit)
+
+Agents need PlanExe runs to complete reliably without human intervention. A failed run that loses 73% of work is a dealbreaker for autonomous workflows.
+
+| # | Proposal | Agent Impact |
+|---|----------|-------------|
+| **87** | Plan Resume MCP Tool | Agents recover from failures without restarting the full pipeline. Luigi already supports resume — needs DB flag + worker logic |
+| **109** | LLM Executor Retry Improvements | Structured retry logic for transient failures. Agents shouldn't need to implement their own retry wrappers |
+| **102** | Pipeline Intelligence Layer | Error-feedback retries — the LLM gets its own error message and retries with an adjusted approach. Eliminates a class of silent failures |
+| **103** | Pipeline Hardening for Local Models | Fix silent truncation and context-window overflows. Critical for agents running local models where failures are subtle |
+| **101** | Luigi Resume Enhancements | Webhook hooks on task completion/failure — agents can subscribe to events instead of polling |
+
+---
+
+## 2. Agent Interface & Discoverability (reduce integration friction)
+
+Agents need to discover PlanExe, understand its tools, and consume outputs programmatically. The current interface assumes a human is in the loop.
+
+| # | Proposal | Agent Impact |
+|---|----------|-------------|
+| **86** | Agent-Optimized Pipeline | Removes the 5 key friction points for autonomous agent use: human approval gate, no agent prompt examples, poll intervals tuned for humans, no machine-readable output, no autonomous agent setup docs |
+| **62** | Agent-First Frontend Discoverability | `llms.txt`, `/.well-known/mcp.json`, agent-readable README — standard discovery protocols so agents find PlanExe without human guidance |
+| **110** | Usage Metrics for Local Runs | Agents need cost accounting for budget-constrained workflows. Answers "how much did this run cost?" |
+
+Key friction points from #86 that block autonomous agent use:
+- **F1**: Human approval step before `plan_create` — autonomous agents can't proceed
+- **F4**: No machine-readable summary — agents must parse HTML or iterate 100+ zip files
+- **F3**: 5-minute poll interval is wrong for cloud runs completing in 8 minutes
+
+---
+
+## 3. Plan Quality Gates (agents must not amplify bad plans)
+
+Before agents can execute plans autonomously, the plans themselves need automated validation. An agent that executes an unchecked plan multiplies errors at machine speed.
+
+| # | Proposal | Agent Impact |
+|---|----------|-------------|
+| **58** | Boost Initial Prompt | Single LLM call to strengthen weak prompts before pipeline runs. Especially valuable for agent-originated prompts, which may be terse or overly technical |
+| **42** | Evidence Traceability Ledger | Links every claim to evidence with freshness scoring. Agents can programmatically check whether assumptions are grounded |
+| **43** | Assumption Drift Monitor | Watches key variables (costs, FX rates) against baselines, triggers re-plan on breach. Agents re-planning on a schedule need this to detect when a plan has gone stale |
+| **57** | Banned Words + Lever Realism | Auto-detects hype-heavy or impractical outputs without human review |
+| **56** | Adversarial Red-Team Reality Check | Multi-model adversarial review with judge scoring. Catches optimism bias that agents are prone to propagating |
+| **88** | Fermi Sanity Check Validation Gate | Rule-based guard on every assumption: bounds present, span ratio sane, evidence for low-confidence claims. First line of defense before expensive downstream processing |
+
+---
+
+## 4. Autonomous Prompt Optimization (agents improving PlanExe itself)
+
+Luigi's caching makes prompt optimization uniquely practical — changing one prompt template and re-running only regenerates that single task (seconds, not the full 15-minute pipeline). This turns a 60+ LLM call pipeline into a 2–21 call experiment.
+
+| # | Proposal | Agent Impact |
+|---|----------|-------------|
+| **94** | Autoresearch-Style Prompt Optimization | Autonomous overnight loops: agent modifies one prompt, re-runs one task, scores, keeps or reverts. Hundreds of experiments per night exploiting Luigi resumability |
+| **59** | Prompt Optimizing with A/B Testing | Structured promotion pipeline: multi-model A/B matrix, Elo tracking, regression guards. Validates candidates discovered by #94 before merging into baseline |
+
+Two-stage system: **#94 discovers** promising variants at high volume (greedy, autonomous), **#59 validates** them with rigor (conservative, human-gated). Exploration feeds promotion.
+
+This is a case where agents improve the tool they use — a self-reinforcing loop.
+
+---
+
+## 5. Post-Plan Execution (the big unlock)
+
+This is PlanExe's largest gap: *"I have a plan. Now what?"* For agents, a plan that can't be executed programmatically is just a document.
+
+| # | Proposal | Agent Impact |
+|---|----------|-------------|
+| **41** | Autonomous Execution of Plan | Converts static Gantt into live execution engine with AI/human task delegation. The plan becomes a runnable workflow, not a PDF |
+| **60–66** | Plan-to-Repo + Agent Swarm | Auto-provision repo, spawn agents (research, issues, compliance), use git as state machine. Plans become collaborative artifacts with continuous enrichment |
+| **92** | Task Complexity Scoring & Model Routing | Each task gets a complexity score and recommended model tier. Agents can route cheap tasks to cheap models and expensive tasks to expensive ones — 55% cost savings in benchmarks |
+
+**#41 is the most transformative proposal** — but it only works if the quality gates (#42, #43, #56, #88) prevent automation from multiplying bad plans.
+
+---
+
+## Recommended Sequence
+
+```
+Phase 1: Reliable foundation         (now)
+  ├─ #87  Plan resume
+  ├─ #109 Retry improvements
+  ├─ #102 Error-feedback retries
+  ├─ #110 Usage metrics
+  └─ #58  Prompt boost
+
+Phase 2: Agent-native interface       (next)
+  ├─ #86  Remove agent friction points
+  ├─ #62  Discovery protocols
+  └─ #88  Fermi validation gate
+
+Phase 3: Automated quality            (then)
+  ├─ #42  Evidence traceability
+  ├─ #43  Assumption drift monitor
+  ├─ #56  Adversarial red-team
+  └─ #57  Banned words / lever realism
+
+Phase 4: Self-improving pipeline      (concurrent with 3)
+  ├─ #94  Autoresearch prompt optimization
+  └─ #59  A/B testing promotion
+
+Phase 5: Autonomous execution         (after quality gates)
+  ├─ #41  Plan execution engine
+  ├─ #60-66 Agent swarm
+  └─ #92  Model routing
+```
+
+Each phase enables the next. Skipping to Phase 5 without Phase 3 means agents execute unchecked plans at scale — the worst outcome.
+
+---
+
+## Key Machine-Readable Outputs Agents Need
+
+From proposal #86 (F4), the most agent-useful files in the zip artifact:
+
+| File | Contents |
+|------|----------|
+| `assumptions/distilled_assumptions.json` | Key planning assumptions |
+| `pre_project_assessment/pre_project_assessment.json` | Go/no-go recommendation |
+| `negative_feedback/negative_feedback.json` | Risk register |
+| `wbs/wbs_level2.json` | Work breakdown structure |
+
+A future `plan_summary.json` collating these into a single file would eliminate the need for agents to parse multiple artifacts.