feat(bench): autoresearch mode — unattended eval-improve loop with hill-climbing ratchet

## Objective

Add an autoresearch mode to agentv-bench that runs the eval-improve loop unattended: evaluate → analyze → keep/drop → mutate → repeat. This turns agentv-bench from a human-directed optimization tool into one that can also run autonomously overnight.

## Context

The autoresearch pattern — proven by [karpathy/autoresearch](https://github.com/karpathy/autoresearch) (ML training optimization), [kevinrgu/autoagent](https://github.com/kevinrgu/autoagent) (agent harness optimization), and [pi-autoresearch](https://github.com/davebcn87/pi-autoresearch) (generic optimization loops) — automates the improvement step: score → keep/drop → mutate → repeat.

**AutoAgent** validates the pattern with benchmark results (96.5% SpreadsheetBench, 55.1% TerminalBench — both #1) but is **not a replacement** for AgentV autoresearch. AutoAgent optimizes Python agent harnesses for Harbor containerized benchmarks. AgentV autoresearch optimizes any text artifact (skills, prompts, configs) against any AgentV eval. Different ecosystem, different audience, different artifact types.

**Model empathy finding** (from AutoAgent research): same-model pairings (e.g., Claude meta-agent optimizing a Claude task agent) significantly outperform cross-model setups. Recommend same-model pairing in docs.

## Prerequisites & Mechanics

### Dependencies

- **#958 (keep/discard automation)** and **#746 (mutator subagent)** must be completed first. This issue composes them into a loop — it does not reimplement their logic.

### Experiment naming

The experiment name is derived from the artifact being optimized: `autoresearch-<artifact-basename>` (e.g., `autoresearch-pdf-skill` for a skill named `pdf`). The implementing agent can also accept a user-provided name via the existing `--experiment` flag.

### Artifact mutation flow

The mutator rewrites the **actual file on disk** (e.g., `./SKILL.md`) in place. The autoresearch loop manages backups:
1. Before the first cycle, copy the artifact to `_autoresearch/original.md`
2. On each KEEP, copy the current artifact to `_autoresearch/best.md`
3. On each DROP, restore `_autoresearch/best.md` back to the artifact path
4. The eval always runs against the real file path — no temp files or indirection

### How the skill invokes eval

The bench skill shells out to `agentv eval <path> --experiment <name>` via Bash tool, same as the existing interactive bench workflow. No programmatic API changes.

### trajectory.html implementation

This is a **new standalone HTML generator**, not an extension of the existing `HtmlWriter` class. `HtmlWriter` renders per-run results from live eval output. `trajectory.html` renders cross-cycle data from `iterations.jsonl` — different data source, different chart. It should be a simple single-file HTML with embedded Chart.js (or inline SVG) that reads the baked-in JSON data, plus a 2-second auto-refresh script that re-reads the file during the loop.

## Design Latitude

### Phase 1: Minimum loop + live trajectory chart

**Skill-only change** — no CLI, schema, or core code changes.

**What changes in agentv-bench:**
- Step 5 (Improve) dispatches `mutator` subagent (#746) instead of waiting for human input
- Human checkpoints at iterations 3/6/9 are skipped
- Hill-climbing ratchet: track `best` version, mutation always reads from best
- Convergence detection: stop after N consecutive cycles with no improvement (default 3)

**Activation:** Triggered via natural language ("run autoresearch on this skill"). No YAML schema changes.

**The loop:**
```
1. RUN EVAL      — agentv eval with current artifact (writes standard run to .agentv/results/runs/<experiment>/<timestamp>/)
2. ANALYZE       — dispatch analyzer subagent on results
3. DECIDE        — if score > best_score: KEEP, else DROP (#958)
4. MUTATE        — dispatch mutator subagent with failure analysis (#746)
5. GOTO 1        — until convergence or max_cycles
```

**Artifact layout:**

Each cycle is a standard eval run. Autoresearch session metadata lives in `_autoresearch/` within the experiment directory:

```
.agentv/results/runs/<experiment>/
  _autoresearch/                     # experiment-level outputs (user-facing, not hidden)
    original.md                    # snapshot of artifact before first mutation
    best.md                        # current best-scoring version (updated on KEEP)
    iterations.jsonl               # one line per cycle — data source for chart + mutator
    trajectory.html                # live-updating score trajectory chart
  2026-04-15T10-30-00/             # cycle 1 — standard run artifacts
    index.jsonl
    grading.json
    timing.json
    benchmark.json
    report.html
  2026-04-15T10-35-00/             # cycle 2 — standard run artifacts
    ...
```

The `_` prefix convention distinguishes workflow folders from timestamped run dirs. Future workflows get their own folder (e.g., `_campaign/`, `_ab-test/`, `_sweep/`).

**`iterations.jsonl`** — one line per cycle, JSONL wire format:
```jsonl
{"cycle":1,"score":0.65,"decision":"keep","cost_usd":0.12,"assertions":{"IDENTIFIES_BUG":0.8,"SUGGESTS_FIX":0.4},"mutation":"added explicit null-check instruction","run_dir":"2026-04-15T10-30-00","timestamp":"2026-04-15T10:32:15Z"}
```

**`trajectory.html`** — live-updating score chart. Follows the existing `HtmlWriter` pattern (auto-refreshes every 2s during the loop, becomes static after completion). Shows:
- Score over iterations (line chart)
- Keep/discard decision per iteration (markers)
- Per-assertion pass rates over iterations
- Cumulative cost across iterations
- Best vs original score summary

**Standard run integration:** Each cycle produces standard artifacts (`index.jsonl`, `grading.json`, etc.) in the normal location. Studio picks them up automatically — no special handling. `agentv compare` works on any two cycle run dirs.

**Interactive/autonomous hybrid:** Users can start in interactive mode (existing behavior), build confidence in their eval, then switch to autoresearch mode to run unattended.

### Phase 2: State persistence and resumability (follow-up)

Only build if users need to resume interrupted loops:
- `_autoresearch/state.json` for loop state (`{best_score, cycle, best_cycle, convergence_count}`)
- Resume from state after context reset using `iterations.jsonl` + `best.md`

## Acceptance Signals

### Phase 1 (this issue)
- agentv-bench SKILL.md documents autoresearch mode as an alternative to interactive iteration
- Given a SKILL.md + EVAL.yaml, runs N improvement cycles without human input
- Hill-climbing ratchet: artifact quality only increases or stays the same across cycles
- Stops on convergence (no improvement for N cycles) or max_cycles
- **Produces `trajectory.html`** — live-updating during loop, static after completion
- **Writes `iterations.jsonl`** — structured per-cycle log for chart + mutator consumption
- Standard run artifacts written per cycle to `.agentv/results/runs/<experiment>/<timestamp>/`
- Studio sees each cycle as a normal run — no special integration needed
- Original artifact preserved in `_autoresearch/original.md`
- Scope remains benchmark/workflow oriented

### Phase 2 (follow-up issue)
- Session state survives context resets
- A fresh agent can resume from persisted state + best artifact

## Non-Goals

- Not a new skill — this is a mode of the existing agentv-bench skill
- Not a replacement for interactive mode — both coexist
- Not multi-file mutation (start with single artifact, expand later)
- Not a full dashboard or Studio feature — live HTML chart only (Phase 1)
- Does not modify the eval definition — only the artifact under test
- No `autoresearch:` YAML config section — triggered via natural language, not schema
- Does not add persistent user memory, session search infra, or runtime-managed skill storage to AgentV core

## Related

- #746 — mutator subagent (required dependency — provides the mutation logic)
- #747 — eval-generator subagent (complementary — removes cold-start friction)
- #958 — automated keep/discard in the current bench loop (incremental step toward this)
- #335 — iteration metadata / termination taxonomy / run-history primitives
- #699 — Ralph Loop (complementary: Ralph improves outputs within a run; autoresearch improves artifacts across runs)
- #214 — Pass@k trials (complementary: statistical significance testing between cycles; already delivered)
- #1003 — tracking issue for optimization-loop roadmap coordination

## Implementation Prompt

One-line prompt to spawn subagents that implement the full autoresearch stack in dependency order:

> Implement the agentv autoresearch optimization loop: first complete #958 (automate keep/discard decision in agentv-bench Step 5 using `agentv compare --json` output), then #746 (add `agents/mutator.md` to agentv-bench that rewrites artifacts from failure analysis), then #748 Phase 1 (wire them into an unattended loop in agentv-bench SKILL.md with `_autoresearch/` output folder containing `iterations.jsonl`, `trajectory.html`, `original.md`, and `best.md` — each cycle writes standard run artifacts via `agentv eval --experiment autoresearch-<name>`). Read each issue body for the full contract before starting.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): autoresearch mode — unattended eval-improve loop with hill-climbing ratchet #748

Objective

Context

Prerequisites & Mechanics

Dependencies

Experiment naming

Artifact mutation flow

How the skill invokes eval

trajectory.html implementation

Design Latitude

Phase 1: Minimum loop + live trajectory chart

Phase 2: State persistence and resumability (follow-up)

Acceptance Signals

Phase 1 (this issue)

Phase 2 (follow-up issue)

Non-Goals

Related

Implementation Prompt

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(bench): autoresearch mode — unattended eval-improve loop with hill-climbing ratchet #748

Description

Objective

Context

Prerequisites & Mechanics

Dependencies

Experiment naming

Artifact mutation flow

How the skill invokes eval

trajectory.html implementation

Design Latitude

Phase 1: Minimum loop + live trajectory chart

Phase 2: State persistence and resumability (follow-up)

Acceptance Signals

Phase 1 (this issue)

Phase 2 (follow-up issue)

Non-Goals

Related

Implementation Prompt

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions