Objective
Add an autoresearch mode to agentv-bench that runs the eval-improve loop unattended: evaluate → analyze → keep/drop → mutate → repeat. This turns agentv-bench from a human-directed optimization tool into one that can also run autonomously overnight.
Context
The autoresearch pattern — proven by karpathy/autoresearch (ML training optimization), kevinrgu/autoagent (agent harness optimization), and pi-autoresearch (generic optimization loops) — automates the improvement step: score → keep/drop → mutate → repeat.
AutoAgent validates the pattern with benchmark results (96.5% SpreadsheetBench, 55.1% TerminalBench — both #1) but is not a replacement for AgentV autoresearch. AutoAgent optimizes Python agent harnesses for Harbor containerized benchmarks. AgentV autoresearch optimizes any text artifact (skills, prompts, configs) against any AgentV eval. Different ecosystem, different audience, different artifact types.
Model empathy finding (from AutoAgent research): same-model pairings (e.g., Claude meta-agent optimizing a Claude task agent) significantly outperform cross-model setups. Recommend same-model pairing in docs.
Prerequisites & Mechanics
Dependencies
Experiment naming
The experiment name is derived from the artifact being optimized: autoresearch-<artifact-basename> (e.g., autoresearch-pdf-skill for a skill named pdf). The implementing agent can also accept a user-provided name via the existing --experiment flag.
Artifact mutation flow
The mutator rewrites the actual file on disk (e.g., ./SKILL.md) in place. The autoresearch loop manages backups:
- Before the first cycle, copy the artifact to
_autoresearch/original.md
- On each KEEP, copy the current artifact to
_autoresearch/best.md
- On each DROP, restore
_autoresearch/best.md back to the artifact path
- The eval always runs against the real file path — no temp files or indirection
How the skill invokes eval
The bench skill shells out to agentv eval <path> --experiment <name> via Bash tool, same as the existing interactive bench workflow. No programmatic API changes.
trajectory.html implementation
This is a new standalone HTML generator, not an extension of the existing HtmlWriter class. HtmlWriter renders per-run results from live eval output. trajectory.html renders cross-cycle data from iterations.jsonl — different data source, different chart. It should be a simple single-file HTML with embedded Chart.js (or inline SVG) that reads the baked-in JSON data, plus a 2-second auto-refresh script that re-reads the file during the loop.
Design Latitude
Phase 1: Minimum loop + live trajectory chart
Skill-only change — no CLI, schema, or core code changes.
What changes in agentv-bench:
Activation: Triggered via natural language ("run autoresearch on this skill"). No YAML schema changes.
The loop:
1. RUN EVAL — agentv eval with current artifact (writes standard run to .agentv/results/runs/<experiment>/<timestamp>/)
2. ANALYZE — dispatch analyzer subagent on results
3. DECIDE — if score > best_score: KEEP, else DROP (#958)
4. MUTATE — dispatch mutator subagent with failure analysis (#746)
5. GOTO 1 — until convergence or max_cycles
Artifact layout:
Each cycle is a standard eval run. Autoresearch session metadata lives in _autoresearch/ within the experiment directory:
.agentv/results/runs/<experiment>/
_autoresearch/ # experiment-level outputs (user-facing, not hidden)
original.md # snapshot of artifact before first mutation
best.md # current best-scoring version (updated on KEEP)
iterations.jsonl # one line per cycle — data source for chart + mutator
trajectory.html # live-updating score trajectory chart
2026-04-15T10-30-00/ # cycle 1 — standard run artifacts
index.jsonl
grading.json
timing.json
benchmark.json
report.html
2026-04-15T10-35-00/ # cycle 2 — standard run artifacts
...
The _ prefix convention distinguishes workflow folders from timestamped run dirs. Future workflows get their own folder (e.g., _campaign/, _ab-test/, _sweep/).
iterations.jsonl — one line per cycle, JSONL wire format:
{"cycle":1,"score":0.65,"decision":"keep","cost_usd":0.12,"assertions":{"IDENTIFIES_BUG":0.8,"SUGGESTS_FIX":0.4},"mutation":"added explicit null-check instruction","run_dir":"2026-04-15T10-30-00","timestamp":"2026-04-15T10:32:15Z"}
trajectory.html — live-updating score chart. Follows the existing HtmlWriter pattern (auto-refreshes every 2s during the loop, becomes static after completion). Shows:
- Score over iterations (line chart)
- Keep/discard decision per iteration (markers)
- Per-assertion pass rates over iterations
- Cumulative cost across iterations
- Best vs original score summary
Standard run integration: Each cycle produces standard artifacts (index.jsonl, grading.json, etc.) in the normal location. Studio picks them up automatically — no special handling. agentv compare works on any two cycle run dirs.
Interactive/autonomous hybrid: Users can start in interactive mode (existing behavior), build confidence in their eval, then switch to autoresearch mode to run unattended.
Phase 2: State persistence and resumability (follow-up)
Only build if users need to resume interrupted loops:
_autoresearch/state.json for loop state ({best_score, cycle, best_cycle, convergence_count})
- Resume from state after context reset using
iterations.jsonl + best.md
Acceptance Signals
Phase 1 (this issue)
- agentv-bench SKILL.md documents autoresearch mode as an alternative to interactive iteration
- Given a SKILL.md + EVAL.yaml, runs N improvement cycles without human input
- Hill-climbing ratchet: artifact quality only increases or stays the same across cycles
- Stops on convergence (no improvement for N cycles) or max_cycles
- Produces
trajectory.html — live-updating during loop, static after completion
- Writes
iterations.jsonl — structured per-cycle log for chart + mutator consumption
- Standard run artifacts written per cycle to
.agentv/results/runs/<experiment>/<timestamp>/
- Studio sees each cycle as a normal run — no special integration needed
- Original artifact preserved in
_autoresearch/original.md
- Scope remains benchmark/workflow oriented
Phase 2 (follow-up issue)
- Session state survives context resets
- A fresh agent can resume from persisted state + best artifact
Non-Goals
- Not a new skill — this is a mode of the existing agentv-bench skill
- Not a replacement for interactive mode — both coexist
- Not multi-file mutation (start with single artifact, expand later)
- Not a full dashboard or Studio feature — live HTML chart only (Phase 1)
- Does not modify the eval definition — only the artifact under test
- No
autoresearch: YAML config section — triggered via natural language, not schema
- Does not add persistent user memory, session search infra, or runtime-managed skill storage to AgentV core
Related
Implementation Prompt
One-line prompt to spawn subagents that implement the full autoresearch stack in dependency order:
Implement the agentv autoresearch optimization loop: first complete #958 (automate keep/discard decision in agentv-bench Step 5 using agentv compare --json output), then #746 (add agents/mutator.md to agentv-bench that rewrites artifacts from failure analysis), then #748 Phase 1 (wire them into an unattended loop in agentv-bench SKILL.md with _autoresearch/ output folder containing iterations.jsonl, trajectory.html, original.md, and best.md — each cycle writes standard run artifacts via agentv eval --experiment autoresearch-<name>). Read each issue body for the full contract before starting.
Objective
Add an autoresearch mode to agentv-bench that runs the eval-improve loop unattended: evaluate → analyze → keep/drop → mutate → repeat. This turns agentv-bench from a human-directed optimization tool into one that can also run autonomously overnight.
Context
The autoresearch pattern — proven by karpathy/autoresearch (ML training optimization), kevinrgu/autoagent (agent harness optimization), and pi-autoresearch (generic optimization loops) — automates the improvement step: score → keep/drop → mutate → repeat.
AutoAgent validates the pattern with benchmark results (96.5% SpreadsheetBench, 55.1% TerminalBench — both #1) but is not a replacement for AgentV autoresearch. AutoAgent optimizes Python agent harnesses for Harbor containerized benchmarks. AgentV autoresearch optimizes any text artifact (skills, prompts, configs) against any AgentV eval. Different ecosystem, different audience, different artifact types.
Model empathy finding (from AutoAgent research): same-model pairings (e.g., Claude meta-agent optimizing a Claude task agent) significantly outperform cross-model setups. Recommend same-model pairing in docs.
Prerequisites & Mechanics
Dependencies
Experiment naming
The experiment name is derived from the artifact being optimized:
autoresearch-<artifact-basename>(e.g.,autoresearch-pdf-skillfor a skill namedpdf). The implementing agent can also accept a user-provided name via the existing--experimentflag.Artifact mutation flow
The mutator rewrites the actual file on disk (e.g.,
./SKILL.md) in place. The autoresearch loop manages backups:_autoresearch/original.md_autoresearch/best.md_autoresearch/best.mdback to the artifact pathHow the skill invokes eval
The bench skill shells out to
agentv eval <path> --experiment <name>via Bash tool, same as the existing interactive bench workflow. No programmatic API changes.trajectory.html implementation
This is a new standalone HTML generator, not an extension of the existing
HtmlWriterclass.HtmlWriterrenders per-run results from live eval output.trajectory.htmlrenders cross-cycle data fromiterations.jsonl— different data source, different chart. It should be a simple single-file HTML with embedded Chart.js (or inline SVG) that reads the baked-in JSON data, plus a 2-second auto-refresh script that re-reads the file during the loop.Design Latitude
Phase 1: Minimum loop + live trajectory chart
Skill-only change — no CLI, schema, or core code changes.
What changes in agentv-bench:
mutatorsubagent (feat(bench): mutator subagent — autonomous artifact rewriting from failure analysis #746) instead of waiting for human inputbestversion, mutation always reads from bestActivation: Triggered via natural language ("run autoresearch on this skill"). No YAML schema changes.
The loop:
Artifact layout:
Each cycle is a standard eval run. Autoresearch session metadata lives in
_autoresearch/within the experiment directory:The
_prefix convention distinguishes workflow folders from timestamped run dirs. Future workflows get their own folder (e.g.,_campaign/,_ab-test/,_sweep/).iterations.jsonl— one line per cycle, JSONL wire format:{"cycle":1,"score":0.65,"decision":"keep","cost_usd":0.12,"assertions":{"IDENTIFIES_BUG":0.8,"SUGGESTS_FIX":0.4},"mutation":"added explicit null-check instruction","run_dir":"2026-04-15T10-30-00","timestamp":"2026-04-15T10:32:15Z"}trajectory.html— live-updating score chart. Follows the existingHtmlWriterpattern (auto-refreshes every 2s during the loop, becomes static after completion). Shows:Standard run integration: Each cycle produces standard artifacts (
index.jsonl,grading.json, etc.) in the normal location. Studio picks them up automatically — no special handling.agentv compareworks on any two cycle run dirs.Interactive/autonomous hybrid: Users can start in interactive mode (existing behavior), build confidence in their eval, then switch to autoresearch mode to run unattended.
Phase 2: State persistence and resumability (follow-up)
Only build if users need to resume interrupted loops:
_autoresearch/state.jsonfor loop state ({best_score, cycle, best_cycle, convergence_count})iterations.jsonl+best.mdAcceptance Signals
Phase 1 (this issue)
trajectory.html— live-updating during loop, static after completioniterations.jsonl— structured per-cycle log for chart + mutator consumption.agentv/results/runs/<experiment>/<timestamp>/_autoresearch/original.mdPhase 2 (follow-up issue)
Non-Goals
autoresearch:YAML config section — triggered via natural language, not schemaRelated
Implementation Prompt
One-line prompt to spawn subagents that implement the full autoresearch stack in dependency order: