Skip to content

feat(bench): autoresearch mode — unattended eval-improve loop with hill-climbing ratchet #748

@christso

Description

@christso

Objective

Add an autoresearch mode to agentv-bench that runs the eval-improve loop unattended: evaluate → analyze → keep/drop → mutate → repeat. This turns agentv-bench from a human-directed optimization tool into one that can also run autonomously overnight.

Context

The autoresearch pattern — proven by karpathy/autoresearch (ML training optimization), kevinrgu/autoagent (agent harness optimization), and pi-autoresearch (generic optimization loops) — automates the improvement step: score → keep/drop → mutate → repeat.

AutoAgent validates the pattern with benchmark results (96.5% SpreadsheetBench, 55.1% TerminalBench — both #1) but is not a replacement for AgentV autoresearch. AutoAgent optimizes Python agent harnesses for Harbor containerized benchmarks. AgentV autoresearch optimizes any text artifact (skills, prompts, configs) against any AgentV eval. Different ecosystem, different audience, different artifact types.

Model empathy finding (from AutoAgent research): same-model pairings (e.g., Claude meta-agent optimizing a Claude task agent) significantly outperform cross-model setups. Recommend same-model pairing in docs.

Prerequisites & Mechanics

Dependencies

Experiment naming

The experiment name is derived from the artifact being optimized: autoresearch-<artifact-basename> (e.g., autoresearch-pdf-skill for a skill named pdf). The implementing agent can also accept a user-provided name via the existing --experiment flag.

Artifact mutation flow

The mutator rewrites the actual file on disk (e.g., ./SKILL.md) in place. The autoresearch loop manages backups:

  1. Before the first cycle, copy the artifact to _autoresearch/original.md
  2. On each KEEP, copy the current artifact to _autoresearch/best.md
  3. On each DROP, restore _autoresearch/best.md back to the artifact path
  4. The eval always runs against the real file path — no temp files or indirection

How the skill invokes eval

The bench skill shells out to agentv eval <path> --experiment <name> via Bash tool, same as the existing interactive bench workflow. No programmatic API changes.

trajectory.html implementation

This is a new standalone HTML generator, not an extension of the existing HtmlWriter class. HtmlWriter renders per-run results from live eval output. trajectory.html renders cross-cycle data from iterations.jsonl — different data source, different chart. It should be a simple single-file HTML with embedded Chart.js (or inline SVG) that reads the baked-in JSON data, plus a 2-second auto-refresh script that re-reads the file during the loop.

Design Latitude

Phase 1: Minimum loop + live trajectory chart

Skill-only change — no CLI, schema, or core code changes.

What changes in agentv-bench:

Activation: Triggered via natural language ("run autoresearch on this skill"). No YAML schema changes.

The loop:

1. RUN EVAL      — agentv eval with current artifact (writes standard run to .agentv/results/runs/<experiment>/<timestamp>/)
2. ANALYZE       — dispatch analyzer subagent on results
3. DECIDE        — if score > best_score: KEEP, else DROP (#958)
4. MUTATE        — dispatch mutator subagent with failure analysis (#746)
5. GOTO 1        — until convergence or max_cycles

Artifact layout:

Each cycle is a standard eval run. Autoresearch session metadata lives in _autoresearch/ within the experiment directory:

.agentv/results/runs/<experiment>/
  _autoresearch/                     # experiment-level outputs (user-facing, not hidden)
    original.md                    # snapshot of artifact before first mutation
    best.md                        # current best-scoring version (updated on KEEP)
    iterations.jsonl               # one line per cycle — data source for chart + mutator
    trajectory.html                # live-updating score trajectory chart
  2026-04-15T10-30-00/             # cycle 1 — standard run artifacts
    index.jsonl
    grading.json
    timing.json
    benchmark.json
    report.html
  2026-04-15T10-35-00/             # cycle 2 — standard run artifacts
    ...

The _ prefix convention distinguishes workflow folders from timestamped run dirs. Future workflows get their own folder (e.g., _campaign/, _ab-test/, _sweep/).

iterations.jsonl — one line per cycle, JSONL wire format:

{"cycle":1,"score":0.65,"decision":"keep","cost_usd":0.12,"assertions":{"IDENTIFIES_BUG":0.8,"SUGGESTS_FIX":0.4},"mutation":"added explicit null-check instruction","run_dir":"2026-04-15T10-30-00","timestamp":"2026-04-15T10:32:15Z"}

trajectory.html — live-updating score chart. Follows the existing HtmlWriter pattern (auto-refreshes every 2s during the loop, becomes static after completion). Shows:

  • Score over iterations (line chart)
  • Keep/discard decision per iteration (markers)
  • Per-assertion pass rates over iterations
  • Cumulative cost across iterations
  • Best vs original score summary

Standard run integration: Each cycle produces standard artifacts (index.jsonl, grading.json, etc.) in the normal location. Studio picks them up automatically — no special handling. agentv compare works on any two cycle run dirs.

Interactive/autonomous hybrid: Users can start in interactive mode (existing behavior), build confidence in their eval, then switch to autoresearch mode to run unattended.

Phase 2: State persistence and resumability (follow-up)

Only build if users need to resume interrupted loops:

  • _autoresearch/state.json for loop state ({best_score, cycle, best_cycle, convergence_count})
  • Resume from state after context reset using iterations.jsonl + best.md

Acceptance Signals

Phase 1 (this issue)

  • agentv-bench SKILL.md documents autoresearch mode as an alternative to interactive iteration
  • Given a SKILL.md + EVAL.yaml, runs N improvement cycles without human input
  • Hill-climbing ratchet: artifact quality only increases or stays the same across cycles
  • Stops on convergence (no improvement for N cycles) or max_cycles
  • Produces trajectory.html — live-updating during loop, static after completion
  • Writes iterations.jsonl — structured per-cycle log for chart + mutator consumption
  • Standard run artifacts written per cycle to .agentv/results/runs/<experiment>/<timestamp>/
  • Studio sees each cycle as a normal run — no special integration needed
  • Original artifact preserved in _autoresearch/original.md
  • Scope remains benchmark/workflow oriented

Phase 2 (follow-up issue)

  • Session state survives context resets
  • A fresh agent can resume from persisted state + best artifact

Non-Goals

  • Not a new skill — this is a mode of the existing agentv-bench skill
  • Not a replacement for interactive mode — both coexist
  • Not multi-file mutation (start with single artifact, expand later)
  • Not a full dashboard or Studio feature — live HTML chart only (Phase 1)
  • Does not modify the eval definition — only the artifact under test
  • No autoresearch: YAML config section — triggered via natural language, not schema
  • Does not add persistent user memory, session search infra, or runtime-managed skill storage to AgentV core

Related

Implementation Prompt

One-line prompt to spawn subagents that implement the full autoresearch stack in dependency order:

Implement the agentv autoresearch optimization loop: first complete #958 (automate keep/discard decision in agentv-bench Step 5 using agentv compare --json output), then #746 (add agents/mutator.md to agentv-bench that rewrites artifacts from failure analysis), then #748 Phase 1 (wire them into an unattended loop in agentv-bench SKILL.md with _autoresearch/ output folder containing iterations.jsonl, trajectory.html, original.md, and best.md — each cycle writes standard run artifacts via agentv eval --experiment autoresearch-<name>). Read each issue body for the full contract before starting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    autoagentin-progressClaimed by an agent — do not duplicate work

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions