Skip to content

feat: curated public benchmark dataset and leaderboard #966

@christso

Description

@christso

Objective

Establish AgentV as the industry standard for agent evaluation by running existing public benchmarks with richer metrics than the originals, and hosting a public leaderboard on agentv.dev as a landing page feature.

Context

From SWE-bench competitive analysis: SWE-bench's dominance comes from being the shared yardstick everyone measures against. Its technical harness is sophisticated but not its moat — the moat is the dataset + community adoption + public leaderboard. Without a shared benchmark, AgentV is a tool; with one, it becomes the standard.

AgentV already has the strongest evaluation framework (15+ evaluator types, multi-provider matrix, CI integration, trend analysis, OTel export). What's missing is a canonical dataset and public surface for the community to rally around.

Strategy: Reuse existing benchmarks, show richer results

Do NOT curate a new dataset. SWE-bench is already the industry standard — no need to convince people it's valid. Instead, run SWE-bench through AgentV and surface dimensions the original leaderboard doesn't have.

The hook: Visitors come to agentv.dev/leaderboard, see familiar SWE-bench results but with columns swebench.com doesn't show. They think "oh, this is what SWE-bench would look like if it measured more than just pass/fail." That's the conversion point — they come for the leaderboard, stay for the framework.

What makes it differentiated vs mirroring swebench.com

Dimension swebench.com agentv.dev/leaderboard
Primary metric % Resolved (single number) Multi-criteria score (quality + cost + latency)
Cost "Avg. $" column Cost-normalized ranking ("best score per dollar"), Pareto frontier chart
Providers One model at a time Per-provider breakdown (Claude vs Codex vs Copilot on same tasks)
Trend None Score trend over time (are models improving across releases?)
Tool usage Trajectory link (external) Tool call efficiency (calls per resolution, exploration ratio)
Comparison Checkbox + compare Visual delta view with win/loss/tie per task
Filtering Agent type, model type Filter by any dimension (cost range, language, repo, difficulty)

Design decisions (settled)

  1. MVP uses static site generation. Run benchmarks periodically, generate static JSON, agentv.dev renders from JSON. agentv.dev is already an Astro/Starlight docs site — add a leaderboard page component that reads from a JSON data file.

  2. Benchmark recipe lives in the agentv repo under benchmarks/swe-bench-lite/. This includes a setup.ts script, the reusable swe-bench-grader.ts template, and a README. The repo does NOT commit the SWE-bench dataset — the setup script downloads it from HuggingFace on first run and generates EVAL.yaml files locally.

    Why a setup script, not committed EVAL.yaml files:

    • The 300 problem statements are long GitHub issue texts (1-3MB total) — too large to commit
    • Each instance needs a Docker image reference + FAIL_TO_PASS lists from the HuggingFace dataset
    • Re-hosting the dataset in our repo creates a stale copy problem
    • A setup script keeps the source of truth on HuggingFace where it belongs

    Data source: HuggingFace dataset SWE-bench/SWE-bench_Lite (test split, 300 instances). Docker images from DockerHub swebench/sweb.eval.x86_64.*.

    Future enhancement (P2): Native remote dataset references in AgentV — agentv eval hf://SWE-bench/SWE-bench_Lite — would eliminate the setup step entirely. Track separately.

  3. Community submissions via GitHub PR. Submitters run the benchmark locally, then open a PR to the agentv repo adding their result JSON to benchmarks/swe-bench-lite/results/<model-name>.json. A CI check validates the format. No API, no backend.

  4. agentv bench submit is deferred. For MVP, submissions are manual GitHub PRs. A CLI command can be added later if submission volume warrants it.

  5. Start with SWE-bench Lite (300 instances). Cheaper to run (~$50-200 per full evaluation depending on model), still credible. Expand to Verified (500) once pipeline is proven.

Execution path

  1. Implement Docker workspaces (feat: Docker workspace execution environments for coding benchmarks #965) — prerequisite
  2. Create benchmarks/swe-bench-lite/ in agentv repo:
    • README.md — how to run, how to submit
    • setup.ts — downloads dataset from HuggingFace, generates EVAL.yaml files into evals/ (gitignored)
    • graders/swe-bench-grader.ts — reusable grader using @agentv/eval
    • results/ — JSON result files per model (committed to repo)
    • .gitignore — ignores evals/ and .cache/ (generated from setup)
  3. Run initial evaluations — evaluate 5-10 models (Claude Opus, Claude Sonnet, GPT-4o, Gemini, Codex, open-source models)
  4. Build leaderboard page in apps/web/:
    • Astro component reading from benchmarks/swe-bench-lite/results/*.json
    • Sortable table: Model, % Resolved, Avg Cost, Cost Efficiency, Tool Calls, Date
    • Pareto frontier chart (score vs cost scatter) using a lightweight chart library
    • Filter dropdowns: model type (open/proprietary), provider
  5. Add landing page section linking to leaderboard with CTAs
  6. Document submission workflow in docs

Leaderboard page layout

┌─────────────────────────────────────────────────────┐
│  AgentV Leaderboard — SWE-bench Lite                │
│  "The multi-dimensional agent benchmark"            │
├─────────────────────────────────────────────────────┤
│  [Filter: All models ▾] [Filter: All providers ▾]  │
│                                                     │
│  Model          │ Resolved │ Avg $ │ $/Fix │ Tools │
│  ─────────────  │ ──────── │ ───── │ ───── │ ───── │
│  Claude Opus    │  72.8%   │ $0.55 │ $0.76 │  8.2  │
│  Gemini Flash   │  71.2%   │ $0.36 │ $0.51 │  6.4  │
│  GPT-4o         │  68.5%   │ $0.45 │ $0.66 │  9.1  │
│  ...            │          │       │       │       │
├─────────────────────────────────────────────────────┤
│  [Pareto Frontier Chart]                            │
│  Y: % Resolved  X: Avg Cost                        │
│  • Each dot = one model                            │
│  • Frontier line connecting efficient models        │
├─────────────────────────────────────────────────────┤
│  Run it yourself:                                   │
│  $ git clone https://github.com/EntityProcess/agentv│
│  $ cd agentv/benchmarks/swe-bench-lite              │
│  $ bun run setup.ts                                 │
│  $ agentv eval ./evals/ --target claude             │
│                                                     │
│  [Submit your results →]                            │
└─────────────────────────────────────────────────────┘

$/Fix = avg cost per resolved instance (cost-normalized metric). This is the differentiated column swebench.com doesn't have.

Result JSON schema

Each submitted result file (results/<model-name>.json):

{
  "model": "Claude Opus 4.6",
  "provider": "anthropic",
  "model_type": "proprietary",
  "date": "2026-04-08",
  "agent": "mini-swe-agent-agentv",
  "agent_version": "1.0.0",
  "dataset": "swe-bench-lite",
  "total_instances": 300,
  "resolved_instances": 218,
  "resolution_rate": 0.727,
  "avg_cost_usd": 0.55,
  "avg_cost_per_fix_usd": 0.76,
  "avg_duration_ms": 45000,
  "avg_tool_calls": 8.2,
  "per_instance": [
    {
      "instance_id": "django__django-15180",
      "resolved": true,
      "cost_usd": 0.42,
      "duration_ms": 32000,
      "tool_calls": 6
    }
  ]
}

Acceptance signals

  • setup.ts downloads SWE-bench Lite from HuggingFace and generates EVAL.yaml files
  • SWE-bench Lite (300 instances) running end-to-end through AgentV
  • At least 5 models evaluated with results in benchmarks/swe-bench-lite/results/
  • Leaderboard page live on agentv.dev with sortable multi-dimensional columns
  • Pareto frontier chart (score vs cost) rendered
  • Cost-normalized ranking ($/Fix) visible
  • Submission workflow documented in benchmarks/swe-bench-lite/README.md
  • CI validates result JSON schema on PR submission

Non-goals

  • Curating a new benchmark dataset from scratch (reuse existing)
  • Building a full platform (auth, teams, billing)
  • Building a dynamic API backend (static JSON for MVP)
  • Competing with observability platforms (Langfuse, Braintrust)
  • Hosting compute for evaluation — users run locally, submit results
  • Replacing swebench.com — we complement it with richer analysis
  • agentv bench submit CLI command (defer until submission volume warrants it)

Dependencies

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions