feat: curated public benchmark dataset and leaderboard

## Objective

Establish AgentV as the industry standard for agent evaluation by running existing public benchmarks with richer metrics than the originals, and hosting a public leaderboard on agentv.dev as a landing page feature.

## Context

From [SWE-bench competitive analysis](https://github.com/agentevals/agentevals-research/pull/55): SWE-bench's dominance comes from being the shared yardstick everyone measures against. Its technical harness is sophisticated but not its moat — the moat is the **dataset + community adoption + public leaderboard**. Without a shared benchmark, AgentV is a tool; with one, it becomes the standard.

AgentV already has the strongest evaluation framework (15+ evaluator types, multi-provider matrix, CI integration, trend analysis, OTel export). What's missing is a canonical dataset and public surface for the community to rally around.

## Strategy: Reuse existing benchmarks, show richer results

**Do NOT curate a new dataset.** SWE-bench is already the industry standard — no need to convince people it's valid. Instead, run SWE-bench through AgentV and surface dimensions the original leaderboard doesn't have.

**The hook**: Visitors come to agentv.dev/leaderboard, see familiar SWE-bench results but with columns swebench.com doesn't show. They think "oh, this is what SWE-bench would look like if it measured more than just pass/fail." That's the conversion point — they come for the leaderboard, stay for the framework.

### What makes it differentiated vs mirroring swebench.com

| Dimension | swebench.com | agentv.dev/leaderboard |
|-----------|-------------|----------------------|
| Primary metric | % Resolved (single number) | Multi-criteria score (quality + cost + latency) |
| Cost | "Avg. $" column | **Cost-normalized ranking** ("best score per dollar"), Pareto frontier chart |
| Providers | One model at a time | **Per-provider breakdown** (Claude vs Codex vs Copilot on same tasks) |
| Trend | None | **Score trend over time** (are models improving across releases?) |
| Tool usage | Trajectory link (external) | **Tool call efficiency** (calls per resolution, exploration ratio) |
| Comparison | Checkbox + compare | **Visual delta view** with win/loss/tie per task |
| Filtering | Agent type, model type | Filter by any dimension (cost range, language, repo, difficulty) |

## Design decisions (settled)

1. **MVP uses static site generation.** Run benchmarks periodically, generate static JSON, agentv.dev renders from JSON. agentv.dev is already an Astro/Starlight docs site — add a leaderboard page component that reads from a JSON data file.

2. **Benchmark recipe lives in the agentv repo** under `benchmarks/swe-bench-lite/`. This includes a `setup.ts` script, the reusable `swe-bench-grader.ts` template, and a README. The repo does **NOT** commit the SWE-bench dataset — the setup script downloads it from HuggingFace on first run and generates EVAL.yaml files locally.

   **Why a setup script, not committed EVAL.yaml files:**
   - The 300 problem statements are long GitHub issue texts (1-3MB total) — too large to commit
   - Each instance needs a Docker image reference + FAIL_TO_PASS lists from the HuggingFace dataset
   - Re-hosting the dataset in our repo creates a stale copy problem
   - A setup script keeps the source of truth on HuggingFace where it belongs

   **Data source:** HuggingFace dataset [`SWE-bench/SWE-bench_Lite`](https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite) (test split, 300 instances). Docker images from DockerHub `swebench/sweb.eval.x86_64.*`.

   **Future enhancement (P2):** Native remote dataset references in AgentV — `agentv eval hf://SWE-bench/SWE-bench_Lite` — would eliminate the setup step entirely. Track separately.

3. **Community submissions via GitHub PR.** Submitters run the benchmark locally, then open a PR to the agentv repo adding their result JSON to `benchmarks/swe-bench-lite/results/<model-name>.json`. A CI check validates the format. No API, no backend.

4. **`agentv bench submit` is deferred.** For MVP, submissions are manual GitHub PRs. A CLI command can be added later if submission volume warrants it.

5. **Start with SWE-bench Lite (300 instances).** Cheaper to run (~$50-200 per full evaluation depending on model), still credible. Expand to Verified (500) once pipeline is proven.

### Execution path

1. **Implement Docker workspaces** (#965) — prerequisite
2. **Create `benchmarks/swe-bench-lite/`** in agentv repo:
   - `README.md` — how to run, how to submit
   - `setup.ts` — downloads dataset from HuggingFace, generates EVAL.yaml files into `evals/` (gitignored)
   - `graders/swe-bench-grader.ts` — reusable grader using `@agentv/eval`
   - `results/` — JSON result files per model (committed to repo)
   - `.gitignore` — ignores `evals/` and `.cache/` (generated from setup)
3. **Run initial evaluations** — evaluate 5-10 models (Claude Opus, Claude Sonnet, GPT-4o, Gemini, Codex, open-source models)
4. **Build leaderboard page** in `apps/web/`:
   - Astro component reading from `benchmarks/swe-bench-lite/results/*.json`
   - Sortable table: Model, % Resolved, Avg Cost, Cost Efficiency, Tool Calls, Date
   - Pareto frontier chart (score vs cost scatter) using a lightweight chart library
   - Filter dropdowns: model type (open/proprietary), provider
5. **Add landing page section** linking to leaderboard with CTAs
6. **Document submission workflow** in docs

### Leaderboard page layout

```
┌─────────────────────────────────────────────────────┐
│  AgentV Leaderboard — SWE-bench Lite                │
│  "The multi-dimensional agent benchmark"            │
├─────────────────────────────────────────────────────┤
│  [Filter: All models ▾] [Filter: All providers ▾]  │
│                                                     │
│  Model          │ Resolved │ Avg $ │ $/Fix │ Tools │
│  ─────────────  │ ──────── │ ───── │ ───── │ ───── │
│  Claude Opus    │  72.8%   │ $0.55 │ $0.76 │  8.2  │
│  Gemini Flash   │  71.2%   │ $0.36 │ $0.51 │  6.4  │
│  GPT-4o         │  68.5%   │ $0.45 │ $0.66 │  9.1  │
│  ...            │          │       │       │       │
├─────────────────────────────────────────────────────┤
│  [Pareto Frontier Chart]                            │
│  Y: % Resolved  X: Avg Cost                        │
│  • Each dot = one model                            │
│  • Frontier line connecting efficient models        │
├─────────────────────────────────────────────────────┤
│  Run it yourself:                                   │
│  $ git clone https://github.com/EntityProcess/agentv│
│  $ cd agentv/benchmarks/swe-bench-lite              │
│  $ bun run setup.ts                                 │
│  $ agentv eval ./evals/ --target claude             │
│                                                     │
│  [Submit your results →]                            │
└─────────────────────────────────────────────────────┘
```

**$/Fix** = avg cost per resolved instance (cost-normalized metric). This is the differentiated column swebench.com doesn't have.

### Result JSON schema

Each submitted result file (`results/<model-name>.json`):
```json
{
  "model": "Claude Opus 4.6",
  "provider": "anthropic",
  "model_type": "proprietary",
  "date": "2026-04-08",
  "agent": "mini-swe-agent-agentv",
  "agent_version": "1.0.0",
  "dataset": "swe-bench-lite",
  "total_instances": 300,
  "resolved_instances": 218,
  "resolution_rate": 0.727,
  "avg_cost_usd": 0.55,
  "avg_cost_per_fix_usd": 0.76,
  "avg_duration_ms": 45000,
  "avg_tool_calls": 8.2,
  "per_instance": [
    {
      "instance_id": "django__django-15180",
      "resolved": true,
      "cost_usd": 0.42,
      "duration_ms": 32000,
      "tool_calls": 6
    }
  ]
}
```

## Acceptance signals

- [ ] `setup.ts` downloads SWE-bench Lite from HuggingFace and generates EVAL.yaml files
- [ ] SWE-bench Lite (300 instances) running end-to-end through AgentV
- [ ] At least 5 models evaluated with results in `benchmarks/swe-bench-lite/results/`
- [ ] Leaderboard page live on agentv.dev with sortable multi-dimensional columns
- [ ] Pareto frontier chart (score vs cost) rendered
- [ ] Cost-normalized ranking ($/Fix) visible
- [ ] Submission workflow documented in `benchmarks/swe-bench-lite/README.md`
- [ ] CI validates result JSON schema on PR submission

## Non-goals

- Curating a new benchmark dataset from scratch (reuse existing)
- Building a full platform (auth, teams, billing)
- Building a dynamic API backend (static JSON for MVP)
- Competing with observability platforms (Langfuse, Braintrust)
- Hosting compute for evaluation — users run locally, submit results
- Replacing swebench.com — we complement it with richer analysis
- `agentv bench submit` CLI command (defer until submission volume warrants it)

## Dependencies

- #965 (Docker workspace environments — required to run SWE-bench instances)
- #563 (Studio hardening — local studio should mirror leaderboard UX patterns)

## Related

- [SWE-bench competitive analysis](https://github.com/agentevals/agentevals-research/pull/55)
- [Product strategy](https://github.com/agentevals/agentevals-research/blob/main/research/proposals/product-strategy.md)
- [SWE-bench website](https://www.swebench.com/)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: curated public benchmark dataset and leaderboard #966

Objective

Context

Strategy: Reuse existing benchmarks, show richer results

What makes it differentiated vs mirroring swebench.com

Design decisions (settled)

Execution path

Leaderboard page layout

Result JSON schema

Acceptance signals

Non-goals

Dependencies

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dimension	swebench.com	agentv.dev/leaderboard
Primary metric	% Resolved (single number)	Multi-criteria score (quality + cost + latency)
Cost	"Avg. $" column	Cost-normalized ranking ("best score per dollar"), Pareto frontier chart
Providers	One model at a time	Per-provider breakdown (Claude vs Codex vs Copilot on same tasks)
Trend	None	Score trend over time (are models improving across releases?)
Tool usage	Trajectory link (external)	Tool call efficiency (calls per resolution, exploration ratio)
Comparison	Checkbox + compare	Visual delta view with win/loss/tie per task
Filtering	Agent type, model type	Filter by any dimension (cost range, language, repo, difficulty)

feat: curated public benchmark dataset and leaderboard #966

Description

Objective

Context

Strategy: Reuse existing benchmarks, show richer results

What makes it differentiated vs mirroring swebench.com

Design decisions (settled)

Execution path

Leaderboard page layout

Result JSON schema

Acceptance signals

Non-goals

Dependencies

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions