You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Establish AgentV as the industry standard for agent evaluation by running existing public benchmarks with richer metrics than the originals, and hosting a public leaderboard on agentv.dev as a landing page feature.
Context
From SWE-bench competitive analysis: SWE-bench's dominance comes from being the shared yardstick everyone measures against. Its technical harness is sophisticated but not its moat — the moat is the dataset + community adoption + public leaderboard. Without a shared benchmark, AgentV is a tool; with one, it becomes the standard.
AgentV already has the strongest evaluation framework (15+ evaluator types, multi-provider matrix, CI integration, trend analysis, OTel export). What's missing is a canonical dataset and public surface for the community to rally around.
Strategy: Reuse existing benchmarks, show richer results
Do NOT curate a new dataset. SWE-bench is already the industry standard — no need to convince people it's valid. Instead, run SWE-bench through AgentV and surface dimensions the original leaderboard doesn't have.
The hook: Visitors come to agentv.dev/leaderboard, see familiar SWE-bench results but with columns swebench.com doesn't show. They think "oh, this is what SWE-bench would look like if it measured more than just pass/fail." That's the conversion point — they come for the leaderboard, stay for the framework.
What makes it differentiated vs mirroring swebench.com
Dimension
swebench.com
agentv.dev/leaderboard
Primary metric
% Resolved (single number)
Multi-criteria score (quality + cost + latency)
Cost
"Avg. $" column
Cost-normalized ranking ("best score per dollar"), Pareto frontier chart
Providers
One model at a time
Per-provider breakdown (Claude vs Codex vs Copilot on same tasks)
Trend
None
Score trend over time (are models improving across releases?)
Tool usage
Trajectory link (external)
Tool call efficiency (calls per resolution, exploration ratio)
Comparison
Checkbox + compare
Visual delta view with win/loss/tie per task
Filtering
Agent type, model type
Filter by any dimension (cost range, language, repo, difficulty)
Design decisions (settled)
MVP uses static site generation. Run benchmarks periodically, generate static JSON, agentv.dev renders from JSON. agentv.dev is already an Astro/Starlight docs site — add a leaderboard page component that reads from a JSON data file.
Benchmark recipe lives in the agentv repo under benchmarks/swe-bench-lite/. This includes a setup.ts script, the reusable swe-bench-grader.ts template, and a README. The repo does NOT commit the SWE-bench dataset — the setup script downloads it from HuggingFace on first run and generates EVAL.yaml files locally.
Why a setup script, not committed EVAL.yaml files:
The 300 problem statements are long GitHub issue texts (1-3MB total) — too large to commit
Each instance needs a Docker image reference + FAIL_TO_PASS lists from the HuggingFace dataset
Re-hosting the dataset in our repo creates a stale copy problem
A setup script keeps the source of truth on HuggingFace where it belongs
Data source: HuggingFace dataset SWE-bench/SWE-bench_Lite (test split, 300 instances). Docker images from DockerHub swebench/sweb.eval.x86_64.*.
Future enhancement (P2): Native remote dataset references in AgentV — agentv eval hf://SWE-bench/SWE-bench_Lite — would eliminate the setup step entirely. Track separately.
Community submissions via GitHub PR. Submitters run the benchmark locally, then open a PR to the agentv repo adding their result JSON to benchmarks/swe-bench-lite/results/<model-name>.json. A CI check validates the format. No API, no backend.
agentv bench submit is deferred. For MVP, submissions are manual GitHub PRs. A CLI command can be added later if submission volume warrants it.
Start with SWE-bench Lite (300 instances). Cheaper to run (~$50-200 per full evaluation depending on model), still credible. Expand to Verified (500) once pipeline is proven.
Objective
Establish AgentV as the industry standard for agent evaluation by running existing public benchmarks with richer metrics than the originals, and hosting a public leaderboard on agentv.dev as a landing page feature.
Context
From SWE-bench competitive analysis: SWE-bench's dominance comes from being the shared yardstick everyone measures against. Its technical harness is sophisticated but not its moat — the moat is the dataset + community adoption + public leaderboard. Without a shared benchmark, AgentV is a tool; with one, it becomes the standard.
AgentV already has the strongest evaluation framework (15+ evaluator types, multi-provider matrix, CI integration, trend analysis, OTel export). What's missing is a canonical dataset and public surface for the community to rally around.
Strategy: Reuse existing benchmarks, show richer results
Do NOT curate a new dataset. SWE-bench is already the industry standard — no need to convince people it's valid. Instead, run SWE-bench through AgentV and surface dimensions the original leaderboard doesn't have.
The hook: Visitors come to agentv.dev/leaderboard, see familiar SWE-bench results but with columns swebench.com doesn't show. They think "oh, this is what SWE-bench would look like if it measured more than just pass/fail." That's the conversion point — they come for the leaderboard, stay for the framework.
What makes it differentiated vs mirroring swebench.com
Design decisions (settled)
MVP uses static site generation. Run benchmarks periodically, generate static JSON, agentv.dev renders from JSON. agentv.dev is already an Astro/Starlight docs site — add a leaderboard page component that reads from a JSON data file.
Benchmark recipe lives in the agentv repo under
benchmarks/swe-bench-lite/. This includes asetup.tsscript, the reusableswe-bench-grader.tstemplate, and a README. The repo does NOT commit the SWE-bench dataset — the setup script downloads it from HuggingFace on first run and generates EVAL.yaml files locally.Why a setup script, not committed EVAL.yaml files:
Data source: HuggingFace dataset
SWE-bench/SWE-bench_Lite(test split, 300 instances). Docker images from DockerHubswebench/sweb.eval.x86_64.*.Future enhancement (P2): Native remote dataset references in AgentV —
agentv eval hf://SWE-bench/SWE-bench_Lite— would eliminate the setup step entirely. Track separately.Community submissions via GitHub PR. Submitters run the benchmark locally, then open a PR to the agentv repo adding their result JSON to
benchmarks/swe-bench-lite/results/<model-name>.json. A CI check validates the format. No API, no backend.agentv bench submitis deferred. For MVP, submissions are manual GitHub PRs. A CLI command can be added later if submission volume warrants it.Start with SWE-bench Lite (300 instances). Cheaper to run (~$50-200 per full evaluation depending on model), still credible. Expand to Verified (500) once pipeline is proven.
Execution path
benchmarks/swe-bench-lite/in agentv repo:README.md— how to run, how to submitsetup.ts— downloads dataset from HuggingFace, generates EVAL.yaml files intoevals/(gitignored)graders/swe-bench-grader.ts— reusable grader using@agentv/evalresults/— JSON result files per model (committed to repo).gitignore— ignoresevals/and.cache/(generated from setup)apps/web/:benchmarks/swe-bench-lite/results/*.jsonLeaderboard page layout
$/Fix = avg cost per resolved instance (cost-normalized metric). This is the differentiated column swebench.com doesn't have.
Result JSON schema
Each submitted result file (
results/<model-name>.json):{ "model": "Claude Opus 4.6", "provider": "anthropic", "model_type": "proprietary", "date": "2026-04-08", "agent": "mini-swe-agent-agentv", "agent_version": "1.0.0", "dataset": "swe-bench-lite", "total_instances": 300, "resolved_instances": 218, "resolution_rate": 0.727, "avg_cost_usd": 0.55, "avg_cost_per_fix_usd": 0.76, "avg_duration_ms": 45000, "avg_tool_calls": 8.2, "per_instance": [ { "instance_id": "django__django-15180", "resolved": true, "cost_usd": 0.42, "duration_ms": 32000, "tool_calls": 6 } ] }Acceptance signals
setup.tsdownloads SWE-bench Lite from HuggingFace and generates EVAL.yaml filesbenchmarks/swe-bench-lite/results/benchmarks/swe-bench-lite/README.mdNon-goals
agentv bench submitCLI command (defer until submission volume warrants it)Dependencies
Related