Site tier-1 follow-up: per-model deep-dive page by MaxGhenis · Pull Request #23 · PolicyEngine/policybench

MaxGhenis · 2026-05-16T13:11:01Z

Summary

Follow-up to #8 and #9. Adds a statically-generated per-model deep-dive page at /model/[id] — one page per model present in data.json.

Rendered sections

1. Headline strip (inside SiteHeader expandedContent, alwaysExpanded)

Provider mark (ProviderMark) + model name + provider label
Score pills: Global, US, UK (from globalStat.countryScores), Parse rate (nParsed / n)

2. Hardest outputs — top 5 lowest-scoring output groups for this model

Aggregated at (country, outputGroup) level using buildAllRows + scorePrediction from lib/sensitivity.ts / lib/scoring.ts
The aggregation mirrors the 3-level mean in scoresPerCountryModel: per-row scores → output-group mean → displayed score
Shows variable label (getVariableLabel), country tag, and a Badge (same color thresholds as ModelLeaderboard)

3. Sample wrong predictions — up to 10 distinct (country, scenario, variable) cells where relErr > 10% and score < 0.75

Sorted by largest relative error first
Each card: country tag, variable label, score badge, Prediction / Ground truth / Error columns (currency-formatted for amount outputs, integer for binary)
Collapsible <details> block with the model's explanation text
Link to /#scenarios for the scenario explorer, plus the scenario ID

4. Back to leaderboard link at page bottom

Static routes generation

generateStaticParams collects all model IDs from dashboard.global.modelStats and the union of country-level modelStats, returning one { id } entry per model. The current data produces 12 static routes:

/model/gpt-5.5
/model/claude-sonnet-4.6
/model/claude-opus-4.7
/model/grok-4.20
/model/gemini-3.1-pro-preview
/model/gemini-3-flash-preview
/model/grok-4.3
/model/gemini-3.1-flash-lite-preview
/model/gpt-5.4-mini
/model/grok-4.1-fast
/model/claude-haiku-4.5
/model/gpt-5.4-nano

Library reuse

Library	Used for
`lib/scoring.ts` — `scorePrediction`, `metricTypeForVariable`	Per-row score computation, metric type for display formatting
`lib/sensitivity.ts` — `buildAllRows`	Builds the full `ScoreRow[]` for all countries, filtered to the model
`lib/bootstrap.ts`	Not needed for static server render; omitted

Scoring math

For each (country, outputGroup) pair, the displayed score is the mean of per-row scores (each scorePrediction result × 100) across all scenarios and person-expanded variables that map to that output group. This is equivalent to the inner two levels of the 3-level mean in scoresPerCountryModel.

Smoke test

bun run lint   # clean (0 errors, 0 warnings)
bun run build  # clean — /model/[id] SSG route with 12 paths in build output

Build output excerpt:

● /model/[id]
│ ├ /model/gpt-5.5
│ ├ /model/claude-sonnet-4.6
│ ├ /model/claude-opus-4.7
│ └ [+9 more paths]

Test plan

CI passes
Visit /model/gpt-5.5 — headline shows Global / US / UK scores, parse rate
Verify "Top 5 lowest-scoring outputs" renders 5 rows with country tags and score badges
Verify "Sample errors" section renders cards with prediction / ground truth / error
Expand a model explanation <details> block
Click "View in scenario explorer →" — lands on /#scenarios
Click "← Back to leaderboard" — returns to /
Visit /model/nonexistent-model — returns 404
Mobile width — score pills wrap cleanly under provider mark

🤖 Generated with Claude Code

Scheduled follow-up agent — opened after confirming both #8 and #9 are merged.

Generated by Claude Code

Statically generates a dedicated page for each of the 12 models in data.json, using generateStaticParams so the entire site stays a pure static export. Each page renders: - Headline strip: provider mark, model name, global/US/UK scores, parse-rate pill — all sourced from globalStat.countryScores. - Hardest outputs: top-5 lowest-scoring output groups (country × outputGroup) computed by reusing buildAllRows/scorePrediction from lib/sensitivity.ts and lib/scoring.ts, aggregated the same way as the headline scorer. - Sample wrong predictions: up to 10 (scenario, variable) cells where relErr > 10% and score < 0.75, sorted by largest relative error, with prediction / ground-truth / error columns plus a collapsible model explanation and a link back to /#scenarios. - Back to leaderboard link. Reuses SiteHeader (alwaysExpanded + actionLink back to /), the Badge color scheme from ModelLeaderboard, and Tailwind v4 design-token classes throughout. Build smoke-test: `bun run build` produces the /model/[id] SSG route with all 12 model paths; `bun run lint` is clean. https://claude.ai/code/session_01DS3KJmEye7o7ff18RdthTC

vercel · 2026-05-16T13:11:06Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policybench-site	Ready	Preview, Comment	May 16, 2026 1:11pm

vercel Bot deployed to Preview May 16, 2026 13:11 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Site tier-1 follow-up: per-model deep-dive page#23

Site tier-1 follow-up: per-model deep-dive page#23
MaxGhenis wants to merge 1 commit into
mainfrom
model-deepdive-page

MaxGhenis commented May 16, 2026

Uh oh!

vercel Bot commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MaxGhenis commented May 16, 2026

Summary

Rendered sections

Static routes generation

Library reuse

Scoring math

Smoke test

Test plan

Uh oh!

vercel Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented May 16, 2026 •

edited

Loading