Skip to content

Site tier-1 follow-up: per-model deep-dive page#23

Draft
MaxGhenis wants to merge 1 commit into
mainfrom
model-deepdive-page
Draft

Site tier-1 follow-up: per-model deep-dive page#23
MaxGhenis wants to merge 1 commit into
mainfrom
model-deepdive-page

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Follow-up to #8 and #9. Adds a statically-generated per-model deep-dive page at /model/[id] — one page per model present in data.json.

Rendered sections

1. Headline strip (inside SiteHeader expandedContent, alwaysExpanded)

  • Provider mark (ProviderMark) + model name + provider label
  • Score pills: Global, US, UK (from globalStat.countryScores), Parse rate (nParsed / n)

2. Hardest outputs — top 5 lowest-scoring output groups for this model

  • Aggregated at (country, outputGroup) level using buildAllRows + scorePrediction from lib/sensitivity.ts / lib/scoring.ts
  • The aggregation mirrors the 3-level mean in scoresPerCountryModel: per-row scores → output-group mean → displayed score
  • Shows variable label (getVariableLabel), country tag, and a Badge (same color thresholds as ModelLeaderboard)

3. Sample wrong predictions — up to 10 distinct (country, scenario, variable) cells where relErr > 10% and score < 0.75

  • Sorted by largest relative error first
  • Each card: country tag, variable label, score badge, Prediction / Ground truth / Error columns (currency-formatted for amount outputs, integer for binary)
  • Collapsible <details> block with the model's explanation text
  • Link to /#scenarios for the scenario explorer, plus the scenario ID

4. Back to leaderboard link at page bottom

Static routes generation

generateStaticParams collects all model IDs from dashboard.global.modelStats and the union of country-level modelStats, returning one { id } entry per model. The current data produces 12 static routes:

/model/gpt-5.5
/model/claude-sonnet-4.6
/model/claude-opus-4.7
/model/grok-4.20
/model/gemini-3.1-pro-preview
/model/gemini-3-flash-preview
/model/grok-4.3
/model/gemini-3.1-flash-lite-preview
/model/gpt-5.4-mini
/model/grok-4.1-fast
/model/claude-haiku-4.5
/model/gpt-5.4-nano

Library reuse

Library Used for
lib/scoring.tsscorePrediction, metricTypeForVariable Per-row score computation, metric type for display formatting
lib/sensitivity.tsbuildAllRows Builds the full ScoreRow[] for all countries, filtered to the model
lib/bootstrap.ts Not needed for static server render; omitted

Scoring math

For each (country, outputGroup) pair, the displayed score is the mean of per-row scores (each scorePrediction result × 100) across all scenarios and person-expanded variables that map to that output group. This is equivalent to the inner two levels of the 3-level mean in scoresPerCountryModel.

Smoke test

bun run lint   # clean (0 errors, 0 warnings)
bun run build  # clean — /model/[id] SSG route with 12 paths in build output

Build output excerpt:

● /model/[id]
│ ├ /model/gpt-5.5
│ ├ /model/claude-sonnet-4.6
│ ├ /model/claude-opus-4.7
│ └ [+9 more paths]

Test plan

  • CI passes
  • Visit /model/gpt-5.5 — headline shows Global / US / UK scores, parse rate
  • Verify "Top 5 lowest-scoring outputs" renders 5 rows with country tags and score badges
  • Verify "Sample errors" section renders cards with prediction / ground truth / error
  • Expand a model explanation <details> block
  • Click "View in scenario explorer →" — lands on /#scenarios
  • Click "← Back to leaderboard" — returns to /
  • Visit /model/nonexistent-model — returns 404
  • Mobile width — score pills wrap cleanly under provider mark

🤖 Generated with Claude Code

Scheduled follow-up agent — opened after confirming both #8 and #9 are merged.


Generated by Claude Code

Statically generates a dedicated page for each of the 12 models in
data.json, using generateStaticParams so the entire site stays a
pure static export.

Each page renders:
- Headline strip: provider mark, model name, global/US/UK scores,
  parse-rate pill — all sourced from globalStat.countryScores.
- Hardest outputs: top-5 lowest-scoring output groups (country × outputGroup)
  computed by reusing buildAllRows/scorePrediction from lib/sensitivity.ts
  and lib/scoring.ts, aggregated the same way as the headline scorer.
- Sample wrong predictions: up to 10 (scenario, variable) cells where
  relErr > 10% and score < 0.75, sorted by largest relative error,
  with prediction / ground-truth / error columns plus a collapsible
  model explanation and a link back to /#scenarios.
- Back to leaderboard link.

Reuses SiteHeader (alwaysExpanded + actionLink back to /),
the Badge color scheme from ModelLeaderboard, and Tailwind v4
design-token classes throughout.

Build smoke-test: `bun run build` produces the /model/[id] SSG route
with all 12 model paths; `bun run lint` is clean.

https://claude.ai/code/session_01DS3KJmEye7o7ff18RdthTC
@vercel
Copy link
Copy Markdown

vercel Bot commented May 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policybench-site Ready Ready Preview, Comment May 16, 2026 1:11pm

Request Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants