Skip to content

Determinism and reproducibility hardening #6

@MaxGhenis

Description

@MaxGhenis

Problem

The current benchmark mixes random-seeded scenario generation with stochastic LLM generation (temperature=1.0, n_runs=10 per scenario) and unpinned dependencies. That gives us variance estimates per cell, but it makes the benchmark itself non-reproducible: re-running the same suite a month later — same code, same prompts — gives different numbers, and we can't tell whether a difference is the model drifting, the SDK changing, the scenario distribution shifting, or noise.

For a benchmark whose value is tracking model performance over time, the runs should be (mostly) reproducible: same inputs + same model snapshot → same outputs, up to provider-side API jitter we can't control.

Concrete sources of nondeterminism today

  1. Model aliases, not snapshots. config.py lists "gpt-4o-mini", "gemini-1.5-flash" — these resolve to whatever each provider currently points the alias at. Snapshots like gpt-4o-mini-2024-07-18 and gemini-1.5-flash-002 are stable.
  2. Unpinned Python deps. pyproject.toml declares numpy, pandas, edsl, policyengine-us without version bounds. A policyengine-us change to e.g. SNAP rules will silently change the ground truth.
  3. temperature=1.0. Hardcoded in llm_estimator.py:93. With temperature=0 (and OpenAI's seed parameter, plus Gemini's equivalent where supported), variance from the LLM side approaches zero on supporting providers, and n_runs=10 becomes mostly redundant.
  4. No response cache. Every benchmark run re-pays API cost even for unchanged (model, scenario) pairs. A SHA-keyed cache (model_id + prompt + params → response) makes re-runs free and makes "what actually changed" diff'able.
  5. Scenarios live in code, not data. households.generate_scenarios is seeded with RANDOM_SEED=42, so they're stable as long as the generator function doesn't change — but if it does, the "same seed" produces a different population without anyone noticing.
  6. No provenance recorded. The output CSV stores model, scenario_index, ground_truth, etc., but not edsl version, policyengine-us version, model snapshot strings, scenario hash, or total API calls. Hard to attribute drift between runs.

Suggested changes (smallest → largest)

  • Pin model snapshots in config.MODELS and document the deprecation horizon (each provider deprecates snapshots on a rolling schedule).
  • Pin all deps in pyproject.toml with >= floors and < ceilings, or commit a uv.lock / requirements.txt.
  • Set temperature=0 and pass provider-specific seeds where available (seed=42 on OpenAI, etc.). Drop n_runs to 1 by default; keep n_runs > 1 as an opt-in for variance studies.
  • Materialize scenarios: regenerate once with the seed, write scenarios.json to the repo, load from disk in main.py. Optional: also commit the generator commit hash so we know how it was produced.
  • Add a response cache (e.g., a JSON or SQLite-keyed-by-sha256-of-canonical-request store). Bonus: makes the benchmark runnable offline against the cache.
  • Emit a provenance block in benchmark_output.csv (or a sidecar JSON): edsl version, policyengine-us version, snapshot IDs, scenario file hash, run timestamp, total API calls, total cost.

Reference

talkie-evals does this for an LM evaluation suite — pins all model HF revisions, dataset revisions, the lm-evaluation-harness task YAMLs, the Modal image's pip packages, and the sample seed. Every result JSON contains the full provenance block (talkie_evals_version, talkie_git_revision, model_revisions, dataset revisions, modal_pip_packages). Same pattern would port directly here.

Out of scope

  • Replacing edsl. Worth a separate discussion (inspect_ai / lm-evaluation-harness / direct litellm wrapper) but orthogonal to determinism.
  • Switching from free-text $-amount answers to bucketed multiple-choice — separate metric design discussion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions