Skip to content

eval: score hard benchmark baselines and CodeLeWM claim gate (#422)#428

Merged
AbdelStark merged 1 commit into
mainfrom
issue-422-hard-rerank-claim-gate
Jun 8, 2026
Merged

eval: score hard benchmark baselines and CodeLeWM claim gate (#422)#428
AbdelStark merged 1 commit into
mainfrom
issue-422-hard-rerank-claim-gate

Conversation

@AbdelStark

Copy link
Copy Markdown
Owner

Summary

Adds an additive hard_mode to the downstream reranking evaluator so a built anti-saturation pack is scored against every RFC-0016 baseline and gated by the three-baseline (no-action / lexical / LLM-order) claim gate. The plain path is byte-for-byte unchanged — every hard-mode behavior is gated behind the flag, and DOWNSTREAM_REQUIRED_BASELINES is untouched.

Linked Issue

Closes #422.

Spec / RFC Reference

  • Spec section: docs/spec/11-llm-world-model-harness.md
  • RFC: docs/rfcs/RFC-0016-hard-downstream-reranking-benchmark.md

Public Surface Impact

  • run_downstream_rerank_evaluation gains an optional hard_mode: bool = False parameter.
  • New exported constants: HARD_DOWNSTREAM_REQUIRED_BASELINES, HARD_DOWNSTREAM_EXTRA_BASELINES (shuffled_action, static_heuristic, p_pass).
  • New CLI flag: codelewm eval downstream-rerank --hard-mode.
  • _build_report gains optional required_baselines / extra_report_fields (internal; defaults preserve the plain report).
  • No schema-version bump: the plain codelewm.downstream_rerank_report.v1 is unchanged; hard-mode adds keys (hard_mode, profile, anti_saturation_report, lift_confidence_intervals) and extra baseline rows only when --hard-mode is set. The embedded reports reuse the data: add anti-saturation benchmark schema and readiness diagnostics #419 codelewm.downstream_anti_saturation_report.v1 / ...claim_gate.v1 schemas.

Validation

uv run pytest tests/eval/test_hard_downstream_rerank.py -q   # 8 passed
uv run pytest tests/eval/ tests/test_imports.py tests/harness/test_output_schemas.py tests/security/test_sandbox_import_boundary.py -q   # 227 passed
uv run pytest tests/docs/ -q   # 144 passed
uv run python -m compileall -q codelewm/eval/downstream_rerank.py codelewm/eval/__init__.py codelewm/harness/cli.py
git diff --check

Artifact Impact

In hard mode the rerank report (reports/downstream_rerank_report.json) additionally carries hard_mode, profile, the embedded anti_saturation_report, lift_confidence_intervals, and the shuffled_action / static_heuristic / p_pass metric rows; claim_gate uses the three-baseline anti-saturation gate. The eval manifest metadata records hard_mode and anti_saturation_eligible. Plain-mode artifacts are unchanged.

Deprecations

none

Caveats / Follow-ups

  • p_pass is always not_recorded here because ScoreResult serializes no standalone p_pass key; the typed row is RFC-compliant.
  • shuffled_action re-scores against the next task's before-state (rotation); on a single-task fixture the rotation is degenerate. Lift CIs are skipped below 20 problems.
  • Follow-up: results: publish hard benchmark artifacts and claim audit #423 publishes the artifact set + claim audit.

Add an additive `hard_mode` to the downstream reranking evaluator so a built
anti-saturation pack is scored against every RFC-0016 baseline and gated by
the three-baseline claim gate. The plain (non-hard) path is byte-for-byte
unchanged: every hard-mode behavior is gated behind the new flag.

`run_downstream_rerank_evaluation(..., hard_mode=True)` now additionally:
- computes the `shuffled_action` control by re-scoring each candidate's
  after-state against a rotated before-state (action-sensitivity probe;
  text-only, never executed), and a deterministic `static_heuristic` order;
- emits a typed `p_pass` row (`not_recorded`, since no standalone p_pass score
  key is serialized per row in this codebase);
- writes the `codelewm.downstream_anti_saturation_report.v1` diagnostic from
  the model-independent baselines, matching the #419 pack-time report;
- bootstraps CodeLeWM lift confidence intervals over no-action, lexical, and
  LLM-order (same LCG + percentile scheme as the existing CIs);
- decides the claim with `build_anti_saturation_claim_gate`: the gate opens
  only on an eligible >=100-problem slice where CodeLeWM is strictly above all
  three baselines on pass@1 and MRR with every lift CI excluding zero.

`HARD_DOWNSTREAM_REQUIRED_BASELINES` (the standard seven plus shuffled-action,
static-heuristic, p_pass) is exported; `DOWNSTREAM_REQUIRED_BASELINES` is
untouched so the v1.0 fixture contract holds. `_build_report` gains optional
`required_baselines` / `extra_report_fields` params (defaults preserve the
plain report). The CLI `eval downstream-rerank` gains a `--hard-mode` flag,
threaded through the handler, command tuple, logs, and manifest metadata.

Tests cover the hard-mode report shape (all ten baselines, anti-saturation
report, typed p_pass, three-baseline gate, lift CIs), the unchanged plain
path, the CLI `--hard-mode` run, and the positive / saturated /
missing-baseline / invalid-candidate-only / mixed-slice claim-gate outcomes.

Closes #422.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AbdelStark AbdelStark merged commit 2ef8622 into main Jun 8, 2026
9 checks passed
@AbdelStark AbdelStark deleted the issue-422-hard-rerank-claim-gate branch June 8, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

eval: score hard benchmark baselines and CodeLeWM claim gate

1 participant