eval: score hard benchmark baselines and CodeLeWM claim gate (#422) by AbdelStark · Pull Request #428 · AbdelStark/CodeLeWM

AbdelStark · 2026-06-08T14:14:33Z

Summary

Adds an additive hard_mode to the downstream reranking evaluator so a built anti-saturation pack is scored against every RFC-0016 baseline and gated by the three-baseline (no-action / lexical / LLM-order) claim gate. The plain path is byte-for-byte unchanged — every hard-mode behavior is gated behind the flag, and DOWNSTREAM_REQUIRED_BASELINES is untouched.

Linked Issue

Closes #422.

Spec / RFC Reference

Spec section: docs/spec/11-llm-world-model-harness.md
RFC: docs/rfcs/RFC-0016-hard-downstream-reranking-benchmark.md

Public Surface Impact

run_downstream_rerank_evaluation gains an optional hard_mode: bool = False parameter.
New exported constants: HARD_DOWNSTREAM_REQUIRED_BASELINES, HARD_DOWNSTREAM_EXTRA_BASELINES (shuffled_action, static_heuristic, p_pass).
New CLI flag: codelewm eval downstream-rerank --hard-mode.
_build_report gains optional required_baselines / extra_report_fields (internal; defaults preserve the plain report).
No schema-version bump: the plain codelewm.downstream_rerank_report.v1 is unchanged; hard-mode adds keys (hard_mode, profile, anti_saturation_report, lift_confidence_intervals) and extra baseline rows only when --hard-mode is set. The embedded reports reuse the data: add anti-saturation benchmark schema and readiness diagnostics #419 codelewm.downstream_anti_saturation_report.v1 / ...claim_gate.v1 schemas.

Validation

uv run pytest tests/eval/test_hard_downstream_rerank.py -q   # 8 passed
uv run pytest tests/eval/ tests/test_imports.py tests/harness/test_output_schemas.py tests/security/test_sandbox_import_boundary.py -q   # 227 passed
uv run pytest tests/docs/ -q   # 144 passed
uv run python -m compileall -q codelewm/eval/downstream_rerank.py codelewm/eval/__init__.py codelewm/harness/cli.py
git diff --check

Artifact Impact

In hard mode the rerank report (reports/downstream_rerank_report.json) additionally carries hard_mode, profile, the embedded anti_saturation_report, lift_confidence_intervals, and the shuffled_action / static_heuristic / p_pass metric rows; claim_gate uses the three-baseline anti-saturation gate. The eval manifest metadata records hard_mode and anti_saturation_eligible. Plain-mode artifacts are unchanged.

Deprecations

none

Caveats / Follow-ups

p_pass is always not_recorded here because ScoreResult serializes no standalone p_pass key; the typed row is RFC-compliant.
shuffled_action re-scores against the next task's before-state (rotation); on a single-task fixture the rotation is degenerate. Lift CIs are skipped below 20 problems.
Follow-up: results: publish hard benchmark artifacts and claim audit #423 publishes the artifact set + claim audit.

Add an additive `hard_mode` to the downstream reranking evaluator so a built anti-saturation pack is scored against every RFC-0016 baseline and gated by the three-baseline claim gate. The plain (non-hard) path is byte-for-byte unchanged: every hard-mode behavior is gated behind the new flag. `run_downstream_rerank_evaluation(..., hard_mode=True)` now additionally: - computes the `shuffled_action` control by re-scoring each candidate's after-state against a rotated before-state (action-sensitivity probe; text-only, never executed), and a deterministic `static_heuristic` order; - emits a typed `p_pass` row (`not_recorded`, since no standalone p_pass score key is serialized per row in this codebase); - writes the `codelewm.downstream_anti_saturation_report.v1` diagnostic from the model-independent baselines, matching the #419 pack-time report; - bootstraps CodeLeWM lift confidence intervals over no-action, lexical, and LLM-order (same LCG + percentile scheme as the existing CIs); - decides the claim with `build_anti_saturation_claim_gate`: the gate opens only on an eligible >=100-problem slice where CodeLeWM is strictly above all three baselines on pass@1 and MRR with every lift CI excluding zero. `HARD_DOWNSTREAM_REQUIRED_BASELINES` (the standard seven plus shuffled-action, static-heuristic, p_pass) is exported; `DOWNSTREAM_REQUIRED_BASELINES` is untouched so the v1.0 fixture contract holds. `_build_report` gains optional `required_baselines` / `extra_report_fields` params (defaults preserve the plain report). The CLI `eval downstream-rerank` gains a `--hard-mode` flag, threaded through the handler, command tuple, logs, and manifest metadata. Tests cover the hard-mode report shape (all ten baselines, anti-saturation report, typed p_pass, three-baseline gate, lift CIs), the unchanged plain path, the CLI `--hard-mode` run, and the positive / saturated / missing-baseline / invalid-candidate-only / mixed-slice claim-gate outcomes. Closes #422. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AbdelStark merged commit 2ef8622 into main Jun 8, 2026
9 checks passed

AbdelStark deleted the issue-422-hard-rerank-claim-gate branch June 8, 2026 14:18

AbdelStark mentioned this pull request Jun 8, 2026

[TRACKER] v1.5 hard anti-saturation downstream reranking benchmark #417

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: score hard benchmark baselines and CodeLeWM claim gate (#422)#428

eval: score hard benchmark baselines and CodeLeWM claim gate (#422)#428
AbdelStark merged 1 commit into
mainfrom
issue-422-hard-rerank-claim-gate

AbdelStark commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AbdelStark commented Jun 8, 2026

Summary

Linked Issue

Spec / RFC Reference

Public Surface Impact

Validation

Artifact Impact

Deprecations

Caveats / Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant