eval: score hard benchmark baselines and CodeLeWM claim gate (#422)#428
Merged
Conversation
Add an additive `hard_mode` to the downstream reranking evaluator so a built anti-saturation pack is scored against every RFC-0016 baseline and gated by the three-baseline claim gate. The plain (non-hard) path is byte-for-byte unchanged: every hard-mode behavior is gated behind the new flag. `run_downstream_rerank_evaluation(..., hard_mode=True)` now additionally: - computes the `shuffled_action` control by re-scoring each candidate's after-state against a rotated before-state (action-sensitivity probe; text-only, never executed), and a deterministic `static_heuristic` order; - emits a typed `p_pass` row (`not_recorded`, since no standalone p_pass score key is serialized per row in this codebase); - writes the `codelewm.downstream_anti_saturation_report.v1` diagnostic from the model-independent baselines, matching the #419 pack-time report; - bootstraps CodeLeWM lift confidence intervals over no-action, lexical, and LLM-order (same LCG + percentile scheme as the existing CIs); - decides the claim with `build_anti_saturation_claim_gate`: the gate opens only on an eligible >=100-problem slice where CodeLeWM is strictly above all three baselines on pass@1 and MRR with every lift CI excluding zero. `HARD_DOWNSTREAM_REQUIRED_BASELINES` (the standard seven plus shuffled-action, static-heuristic, p_pass) is exported; `DOWNSTREAM_REQUIRED_BASELINES` is untouched so the v1.0 fixture contract holds. `_build_report` gains optional `required_baselines` / `extra_report_fields` params (defaults preserve the plain report). The CLI `eval downstream-rerank` gains a `--hard-mode` flag, threaded through the handler, command tuple, logs, and manifest metadata. Tests cover the hard-mode report shape (all ten baselines, anti-saturation report, typed p_pass, three-baseline gate, lift CIs), the unchanged plain path, the CLI `--hard-mode` run, and the positive / saturated / missing-baseline / invalid-candidate-only / mixed-slice claim-gate outcomes. Closes #422. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an additive
hard_modeto the downstream reranking evaluator so a built anti-saturation pack is scored against every RFC-0016 baseline and gated by the three-baseline (no-action / lexical / LLM-order) claim gate. The plain path is byte-for-byte unchanged — every hard-mode behavior is gated behind the flag, andDOWNSTREAM_REQUIRED_BASELINESis untouched.Linked Issue
Closes #422.
Spec / RFC Reference
docs/spec/11-llm-world-model-harness.mddocs/rfcs/RFC-0016-hard-downstream-reranking-benchmark.mdPublic Surface Impact
run_downstream_rerank_evaluationgains an optionalhard_mode: bool = Falseparameter.HARD_DOWNSTREAM_REQUIRED_BASELINES,HARD_DOWNSTREAM_EXTRA_BASELINES(shuffled_action,static_heuristic,p_pass).codelewm eval downstream-rerank --hard-mode._build_reportgains optionalrequired_baselines/extra_report_fields(internal; defaults preserve the plain report).codelewm.downstream_rerank_report.v1is unchanged; hard-mode adds keys (hard_mode,profile,anti_saturation_report,lift_confidence_intervals) and extra baseline rows only when--hard-modeis set. The embedded reports reuse the data: add anti-saturation benchmark schema and readiness diagnostics #419codelewm.downstream_anti_saturation_report.v1/...claim_gate.v1schemas.Validation
Artifact Impact
In hard mode the rerank report (
reports/downstream_rerank_report.json) additionally carrieshard_mode,profile, the embeddedanti_saturation_report,lift_confidence_intervals, and theshuffled_action/static_heuristic/p_passmetric rows;claim_gateuses the three-baseline anti-saturation gate. The eval manifest metadata recordshard_modeandanti_saturation_eligible. Plain-mode artifacts are unchanged.Deprecations
noneCaveats / Follow-ups
p_passis alwaysnot_recordedhere becauseScoreResultserializes no standalone p_pass key; the typed row is RFC-compliant.shuffled_actionre-scores against the next task's before-state (rotation); on a single-task fixture the rotation is degenerate. Lift CIs areskippedbelow 20 problems.