eval: score hard benchmark baselines and CodeLeWM claim gate

## Parent

#417

## What to build

Run the hard benchmark comparison over a built anti-saturation benchmark pack. The report must compare CodeLeWM to every required baseline and decide the downstream claim gate from predeclared metrics.

## Acceptance criteria

- [ ] Evaluation consumes the hard benchmark manifest and trusted model/checkpoint/index artifacts without using transient job directories.
- [ ] Reports include random, LLM-order, lexical, static heuristic, no-action, shuffled-action, CodeLeWM transition-energy, retrieval-prior-only when available, and final-score/ensemble rows.
- [ ] `p_pass` rows appear only when a standalone score key is serialized for every downstream row; otherwise they are typed `not_recorded`.
- [ ] Metrics include pass@1, pass@k, MRR, first-passing rank, valid candidate rate, parser/apply failure rate, check-pass rate, candidate-class slices, and bootstrap lift CIs.
- [ ] The claim gate opens only when eligible slices show CodeLeWM beating no-action, lexical, and LLM-order on pass@1 and MRR with lift CIs excluding zero.
- [ ] Tests cover positive, saturated, missing-baseline, invalid-candidate-only, and mixed-slice outcomes.

## Blocked by

#420 and #421

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: score hard benchmark baselines and CodeLeWM claim gate #422

Parent

What to build

Acceptance criteria

Blocked by

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

eval: score hard benchmark baselines and CodeLeWM claim gate #422

Description

Parent

What to build

Acceptance criteria

Blocked by

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions