Skip to content

eval: score hard benchmark baselines and CodeLeWM claim gate #422

Description

@AbdelStark

Parent

#417

What to build

Run the hard benchmark comparison over a built anti-saturation benchmark pack. The report must compare CodeLeWM to every required baseline and decide the downstream claim gate from predeclared metrics.

Acceptance criteria

  • Evaluation consumes the hard benchmark manifest and trusted model/checkpoint/index artifacts without using transient job directories.
  • Reports include random, LLM-order, lexical, static heuristic, no-action, shuffled-action, CodeLeWM transition-energy, retrieval-prior-only when available, and final-score/ensemble rows.
  • p_pass rows appear only when a standalone score key is serialized for every downstream row; otherwise they are typed not_recorded.
  • Metrics include pass@1, pass@k, MRR, first-passing rank, valid candidate rate, parser/apply failure rate, check-pass rate, candidate-class slices, and bootstrap lift CIs.
  • The claim gate opens only when eligible slices show CodeLeWM beating no-action, lexical, and LLM-order on pass@1 and MRR with lift CIs excluding zero.
  • Tests cover positive, saturated, missing-baseline, invalid-candidate-only, and mixed-slice outcomes.

Blocked by

#420 and #421

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:evaluationArea: evaluationarea:harnessArea: harnessarea:modelArea: modeleffort:lLarge multi-file implementation changepriority:p1Required for v1.0 or core follow-throughspec:rfc-0016Derived from RFC-0016type:featureFeature implementation work

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions