forked from lucas-maes/le-wm
-
Notifications
You must be signed in to change notification settings - Fork 1
eval: score hard benchmark baselines and CodeLeWM claim gate #422
Copy link
Copy link
Closed
Labels
area:evaluationArea: evaluationArea: evaluationarea:harnessArea: harnessArea: harnessarea:modelArea: modelArea: modeleffort:lLarge multi-file implementation changeLarge multi-file implementation changepriority:p1Required for v1.0 or core follow-throughRequired for v1.0 or core follow-throughspec:rfc-0016Derived from RFC-0016Derived from RFC-0016type:featureFeature implementation workFeature implementation work
Description
Metadata
Metadata
Assignees
Labels
area:evaluationArea: evaluationArea: evaluationarea:harnessArea: harnessArea: harnessarea:modelArea: modelArea: modeleffort:lLarge multi-file implementation changeLarge multi-file implementation changepriority:p1Required for v1.0 or core follow-throughRequired for v1.0 or core follow-throughspec:rfc-0016Derived from RFC-0016Derived from RFC-0016type:featureFeature implementation workFeature implementation work
Parent
#417
What to build
Run the hard benchmark comparison over a built anti-saturation benchmark pack. The report must compare CodeLeWM to every required baseline and decide the downstream claim gate from predeclared metrics.
Acceptance criteria
p_passrows appear only when a standalone score key is serialized for every downstream row; otherwise they are typednot_recorded.Blocked by
#420 and #421