docs: lock hard downstream benchmark spec and tracker (#418) by AbdelStark · Pull Request #424 · AbdelStark/CodeLeWM

AbdelStark · 2026-06-08T13:00:43Z

Summary

Lands the repo-level contract for the hard anti-saturation downstream reranking benchmark (RFC-0016), so future agents cannot weaken the anti-saturation gate. This is documentation + docs-tests only; no runtime/code surface changes. It is the first child of tracker #417 and sets the contract that #419–#423 implement.

Linked Issue

Closes #418.

Spec / RFC Reference

Spec section: docs/spec/11-llm-world-model-harness.md (anti_saturation_semantic_v1 profile + codelewm.downstream_anti_saturation_report.v1)
RFC: docs/rfcs/RFC-0016-hard-downstream-reranking-benchmark.md

Public Surface Impact

none (docs + docs tests only). No CLI flags, JSON schemas, manifest fields, error types, config keys, or Python APIs are added/changed/removed. The RFC names future schema strings (codelewm.downstream_anti_saturation_report.v1, anti_saturation_semantic_v1) but no code implements them yet.

Validation

uv run pytest tests/docs/
# 144 passed, 1024 subtests passed
uv run pytest tests/docs/test_hard_downstream_benchmark.py tests/docs/test_implementation_tracker.py tests/docs/test_final_paper_package.py -q

Artifact Impact

none on-disk at runtime. Documents the planned published artifact set and the codelewm.downstream_anti_saturation_report.v1 schema for later issues.

Deprecations

none

Caveats / Follow-ups

This tracker does not reopen the broad coding-improvement claim. Public wording stays diagnostic until the locked benchmark gate (eval: score hard benchmark baselines and CodeLeWM claim gate #422/results: publish hard benchmark artifacts and claim audit #423) opens.
Follow-ups: data: add anti-saturation benchmark schema and readiness diagnostics #419 (schema + readiness diagnostics), data: build public-safe hard-negative candidate pools #420 (hard-negative pools), harness: ingest LLM candidate packs into the hard benchmark #421 (LLM candidate ingestion), eval: score hard benchmark baselines and CodeLeWM claim gate #422 (scoring + claim gate), results: publish hard benchmark artifacts and claim audit #423 (publication + claim audit).

Land the repo-level contract for the hard anti-saturation downstream reranking benchmark (RFC-0016) so future work cannot weaken the anti-saturation gate. - Add RFC-0016 defining the scientific question, candidate classes, anti-saturation filter (no-action/lexical < 0.85, LLM-order < 0.90), required baselines, metrics, split policy, security boundary, and the claim gate. - Add the dedicated roadmap (HARD_DOWNSTREAM_RERANKING_BENCHMARK.md) mapping the #417 tracker to child issues #418-#423. - Point the LLM/world-model harness spec at the `anti_saturation_semantic_v1` profile and the `codelewm.downstream_anti_saturation_report.v1` diagnostic report. - Explain in the downstream benchmark doc why MBPP-Plus WS-D saturation motivated this follow-up; strengthen the paper/claim-audit wording to preserve the v1.0 claim boundary. - Update roadmap/tracker docs with the v1.5 issue order. - Add docs tests locking the contract and issue references. Closes #418. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AbdelStark merged commit cf0197e into main Jun 8, 2026
9 checks passed

AbdelStark deleted the issue-418-lock-hard-downstream-benchmark-spec branch June 8, 2026 13:03

AbdelStark mentioned this pull request Jun 8, 2026

[TRACKER] v1.5 hard anti-saturation downstream reranking benchmark #417

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: lock hard downstream benchmark spec and tracker (#418)#424

docs: lock hard downstream benchmark spec and tracker (#418)#424
AbdelStark merged 1 commit into
mainfrom
issue-418-lock-hard-downstream-benchmark-spec

AbdelStark commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AbdelStark commented Jun 8, 2026

Summary

Linked Issue

Spec / RFC Reference

Public Surface Impact

Validation

Artifact Impact

Deprecations

Caveats / Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant