Skip to content

docs: lock hard downstream benchmark spec and tracker (#418)#424

Merged
AbdelStark merged 1 commit into
mainfrom
issue-418-lock-hard-downstream-benchmark-spec
Jun 8, 2026
Merged

docs: lock hard downstream benchmark spec and tracker (#418)#424
AbdelStark merged 1 commit into
mainfrom
issue-418-lock-hard-downstream-benchmark-spec

Conversation

@AbdelStark

Copy link
Copy Markdown
Owner

Summary

Lands the repo-level contract for the hard anti-saturation downstream reranking benchmark (RFC-0016), so future agents cannot weaken the anti-saturation gate. This is documentation + docs-tests only; no runtime/code surface changes. It is the first child of tracker #417 and sets the contract that #419#423 implement.

Linked Issue

Closes #418.

Spec / RFC Reference

  • Spec section: docs/spec/11-llm-world-model-harness.md (anti_saturation_semantic_v1 profile + codelewm.downstream_anti_saturation_report.v1)
  • RFC: docs/rfcs/RFC-0016-hard-downstream-reranking-benchmark.md

Public Surface Impact

none (docs + docs tests only). No CLI flags, JSON schemas, manifest fields, error types, config keys, or Python APIs are added/changed/removed. The RFC names future schema strings (codelewm.downstream_anti_saturation_report.v1, anti_saturation_semantic_v1) but no code implements them yet.

Validation

uv run pytest tests/docs/
# 144 passed, 1024 subtests passed
uv run pytest tests/docs/test_hard_downstream_benchmark.py tests/docs/test_implementation_tracker.py tests/docs/test_final_paper_package.py -q

Artifact Impact

none on-disk at runtime. Documents the planned published artifact set and the codelewm.downstream_anti_saturation_report.v1 schema for later issues.

Deprecations

none

Caveats / Follow-ups

Land the repo-level contract for the hard anti-saturation downstream
reranking benchmark (RFC-0016) so future work cannot weaken the
anti-saturation gate.

- Add RFC-0016 defining the scientific question, candidate classes,
  anti-saturation filter (no-action/lexical < 0.85, LLM-order < 0.90),
  required baselines, metrics, split policy, security boundary, and the
  claim gate.
- Add the dedicated roadmap (HARD_DOWNSTREAM_RERANKING_BENCHMARK.md)
  mapping the #417 tracker to child issues #418-#423.
- Point the LLM/world-model harness spec at the
  `anti_saturation_semantic_v1` profile and the
  `codelewm.downstream_anti_saturation_report.v1` diagnostic report.
- Explain in the downstream benchmark doc why MBPP-Plus WS-D saturation
  motivated this follow-up; strengthen the paper/claim-audit wording to
  preserve the v1.0 claim boundary.
- Update roadmap/tracker docs with the v1.5 issue order.
- Add docs tests locking the contract and issue references.

Closes #418.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AbdelStark AbdelStark merged commit cf0197e into main Jun 8, 2026
9 checks passed
@AbdelStark AbdelStark deleted the issue-418-lock-hard-downstream-benchmark-spec branch June 8, 2026 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: lock hard downstream benchmark spec and tracker

1 participant