data: add anti-saturation benchmark schema and readiness diagnostics (#419) by AbdelStark · Pull Request #425 · AbdelStark/CodeLeWM

AbdelStark · 2026-06-08T13:25:48Z

Summary

Adds the RFC-0016 anti_saturation_semantic_v1 profile and the codelewm.downstream_anti_saturation_report.v1 diagnostic so a hard downstream reranking pack exposes anti-saturation eligibility before any CodeLeWM scoring. The change is additive: a new schema-only downstream_anti_saturation module plus optional profile / hard_negative_class config fields on the existing pack builder. Plain (non-profile) fixture configs are byte-for-byte unaffected.

Linked Issue

Closes #419.

Spec / RFC Reference

Spec section: docs/spec/11-llm-world-model-harness.md
RFC: docs/rfcs/RFC-0016-hard-downstream-reranking-benchmark.md

Public Surface Impact

New Python API (exported from codelewm.eval): build_anti_saturation_report, build_anti_saturation_claim_gate, compute_model_independent_baselines, validate_hard_negative_class, validate_anti_saturation_report, validate_anti_saturation_claim_gate, anti_saturation_report_json_schema, lexical_similarity, stable_random_key, DownstreamAntiSaturationError, and the ANTI_SATURATION_* / HARD_NEGATIVE_CLASSES / ceiling constants.

New schema versions (additive): codelewm.downstream_anti_saturation_report.v1, codelewm.downstream_anti_saturation_claim_gate.v1.

New config keys (both optional): profile (pack config, only anti_saturation_semantic_v1), hard_negative_class (candidate). DownstreamBenchmarkPackResult gains optional anti_saturation_report_path / anti_saturation_eligible. No existing field, baseline tuple, or schema version changed.

No new CLI command — the existing eval downstream-pack emits the report when the config sets the profile.

Validation

uv run pytest tests/eval/test_hard_downstream_schema.py tests/eval/test_hard_downstream_pack.py -q   # 21 passed
uv run pytest tests/eval/ tests/test_imports.py tests/harness/test_output_schemas.py -q              # 199 passed
uv run python -m compileall -q codelewm/eval/downstream_anti_saturation.py codelewm/eval/downstream_pack.py codelewm/eval/__init__.py
git diff --check

Artifact Impact

A pack built with profile: anti_saturation_semantic_v1 now writes reports/anti_saturation_report.json (schema codelewm.downstream_anti_saturation_report.v1), records it in manifest.json metadata (profile, anti_saturation_report, anti_saturation_eligible), and adds anti_saturation_eligible / anti_saturation_blocked_reasons to the readiness report. Packs without a profile are unchanged.

Deprecations

none

Caveats / Follow-ups

The pack-time report computes the model-independent baselines (no-action/lexical/LLM-order/random). The CodeLeWM-scored baselines and the lift CIs are added at eval time in eval: score hard benchmark baselines and CodeLeWM claim gate #422.
The bundled fixture has one task, so it is intentionally eligible=false (blocked only by problem_count_below_minimum:1<100), preserving the v1.0 claim boundary.
Follow-ups: data: build public-safe hard-negative candidate pools #420 (hard-negative pool builder), harness: ingest LLM candidate packs into the hard benchmark #421 (LLM candidate ingestion), eval: score hard benchmark baselines and CodeLeWM claim gate #422 (scoring + claim gate), results: publish hard benchmark artifacts and claim audit #423 (publication).

…419) Add the RFC-0016 `anti_saturation_semantic_v1` profile and the `codelewm.downstream_anti_saturation_report.v1` diagnostic so a hard downstream reranking pack exposes anti-saturation eligibility before any CodeLeWM scoring. New `codelewm/eval/downstream_anti_saturation.py`: - `build_anti_saturation_report` computes eligibility from the RFC-0016 gates: problem_count >= 100, 6-12 candidates per pool, no-action and lexical pass@1 < 0.85, LLM-order pass@1 < 0.90, >= 70% of problems with two or more distinct failing hard-negative classes, plus the source/license, split-leakage, manifest, and secret-scan gates. Saturated, under-covered, or missing-baseline slices are preserved with `eligible=False` and a typed `blocked_reasons` entry; nothing is dropped. - `build_anti_saturation_claim_gate` opens only when CodeLeWM beats no-action, lexical, AND LLM-order on pass@1 and MRR (with lift CIs excluding zero when supplied) on an eligible >=100-problem slice. - `compute_model_independent_baselines` derives the no-action / lexical / LLM-order / random pass@1, pool sizes, class coverage, dual-coverage fraction, and parser/apply failure rate from a materialized pack. It is text-only: patch apply is lazy-imported and parseability uses ast.parse. - `validate_hard_negative_class` enforces the RFC-0016 class enumeration. `downstream_pack.py` gains optional `profile` (pack config) and `hard_negative_class` (candidate) fields. When the profile is set, the pack build writes `reports/anti_saturation_report.json`, records it in the manifest, feeds eligibility into the readiness report, and reports the path + eligibility on the result. Plain fixture configs are unaffected. Adds a public-safe fixture pack (one task, six classed candidates) that is non-saturated on every baseline and blocked only by the 1<100 problem count, plus focused tests for eligible/saturated/too-small/missing-baseline cases, the claim gate, and the pack-build + CLI paths. Closes #419. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AbdelStark merged commit 174533d into main Jun 8, 2026
9 checks passed

AbdelStark deleted the issue-419-anti-saturation-schema branch June 8, 2026 13:28

AbdelStark mentioned this pull request Jun 8, 2026

[TRACKER] v1.5 hard anti-saturation downstream reranking benchmark #417

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data: add anti-saturation benchmark schema and readiness diagnostics (#419)#425

data: add anti-saturation benchmark schema and readiness diagnostics (#419)#425
AbdelStark merged 1 commit into
mainfrom
issue-419-anti-saturation-schema

AbdelStark commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AbdelStark commented Jun 8, 2026

Summary

Linked Issue

Spec / RFC Reference

Public Surface Impact

Validation

Artifact Impact

Deprecations

Caveats / Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant