results: publish hard benchmark artifacts and claim audit (#423) by AbdelStark · Pull Request #429 · AbdelStark/CodeLeWM

AbdelStark · 2026-06-08T14:26:01Z

Summary

Final RFC-0016 stage: assemble a self-contained, manifest-verified, secret-scanned publication artifact set from a built anti-saturation pack and a hard-mode rerank evaluation, with an artifact index and a claim audit whose public wording follows the claim gate exactly. Closes the #417 tracker.

Linked Issue

Closes #423.

Spec / RFC Reference

Spec section: docs/spec/11-llm-world-model-harness.md
RFC: docs/rfcs/RFC-0016-hard-downstream-reranking-benchmark.md

Public Surface Impact

New Python API (codelewm.eval): assemble_hard_downstream_artifact_set, build_hard_downstream_claim_audit, read_hard_downstream_claim_audit, HardDownstreamPublicationResult, HardDownstreamPublishError, DIAGNOSTIC_FALLBACK_WORDING, POSITIVE_CLAIM_WORDING, and the HARD_DOWNSTREAM_PUBLICATION/ARTIFACT_INDEX/CLAIM_AUDIT schema-version constants.

New schema versions (additive): codelewm.hard_downstream_publication.v1, codelewm.hard_downstream_artifact_index.v1, codelewm.hard_downstream_claim_audit.v1.

New CLI command: codelewm eval hard-downstream-publish.

Validation

uv run pytest tests/eval/test_hard_downstream_publish.py tests/docs/test_hard_downstream_artifact_audit.py -q   # passed
uv run pytest tests/docs/ tests/eval/test_hard_downstream_publish.py tests/eval/test_hard_downstream_rerank.py tests/eval/test_hard_downstream_pack.py tests/eval/test_hard_downstream_ingest.py tests/test_imports.py tests/harness/test_output_schemas.py tests/security/test_sandbox_import_boundary.py -q   # 181 passed
uv run pytest tests/harness/ -q   # 106 passed
uv run python -m compileall -q codelewm/eval/hard_downstream_publish.py codelewm/eval/__init__.py codelewm/harness/cli.py
git diff --check

Artifact Impact

eval hard-downstream-publish writes a publication directory: manifest.json (codelewm.hard_downstream_publication.v1, parent_artifacts = pack + rerank ids), claim_audit.json (codelewm.hard_downstream_claim_audit.v1), artifact_index.json (codelewm.hard_downstream_artifact_index.v1, per-file checksums), copies of the benchmark + rerank reports under reports/, and reports/publication_secret_scan_report.json. The build aborts on any secret finding.

Deprecations

none

Caveats / Follow-ups

No locked >=100-problem eligible slice has been published, so the claim gate is closed and the claim audit emits the diagnostic fallback wording; broad_coding_improvement_claim_allowed is false and no positive claim is added to any doc/paper (enforced by tests/docs/test_hard_downstream_artifact_audit.py).
Running the benchmark on a real locked slice (and, if the gate opens, the paper addendum) is future research work, not a code gap — the harness and gate are complete.

Add the final RFC-0016 stage: assemble a self-contained, manifest-verified, secret-scanned publication artifact set from a built anti-saturation benchmark pack and a hard-mode rerank evaluation, with an artifact index and a claim audit whose public wording follows the claim gate exactly. New `codelewm/eval/hard_downstream_publish.py`: - `assemble_hard_downstream_artifact_set` verifies the pack and rerank manifest chains, copies the source/license, split-leakage, anti-saturation, label-construction, LLM-ingest, benchmark, rerank, and claim-gate reports into the publication directory, builds a `codelewm.hard_downstream_artifact_index.v1` with per-file checksums, runs a secret scan over the whole set (incl. `.html`), and writes a publication manifest. - `build_hard_downstream_claim_audit` records eligible vs saturated slices, missing baselines, the lift confidence intervals, and the exact public wording: the RFC diagnostic fallback when the gate is closed, the positive wording only when it opens. `broad_coding_improvement_claim_allowed` mirrors the gate, so no broad claim is ever asserted while the gate is closed. CLI: `codelewm eval hard-downstream-publish --pack-manifest ... --rerank-manifest ... --out ...`. Docs: the downstream benchmark doc gains an evidence-gated implementation-status section (machinery complete; claim gate remains closed; diagnostic wording), and the roadmap marks #418-#423 merged. No broad coding-improvement claim is added because no locked eligible slice has been published. A new docs test asserts the doc/code wording agree and that no positive claim leaks while the gate is closed. Tests cover the assembled artifact set, index checksum accuracy, the diagnostic claim audit, and the CLI publish path. Closes #423. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AbdelStark merged commit 7a247e6 into main Jun 8, 2026
9 checks passed

AbdelStark deleted the issue-423-publish-claim-audit branch June 8, 2026 14:29

AbdelStark mentioned this pull request Jun 8, 2026

[TRACKER] v1.5 hard anti-saturation downstream reranking benchmark #417

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

results: publish hard benchmark artifacts and claim audit (#423)#429

results: publish hard benchmark artifacts and claim audit (#423)#429
AbdelStark merged 1 commit into
mainfrom
issue-423-publish-claim-audit

AbdelStark commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AbdelStark commented Jun 8, 2026

Summary

Linked Issue

Spec / RFC Reference

Public Surface Impact

Validation

Artifact Impact

Deprecations

Caveats / Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant