results: publish hard benchmark artifacts and claim audit (#423)#429
Merged
Conversation
Add the final RFC-0016 stage: assemble a self-contained, manifest-verified, secret-scanned publication artifact set from a built anti-saturation benchmark pack and a hard-mode rerank evaluation, with an artifact index and a claim audit whose public wording follows the claim gate exactly. New `codelewm/eval/hard_downstream_publish.py`: - `assemble_hard_downstream_artifact_set` verifies the pack and rerank manifest chains, copies the source/license, split-leakage, anti-saturation, label-construction, LLM-ingest, benchmark, rerank, and claim-gate reports into the publication directory, builds a `codelewm.hard_downstream_artifact_index.v1` with per-file checksums, runs a secret scan over the whole set (incl. `.html`), and writes a publication manifest. - `build_hard_downstream_claim_audit` records eligible vs saturated slices, missing baselines, the lift confidence intervals, and the exact public wording: the RFC diagnostic fallback when the gate is closed, the positive wording only when it opens. `broad_coding_improvement_claim_allowed` mirrors the gate, so no broad claim is ever asserted while the gate is closed. CLI: `codelewm eval hard-downstream-publish --pack-manifest ... --rerank-manifest ... --out ...`. Docs: the downstream benchmark doc gains an evidence-gated implementation-status section (machinery complete; claim gate remains closed; diagnostic wording), and the roadmap marks #418-#423 merged. No broad coding-improvement claim is added because no locked eligible slice has been published. A new docs test asserts the doc/code wording agree and that no positive claim leaks while the gate is closed. Tests cover the assembled artifact set, index checksum accuracy, the diagnostic claim audit, and the CLI publish path. Closes #423. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Final RFC-0016 stage: assemble a self-contained, manifest-verified, secret-scanned publication artifact set from a built anti-saturation pack and a hard-mode rerank evaluation, with an artifact index and a claim audit whose public wording follows the claim gate exactly. Closes the #417 tracker.
Linked Issue
Closes #423.
Spec / RFC Reference
docs/spec/11-llm-world-model-harness.mddocs/rfcs/RFC-0016-hard-downstream-reranking-benchmark.mdPublic Surface Impact
New Python API (
codelewm.eval):assemble_hard_downstream_artifact_set,build_hard_downstream_claim_audit,read_hard_downstream_claim_audit,HardDownstreamPublicationResult,HardDownstreamPublishError,DIAGNOSTIC_FALLBACK_WORDING,POSITIVE_CLAIM_WORDING, and theHARD_DOWNSTREAM_PUBLICATION/ARTIFACT_INDEX/CLAIM_AUDITschema-version constants.New schema versions (additive):
codelewm.hard_downstream_publication.v1,codelewm.hard_downstream_artifact_index.v1,codelewm.hard_downstream_claim_audit.v1.New CLI command:
codelewm eval hard-downstream-publish.Validation
Artifact Impact
eval hard-downstream-publishwrites a publication directory:manifest.json(codelewm.hard_downstream_publication.v1, parent_artifacts = pack + rerank ids),claim_audit.json(codelewm.hard_downstream_claim_audit.v1),artifact_index.json(codelewm.hard_downstream_artifact_index.v1, per-file checksums), copies of the benchmark + rerank reports underreports/, andreports/publication_secret_scan_report.json. The build aborts on any secret finding.Deprecations
noneCaveats / Follow-ups
>=100-problem eligible slice has been published, so the claim gate is closed and the claim audit emits the diagnostic fallback wording;broad_coding_improvement_claim_allowedis false and no positive claim is added to any doc/paper (enforced bytests/docs/test_hard_downstream_artifact_audit.py).