Skip to content

results: publish hard benchmark artifacts and claim audit (#423)#429

Merged
AbdelStark merged 1 commit into
mainfrom
issue-423-publish-claim-audit
Jun 8, 2026
Merged

results: publish hard benchmark artifacts and claim audit (#423)#429
AbdelStark merged 1 commit into
mainfrom
issue-423-publish-claim-audit

Conversation

@AbdelStark

Copy link
Copy Markdown
Owner

Summary

Final RFC-0016 stage: assemble a self-contained, manifest-verified, secret-scanned publication artifact set from a built anti-saturation pack and a hard-mode rerank evaluation, with an artifact index and a claim audit whose public wording follows the claim gate exactly. Closes the #417 tracker.

Linked Issue

Closes #423.

Spec / RFC Reference

  • Spec section: docs/spec/11-llm-world-model-harness.md
  • RFC: docs/rfcs/RFC-0016-hard-downstream-reranking-benchmark.md

Public Surface Impact

New Python API (codelewm.eval): assemble_hard_downstream_artifact_set, build_hard_downstream_claim_audit, read_hard_downstream_claim_audit, HardDownstreamPublicationResult, HardDownstreamPublishError, DIAGNOSTIC_FALLBACK_WORDING, POSITIVE_CLAIM_WORDING, and the HARD_DOWNSTREAM_PUBLICATION/ARTIFACT_INDEX/CLAIM_AUDIT schema-version constants.

New schema versions (additive): codelewm.hard_downstream_publication.v1, codelewm.hard_downstream_artifact_index.v1, codelewm.hard_downstream_claim_audit.v1.

New CLI command: codelewm eval hard-downstream-publish.

Validation

uv run pytest tests/eval/test_hard_downstream_publish.py tests/docs/test_hard_downstream_artifact_audit.py -q   # passed
uv run pytest tests/docs/ tests/eval/test_hard_downstream_publish.py tests/eval/test_hard_downstream_rerank.py tests/eval/test_hard_downstream_pack.py tests/eval/test_hard_downstream_ingest.py tests/test_imports.py tests/harness/test_output_schemas.py tests/security/test_sandbox_import_boundary.py -q   # 181 passed
uv run pytest tests/harness/ -q   # 106 passed
uv run python -m compileall -q codelewm/eval/hard_downstream_publish.py codelewm/eval/__init__.py codelewm/harness/cli.py
git diff --check

Artifact Impact

eval hard-downstream-publish writes a publication directory: manifest.json (codelewm.hard_downstream_publication.v1, parent_artifacts = pack + rerank ids), claim_audit.json (codelewm.hard_downstream_claim_audit.v1), artifact_index.json (codelewm.hard_downstream_artifact_index.v1, per-file checksums), copies of the benchmark + rerank reports under reports/, and reports/publication_secret_scan_report.json. The build aborts on any secret finding.

Deprecations

none

Caveats / Follow-ups

  • No locked >=100-problem eligible slice has been published, so the claim gate is closed and the claim audit emits the diagnostic fallback wording; broad_coding_improvement_claim_allowed is false and no positive claim is added to any doc/paper (enforced by tests/docs/test_hard_downstream_artifact_audit.py).
  • Running the benchmark on a real locked slice (and, if the gate opens, the paper addendum) is future research work, not a code gap — the harness and gate are complete.

Add the final RFC-0016 stage: assemble a self-contained, manifest-verified,
secret-scanned publication artifact set from a built anti-saturation benchmark
pack and a hard-mode rerank evaluation, with an artifact index and a claim
audit whose public wording follows the claim gate exactly.

New `codelewm/eval/hard_downstream_publish.py`:
- `assemble_hard_downstream_artifact_set` verifies the pack and rerank
  manifest chains, copies the source/license, split-leakage, anti-saturation,
  label-construction, LLM-ingest, benchmark, rerank, and claim-gate reports
  into the publication directory, builds a
  `codelewm.hard_downstream_artifact_index.v1` with per-file checksums, runs a
  secret scan over the whole set (incl. `.html`), and writes a publication
  manifest.
- `build_hard_downstream_claim_audit` records eligible vs saturated slices,
  missing baselines, the lift confidence intervals, and the exact public
  wording: the RFC diagnostic fallback when the gate is closed, the positive
  wording only when it opens. `broad_coding_improvement_claim_allowed` mirrors
  the gate, so no broad claim is ever asserted while the gate is closed.

CLI: `codelewm eval hard-downstream-publish --pack-manifest ... --rerank-manifest ... --out ...`.

Docs: the downstream benchmark doc gains an evidence-gated implementation-status
section (machinery complete; claim gate remains closed; diagnostic wording),
and the roadmap marks #418-#423 merged. No broad coding-improvement claim is
added because no locked eligible slice has been published. A new docs test
asserts the doc/code wording agree and that no positive claim leaks while the
gate is closed.

Tests cover the assembled artifact set, index checksum accuracy, the
diagnostic claim audit, and the CLI publish path.

Closes #423.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AbdelStark AbdelStark merged commit 7a247e6 into main Jun 8, 2026
9 checks passed
@AbdelStark AbdelStark deleted the issue-423-publish-claim-audit branch June 8, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

results: publish hard benchmark artifacts and claim audit

1 participant