Claim-gated code world models for scoring and reranking candidate patches.
CodeLeWM learns latent transition dynamics over Python code edits, then turns those dynamics into reproducible scoring, retrieval, diagnostics, and LLM candidate-reranking artifacts.
CodeLeWM models edit transitions:
CodeState_before + EditAction -> latent(CodeState_after)
It is not a patch generator. It accepts candidate after-states or unified diffs from a caller, codemod, search process, or LLM, then scores and reranks them with a learned transition model and explicit baselines.
The package is built around four practical surfaces:
- data builders for public-safe Python edit transition datasets;
- package-native training and checkpoint manifests;
- retrieval, surprise, latent, scorer-quality, and downstream reranking evals;
- an LLM + world-model harness that captures untrusted candidate patches before scoring them.
uv sync --group dev
uv run codelewm --helpOptional runtimes stay explicit:
uv sync --group dev --group data # HDF5 and Arrow dataset packing
uv sync --group dev --group train # PyTorch training and scoring
uv sync --group dev --group eval # optional evaluation helpers
uv sync --group dev --group llm # OpenRouter candidate generation
uv sync --group dev --group observability # TensorBoard-compatible exports
uv sync --group dev --group tui # optional Textual viewersFull command reference: docs/usage/USAGE.md.
Build the local smoke artifact set:
uv sync --group dev --group data --group train
uv run scripts/first-results --overwrite
uv run codelewm secret-scan \
.artifacts/first-results \
docs/benchmark/FIRST_RESULTS.md \
--jsonRun the fixture LLM + world-model demo:
uv sync --group dev --group data --group train --group llm
uv run scripts/llm-world-model-demoBuild the deterministic paper-demo artifact:
uv run scripts/paper-demo --out .artifacts/paper-demo --overwriteThese commands write schema-versioned artifacts under .artifacts/, verify
manifests where applicable, and keep publishable outputs compatible with the
secret scanner.
CodeLeWM package
raw edits / fixtures / public shards
|
v
+-------------------+
| codelewm.data | source policy, parsing, normalization,
| | dedup, splits, transition manifests
+---------+---------+
|
v
+-------------------+ +--------------------+
| TransitionRecord +------>| HDF5 / artifact |
| before/action/after | packs + manifests |
+---------+---------+ +--------------------+
|
v
+-------------------+
| codelewm.model | CodeStateEncoder, action encoders,
| | JEPA-style predictor, transition energy
+---------+---------+
|
v
+-------------------+ +--------------------+
| codelewm.eval +------>| reports: retrieval,|
| | | surprise, latent, |
| | | downstream gates |
+---------+---------+ +--------------------+
|
v
+-------------------+
| codelewm.harness | score, rerank, index, LLM demo,
| | HTML/terminal/TUI view models
+-------------------+
Load-bearing rules:
- candidate code is untrusted input;
- parsing and scoring do not execute candidate patches by default;
- every publishable artifact is JSON-native where applicable, manifest-backed, checksum-verifiable, and secret-scanned;
- public model-quality claims are opened by reports, not by demos.
CodeLeWM uses a JEPA-style latent transition objective for code edits. The model predicts the after-state latent from the before-state latent and action latent; transition energy is the distance between the predicted and observed after latents.
training row
+-----------------------------------------+
| before code | edit action | after code |
+------+------+-+-----------+-+-----------+
| | |
v v v
+-------------+ +-------------+ +-------------+
| CodeState | | Text/ | | CodeState |
| Encoder | | Abstract | | Encoder |
+------+------+ | Action Enc. | +------+------+
| +------+------+ |
| | |
v v v
z_before action_latent z_after
\ | /
\ v /
+----> latent predictor ---+
|
v
z_after_pred
|
+-----------+-----------+
| transition loss |
| SIGReg/collapse gates |
| retrieval diagnostics |
+-----------+-----------+
|
v
checkpoint + manifest + eval reports
Reproducible training and publication use the same outer loop:
source acquisition
-> license and split gates
-> transition shards
-> packed training batches
-> package-native torch training
-> trusted checkpoint manifest
-> downloaded-artifact verification
-> retrieval / surprise / latent / rerank reports
-> claim-gated public wording
The harness is designed for the practical workflow where an LLM proposes candidate patches and CodeLeWM evaluates them as code-edit transitions.
task + bounded repository context
|
v
+-------------------+
| OpenRouter / LLM | dry-run fixtures by default,
| candidate writer | live mode via OPENROUTER_API_KEY
+---------+---------+
|
v
+-------------------+
| candidate pack | untrusted diffs, checksums,
| codelewm.llm_* | parse/apply status, redaction
+---------+---------+
|
+--------+---------+
| |
v v
+-------------+ +-------------------+
| static | | CodeLeWM scorer |
| patch view | | trusted checkpoint|
+------+------+ +---------+---------+
\ /
\ /
v v
+-----------------------+
| rerank report |
| LLM order, no-action, |
| lexical, CodeLeWM |
+-----------+-----------+
|
v
terminal / HTML / JSON / optional TUI
Run the default fixture path:
uv run scripts/llm-world-model-demoRun live candidate generation through OpenRouter:
cp .env.example .env
# Fill OPENROUTER_API_KEY locally. Keep .env untracked.
CODELEWM_LLM_DRY_RUN=0 uv run scripts/llm-world-model-demoAnthropic BYOK is explicit and routed through OpenRouter:
uv run codelewm openrouter byok-register \
--provider anthropic \
--key-env ANTHROPIC_API_KEY \
--management-key-env OPENROUTER_MANAGEMENT_KEY \
--name "CodeLeWM Anthropic BYOK" \
--allowed-model anthropic/claude-4.5-sonnet \
--dry-run \
--jsonRaw provider keys are never written to reports; reports serialize only redacted BYOK metadata.
Score one candidate:
from pathlib import Path
from codelewm.harness import load_scorer
scorer = load_scorer(
Path(".artifacts/first-results/train/checkpoints/checkpoint.pt"),
device="cpu",
)
result = scorer.score_files(
before=Path("tests/fixtures/codestate/class_method_before.py"),
instruction="rewrite the accumulator update explicitly",
candidate=Path("config/first_results/scorer_quality_candidates/true_after.py"),
)
print(result.to_dict())Rerank a candidate directory:
from pathlib import Path
from codelewm.harness import load_scorer
result = load_scorer(
Path(".artifacts/first-results/train/checkpoints/checkpoint.pt")
).rerank_files(
before=Path("tests/fixtures/codestate/class_method_before.py"),
instruction="rewrite the accumulator update explicitly",
candidates=Path("config/first_results/scorer_quality_candidates"),
)
for item in result.results:
if hasattr(item, "final_score"):
print(item.candidate, item.final_score)
else:
print(item.artifact, item.error_type, item.message)| Need | Command |
|---|---|
| Build transition data | codelewm dataset build |
| Pack training batches | codelewm dataset pack |
| Train a transition model | codelewm train |
| Score one candidate | codelewm score |
| Rerank many candidates | codelewm rerank |
| Build a transition index | codelewm index |
| Run retrieval eval | codelewm eval retrieval |
| Run surprise eval | codelewm eval surprise |
| Inspect latent matrices | codelewm eval latent-matrix |
| Evaluate scorer quality | codelewm eval scorer-quality |
| Build downstream rerank packs | codelewm eval downstream-pack |
| Evaluate downstream reranking | codelewm eval downstream-rerank |
| Run the LLM harness | codelewm llm-demo |
| Register OpenRouter BYOK | codelewm openrouter byok-register |
| Verify lineage | codelewm manifest verify |
| Scan publishable artifacts | codelewm secret-scan |
CodeLeWM treats artifacts as part of the API:
- manifests record command, config, source git SHA, checksums, and parent artifacts;
- checkpoint loading goes through trust gates before model-backed scoring;
- reports use stable schema names such as
codelewm.eval.retrieval_report.v1,codelewm.llm_candidate_pack.v1, andcodelewm.rerank.v1; - documentation cards and benchmark reports link public artifacts to the exact commands that produced them.
Useful entry points:
- final paper source:
docs/papers/codelewm_final_paper.tex; - final paper PDF:
docs/papers/codelewm_final_paper.pdf; - usage guide:
docs/usage/USAGE.md; - paper-demo artifact set:
docs/benchmark/v1_0/paper_demo; - paper-demo artifact note:
docs/benchmark/PAPER_DEMO_V1_0_ARTIFACTS_2026-06-08.md; - final claim audit:
docs/benchmark/V1_0_FINAL_CLAIM_AUDIT_2026-06-08.md; - public artifact index:
docs/benchmark/PUBLIC_ARTIFACT_INDEX_2026-06-08.md; - Hugging Face execution dataset:
https://huggingface.co/datasets/abdelstark/codelewm-execution-pack; - Hugging Face transition-model artifacts:
https://huggingface.co/abdelstark/codelewm-transition-model; - Hugging Face run artifacts:
https://huggingface.co/datasets/abdelstark/codelewm-runs; - reproducibility checklist:
docs/release/V1_0_REPRODUCIBILITY_CHECKLIST_2026-06-08.md. - release announcement:
docs/announcements/FINAL_V1_0_RELEASE_2026-06-08.md.
CodeLeWM is deliberately claim-gated. Runnable demos, green manifests, and published artifacts are necessary evidence, but they do not automatically imply model-quality or coding-usefulness claims.
Supported wording:
- reproducible code-edit world-model harness;
- manifest-backed public artifacts;
- negative action-use evidence for the tested action-conditioned checkpoints;
- narrow HumanEval WS-D reranking slice in the checked-in replay evidence.
Blocked wording:
- broad coding improvement;
- live patch utility;
- validated semantic latent axes;
- general downstream improvement across benchmarks.
The detailed audit lives in
docs/benchmark/V1_0_FINAL_CLAIM_AUDIT_2026-06-08.md.
uv sync --group dev
uv run python -m pytest tests/
uv run python -m compileall -q -x 'tests/fixtures/codestate/invalid_(before|after)\.py$' codelewm tests
uv run codelewm --helpPackage and artifact gates are documented in CONTRIBUTING.md and
docs/release/RELEASE_CHECKLIST.md.
CodeLeWM starts from the LeWorldModel codebase and keeps its JEPA-style model shape as the implementation seed:
@article{maes_lelidec2026lewm,
title={LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels},
author={Maes, Lucas and Le Lidec, Quentin and Scieur, Damien and LeCun, Yann and Balestriero, Randall},
journal={arXiv preprint},
year={2026}
}


