CodeLeWM

Claim-gated code world models for scoring and reranking candidate patches.

CodeLeWM learns latent transition dynamics over Python code edits, then turns those dynamics into reproducible scoring, retrieval, diagnostics, and LLM candidate-reranking artifacts.

_{Candidate pack: generated patch options remain inspectable with parse/apply status.}	_{Scorer trace: transition energies compare candidates against the no-action baseline.}
_{Execution tour: task cards expose pass/fail labels, CodeLeWM order, and candidate code.}	_{Ranking trace: CodeLeWM, LLM-order, and lexical ranks are compared as diagnostic evidence.}

What It Does

CodeLeWM models edit transitions:

CodeState_before + EditAction -> latent(CodeState_after)

It is not a patch generator. It accepts candidate after-states or unified diffs from a caller, codemod, search process, or LLM, then scores and reranks them with a learned transition model and explicit baselines.

The package is built around four practical surfaces:

data builders for public-safe Python edit transition datasets;
package-native training and checkpoint manifests;
retrieval, surprise, latent, scorer-quality, and downstream reranking evals;
an LLM + world-model harness that captures untrusted candidate patches before scoring them.

Install

uv sync --group dev
uv run codelewm --help

Optional runtimes stay explicit:

uv sync --group dev --group data          # HDF5 and Arrow dataset packing
uv sync --group dev --group train         # PyTorch training and scoring
uv sync --group dev --group eval          # optional evaluation helpers
uv sync --group dev --group llm           # OpenRouter candidate generation
uv sync --group dev --group observability # TensorBoard-compatible exports
uv sync --group dev --group tui           # optional Textual viewers

Full command reference: docs/usage/USAGE.md.

Quickstart

Build the local smoke artifact set:

uv sync --group dev --group data --group train
uv run scripts/first-results --overwrite
uv run codelewm secret-scan \
  .artifacts/first-results \
  docs/benchmark/FIRST_RESULTS.md \
  --json

Run the fixture LLM + world-model demo:

uv sync --group dev --group data --group train --group llm
uv run scripts/llm-world-model-demo

Build the deterministic paper-demo artifact:

uv run scripts/paper-demo --out .artifacts/paper-demo --overwrite

These commands write schema-versioned artifacts under .artifacts/, verify manifests where applicable, and keep publishable outputs compatible with the secret scanner.

Architecture

                         CodeLeWM package

  raw edits / fixtures / public shards
                 |
                 v
        +-------------------+
        | codelewm.data     |  source policy, parsing, normalization,
        |                   |  dedup, splits, transition manifests
        +---------+---------+
                  |
                  v
        +-------------------+       +--------------------+
        | TransitionRecord  +------>| HDF5 / artifact    |
        | before/action/after       | packs + manifests  |
        +---------+---------+       +--------------------+
                  |
                  v
        +-------------------+
        | codelewm.model    |  CodeStateEncoder, action encoders,
        |                   |  JEPA-style predictor, transition energy
        +---------+---------+
                  |
                  v
        +-------------------+       +--------------------+
        | codelewm.eval     +------>| reports: retrieval,|
        |                   |       | surprise, latent,  |
        |                   |       | downstream gates   |
        +---------+---------+       +--------------------+
                  |
                  v
        +-------------------+
        | codelewm.harness  |  score, rerank, index, LLM demo,
        |                   |  HTML/terminal/TUI view models
        +-------------------+

Load-bearing rules:

candidate code is untrusted input;
parsing and scoring do not execute candidate patches by default;
every publishable artifact is JSON-native where applicable, manifest-backed, checksum-verifiable, and secret-scanned;
public model-quality claims are opened by reports, not by demos.

Training Paradigm

CodeLeWM uses a JEPA-style latent transition objective for code edits. The model predicts the after-state latent from the before-state latent and action latent; transition energy is the distance between the predicted and observed after latents.

                 training row
  +-----------------------------------------+
  | before code | edit action | after code  |
  +------+------+-+-----------+-+-----------+
         |        |             |
         v        v             v
  +-------------+ +-------------+ +-------------+
  | CodeState   | | Text/       | | CodeState   |
  | Encoder     | | Abstract    | | Encoder     |
  +------+------+ | Action Enc. | +------+------+
         |        +------+------+        |
         |               |               |
         v               v               v
     z_before       action_latent     z_after
         \               |              /
          \              v             /
           +----> latent predictor ---+
                         |
                         v
                   z_after_pred
                         |
             +-----------+-----------+
             | transition loss       |
             | SIGReg/collapse gates |
             | retrieval diagnostics |
             +-----------+-----------+
                         |
                         v
          checkpoint + manifest + eval reports

Reproducible training and publication use the same outer loop:

source acquisition
  -> license and split gates
  -> transition shards
  -> packed training batches
  -> package-native torch training
  -> trusted checkpoint manifest
  -> downloaded-artifact verification
  -> retrieval / surprise / latent / rerank reports
  -> claim-gated public wording

LLM + World-Model Harness

The harness is designed for the practical workflow where an LLM proposes candidate patches and CodeLeWM evaluates them as code-edit transitions.

  task + bounded repository context
              |
              v
     +-------------------+
     | OpenRouter / LLM  |  dry-run fixtures by default,
     | candidate writer  |  live mode via OPENROUTER_API_KEY
     +---------+---------+
               |
               v
     +-------------------+
     | candidate pack    |  untrusted diffs, checksums,
     | codelewm.llm_*    |  parse/apply status, redaction
     +---------+---------+
               |
      +--------+---------+
      |                  |
      v                  v
+-------------+   +-------------------+
| static      |   | CodeLeWM scorer   |
| patch view  |   | trusted checkpoint|
+------+------+   +---------+---------+
       \                  /
        \                /
         v              v
     +-----------------------+
     | rerank report         |
     | LLM order, no-action, |
     | lexical, CodeLeWM     |
     +-----------+-----------+
                 |
                 v
     terminal / HTML / JSON / optional TUI

Run the default fixture path:

uv run scripts/llm-world-model-demo

Run live candidate generation through OpenRouter:

cp .env.example .env
# Fill OPENROUTER_API_KEY locally. Keep .env untracked.
CODELEWM_LLM_DRY_RUN=0 uv run scripts/llm-world-model-demo

Anthropic BYOK is explicit and routed through OpenRouter:

uv run codelewm openrouter byok-register \
  --provider anthropic \
  --key-env ANTHROPIC_API_KEY \
  --management-key-env OPENROUTER_MANAGEMENT_KEY \
  --name "CodeLeWM Anthropic BYOK" \
  --allowed-model anthropic/claude-4.5-sonnet \
  --dry-run \
  --json

Raw provider keys are never written to reports; reports serialize only redacted BYOK metadata.

Python API

Score one candidate:

from pathlib import Path
from codelewm.harness import load_scorer

scorer = load_scorer(
    Path(".artifacts/first-results/train/checkpoints/checkpoint.pt"),
    device="cpu",
)
result = scorer.score_files(
    before=Path("tests/fixtures/codestate/class_method_before.py"),
    instruction="rewrite the accumulator update explicitly",
    candidate=Path("config/first_results/scorer_quality_candidates/true_after.py"),
)
print(result.to_dict())

Rerank a candidate directory:

from pathlib import Path
from codelewm.harness import load_scorer

result = load_scorer(
    Path(".artifacts/first-results/train/checkpoints/checkpoint.pt")
).rerank_files(
    before=Path("tests/fixtures/codestate/class_method_before.py"),
    instruction="rewrite the accumulator update explicitly",
    candidates=Path("config/first_results/scorer_quality_candidates"),
)

for item in result.results:
    if hasattr(item, "final_score"):
        print(item.candidate, item.final_score)
    else:
        print(item.artifact, item.error_type, item.message)

CLI Surface

Need	Command
Build transition data	`codelewm dataset build`
Pack training batches	`codelewm dataset pack`
Train a transition model	`codelewm train`
Score one candidate	`codelewm score`
Rerank many candidates	`codelewm rerank`
Build a transition index	`codelewm index`
Run retrieval eval	`codelewm eval retrieval`
Run surprise eval	`codelewm eval surprise`
Inspect latent matrices	`codelewm eval latent-matrix`
Evaluate scorer quality	`codelewm eval scorer-quality`
Build downstream rerank packs	`codelewm eval downstream-pack`
Evaluate downstream reranking	`codelewm eval downstream-rerank`
Run the LLM harness	`codelewm llm-demo`
Register OpenRouter BYOK	`codelewm openrouter byok-register`
Verify lineage	`codelewm manifest verify`
Scan publishable artifacts	`codelewm secret-scan`

Artifact Contract

CodeLeWM treats artifacts as part of the API:

manifests record command, config, source git SHA, checksums, and parent artifacts;
checkpoint loading goes through trust gates before model-backed scoring;
reports use stable schema names such as codelewm.eval.retrieval_report.v1, codelewm.llm_candidate_pack.v1, and codelewm.rerank.v1;
documentation cards and benchmark reports link public artifacts to the exact commands that produced them.

Useful entry points:

final paper source: docs/papers/codelewm_final_paper.tex;
final paper PDF: docs/papers/codelewm_final_paper.pdf;
usage guide: docs/usage/USAGE.md;
paper-demo artifact set: docs/benchmark/v1_0/paper_demo;
paper-demo artifact note: docs/benchmark/PAPER_DEMO_V1_0_ARTIFACTS_2026-06-08.md;
final claim audit: docs/benchmark/V1_0_FINAL_CLAIM_AUDIT_2026-06-08.md;
public artifact index: docs/benchmark/PUBLIC_ARTIFACT_INDEX_2026-06-08.md;
Hugging Face execution dataset: https://huggingface.co/datasets/abdelstark/codelewm-execution-pack;
Hugging Face transition-model artifacts: https://huggingface.co/abdelstark/codelewm-transition-model;
Hugging Face run artifacts: https://huggingface.co/datasets/abdelstark/codelewm-runs;
reproducibility checklist: docs/release/V1_0_REPRODUCIBILITY_CHECKLIST_2026-06-08.md.
release announcement: docs/announcements/FINAL_V1_0_RELEASE_2026-06-08.md.

Claim Contract

CodeLeWM is deliberately claim-gated. Runnable demos, green manifests, and published artifacts are necessary evidence, but they do not automatically imply model-quality or coding-usefulness claims.

Supported wording:

reproducible code-edit world-model harness;
manifest-backed public artifacts;
negative action-use evidence for the tested action-conditioned checkpoints;
narrow HumanEval WS-D reranking slice in the checked-in replay evidence.

Blocked wording:

broad coding improvement;
live patch utility;
validated semantic latent axes;
general downstream improvement across benchmarks.

The detailed audit lives in docs/benchmark/V1_0_FINAL_CLAIM_AUDIT_2026-06-08.md.

Development

uv sync --group dev
uv run python -m pytest tests/
uv run python -m compileall -q -x 'tests/fixtures/codestate/invalid_(before|after)\.py$' codelewm tests
uv run codelewm --help

Package and artifact gates are documented in CONTRIBUTING.md and docs/release/RELEASE_CHECKLIST.md.

Attribution

CodeLeWM starts from the LeWorldModel codebase and keeps its JEPA-style model shape as the implementation seed:

@article{maes_lelidec2026lewm,
  title={LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels},
  author={Maes, Lucas and Le Lidec, Quentin and Scieur, Damien and LeCun, Yann and Balestriero, Randall},
  journal={arXiv preprint},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 367 Commits
.github		.github
assets		assets
codelewm		codelewm
config		config
containers		containers
docs		docs
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SPEC.md		SPEC.md
eval.py		eval.py
jepa.py		jepa.py
module.py		module.py
pyproject.toml		pyproject.toml
train.py		train.py
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeLeWM

What It Does

Install

Quickstart

Architecture

Training Paradigm

LLM + World-Model Harness

Python API

CLI Surface

Artifact Contract

Claim Contract

Development

Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodeLeWM

What It Does

Install

Quickstart

Architecture

Training Paradigm

LLM + World-Model Harness

Python API

CLI Surface

Artifact Contract

Claim Contract

Development

Attribution

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages