QEC AI-Enhanced Decoder — AutoQEC

AutoQEC is an LLM-agent-driven auto-research harness for discovering neural predecoders for quantum error-correcting codes. Given an environment triple (code_spec, noise_model, constraints), the system runs 10–20 rounds of hypothesis → DSL config → training → evaluation → analysis and emits verified predecoder checkpoints on the accuracy–latency–parameters Pareto front.

⚡ Review in one prompt

Clone the repo, open a Claude Code / Codex CLI in the project root, and paste the block below verbatim. The agent will install deps, run the six headline demos, and hand back a pass/fail table. Total wall-clock on CPU-only hardware: ~15–20 min (skip step 0's unit suite if you're in a hurry — ~3 min off).

Run the 6 demos below in order and report results. All commands run from the
repo root. First create and activate .venv (Windows: `.venv\Scripts\activate`,
Unix: `source .venv/bin/activate`), then `pip install -e '.[dev]'`. Do NOT
enter the live-LLM path (nested CLI sessions are unstable). For each step
report: cmd, wall-clock, key artifact paths, pass/fail against the stated
criteria. At the end give one markdown summary table.

0. Preflight (optional but strongly recommended):
   - `ruff check autoqec cli tests scripts` must print "All checks passed"
   - `pytest tests/ -m "not integration" -q` must be all green (~4 min, 290 tests)
   - `pytest tests/test_reward_hacking_traps.py -m integration --run-integration -v`
     must be 3/3 green (trap_A/B/C verifier guards)

1. Demo 1 — surface-code end-to-end loop (~2.5 min):
   `bash demos/demo-1-surface-d5/run_quick.sh`
   Pass criteria: in the newest `runs/<ts>/`, all of `history.jsonl`,
   `candidate_pareto.json`, `fork_graph.json`,
   `round_1/{config.yaml,train.log,checkpoint.pt,metrics.json}` exist;
   `metrics.json.status == "ok"`.

2. Demo 2 — qLDPC cross-code-family (~3 min, routes through OSD not MWPM):
   `MODE=fast bash demos/demo-2-bb72/run.sh`
   Pass criteria: in the new `runs/<ts>/round_1/metrics.json`, `status == "ok"`;
   the env config uses `classical_backend: osd` (prove the backend actually
   switched away from the surface-code MWPM path).

3. Demo 3 — worktree branches-as-Pareto provenance (~30 s, no LLM):
   `bash demos/demo-3-worktree-provenance/run.sh`
   Pass criteria: three `run-round` invocations all exit 0; every
   `metrics.json` has a non-empty `branch` and a real `commit_sha`; the
   compose worktree contains both `round_1/round_1_pointer.json` and
   `round_2/round_2_pointer.json` (proving both parents merged); the
   deliberate-conflict step yields `status="compose_conflict"` with
   `conflicting_files` populated.

4. Demo 4 — reward-hacking rejection (~1 min, memorizer must be rejected):
   `bash demos/demo-4-reward-hacking/run.sh`
   Pass criteria: `runs/demo-4/round_0/verification_report.json.verdict` is in
   {SUSPICIOUS, FAILED} — never VERIFIED; script exits 0.

5. Demo 5 — failure root-cause diagnosis (~5 s):
   `bash demos/demo-5-failure-recovery/run.sh`
   Pass criteria: stdout JSON contains `"status": "compile_error"` and a
   `status_reason` field.

6. Demo 6 — advisor replay / offline reproducibility (~2 min):
   `bash demos/demo-6-advisor-replay/run.sh`
   Pass criteria: `cli.autoqec package-run` produces a `runs/<run_id>.tar.gz`
   archive; `autoqec.tools.advisor_replay` extracts it under
   `runs/demo-6-replay/` and re-runs `cli.autoqec verify` with all
   `AUTOQEC_*_BACKEND` env vars unset and sockets blocked; the offline
   replay verify exits 0.

Summary table columns: demo | cmd | wall-clock | key artifact | pass/fail.
Rules:
- If any step fails, keep going to the next step; record the fail in the
  table rather than aborting.
- On failure, paste the last 30 lines of that step's stdout/stderr — do not
  silently swallow errors.
- Do not modify source files; only read, run, and report.
- If every demo passes, end with the line "harness end-to-end healthy"; if any
  failed, end with "needs triage: <which step>".

What the six demos prove, in plain language:

Demo	Answer it gives the reviewer
1 — `surface_d5` no-LLM smoke	The full round pipeline (DSL → train → eval → Pareto → `fork_graph.json`) actually writes every artifact, with `Δ_LER` reported on the surface-code path.
2 — `bb72` qLDPC	The same harness swaps MWPM → OSD cleanly; `(code, noise, constraints)` is really the only input knob.
3 — Worktree provenance	Branches-as-Pareto isn't paperware — every round commits its `round_<N>_pointer.json` into a real `exp/<run_id>/<NN>-<slug>` branch, compose rounds `git merge` two parents end-to-end, and conflict is recorded as `status="compose_conflict"` rather than thrown.
4 — Reward-hacking detection	A hand-built memorizer cheater gets rejected (`SUSPICIOUS` or `FAILED`), never admitted to Pareto — the independent verifier actually guards the front.
5 — Failure recovery	`cli.autoqec diagnose` takes a broken round dir and emits a machine-readable root cause (`compile_error` + reason), feeding the `/diagnose-failure` skill.
6 — Advisor replay	A run can be packaged (`cli.autoqec package-run` → `.tar.gz`) and replayed offline in a clean directory with all `AUTOQEC_*_BACKEND` env vars unset and sockets blocked — proving runs are network-free reproducible artifacts, not just local state.

The live-LLM research loop (/autoqec-run) is intentionally out of the one-prompt flow — nesting Claude-Code-inside-Claude-Code is unstable. See docs/verification/human-verification-report-2026-04-24.md for the last end-to-end retest and .claude/skills/autoqec-run/SKILL.md for the manual path.

Spec: docs/superpowers/specs/2026-04-20-autoqec-design.md (v2.2)
API documentation: docs/api-documentation.md
Master plan: docs/superpowers/plans/2026-04-21-autoqec-master.md
Per-owner plans: docs/superpowers/plans/
Test plan: docs/verification/human-verification-test-plan.md
Developer test targets: docs/test-plan.md — make lint, make test, and make test-integration gate the full unit + GPU/integration suites (the one-prompt flow above only exercises make test + traps).
Knowledge base: knowledge/ — 81-paper index + 3 synthesis documents (roadmap, strategic assessment, autoresearch patterns)

Architecture at a glance

The classical backend guarantees structural validity; the predecoder contributes Δ_LER = LER(plain_classical) − LER(predecoder + classical) — a single clean number per round.

syndrome + code DEM
        │
        ↓
┌─────────────────────────────────────────┐
│ AI Predecoder (agent searches here)     │
│   type:        gnn | neural_bp          │
│   output_mode: hard_flip | soft_priors  │
└─────────────────────────────────────────┘
        │
        ↓
┌─────────────────────────────────────────┐
│ Classical Backend (fixed per env)       │
│   surface codes → MWPM (PyMatching)     │
│   qLDPC         → OSD                   │
└─────────────────────────────────────────┘
        │
        ↓
 logical correction

Each round is a 3-subagent DAG (+ optional Verifier). Roles are backend-pluggable via AUTOQEC_{IDEATOR,CODER,ANALYST}_BACKEND; the Runner is local-only and enforces a tool whitelist (autoqec/runner/safety.py) so reward-hacking paths to training-set syndromes are physically blocked.

 round N dispatch · fork_graph + machine_state
        ↓
┌─────────────────────────────────────────┐
│ Ideator                     (subagent)  │
│   hypothesis + fork_from + compose_mode │
└─────────────────────────────────────────┘
        ↓
┌─────────────────────────────────────────┐
│ Coder                       (subagent)  │
│   DSL config + commit_message           │
│   Tier-1 canonical · Tier-2 custom_fn   │
└─────────────────────────────────────────┘
        ↓
┌─────────────────────────────────────────┐
│ Runner                       (script)   │
│   train + eval → RoundMetrics           │
└─────────────────────────────────────────┘
        ↓
┌─────────────────────────────────────────┐
│ Analyst                     (subagent)  │
│   verdict: candidate | ignore           │
└─────────────────────────────────────────┘
        ↓   (if candidate)
┌─────────────────────────────────────────┐
│ Verifier           (optional subagent)  │
│   VERIFIED → Pareto admit               │
│   SUSPICIOUS / FAILED → rejected        │
└─────────────────────────────────────────┘

Each round runs on its own exp/<run_id>/<NN>-<slug> git branch inside a .worktrees/ checkout, Pareto members are the complete non-dominated set of VERIFIED branches, and compose rounds test git merge parent-A parent-B as a first-class scientific probe. Startup reconciliation keeps history.jsonl and the live branch set in sync.

 main  (f51cfcf · run_id = demo38-…103750)
   │
   ├─ 01-idea-a           ★ Δ=+0.18  ──┐
   │                                   │
   ├─ 02-idea-b           ★ Δ=+0.12  ──┤   git merge   (compose_mode = merge)
   │                                   │
   ├─ 03-compose-ab       ★ Δ=+0.22  ──┘   dominates both parents
   │
   ├─ 04-idea-c           ignore          Analyst: Δ within bootstrap CI
   │
   ├─ 90-conflict-L       ◇ compose_conflict · branch=None · sha=None
   ├─ 91-conflict-R       ◇ compose_conflict · merge refused · no worktree
   │
   └─ 07-orphan           ⊘ orphaned_branch  · healed by reconcile

 ★ on non-dominated Pareto    ◇ merge refused    ⊘ orphan recovered on startup
 each  exp/<run_id>/<NN>-<slug>  ≡ its own  .worktrees/exp-…/  checkout
 runs/<run_id>/  stores   history.jsonl · pareto.json · fork_graph.json (this tree)

Deliverables

6 Features (core capabilities of the harness)

All six are present on main; the one-prompt flow above exercises F1–F6 end-to-end.

#	Feature	Status	Evidence
F1	End-to-end research loop over any `(code, noise, constraints)` triple	implemented	`cli/autoqec.py run` + `autoqec/orchestration/llm_loop.py`; Demo 1 / Demo 2 drive the no-LLM path
F2	Tier-1 canonical DSL + Tier-2 `custom_fn` escape hatch with AST+smoke validation	implemented	`autoqec/decoders/{dsl_schema,dsl_compiler,custom_fn_validator,custom_fn_rules}.py`; 18 hostile cases in `tests/test_custom_fn_validator.py`
F3	Independent verification module with 3 fair-baseline guards (seed isolation, paired bootstrap CI, ablation sanity)	implemented	`autoqec/eval/independent_eval.py` + `eval/bootstrap.py`; trap_A/B/C guards proven on main (Demo 4 + `tests/test_reward_hacking_traps.py`)
F4	Multi-agent orchestration (Ideator / Coder / Analyst) with tool whitelisting + 3-layer memory + `machine_state` tool	implemented	`autoqec/agents/dispatch.py` + `.claude/agents/autoqec-{ideator,coder,analyst}.md`; `autoqec/orchestration/memory.py`; `autoqec/tools/machine_state.py`; `autoqec/runner/safety.py`
F5	Pareto-front maintenance across (Δ_LER, FLOPs, n_params) with verify-admitted candidates	implemented	`autoqec/pareto/front.py` + `orchestration/round_recorder.py`; atomic `pareto.json` swap covered by `tests/test_pareto_atomic*.py`
F6	Worktree-based experiment model (branches-as-Pareto; compose rounds; startup reconciliation; `fork_graph.json`)	implemented	`autoqec/orchestration/{worktree,subprocess_runner,reconcile,fork_graph}.py`; persistence proven by `tests/test_fork_graph_persist.py`

6 Demos (each produces a reproducible artifact)

All six ship as runnable demos. The /add-env onboarding flow stays a CLI-only subcommand for now (python -m cli.autoqec add-env …) and is tracked under Skills below rather than as a demo.

#	Demo	Proves	Priority	Status
D1	`surface_d5` full research run — `demos/demo-1-surface-d5/run_quick.sh`	End-to-end harness works	P0	implemented
D2	`bb72` qLDPC research run — `demos/demo-2-bb72/run.sh` (fast/dev/prod modes)	Genericity across codes / classical backends (MWPM → OSD)	P1	implemented
D3	Worktree branches-as-Pareto provenance — `demos/demo-3-worktree-provenance/run.sh`	Every round runs in its own `exp/<run_id>/<NN>-<slug>` git worktree, commits a `round_<N>_pointer.json`, and `compose_conflict` is a recorded failure mode (issue #38)	P1	implemented
D4	Reward-hacking detection — `demos/demo-4-reward-hacking/run.sh`	Memorizer cheater gets `SUSPICIOUS` / `FAILED` verdict	P0	implemented
D5	Failure recovery — `demos/demo-5-failure-recovery/run.sh`	`cli.autoqec diagnose` identifies `compile_error` root cause	P2	implemented
D6	Advisor replay / offline reproducibility — `demos/demo-6-advisor-replay/run.sh`	A packaged run replays under `runs/demo-6-replay/` with no LLM backends and no network — `cli.autoqec package-run` + `autoqec.tools.advisor_replay`	P2	implemented

Skills (LLM-reasoning user surfaces, exposed as `/<name>`)

All six skills under .claude/skills/ are discoverable from Claude Code. /add-env remained a CLI-only subcommand; /read-zulip recovers off-repo hackathon context; /review-framework closes the autoresearch loop by feeding findings from a completed run back into the codebase.

Skill	Purpose	Status
`/autoqec-run`	Run the full research loop on an env YAML	implemented
`/verify-decoder`	Audit a Pareto candidate against holdout seeds (wraps `cli.autoqec verify`)	implemented
`/review-log`	Read an entire `runs/<id>/log.md`, flag stuck hypotheses / overfitting	implemented
`/review-framework`	Read a completed run and propose framework improvements (DSL gaps, weak baselines, prompt drift, env limits, orchestration friction); advisory only, never edits framework code	implemented
`/diagnose-failure`	Root-cause a broken or stalled round, recommend a fix (wraps `cli.autoqec diagnose`)	implemented
`/demo-presenter`	Generate an evidence-backed walkthrough and live AI narration for the merged demos plus planned PR-only worktree demo	implemented
`/read-zulip`	Pull Zulip stream/topic history for off-repo project context	implemented
`/add-env`	Interactively create a new env YAML	planned (CLI only for now: `python -m cli.autoqec add-env`)

Repo layout (planned)

qec-ai-decoder/
├── autoqec/                        # Python package
│   ├── envs/                       # EnvSpec + builtin envs
│   ├── runner/                     # Non-LLM train + eval + safety + FLOPs
│   ├── eval/                       # independent_eval (ISOLATED from runner)
│   ├── decoders/                   # DSL schema, compiler, GNN/Neural-BP modules
│   ├── orchestration/              # Research-loop driver + 3-layer memory
│   ├── agents/                     # Subagent dispatcher
│   ├── llm/                        # claude-cli / codex-cli router
│   ├── tools/                      # machine_state, etc.
│   ├── pareto/                     # Front maintenance
│   └── example_db/                 # Tier-1 seed templates
├── cli/autoqec.py                  # click CLI entry
├── .claude/
│   ├── agents/                     # Subagent prompt files
│   └── skills/                     # 6 user-facing skills
├── demos/                          # 6 demos with run.sh + (some) walkthrough.md
├── envs/                           # User-authored env YAMLs
├── circuits/                       # *.stim files
├── runs/                           # .gitignore'd; per-run outputs
├── knowledge/                      # Literature, roadmap, strategic docs
├── docs/superpowers/specs/         # Design specs
├── docs/superpowers/plans/         # Implementation plans
├── docs/contracts/                 # Phase-0 interface contracts
├── tests/
├── Makefile
└── pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QEC AI-Enhanced Decoder — AutoQEC

⚡ Review in one prompt

Architecture at a glance

Deliverables

6 Features (core capabilities of the harness)

6 Demos (each produces a reproducible artifact)

Skills (LLM-reasoning user surfaces, exposed as `/<name>`)

Repo layout (planned)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 284 Commits
.claude		.claude
.github/workflows		.github/workflows
autoqec		autoqec
circuits		circuits
cli		cli
demos		demos
docs		docs
knowledge		knowledge
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
codecov.yml		codecov.yml
design.md		design.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

QEC AI-Enhanced Decoder — AutoQEC

⚡ Review in one prompt

Architecture at a glance

Deliverables

6 Features (core capabilities of the harness)

6 Demos (each produces a reproducible artifact)

Skills (LLM-reasoning user surfaces, exposed as /<name>)

Repo layout (planned)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Skills (LLM-reasoning user surfaces, exposed as `/<name>`)

Packages