AutoQEC is an LLM-agent-driven auto-research harness for discovering neural predecoders for quantum error-correcting codes. Given an environment triple (code_spec, noise_model, constraints), the system runs 10–20 rounds of hypothesis → DSL config → training → evaluation → analysis and emits verified predecoder checkpoints on the accuracy–latency–parameters Pareto front.
Clone the repo, open a Claude Code / Codex CLI in the project root, and paste the block below verbatim. The agent will install deps, run the six headline demos, and hand back a pass/fail table. Total wall-clock on CPU-only hardware: ~15–20 min (skip step 0's unit suite if you're in a hurry — ~3 min off).
Run the 6 demos below in order and report results. All commands run from the
repo root. First create and activate .venv (Windows: `.venv\Scripts\activate`,
Unix: `source .venv/bin/activate`), then `pip install -e '.[dev]'`. Do NOT
enter the live-LLM path (nested CLI sessions are unstable). For each step
report: cmd, wall-clock, key artifact paths, pass/fail against the stated
criteria. At the end give one markdown summary table.
0. Preflight (optional but strongly recommended):
- `ruff check autoqec cli tests scripts` must print "All checks passed"
- `pytest tests/ -m "not integration" -q` must be all green (~4 min, 290 tests)
- `pytest tests/test_reward_hacking_traps.py -m integration --run-integration -v`
must be 3/3 green (trap_A/B/C verifier guards)
1. Demo 1 — surface-code end-to-end loop (~2.5 min):
`bash demos/demo-1-surface-d5/run_quick.sh`
Pass criteria: in the newest `runs/<ts>/`, all of `history.jsonl`,
`candidate_pareto.json`, `fork_graph.json`,
`round_1/{config.yaml,train.log,checkpoint.pt,metrics.json}` exist;
`metrics.json.status == "ok"`.
2. Demo 2 — qLDPC cross-code-family (~3 min, routes through OSD not MWPM):
`MODE=fast bash demos/demo-2-bb72/run.sh`
Pass criteria: in the new `runs/<ts>/round_1/metrics.json`, `status == "ok"`;
the env config uses `classical_backend: osd` (prove the backend actually
switched away from the surface-code MWPM path).
3. Demo 3 — worktree branches-as-Pareto provenance (~30 s, no LLM):
`bash demos/demo-3-worktree-provenance/run.sh`
Pass criteria: three `run-round` invocations all exit 0; every
`metrics.json` has a non-empty `branch` and a real `commit_sha`; the
compose worktree contains both `round_1/round_1_pointer.json` and
`round_2/round_2_pointer.json` (proving both parents merged); the
deliberate-conflict step yields `status="compose_conflict"` with
`conflicting_files` populated.
4. Demo 4 — reward-hacking rejection (~1 min, memorizer must be rejected):
`bash demos/demo-4-reward-hacking/run.sh`
Pass criteria: `runs/demo-4/round_0/verification_report.json.verdict` is in
{SUSPICIOUS, FAILED} — never VERIFIED; script exits 0.
5. Demo 5 — failure root-cause diagnosis (~5 s):
`bash demos/demo-5-failure-recovery/run.sh`
Pass criteria: stdout JSON contains `"status": "compile_error"` and a
`status_reason` field.
6. Demo 6 — advisor replay / offline reproducibility (~2 min):
`bash demos/demo-6-advisor-replay/run.sh`
Pass criteria: `cli.autoqec package-run` produces a `runs/<run_id>.tar.gz`
archive; `autoqec.tools.advisor_replay` extracts it under
`runs/demo-6-replay/` and re-runs `cli.autoqec verify` with all
`AUTOQEC_*_BACKEND` env vars unset and sockets blocked; the offline
replay verify exits 0.
Summary table columns: demo | cmd | wall-clock | key artifact | pass/fail.
Rules:
- If any step fails, keep going to the next step; record the fail in the
table rather than aborting.
- On failure, paste the last 30 lines of that step's stdout/stderr — do not
silently swallow errors.
- Do not modify source files; only read, run, and report.
- If every demo passes, end with the line "harness end-to-end healthy"; if any
failed, end with "needs triage: <which step>".
What the six demos prove, in plain language:
| Demo | Answer it gives the reviewer |
|---|---|
1 — surface_d5 no-LLM smoke |
The full round pipeline (DSL → train → eval → Pareto → fork_graph.json) actually writes every artifact, with Δ_LER reported on the surface-code path. |
2 — bb72 qLDPC |
The same harness swaps MWPM → OSD cleanly; (code, noise, constraints) is really the only input knob. |
| 3 — Worktree provenance | Branches-as-Pareto isn't paperware — every round commits its round_<N>_pointer.json into a real exp/<run_id>/<NN>-<slug> branch, compose rounds git merge two parents end-to-end, and conflict is recorded as status="compose_conflict" rather than thrown. |
| 4 — Reward-hacking detection | A hand-built memorizer cheater gets rejected (SUSPICIOUS or FAILED), never admitted to Pareto — the independent verifier actually guards the front. |
| 5 — Failure recovery | cli.autoqec diagnose takes a broken round dir and emits a machine-readable root cause (compile_error + reason), feeding the /diagnose-failure skill. |
| 6 — Advisor replay | A run can be packaged (cli.autoqec package-run → .tar.gz) and replayed offline in a clean directory with all AUTOQEC_*_BACKEND env vars unset and sockets blocked — proving runs are network-free reproducible artifacts, not just local state. |
The live-LLM research loop (
/autoqec-run) is intentionally out of the one-prompt flow — nesting Claude-Code-inside-Claude-Code is unstable. Seedocs/verification/human-verification-report-2026-04-24.mdfor the last end-to-end retest and.claude/skills/autoqec-run/SKILL.mdfor the manual path.
- Spec:
docs/superpowers/specs/2026-04-20-autoqec-design.md(v2.2) - API documentation:
docs/api-documentation.md - Master plan:
docs/superpowers/plans/2026-04-21-autoqec-master.md - Per-owner plans:
docs/superpowers/plans/ - Test plan:
docs/verification/human-verification-test-plan.md - Developer test targets:
docs/test-plan.md—make lint,make test, andmake test-integrationgate the full unit + GPU/integration suites (the one-prompt flow above only exercisesmake test+ traps). - Knowledge base:
knowledge/— 81-paper index + 3 synthesis documents (roadmap, strategic assessment, autoresearch patterns)
The classical backend guarantees structural validity; the predecoder contributes Δ_LER = LER(plain_classical) − LER(predecoder + classical) — a single clean number per round.
syndrome + code DEM
│
↓
┌─────────────────────────────────────────┐
│ AI Predecoder (agent searches here) │
│ type: gnn | neural_bp │
│ output_mode: hard_flip | soft_priors │
└─────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Classical Backend (fixed per env) │
│ surface codes → MWPM (PyMatching) │
│ qLDPC → OSD │
└─────────────────────────────────────────┘
│
↓
logical correction
Each round is a 3-subagent DAG (+ optional Verifier). Roles are backend-pluggable via AUTOQEC_{IDEATOR,CODER,ANALYST}_BACKEND; the Runner is local-only and enforces a tool whitelist (autoqec/runner/safety.py) so reward-hacking paths to training-set syndromes are physically blocked.
round N dispatch · fork_graph + machine_state
↓
┌─────────────────────────────────────────┐
│ Ideator (subagent) │
│ hypothesis + fork_from + compose_mode │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Coder (subagent) │
│ DSL config + commit_message │
│ Tier-1 canonical · Tier-2 custom_fn │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Runner (script) │
│ train + eval → RoundMetrics │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Analyst (subagent) │
│ verdict: candidate | ignore │
└─────────────────────────────────────────┘
↓ (if candidate)
┌─────────────────────────────────────────┐
│ Verifier (optional subagent) │
│ VERIFIED → Pareto admit │
│ SUSPICIOUS / FAILED → rejected │
└─────────────────────────────────────────┘
Each round runs on its own exp/<run_id>/<NN>-<slug> git branch inside a .worktrees/ checkout, Pareto members are the complete non-dominated set of VERIFIED branches, and compose rounds test git merge parent-A parent-B as a first-class scientific probe. Startup reconciliation keeps history.jsonl and the live branch set in sync.
main (f51cfcf · run_id = demo38-…103750)
│
├─ 01-idea-a ★ Δ=+0.18 ──┐
│ │
├─ 02-idea-b ★ Δ=+0.12 ──┤ git merge (compose_mode = merge)
│ │
├─ 03-compose-ab ★ Δ=+0.22 ──┘ dominates both parents
│
├─ 04-idea-c ignore Analyst: Δ within bootstrap CI
│
├─ 90-conflict-L ◇ compose_conflict · branch=None · sha=None
├─ 91-conflict-R ◇ compose_conflict · merge refused · no worktree
│
└─ 07-orphan ⊘ orphaned_branch · healed by reconcile
★ on non-dominated Pareto ◇ merge refused ⊘ orphan recovered on startup
each exp/<run_id>/<NN>-<slug> ≡ its own .worktrees/exp-…/ checkout
runs/<run_id>/ stores history.jsonl · pareto.json · fork_graph.json (this tree)
All six are present on main; the one-prompt flow above exercises F1–F6 end-to-end.
| # | Feature | Status | Evidence |
|---|---|---|---|
| F1 | End-to-end research loop over any (code, noise, constraints) triple |
implemented | cli/autoqec.py run + autoqec/orchestration/llm_loop.py; Demo 1 / Demo 2 drive the no-LLM path |
| F2 | Tier-1 canonical DSL + Tier-2 custom_fn escape hatch with AST+smoke validation |
implemented | autoqec/decoders/{dsl_schema,dsl_compiler,custom_fn_validator,custom_fn_rules}.py; 18 hostile cases in tests/test_custom_fn_validator.py |
| F3 | Independent verification module with 3 fair-baseline guards (seed isolation, paired bootstrap CI, ablation sanity) | implemented | autoqec/eval/independent_eval.py + eval/bootstrap.py; trap_A/B/C guards proven on main (Demo 4 + tests/test_reward_hacking_traps.py) |
| F4 | Multi-agent orchestration (Ideator / Coder / Analyst) with tool whitelisting + 3-layer memory + machine_state tool |
implemented | autoqec/agents/dispatch.py + .claude/agents/autoqec-{ideator,coder,analyst}.md; autoqec/orchestration/memory.py; autoqec/tools/machine_state.py; autoqec/runner/safety.py |
| F5 | Pareto-front maintenance across (Δ_LER, FLOPs, n_params) with verify-admitted candidates | implemented | autoqec/pareto/front.py + orchestration/round_recorder.py; atomic pareto.json swap covered by tests/test_pareto_atomic*.py |
| F6 | Worktree-based experiment model (branches-as-Pareto; compose rounds; startup reconciliation; fork_graph.json) |
implemented | autoqec/orchestration/{worktree,subprocess_runner,reconcile,fork_graph}.py; persistence proven by tests/test_fork_graph_persist.py |
All six ship as runnable demos. The /add-env onboarding flow stays a CLI-only subcommand for now (python -m cli.autoqec add-env …) and is tracked under Skills below rather than as a demo.
| # | Demo | Proves | Priority | Status |
|---|---|---|---|---|
| D1 | surface_d5 full research run — demos/demo-1-surface-d5/run_quick.sh |
End-to-end harness works | P0 | implemented |
| D2 | bb72 qLDPC research run — demos/demo-2-bb72/run.sh (fast/dev/prod modes) |
Genericity across codes / classical backends (MWPM → OSD) | P1 | implemented |
| D3 | Worktree branches-as-Pareto provenance — demos/demo-3-worktree-provenance/run.sh |
Every round runs in its own exp/<run_id>/<NN>-<slug> git worktree, commits a round_<N>_pointer.json, and compose_conflict is a recorded failure mode (issue #38) |
P1 | implemented |
| D4 | Reward-hacking detection — demos/demo-4-reward-hacking/run.sh |
Memorizer cheater gets SUSPICIOUS / FAILED verdict |
P0 | implemented |
| D5 | Failure recovery — demos/demo-5-failure-recovery/run.sh |
cli.autoqec diagnose identifies compile_error root cause |
P2 | implemented |
| D6 | Advisor replay / offline reproducibility — demos/demo-6-advisor-replay/run.sh |
A packaged run replays under runs/demo-6-replay/ with no LLM backends and no network — cli.autoqec package-run + autoqec.tools.advisor_replay |
P2 | implemented |
All six skills under .claude/skills/ are discoverable from Claude Code. /add-env remained a CLI-only subcommand; /read-zulip recovers off-repo hackathon context; /review-framework closes the autoresearch loop by feeding findings from a completed run back into the codebase.
| Skill | Purpose | Status |
|---|---|---|
/autoqec-run |
Run the full research loop on an env YAML | implemented |
/verify-decoder |
Audit a Pareto candidate against holdout seeds (wraps cli.autoqec verify) |
implemented |
/review-log |
Read an entire runs/<id>/log.md, flag stuck hypotheses / overfitting |
implemented |
/review-framework |
Read a completed run and propose framework improvements (DSL gaps, weak baselines, prompt drift, env limits, orchestration friction); advisory only, never edits framework code | implemented |
/diagnose-failure |
Root-cause a broken or stalled round, recommend a fix (wraps cli.autoqec diagnose) |
implemented |
/demo-presenter |
Generate an evidence-backed walkthrough and live AI narration for the merged demos plus planned PR-only worktree demo | implemented |
/read-zulip |
Pull Zulip stream/topic history for off-repo project context | implemented |
/add-env |
Interactively create a new env YAML | planned (CLI only for now: python -m cli.autoqec add-env) |
qec-ai-decoder/
├── autoqec/ # Python package
│ ├── envs/ # EnvSpec + builtin envs
│ ├── runner/ # Non-LLM train + eval + safety + FLOPs
│ ├── eval/ # independent_eval (ISOLATED from runner)
│ ├── decoders/ # DSL schema, compiler, GNN/Neural-BP modules
│ ├── orchestration/ # Research-loop driver + 3-layer memory
│ ├── agents/ # Subagent dispatcher
│ ├── llm/ # claude-cli / codex-cli router
│ ├── tools/ # machine_state, etc.
│ ├── pareto/ # Front maintenance
│ └── example_db/ # Tier-1 seed templates
├── cli/autoqec.py # click CLI entry
├── .claude/
│ ├── agents/ # Subagent prompt files
│ └── skills/ # 6 user-facing skills
├── demos/ # 6 demos with run.sh + (some) walkthrough.md
├── envs/ # User-authored env YAMLs
├── circuits/ # *.stim files
├── runs/ # .gitignore'd; per-run outputs
├── knowledge/ # Literature, roadmap, strategic docs
├── docs/superpowers/specs/ # Design specs
├── docs/superpowers/plans/ # Implementation plans
├── docs/contracts/ # Phase-0 interface contracts
├── tests/
├── Makefile
└── pyproject.toml
MIT — see LICENSE. Copyright © 2026 AutoQEC Contributors.
