Skip to content

qualit527/qec-ai-decoder

Repository files navigation

AutoQEC AI logo

QEC AI-Enhanced Decoder — AutoQEC

CI testcov docs API

AutoQEC is an LLM-agent-driven auto-research harness for discovering neural predecoders for quantum error-correcting codes. Given an environment triple (code_spec, noise_model, constraints), the system runs 10–20 rounds of hypothesis → DSL config → training → evaluation → analysis and emits verified predecoder checkpoints on the accuracy–latency–parameters Pareto front.


⚡ Review in one prompt

Clone the repo, open a Claude Code / Codex CLI in the project root, and paste the block below verbatim. The agent will install deps, run the six headline demos, and hand back a pass/fail table. Total wall-clock on CPU-only hardware: ~15–20 min (skip step 0's unit suite if you're in a hurry — ~3 min off).

Run the 6 demos below in order and report results. All commands run from the
repo root. First create and activate .venv (Windows: `.venv\Scripts\activate`,
Unix: `source .venv/bin/activate`), then `pip install -e '.[dev]'`. Do NOT
enter the live-LLM path (nested CLI sessions are unstable). For each step
report: cmd, wall-clock, key artifact paths, pass/fail against the stated
criteria. At the end give one markdown summary table.

0. Preflight (optional but strongly recommended):
   - `ruff check autoqec cli tests scripts` must print "All checks passed"
   - `pytest tests/ -m "not integration" -q` must be all green (~4 min, 290 tests)
   - `pytest tests/test_reward_hacking_traps.py -m integration --run-integration -v`
     must be 3/3 green (trap_A/B/C verifier guards)

1. Demo 1 — surface-code end-to-end loop (~2.5 min):
   `bash demos/demo-1-surface-d5/run_quick.sh`
   Pass criteria: in the newest `runs/<ts>/`, all of `history.jsonl`,
   `candidate_pareto.json`, `fork_graph.json`,
   `round_1/{config.yaml,train.log,checkpoint.pt,metrics.json}` exist;
   `metrics.json.status == "ok"`.

2. Demo 2 — qLDPC cross-code-family (~3 min, routes through OSD not MWPM):
   `MODE=fast bash demos/demo-2-bb72/run.sh`
   Pass criteria: in the new `runs/<ts>/round_1/metrics.json`, `status == "ok"`;
   the env config uses `classical_backend: osd` (prove the backend actually
   switched away from the surface-code MWPM path).

3. Demo 3 — worktree branches-as-Pareto provenance (~30 s, no LLM):
   `bash demos/demo-3-worktree-provenance/run.sh`
   Pass criteria: three `run-round` invocations all exit 0; every
   `metrics.json` has a non-empty `branch` and a real `commit_sha`; the
   compose worktree contains both `round_1/round_1_pointer.json` and
   `round_2/round_2_pointer.json` (proving both parents merged); the
   deliberate-conflict step yields `status="compose_conflict"` with
   `conflicting_files` populated.

4. Demo 4 — reward-hacking rejection (~1 min, memorizer must be rejected):
   `bash demos/demo-4-reward-hacking/run.sh`
   Pass criteria: `runs/demo-4/round_0/verification_report.json.verdict` is in
   {SUSPICIOUS, FAILED} — never VERIFIED; script exits 0.

5. Demo 5 — failure root-cause diagnosis (~5 s):
   `bash demos/demo-5-failure-recovery/run.sh`
   Pass criteria: stdout JSON contains `"status": "compile_error"` and a
   `status_reason` field.

6. Demo 6 — advisor replay / offline reproducibility (~2 min):
   `bash demos/demo-6-advisor-replay/run.sh`
   Pass criteria: `cli.autoqec package-run` produces a `runs/<run_id>.tar.gz`
   archive; `autoqec.tools.advisor_replay` extracts it under
   `runs/demo-6-replay/` and re-runs `cli.autoqec verify` with all
   `AUTOQEC_*_BACKEND` env vars unset and sockets blocked; the offline
   replay verify exits 0.

Summary table columns: demo | cmd | wall-clock | key artifact | pass/fail.
Rules:
- If any step fails, keep going to the next step; record the fail in the
  table rather than aborting.
- On failure, paste the last 30 lines of that step's stdout/stderr — do not
  silently swallow errors.
- Do not modify source files; only read, run, and report.
- If every demo passes, end with the line "harness end-to-end healthy"; if any
  failed, end with "needs triage: <which step>".

What the six demos prove, in plain language:

Demo Answer it gives the reviewer
1 — surface_d5 no-LLM smoke The full round pipeline (DSL → train → eval → Pareto → fork_graph.json) actually writes every artifact, with Δ_LER reported on the surface-code path.
2 — bb72 qLDPC The same harness swaps MWPM → OSD cleanly; (code, noise, constraints) is really the only input knob.
3 — Worktree provenance Branches-as-Pareto isn't paperware — every round commits its round_<N>_pointer.json into a real exp/<run_id>/<NN>-<slug> branch, compose rounds git merge two parents end-to-end, and conflict is recorded as status="compose_conflict" rather than thrown.
4 — Reward-hacking detection A hand-built memorizer cheater gets rejected (SUSPICIOUS or FAILED), never admitted to Pareto — the independent verifier actually guards the front.
5 — Failure recovery cli.autoqec diagnose takes a broken round dir and emits a machine-readable root cause (compile_error + reason), feeding the /diagnose-failure skill.
6 — Advisor replay A run can be packaged (cli.autoqec package-run.tar.gz) and replayed offline in a clean directory with all AUTOQEC_*_BACKEND env vars unset and sockets blocked — proving runs are network-free reproducible artifacts, not just local state.

The live-LLM research loop (/autoqec-run) is intentionally out of the one-prompt flow — nesting Claude-Code-inside-Claude-Code is unstable. See docs/verification/human-verification-report-2026-04-24.md for the last end-to-end retest and .claude/skills/autoqec-run/SKILL.md for the manual path.


Architecture at a glance

The classical backend guarantees structural validity; the predecoder contributes Δ_LER = LER(plain_classical) − LER(predecoder + classical) — a single clean number per round.

syndrome + code DEM
        │
        ↓
┌─────────────────────────────────────────┐
│ AI Predecoder (agent searches here)     │
│   type:        gnn | neural_bp          │
│   output_mode: hard_flip | soft_priors  │
└─────────────────────────────────────────┘
        │
        ↓
┌─────────────────────────────────────────┐
│ Classical Backend (fixed per env)       │
│   surface codes → MWPM (PyMatching)     │
│   qLDPC         → OSD                   │
└─────────────────────────────────────────┘
        │
        ↓
 logical correction

Each round is a 3-subagent DAG (+ optional Verifier). Roles are backend-pluggable via AUTOQEC_{IDEATOR,CODER,ANALYST}_BACKEND; the Runner is local-only and enforces a tool whitelist (autoqec/runner/safety.py) so reward-hacking paths to training-set syndromes are physically blocked.

 round N dispatch · fork_graph + machine_state
        ↓
┌─────────────────────────────────────────┐
│ Ideator                     (subagent)  │
│   hypothesis + fork_from + compose_mode │
└─────────────────────────────────────────┘
        ↓
┌─────────────────────────────────────────┐
│ Coder                       (subagent)  │
│   DSL config + commit_message           │
│   Tier-1 canonical · Tier-2 custom_fn   │
└─────────────────────────────────────────┘
        ↓
┌─────────────────────────────────────────┐
│ Runner                       (script)   │
│   train + eval → RoundMetrics           │
└─────────────────────────────────────────┘
        ↓
┌─────────────────────────────────────────┐
│ Analyst                     (subagent)  │
│   verdict: candidate | ignore           │
└─────────────────────────────────────────┘
        ↓   (if candidate)
┌─────────────────────────────────────────┐
│ Verifier           (optional subagent)  │
│   VERIFIED → Pareto admit               │
│   SUSPICIOUS / FAILED → rejected        │
└─────────────────────────────────────────┘

Each round runs on its own exp/<run_id>/<NN>-<slug> git branch inside a .worktrees/ checkout, Pareto members are the complete non-dominated set of VERIFIED branches, and compose rounds test git merge parent-A parent-B as a first-class scientific probe. Startup reconciliation keeps history.jsonl and the live branch set in sync.

 main  (f51cfcf · run_id = demo38-…103750)
   │
   ├─ 01-idea-a           ★ Δ=+0.18  ──┐
   │                                   │
   ├─ 02-idea-b           ★ Δ=+0.12  ──┤   git merge   (compose_mode = merge)
   │                                   │
   ├─ 03-compose-ab       ★ Δ=+0.22  ──┘   dominates both parents
   │
   ├─ 04-idea-c           ignore          Analyst: Δ within bootstrap CI
   │
   ├─ 90-conflict-L       ◇ compose_conflict · branch=None · sha=None
   ├─ 91-conflict-R       ◇ compose_conflict · merge refused · no worktree
   │
   └─ 07-orphan           ⊘ orphaned_branch  · healed by reconcile

 ★ on non-dominated Pareto    ◇ merge refused    ⊘ orphan recovered on startup
 each  exp/<run_id>/<NN>-<slug>  ≡ its own  .worktrees/exp-…/  checkout
 runs/<run_id>/  stores   history.jsonl · pareto.json · fork_graph.json (this tree)

Deliverables

6 Features (core capabilities of the harness)

All six are present on main; the one-prompt flow above exercises F1–F6 end-to-end.

# Feature Status Evidence
F1 End-to-end research loop over any (code, noise, constraints) triple implemented cli/autoqec.py run + autoqec/orchestration/llm_loop.py; Demo 1 / Demo 2 drive the no-LLM path
F2 Tier-1 canonical DSL + Tier-2 custom_fn escape hatch with AST+smoke validation implemented autoqec/decoders/{dsl_schema,dsl_compiler,custom_fn_validator,custom_fn_rules}.py; 18 hostile cases in tests/test_custom_fn_validator.py
F3 Independent verification module with 3 fair-baseline guards (seed isolation, paired bootstrap CI, ablation sanity) implemented autoqec/eval/independent_eval.py + eval/bootstrap.py; trap_A/B/C guards proven on main (Demo 4 + tests/test_reward_hacking_traps.py)
F4 Multi-agent orchestration (Ideator / Coder / Analyst) with tool whitelisting + 3-layer memory + machine_state tool implemented autoqec/agents/dispatch.py + .claude/agents/autoqec-{ideator,coder,analyst}.md; autoqec/orchestration/memory.py; autoqec/tools/machine_state.py; autoqec/runner/safety.py
F5 Pareto-front maintenance across (Δ_LER, FLOPs, n_params) with verify-admitted candidates implemented autoqec/pareto/front.py + orchestration/round_recorder.py; atomic pareto.json swap covered by tests/test_pareto_atomic*.py
F6 Worktree-based experiment model (branches-as-Pareto; compose rounds; startup reconciliation; fork_graph.json) implemented autoqec/orchestration/{worktree,subprocess_runner,reconcile,fork_graph}.py; persistence proven by tests/test_fork_graph_persist.py

6 Demos (each produces a reproducible artifact)

All six ship as runnable demos. The /add-env onboarding flow stays a CLI-only subcommand for now (python -m cli.autoqec add-env …) and is tracked under Skills below rather than as a demo.

# Demo Proves Priority Status
D1 surface_d5 full research run — demos/demo-1-surface-d5/run_quick.sh End-to-end harness works P0 implemented
D2 bb72 qLDPC research run — demos/demo-2-bb72/run.sh (fast/dev/prod modes) Genericity across codes / classical backends (MWPM → OSD) P1 implemented
D3 Worktree branches-as-Pareto provenance — demos/demo-3-worktree-provenance/run.sh Every round runs in its own exp/<run_id>/<NN>-<slug> git worktree, commits a round_<N>_pointer.json, and compose_conflict is a recorded failure mode (issue #38) P1 implemented
D4 Reward-hacking detection — demos/demo-4-reward-hacking/run.sh Memorizer cheater gets SUSPICIOUS / FAILED verdict P0 implemented
D5 Failure recovery — demos/demo-5-failure-recovery/run.sh cli.autoqec diagnose identifies compile_error root cause P2 implemented
D6 Advisor replay / offline reproducibility — demos/demo-6-advisor-replay/run.sh A packaged run replays under runs/demo-6-replay/ with no LLM backends and no network — cli.autoqec package-run + autoqec.tools.advisor_replay P2 implemented

Skills (LLM-reasoning user surfaces, exposed as /<name>)

All six skills under .claude/skills/ are discoverable from Claude Code. /add-env remained a CLI-only subcommand; /read-zulip recovers off-repo hackathon context; /review-framework closes the autoresearch loop by feeding findings from a completed run back into the codebase.

Skill Purpose Status
/autoqec-run Run the full research loop on an env YAML implemented
/verify-decoder Audit a Pareto candidate against holdout seeds (wraps cli.autoqec verify) implemented
/review-log Read an entire runs/<id>/log.md, flag stuck hypotheses / overfitting implemented
/review-framework Read a completed run and propose framework improvements (DSL gaps, weak baselines, prompt drift, env limits, orchestration friction); advisory only, never edits framework code implemented
/diagnose-failure Root-cause a broken or stalled round, recommend a fix (wraps cli.autoqec diagnose) implemented
/demo-presenter Generate an evidence-backed walkthrough and live AI narration for the merged demos plus planned PR-only worktree demo implemented
/read-zulip Pull Zulip stream/topic history for off-repo project context implemented
/add-env Interactively create a new env YAML planned (CLI only for now: python -m cli.autoqec add-env)

Repo layout (planned)

qec-ai-decoder/
├── autoqec/                        # Python package
│   ├── envs/                       # EnvSpec + builtin envs
│   ├── runner/                     # Non-LLM train + eval + safety + FLOPs
│   ├── eval/                       # independent_eval (ISOLATED from runner)
│   ├── decoders/                   # DSL schema, compiler, GNN/Neural-BP modules
│   ├── orchestration/              # Research-loop driver + 3-layer memory
│   ├── agents/                     # Subagent dispatcher
│   ├── llm/                        # claude-cli / codex-cli router
│   ├── tools/                      # machine_state, etc.
│   ├── pareto/                     # Front maintenance
│   └── example_db/                 # Tier-1 seed templates
├── cli/autoqec.py                  # click CLI entry
├── .claude/
│   ├── agents/                     # Subagent prompt files
│   └── skills/                     # 6 user-facing skills
├── demos/                          # 6 demos with run.sh + (some) walkthrough.md
├── envs/                           # User-authored env YAMLs
├── circuits/                       # *.stim files
├── runs/                           # .gitignore'd; per-run outputs
├── knowledge/                      # Literature, roadmap, strategic docs
├── docs/superpowers/specs/         # Design specs
├── docs/superpowers/plans/         # Implementation plans
├── docs/contracts/                 # Phase-0 interface contracts
├── tests/
├── Makefile
└── pyproject.toml

License

MIT — see LICENSE. Copyright © 2026 AutoQEC Contributors.

About

AI-enhanced decoders for quantum error correction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages