satyaborg · satyaborg · May 26, 2026 · May 25, 2026 · May 25, 2026 · May 25, 2026
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,5 @@
 .codex/
 .specs/
 .DS_Store
+coverage/
+node_modules/
diff --git a/README.md b/README.md
@@ -1,112 +1,113 @@
 # devloop
 
-Spec in, accepted code out. Codex implements, Claude reviews, loop until ACCEPT, stall, or max turns. One bash file.
+Codex implements. Claude reviews. devloop runs the loop until the work is accepted, stalls, becomes unclear, hits max turns, or an agent fails.
 
+```sh
+devloop [--plain|--tui] [--no-strict] [--report-format html|markdown] spec.md [max=5]
 ```
-devloop.sh [--report-format html|markdown] path/to/spec.md [max=5]
-```
-
-## Why
 
-Skills-as-orchestrators drift. The LLM driver has discretion to skip steps and often does, especially under load. A shell state machine cannot. devloop is the same workflow as the [`/devloop`](https://github.com/anthropics/claude-code) skill it replaces, minus the discretion.
+Run from the target git worktree. The spec may live anywhere.
 
-What stays in the LLMs (because they are good at it):
-- Codex: implementation, design decisions, fix passes
-- Claude: review judgment, verdict, final synthesis
+Start new specs from [`templates/spec.md`](templates/spec.md), usually copied to `.specs/YYYY-MM-DD-slug.md`.
 
-What moves to bash (because the LLM does not need discretion here):
-- Sequencing the loop
-- Spawning each agent headless
-- Reusing one Codex implementation session and one Claude review session
-- Parsing the verdict
-- Detecting stalls
-- File path conventions
-- Stopping at max turns
+The template is intentionally short: clear problem, observable outcome, tight scope, behavior examples, verifiable acceptance criteria, regression-first test plan, constraints, and only material notes.
 
-## Quick start
+## Install
 
-Prereqs on PATH: `claude`, `codex`, `git`.
+From this checkout:
 
 ```sh
-# 1. write a spec
-cat > .specs/add-foo-flag.md <<'EOF'
-# Add foo flag to bar config
+bun scripts/install.ts
+```
 
-## Acceptance criteria
-1. ...
-EOF
+That installs dependencies and links `devloop` into `~/.local/bin`. Set `DEVLOOP_BIN_DIR` to choose another bin directory.
 
-# 2. loop
-./devloop.sh .specs/add-foo-flag.md
-```
+## Defaults
 
-Defaults to unattended (`--dangerously-bypass-approvals-and-sandbox` for codex, `--dangerously-skip-permissions` for claude). Run inside a git worktree, not your main checkout.
+- strict mode is on
+- HTML reports are on
+- max turns default to 5 and clamp to 1-10
+- TTY runs use the collapsed OpenTUI view
+- non-TTY runs use plain output
+- accepted runs create a local branch and local commit
+- no-arg `devloop` shows the logo and common commands
 
-The implementation worktree is resolved from the directory where you invoke `devloop.sh`, not from the spec file's location. The spec can live elsewhere; Codex and Claude are pointed at the current worktree.
+Use `--plain` for CI. Use `--tui` to force the TUI. Use `--no-strict` only when you explicitly want weaker gates.
 
-## The loop
+## Strict Acceptance
 
-```
-pass 1: codex implements against spec
-        claude reviews → ACCEPT | REJECT | UNCLEAR
-pass N: codex fixes findings from review N-1
-        claude reviews
-exit:   ACCEPT          → 0
-        stall | max | unclear → 1
-        codex/claude error    → 2
+Strict mode requires:
+
+```md
+## Acceptance criteria
+1. ...
 ```
 
-Stall = normalized findings hash matches the prior REJECT.
+Codex is prompted to work regression-first: add or update tests, observe the red phase when behavior changes, implement the smallest fix, then run targeted tests, full tests, lint/typecheck, and coverage.
 
-## Sessions
+Claude must write:
 
-Each spec slug gets two persisted sessions:
+```md
+## Acceptance matrix
 
-```
-.codex/sessions/<slug>-codex.id
-.codex/sessions/<slug>-claude.id
+- AC1: PASS - evidence
 ```
 
-Pass 1 starts the Codex implementation session and records the resumable session ID. Later fix passes call `codex exec resume <session-id>`, so Codex keeps the implementation context. Claude uses one review session for every review pass and the final report.
+In strict mode, `Verdict: ACCEPT` only counts when every parsed criterion has a `PASS` matrix row. Missing evidence exits as `unclear`.
 
-## Artifacts
+## Output
 
-```
-.codex/tracks/<slug>.md        codex's running notes per pass
-.codex/reviews/<slug>-r<N>.md  one per review turn
-.codex/reports/<slug>.html     synthesized post-mortem by default
-.codex/reports/<slug>.md       synthesized post-mortem with --report-format markdown
-.codex/logs/                   raw agent stdout for debugging
+```text
+.codex/tracks/<slug>.md
+.codex/reviews/<slug>-r<N>.md
+.codex/reports/<slug>.html
+.codex/reports/<slug>.md
+.codex/logs/
+.codex/sessions/
 ```
 
-## Tests
+Reports can be HTML or Markdown:
 
 ```sh
-./tests/devloop_test.sh
+devloop --report-format html .specs/change.md
+devloop --report-format markdown .specs/change.md
+devloop --md .specs/change.md
+```
+
+Reports include result, passes, repo, spec, base branch, starting branch, final branch, local commit, commit message, Codex session ID, Claude session ID, track, and review files.
+
+## Local Commit
+
+On `accepted`, devloop creates or reuses:
+
+```text
+devloop/<spec-slug>
 ```
 
-The tests run the shell state machine against temporary git repos with mocked `codex` and `claude` commands, so they do not call either agent.
+It commits only files that were clean when the run started and excludes `.codex/`. Commit messages are Conventional Commit style:
 
-## The report
+```text
+feat: <spec-slug>
+```
 
-Not a mechanical concat. Claude is called one more time in the same review session with the spec + track + all reviews and writes a learning-oriented post-mortem:
+devloop does not push or open a PR.
 
-- **Shape of the problem** — what the spec really asked for, alternatives ruled out
-- **What was built** — design choices and the tradeoffs weighed
-- **What review caught (and why it mattered)** — symptom → root cause → principle violated, grouped into patterns
-- **What to remember next time** — transferable lessons in `When X, prefer Y because Z` form
-- **Residual risk** — concrete, not generic
+## Development
 
-The "why" is enforced in the prompts: codex must explain decisions, claude must articulate the principle behind each finding ("if you cannot articulate the principle, the finding is too shallow — drop it or sharpen it").
+Prereqs: `bun`, `codex`, `claude`, `git`.
 
-## Caveats
+```sh
+bun scripts/install.ts
+bun run typecheck
+bun test
+```
 
-- **Unattended = trusts both agents.** Use worktrees.
-- **Sessions persist per spec slug.** Delete the matching files in `.codex/sessions/` when you want a fresh Codex or Claude context for the same spec filename.
-- **No spec writing.** Deliberate. Write the spec yourself (or via an interview skill) and hand the path in.
-- **Stall detection is hash-based.** Cosmetic rewording of identical findings will defeat it.
-- **Base branch is auto-guessed** (`origin/HEAD` → `main` → `master`). Edit the `BASE=` line for stacked branches.
+`bun test` enforces 100% line/function/statement coverage for the TypeScript core.
 
-## License
+## References
 
-MIT
+- [ISO/IEC/IEEE 29148:2018](https://byui-cse.github.io/cse372-course/Reading/CSE372Week10-IEEE.pdf): requirements should be necessary, unambiguous, singular, feasible, verifiable, and focused on what is needed.
+- [Erdogmus, Morisio, and Torchiano, 2005](https://cs.unm.edu/~joel/cs351/paper/IEEE-Effectiveness_of_Test-First_Approach_to_Programming.pdf): test-first work formalizes functionality as tests, gives fast feedback, and supports small measurable tasks.
+- [Fucci et al., 2016](https://bura.brunel.ac.uk/bitstream/2438/14550/1/FullText.pdf): TDD benefits depend heavily on fine-grained, steady cycles, not ceremony.
+- [Rafique and Misic, 2013](https://openurl.ebsco.com/contentitem/doi%3A10.1109/tse.2012.28?id=ebsco%3Adoi%3A10.1109%2Ftse.2012.28&sid=ebsco%3Aplink%3Acrawler) and [Bissi, Neto, and Emer, 2016](https://www.sciencedirect.com/science/article/abs/pii/S0950584916300222): TDD evidence is strongest for quality, less conclusive for productivity.
+- [Agile Alliance user stories](https://agilealliance.org/glossary/user-stories/), [Given-When-Then](https://agilealliance.org/glossary/given-when-then/), and [Cucumber Gherkin reference](https://cucumber.io/docs/gherkin/reference/): acceptance criteria are strongest when they become concrete, observable examples.
diff --git a/bun.lock b/bun.lock
diff --git a/bunfig.toml b/bunfig.toml
@@ -0,0 +1,5 @@
+[test]
+coverage = true
+coverageThreshold = { lines = 1.0, functions = 1.0, statements = 1.0 }
+coverageReporter = ["text", "lcov"]
+coverageSkipTestFiles = true