Interactive multi-turn test harness for agents

## Why

We have 25+ agents, dozens of slash commands, and increasingly rich multi-turn flows (Ctrl+V image paste → edit → resend; conversation cloning; agent switching mid-session). Manual end-to-end testing is painful: launch graff, type commands, eyeball model output, save state, switch agents. As a result:

- Regressions slip through. The recent FORGE→GRAFF rebrand had four separate user-visible spots (banner, doctor, zsh rprompt, REPL prompt) — only the user catching them in their terminal exposed each one.
- New features ship without exercise. Image paste landed in #19 but nothing automatically verifies that a pasted image produces a multimodal block reaching the model.
- Multi-turn flows are nearly never tested in CI.

## What we want

A test harness that scripts a multi-turn conversation against the live agent — runnable both as a Rust unit test and as an interactive CLI for iterative agent development. Ergonomics goal: writing a new scenario should be as easy as writing an `insta` snapshot; running existing scenarios in CI should be deterministic and require no live API.

## Proposed shape

### A. Driver API (Rust, in a new `forge_test_harness` crate)

```rust
let mut session = TestSession::new()
    .with_provider(MockProvider::from_fixture(\"tests/fixtures/cat-replay.json\"))
    .with_clipboard(ClipboardFixture::image(\"tests/fixtures/cat.png\"))
    .build();

session.send_keys(\"ctrl+v\")
       .send_text(\" what is this\")
       .submit()
       .assert_tool_called(\"image_describe\")
       .assert_response_matches(r\"(?i)cat\");

session.send_command(\":new\")
       .send_text(\"hi\")
       .submit()
       .assert_response_count(1);

session.snapshot(\"cat-image-then-greeting\");
```

### B. Scenario format (YAML for non-Rust authors / CI fixtures)

```yaml
scenario: clipboard-image-paste
provider: mock
fixtures:
  clipboard_image: tests/fixtures/cat.png
  replay: tests/fixtures/cat-replay.json
turns:
  - keys: [ctrl+v]
  - text: \" what is this\"
  - submit: true
  - assert:
      tool_called: image_describe
      response_matches: \"(?i)cat\"
  - command: \":new\"
  - text: \"hi\"
  - submit: true
  - assert:
      response_count: 1
snapshot: cat-image-then-greeting
```

A small `forge-test` CLI runs these scenarios, supports `--accept` (insta-style snapshot update), `--watch` (re-runs on file change), `--record` (captures live API output to a fixture for replay).

### C. Mock provider

Implement `Provider::Mock` that consumes a deterministic fixture (JSONL of model events: token streams, tool calls, finish reasons) and replays it in real time. No network required. Captures every tool call invocation for assertion.

### D. Live-record / replay

`forge-test record <scenario>` runs the scenario against a real provider, captures all model events to a fixture file. Subsequent CI runs use the fixture for replay. Same golden-master pattern as `cassette`-style HTTP recorders. Recordings are version-controlled.

### E. Hot-reload for interactive iteration

`forge-test watch <scenario>` keeps a session alive and:
1. Notices when an agent prompt file (`crates/forge_repo/src/agents/*.md`) changes.
2. Replays the scenario from turn 1 against the new agent definition.
3. Shows a colored diff of (response text, tool call sequence, final state) vs the last known-good run.
4. Lets the developer accept (snapshot updated) or reject (revert agent) with a keystroke.

Goal: a tight loop where editing an agent prompt → seeing the multi-turn behavior change → committing happens in seconds, not minutes.

### F. Multi-turn assertion library

Beyond per-turn matchers:
- **Sequence** — \"tool A must be called before tool B\"
- **Count** — \"exactly N model calls in this scenario\"
- **State** — \"after turn 3, conversation has K messages, agent is `forge`\"
- **Final** — \"conversation history JSON matches snapshot\" (insta integration)

### G. Interactive REPL mode

`forge-test repl --provider=mock --fixture=cat-replay.json` opens a live REPL where each user input runs through the harness. Lets a developer poke an agent under controlled conditions without burning real tokens.

## Acceptance criteria

- New crate `forge_test_harness` with the driver API (or, if simpler, an extension to `forge_test_kit`).
- `MockProvider` implementing the existing provider trait, replaying JSONL fixtures.
- `forge-test` binary (or `cargo forge-test` subcommand) supporting `run`, `record`, `watch`, `repl`, `--accept`.
- At least 3 example scenarios committed to `tests/scenarios/`:
  1. Image paste → describe (covers #19)
  2. Slash command palette execution (covers #17)
  3. Multi-turn agent switch (`:forge` → `:muse` → back)
- CI integration: `cargo test --all` runs scenarios offline against fixtures.
- Documentation in `docs/test-harness.md` showing how to write, run, record, accept.

## Out of scope (for v1, file as follow-ups)

- TUI testing for codegraff — needs ratatui-specific harness (e.g. via `ratatui::backend::TestBackend`)
- LLM-based assertions (\"did the response sound friendly?\") — too fuzzy for deterministic tests
- Performance/latency benchmarks
- Adversarial / fuzz testing of inputs
- Cross-provider equivalence testing

## Related

- #16/#19 — Ctrl+V image paste needs scenario coverage
- #17 — Slash palette needs scenario coverage
- #18 — codegraff polish; TUI harness is a follow-up to this issue
- #15 — subagent observability needs scenario fixtures to validate trace output

## Estimated scope

Medium-large. The core (mock provider + driver API + 1-2 scenarios) is probably 2-3 days of focused work. The hot-reload and watch modes are another 1-2 days. Documentation and 3 polished scenarios add another day. Total: ~1 week of dedicated effort, can be split across multiple PRs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interactive multi-turn test harness for agents #20

Why

What we want

Proposed shape

A. Driver API (Rust, in a new `forge_test_harness` crate)

B. Scenario format (YAML for non-Rust authors / CI fixtures)

C. Mock provider

D. Live-record / replay

E. Hot-reload for interactive iteration

F. Multi-turn assertion library

G. Interactive REPL mode

Acceptance criteria

Out of scope (for v1, file as follow-ups)

Related

Estimated scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Interactive multi-turn test harness for agents #20

Description

Why

What we want

Proposed shape

A. Driver API (Rust, in a new forge_test_harness crate)

B. Scenario format (YAML for non-Rust authors / CI fixtures)

C. Mock provider

D. Live-record / replay

E. Hot-reload for interactive iteration

F. Multi-turn assertion library

G. Interactive REPL mode

Acceptance criteria

Out of scope (for v1, file as follow-ups)

Related

Estimated scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

A. Driver API (Rust, in a new `forge_test_harness` crate)