Skip to content

Interactive multi-turn test harness for agents #20

@justrach

Description

@justrach

Why

We have 25+ agents, dozens of slash commands, and increasingly rich multi-turn flows (Ctrl+V image paste → edit → resend; conversation cloning; agent switching mid-session). Manual end-to-end testing is painful: launch graff, type commands, eyeball model output, save state, switch agents. As a result:

  • Regressions slip through. The recent FORGE→GRAFF rebrand had four separate user-visible spots (banner, doctor, zsh rprompt, REPL prompt) — only the user catching them in their terminal exposed each one.
  • New features ship without exercise. Image paste landed in feat(repl): Ctrl+V image paste in graff #19 but nothing automatically verifies that a pasted image produces a multimodal block reaching the model.
  • Multi-turn flows are nearly never tested in CI.

What we want

A test harness that scripts a multi-turn conversation against the live agent — runnable both as a Rust unit test and as an interactive CLI for iterative agent development. Ergonomics goal: writing a new scenario should be as easy as writing an insta snapshot; running existing scenarios in CI should be deterministic and require no live API.

Proposed shape

A. Driver API (Rust, in a new forge_test_harness crate)

let mut session = TestSession::new()
    .with_provider(MockProvider::from_fixture(\"tests/fixtures/cat-replay.json\"))
    .with_clipboard(ClipboardFixture::image(\"tests/fixtures/cat.png\"))
    .build();

session.send_keys(\"ctrl+v\")
       .send_text(\" what is this\")
       .submit()
       .assert_tool_called(\"image_describe\")
       .assert_response_matches(r\"(?i)cat\");

session.send_command(\":new\")
       .send_text(\"hi\")
       .submit()
       .assert_response_count(1);

session.snapshot(\"cat-image-then-greeting\");

B. Scenario format (YAML for non-Rust authors / CI fixtures)

scenario: clipboard-image-paste
provider: mock
fixtures:
  clipboard_image: tests/fixtures/cat.png
  replay: tests/fixtures/cat-replay.json
turns:
  - keys: [ctrl+v]
  - text: \" what is this\"
  - submit: true
  - assert:
      tool_called: image_describe
      response_matches: \"(?i)cat\"
  - command: \":new\"
  - text: \"hi\"
  - submit: true
  - assert:
      response_count: 1
snapshot: cat-image-then-greeting

A small forge-test CLI runs these scenarios, supports --accept (insta-style snapshot update), --watch (re-runs on file change), --record (captures live API output to a fixture for replay).

C. Mock provider

Implement Provider::Mock that consumes a deterministic fixture (JSONL of model events: token streams, tool calls, finish reasons) and replays it in real time. No network required. Captures every tool call invocation for assertion.

D. Live-record / replay

forge-test record <scenario> runs the scenario against a real provider, captures all model events to a fixture file. Subsequent CI runs use the fixture for replay. Same golden-master pattern as cassette-style HTTP recorders. Recordings are version-controlled.

E. Hot-reload for interactive iteration

forge-test watch <scenario> keeps a session alive and:

  1. Notices when an agent prompt file (crates/forge_repo/src/agents/*.md) changes.
  2. Replays the scenario from turn 1 against the new agent definition.
  3. Shows a colored diff of (response text, tool call sequence, final state) vs the last known-good run.
  4. Lets the developer accept (snapshot updated) or reject (revert agent) with a keystroke.

Goal: a tight loop where editing an agent prompt → seeing the multi-turn behavior change → committing happens in seconds, not minutes.

F. Multi-turn assertion library

Beyond per-turn matchers:

  • Sequence — "tool A must be called before tool B"
  • Count — "exactly N model calls in this scenario"
  • State — "after turn 3, conversation has K messages, agent is forge"
  • Final — "conversation history JSON matches snapshot" (insta integration)

G. Interactive REPL mode

forge-test repl --provider=mock --fixture=cat-replay.json opens a live REPL where each user input runs through the harness. Lets a developer poke an agent under controlled conditions without burning real tokens.

Acceptance criteria

  • New crate forge_test_harness with the driver API (or, if simpler, an extension to forge_test_kit).
  • MockProvider implementing the existing provider trait, replaying JSONL fixtures.
  • forge-test binary (or cargo forge-test subcommand) supporting run, record, watch, repl, --accept.
  • At least 3 example scenarios committed to tests/scenarios/:
    1. Image paste → describe (covers feat(repl): Ctrl+V image paste in graff #19)
    2. Slash command palette execution (covers Add slash command (/) palette + autocomplete to codegraff TUI #17)
    3. Multi-turn agent switch (:forge:muse → back)
  • CI integration: cargo test --all runs scenarios offline against fixtures.
  • Documentation in docs/test-harness.md showing how to write, run, record, accept.

Out of scope (for v1, file as follow-ups)

  • TUI testing for codegraff — needs ratatui-specific harness (e.g. via ratatui::backend::TestBackend)
  • LLM-based assertions ("did the response sound friendly?") — too fuzzy for deterministic tests
  • Performance/latency benchmarks
  • Adversarial / fuzz testing of inputs
  • Cross-provider equivalence testing

Related

Estimated scope

Medium-large. The core (mock provider + driver API + 1-2 scenarios) is probably 2-3 days of focused work. The hot-reload and watch modes are another 1-2 days. Documentation and 3 polished scenarios add another day. Total: ~1 week of dedicated effort, can be split across multiple PRs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state: inactiveNo current action needed/possible; issue fixed, out of scope, or superseded.type: featureBrand new functionality, features, pages, workflows, endpoints, etc.work: complexThe situation is complex, emergent practices used.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions