You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have 25+ agents, dozens of slash commands, and increasingly rich multi-turn flows (Ctrl+V image paste → edit → resend; conversation cloning; agent switching mid-session). Manual end-to-end testing is painful: launch graff, type commands, eyeball model output, save state, switch agents. As a result:
Regressions slip through. The recent FORGE→GRAFF rebrand had four separate user-visible spots (banner, doctor, zsh rprompt, REPL prompt) — only the user catching them in their terminal exposed each one.
New features ship without exercise. Image paste landed in feat(repl): Ctrl+V image paste in graff #19 but nothing automatically verifies that a pasted image produces a multimodal block reaching the model.
Multi-turn flows are nearly never tested in CI.
What we want
A test harness that scripts a multi-turn conversation against the live agent — runnable both as a Rust unit test and as an interactive CLI for iterative agent development. Ergonomics goal: writing a new scenario should be as easy as writing an insta snapshot; running existing scenarios in CI should be deterministic and require no live API.
Proposed shape
A. Driver API (Rust, in a new forge_test_harness crate)
letmut session = TestSession::new().with_provider(MockProvider::from_fixture(\"tests/fixtures/cat-replay.json\")).with_clipboard(ClipboardFixture::image(\"tests/fixtures/cat.png\")).build();
session.send_keys(\"ctrl+v\").send_text(\" what is this\").submit().assert_tool_called(\"image_describe\").assert_response_matches(r\"(?i)cat\");
session.send_command(\":new\").send_text(\"hi\").submit().assert_response_count(1);
session.snapshot(\"cat-image-then-greeting\");
B. Scenario format (YAML for non-Rust authors / CI fixtures)
A small forge-test CLI runs these scenarios, supports --accept (insta-style snapshot update), --watch (re-runs on file change), --record (captures live API output to a fixture for replay).
C. Mock provider
Implement Provider::Mock that consumes a deterministic fixture (JSONL of model events: token streams, tool calls, finish reasons) and replays it in real time. No network required. Captures every tool call invocation for assertion.
D. Live-record / replay
forge-test record <scenario> runs the scenario against a real provider, captures all model events to a fixture file. Subsequent CI runs use the fixture for replay. Same golden-master pattern as cassette-style HTTP recorders. Recordings are version-controlled.
E. Hot-reload for interactive iteration
forge-test watch <scenario> keeps a session alive and:
Notices when an agent prompt file (crates/forge_repo/src/agents/*.md) changes.
Replays the scenario from turn 1 against the new agent definition.
Shows a colored diff of (response text, tool call sequence, final state) vs the last known-good run.
Lets the developer accept (snapshot updated) or reject (revert agent) with a keystroke.
Goal: a tight loop where editing an agent prompt → seeing the multi-turn behavior change → committing happens in seconds, not minutes.
F. Multi-turn assertion library
Beyond per-turn matchers:
Sequence — "tool A must be called before tool B"
Count — "exactly N model calls in this scenario"
State — "after turn 3, conversation has K messages, agent is forge"
Final — "conversation history JSON matches snapshot" (insta integration)
G. Interactive REPL mode
forge-test repl --provider=mock --fixture=cat-replay.json opens a live REPL where each user input runs through the harness. Lets a developer poke an agent under controlled conditions without burning real tokens.
Acceptance criteria
New crate forge_test_harness with the driver API (or, if simpler, an extension to forge_test_kit).
MockProvider implementing the existing provider trait, replaying JSONL fixtures.
Medium-large. The core (mock provider + driver API + 1-2 scenarios) is probably 2-3 days of focused work. The hot-reload and watch modes are another 1-2 days. Documentation and 3 polished scenarios add another day. Total: ~1 week of dedicated effort, can be split across multiple PRs.
Why
We have 25+ agents, dozens of slash commands, and increasingly rich multi-turn flows (Ctrl+V image paste → edit → resend; conversation cloning; agent switching mid-session). Manual end-to-end testing is painful: launch graff, type commands, eyeball model output, save state, switch agents. As a result:
What we want
A test harness that scripts a multi-turn conversation against the live agent — runnable both as a Rust unit test and as an interactive CLI for iterative agent development. Ergonomics goal: writing a new scenario should be as easy as writing an
instasnapshot; running existing scenarios in CI should be deterministic and require no live API.Proposed shape
A. Driver API (Rust, in a new
forge_test_harnesscrate)B. Scenario format (YAML for non-Rust authors / CI fixtures)
A small
forge-testCLI runs these scenarios, supports--accept(insta-style snapshot update),--watch(re-runs on file change),--record(captures live API output to a fixture for replay).C. Mock provider
Implement
Provider::Mockthat consumes a deterministic fixture (JSONL of model events: token streams, tool calls, finish reasons) and replays it in real time. No network required. Captures every tool call invocation for assertion.D. Live-record / replay
forge-test record <scenario>runs the scenario against a real provider, captures all model events to a fixture file. Subsequent CI runs use the fixture for replay. Same golden-master pattern ascassette-style HTTP recorders. Recordings are version-controlled.E. Hot-reload for interactive iteration
forge-test watch <scenario>keeps a session alive and:crates/forge_repo/src/agents/*.md) changes.Goal: a tight loop where editing an agent prompt → seeing the multi-turn behavior change → committing happens in seconds, not minutes.
F. Multi-turn assertion library
Beyond per-turn matchers:
forge"G. Interactive REPL mode
forge-test repl --provider=mock --fixture=cat-replay.jsonopens a live REPL where each user input runs through the harness. Lets a developer poke an agent under controlled conditions without burning real tokens.Acceptance criteria
forge_test_harnesswith the driver API (or, if simpler, an extension toforge_test_kit).MockProviderimplementing the existing provider trait, replaying JSONL fixtures.forge-testbinary (orcargo forge-testsubcommand) supportingrun,record,watch,repl,--accept.tests/scenarios/:/) palette + autocomplete to codegraff TUI #17):forge→:muse→ back)cargo test --allruns scenarios offline against fixtures.docs/test-harness.mdshowing how to write, run, record, accept.Out of scope (for v1, file as follow-ups)
ratatui::backend::TestBackend)Related
/) palette + autocomplete to codegraff TUI #17 — Slash palette needs scenario coverageEstimated scope
Medium-large. The core (mock provider + driver API + 1-2 scenarios) is probably 2-3 days of focused work. The hot-reload and watch modes are another 1-2 days. Documentation and 3 polished scenarios add another day. Total: ~1 week of dedicated effort, can be split across multiple PRs.