Problem
Prompts behave differently depending on the provider. For example, VS Code uses runSubagent too, while Copilot CLI uses task agent.
Also, Copilot CLI has GitHub MCP built-in by default, while VS Code has githubRepo tool built-in.
Proposal
Matrix Evaluation
In the eval YAML, allow users to define multiple targets and automatically run every case against every target, producing a comparison matrix:
name: cross-provider-eval
version: "1.0"
targets:
- copilot-cli
- vscode
- claude
cases:
- id: refund-request
input: "Help me get a refund"
evaluators:
- type: llm_judge
prompt: evaluators/helpful.md
Running this produces a targets × cases matrix:
| copilot-cli | vscode | claude |
refund-request | 0.9 ✓ | 0.85 ✓ | 0.7 ✗ |
greeting | 1.0 ✓ | 1.0 ✓ | 0.95 ✓ |
edge-case | 0.6 ✗ | 0.8 ✓ | 0.9 ✓ |
CLI Usage
# Run against all targets defined in the eval
agentv run evals/cross-provider.yaml
# Run against specific targets only
agentv run --target copilot-cli --target claude evals/
# Compare results across targets
agentv compare --by-target results.jsonl
Multi-Prompt Testing
Also support testing multiple prompt variants against the same cases:
prompts:
- file://prompts/v1-concise.md
- file://prompts/v2-detailed.md
targets:
- copilot-cli
cases:
- id: test-1
input: "..."
This produces a prompts × cases (or prompts × targets × cases) matrix for A/B testing prompt versions.
Results Schema
Each result includes which target and prompt variant produced it:
{
"caseId": "refund-request",
"target": "copilot-cli",
"prompt": "v1-concise",
"score": 0.9,
"pass": true
}
Why This Matters
- Cross-provider testing is essential for portable agent prompts
- Matrix view makes it easy to spot provider-specific regressions
- Prompt A/B testing helps optimize agent instructions
- The
compare command already exists — this extends it naturally
Acceptance Criteria
Problem
Prompts behave differently depending on the provider. For example, VS Code uses runSubagent too, while Copilot CLI uses task agent.
Also, Copilot CLI has GitHub MCP built-in by default, while VS Code has githubRepo tool built-in.
Proposal
Matrix Evaluation
In the eval YAML, allow users to define multiple targets and automatically run every case against every target, producing a comparison matrix:
Running this produces a
targets × casesmatrix:CLI Usage
Multi-Prompt Testing
Also support testing multiple prompt variants against the same cases:
This produces a
prompts × cases(orprompts × targets × cases) matrix for A/B testing prompt versions.Results Schema
Each result includes which target and prompt variant produced it:
{ "caseId": "refund-request", "target": "copilot-cli", "prompt": "v1-concise", "score": 0.9, "pass": true }Why This Matters
comparecommand already exists — this extends it naturallyAcceptance Criteria
--targetflag to filter to specific targets