feat: evaluate multiple targets

## Problem

Prompts behave differently depending on the provider. For example, VS Code uses runSubagent too, while Copilot CLI uses task agent.
Also, Copilot CLI has GitHub MCP built-in by default, while VS Code has githubRepo tool built-in.

## Proposal

### Matrix Evaluation

In the eval YAML, allow users to define multiple targets and automatically run every case against every target, producing a comparison matrix:

```yaml
name: cross-provider-eval
version: "1.0"

targets:
  - copilot-cli
  - vscode
  - claude

cases:
  - id: refund-request
    input: "Help me get a refund"
    evaluators:
      - type: llm_judge
        prompt: evaluators/helpful.md
```

Running this produces a `targets × cases` matrix:

```
                | copilot-cli | vscode | claude |
refund-request  |    0.9 ✓    | 0.85 ✓ | 0.7 ✗  |
greeting        |    1.0 ✓    | 1.0 ✓  | 0.95 ✓ |
edge-case       |    0.6 ✗    | 0.8 ✓  | 0.9 ✓  |
```

### CLI Usage

```bash
# Run against all targets defined in the eval
agentv run evals/cross-provider.yaml

# Run against specific targets only
agentv run --target copilot-cli --target claude evals/

# Compare results across targets
agentv compare --by-target results.jsonl
```

### Multi-Prompt Testing

Also support testing multiple prompt variants against the same cases:

```yaml
prompts:
  - file://prompts/v1-concise.md
  - file://prompts/v2-detailed.md

targets:
  - copilot-cli

cases:
  - id: test-1
    input: "..."
```

This produces a `prompts × cases` (or `prompts × targets × cases`) matrix for A/B testing prompt versions.

### Results Schema

Each result includes which target and prompt variant produced it:

```json
{
  "caseId": "refund-request",
  "target": "copilot-cli",
  "prompt": "v1-concise",
  "score": 0.9,
  "pass": true
}
```

## Why This Matters

- Cross-provider testing is essential for portable agent prompts
- Matrix view makes it easy to spot provider-specific regressions
- Prompt A/B testing helps optimize agent instructions
- The `compare` command already exists — this extends it naturally

## Acceptance Criteria

- [ ] Multiple targets in eval YAML, all cases run against all targets
- [ ] Matrix output in terminal showing targets × cases
- [ ] `--target` flag to filter to specific targets
- [ ] Optional multi-prompt support for A/B testing
- [ ] Results include target and prompt metadata
- [ ] Documentation with comparison examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: evaluate multiple targets #179

Problem

Proposal

Matrix Evaluation

CLI Usage

Multi-Prompt Testing

Results Schema

Why This Matters

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: evaluate multiple targets #179

Description

Problem

Proposal

Matrix Evaluation

CLI Usage

Multi-Prompt Testing

Results Schema

Why This Matters

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions