Skip to content

feat(compare): support combined JSONL input and N-way multi-model comparison#382

Merged
christso merged 1 commit into
mainfrom
feat/381-compare-matrix
Feb 26, 2026
Merged

feat(compare): support combined JSONL input and N-way multi-model comparison#382
christso merged 1 commit into
mainfrom
feat/381-compare-matrix

Conversation

@christso
Copy link
Copy Markdown
Collaborator

Summary

  • Add combined JSONL mode: agentv compare results.jsonl reads a single file with target field and produces an N-way score matrix with pairwise summaries
  • Add --baseline / --candidate flags for pairwise comparison from combined JSONL
  • Add --targets flag to filter matrix to specific targets
  • Exit code 1 when --baseline is set and any target regresses vs baseline
  • JSON output (--json) includes full matrix and all pairwise comparisons
  • Two-file pairwise mode (agentv compare a.jsonl b.jsonl) remains unchanged
  • Add benchmark-tooling example with eval YAML, combined fixture (9 records: 3 tests x 3 targets), and updated README
  • Fix pre-existing lint issues in benchmark-tooling scripts

Risk

Low -- additive feature with backward-compatible CLI changes (positional args use restPositionals, existing two-file mode unchanged). Comprehensive tests cover all modes.

Closes #381

…parison

Add combined JSONL mode that reads a single file with `target` field and
produces an N-way score matrix. Supports --baseline/--candidate for
pairwise filtering, --targets for matrix filtering, and exit code 1 on
baseline regressions. Two-file pairwise mode remains unchanged.

Closes #381
@christso christso merged commit 82f3505 into main Feb 26, 2026
1 check was pending
@christso christso deleted the feat/381-compare-matrix branch February 26, 2026 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(compare): support combined JSONL input and N-way multi-model comparison

1 participant