feat(compare): support combined JSONL input and N-way multi-model comparison by christso · Pull Request #382 · EntityProcess/agentv

christso · 2026-02-26T00:58:44Z

Summary

Add combined JSONL mode: agentv compare results.jsonl reads a single file with target field and produces an N-way score matrix with pairwise summaries
Add --baseline / --candidate flags for pairwise comparison from combined JSONL
Add --targets flag to filter matrix to specific targets
Exit code 1 when --baseline is set and any target regresses vs baseline
JSON output (--json) includes full matrix and all pairwise comparisons
Two-file pairwise mode (agentv compare a.jsonl b.jsonl) remains unchanged
Add benchmark-tooling example with eval YAML, combined fixture (9 records: 3 tests x 3 targets), and updated README
Fix pre-existing lint issues in benchmark-tooling scripts

Risk

Low -- additive feature with backward-compatible CLI changes (positional args use restPositionals, existing two-file mode unchanged). Comprehensive tests cover all modes.

Closes #381

…parison Add combined JSONL mode that reads a single file with `target` field and produces an N-way score matrix. Supports --baseline/--candidate for pairwise filtering, --targets for matrix filtering, and exit code 1 on baseline regressions. Two-file pairwise mode remains unchanged. Closes #381

christso merged commit 82f3505 into main Feb 26, 2026
1 check was pending

christso deleted the feat/381-compare-matrix branch February 26, 2026 00:58

This was referenced Feb 26, 2026

fix(compare): regression detection for non-alphabetically-first baselines #383

Merged

docs: update compare command references for N-way matrix mode #384

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compare): support combined JSONL input and N-way multi-model comparison#382

feat(compare): support combined JSONL input and N-way multi-model comparison#382
christso merged 1 commit into
mainfrom
feat/381-compare-matrix

christso commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Feb 26, 2026

Summary

Risk

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant