fix(compare): regression detection for non-alphabetically-first baselines by christso · Pull Request #383 · EntityProcess/agentv

christso · 2026-02-26T01:28:29Z

Summary

Fixes 5 bugs found during code review of #382:

Critical: determineMatrixExitCode missed regressions for targets sorting alphabetically before the baseline. Pairwise pairs are generated in sorted order, so --baseline gpt-4.1 only checked gpt-4.1 → gpt-5-mini but missed gemini-3-flash-preview → gpt-4.1 (where gpt-4.1 is the candidate, not the baseline in the pair). Now checks both directions.
Critical: --candidate without --baseline silently entered matrix mode. Now errors with a clear message.
--targets filtering to zero results gave "No results found" with no hint about available targets. Now shows available target names.
--baseline combined with --targets that excludes the baseline silently ignored regression check. Now errors early.
Eval YAML used wrong schema keys (test_id, expected, scores instead of id, criteria). Fixed to match agentv's actual YAML parser.
Minor: moved maxLabelLen computation out of the pairwise summary loop (was O(n²), now O(n)).

Test plan

39 unit tests pass (3 new regression tests for the alphabetical sort bug)
E2E: agentv compare combined.jsonl --baseline gpt-4.1 correctly exits 1 (gemini regression detected)
E2E: --candidate gpt-5-mini (no baseline) → clear error
E2E: --targets nonexistent → lists available targets
E2E: --baseline gpt-4.1 --targets gemini gpt-5-mini → baseline excluded error
E2E: all happy paths unchanged (matrix, pairwise, JSON output)
Pre-push hooks pass (build, typecheck, lint, test)

Refs: #381

…ines determineMatrixExitCode only checked pairs where the designated baseline appeared as .baseline (first in sorted order). Targets sorting before the baseline name were never checked for regression — e.g. --baseline gpt-4.1 missed regressions from gemini-3-flash-preview. Also fixes: - --candidate without --baseline now errors instead of silently entering matrix mode - --targets filtering to zero results shows available targets - --baseline excluded by --targets shows clear error - Eval YAML uses correct schema keys (id, criteria) - Move maxLabelLen computation out of pairwise loop Refs: #381

cloudflare-workers-and-pages · 2026-02-26T01:29:22Z

Deploying agentv with Cloudflare Pages

Latest commit:	`09325d0`
Status:	✅ Deploy successful!
Preview URL:	https://3e0040ea.agentv.pages.dev
Branch Preview URL:	https://fix-compare-bugs.agentv.pages.dev

View logs

Show actual matrix and pairwise output inline so users can see what to expect without running the command. Add Quick Start section pointing to the included fixture, exit code reference table, and pairwise mode example.

docs(benchmark-tooling): add example output to README

09325d0

Show actual matrix and pairwise output inline so users can see what to expect without running the command. Add Quick Start section pointing to the included fixture, exit code reference table, and pairwise mode example.

christso merged commit d57846e into main Feb 26, 2026
1 check passed

christso deleted the fix/compare-bugs branch February 26, 2026 02:42

christso mentioned this pull request Feb 26, 2026

docs: update compare command references for N-way matrix mode #384

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(compare): regression detection for non-alphabetically-first baselines#383

fix(compare): regression detection for non-alphabetically-first baselines#383
christso merged 2 commits into
mainfrom
fix/compare-bugs

christso commented Feb 26, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Feb 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Feb 26, 2026

Summary

Test plan

Uh oh!

cloudflare-workers-and-pages Bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages Bot commented Feb 26, 2026 •

edited

Loading