Skip to content

fix(compare): regression detection for non-alphabetically-first baselines#383

Merged
christso merged 2 commits into
mainfrom
fix/compare-bugs
Feb 26, 2026
Merged

fix(compare): regression detection for non-alphabetically-first baselines#383
christso merged 2 commits into
mainfrom
fix/compare-bugs

Conversation

@christso
Copy link
Copy Markdown
Collaborator

Summary

Fixes 5 bugs found during code review of #382:

  • Critical: determineMatrixExitCode missed regressions for targets sorting alphabetically before the baseline. Pairwise pairs are generated in sorted order, so --baseline gpt-4.1 only checked gpt-4.1 → gpt-5-mini but missed gemini-3-flash-preview → gpt-4.1 (where gpt-4.1 is the candidate, not the baseline in the pair). Now checks both directions.
  • Critical: --candidate without --baseline silently entered matrix mode. Now errors with a clear message.
  • --targets filtering to zero results gave "No results found" with no hint about available targets. Now shows available target names.
  • --baseline combined with --targets that excludes the baseline silently ignored regression check. Now errors early.
  • Eval YAML used wrong schema keys (test_id, expected, scores instead of id, criteria). Fixed to match agentv's actual YAML parser.
  • Minor: moved maxLabelLen computation out of the pairwise summary loop (was O(n²), now O(n)).

Test plan

  • 39 unit tests pass (3 new regression tests for the alphabetical sort bug)
  • E2E: agentv compare combined.jsonl --baseline gpt-4.1 correctly exits 1 (gemini regression detected)
  • E2E: --candidate gpt-5-mini (no baseline) → clear error
  • E2E: --targets nonexistent → lists available targets
  • E2E: --baseline gpt-4.1 --targets gemini gpt-5-mini → baseline excluded error
  • E2E: all happy paths unchanged (matrix, pairwise, JSON output)
  • Pre-push hooks pass (build, typecheck, lint, test)

Refs: #381

…ines

determineMatrixExitCode only checked pairs where the designated baseline
appeared as .baseline (first in sorted order). Targets sorting before the
baseline name were never checked for regression — e.g. --baseline gpt-4.1
missed regressions from gemini-3-flash-preview.

Also fixes:
- --candidate without --baseline now errors instead of silently entering matrix mode
- --targets filtering to zero results shows available targets
- --baseline excluded by --targets shows clear error
- Eval YAML uses correct schema keys (id, criteria)
- Move maxLabelLen computation out of pairwise loop

Refs: #381
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Feb 26, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 09325d0
Status: ✅  Deploy successful!
Preview URL: https://3e0040ea.agentv.pages.dev
Branch Preview URL: https://fix-compare-bugs.agentv.pages.dev

View logs

Show actual matrix and pairwise output inline so users can see what
to expect without running the command. Add Quick Start section pointing
to the included fixture, exit code reference table, and pairwise mode
example.
@christso christso merged commit d57846e into main Feb 26, 2026
1 check passed
@christso christso deleted the fix/compare-bugs branch February 26, 2026 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant