Goal
Provide a reusable pattern (wrapper + fixtures) to validate evaluator/judge health:
- compatibility (can execute, returns valid structured output)
- consistency (stable on unambiguous fixtures across multiple runs)
This is intentionally NOT a core feature; it should be an example/showcase that teams can copy and adapt.
Motivation (concrete failures this catches)
- Provider/dependency/runtime misconfig causes evaluator to error (e.g., missing deps, unsupported runtime imports).
- LLM judge returns malformed JSON, missing fields, or out-of-range scores.
- Judge prompt/model change causes drift: previously PASS fixtures start failing (or vice versa).
- Stochastic judge flips decisions on clearly unambiguous cases.
Proposal
Add a new example/showcase (or extend an existing one) that includes:
- a small fixture dataset with labeled expectations:
- unambiguous-pass cases (must always score 1)
- unambiguous-fail cases (must always score 0)
- ambiguous cases (allowed to vary, but bounded)
- a wrapper script that:
- runs the evaluator N times per fixture (or runs
agentv eval N times)
- validates schema and bounds (compatibility)
- computes flip-rate / agreement / variance (consistency)
- exits 0/1 for CI gating
Suggested CLI flags (wrapper)
--runs N
--max-flip-rate X (for unambiguous fixtures, typically 0)
--require-schema (default true)
--output for structured CI JSON
Acceptance Criteria
- Can be used in CI to prevent shipping evaluator prompt/code changes that:
- break output schema
- introduce regressions on fixtures
- increase instability beyond configured thresholds
Related
Goal
Provide a reusable pattern (wrapper + fixtures) to validate evaluator/judge health:
This is intentionally NOT a core feature; it should be an example/showcase that teams can copy and adapt.
Motivation (concrete failures this catches)
Proposal
Add a new example/showcase (or extend an existing one) that includes:
agentv evalN times)Suggested CLI flags (wrapper)
--runs N--max-flip-rate X(for unambiguous fixtures, typically 0)--require-schema(default true)--outputfor structured CI JSONAcceptance Criteria
Related