Objective
Add a --threshold CLI flag to agentv eval that fails (exit 1) if the mean score across all tests falls below the specified threshold. This enables CI/CD quality gating without needing agentv compare --baseline.
Design Latitude
- Add
--threshold <number> flag (0-1 scale) to agentv eval
- Also support
execution.threshold in EVAL.yaml for per-suite defaults
- CLI flag overrides YAML value
- After all tests complete, compute mean score; if below threshold, exit code 1 with summary
- Integrate with existing JUnit XML output (test-level pass/fail based on threshold)
Acceptance Signals
agentv eval evals/ --threshold 0.6 exits 1 if mean score < 0.6
execution.threshold: 0.6 in YAML has the same effect
- CLI
--threshold overrides YAML execution.threshold
- Summary line printed: "Suite score: 0.53 (threshold: 0.6) — FAIL"
- Exit code 0 when score meets threshold
Non-Goals
Context
Identified via microsoft/skills harness research. The skills harness uses --threshold 60 (0-100 scale) for CI quality gates. AgentV currently has per-test required gates and agentv compare --baseline regression gating, but no suite-level threshold.
Objective
Add a
--thresholdCLI flag toagentv evalthat fails (exit 1) if the mean score across all tests falls below the specified threshold. This enables CI/CD quality gating without needingagentv compare --baseline.Design Latitude
--threshold <number>flag (0-1 scale) toagentv evalexecution.thresholdin EVAL.yaml for per-suite defaultsAcceptance Signals
agentv eval evals/ --threshold 0.6exits 1 if mean score < 0.6execution.threshold: 0.6in YAML has the same effect--thresholdoverrides YAMLexecution.thresholdNon-Goals
agentv compareregression gating (different use case)requiredfor that)Context
Identified via microsoft/skills harness research. The skills harness uses
--threshold 60(0-100 scale) for CI quality gates. AgentV currently has per-testrequiredgates andagentv compare --baselineregression gating, but no suite-level threshold.