Skip to content

feat(cli): --threshold flag for suite-level quality gates #698

@christso

Description

@christso

Objective

Add a --threshold CLI flag to agentv eval that fails (exit 1) if the mean score across all tests falls below the specified threshold. This enables CI/CD quality gating without needing agentv compare --baseline.

Design Latitude

  • Add --threshold <number> flag (0-1 scale) to agentv eval
  • Also support execution.threshold in EVAL.yaml for per-suite defaults
  • CLI flag overrides YAML value
  • After all tests complete, compute mean score; if below threshold, exit code 1 with summary
  • Integrate with existing JUnit XML output (test-level pass/fail based on threshold)

Acceptance Signals

  • agentv eval evals/ --threshold 0.6 exits 1 if mean score < 0.6
  • execution.threshold: 0.6 in YAML has the same effect
  • CLI --threshold overrides YAML execution.threshold
  • Summary line printed: "Suite score: 0.53 (threshold: 0.6) — FAIL"
  • Exit code 0 when score meets threshold

Non-Goals

Context

Identified via microsoft/skills harness research. The skills harness uses --threshold 60 (0-100 scale) for CI quality gates. AgentV currently has per-test required gates and agentv compare --baseline regression gating, but no suite-level threshold.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions