feat(cli): --threshold flag for suite-level quality gates

## Objective

Add a `--threshold` CLI flag to `agentv eval` that fails (exit 1) if the mean score across all tests falls below the specified threshold. This enables CI/CD quality gating without needing `agentv compare --baseline`.

## Design Latitude

- Add `--threshold <number>` flag (0-1 scale) to `agentv eval`
- Also support `execution.threshold` in EVAL.yaml for per-suite defaults
- CLI flag overrides YAML value
- After all tests complete, compute mean score; if below threshold, exit code 1 with summary
- Integrate with existing JUnit XML output (test-level pass/fail based on threshold)

## Acceptance Signals

- `agentv eval evals/ --threshold 0.6` exits 1 if mean score < 0.6
- `execution.threshold: 0.6` in YAML has the same effect
- CLI `--threshold` overrides YAML `execution.threshold`
- Summary line printed: "Suite score: 0.53 (threshold: 0.6) — FAIL"
- Exit code 0 when score meets threshold

## Non-Goals

- Not a replacement for `agentv compare` regression gating (different use case)
- Not per-test threshold override (use `required` for that)
- Not severity levels (#334 covers that separately)

## Context

Identified via [microsoft/skills harness research](https://github.com/agentevals/agentevals-research/blob/main/research/findings/microsoft-skills-harness/agentv-gap-analysis.md). The skills harness uses `--threshold 60` (0-100 scale) for CI quality gates. AgentV currently has per-test `required` gates and `agentv compare --baseline` regression gating, but no suite-level threshold.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): --threshold flag for suite-level quality gates #698

Objective

Design Latitude

Acceptance Signals

Non-Goals

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(cli): --threshold flag for suite-level quality gates #698

Description

Objective

Design Latitude

Acceptance Signals

Non-Goals

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions