Goal
Improve the existing export-screening showcase to demonstrate more robust CI gating without requiring AgentV core changes.
This keeps the current PR scope contained while providing an actionable pattern users can adopt immediately.
Proposal
Extend examples/showcase/export-screening/evals/ci_check.ts with optional multi-sample evaluation and stability-aware gating.
New wrapper options (suggested)
--samples N: run the eval N times (fresh eval invocation each time) and aggregate metrics across runs.
--gate min|mean|p05|p10 (or similar): choose conservative gating strategy.
--min-run-f1 X: require every run (or p05) to meet threshold.
- Optional:
--max-stddev X or --max-variance X for the checked class.
Behavior
- Default behavior remains unchanged (single-run threshold gate), so existing docs continue to work.
- When
--samples is provided:
- wrapper runs
bun agentv eval repeatedly (or expects multiple results files)
- aggregates confusion matrices / per-class metrics
- emits a stability-aware CI result JSON and exits non-zero on failure.
Why this belongs in a wrapper/example
Acceptance Criteria
bun run ./evals/ci_check.ts --eval ./evals/dataset.yaml --samples 5 --threshold 0.95 --check-class High works.
- Output JSON includes per-run metrics + aggregate metrics + selected gating rule.
- Wrapper exits 0/1 deterministically based on selected gate.
- README updated with the new options and guidance on choosing
--samples and gate types.
Related
Related
Goal
Improve the existing export-screening showcase to demonstrate more robust CI gating without requiring AgentV core changes.
This keeps the current PR scope contained while providing an actionable pattern users can adopt immediately.
Proposal
Extend examples/showcase/export-screening/evals/ci_check.ts with optional multi-sample evaluation and stability-aware gating.
New wrapper options (suggested)
--samples N: run the eval N times (fresh eval invocation each time) and aggregate metrics across runs.--gate min|mean|p05|p10(or similar): choose conservative gating strategy.--min-run-f1 X: require every run (or p05) to meet threshold.--max-stddev Xor--max-variance Xfor the checked class.Behavior
--samplesis provided:bun agentv evalrepeatedly (or expects multiple results files)Why this belongs in a wrapper/example
Acceptance Criteria
bun run ./evals/ci_check.ts --eval ./evals/dataset.yaml --samples 5 --threshold 0.95 --check-class Highworks.--samplesand gate types.Related
Related