Summary
Add an evaluate() programmatic API so agentv can be used as a library, not just a CLI. This is the foundation for language SDKs, CI integrations, and AI agent consumption.
Motivation
AgentV is currently CLI-only. To use it programmatically, users must shell out to agentv run and parse JSONL output. This is fragile and prevents:
- Type-safe integration in TypeScript projects
- Language SDKs (Python, C#) which need a programmatic API to wrap
- AI agents using agentv as a composable primitive
- Custom runners with different output formats
Evidence from research:
- Promptfoo:
promptfoo.evaluate({ tests, assert, providers }) — single function, config mirrors YAML
- Azure SDK:
evaluate(data=..., evaluators=..., target=...) — same pattern
- DeepEval:
evaluate(test_cases, metrics) — same pattern, Python
- Mastra: Programmatic
agent.evaluate() method
- nem035/agentevals:
describe/it/expect plus run() — dual CLI + programmatic
Proposed Design
Naming
Per #328 naming decisions, the programmatic API is evaluate() (not runEvaluation()) to match Promptfoo's mental model. The config shape mirrors the YAML — users can translate between them 1:1.
TypeScript API
import { evaluate } from "@agentv/core";
// Option 1: Pure code (no YAML needed) — config mirrors YAML structure
const results = await evaluate({
tests: [
{
id: "capital",
input: "What is the capital of France?",
expected_output: "Paris",
assert: [
{ type: "contains", value: "Paris" },
{ type: "llm_judge", prompt: "Is this geographically correct?" },
],
},
],
target: { provider: "claude_agent" },
});
// Option 2: Load from YAML (existing workflow, programmatic)
const results = await evaluate({
specFile: "./evals/EVAL.yaml",
target: "mock_agent",
filter: "specific-test",
});
// Results are typed
for (const result of results.results) {
console.log(result.test_id, result.verdict, result.scores);
}
Package location
evaluate() lives in @agentv/core — it needs the orchestrator, providers, and registry. This is a heavy dependency and that's honest: users who need programmatic evaluation need the engine.
@agentv/eval → defineAssertion(), defineCodeJudge() (lightweight, zod only)
@agentv/core → evaluate() (heavy, needs engine)
Return Type
interface EvalRunResult {
results: EvalCaseResult[];
summary: {
total: number;
passed: number;
failed: number;
borderline: number;
duration_ms: number;
cost_usd: number;
};
}
Architecture
Extract the evaluation orchestration logic from the CLI command handler into a standalone function:
apps/cli/src/commands/run.ts (CLI layer)
└── calls evaluate() from @agentv/core (library layer)
├── loadEvalSpec() — parse YAML/JSONL
├── resolveTarget() — provider registry lookup
├── runTests() — parallel execution
└── scoreResults() — assertion pipeline
The CLI becomes a thin wrapper around the library API.
Acceptance Criteria
Research References
Summary
Add an
evaluate()programmatic API so agentv can be used as a library, not just a CLI. This is the foundation for language SDKs, CI integrations, and AI agent consumption.Motivation
AgentV is currently CLI-only. To use it programmatically, users must shell out to
agentv runand parse JSONL output. This is fragile and prevents:Evidence from research:
promptfoo.evaluate({ tests, assert, providers })— single function, config mirrors YAMLevaluate(data=..., evaluators=..., target=...)— same patternevaluate(test_cases, metrics)— same pattern, Pythonagent.evaluate()methoddescribe/it/expectplusrun()— dual CLI + programmaticProposed Design
Naming
Per #328 naming decisions, the programmatic API is
evaluate()(notrunEvaluation()) to match Promptfoo's mental model. The config shape mirrors the YAML — users can translate between them 1:1.TypeScript API
Package location
evaluate()lives in@agentv/core— it needs the orchestrator, providers, and registry. This is a heavy dependency and that's honest: users who need programmatic evaluation need the engine.Return Type
Architecture
Extract the evaluation orchestration logic from the CLI command handler into a standalone function:
The CLI becomes a thin wrapper around the library API.
Acceptance Criteria
evaluate()exported from@agentv/coreassertkey (matching YAML), notevaluatorsEvalRunResultwith summary statisticsevaluate()internallyonResult?: (result: EvalCaseResult) => voidexamples/programmatic/showing library usageResearch References
evaluate()API, config mirrors YAMLevaluate()functionevaluate()function