Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions plugins/agentv-dev/skills/agentv-bench/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,13 @@ This is the only opportunity to capture this data — it comes through the task
agentv pipeline grade <run-dir>
```

This runs all `code-grader` assertions against the `response.md` files. Results are written to `<test-id>/code_grader_results/<name>.json`. Alternatively, pass `--grader-type code` to `pipeline run` to run code graders inline.
This evaluates all deterministic assertions against `response.md` files. Two types are handled:
- **`code-grader` scripts** — external scripts executed against the response (arbitrary logic, any language)
- **Built-in assertion types** — evaluated in-process: `contains`, `contains-any`, `contains-all`, `icontains`, `regex`, `equals`, `starts-with`, `ends-with`, `is-json`, and variants

Both types are configured by `pipeline input` into `code_graders/<name>.json` and graded by `pipeline grade`. Results are written to `<test-id>/code_grader_results/<name>.json`. Alternatively, pass `--grader-type code` to `pipeline run` to run these inline.

**Do not dispatch LLM grader subagents for tests that only have `contains`, `regex`, or other built-in assertions** — `pipeline grade` handles them entirely, at zero cost.

**Phase 2: LLM grading** (semantic — do NOT skip this phase)

Expand All @@ -196,7 +202,9 @@ Example: 5 tests × 2 LLM graders = 10 grader subagents launched simultaneously.

**Do NOT dispatch a single grader for multiple tests.** Each subagent grades exactly one (test, grader) pair.

Each grader subagent (read `agents/grader.md`):
**Before dispatching graders, read `agents/grader.md` and embed its full content as the system instructions in every grader subagent prompt.** The grader is a `general-purpose` task agent — there is no auto-resolved "grader" type. Without `agents/grader.md` embedded verbatim, the subagent has no grading process, no output format, and no file-path knowledge, and will produce empty or incorrect output.

Each grader subagent (operating under `agents/grader.md` instructions):
1. Reads `<test-id>/llm_graders/<name>.json` for the grading prompt
2. Reads `<test-id>/response.md` for the candidate output
3. Grades the response against the prompt criteria
Expand All @@ -221,6 +229,8 @@ agentv results validate <run-dir>

`pipeline bench` reads LLM grader results from `llm_grader_results/<name>.json` per test automatically, merges with code-grader scores, computes weighted pass_rate, and writes `grading.json` + `index.jsonl` + `benchmark.json`.

> **Diagnosing `pass_rate=0`:** If `pipeline bench` reports `pass_rate=0` across the board, do **not** assume the tests genuinely failed. First verify the grading pipeline ran correctly: check that `<test-id>/llm_grader_results/<name>.json` exists and is non-empty for each test. If these files are absent or empty, the grader subagents failed to produce output (most common cause: `agents/grader.md` was not embedded in the subagent prompts — see Phase 2). Treat `pass_rate=0` as a real signal only after confirming grader results exist.

### Artifacts

All artifacts use established schemas — see `references/schemas.md` for the full definitions. Do not modify the structure. Key artifacts per run:
Expand Down
17 changes: 5 additions & 12 deletions plugins/agentv-dev/skills/agentv-bench/agents/grader.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,12 +152,15 @@ Suggestions worth raising:

If the evals are solid, set eval_feedback to `{"suggestions": [], "overall": "No suggestions, evals look solid."}`.

### Step 9: Write grading.json
### Step 9: Write results to disk

Write results to `{bench-dir}/test-{test-id}/grading.json`:
Write results to `{bench-dir}/{test-id}/llm_grader_results/<grader-name>.json`, where `<grader-name>` matches the filename from `llm_graders/<name>.json` (e.g. if the grader config is `llm_graders/rubrics.json`, write to `llm_grader_results/rubrics.json`).

Do **NOT** write directly to `grading.json` — that file is produced by `agentv pipeline bench` after merging all `llm_grader_results`. Writing directly to it bypasses the merge step and will cause `pipeline bench` to report `pass_rate=0`.

```json
{
"score": 0.85,
"assertions": [
{
"text": "Response contains 'hello'",
Expand All @@ -171,16 +174,6 @@ Write results to `{bench-dir}/test-{test-id}/grading.json`:
"total": 1,
"pass_rate": 1.0
},
"execution_metrics": {
"tool_calls": { "Read": 3, "Bash": 2 },
"total_tool_calls": 5,
"output_chars": 1200,
"transcript_chars": 800
},
"timing": {
"executor_duration_seconds": 12.5,
"total_duration_seconds": 15.0
},
"claims": [
{
"claim": "Used async/await pattern",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -158,8 +158,9 @@ the eval.yaml. The target is recorded in `manifest.json` — one run = one targe
├── criteria.md ← grading criteria
├── response.md ← target/agent output
├── timing.json ← execution timing
├── code_graders/<name>.json ← code grader configs
├── code_graders/<name>.json ← grader configs written by `pipeline input`: code-grader scripts AND built-in types (contains, regex, equals, etc.)
├── llm_graders/<name>.json ← LLM grader configs
├── code_grader_results/<name>.json ← code grader results
└── grading.json ← merged grading
├── code_grader_results/<name>.json ← code grader results
├── llm_grader_results/<name>.json ← LLM grader results (written by grader subagents; one file per grader)
└── grading.json ← merged grading (written by `pipeline bench` — do NOT write here directly)
```