diff --git a/plugins/agentv-dev/skills/agentv-bench/SKILL.md b/plugins/agentv-dev/skills/agentv-bench/SKILL.md index 46aa2c46..89d108e7 100644 --- a/plugins/agentv-dev/skills/agentv-bench/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-bench/SKILL.md @@ -187,7 +187,13 @@ This is the only opportunity to capture this data — it comes through the task agentv pipeline grade ``` -This runs all `code-grader` assertions against the `response.md` files. Results are written to `/code_grader_results/.json`. Alternatively, pass `--grader-type code` to `pipeline run` to run code graders inline. +This evaluates all deterministic assertions against `response.md` files. Two types are handled: +- **`code-grader` scripts** — external scripts executed against the response (arbitrary logic, any language) +- **Built-in assertion types** — evaluated in-process: `contains`, `contains-any`, `contains-all`, `icontains`, `regex`, `equals`, `starts-with`, `ends-with`, `is-json`, and variants + +Both types are configured by `pipeline input` into `code_graders/.json` and graded by `pipeline grade`. Results are written to `/code_grader_results/.json`. Alternatively, pass `--grader-type code` to `pipeline run` to run these inline. + +**Do not dispatch LLM grader subagents for tests that only have `contains`, `regex`, or other built-in assertions** — `pipeline grade` handles them entirely, at zero cost. **Phase 2: LLM grading** (semantic — do NOT skip this phase) @@ -196,7 +202,9 @@ Example: 5 tests × 2 LLM graders = 10 grader subagents launched simultaneously. **Do NOT dispatch a single grader for multiple tests.** Each subagent grades exactly one (test, grader) pair. -Each grader subagent (read `agents/grader.md`): +**Before dispatching graders, read `agents/grader.md` and embed its full content as the system instructions in every grader subagent prompt.** The grader is a `general-purpose` task agent — there is no auto-resolved "grader" type. Without `agents/grader.md` embedded verbatim, the subagent has no grading process, no output format, and no file-path knowledge, and will produce empty or incorrect output. + +Each grader subagent (operating under `agents/grader.md` instructions): 1. Reads `/llm_graders/.json` for the grading prompt 2. Reads `/response.md` for the candidate output 3. Grades the response against the prompt criteria @@ -221,6 +229,8 @@ agentv results validate `pipeline bench` reads LLM grader results from `llm_grader_results/.json` per test automatically, merges with code-grader scores, computes weighted pass_rate, and writes `grading.json` + `index.jsonl` + `benchmark.json`. +> **Diagnosing `pass_rate=0`:** If `pipeline bench` reports `pass_rate=0` across the board, do **not** assume the tests genuinely failed. First verify the grading pipeline ran correctly: check that `/llm_grader_results/.json` exists and is non-empty for each test. If these files are absent or empty, the grader subagents failed to produce output (most common cause: `agents/grader.md` was not embedded in the subagent prompts — see Phase 2). Treat `pass_rate=0` as a real signal only after confirming grader results exist. + ### Artifacts All artifacts use established schemas — see `references/schemas.md` for the full definitions. Do not modify the structure. Key artifacts per run: diff --git a/plugins/agentv-dev/skills/agentv-bench/agents/grader.md b/plugins/agentv-dev/skills/agentv-bench/agents/grader.md index e39d547f..1f176b57 100644 --- a/plugins/agentv-dev/skills/agentv-bench/agents/grader.md +++ b/plugins/agentv-dev/skills/agentv-bench/agents/grader.md @@ -152,12 +152,15 @@ Suggestions worth raising: If the evals are solid, set eval_feedback to `{"suggestions": [], "overall": "No suggestions, evals look solid."}`. -### Step 9: Write grading.json +### Step 9: Write results to disk -Write results to `{bench-dir}/test-{test-id}/grading.json`: +Write results to `{bench-dir}/{test-id}/llm_grader_results/.json`, where `` matches the filename from `llm_graders/.json` (e.g. if the grader config is `llm_graders/rubrics.json`, write to `llm_grader_results/rubrics.json`). + +Do **NOT** write directly to `grading.json` — that file is produced by `agentv pipeline bench` after merging all `llm_grader_results`. Writing directly to it bypasses the merge step and will cause `pipeline bench` to report `pass_rate=0`. ```json { + "score": 0.85, "assertions": [ { "text": "Response contains 'hello'", @@ -171,16 +174,6 @@ Write results to `{bench-dir}/test-{test-id}/grading.json`: "total": 1, "pass_rate": 1.0 }, - "execution_metrics": { - "tool_calls": { "Read": 3, "Bash": 2 }, - "total_tool_calls": 5, - "output_chars": 1200, - "transcript_chars": 800 - }, - "timing": { - "executor_duration_seconds": 12.5, - "total_duration_seconds": 15.0 - }, "claims": [ { "claim": "Used async/await pattern", diff --git a/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md b/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md index 34703d43..23994de9 100644 --- a/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md +++ b/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md @@ -158,8 +158,9 @@ the eval.yaml. The target is recorded in `manifest.json` — one run = one targe ├── criteria.md ← grading criteria ├── response.md ← target/agent output ├── timing.json ← execution timing - ├── code_graders/.json ← code grader configs + ├── code_graders/.json ← grader configs written by `pipeline input`: code-grader scripts AND built-in types (contains, regex, equals, etc.) ├── llm_graders/.json ← LLM grader configs - ├── code_grader_results/.json ← code grader results - └── grading.json ← merged grading + ├── code_grader_results/.json ← code grader results + ├── llm_grader_results/.json ← LLM grader results (written by grader subagents; one file per grader) + └── grading.json ← merged grading (written by `pipeline bench` — do NOT write here directly) ```