EntityProcess · christso · Apr 23, 2026 · Apr 22, 2026
diff --git a/plugins/agentv-dev/skills/agentv-bench/SKILL.md b/plugins/agentv-dev/skills/agentv-bench/SKILL.md
@@ -187,7 +187,13 @@ This is the only opportunity to capture this data — it comes through the task
 agentv pipeline grade <run-dir>
 ```
 
-This runs all `code-grader` assertions against the `response.md` files. Results are written to `<test-id>/code_grader_results/<name>.json`. Alternatively, pass `--grader-type code` to `pipeline run` to run code graders inline.
+This evaluates all deterministic assertions against `response.md` files. Two types are handled:
+- **`code-grader` scripts** — external scripts executed against the response (arbitrary logic, any language)
+- **Built-in assertion types** — evaluated in-process: `contains`, `contains-any`, `contains-all`, `icontains`, `regex`, `equals`, `starts-with`, `ends-with`, `is-json`, and variants
+
+Both types are configured by `pipeline input` into `code_graders/<name>.json` and graded by `pipeline grade`. Results are written to `<test-id>/code_grader_results/<name>.json`. Alternatively, pass `--grader-type code` to `pipeline run` to run these inline.
+
+**Do not dispatch LLM grader subagents for tests that only have `contains`, `regex`, or other built-in assertions** — `pipeline grade` handles them entirely, at zero cost.
 
 **Phase 2: LLM grading** (semantic — do NOT skip this phase)
 
@@ -196,7 +202,9 @@ Example: 5 tests × 2 LLM graders = 10 grader subagents launched simultaneously.
 
 **Do NOT dispatch a single grader for multiple tests.** Each subagent grades exactly one (test, grader) pair.
 
-Each grader subagent (read `agents/grader.md`):
+**Before dispatching graders, read `agents/grader.md` and embed its full content as the system instructions in every grader subagent prompt.** The grader is a `general-purpose` task agent — there is no auto-resolved "grader" type. Without `agents/grader.md` embedded verbatim, the subagent has no grading process, no output format, and no file-path knowledge, and will produce empty or incorrect output.
+
+Each grader subagent (operating under `agents/grader.md` instructions):
 1. Reads `<test-id>/llm_graders/<name>.json` for the grading prompt
 2. Reads `<test-id>/response.md` for the candidate output
 3. Grades the response against the prompt criteria
@@ -221,6 +229,8 @@ agentv results validate <run-dir>
 
 `pipeline bench` reads LLM grader results from `llm_grader_results/<name>.json` per test automatically, merges with code-grader scores, computes weighted pass_rate, and writes `grading.json` + `index.jsonl` + `benchmark.json`.
 
+> **Diagnosing `pass_rate=0`:** If `pipeline bench` reports `pass_rate=0` across the board, do **not** assume the tests genuinely failed. First verify the grading pipeline ran correctly: check that `<test-id>/llm_grader_results/<name>.json` exists and is non-empty for each test. If these files are absent or empty, the grader subagents failed to produce output (most common cause: `agents/grader.md` was not embedded in the subagent prompts — see Phase 2). Treat `pass_rate=0` as a real signal only after confirming grader results exist.
+
 ### Artifacts
 
 All artifacts use established schemas — see `references/schemas.md` for the full definitions. Do not modify the structure. Key artifacts per run:

diff --git a/plugins/agentv-dev/skills/agentv-bench/agents/grader.md b/plugins/agentv-dev/skills/agentv-bench/agents/grader.md
@@ -152,12 +152,15 @@ Suggestions worth raising:
 
 If the evals are solid, set eval_feedback to `{"suggestions": [], "overall": "No suggestions, evals look solid."}`.
 
-### Step 9: Write grading.json
+### Step 9: Write results to disk
 
-Write results to `{bench-dir}/test-{test-id}/grading.json`:
+Write results to `{bench-dir}/{test-id}/llm_grader_results/<grader-name>.json`, where `<grader-name>` matches the filename from `llm_graders/<name>.json` (e.g. if the grader config is `llm_graders/rubrics.json`, write to `llm_grader_results/rubrics.json`).
+
+Do **NOT** write directly to `grading.json` — that file is produced by `agentv pipeline bench` after merging all `llm_grader_results`. Writing directly to it bypasses the merge step and will cause `pipeline bench` to report `pass_rate=0`.
 
 ```json
 {
+  "score": 0.85,
   "assertions": [
     {
       "text": "Response contains 'hello'",
@@ -171,16 +174,6 @@ Write results to `{bench-dir}/test-{test-id}/grading.json`:
     "total": 1,
     "pass_rate": 1.0
   },
-  "execution_metrics": {
-    "tool_calls": { "Read": 3, "Bash": 2 },
-    "total_tool_calls": 5,
-    "output_chars": 1200,
-    "transcript_chars": 800
-  },
-  "timing": {
-    "executor_duration_seconds": 12.5,
-    "total_duration_seconds": 15.0
-  },
   "claims": [
     {
       "claim": "Used async/await pattern",

diff --git a/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md b/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md
@@ -158,8 +158,9 @@ the eval.yaml. The target is recorded in `manifest.json` — one run = one targe
         ├── criteria.md              ← grading criteria
         ├── response.md              ← target/agent output
         ├── timing.json              ← execution timing
-        ├── code_graders/<name>.json     ← code grader configs
+        ├── code_graders/<name>.json     ← grader configs written by `pipeline input`: code-grader scripts AND built-in types (contains, regex, equals, etc.)
         ├── llm_graders/<name>.json      ← LLM grader configs
-        ├── code_grader_results/<name>.json ← code grader results
-        └── grading.json             ← merged grading
+        ├── code_grader_results/<name>.json  ← code grader results
+        ├── llm_grader_results/<name>.json   ← LLM grader results (written by grader subagents; one file per grader)
+        └── grading.json              ← merged grading (written by `pipeline bench` — do NOT write here directly)
 ```