fix(agentv-bench): fix grader subagent pipeline bugs by jozsurf · Pull Request #1148 · EntityProcess/agentv

jozsurf · 2026-04-23T01:14:47Z

Summary

Three bugs caused Phase 2→3 of the subagent grading pipeline to silently fail, producing pass_rate=0 even when all tests genuinely passed. Also updates Phase 1 docs to reflect v4.21.0 CLI changes.

Bug 1 — `grader.md` Step 9: wrong output path (critical)

Step 9 told grader subagents to write results to {bench-dir}/test-{test-id}/grading.json.

Two problems:

The test- prefix doesn't match the actual directory name on disk ({test-id}, not test-{test-id})
grading.json is pipeline bench's output — it overwrites this file during the merge step, discarding whatever the grader wrote

The correct path is {bench-dir}/{test-id}/llm_grader_results/<grader-name>.json, which pipeline bench reads during the merge step.

Fix: Updated Step 9 with correct path and explicit warning not to write directly to grading.json.

Bug 2 — `SKILL.md` Phase 2: missing instruction to embed `grader.md` verbatim

Phase 2 mentioned agents/grader.md but didn't say to embed its content in the subagent prompt. Grader subagents are general-purpose task agents — no instructions are auto-loaded. Without grader.md injected verbatim, subagents have no grading process, no output format, and no file-path knowledge, producing empty output.

Fix: Added a bold instruction: read agents/grader.md and embed its full content as system instructions in every grader subagent prompt.

Bug 3 — `SKILL.md` Phase 3: `pass_rate=0` treated as a real signal

When the grading pipeline fails silently (Bugs 1 or 2), pipeline bench reports pass_rate=0. Nothing warned that zero often means a pipeline failure rather than genuine test failures, leading agents to optimise the wrong thing.

Fix: Added a diagnostic callout in Phase 3 directing users to verify llm_grader_results/<name>.json before treating zero as a real signal.

Phase 1 doc update — stale description of `pipeline grade`

Since v4.21.0 (64fdff95), pipeline grade evaluates built-in assertion types natively (contains, regex, equals, starts-with, ends-with, is-json, and variants) in addition to code-grader scripts. pipeline input also now creates code_graders/<name>.json configs for these types. The skill docs still described Phase 1 as only handling code-grader assertions.

Fix: Updated Phase 1 description to cover both code-grader scripts and built-in types. Added a note that built-in-only tests do not need LLM grader subagents dispatched. Updated subagent-pipeline.md directory tree label for code_graders/<name>.json accordingly.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

- SKILL.md Phase 2: clarify that agents/grader.md must be embedded verbatim in every grader subagent prompt. Without it, general-purpose subagents have no grading process and produce empty output. - SKILL.md Phase 3: add pass_rate=0 diagnostic note — zero pass rate from pipeline bench indicates missing grader results, not all tests failing. Direct devs to check llm_grader_results/<name>.json first. - grader.md Step 9: fix output path from wrong {bench-dir}/test-{test-id}/grading.json to correct {bench-dir}/{test-id}/llm_grader_results/<grader-name>.json. Add explicit warning not to write directly to grading.json, which is produced by pipeline bench and would be overwritten/ignored. - subagent-pipeline.md: add llm_grader_results/<name>.json to the output structure tree, alongside code_grader_results. Also annotate grading.json as pipeline-owned to prevent direct writes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs(agentv-bench): clean up stale grader references after #1148 Follow-up to #1148. Three small clarifications: - grader.md: fix stale `bench-dir` example (`results/export/` → `results/runs/`) and clarify that `bench-dir` already includes the `<evalset>` segment. The output path spec in Step 9 assumes this. - grader.md: annotate Field Descriptions to distinguish fields consumed by `pipeline bench` (`score`, `assertions[]`) from fields kept on disk for traceability. Also remove `execution_metrics` and `timing` from the list — #1148 dropped them from the JSON example but the descriptions still referenced them. - SKILL.md Phase 1: add a one-liner on how orchestrators detect which tests need Phase 2 — check whether `<test-id>/llm_graders/` has any `.json` files. `pipeline input` only populates it for `llm-grader` assertions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pipeline): align subagent-mode suite fallback with CLI mode Before this change, `pipeline input` and `pipeline run` resolved the suite (evalset) directory name from `suite.metadata?.name` only. If the eval.yaml had no `name` field, `suite.metadata` was undefined and the artifact layout skipped the suite segment entirely — writing to `<run-dir>/<test-id>/...` directly. CLI mode (`agentv eval`, via `artifact-writer.ts:buildArtifactSubdir` consuming `test.suite`) already falls back through `metadata.name` → eval-file basename (stripping `.eval.yaml`) → `'eval'`. This fallback is applied by the loaders (`yaml-parser.ts:317-324`, `jsonl-parser.ts:165`) and attached to every test as `test.suite`. Switch `pipeline input` and `pipeline run` to read `tests[0]?.suite` so subagent-mode runs produce the same `<evalset>/<test-id>/` layout CLI mode produces. `pipeline bench` and `pipeline grade` consume `manifest.suite` which is written by these two, so they pick up the change automatically — no consumer edits needed. Docs in the agentv-bench skill updated to match: `<evalset>` is now always present in the artifact tree, not conditional. Regression test covers the no-`name` case — previously resolved to `<run-dir>/<test-id>/`, now resolves to `<run-dir>/no-name/<test-id>/` matching `no-name.eval.yaml`'s basename. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

christso merged commit a84a151 into EntityProcess:main Apr 23, 2026

christso mentioned this pull request Apr 23, 2026

fix(pipeline): align subagent-mode suite fallback with CLI mode #1151

Merged

7 tasks

This was referenced Apr 23, 2026

bug(pipeline): llm_grader_results/ not pre-created, risks silent zero-score regression #1152

Open

docs(grader): clarify how threshold in llm_graders config should affect the passed boolean #1153

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agentv-bench): fix grader subagent pipeline bugs#1148

fix(agentv-bench): fix grader subagent pipeline bugs#1148
christso merged 1 commit into
EntityProcess:mainfrom
jozsurf:JP0/fix-agentv-bench-grader-pipeline

jozsurf commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jozsurf commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug 1 — grader.md Step 9: wrong output path (critical)

Bug 2 — SKILL.md Phase 2: missing instruction to embed grader.md verbatim

Bug 3 — SKILL.md Phase 3: pass_rate=0 treated as a real signal

Phase 1 doc update — stale description of pipeline grade

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jozsurf commented Apr 23, 2026 •

edited

Loading

Bug 1 — `grader.md` Step 9: wrong output path (critical)

Bug 2 — `SKILL.md` Phase 2: missing instruction to embed `grader.md` verbatim

Bug 3 — `SKILL.md` Phase 3: `pass_rate=0` treated as a real signal

Phase 1 doc update — stale description of `pipeline grade`