fix(agentv-bench): fix grader subagent pipeline bugs#1148
Merged
christso merged 1 commit intoApr 23, 2026
Conversation
- SKILL.md Phase 2: clarify that agents/grader.md must be embedded
verbatim in every grader subagent prompt. Without it, general-purpose
subagents have no grading process and produce empty output.
- SKILL.md Phase 3: add pass_rate=0 diagnostic note — zero pass rate
from pipeline bench indicates missing grader results, not all tests
failing. Direct devs to check llm_grader_results/<name>.json first.
- grader.md Step 9: fix output path from wrong
{bench-dir}/test-{test-id}/grading.json to correct
{bench-dir}/{test-id}/llm_grader_results/<grader-name>.json.
Add explicit warning not to write directly to grading.json, which
is produced by pipeline bench and would be overwritten/ignored.
- subagent-pipeline.md: add llm_grader_results/<name>.json to the
output structure tree, alongside code_grader_results. Also annotate
grading.json as pipeline-owned to prevent direct writes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7 tasks
christso
added a commit
that referenced
this pull request
Apr 23, 2026
* docs(agentv-bench): clean up stale grader references after #1148 Follow-up to #1148. Three small clarifications: - grader.md: fix stale `bench-dir` example (`results/export/` → `results/runs/`) and clarify that `bench-dir` already includes the `<evalset>` segment. The output path spec in Step 9 assumes this. - grader.md: annotate Field Descriptions to distinguish fields consumed by `pipeline bench` (`score`, `assertions[]`) from fields kept on disk for traceability. Also remove `execution_metrics` and `timing` from the list — #1148 dropped them from the JSON example but the descriptions still referenced them. - SKILL.md Phase 1: add a one-liner on how orchestrators detect which tests need Phase 2 — check whether `<test-id>/llm_graders/` has any `.json` files. `pipeline input` only populates it for `llm-grader` assertions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pipeline): align subagent-mode suite fallback with CLI mode Before this change, `pipeline input` and `pipeline run` resolved the suite (evalset) directory name from `suite.metadata?.name` only. If the eval.yaml had no `name` field, `suite.metadata` was undefined and the artifact layout skipped the suite segment entirely — writing to `<run-dir>/<test-id>/...` directly. CLI mode (`agentv eval`, via `artifact-writer.ts:buildArtifactSubdir` consuming `test.suite`) already falls back through `metadata.name` → eval-file basename (stripping `.eval.yaml`) → `'eval'`. This fallback is applied by the loaders (`yaml-parser.ts:317-324`, `jsonl-parser.ts:165`) and attached to every test as `test.suite`. Switch `pipeline input` and `pipeline run` to read `tests[0]?.suite` so subagent-mode runs produce the same `<evalset>/<test-id>/` layout CLI mode produces. `pipeline bench` and `pipeline grade` consume `manifest.suite` which is written by these two, so they pick up the change automatically — no consumer edits needed. Docs in the agentv-bench skill updated to match: `<evalset>` is now always present in the artifact tree, not conditional. Regression test covers the no-`name` case — previously resolved to `<run-dir>/<test-id>/`, now resolves to `<run-dir>/no-name/<test-id>/` matching `no-name.eval.yaml`'s basename. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three bugs caused Phase 2→3 of the subagent grading pipeline to silently fail, producing
pass_rate=0even when all tests genuinely passed. Also updates Phase 1 docs to reflect v4.21.0 CLI changes.Bug 1 —
grader.mdStep 9: wrong output path (critical)Step 9 told grader subagents to write results to
{bench-dir}/test-{test-id}/grading.json.Two problems:
test-prefix doesn't match the actual directory name on disk ({test-id}, nottest-{test-id})grading.jsonispipeline bench's output — it overwrites this file during the merge step, discarding whatever the grader wroteThe correct path is
{bench-dir}/{test-id}/llm_grader_results/<grader-name>.json, whichpipeline benchreads during the merge step.Fix: Updated Step 9 with correct path and explicit warning not to write directly to
grading.json.Bug 2 —
SKILL.mdPhase 2: missing instruction to embedgrader.mdverbatimPhase 2 mentioned
agents/grader.mdbut didn't say to embed its content in the subagent prompt. Grader subagents aregeneral-purposetask agents — no instructions are auto-loaded. Withoutgrader.mdinjected verbatim, subagents have no grading process, no output format, and no file-path knowledge, producing empty output.Fix: Added a bold instruction: read
agents/grader.mdand embed its full content as system instructions in every grader subagent prompt.Bug 3 —
SKILL.mdPhase 3:pass_rate=0treated as a real signalWhen the grading pipeline fails silently (Bugs 1 or 2),
pipeline benchreportspass_rate=0. Nothing warned that zero often means a pipeline failure rather than genuine test failures, leading agents to optimise the wrong thing.Fix: Added a diagnostic callout in Phase 3 directing users to verify
llm_grader_results/<name>.jsonbefore treating zero as a real signal.Phase 1 doc update — stale description of
pipeline gradeSince v4.21.0 (
64fdff95),pipeline gradeevaluates built-in assertion types natively (contains,regex,equals,starts-with,ends-with,is-json, and variants) in addition tocode-graderscripts.pipeline inputalso now createscode_graders/<name>.jsonconfigs for these types. The skill docs still described Phase 1 as only handlingcode-graderassertions.Fix: Updated Phase 1 description to cover both
code-graderscripts and built-in types. Added a note that built-in-only tests do not need LLM grader subagents dispatched. Updatedsubagent-pipeline.mddirectory tree label forcode_graders/<name>.jsonaccordingly.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com