Skip to content

fix(agentv-bench): fix grader subagent pipeline bugs#1148

Merged
christso merged 1 commit into
EntityProcess:mainfrom
jozsurf:JP0/fix-agentv-bench-grader-pipeline
Apr 23, 2026
Merged

fix(agentv-bench): fix grader subagent pipeline bugs#1148
christso merged 1 commit into
EntityProcess:mainfrom
jozsurf:JP0/fix-agentv-bench-grader-pipeline

Conversation

@jozsurf
Copy link
Copy Markdown
Contributor

@jozsurf jozsurf commented Apr 23, 2026

Summary

Three bugs caused Phase 2→3 of the subagent grading pipeline to silently fail, producing pass_rate=0 even when all tests genuinely passed. Also updates Phase 1 docs to reflect v4.21.0 CLI changes.


Bug 1 — grader.md Step 9: wrong output path (critical)

Step 9 told grader subagents to write results to {bench-dir}/test-{test-id}/grading.json.

Two problems:

  1. The test- prefix doesn't match the actual directory name on disk ({test-id}, not test-{test-id})
  2. grading.json is pipeline bench's output — it overwrites this file during the merge step, discarding whatever the grader wrote

The correct path is {bench-dir}/{test-id}/llm_grader_results/<grader-name>.json, which pipeline bench reads during the merge step.

Fix: Updated Step 9 with correct path and explicit warning not to write directly to grading.json.


Bug 2 — SKILL.md Phase 2: missing instruction to embed grader.md verbatim

Phase 2 mentioned agents/grader.md but didn't say to embed its content in the subagent prompt. Grader subagents are general-purpose task agents — no instructions are auto-loaded. Without grader.md injected verbatim, subagents have no grading process, no output format, and no file-path knowledge, producing empty output.

Fix: Added a bold instruction: read agents/grader.md and embed its full content as system instructions in every grader subagent prompt.


Bug 3 — SKILL.md Phase 3: pass_rate=0 treated as a real signal

When the grading pipeline fails silently (Bugs 1 or 2), pipeline bench reports pass_rate=0. Nothing warned that zero often means a pipeline failure rather than genuine test failures, leading agents to optimise the wrong thing.

Fix: Added a diagnostic callout in Phase 3 directing users to verify llm_grader_results/<name>.json before treating zero as a real signal.


Phase 1 doc update — stale description of pipeline grade

Since v4.21.0 (64fdff95), pipeline grade evaluates built-in assertion types natively (contains, regex, equals, starts-with, ends-with, is-json, and variants) in addition to code-grader scripts. pipeline input also now creates code_graders/<name>.json configs for these types. The skill docs still described Phase 1 as only handling code-grader assertions.

Fix: Updated Phase 1 description to cover both code-grader scripts and built-in types. Added a note that built-in-only tests do not need LLM grader subagents dispatched. Updated subagent-pipeline.md directory tree label for code_graders/<name>.json accordingly.


Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

- SKILL.md Phase 2: clarify that agents/grader.md must be embedded
  verbatim in every grader subagent prompt. Without it, general-purpose
  subagents have no grading process and produce empty output.

- SKILL.md Phase 3: add pass_rate=0 diagnostic note — zero pass rate
  from pipeline bench indicates missing grader results, not all tests
  failing. Direct devs to check llm_grader_results/<name>.json first.

- grader.md Step 9: fix output path from wrong
  {bench-dir}/test-{test-id}/grading.json to correct
  {bench-dir}/{test-id}/llm_grader_results/<grader-name>.json.
  Add explicit warning not to write directly to grading.json, which
  is produced by pipeline bench and would be overwritten/ignored.

- subagent-pipeline.md: add llm_grader_results/<name>.json to the
  output structure tree, alongside code_grader_results. Also annotate
  grading.json as pipeline-owned to prevent direct writes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@christso christso merged commit a84a151 into EntityProcess:main Apr 23, 2026
christso added a commit that referenced this pull request Apr 23, 2026
* docs(agentv-bench): clean up stale grader references after #1148

Follow-up to #1148. Three small clarifications:

- grader.md: fix stale `bench-dir` example (`results/export/` →
  `results/runs/`) and clarify that `bench-dir` already includes the
  `<evalset>` segment. The output path spec in Step 9 assumes this.

- grader.md: annotate Field Descriptions to distinguish fields
  consumed by `pipeline bench` (`score`, `assertions[]`) from fields
  kept on disk for traceability. Also remove `execution_metrics` and
  `timing` from the list — #1148 dropped them from the JSON example
  but the descriptions still referenced them.

- SKILL.md Phase 1: add a one-liner on how orchestrators detect
  which tests need Phase 2 — check whether `<test-id>/llm_graders/`
  has any `.json` files. `pipeline input` only populates it for
  `llm-grader` assertions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pipeline): align subagent-mode suite fallback with CLI mode

Before this change, `pipeline input` and `pipeline run` resolved the
suite (evalset) directory name from `suite.metadata?.name` only. If the
eval.yaml had no `name` field, `suite.metadata` was undefined and the
artifact layout skipped the suite segment entirely — writing to
`<run-dir>/<test-id>/...` directly.

CLI mode (`agentv eval`, via `artifact-writer.ts:buildArtifactSubdir`
consuming `test.suite`) already falls back through `metadata.name` →
eval-file basename (stripping `.eval.yaml`) → `'eval'`. This fallback
is applied by the loaders (`yaml-parser.ts:317-324`,
`jsonl-parser.ts:165`) and attached to every test as `test.suite`.

Switch `pipeline input` and `pipeline run` to read `tests[0]?.suite`
so subagent-mode runs produce the same `<evalset>/<test-id>/` layout
CLI mode produces. `pipeline bench` and `pipeline grade` consume
`manifest.suite` which is written by these two, so they pick up the
change automatically — no consumer edits needed.

Docs in the agentv-bench skill updated to match: `<evalset>` is now
always present in the artifact tree, not conditional.

Regression test covers the no-`name` case — previously resolved to
`<run-dir>/<test-id>/`, now resolves to `<run-dir>/no-name/<test-id>/`
matching `no-name.eval.yaml`'s basename.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants