Problem
Code judges receive the full output Message[] array via stdin for every invocation. For long-running agent sessions (Claude Code, Copilot), this array can be 1-10 MB (50+ turns with full file contents in tool outputs). With multiple code judges per test case and parallel workers, this creates redundant serialization:
- 3 workers × 10 MB output × 3 judges = 90 MB of redundant JSON serialized, piped, and parsed
Most code judges don't even need the full output — they inspect trace (TraceSummary stats) or answer (string).
Current Flow
orchestrator → JSON.stringify(full payload with output[]) → stdin pipe → judge process → JSON.parse
code-evaluator.ts:41-57 always includes output: context.output ?? null in the stdin payload regardless of whether the judge uses it.
Proposed: File-Backed Lazy Loading
Write large fields to a temp file once per test case. Pass the file path in the stdin payload. The @agentv/eval SDK reads the file transparently when the judge accesses the field.
Orchestrator changes (code-evaluator.ts)
// Write output to temp file once (shared across all judges for this test case)
const outputPath = await writeTempOutput(context.output);
const payload = {
question: context.evalCase.question,
answer: context.candidate,
trace: context.trace ?? null, // small — always in stdin
output: null, // no longer in stdin
_outputPath: outputPath, // file path for lazy loading
input: context.evalCase.input,
// ... rest unchanged
};
SDK changes (@agentv/eval runtime.ts)
// Transparent lazy loading — judge code unchanged
const camelInput = toCamelCaseDeep(rawInput);
// If _outputPath present and output is null, create lazy getter
if (camelInput._outputPath && camelInput.output === null) {
Object.defineProperty(camelInput, 'output', {
get: () => {
const data = JSON.parse(readFileSync(camelInput._outputPath, 'utf8'));
// Cache after first read
Object.defineProperty(camelInput, 'output', { value: data });
return data;
},
configurable: true,
});
}
const input = CodeJudgeInputSchema.parse(camelInput);
Judge code — no changes needed
import { defineCodeJudge } from '@agentv/eval';
export default defineCodeJudge((input) => {
// These are always in stdin (fast)
const { trace, answer } = input;
// This triggers lazy file read only if accessed
const output = input.output; // transparent — reads from file
});
What Changes
| Field |
Current |
Proposed |
answer, trace, criteria, config |
In stdin |
In stdin (unchanged) |
output (Message[]) |
In stdin (always) |
Temp file, lazy loaded via SDK |
input (Message[]) |
In stdin |
In stdin (usually small) |
| Serialization cost per test case |
N judges × full payload |
1 file write + N judges × small payload |
SDK input.output API |
Direct property |
Same API (lazy getter, cached after first read) |
| Raw stdin judges (no SDK) |
Get output in stdin |
Get _outputPath + null output — must read file manually |
Acceptance Criteria
Files to Modify
packages/core/src/evaluation/evaluators/code-evaluator.ts — write temp file, pass path
packages/eval/src/runtime.ts — lazy getter in runCodeJudge
packages/eval/src/schemas.ts — add _outputPath to schema
packages/core/src/evaluation/orchestrator.ts — manage temp file lifecycle (write before judges, cleanup after)
Priority
Low — Current approach works fine for typical eval payloads. This optimization matters when evaluating long-running agent sessions with large tool outputs (50+ turns, MB-scale Message arrays).
LLM Judge Note
Not affected — default LLM judge template doesn't include {{output}}. Custom templates that use {{output}} face a token cost problem (not serialization), which is a separate concern.
Problem
Code judges receive the full
outputMessage[] array via stdin for every invocation. For long-running agent sessions (Claude Code, Copilot), this array can be 1-10 MB (50+ turns with full file contents in tool outputs). With multiple code judges per test case and parallel workers, this creates redundant serialization:Most code judges don't even need the full output — they inspect
trace(TraceSummary stats) oranswer(string).Current Flow
code-evaluator.ts:41-57always includesoutput: context.output ?? nullin the stdin payload regardless of whether the judge uses it.Proposed: File-Backed Lazy Loading
Write large fields to a temp file once per test case. Pass the file path in the stdin payload. The
@agentv/evalSDK reads the file transparently when the judge accesses the field.Orchestrator changes (
code-evaluator.ts)SDK changes (
@agentv/evalruntime.ts)Judge code — no changes needed
What Changes
answer,trace,criteria,configoutput(Message[])input(Message[])input.outputAPI_outputPath+ null output — must read file manuallyAcceptance Criteria
outputMessage[] to temp file once per test case (before running judges)_outputPathinstead of fulloutputarray@agentv/evalSDK transparently reads from file wheninput.outputis accessedoutputis present in stdin (non-null), use it directly (no file read)_outputPathin payload — document migration pathinputto file-backed loading if payload size warrants it (future)Files to Modify
packages/core/src/evaluation/evaluators/code-evaluator.ts— write temp file, pass pathpackages/eval/src/runtime.ts— lazy getter inrunCodeJudgepackages/eval/src/schemas.ts— add_outputPathto schemapackages/core/src/evaluation/orchestrator.ts— manage temp file lifecycle (write before judges, cleanup after)Priority
Low — Current approach works fine for typical eval payloads. This optimization matters when evaluating long-running agent sessions with large tool outputs (50+ turns, MB-scale Message arrays).
LLM Judge Note
Not affected — default LLM judge template doesn't include
{{output}}. Custom templates that use{{output}}face a token cost problem (not serialization), which is a separate concern.