Skip to content

feat(eval): Ralph Loop — iterative improvement with feedback injection #699

@christso

Description

@christso

Objective

Add an iterative generate→evaluate→feedback→regenerate loop (Ralph Loop) to AgentV. When enabled, a failing test case gets re-prompted with structured feedback about what went wrong, up to N iterations, until the quality threshold is met.

Design Latitude

YAML configuration (suggested shape, flexible):

execution:
  ralph:
    max_iterations: 3
    threshold: 0.8
    improvement_threshold: 0.05
    feedback_template: ./feedback.md  # optional custom template

CLI flag: agentv eval evals/ --ralph --max-iterations 3

Key components

  1. Orchestrator loop — wraps existing evaluate-single-case flow with retry logic
  2. Feedback builder — converts assertion failures into structured LLM-actionable feedback injected into the next prompt
  3. Stop conditions (from microsoft/skills harness):
    • quality_threshold_met — score >= threshold
    • perfect_score — score >= 1.0
    • max_iterations_reached — exhausted budget
    • no_improvement — improvement < improvement_threshold
    • score_regression — score went down
  4. Per-iteration result tracking — score trajectory, which iteration passed, stop reason
  5. Feedback template system — default template that formats failures by severity with suggestions; user can override with custom markdown template

Result schema extension

{
  "ralph": {
    "iterations": 3,
    "scores": [0.4, 0.7, 0.9],
    "stop_reason": "quality_threshold_met",
    "improvement": 0.5
  }
}

Acceptance Signals

  • agentv eval evals/ --ralph re-prompts failing tests with feedback
  • Feedback includes assertion failures grouped by severity
  • Stops when threshold met, max iterations reached, or no improvement
  • Results include per-iteration scores and stop reason
  • Works with all target types (CLI agents, API providers)

Non-Goals

  • Not multi-agent orchestration (single agent, iterative refinement)
  • Not automatic prompt rewriting (feedback is appended, original prompt preserved)
  • Not a replacement for trials (trials = same prompt N times; Ralph = feedback-augmented retries)

Context

Core pattern from the microsoft/skills eval harness. Named after the "Sensei" technique by Shayne Boyer. The skills harness uses this across 1114+ scenarios and reports significant quality improvements (often 40-60 point score jumps in 2-3 iterations).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions