feat(eval): Ralph Loop — iterative improvement with feedback injection

## Objective

Add an iterative generate→evaluate→feedback→regenerate loop (Ralph Loop) to AgentV. When enabled, a failing test case gets re-prompted with structured feedback about what went wrong, up to N iterations, until the quality threshold is met.

## Design Latitude

YAML configuration (suggested shape, flexible):

```yaml
execution:
  ralph:
    max_iterations: 3
    threshold: 0.8
    improvement_threshold: 0.05
    feedback_template: ./feedback.md  # optional custom template
```

CLI flag: `agentv eval evals/ --ralph --max-iterations 3`

### Key components

1. **Orchestrator loop** — wraps existing evaluate-single-case flow with retry logic
2. **Feedback builder** — converts assertion failures into structured LLM-actionable feedback injected into the next prompt
3. **Stop conditions** (from microsoft/skills harness):
   - `quality_threshold_met` — score >= threshold
   - `perfect_score` — score >= 1.0
   - `max_iterations_reached` — exhausted budget
   - `no_improvement` — improvement < improvement_threshold
   - `score_regression` — score went down
4. **Per-iteration result tracking** — score trajectory, which iteration passed, stop reason
5. **Feedback template system** — default template that formats failures by severity with suggestions; user can override with custom markdown template

### Result schema extension

```jsonc
{
  "ralph": {
    "iterations": 3,
    "scores": [0.4, 0.7, 0.9],
    "stop_reason": "quality_threshold_met",
    "improvement": 0.5
  }
}
```

## Acceptance Signals

- `agentv eval evals/ --ralph` re-prompts failing tests with feedback
- Feedback includes assertion failures grouped by severity
- Stops when threshold met, max iterations reached, or no improvement
- Results include per-iteration scores and stop reason
- Works with all target types (CLI agents, API providers)

## Non-Goals

- Not multi-agent orchestration (single agent, iterative refinement)
- Not automatic prompt rewriting (feedback is appended, original prompt preserved)
- Not a replacement for trials (trials = same prompt N times; Ralph = feedback-augmented retries)

## Context

Core pattern from the [microsoft/skills eval harness](https://github.com/agentevals/agentevals-research/blob/main/research/findings/microsoft-skills-harness/README.md#16-the-ralph-loop-pattern). Named after the "Sensei" technique by Shayne Boyer. The skills harness uses this across 1114+ scenarios and reports significant quality improvements (often 40-60 point score jumps in 2-3 iterations).

## Related

- #334 — composable quality gates (complementary — Ralph Loop uses gates to determine stop)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): Ralph Loop — iterative improvement with feedback injection #699

Objective

Design Latitude

Key components

Result schema extension

Acceptance Signals

Non-Goals

Context

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(eval): Ralph Loop — iterative improvement with feedback injection #699

Description

Objective

Design Latitude

Key components

Result schema extension

Acceptance Signals

Non-Goals

Context

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions