Skip to content

[EVAL.yaml] Feature: Environment preflight checks for E2E evals #1208

@tsoyangbot

Description

@tsoyangbot

Feature Request: Environment preflight checks for E2E evals (fail fast)

Current Behavior

Long-running E2E evals fail mid-way if dependencies are missing (ffmpeg, pandoc, wkhtmltopdf, Python modules, etc.).

Real example from property-inspection-video-analysis E2E-1:

  • Download 881MB of videos (takes 25+ minutes)
  • Extract 74 frames with ffmpeg
  • Then fail because PIL not installed
  • Wasted 30+ minutes before discovering missing dependency

Desired Behavior

Add optional env: or preflight: section to EVAL.yaml:

env:
  required_commands:
    - ffmpeg
    - pandoc
    - wkhtmltopdf
  required_python_modules:
    - PIL
    - openai
  preflight_check: true   # Run before test cases start

Expected Behavior

When preflight_check: true, AgentV should:

  1. Check all required_commands exist in PATH
  2. Check all required_python_modules can be imported
  3. Fail immediately with clear message if anything missing
  4. Only proceed to test cases if all checks pass

Current Workaround

Manual setup script before eval run:

#!/bin/bash
set -e
command -v ffmpeg >/dev/null 2>&1 || { echo "ffmpeg required"; exit 1; }
python3 -c "import PIL" 2>/dev/null || { echo "PIL required"; exit 1; }
python3 -c "import openai" 2>/dev/null || { echo "openai required"; exit 1; }
# ... then run agentv pipeline ...

Reproduction

Repo: https://github.com/tsoyang-org/property-inspection-bench

Test cases:

  • E2E-1 (video→report): ~30 min runtime, requires ffmpeg, PIL, openai
  • E2E-2 (markdown→PDF): ~5 min runtime, requires pandoc, wkhtmltopdf/weasyprint

Both would benefit from fail-fast validation before starting long downloads.

Related

Issue #1207 - numeric shell operators (same project)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions