Skip to content

feat: Harbor framework integration for Terminal-Bench evaluations#202

Merged
jeremyeder merged 22 commits intoambient-code:mainfrom
jeremyeder:002-harbor-real-integration
Dec 10, 2025
Merged

feat: Harbor framework integration for Terminal-Bench evaluations#202
jeremyeder merged 22 commits intoambient-code:mainfrom
jeremyeder:002-harbor-real-integration

Conversation

@jeremyeder
Copy link
Copy Markdown
Contributor

Summary

Implements real Harbor framework integration for Terminal-Bench evaluations, completing the technical infrastructure for Phase 2 of the eval harness.

Closes #190

Follow-on: Assessor refinement research study → #201


What's New ✨

Core Features

1. Real Harbor Framework Integration

  • Subprocess-based Harbor execution with full environment control
  • Harbor 2.0 result.json parsing
  • Trajectory file path capture for debugging
  • API key validation and configuration
  • Override for Harbor's hardcoded MiniMax API bug

2. agentready benchmark CLI Command

# Quick smoketest (1-2 tasks)
agentready benchmark --subset smoketest

# Full Terminal-Bench (89 tasks)
agentready benchmark --subset full --model claude-sonnet-4-5

3. Automatic Preflight Checks

  • Harbor CLI detection and installation prompts
  • Terminal-Bench dataset management
  • Interactive installation (uv/pip fallback)
  • --skip-preflight option for advanced users

4. Security Improvements

  • Path traversal validation (FR-005)
  • Input sanitization for subprocess calls
  • API key truncation in logs
  • Timeout enforcement (default: 1 hour)

Key Implementation Files

Services (Core Logic):

  • src/agentready/services/eval_harness/tbench_runner.py - Harbor subprocess integration
  • src/agentready/services/eval_harness/harbor_config.py - Configuration model
  • src/agentready/utils/preflight.py - Dependency checking (100% test coverage)

CLI:

  • src/agentready/cli/benchmark.py - Benchmark command implementation

Tests:

  • tests/unit/test_harbor_*.py - Harbor integration tests
  • tests/unit/utils/test_preflight.py - Preflight checks (100% coverage)

Technical Details

Harbor Integration Architecture

agentready benchmark
    ↓
CLI (benchmark.py)
    ↓
Preflight Checks (harbor CLI + dataset)
    ↓
HarborConfig (API key, model, timeout)
    ↓
_real_tbench_result() → subprocess.run(['harbor', 'run', ...])
    ↓
Parse result.json (Harbor 2.0 format)
    ↓
Return TbenchResult (score, trajectory_path, etc.)

Environment Handling

Fixes critical Harbor bug where MiniMax API is hardcoded:

clean_env["ANTHROPIC_API_KEY"] = config.api_key
clean_env["ANTHROPIC_AUTH_TOKEN"] = config.api_key  # Harbor uses this
clean_env["ANTHROPIC_BASE_URL"] = "https://api.anthropic.com"  # Override MiniMax
clean_env["ANTHROPIC_API_BASE"] = "https://api.anthropic.com"  # Alternative var
clean_env.pop("MINIMAX_API_KEY", None)  # Clear MiniMax settings

Results Parsing

Harbor 2.0 structure:

{
  "stats": {
    "evals": {
      "terminal-bench@2.0": {
        "metrics": [{"mean": 0.75}],
        "reward_stats": {"reward": {...}}
      }
    }
  },
  "n_total_trials": 89
}

Testing

Coverage:

  • ✅ TbenchResult model validation (score ranges, trial counts)
  • ✅ Harbor config validation (API key, timeout, paths)
  • ✅ Preflight checks (100% coverage)
  • ✅ Result parsing (Harbor 2.0 JSON structure)

Manual Testing:

  • ✅ Smoketest runs (1-2 tasks, ~2-5 min)
  • ✅ Full Terminal-Bench runs (89 tasks)
  • ✅ Both Haiku and Sonnet models
  • ✅ Preflight check workflows (missing Harbor CLI → install prompt)

Usage Examples

Basic Smoketest

export ANTHROPIC_API_KEY=your-key-here
agentready benchmark --subset smoketest

Output:

==================================================
Harbor Command (Copy/Paste Ready)
==================================================

ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY ... harbor run --path ... --agent claude-code --model anthropic/claude-haiku-4-5 --jobs-dir ...

==================================================
Terminal-Bench Benchmark Complete
==================================================

Score: 0.75
Task Solved: True
Resolved Trials: 3
Unresolved Trials: 1
Pass@1: 0.75

Trajectory: /path/to/jobs/2025-12-09__17-12-09/.../trajectory.json

Full Benchmark with Sonnet

agentready benchmark --subset full --model claude-sonnet-4-5 --verbose

Advanced: Skip Preflight

# For CI/CD where Harbor is pre-installed
agentready benchmark --subset smoketest --skip-preflight

Breaking Changes

None - This is a new feature with no impact on existing functionality.


Migration Guide

Not Required - New feature, no migration needed.

To use:

  1. Ensure Harbor CLI installed: uv tool install harbor (or let preflight handle it)
  2. Set ANTHROPIC_API_KEY environment variable
  3. Run agentready benchmark --subset smoketest

Documentation Updates

Updated:

Follow-on (Issue #201):

  • docs/tbench/assessor-refinement-results.md - Research study deliverable

Performance Impact

Minimal - Preflight checks add ~1-2 seconds to first run (cached thereafter).


Security Considerations

Implemented:

  • ✅ Path traversal validation for Harbor results
  • ✅ Input sanitization for subprocess calls
  • ✅ API key truncation in display output
  • ✅ Timeout enforcement (prevents runaway processes)
  • ✅ Environment isolation (clean_env prevents pollution)

Future Work

Immediate (Issue #201):

  • Run assessor refinement research study (10-20 diverse repos)
  • Statistical analysis of assessor impact
  • Publish docs/tbench/assessor-refinement-results.md

Later:

  • Dashboard integration for benchmark results
  • Historical performance tracking
  • Leaderboard integration (optional)

Related Issues


Commit Summary

29 commits including:

  • Harbor subprocess integration
  • Preflight dependency checking
  • Security fixes (path validation, timeout enforcement)
  • MiniMax API override fix
  • Result parsing for Harbor 2.0
  • Trajectory file capture
  • Comprehensive testing
  • Documentation updates

Test Plan

Automated Tests:

pytest tests/unit/test_harbor_*.py -v
pytest tests/unit/utils/test_preflight.py -v

Manual Testing:

  1. ✅ Run smoketest: agentready benchmark --subset smoketest
  2. ✅ Verify preflight prompts when Harbor not installed
  3. ✅ Verify API key validation error when ANTHROPIC_API_KEY unset
  4. ✅ Verify results parsing and trajectory file capture
  5. ✅ Verify both Haiku and Sonnet models work

Checklist


🚀 Ready to merge!

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Terminal-Bench Eval Harness - Phase 2: Real Harbor Framework Integration

3 participants