feat: Harbor framework integration for Terminal-Bench evaluations by jeremyeder · Pull Request #202 · ambient-code/agentready

jeremyeder · 2025-12-10T21:32:56Z

Summary

Implements real Harbor framework integration for Terminal-Bench evaluations, completing the technical infrastructure for Phase 2 of the eval harness.

Closes #190

Follow-on: Assessor refinement research study → #201

What's New ✨

Core Features

1. Real Harbor Framework Integration

Subprocess-based Harbor execution with full environment control
Harbor 2.0 result.json parsing
Trajectory file path capture for debugging
API key validation and configuration
Override for Harbor's hardcoded MiniMax API bug

2. agentready benchmark CLI Command

# Quick smoketest (1-2 tasks)
agentready benchmark --subset smoketest

# Full Terminal-Bench (89 tasks)
agentready benchmark --subset full --model claude-sonnet-4-5

3. Automatic Preflight Checks

Harbor CLI detection and installation prompts
Terminal-Bench dataset management
Interactive installation (uv/pip fallback)
--skip-preflight option for advanced users

4. Security Improvements

Path traversal validation (FR-005)
Input sanitization for subprocess calls
API key truncation in logs
Timeout enforcement (default: 1 hour)

Key Implementation Files

Services (Core Logic):

src/agentready/services/eval_harness/tbench_runner.py - Harbor subprocess integration
src/agentready/services/eval_harness/harbor_config.py - Configuration model
src/agentready/utils/preflight.py - Dependency checking (100% test coverage)

CLI:

src/agentready/cli/benchmark.py - Benchmark command implementation

Tests:

tests/unit/test_harbor_*.py - Harbor integration tests
tests/unit/utils/test_preflight.py - Preflight checks (100% coverage)

Technical Details

Harbor Integration Architecture

agentready benchmark
    ↓
CLI (benchmark.py)
    ↓
Preflight Checks (harbor CLI + dataset)
    ↓
HarborConfig (API key, model, timeout)
    ↓
_real_tbench_result() → subprocess.run(['harbor', 'run', ...])
    ↓
Parse result.json (Harbor 2.0 format)
    ↓
Return TbenchResult (score, trajectory_path, etc.)

Environment Handling

Fixes critical Harbor bug where MiniMax API is hardcoded:

clean_env["ANTHROPIC_API_KEY"] = config.api_key
clean_env["ANTHROPIC_AUTH_TOKEN"] = config.api_key  # Harbor uses this
clean_env["ANTHROPIC_BASE_URL"] = "https://api.anthropic.com"  # Override MiniMax
clean_env["ANTHROPIC_API_BASE"] = "https://api.anthropic.com"  # Alternative var
clean_env.pop("MINIMAX_API_KEY", None)  # Clear MiniMax settings

Results Parsing

Harbor 2.0 structure:

{
  "stats": {
    "evals": {
      "terminal-bench@2.0": {
        "metrics": [{"mean": 0.75}],
        "reward_stats": {"reward": {...}}
      }
    }
  },
  "n_total_trials": 89
}

Testing

Coverage:

✅ TbenchResult model validation (score ranges, trial counts)
✅ Harbor config validation (API key, timeout, paths)
✅ Preflight checks (100% coverage)
✅ Result parsing (Harbor 2.0 JSON structure)

Manual Testing:

✅ Smoketest runs (1-2 tasks, ~2-5 min)
✅ Full Terminal-Bench runs (89 tasks)
✅ Both Haiku and Sonnet models
✅ Preflight check workflows (missing Harbor CLI → install prompt)

Usage Examples

Basic Smoketest

export ANTHROPIC_API_KEY=your-key-here
agentready benchmark --subset smoketest

Output:

==================================================
Harbor Command (Copy/Paste Ready)
==================================================

ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY ... harbor run --path ... --agent claude-code --model anthropic/claude-haiku-4-5 --jobs-dir ...

==================================================
Terminal-Bench Benchmark Complete
==================================================

Score: 0.75
Task Solved: True
Resolved Trials: 3
Unresolved Trials: 1
Pass@1: 0.75

Trajectory: /path/to/jobs/2025-12-09__17-12-09/.../trajectory.json

Full Benchmark with Sonnet

agentready benchmark --subset full --model claude-sonnet-4-5 --verbose

Advanced: Skip Preflight

# For CI/CD where Harbor is pre-installed
agentready benchmark --subset smoketest --skip-preflight

Breaking Changes

None - This is a new feature with no impact on existing functionality.

Migration Guide

Not Required - New feature, no migration needed.

To use:

Ensure Harbor CLI installed: uv tool install harbor (or let preflight handle it)
Set ANTHROPIC_API_KEY environment variable
Run agentready benchmark --subset smoketest

Documentation Updates

Updated:

✅ CLAUDE.md - Added Harbor integration section, preflight checks documentation
✅ src/agentready/cli/benchmark.py - Comprehensive docstring with examples
✅ Issue Terminal-Bench Eval Harness - Phase 2: Real Harbor Framework Integration #190 - Updated to reflect technical completion

Follow-on (Issue #201):

docs/tbench/assessor-refinement-results.md - Research study deliverable

Performance Impact

Minimal - Preflight checks add ~1-2 seconds to first run (cached thereafter).

Security Considerations

Implemented:

✅ Path traversal validation for Harbor results
✅ Input sanitization for subprocess calls
✅ API key truncation in display output
✅ Timeout enforcement (prevents runaway processes)
✅ Environment isolation (clean_env prevents pollution)

Future Work

Immediate (Issue #201):

Run assessor refinement research study (10-20 diverse repos)
Statistical analysis of assessor impact
Publish docs/tbench/assessor-refinement-results.md

Later:

Dashboard integration for benchmark results
Historical performance tracking
Leaderboard integration (optional)

Related Issues

Closes Terminal-Bench Eval Harness - Phase 2: Real Harbor Framework Integration #190 - Terminal-Bench Eval Harness Phase 2: Real Harbor Integration
Related Terminal-Bench Assessor Refinement Research Study (Phase 2b) #201 - Assessor Refinement Research Study (follow-on)
Related feat: Terminal-Bench eval harness (MVP Phase 1) #178 - Terminal-Bench Eval Harness MVP (Phase 1, completed)

Commit Summary

29 commits including:

Harbor subprocess integration
Preflight dependency checking
Security fixes (path validation, timeout enforcement)
MiniMax API override fix
Result parsing for Harbor 2.0
Trajectory file capture
Comprehensive testing
Documentation updates

Test Plan

Automated Tests:

pytest tests/unit/test_harbor_*.py -v
pytest tests/unit/utils/test_preflight.py -v

Manual Testing:

✅ Run smoketest: agentready benchmark --subset smoketest
✅ Verify preflight prompts when Harbor not installed
✅ Verify API key validation error when ANTHROPIC_API_KEY unset
✅ Verify results parsing and trajectory file capture
✅ Verify both Haiku and Sonnet models work

Checklist

✅ All tests passing locally
✅ Linters passing (black, isort, ruff)
✅ Documentation updated (CLAUDE.md)
✅ Security review complete (path validation, input sanitization)
✅ Manual testing complete (smoketest + full runs)
✅ Issue Terminal-Bench Eval Harness - Phase 2: Real Harbor Framework Integration #190 updated with completion status
✅ Follow-on issue Terminal-Bench Assessor Refinement Research Study (Phase 2b) #201 created for research study
✅ No breaking changes
✅ Backward compatible

🚀 Ready to merge!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Harbor framework integration for Terminal-Bench evaluations#202

feat: Harbor framework integration for Terminal-Bench evaluations#202
jeremyeder merged 22 commits intoambient-code:mainfrom
jeremyeder:002-harbor-real-integration

jeremyeder commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jeremyeder commented Dec 10, 2025

Summary

What's New ✨

Core Features

Key Implementation Files

Technical Details

Harbor Integration Architecture

Environment Handling

Results Parsing

Testing

Usage Examples

Basic Smoketest

Full Benchmark with Sonnet

Advanced: Skip Preflight

Breaking Changes

Migration Guide

Documentation Updates

Performance Impact

Security Considerations

Future Work

Related Issues

Commit Summary

Test Plan

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants