feat: Harbor framework integration for Terminal-Bench evaluations#202
Merged
jeremyeder merged 22 commits intoambient-code:mainfrom Dec 10, 2025
Merged
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements real Harbor framework integration for Terminal-Bench evaluations, completing the technical infrastructure for Phase 2 of the eval harness.
Closes #190
Follow-on: Assessor refinement research study → #201
What's New ✨
Core Features
1. Real Harbor Framework Integration
2.
agentready benchmarkCLI Command3. Automatic Preflight Checks
--skip-preflightoption for advanced users4. Security Improvements
Key Implementation Files
Services (Core Logic):
src/agentready/services/eval_harness/tbench_runner.py- Harbor subprocess integrationsrc/agentready/services/eval_harness/harbor_config.py- Configuration modelsrc/agentready/utils/preflight.py- Dependency checking (100% test coverage)CLI:
src/agentready/cli/benchmark.py- Benchmark command implementationTests:
tests/unit/test_harbor_*.py- Harbor integration teststests/unit/utils/test_preflight.py- Preflight checks (100% coverage)Technical Details
Harbor Integration Architecture
Environment Handling
Fixes critical Harbor bug where MiniMax API is hardcoded:
Results Parsing
Harbor 2.0 structure:
{ "stats": { "evals": { "terminal-bench@2.0": { "metrics": [{"mean": 0.75}], "reward_stats": {"reward": {...}} } } }, "n_total_trials": 89 }Testing
Coverage:
Manual Testing:
Usage Examples
Basic Smoketest
export ANTHROPIC_API_KEY=your-key-here agentready benchmark --subset smoketestOutput:
Full Benchmark with Sonnet
Advanced: Skip Preflight
# For CI/CD where Harbor is pre-installed agentready benchmark --subset smoketest --skip-preflightBreaking Changes
None - This is a new feature with no impact on existing functionality.
Migration Guide
Not Required - New feature, no migration needed.
To use:
uv tool install harbor(or let preflight handle it)agentready benchmark --subset smoketestDocumentation Updates
Updated:
CLAUDE.md- Added Harbor integration section, preflight checks documentationsrc/agentready/cli/benchmark.py- Comprehensive docstring with examplesFollow-on (Issue #201):
docs/tbench/assessor-refinement-results.md- Research study deliverablePerformance Impact
Minimal - Preflight checks add ~1-2 seconds to first run (cached thereafter).
Security Considerations
Implemented:
Future Work
Immediate (Issue #201):
docs/tbench/assessor-refinement-results.mdLater:
Related Issues
Commit Summary
29 commits including:
Test Plan
Automated Tests:
pytest tests/unit/test_harbor_*.py -v pytest tests/unit/utils/test_preflight.py -vManual Testing:
agentready benchmark --subset smoketestChecklist
🚀 Ready to merge!