feat: Terminal-Bench eval harness (MVP Phase 1)#178
Merged
jeremyeder merged 12 commits intomainfrom Dec 9, 2025
Merged
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implement Terminal-Bench evaluation harness to empirically measure the impact of AgentReady assessors on agentic development performance.
Overview
This PR implements Phase 1 (MVP) of the Terminal-Bench eval harness - a systematic A/B testing framework that measures how each AgentReady assessor improves benchmark scores.
Components Implemented
Phase 1A-1D: Core Services
Phase 1E: GitHub Pages Dashboard
/agentready/tbenchPhase 1F: Documentation & Tests
docs/eval-harness-guide.md- Step-by-step tutorialsdocs/tbench/methodology.md- Statistical methods explainedCLI Commands
Statistical Methods
Significance Criteria (both required):
Effect Size Interpretation:
Demo Results
Ran eval harness on AgentReady repository itself:
This validates the system works correctly - it identifies repos that already follow best practices.
File Structure
Phase 2 (Future)
Testing
✅ 6 CLI unit tests passing
✅ 5 integration tests passing
✅ 32 service tests passing
✅ End-to-end workflow tested
✅ Dashboard generated and verified
✅ All demos working (slides, walkthrough, terminal demo)
Files Changed
New Services:
src/agentready/services/eval_harness/*.py(5 services)src/agentready/models/eval_harness.py(data models)New CLI:
src/agentready/cli/eval_harness.py(5 commands)Tests:
tests/unit/test_eval_harness*.py(6 files)tests/integration/test_eval_harness_e2e.pyDocumentation:
docs/eval-harness-guide.mddocs/tbench/methodology.mddocs/tbench.md(dashboard)Demos:
docs/demos/slides.html(15 slides, reveal.js)docs/demos/walkthrough.md(complete guide)scripts/generate_slides.pyscripts/build_demos.py🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com