docs: add quality evaluation mode to manual testing plan spec#39
Open
jlevy wants to merge 9 commits into
Open
Conversation
Design document for supporting "manual" test scripts that facilitate human/agent review rather than strict pass/fail automation. Addresses use cases for LLM responses, web scraping, visual UX, and other variable outputs. Key features proposed: - Documentation playbook with patterns and anti-patterns - --review mode for update + diff display - validation: binary|manual frontmatter option - Review annotations in test files - CI integration patterns https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN
Local tbd state from running tbd prime for issue tracking context. https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN
Extends the manual testing workflows spec with: - Quality Evaluation use case (search engines, recommendations, ML outputs) - Phase VI: Comparison and Evaluation Modes - Side-by-side comparison display (not just diff) - Script-based evaluation with thresholds - LLM-based evaluation (future) - Human judgment with structured criteria - Updated validation enum: binary | manual | evaluation - New comparison modes: diff | side-by-side | baseline - Outstanding questions for evaluation semantics This addresses workflows where outputs may legitimately differ but quality should remain comparable - e.g., search results where ordering changes but relevance should be maintained. https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
Coverage Report
File CoverageNo changed files found. |
The .tbd/.gitignore from tbd v0.1.3 was missing entries for docs/ and state.yml that are present in current versions. A previous session ran `tbd prime` which created these files, then committed them. - Updated .tbd/.gitignore to match tbd v0.1.12 template - Removed .tbd/docs/ from tracking (regenerated on setup) - Removed .tbd/state.yml from tracking (local state) - Updated tbd_version in config.yml https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
- Workflows section now shows pnpm/package.json test scripts - CI just calls the same scripts developers use locally - Phase V simplified to be CI-agnostic - Removed verbose GitHub Actions examples - Cleaner, more minimal examples throughout https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
- Consolidate redundant use case sections (web scraping, visual, interactive all merged into "manual review testing") - Fix nested fence syntax (use 4+ backticks for outer fences) - Fix elision patterns: use `...` not `[.. text ..]` - Remove verbose REVIEW/EVALUATE comment syntax - Simplify Phase IV to just recommend markdown for review guidance - Reduce overall spec size by ~270 lines https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
Simplified the manual testing workflows spec significantly: - Phase I: Manual testing workflow (--review, validation frontmatter, playbook) - Phase II: Quality evaluation (comparison modes, evaluators) Reduced from ~950 lines to ~220 lines by: - Removing redundant examples - Consolidating related features into single phases - Keeping only essential implementation details - Streamlining outstanding questions https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
Elision should only hide irrelevant noise (timestamps, paths), not the content being evaluated. For manual review, you need to see the actual output to assess quality. https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
- `--review` updates files, lists what changed; use `git diff` to review - Remove custom comparison modes (diff, side-by-side, baseline) - Git already provides all this functionality with better tooling - Phase II simplified to just evaluator scripts for automated quality gates - Users can configure Git with delta/diff-so-fancy or use IDE/GitHub https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
binary | manual | evaluationdiff | side-by-side | baselineThis extends the manual testing spec to address workflows where outputs may legitimately differ but quality should remain comparable.
https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5