Skip to content

docs: add quality evaluation mode to manual testing plan spec#39

Open
jlevy wants to merge 9 commits into
mainfrom
claude/review-manual-testing-specs-YCe6J
Open

docs: add quality evaluation mode to manual testing plan spec#39
jlevy wants to merge 9 commits into
mainfrom
claude/review-manual-testing-specs-YCe6J

Conversation

@jlevy
Copy link
Copy Markdown
Owner

@jlevy jlevy commented Jan 31, 2026

Summary

  • Adds Quality Evaluation use case for search engines, recommendations, ML outputs
  • Adds Phase VI: Comparison and Evaluation Modes with side-by-side display, script/LLM evaluators
  • Updates validation enum: binary | manual | evaluation
  • Adds new comparison modes: diff | side-by-side | baseline
  • Documents evaluation strategies and outstanding questions

This extends the manual testing spec to address workflows where outputs may legitimately differ but quality should remain comparable.

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5

Design document for supporting "manual" test scripts that facilitate
human/agent review rather than strict pass/fail automation. Addresses
use cases for LLM responses, web scraping, visual UX, and other
variable outputs.

Key features proposed:
- Documentation playbook with patterns and anti-patterns
- --review mode for update + diff display
- validation: binary|manual frontmatter option
- Review annotations in test files
- CI integration patterns

https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN
Local tbd state from running tbd prime for issue tracking context.

https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN
Extends the manual testing workflows spec with:
- Quality Evaluation use case (search engines, recommendations, ML outputs)
- Phase VI: Comparison and Evaluation Modes
  - Side-by-side comparison display (not just diff)
  - Script-based evaluation with thresholds
  - LLM-based evaluation (future)
  - Human judgment with structured criteria
- Updated validation enum: binary | manual | evaluation
- New comparison modes: diff | side-by-side | baseline
- Outstanding questions for evaluation semantics

This addresses workflows where outputs may legitimately differ but
quality should remain comparable - e.g., search results where ordering
changes but relevance should be maintained.

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jan 31, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 93.29% 2557 / 2741
🔵 Statements 93.29% 2557 / 2741
🔵 Functions 35.76% 54 / 151
🔵 Branches 36.87% 243 / 659
File CoverageNo changed files found.
Generated in workflow #134 for commit a12891e by the Vitest Coverage Report Action

@jlevy jlevy changed the base branch from claude/tryscript-manual-testing-ZPMvS to main January 31, 2026 19:26
The .tbd/.gitignore from tbd v0.1.3 was missing entries for docs/
and state.yml that are present in current versions. A previous session
ran `tbd prime` which created these files, then committed them.

- Updated .tbd/.gitignore to match tbd v0.1.12 template
- Removed .tbd/docs/ from tracking (regenerated on setup)
- Removed .tbd/state.yml from tracking (local state)
- Updated tbd_version in config.yml

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
- Workflows section now shows pnpm/package.json test scripts
- CI just calls the same scripts developers use locally
- Phase V simplified to be CI-agnostic
- Removed verbose GitHub Actions examples
- Cleaner, more minimal examples throughout

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
- Consolidate redundant use case sections (web scraping, visual, interactive
  all merged into "manual review testing")
- Fix nested fence syntax (use 4+ backticks for outer fences)
- Fix elision patterns: use `...` not `[.. text ..]`
- Remove verbose REVIEW/EVALUATE comment syntax
- Simplify Phase IV to just recommend markdown for review guidance
- Reduce overall spec size by ~270 lines

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
Simplified the manual testing workflows spec significantly:

- Phase I: Manual testing workflow (--review, validation frontmatter, playbook)
- Phase II: Quality evaluation (comparison modes, evaluators)

Reduced from ~950 lines to ~220 lines by:
- Removing redundant examples
- Consolidating related features into single phases
- Keeping only essential implementation details
- Streamlining outstanding questions

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
Elision should only hide irrelevant noise (timestamps, paths), not the
content being evaluated. For manual review, you need to see the actual
output to assess quality.

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
- `--review` updates files, lists what changed; use `git diff` to review
- Remove custom comparison modes (diff, side-by-side, baseline)
- Git already provides all this functionality with better tooling
- Phase II simplified to just evaluator scripts for automated quality gates
- Users can configure Git with delta/diff-so-fancy or use IDE/GitHub

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants