Tighten Daily Formal Spec Verifier safe-output contract#40367
Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Tightens the Daily Formal Spec Verifier workflow prompt so the agent must emit safe outputs directly (instead of drifting into shell/CLI safeoutputs usage), and adds a Go prompt test to prevent regressions.
Changes:
- Adds an explicit “Output Contract (Required)” section to the workflow prompt, including a direct
report_incompletesafe-output fallback. - Updates the quality-bar fallback language to require emitting
report_incompletedirectly as a safe output. - Adds a regression test asserting the workflow contains the direct safe-output contract language.
Show a summary per file
| File | Description |
|---|---|
pkg/workflow/prompts_test.go |
Adds a workflow-prompt regression test for the new direct safe-output contract. |
.github/workflows/daily-formal-spec-verifier.md |
Introduces an explicit Output Contract section and tightens fallback wording. |
.github/workflows/daily-formal-spec-verifier.lock.yml |
Regenerated lock to reflect the updated markdown workflow body hash. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 3/3 changed files
- Comments generated: 2
| @@ -174,6 +174,13 @@ Use existing types, functions, and interfaces from the codebase where possible ( | |||
|
|
|||
| Create exactly one issue using the `create_issue` safe output. | |||
| workflow := string(content) | ||
| requiredContract := "Draft the title and body locally first if needed, but emit exactly one final `create_issue` safe output only after the full payload is complete." | ||
| if !strings.Contains(workflow, requiredContract) { | ||
| t.Fatal("Expected daily-formal-spec-verifier workflow to require a single final create_issue safe output") | ||
| } | ||
|
|
||
| noShellGuidance := "Do **not** use `bash`, `cli-proxy`, or the `safeoutputs` CLI to create the issue or inspect the tool schema. Emit the safe output directly with `title` and `body` arguments." | ||
| if !strings.Contains(workflow, noShellGuidance) { | ||
| t.Fatal("Expected daily-formal-spec-verifier workflow to forbid bash/CLI safe-output invocation") | ||
| } | ||
|
|
||
| reportIncompleteGuidance := "If the quality checks below cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`." | ||
| if !strings.Contains(workflow, reportIncompleteGuidance) { | ||
| t.Fatal("Expected daily-formal-spec-verifier workflow to require direct report_incomplete fallback") | ||
| } |
|
✅ Test Quality Sentinel completed test quality analysis. |
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #40367 does not have the 'implementation' label and has only 29 new lines of code in business logic directories (≤100 threshold). Neither enforcement condition is met. |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
🧪 Test Quality Sentinel Report✅ Test Quality Score: 90/100 — Excellent
📊 Metrics & Test Classification (1 test analyzed)
Go: 1 ( Scoring Breakdown
🔍 Test Analysis Detail
Verifies that
Design invariant: The workflow must contain exact contract language that governs agent behavior. This is a behavioral contract test — if the strings are removed, an agent could misuse the safe-output mechanism (e.g., calling safeoutputs CLI instead of emitting the tool call directly). Value if deleted: High — removes the guard preventing accidental deletion of the safe-output contract section from the workflow. Error/edge coverage: ✅ Each of the 3 required strings has a separate Assertion messages: ✅ All assertions use descriptive message strings. Build tag: ✅ Inflation note: 29 test lines verify 8 production lines in the workflow Verdict
|
There was a problem hiding this comment.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 88.6 AIC · ⌖ 7.34 AIC · ⊞ 6.9K
| if !strings.Contains(workflow, reportIncompleteGuidance) { | ||
| t.Fatal("Expected daily-formal-spec-verifier workflow to require direct report_incomplete fallback") | ||
| } | ||
| } |
There was a problem hiding this comment.
[/tdd] Rule 3 (Never retry \create_issue` with empty, placeholder, or partial arguments`) has no assertion here — if it is accidentally removed or rephrased, no test will catch it.
💡 Suggested addition (before the closing `}`)
noRetryGuidance := "Never retry `create_issue` with empty, placeholder, or partial arguments."
if !strings.Contains(workflow, noRetryGuidance) {
t.Fatal("Expected daily-formal-spec-verifier workflow to forbid retrying create_issue with partial args")
}Rules 1, 2, and 4 are already covered; this closes the gap for rule 3.
| } | ||
|
|
||
| workflow := string(content) | ||
| requiredContract := "Draft the title and body locally first if needed, but emit exactly one final `create_issue` safe output only after the full payload is complete." |
There was a problem hiding this comment.
[/tdd] The full-sentence verbatim string makes this assertion brittle — a typo fix or minor reword in the workflow breaks the test even when the contract semantics are unchanged.
💡 Consider a shorter key-phrase
A shorter key substring captures the intent with less fragility:
// Instead of copying the full sentence:
requiredContract := "emit exactly one final `create_issue` safe output only after the full payload is complete"This is a pattern-level observation; the existing convention in this file does use full sentences. Worth discussing as a team norm if the tests start drifting out of sync with minor prompt rewords.
There was a problem hiding this comment.
Two issues need follow-up — both were flagged before merge but not resolved.
Findings
1. Contradictory primary directive (correctness, high)
The unchanged instruction at ~line 175 of daily-formal-spec-verifier.md — Create
🔎 Code quality review by PR Code Quality Reviewer · 83.1 AIC · ⌖ 8.03 AIC · ⊞ 5.1K
| 1. Draft the title and body locally first if needed, but emit exactly one final `create_issue` safe output only after the full payload is complete. | ||
| 2. Do **not** use `bash`, `cli-proxy`, or the `safeoutputs` CLI to create the issue or inspect the tool schema. Emit the safe output directly with `title` and `body` arguments. | ||
| 3. Never retry `create_issue` with empty, placeholder, or partial arguments. | ||
| 4. If the quality checks below cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`. |
There was a problem hiding this comment.
Contradictory primary directive: the unchanged instruction above this section ("Create exactly one issue using the create_issue safe output.") is unconditional; adding Output Contract #4 as a subordinate clause does not override it, and an agent reasoning top-down may always choose create_issue.
💡 Suggested fix
Change the standalone instruction (~line 175) from:
Create exactly one issue using the `create_issue` safe output.
to:
Emit exactly one final safe output: use `create_issue` if the quality bar is met, or `report_incomplete` if it cannot be met.
This makes both allowed paths explicit at the top level rather than introducing an exception seven lines later inside a sub-section, which an agent can reasonably interpret as advisory rather than authoritative.
| t.Fatal("Expected daily-formal-spec-verifier workflow to forbid bash/CLI safe-output invocation") | ||
| } | ||
|
|
||
| reportIncompleteGuidance := "If the quality checks below cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`." |
There was a problem hiding this comment.
Coverage gap: this assertion covers Output Contract item #4 ("quality checks below") but the quality-bar paragraph updated at line ~249 ("these checks") uses different wording and is never verified — a silent revert of that line would leave all three assertions green.
💡 Suggested fix
Add a fourth assertion after line 353 targeting the quality-bar wording specifically:
qualityBarFallback := "If these checks cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`."
if !strings.Contains(workflow, qualityBarFallback) {
t.Fatal("Expected daily-formal-spec-verifier quality bar to require direct report_incomplete fallback")
}This closes the gap: the existing reportIncompleteGuidance check locks in the Output Contract section; this new check locks in the quality-bar paragraph update. Together they ensure both locations stay consistent.
The Daily Formal Spec Verifier run completed its analysis but produced no safe outputs. In practice, the agent prepared issue content, then drifted into CLI-based
safeoutputsinvocation instead of emitting a direct safe-output item.Prompt contract
.github/workflows/daily-formal-spec-verifier.mdrequiring the workflow to:create_issuesafe outputbash,cli-proxy, andsafeoutputsCLI for issue creation or schema probingreport_incompleteemission when quality checks failFailure-mode hardening
report_incompletemust be emitted as a safe output, not reached through a shell-mediated tool path.Regression coverage
pkg/workflow/prompts_test.gothat locks in the direct-output contract and the no-shell/no-CLI requirement.Example of the new contract shape: