Tighten Daily Formal Spec Verifier safe-output contract by Copilot · Pull Request #40367 · github/gh-aw

Copilot · 2026-06-19T17:42:35Z

The Daily Formal Spec Verifier run completed its analysis but produced no safe outputs. In practice, the agent prepared issue content, then drifted into CLI-based safeoutputs invocation instead of emitting a direct safe-output item.

Prompt contract
- Add an explicit output contract to .github/workflows/daily-formal-spec-verifier.md requiring the workflow to:
  - draft content locally first
  - emit exactly one final create_issue safe output
  - avoid bash, cli-proxy, and safeoutputs CLI for issue creation or schema probing
  - fall back to direct report_incomplete emission when quality checks fail
Failure-mode hardening
- Clarify that report_incomplete must be emitted as a safe output, not reached through a shell-mediated tool path.
- This narrows the agent’s allowed completion behavior to direct safe-output emission, which is the missing condition in the failing run.
Regression coverage
- Add a workflow prompt test in pkg/workflow/prompts_test.go that locks in the direct-output contract and the no-shell/no-CLI requirement.

Example of the new contract shape:

### Output Contract (Required)

1. Draft the title and body locally first if needed, but emit exactly one final `create_issue` safe output only after the full payload is complete.
2. Do **not** use `bash`, `cli-proxy`, or the `safeoutputs` CLI to create the issue or inspect the tool schema. Emit the safe output directly with `title` and `body` arguments.
3. Never retry `create_issue` with empty, placeholder, or partial arguments.
4. If the quality checks below cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`.

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot

Pull request overview

Tightens the Daily Formal Spec Verifier workflow prompt so the agent must emit safe outputs directly (instead of drifting into shell/CLI safeoutputs usage), and adds a Go prompt test to prevent regressions.

Changes:

Adds an explicit “Output Contract (Required)” section to the workflow prompt, including a direct report_incomplete safe-output fallback.
Updates the quality-bar fallback language to require emitting report_incomplete directly as a safe output.
Adds a regression test asserting the workflow contains the direct safe-output contract language.

Show a summary per file

File	Description
`pkg/workflow/prompts_test.go`	Adds a workflow-prompt regression test for the new direct safe-output contract.
`.github/workflows/daily-formal-spec-verifier.md`	Introduces an explicit Output Contract section and tightens fallback wording.
`.github/workflows/daily-formal-spec-verifier.lock.yml`	Regenerated lock to reflect the updated markdown workflow body hash.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 3/3 changed files
Comments generated: 2

@@ -174,6 +174,13 @@ Use existing types, functions, and interfaces from the codebase where possible (

 Create exactly one issue using the `create_issue` safe output.


+	workflow := string(content)
+	requiredContract := "Draft the title and body locally first if needed, but emit exactly one final `create_issue` safe output only after the full payload is complete."
+	if !strings.Contains(workflow, requiredContract) {
+		t.Fatal("Expected daily-formal-spec-verifier workflow to require a single final create_issue safe output")
+	}
+
+	noShellGuidance := "Do **not** use `bash`, `cli-proxy`, or the `safeoutputs` CLI to create the issue or inspect the tool schema. Emit the safe output directly with `title` and `body` arguments."
+	if !strings.Contains(workflow, noShellGuidance) {
+		t.Fatal("Expected daily-formal-spec-verifier workflow to forbid bash/CLI safe-output invocation")
+	}
+
+	reportIncompleteGuidance := "If the quality checks below cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`."
+	if !strings.Contains(workflow, reportIncompleteGuidance) {
+		t.Fatal("Expected daily-formal-spec-verifier workflow to require direct report_incomplete fallback")
+	}


github-actions · 2026-06-19T18:14:49Z

✅ Test Quality Sentinel completed test quality analysis.

github-actions · 2026-06-19T18:14:53Z

✅ PR Code Quality Reviewer completed the code quality review.

github-actions · 2026-06-19T18:14:59Z

✅ Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #40367 does not have the 'implementation' label and has only 29 new lines of code in business logic directories (≤100 threshold). Neither enforcement condition is met.

github-actions · 2026-06-19T18:15:05Z

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

github-actions · 2026-06-19T18:22:35Z

🧪 Test Quality Sentinel Report

✅ Test Quality Score: 90/100 — Excellent

Analyzed 1 test: 1 design, 0 implementation, 0 guideline violations.

📊 Metrics & Test Classification (1 test analyzed)

Metric	Value
New/modified tests analyzed	1
✅ Design tests (behavioral contracts)	1 (100%)
⚠️ Implementation tests (low value)	0 (0%)
Tests with error/edge cases	1 (100%)
Duplicate test clusters	0
Test inflation detected	⚠️ Yes — 29 test lines added vs 0 new lines in `prompts.go` (note: test targets a workflow `.md` file, not `prompts.go`, so this is contextually expected)
🚨 Coding-guideline violations	0

Test	File	Classification	Issues Detected
`TestDailyFormalSpecVerifierDefinesDirectSafeOutputContract`	`pkg/workflow/prompts_test.go:327`	✅ Design	—

Go: 1 (*_test.go); JavaScript: 0 (*.test.cjs, *.test.js).

Scoring Breakdown

Component	Score	Max
Behavioral Coverage (100% design tests)	40	40
Error/Edge Case Coverage (100% have error paths)	30	30
Low Duplication (0 clusters)	20	20
Proportional Growth (inflation flag — binary deduction)	0	10
Total	90	100

🔍 Test Analysis Detail

TestDailyFormalSpecVerifierDefinesDirectSafeOutputContract (pkg/workflow/prompts_test.go:327)

Verifies that daily-formal-spec-verifier.md contains three required safe-output contract strings:

Single final create_issue safe output contract
Prohibition of bash/cli-proxy/safeoutputs CLI usage
Direct report_incomplete fallback requirement

Design invariant: The workflow must contain exact contract language that governs agent behavior. This is a behavioral contract test — if the strings are removed, an agent could misuse the safe-output mechanism (e.g., calling safeoutputs CLI instead of emitting the tool call directly).

Value if deleted: High — removes the guard preventing accidental deletion of the safe-output contract section from the workflow.

Error/edge coverage: ✅ Each of the 3 required strings has a separate t.Fatal path, covering the "contract broken" failure scenario.

Assertion messages: ✅ All assertions use descriptive message strings.

Build tag: ✅ //go:build !integration on line 1.

Inflation note: 29 test lines verify 8 production lines in the workflow .md file. The prompts_test.go → prompts.go mapping shows 0 production .go changes, triggering the inflation flag by strict rule. However, this test pattern (asserting workflow file content) inherently requires more verification lines than the workflow additions themselves — the flag is technically correct but contextually expected.

Verdict

✅ Check passed. 0% implementation tests (threshold: 30%). The new test enforces a behavioral contract — it ensures the workflow contains exact safe-output governance language that directly controls agent output behavior.

🧪 Test quality analysis by Test Quality Sentinel · 64.6 AIC · ⌖ 10.5 AIC · ⊞ 8.3K · ◷

github-actions

✅ Test Quality Sentinel: 90/100. Test quality is acceptable — 0% of new tests are implementation tests (threshold: 30%).

github-actions

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 88.6 AIC · ⌖ 7.34 AIC · ⊞ 6.9K

github-actions · 2026-06-19T18:25:03Z

+	if !strings.Contains(workflow, reportIncompleteGuidance) {
+		t.Fatal("Expected daily-formal-spec-verifier workflow to require direct report_incomplete fallback")
+	}
+}


[/tdd] Rule 3 (Never retry \create_issue` with empty, placeholder, or partial arguments`) has no assertion here — if it is accidentally removed or rephrased, no test will catch it.

💡 Suggested addition (before the closing `}`)

noRetryGuidance := "Never retry `create_issue` with empty, placeholder, or partial arguments." if !strings.Contains(workflow, noRetryGuidance) { t.Fatal("Expected daily-formal-spec-verifier workflow to forbid retrying create_issue with partial args") }

Rules 1, 2, and 4 are already covered; this closes the gap for rule 3.

github-actions · 2026-06-19T18:25:03Z

+	}
+
+	workflow := string(content)
+	requiredContract := "Draft the title and body locally first if needed, but emit exactly one final `create_issue` safe output only after the full payload is complete."


[/tdd] The full-sentence verbatim string makes this assertion brittle — a typo fix or minor reword in the workflow breaks the test even when the contract semantics are unchanged.

💡 Consider a shorter key-phrase

A shorter key substring captures the intent with less fragility:

// Instead of copying the full sentence: requiredContract := "emit exactly one final `create_issue` safe output only after the full payload is complete"

This is a pattern-level observation; the existing convention in this file does use full sentences. Worth discussing as a team norm if the tests start drifting out of sync with minor prompt rewords.

github-actions

Two issues need follow-up — both were flagged before merge but not resolved.

Findings

1. Contradictory primary directive (correctness, high)

The unchanged instruction at ~line 175 of daily-formal-spec-verifier.md — Create

🔎 Code quality review by PR Code Quality Reviewer · 83.1 AIC · ⌖ 8.03 AIC · ⊞ 5.1K

github-actions · 2026-06-19T18:26:05Z

+1. Draft the title and body locally first if needed, but emit exactly one final `create_issue` safe output only after the full payload is complete.
+2. Do **not** use `bash`, `cli-proxy`, or the `safeoutputs` CLI to create the issue or inspect the tool schema. Emit the safe output directly with `title` and `body` arguments.
+3. Never retry `create_issue` with empty, placeholder, or partial arguments.
+4. If the quality checks below cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`.


Contradictory primary directive: the unchanged instruction above this section ("Create exactly one issue using the create_issue safe output.") is unconditional; adding Output Contract #4 as a subordinate clause does not override it, and an agent reasoning top-down may always choose create_issue.

💡 Suggested fix

Change the standalone instruction (~line 175) from:

Create exactly one issue using the `create_issue` safe output.

to:

Emit exactly one final safe output: use `create_issue` if the quality bar is met, or `report_incomplete` if it cannot be met.

This makes both allowed paths explicit at the top level rather than introducing an exception seven lines later inside a sub-section, which an agent can reasonably interpret as advisory rather than authoritative.

github-actions · 2026-06-19T18:26:05Z

+		t.Fatal("Expected daily-formal-spec-verifier workflow to forbid bash/CLI safe-output invocation")
+	}
+
+	reportIncompleteGuidance := "If the quality checks below cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`."


Coverage gap: this assertion covers Output Contract item #4 ("quality checks below") but the quality-bar paragraph updated at line ~249 ("these checks") uses different wording and is never verified — a silent revert of that line would leave all three assertions green.

💡 Suggested fix

Add a fourth assertion after line 353 targeting the quality-bar wording specifically:

qualityBarFallback := "If these checks cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`." if !strings.Contains(workflow, qualityBarFallback) { t.Fatal("Expected daily-formal-spec-verifier quality bar to require direct report_incomplete fallback") }

This closes the gap: the existing reportIncompleteGuidance check locks in the Output Contract section; this new check locks in the quality-bar paragraph update. Together they ensure both locations stay consistent.

Initial plan

94079de

Copilot AI assigned Copilot and pelikhan Jun 19, 2026

Copilot started work on behalf of pelikhan June 19, 2026 17:42 View session

Fix Daily Formal Spec Verifier safe output contract

6405b3e

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix daily formal spec verifier to produce safe outputs~~ Tighten Daily Formal Spec Verifier safe-output contract Jun 19, 2026

Copilot AI requested a review from pelikhan June 19, 2026 17:58

Copilot finished work on behalf of pelikhan June 19, 2026 17:58

pelikhan marked this pull request as ready for review June 19, 2026 18:05

Copilot AI review requested due to automatic review settings June 19, 2026 18:05

Copilot started reviewing on behalf of pelikhan June 19, 2026 18:06 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

pelikhan approved these changes Jun 19, 2026

View reviewed changes

pelikhan merged commit 24f7d4c into main Jun 19, 2026
57 of 68 checks passed

pelikhan deleted the copilot/aw-fix-daily-formal-spec-verifier branch June 19, 2026 18:14

github-actions Bot mentioned this pull request Jun 19, 2026

[aw] No-Op Runs #39849

Open

github-actions Bot approved these changes Jun 19, 2026

View reviewed changes

github-actions Bot reviewed Jun 19, 2026

View reviewed changes

github-actions Bot mentioned this pull request Jun 19, 2026

Smoke Test: Copilot - 27845045338 #40382

Closed

		@@ -174,6 +174,13 @@ Use existing types, functions, and interfaces from the codebase where possible (

		Create exactly one issue using the `create_issue` safe output.

Conversation

Copilot AI commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

github-actions Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 19, 2026

🧪 Test Quality Sentinel Report

Verdict

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

1. Contradictory primary directive (correctness, high)

Uh oh!

github-actions Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Jun 19, 2026 •

edited

Loading

github-actions Bot commented Jun 19, 2026 •

edited

Loading

github-actions Bot commented Jun 19, 2026 •

edited

Loading

github-actions Bot commented Jun 19, 2026 •

edited

Loading

github-actions Bot commented Jun 19, 2026 •

edited

Loading