Skip to content

Tighten Daily Formal Spec Verifier safe-output contract#40367

Merged
pelikhan merged 2 commits into
mainfrom
copilot/aw-fix-daily-formal-spec-verifier
Jun 19, 2026
Merged

Tighten Daily Formal Spec Verifier safe-output contract#40367
pelikhan merged 2 commits into
mainfrom
copilot/aw-fix-daily-formal-spec-verifier

Conversation

Copilot AI commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

The Daily Formal Spec Verifier run completed its analysis but produced no safe outputs. In practice, the agent prepared issue content, then drifted into CLI-based safeoutputs invocation instead of emitting a direct safe-output item.

  • Prompt contract

    • Add an explicit output contract to .github/workflows/daily-formal-spec-verifier.md requiring the workflow to:
      • draft content locally first
      • emit exactly one final create_issue safe output
      • avoid bash, cli-proxy, and safeoutputs CLI for issue creation or schema probing
      • fall back to direct report_incomplete emission when quality checks fail
  • Failure-mode hardening

    • Clarify that report_incomplete must be emitted as a safe output, not reached through a shell-mediated tool path.
    • This narrows the agent’s allowed completion behavior to direct safe-output emission, which is the missing condition in the failing run.
  • Regression coverage

    • Add a workflow prompt test in pkg/workflow/prompts_test.go that locks in the direct-output contract and the no-shell/no-CLI requirement.

Example of the new contract shape:

### Output Contract (Required)

1. Draft the title and body locally first if needed, but emit exactly one final `create_issue` safe output only after the full payload is complete.
2. Do **not** use `bash`, `cli-proxy`, or the `safeoutputs` CLI to create the issue or inspect the tool schema. Emit the safe output directly with `title` and `body` arguments.
3. Never retry `create_issue` with empty, placeholder, or partial arguments.
4. If the quality checks below cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`.

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix daily formal spec verifier to produce safe outputs Tighten Daily Formal Spec Verifier safe-output contract Jun 19, 2026
Copilot AI requested a review from pelikhan June 19, 2026 17:58
@pelikhan pelikhan marked this pull request as ready for review June 19, 2026 18:05
Copilot AI review requested due to automatic review settings June 19, 2026 18:05

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Tightens the Daily Formal Spec Verifier workflow prompt so the agent must emit safe outputs directly (instead of drifting into shell/CLI safeoutputs usage), and adds a Go prompt test to prevent regressions.

Changes:

  • Adds an explicit “Output Contract (Required)” section to the workflow prompt, including a direct report_incomplete safe-output fallback.
  • Updates the quality-bar fallback language to require emitting report_incomplete directly as a safe output.
  • Adds a regression test asserting the workflow contains the direct safe-output contract language.
Show a summary per file
File Description
pkg/workflow/prompts_test.go Adds a workflow-prompt regression test for the new direct safe-output contract.
.github/workflows/daily-formal-spec-verifier.md Introduces an explicit Output Contract section and tightens fallback wording.
.github/workflows/daily-formal-spec-verifier.lock.yml Regenerated lock to reflect the updated markdown workflow body hash.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 3/3 changed files
  • Comments generated: 2

@@ -174,6 +174,13 @@ Use existing types, functions, and interfaces from the codebase where possible (

Create exactly one issue using the `create_issue` safe output.
Comment on lines +339 to +353
workflow := string(content)
requiredContract := "Draft the title and body locally first if needed, but emit exactly one final `create_issue` safe output only after the full payload is complete."
if !strings.Contains(workflow, requiredContract) {
t.Fatal("Expected daily-formal-spec-verifier workflow to require a single final create_issue safe output")
}

noShellGuidance := "Do **not** use `bash`, `cli-proxy`, or the `safeoutputs` CLI to create the issue or inspect the tool schema. Emit the safe output directly with `title` and `body` arguments."
if !strings.Contains(workflow, noShellGuidance) {
t.Fatal("Expected daily-formal-spec-verifier workflow to forbid bash/CLI safe-output invocation")
}

reportIncompleteGuidance := "If the quality checks below cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`."
if !strings.Contains(workflow, reportIncompleteGuidance) {
t.Fatal("Expected daily-formal-spec-verifier workflow to require direct report_incomplete fallback")
}
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Test Quality Sentinel completed test quality analysis.

@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

PR Code Quality Reviewer completed the code quality review.

@pelikhan pelikhan merged commit 24f7d4c into main Jun 19, 2026
57 of 68 checks passed
@pelikhan pelikhan deleted the copilot/aw-fix-daily-formal-spec-verifier branch June 19, 2026 18:14
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #40367 does not have the 'implementation' label and has only 29 new lines of code in business logic directories (≤100 threshold). Neither enforcement condition is met.

@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

@github-actions github-actions Bot mentioned this pull request Jun 19, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 90/100 — Excellent

Analyzed 1 test: 1 design, 0 implementation, 0 guideline violations.

📊 Metrics & Test Classification (1 test analyzed)
Metric Value
New/modified tests analyzed 1
✅ Design tests (behavioral contracts) 1 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 1 (100%)
Duplicate test clusters 0
Test inflation detected ⚠️ Yes — 29 test lines added vs 0 new lines in prompts.go (note: test targets a workflow .md file, not prompts.go, so this is contextually expected)
🚨 Coding-guideline violations 0
Test File Classification Issues Detected
TestDailyFormalSpecVerifierDefinesDirectSafeOutputContract pkg/workflow/prompts_test.go:327 ✅ Design

Go: 1 (*_test.go); JavaScript: 0 (*.test.cjs, *.test.js).

Scoring Breakdown

Component Score Max
Behavioral Coverage (100% design tests) 40 40
Error/Edge Case Coverage (100% have error paths) 30 30
Low Duplication (0 clusters) 20 20
Proportional Growth (inflation flag — binary deduction) 0 10
Total 90 100
🔍 Test Analysis Detail

TestDailyFormalSpecVerifierDefinesDirectSafeOutputContract (pkg/workflow/prompts_test.go:327)

Verifies that daily-formal-spec-verifier.md contains three required safe-output contract strings:

  1. Single final create_issue safe output contract
  2. Prohibition of bash/cli-proxy/safeoutputs CLI usage
  3. Direct report_incomplete fallback requirement

Design invariant: The workflow must contain exact contract language that governs agent behavior. This is a behavioral contract test — if the strings are removed, an agent could misuse the safe-output mechanism (e.g., calling safeoutputs CLI instead of emitting the tool call directly).

Value if deleted: High — removes the guard preventing accidental deletion of the safe-output contract section from the workflow.

Error/edge coverage: ✅ Each of the 3 required strings has a separate t.Fatal path, covering the "contract broken" failure scenario.

Assertion messages: ✅ All assertions use descriptive message strings.

Build tag: ✅ //go:build !integration on line 1.

Inflation note: 29 test lines verify 8 production lines in the workflow .md file. The prompts_test.goprompts.go mapping shows 0 production .go changes, triggering the inflation flag by strict rule. However, this test pattern (asserting workflow file content) inherently requires more verification lines than the workflow additions themselves — the flag is technically correct but contextually expected.

Verdict

Check passed. 0% implementation tests (threshold: 30%). The new test enforces a behavioral contract — it ensures the workflow contains exact safe-output governance language that directly controls agent output behavior.

🧪 Test quality analysis by Test Quality Sentinel · 64.6 AIC · ⌖ 10.5 AIC · ⊞ 8.3K ·

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 90/100. Test quality is acceptable — 0% of new tests are implementation tests (threshold: 30%).

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 88.6 AIC · ⌖ 7.34 AIC · ⊞ 6.9K

if !strings.Contains(workflow, reportIncompleteGuidance) {
t.Fatal("Expected daily-formal-spec-verifier workflow to require direct report_incomplete fallback")
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] Rule 3 (Never retry \create_issue` with empty, placeholder, or partial arguments`) has no assertion here — if it is accidentally removed or rephrased, no test will catch it.

💡 Suggested addition (before the closing `}`)
noRetryGuidance := "Never retry `create_issue` with empty, placeholder, or partial arguments."
if !strings.Contains(workflow, noRetryGuidance) {
	t.Fatal("Expected daily-formal-spec-verifier workflow to forbid retrying create_issue with partial args")
}

Rules 1, 2, and 4 are already covered; this closes the gap for rule 3.

}

workflow := string(content)
requiredContract := "Draft the title and body locally first if needed, but emit exactly one final `create_issue` safe output only after the full payload is complete."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The full-sentence verbatim string makes this assertion brittle — a typo fix or minor reword in the workflow breaks the test even when the contract semantics are unchanged.

💡 Consider a shorter key-phrase

A shorter key substring captures the intent with less fragility:

// Instead of copying the full sentence:
requiredContract := "emit exactly one final `create_issue` safe output only after the full payload is complete"

This is a pattern-level observation; the existing convention in this file does use full sentences. Worth discussing as a team norm if the tests start drifting out of sync with minor prompt rewords.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues need follow-up — both were flagged before merge but not resolved.

Findings

1. Contradictory primary directive (correctness, high)

The unchanged instruction at ~line 175 of daily-formal-spec-verifier.md — Create

🔎 Code quality review by PR Code Quality Reviewer · 83.1 AIC · ⌖ 8.03 AIC · ⊞ 5.1K

1. Draft the title and body locally first if needed, but emit exactly one final `create_issue` safe output only after the full payload is complete.
2. Do **not** use `bash`, `cli-proxy`, or the `safeoutputs` CLI to create the issue or inspect the tool schema. Emit the safe output directly with `title` and `body` arguments.
3. Never retry `create_issue` with empty, placeholder, or partial arguments.
4. If the quality checks below cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contradictory primary directive: the unchanged instruction above this section ("Create exactly one issue using the create_issue safe output.") is unconditional; adding Output Contract #4 as a subordinate clause does not override it, and an agent reasoning top-down may always choose create_issue.

💡 Suggested fix

Change the standalone instruction (~line 175) from:

Create exactly one issue using the `create_issue` safe output.

to:

Emit exactly one final safe output: use `create_issue` if the quality bar is met, or `report_incomplete` if it cannot be met.

This makes both allowed paths explicit at the top level rather than introducing an exception seven lines later inside a sub-section, which an agent can reasonably interpret as advisory rather than authoritative.

t.Fatal("Expected daily-formal-spec-verifier workflow to forbid bash/CLI safe-output invocation")
}

reportIncompleteGuidance := "If the quality checks below cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage gap: this assertion covers Output Contract item #4 ("quality checks below") but the quality-bar paragraph updated at line ~249 ("these checks") uses different wording and is never verified — a silent revert of that line would leave all three assertions green.

💡 Suggested fix

Add a fourth assertion after line 353 targeting the quality-bar wording specifically:

qualityBarFallback := "If these checks cannot be met, emit `report_incomplete` directly as a safe output instead of `create_issue`."
if !strings.Contains(workflow, qualityBarFallback) {
    t.Fatal("Expected daily-formal-spec-verifier quality bar to require direct report_incomplete fallback")
}

This closes the gap: the existing reportIncompleteGuidance check locks in the Output Contract section; this new check locks in the quality-bar paragraph update. Together they ensure both locations stay consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants