feat: add Monte Carlo forecast compliance test suite (P1–P13) and fix fixture AIC gap#40126
Conversation
- Add total_aic: 0.0054 to specs/forecast-compliance-fixtures/run_summary_minimal.json to close the fixture gap where the engine reads TotalAIC but the fixture only set total_effective_tokens (runAIC ≤ 0 → continue at forecast.go:593) - Add pkg/cli/forecast_compliance_fixtures_formal_test.go with 13 predicates covering T-FC-031–T-FC-040: fixture field mapping, Bernoulli success model, observed rate / yield formulas, Monte Carlo trial count, zero-λ nil guard, zero-AIC exclusion, reliability threshold, Poisson branch crossover, duration derivation, days flag validation, fixture AIC gap, and JSON conformance Closes #40114 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a forecast fixture/schema mismatch that caused fixture-backed Monte Carlo forecasts to be skipped, and adds a formalized compliance test suite (P1–P13) to prevent regressions and validate key Monte Carlo/forecast invariants.
Changes:
- Add
token_usage_summary.total_aicto the canonicalrun_summary_minimal.jsonfixture so AIC observations aren’t silently excluded. - Introduce a new formal compliance test file covering predicates P1–P13 for Monte Carlo forecast behavior and fixture conformance.
Show a summary per file
| File | Description |
|---|---|
specs/forecast-compliance-fixtures/run_summary_minimal.json |
Adds the missing total_aic field so fixture runs produce usable AIC observations for forecasting. |
pkg/cli/forecast_compliance_fixtures_formal_test.go |
Adds formal predicate-based unit tests validating Monte Carlo behavior, forecast derivations, flag validation, and fixture field presence. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 1
| // Allowed values must NOT return a flag-validation error. | ||
| for _, days := range []int{7, 30} { | ||
| cfg := ForecastConfig{Days: days, Period: "month", JSONOutput: true, SampleSize: 10} | ||
| err := RunForecast(cfg) | ||
| // The call may fail for other reasons (no GitHub auth, no workflows), but must | ||
| // not fail with the days-validation error. | ||
| if err != nil { | ||
| assert.NotContains(t, err.Error(), "invalid days value", | ||
| "P11: days=%d is valid and must not trigger the days-validation error", days) | ||
| } | ||
| } |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
✅ Test Quality Sentinel completed test quality analysis. |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🏗️ Design Decision Gate — ADR RequiredThis PR adds significant changes to 📄 Draft ADR committed:
📋 What to do next
Once an ADR is linked in the PR body, this gate will re-run and verify the implementation matches the decision. ❓ Why ADRs MatterADRs create a searchable, permanent record of why the codebase looks the way it does — here, why the forecast engine is pinned by a formal predicate suite rather than ad-hoc unit tests, and why the fixture 📋 Michael Nygard ADR Format ReferenceAn ADR must contain these four sections to be considered complete:
All ADRs are stored in
|
There was a problem hiding this comment.
Non-blocking observations — fixture fix is correct, test suite has quality gaps.
The one-line fixture change ("total_aic": 0.0054) correctly closes the gap that caused forecast.go:593 to silently skip every fixture-backed run. That change is good and should merge.
Review findings summary
Reliability
runtime.Caller(0)fragility (line 35): breaks P1/P12/P13 if any build step passes-trimpath; use a package-relative path instead sincego testguarantees the working directory.
Test correctness gaps
- P3/P4 are arithmetic tautologies (lines 104–148): neither test calls a production function;
wantLambda/wantYieldentries are defined as the same formula being tested, so they cannot detect engine regressions. - P10 reimplements rather than tests (lines 309–316): the test copies the exact condition from
forecast.go:573-574inline, so a bug in those lines would not be caught.
Clarity
- P9 uses
assert.InDelta(delta=0)(line 253) for a typed constant —assert.Equalis the right tool here.
P2, P5, P6, P7, P8, P9, P11 (invalid-days path), P12, P13 all exercise real engine code and are solid.
🔎 Code quality review by PR Code Quality Reviewer
| func fixtureDir(t *testing.T) string { | ||
| t.Helper() | ||
| _, thisFile, _, ok := runtime.Caller(0) | ||
| require.True(t, ok, "runtime.Caller must return a valid file path") |
There was a problem hiding this comment.
runtime.Caller(0) breaks fixture discovery when built with -trimpath: if the Go toolchain strips source paths (e.g. go test -trimpath or a reproducible-build CI step), thisFile becomes a module-relative virtual path and filepath.Dir resolves to a non-existent directory, causing every fixture-backed test (P1, P12, P13) to fail with a misleading "fixture file must be readable" error.
💡 Suggested fix
Drop the runtime import and use the package working directory that go test guarantees:
func fixtureDir(t *testing.T) string {
t.Helper()
return filepath.Join("..", "..", "specs", "forecast-compliance-fixtures")
}go test always sets the working directory to the package directory (pkg/cli/), so a relative path of ../../specs/forecast-compliance-fixtures is stable and works with any build flags. The runtime.Caller approach is fragile once -trimpath enters the picture.
| func TestFormal_P3_ObservedRateFormula(t *testing.T) { | ||
| cases := []struct { | ||
| sampledRuns int | ||
| historyDays int | ||
| periodDays int | ||
| wantLambda float64 | ||
| }{ | ||
| {sampledRuns: 10, historyDays: 30, periodDays: 30, wantLambda: 10.0}, | ||
| {sampledRuns: 10, historyDays: 30, periodDays: 7, wantLambda: 10.0 / 30.0 * 7}, | ||
| {sampledRuns: 21, historyDays: 7, periodDays: 7, wantLambda: 21.0}, | ||
| {sampledRuns: 0, historyDays: 30, periodDays: 30, wantLambda: 0.0}, | ||
| } | ||
| for _, tc := range cases { | ||
| lambda := float64(tc.sampledRuns) / float64(tc.historyDays) * float64(tc.periodDays) | ||
| assert.InDelta(t, tc.wantLambda, lambda, 1e-9, |
There was a problem hiding this comment.
P3 and P4 are arithmetic tautologies — they test the formula in isolation, not the engine: both tests compute the formula inline in test code and compare against wantXxx values that are themselves defined as the same formula (e.g. wantLambda: 10.0 / 30.0 * 7 is identical to what the loop computes). These tests can never fail regardless of what forecast.go does. If forecastWorkflow were changed to compute ObservedRunsPerPeriod incorrectly (e.g. off-by-one on historyDays), P3 and P4 would remain green.
💡 Why this matters
The test title implies it verifies the engine's observed-rate formula, but neither P3 nor P4 calls any production function. They verify that (a/b)*c == expected in pure test arithmetic, which is trivially true.
To actually guard against engine regressions, the tests need to invoke forecastWorkflow (or a helper that computes ObservedRunsPerPeriod) with a stubbed GitHub API, then assert on the result's ObservedRunsPerPeriod field. That is admittedly harder, but the current tests provide false confidence — a label of 'P3: formal compliance' that doesn't back it up.
At minimum, the tautological wantLambda / wantYield entries should be replaced with hardcoded expected values derived independently of the formula, and the test should acknowledge it is validating formula semantics only (not engine integration).
| // Replicate the derivation logic from forecast.go:573-574. | ||
| if r.Duration == 0 && !r.StartedAt.IsZero() && !r.UpdatedAt.IsZero() { | ||
| r.Duration = r.UpdatedAt.Sub(r.StartedAt) | ||
| } | ||
| assert.Equal(t, tc.wantDuration, r.Duration, | ||
| "P10: %s: Duration must equal UpdatedAt − StartedAt", tc.description) | ||
| } | ||
| } |
There was a problem hiding this comment.
P10 reimplements the engine derivation rather than testing it: the test copies the exact condition from forecast.go:573-574 and applies it directly to a WorkflowRun struct. It asserts that time.Sub works correctly — which it always does — but never calls forecastWorkflow or any function that actually contains the production derivation. If the guard condition (r.Duration == 0 && !r.StartedAt.IsZero() && !r.UpdatedAt.IsZero()) were wrong in forecast.go, this test would still pass because the test re-runs the same copied condition.
💡 Impact and fix direction
The test comment even says "Replicate the derivation logic from forecast.go:573-574" — this is a red flag. A compliance test should exercise the production path, not copy it.
The duration derivation is called inside forecastWorkflow (line 573-574), which requires a stubbed GitHub API to call. If that's too expensive here, the test should at least be retitled to clarify it validates the formula semantics in isolation (TestFormal_P10_DurationArithmetic) and add a TODO noting that the engine integration path is not covered.
| // Specification reference: R-FC-060; forecast_montecarlo.go poissonNormalApproximationThreshold | ||
| func TestFormal_P9_PoissonBranchCrossover(t *testing.T) { | ||
| assert.InDelta(t, 15.0, poissonNormalApproximationThreshold, 0, | ||
| "P9: Poisson crossover threshold must equal 15") |
There was a problem hiding this comment.
assert.InDelta(delta=0) for an exact constant equality is misleading: using InDelta with a delta of 0 is semantically identical to assert.Equal but hides the intent. Readers will scan for the delta value to understand the tolerance being allowed, see 0, and wonder if it is intentional or a copy-paste error. For a constant value like poissonNormalApproximationThreshold (a typed float64 constant, not a computed value), assert.Equal is the correct assertion.
💡 Suggested fix
// Replace:
assert.InDelta(t, 15.0, poissonNormalApproximationThreshold, 0,
"P9: Poisson crossover threshold must equal 15")
// With:
assert.Equal(t, 15.0, poissonNormalApproximationThreshold,
"P9: Poisson crossover threshold must equal 15")There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /tdd and /grill-with-docs — commenting with improvement suggestions. No blocking issues; the fixture fix is correct and the test suite is a valuable addition.
📋 Key Themes & Highlights
Key Themes
- Tautological tests (P3, P4): Two predicates compute their formulas inline in the test body rather than calling production code, making them always-green spec documentation rather than regression guards.
- Logic duplication (P10): The test copies
forecast.go:573-574verbatim instead of calling an extractable helper — future changes to the production condition won't break P10. - Error-string inconsistency (P11): The positive and negative assertions use different substrings, creating a latent fragility.
- Documentation drift: The file header references a Formal Model section that doesn't exist in the README, and the README's "How to Run" section is now stale.
Positive Highlights
- ✅
P12 + fixture fixis the clearest value in this PR: closes a silent data-gap that caused every fixture-backed run to be skipped atforecast.go:593. The regression guard is well-placed. - ✅
P6andP9cover the hardest-to-spot boundary conditions (λ=NaN/±Inf, Poisson crossover at exactly 15) with crisp table-driven cases. - ✅
P8tests both sides of the reliability threshold with exact boundary values (n=9 vs n=10) — a textbook Arrange/Act/Assert structure. - ✅ All 13 predicate docstrings include formal predicate notation and spec-reference cross-links, which makes auditing significantly easier.
- ✅ The fixture change is a one-liner with a clear rationale. P12 + P13 together ensure it can't regress silently.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer
| {sampledRuns: 0, historyDays: 30, periodDays: 30, wantLambda: 0.0}, | ||
| } | ||
| for _, tc := range cases { | ||
| lambda := float64(tc.sampledRuns) / float64(tc.historyDays) * float64(tc.periodDays) |
There was a problem hiding this comment.
[/tdd] This test re-implements the production formula inline rather than calling any production code — so it will always pass regardless of what forecastWorkflow actually computes.
The assertion lambda == wantLambda is mathematically tautological: wantLambda is the result of the same expression. If forecast.go:647–648 is ever changed (e.g., a rounding step or an integer-division bug), P3 would still pass.
💡 Suggestion
Extract the lambda computation into a testable helper:
// Extract from forecast.go lines 647-648
func observedRunsPerPeriod(sampledRuns, historyDays, periodDays int) float64 {
return (float64(sampledRuns) / float64(historyDays)) * float64(periodDays)
}
// P3 test then calls the production helper
lambda := observedRunsPerPeriod(tc.sampledRuns, tc.historyDays, tc.periodDays)If extraction is not feasible, at minimum add a historyDays=0 case to document and verify that the engine guards against divide-by-zero / NaN upstream.
| } | ||
| for _, tc := range cases { | ||
| successRate := float64(tc.successCount) / float64(tc.totalRuns) | ||
| yield := successRate * tc.lambda |
There was a problem hiding this comment.
[/tdd] Same concern as P3: both successRate and yield are computed inline in the test, so this assertion verifies a mathematical identity (a/b)*c == (a/b)*c rather than exercising production code.
The production successRate is computed inside runMonteCarlo at line 107 of forecast_montecarlo.go. P2 already provides stronger coverage by calling runMonteCarlo end-to-end. Consider whether P4 adds value over P2, or whether it should instead be a cross-check that a given (successCount, observations, lambda) triple produces a MeanProjectedAIC that lies within an expected range derived from successRate × lambda × meanAIC.
💡 Suggestion
Either remove P4 (since the formula is trivially true) or replace the inline computation with an integration-style check:
// Verify yield influences MC output proportionally
rng := rand.New(rand.NewSource(42))
mc100 := runMonteCarlo(aicObs, 10 /* 100% success */, 10.0, rng)
rng2 := rand.New(rand.NewSource(42))
mc50 := runMonteCarlo(aicObs, 5 /* 50% success */, 10.0, rng2)
assert.InDelta(t, mc100.MeanProjectedAIC*0.5, mc50.MeanProjectedAIC, mc100.MeanProjectedAIC*0.1,
"P4: 50%% success should yield ~half the mean AIC of 100%% success")| } | ||
| // Replicate the derivation logic from forecast.go:573-574. | ||
| if r.Duration == 0 && !r.StartedAt.IsZero() && !r.UpdatedAt.IsZero() { | ||
| r.Duration = r.UpdatedAt.Sub(r.StartedAt) |
There was a problem hiding this comment.
[/tdd] The comment on line 309 says "Replicate the derivation logic from forecast.go:573-574" — and that is exactly the problem. This test copies production code verbatim rather than calling it, so a change to the production condition (e.g., adding && !r.UpdatedAt.Before(r.StartedAt) to guard negative durations) would leave P10 silently passing with the old logic.
This also misses an important edge case: updatedAt < startedAt (e.g., a clock skew or data corruption scenario). The current code would compute a negative duration without any guard.
💡 Suggestion
Extract computeDuration(r *WorkflowRun) from forecast.go:571–574 so P10 can call the real helper:
func computeDuration(r *WorkflowRun) {
if r.Duration == 0 && !r.StartedAt.IsZero() && !r.UpdatedAt.IsZero() {
r.Duration = r.UpdatedAt.Sub(r.StartedAt)
}
}
// P10 test:
computeDuration(&r)Also add a negative-duration case:
{
startedAt: time.Date(2026, 5, 1, 12, 0, 0, 0, time.UTC),
updatedAt: time.Date(2026, 5, 1, 11, 0, 0, 0, time.UTC), // before start
description: "inverted timestamps (clock skew / bad data)",
}| // The call may fail for other reasons (no GitHub auth, no workflows), but must | ||
| // not fail with the days-validation error. | ||
| if err != nil { | ||
| assert.NotContains(t, err.Error(), "invalid days value", |
There was a problem hiding this comment.
[/tdd] The negative check for valid days values looks for "invalid days value", but the positive check for invalid days values (line 329) asserts the error contains "must be 7 or 30". These are different strings.
If the production error message changes to something like "invalid days value: must be 7 or 30", this negative assertion could fail for the wrong reason or silently stop being useful if the message format changes in the opposite direction.
💡 Fix
Mirror the same substring used in the positive check:
// Valid days must NOT return the days-validation error
if err != nil {
assert.NotContains(t, err.Error(), "must be 7 or 30",
"P11: days=%d is valid and must not trigger the days-validation error", days)
}| "P7: nil AIC observations must return nil (R-MC-011)") | ||
|
|
||
| // Empty observations → runMonteCarlo must return nil. | ||
| assert.Nil(t, runMonteCarlo([]int{}, 0, 5.0, rng), |
There was a problem hiding this comment.
[/tdd] P7 tests nil and []int{} but misses the case of a non-empty slice where all AIC values are zero: []int{0, 0, 0}. According to forecast.go:593, the engine filters runAIC ≤ 0 → continue at the call site before populating the aicObservations slice. If that filter is accidentally removed or bypassed, runMonteCarlo([]int{0,0,0}, ...) would receive a non-nil, non-empty slice and proceed — covering this in P7 would catch the regression.
💡 Suggestion
// If the engine ever passes zero-AIC observations through, MC should still return
// a near-zero (not nil) result — or document explicitly if nil is expected here.
rng3 := rand.New(rand.NewSource(42))
mc := runMonteCarlo([]int{0, 0, 0}, 3, 5.0, rng3)
// Assert the expected behaviour — either nil (if the engine guards inside runMonteCarlo)
// or a near-zero mean (if filtering is the caller's responsibility).
assert.InDelta(t, 0.0, mc.MeanProjectedAIC, 1e-9,
"P7: all-zero AIC observations → MC mean must be zero")This also serves as a regression guard for the fix in P12.
| // - TLA+ invariants: P1, P5, P7, P8, P9 | ||
| // - F* pre/post conditions: P2, P3, P4, P6, P10, P11 | ||
| // - Z3-SMT schema gap: P12, P13 | ||
|
|
There was a problem hiding this comment.
[/grill-with-docs] The comment references a "Formal Model" section of issue #40114 and classifies predicates by TLA+, F*, and Z3-SMT, but specs/forecast-compliance-fixtures/README.md does not contain a Formal Model section or any TLA+/F*/Z3 content. This creates a documentation trail that leads nowhere.
Additionally, the README's "How to Run Compliance Tests" section still points to forecast_montecarlo_test.go and forecast_test.go and uses go test -run TestMonteCarlo, but the new tests use the TestFormal_* prefix in a new file — the README needs to be updated to stay accurate.
💡 Two small fixes
-
Either update
specs/forecast-compliance-fixtures/README.mdto add the Formal Model section (or a link to the issue) so the reference resolves, or simplify this header comment to remove the TLA+/F*/Z3 classification since those artifacts don't exist in the repo. -
Add to the README's "How to Run Compliance Tests":
# Run the formal compliance predicates (P1–P13):
go test -v -run "TestFormal_" ./pkg/cli/And update the file list to include forecast_compliance_fixtures_formal_test.go.
|
…ix stale frontmatter hash The 9441310 commit on main updated the steps in daily-security-observability.md but manually patched the lock file without recompiling. Since gh-aw workflow steps are part of the frontmatter, the frontmatter_hash in the lock file became stale (a869be8d... vs correct 0951baf3...). The CI checks out the merge commit (PR branch merged with main), sees the updated md but old lock hash, and TestHashConsistencyAcrossLockFiles fails. Fix: pull the updated md from main and recompile to get the correct hash. Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Fixed in commit The root cause: commit Fix: pulled the updated |
|
@copilot merge main and recompile |
…orecast-compliance-fixtures-readme # Conflicts: # .github/workflows/daily-security-observability.lock.yml Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…versions (0.27.4→0.27.6) Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done. Merged |
The canonical
run_summary_minimal.jsonfixture settotal_effective_tokensbut omittedtotal_aic. Sinceforecast.go:593gates onrunAIC ≤ 0 → continue, every fixture-backed run was silently skipped, producing no Monte Carlo input for T-FC-031–T-FC-040.Fixture fix
Added
"total_aic": 0.0054totoken_usage_summaryinspecs/forecast-compliance-fixtures/run_summary_minimal.json, closing the field-name gap between what the fixture published and what the engine reads (TokenUsage.TotalAIC).Formal test suite
New
pkg/cli/forecast_compliance_fixtures_formal_test.go— 13 predicates mapped to the formal model inspecs/forecast-compliance-fixtures/README.md:successCount=0collapses MC mean to zero regardless of λλ = (n/h) × periodDaystable-drivenyield = successRate × observedRunsPerPeriodresult.Iterations == monteCarloIterations(10 000)nil(R-MC-001/004)nil(R-MC-011/032)IsReliable=false; n≥10 →true(R-MC-030)useNormalApproximationForPoisson: exact boundary at λ=15 (R-FC-060)Duration = UpdatedAt − StartedAtreplicatesforecast.go:573-574days ∉ {7,30}→RunForecastreturns error with correct messagetotal_aic > 0post-fix; guards against regression of the gaprun, andtoken_usage_summaryfields present