Improve Copilot harness classification for opaque exitCode=1 failures#39959
Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #39959 does not have the 'implementation' label and has 0 new lines of code in business logic directories (threshold 100). |
|
✅ Test Quality Sentinel completed test quality analysis. |
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
There was a problem hiding this comment.
Pull request overview
This PR improves diagnosability of previously opaque Copilot harness failures (notably exitCode=1 cases) by introducing an explicit failureClass and emitting a bounded outputTail, while also expanding Copilot/CAPI quota detection to recognize additional 429 “Too Many Requests” variants.
Changes:
- Extend Copilot/CAPI quota detection to match
CAPIError: Too Many Requests(and update tests + retry guard expectations accordingly). - Add new Copilot-specific failure classifiers for SDK
session.idletimeouts and MCP gateway shutdown messages. - Emit a compact
outputTailfor failed attempts and log a namedfailureClassalongside existing boolean flags.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/detect_agent_errors.cjs | Expands the quota/rate-limit regex used for post-run log scanning (isCAPIQuotaExceededError). |
| actions/setup/js/detect_agent_errors.test.cjs | Updates unit tests to cover the new “Too Many Requests” matching behavior. |
| actions/setup/js/copilot_harness.cjs | Adds failure classification + output tail extraction and logs failureClass/outputTail on failed attempts. |
| actions/setup/js/copilot_harness.test.cjs | Updates quota/rate-limit retry expectations and adds tests for the new helper classifiers/tail extraction. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 4/4 changed files
- Comments generated: 2
| // Pattern: Copilot/CAPI quota exhaustion. | ||
| // Matches the observed error: "CAPIError: 429 429 quota exceeded". | ||
| // Quota exhaustion is a persistent, non-retryable condition. | ||
| const CAPI_QUOTA_EXCEEDED_PATTERN = /CAPIError:\s*429\s+429\s+quota exceeded/i; | ||
| const CAPI_QUOTA_EXCEEDED_PATTERN = /CAPIError:\s*(?:429\s+)?(?:429\s+quota exceeded|Too Many Requests)/i; |
| * isAuthErr?: boolean, | ||
| * isAuthenticationFailed?: boolean, | ||
| * isCAPIError?: boolean, | ||
| * isMCPGatewayShutdown?: boolean, | ||
| * isMCPPolicy?: boolean, |
🧪 Test Quality Sentinel Report✅ Test Quality Score: 90/100 — Excellent
📊 Metrics & Test Classification (7 tests analyzed)
JavaScript: 7 ( Verdict
References: §27740021565
|
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /diagnose and /tdd — no blocking issues; a few targeted improvements suggested.
📋 Key Themes & Highlights
Key Themes
MCP_GATEWAY_SHUTDOWN_PATTERNbreadth — the bare substring/Gateway shutdown initiated/could match informational teardown logs on clean exits; worth tightening to require corroborating JSON fields from the observed payload.extractOutputTailslice(-0)footgun — silent misbehaviour whenmaxChars < 2; unreachable with the default of 600 but easy to guard.- Stale inline comment in
detect_agent_errors.cjs— doesn't document the newCAPIError: Too Many Requestsform that the pattern now matches. - Retry behaviour untested for new classes —
sdk_session_idle_timeoutandmcp_gateway_shutdownintentionally fall through topartial_executionretries, but there are noshouldRetrytests capturing that intent.
Positive Highlights
- ✅ The quota pattern fix (
detect_agent_errors.cjs) is minimal, correct, and backed by assertion-flipped tests in both test files — exactly the right footprint for a bug fix. - ✅
classifyCopilotFailurecleanly replaces an implicit matrix of booleans with a readable priority chain; JSDoc is thorough. - ✅
extractOutputTailis well-bounded (600 chars / 12 lines), handles null bytes and CRLF, and is covered by the key truncation cases. - ✅ The PR description precisely documents the observable before/after — making the intent reviewable without needing to run the code.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer
| @@ -60,7 +60,7 @@ const MODEL_NOT_SUPPORTED_PATTERN = | |||
| // Pattern: Copilot/CAPI quota exhaustion. | |||
| // Matches the observed error: "CAPIError: 429 429 quota exceeded". | |||
There was a problem hiding this comment.
[/diagnose] The comment still only documents the original CAPIError: 429 429 quota exceeded form, but the updated pattern also matches CAPIError: 429 Too Many Requests and CAPIError: Too Many Requests. A future reader will have an incomplete picture of what the pattern detects.
💡 Suggested comment update
// Matches Copilot/CAPI quota exhaustion in observed forms:
// "CAPIError: 429 429 quota exceeded" – original observed form
// "CAPIError: 429 Too Many Requests" – HTTP status form
// "CAPIError: Too Many Requests" – statusless form seen in retry wrapper output
// Quota exhaustion is a persistent, non-retryable condition.| // Pattern: Copilot SDK driver timed out waiting for the session to become idle. | ||
| const SDK_SESSION_IDLE_TIMEOUT_PATTERN = /Timeout after \d+ms waiting for session\.idle/; | ||
| // Pattern: MCP gateway shutdown surfaced in agent output. | ||
| const MCP_GATEWAY_SHUTDOWN_PATTERN = /Gateway shutdown initiated/; |
There was a problem hiding this comment.
[/diagnose] /Gateway shutdown initiated/ is a bare substring match that fires on any log line containing those three words — including informational teardown messages that MCP servers may emit on a clean exit. If a successful teardown log surfaces in a failed run's combined output (e.g., because the gateway correctly wound down after the copilot process crashed), the failure gets mis-classified as mcp_gateway_shutdown even when the root cause is something else.
💡 Tighten the pattern
The test fixture uses a JSON envelope: '{"message":"Gateway shutdown initiated","serversTerminated":2,"status":"closed"}'. Requiring at least one other field from that envelope reduces false positives from bare log lines:
// Require accompanying serversTerminated or status:closed to avoid matching
// informational teardown messages on clean exits.
const MCP_GATEWAY_SHUTDOWN_PATTERN = /Gateway shutdown initiated[\s\S]{0,150}(?:serversTerminated|"status"\s*:\s*"closed")/;If the JSON format isn't guaranteed, even anchoring to the error path by requiring exitCode=1 context is already done by only calling this inside the failure branch — but the pattern itself is the first line of defence.
| if (tailLines.length === 0) return ""; | ||
| let tail = tailLines.join("\n"); | ||
| if (tail.length > maxChars) { | ||
| tail = `…${tail.slice(-(maxChars - 1))}`; |
There was a problem hiding this comment.
[/tdd] tail.slice(-(maxChars - 1)) silently misbehaves when maxChars === 1: the expression becomes tail.slice(-0), and in JavaScript slice(-0) is equivalent to slice(0) — it returns the entire string. The result is "..." + tail, which is longer than the requested maxChars. Unreachable with the default of 600, but a unit test would catch this and a guard removes the footgun for any caller that passes a small explicit value.
💡 Guard suggestion
if (tail.length > maxChars) {
const keep = maxChars - 1; // characters available after the ellipsis
tail = keep > 0 ? `...${tail.slice(-keep)}` : "...";
}And a regression test:
it("never exceeds maxChars even when maxChars is 1", () => {
const tail = extractOutputTail("abc", { maxLines: 5, maxChars: 1 });
expect(tail.length).toBeLessThanOrEqual(1);
});| @@ -239,6 +254,30 @@ describe("copilot_harness.cjs", () => { | |||
| expect(shouldRetry(result, 1, true, 1)).toBe(false); | |||
| }); | |||
There was a problem hiding this comment.
[/tdd] The new tests confirm that sdk_session_idle_timeout and mcp_gateway_shutdown are correctly named, but there are no shouldRetry tests verifying what the harness does with them at retry time. Per the PR description, both classes intentionally fall through to the partial_execution retry path when hasOutput: true. Without an explicit test, a future guard clause that accidentally makes them non-retryable would go undetected.
💡 Suggested regression tests
it("retries SDK session.idle timeout as partial execution", () => {
const result = {
exitCode: 1,
hasOutput: true,
output: "[copilot-sdk-driver] Timeout after 60000ms waiting for session.idle",
};
expect(shouldRetry(result, 0)).toBe(true);
});
it("retries MCP gateway shutdown as partial execution", () => {
const result = {
exitCode: 1,
hasOutput: true,
output: '{"message":"Gateway shutdown initiated","serversTerminated":1,"status":"closed"}',
};
expect(shouldRetry(result, 0)).toBe(true);
});These tests serve as executable documentation of the intentional retry-through behaviour described in the PR.
There was a problem hiding this comment.
REQUEST_CHANGES — two correctness issues and three follow-ons need addressing before merge
Blocking issues:
CAPI_QUOTA_EXCEEDED_PATTERNnow silently makes transient HTTP 429 throttles non-retryable, potentially aborting all retry attempts on a recoverable condition.classifyCopilotFailurepriority order buriessdk_session_idle_timeoutandmcp_gateway_shutdownunderpermission_denied, defeating the PR's stated goal of surfacing previously opaque failures.MCP_GATEWAY_SHUTDOWN_PATTERN(/Gateway shutdown initiated/) is insufficiently anchored and will misclassify any non-MCP shutdown message.
Non-blocking but worth fixing:
extractOutputTaildoesn't normalize bare\r, garbling log output from progress-bar-style processes.- The inline comment and module JSDoc for
CAPI_QUOTA_EXCEEDED_PATTERNstill only document the original double-429 form; the new forms it matches are undocumented. filter(Boolean)drops blank lines beforeslice(-maxLines), somaxLinessilently counts non-empty lines rather than total lines.
🔎 Code quality review by PR Code Quality Reviewer
| // Pattern: Copilot SDK driver timed out waiting for the session to become idle. | ||
| const SDK_SESSION_IDLE_TIMEOUT_PATTERN = /Timeout after \d+ms waiting for session\.idle/; | ||
| // Pattern: MCP gateway shutdown surfaced in agent output. | ||
| const MCP_GATEWAY_SHUTDOWN_PATTERN = /Gateway shutdown initiated/; |
There was a problem hiding this comment.
MCP_GATEWAY_SHUTDOWN_PATTERN is too generic — will misclassify any process that emits this string as an MCP failure.
💡 Details and suggested fix
The current pattern:
const MCP_GATEWAY_SHUTDOWN_PATTERN = /Gateway shutdown initiated/;This matches any substring Gateway shutdown initiated regardless of source. Any API server, cloud SDK, or test fixture that emits this message will trigger mcp_gateway_shutdown, masking the real failure class.
The test shows the actual observed message is a JSON blob from the MCP driver:
{"message":"Gateway shutdown initiated","serversTerminated":2,"status":"closed"}
Anchor to the JSON key to reduce false positives:
// Matches the JSON payload the MCP gateway driver emits on shutdown.
const MCP_GATEWAY_SHUTDOWN_PATTERN = /"message"\s*:\s*"Gateway shutdown initiated"/;Without this specificity, any verbose CI log from any gateway component flips this flag and swallows the true failure class.
| // Matches the observed error: "CAPIError: 429 429 quota exceeded". | ||
| // Quota exhaustion is a persistent, non-retryable condition. | ||
| const CAPI_QUOTA_EXCEEDED_PATTERN = /CAPIError:\s*429\s+429\s+quota exceeded/i; | ||
| const CAPI_QUOTA_EXCEEDED_PATTERN = /CAPIError:\s*(?:429\s+)?(?:429\s+quota exceeded|Too Many Requests)/i; |
There was a problem hiding this comment.
CAPI_QUOTA_EXCEEDED_PATTERN now treats any CAPIError: Too Many Requests as non-retryable quota exhaustion — a transient throttle will burn the entire retry budget.
💡 Details and suggested fix
Old (narrow):```js
/CAPIError:\s*429\s+429\s+quota exceeded/i
New (broad):
```js
/CAPIError:\s*(?:429\s+)?(?:429\s+quota exceeded|Too Many Requests)/i
A standard HTTP 429 from a transiently rate-limited backend is now indistinguishable from true per-account quota exhaustion. Since isQuotaExceeded triggers an immediate break with no retries, a single transient 429 will abort all remaining attempts — the opposite of the intended retry behavior for recoverable conditions.
The test string "retried 5 times. Last error: CAPIError: Too Many Requests" contains the SDK's internal retry count as evidence of persistence, but the regex matches CAPIError: Too Many Requests anywhere in output with no such context requirement.
Options:
- Separate the two semantics: add a distinct
isTransientRateLimitErrorthat matches bareToo Many Requestsand retries with back-off, while reservingcapi_quota_exceededforquota exceededforms. - Accept the broadening but update the inline comment (line 61–62) and the module-level JSDoc (line 20) to explicitly document that both throttling and quota exhaustion are treated as non-retryable — so the trade-off is visible to future reviewers.
| if (detection.isNullTypeToolCall) return "null_type_tool_call"; | ||
| if (detection.isAuthErr) return "no_auth_info"; | ||
| if (detection.isAuthenticationFailed) return "authentication_failed"; | ||
| if (detection.hasNumerousPermissionDenied) return "permission_denied"; |
There was a problem hiding this comment.
hasNumerousPermissionDenied outranks isSDKSessionIdleTimeout and isMCPGatewayShutdown in the priority chain — the new failure classes the PR is meant to surface get swallowed by the existing one.
💡 Details and suggested fix
Current order:
if (detection.hasNumerousPermissionDenied) return "permission_denied"; // line 336
if (detection.isSDKSessionIdleTimeout) return "sdk_session_idle_timeout"; // line 337
if (detection.isMCPGatewayShutdown) return "mcp_gateway_shutdown"; // line 338A Copilot SDK run that hits a session.idle timeout often also generates numerous permission-denied log lines as a side-effect of the shutdown sequence. Under the current ordering, classifyCopilotFailure returns "permission_denied" in that case — precisely the opaque result this PR is trying to eliminate.
The same problem applies to mcp_gateway_shutdown.
Suggested reorder — put the specific new signal classes above the generic volume-based one:
if (detection.isSDKSessionIdleTimeout) return "sdk_session_idle_timeout";
if (detection.isMCPGatewayShutdown) return "mcp_gateway_shutdown";
if (detection.hasNumerousPermissionDenied) return "permission_denied";This aligns with the PR's stated goal: make previously opaque failures directly attributable from the harness log alone.
| if (typeof output !== "string" || !output) return ""; | ||
| const maxChars = options?.maxChars ?? OUTPUT_TAIL_MAX_CHARS; | ||
| const maxLines = options?.maxLines ?? OUTPUT_TAIL_MAX_LINES; | ||
| const normalized = output.replace(/\0/g, "").replace(/\r\n/g, "\n").trim(); |
There was a problem hiding this comment.
extractOutputTail normalizes \r\n but leaves bare \r intact — progress-bar-style output will garble the logged tail.
💡 Details and suggested fix
const normalized = output.replace(/\0/g, "").replace(/\r\n/g, "\n").trim();Some child processes (progress bars, build tools, terminal-mode programs) use bare \r to overwrite the current line in place. Only \r\n is normalized here; lone \r characters survive into tailLines. When the tail is joined on \n and emitted as JSON.stringify(outputTail), the \r causes the beginning of the line to be visually overwritten in the CI log, hiding the context the tail was meant to surface.
Fix — add a lone-CR pass after the \r\n normalization:
const normalized = output
.replace(/\0/g, "")
.replace(/\r\n/g, "\n")
.replace(/\r/g, "\n") // bare CR → LF
.trim();| @@ -60,7 +60,7 @@ const MODEL_NOT_SUPPORTED_PATTERN = | |||
| // Pattern: Copilot/CAPI quota exhaustion. | |||
| // Matches the observed error: "CAPIError: 429 429 quota exceeded". | |||
There was a problem hiding this comment.
Stale comment — after the regex change it now matches three distinct forms, but the comment still only documents the original double-429 error string.
💡 Suggested update
Current (stale):
// Matches the observed error: "CAPIError: 429 429 quota exceeded".
// Quota exhaustion is a persistent, non-retryable condition.The new regex also matches:
CAPIError: 429 Too Many RequestsCAPIError: Too Many RequestsCAPIError: 429 quota exceeded(single 429 + quota exceeded, as a side-effect of the optional group)
Suggested update:
// Matches Copilot/CAPI quota exhaustion and rate-limit responses:
// "CAPIError: 429 429 quota exceeded" (original observed form)
// "CAPIError: 429 Too Many Requests" (HTTP 429 form)
// "CAPIError: Too Many Requests" (no status code in message)
// All forms are treated as non-retryable.The module-level JSDoc at line 20 (capi_quota_exceeded_error description) also needs the same update.
| const tailLines = normalized | ||
| .split("\n") | ||
| .map(line => line.trimEnd()) | ||
| .filter(Boolean) |
There was a problem hiding this comment.
filter(Boolean) removes blank lines before slice(-maxLines) — maxLines counts non-empty lines, not total lines, so callers silently get fewer context lines than requested.
💡 Details
.filter(Boolean) // blank lines dropped here
.slice(-maxLines); // maxLines applied to surviving lines onlyIf the last section of output is:
Error: timeout\n\n\nStack trace line 1\nStack trace line 2
...after filter(Boolean) you have 3 non-empty lines. A caller requesting maxLines: 2 sees only the two stack trace lines with no indication that the separating blank lines — and thus the visual structure of the error — were dropped.
If blank lines should be preserved, remove filter(Boolean). If they should be collapsed, document that maxLines applies after filtering so callers aren't surprised by unexpectedly thin output.
|
@copilot run pr-finisher skill |
…Error to isTransientCAPIError Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done. Addressed both reviewer threads in commit
All 347 JS test files and Go unit tests pass locally. |
|
@copilot review comments from GitHub actions bot |
…re priority, fix bare-CR/slice-0, improve comments, add regression tests Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Addressed all GitHub Actions bot threads. Changes in the latest commits:
|
Unclassified Copilot harness failures were exhausting retries with
exitCode=1while every built-in classifier stayed false, leaving no actionable cause in the run output. This was especially visible in the chronic Daily Model Inventory failures and AOAI smoke failures, where the underlying signal was present in stderr/output but never surfaced as a distinct failure class.Failure annotations
failureClassto Copilot harness failure logs instead of relying only on a matrix of booleans.outputTailfor failed attempts so the last useful stderr/stdout context is preserved in the failure summary path.New Copilot-specific classifications
session.idletimeouts assdk_session_idle_timeout.mcp_gateway_shutdownwhen they appear in agent output.429 / rate-limit handling
CAPIError: Too Many Requestsin addition to the existing429 429 quota exceededform.Representative output
This keeps the retry behavior intact for partial executions while making previously opaque failures directly attributable from the harness log alone.