You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Copilot CLI engine (copilot-harness) exhausts all retries and exits exitCode=1 while every built-in error classifier reports false — isCAPIError400=false isCAPIQuotaExceededError=false isMCPPolicyError=false isModelNotSupportedError=false isNullTypeToolCallError=false isAuthError=false isAuthenticationFailedError=false permissionDeniedCount=0 hasNumerousPermissionDenied=false — with hasOutput=true. The agent produces output and tool calls, but the harness reports failure with no actionable classification, so the run fails opaquely. This is distinct from the previously-tracked Codex model-404 and Copilot BYOK model-not-supported (400) clusters (those set a specific classifier; here all classifiers are false).
Affected workflows and run IDs
Daily Model Inventory Checker — §27728272202 (2026-06-18 00:19 UTC, main, schedule; 880k tokens, 35 turns). Chronic: this workflow has failed on main every day for ≥8 consecutive days — §27657051775, §27585387447, 27516779860, 27483283602, 27450621036, 27386031918, 27315084790 — with no successful baseline run available for comparison.
The CLI exits non-zero for a reason outside the harness's known classifier set. In the Model Inventory run the MCP gateway logged {"message":"Gateway shutdown initiated","serversTerminated":2,"status":"closed"} immediately before the harness reported exit 1 — suggesting an unclean session teardown (MCP gateway/sidecar shutdown race) rather than an LLM/model error.
Because no classifier matches, retry/repair logic cannot distinguish a transient teardown from a real failure; all 3 retries reproduce the same bare exit 1, and the failure surfaces with no cause.
Proposed remediation
On unclassified exitCode=1, persist and surface the Copilot CLI stderr / last-output tail into the run's failure summary (the harness already records hasOutput=true — capture that content).
Add explicit handling for clean-exit-with-output and for MCP-gateway-shutdown races, so these are either retried correctly or reported as a distinct, named failure class instead of a bare exit 1.
Triage the chronic Daily Model Inventory failure directly: an 8+ day unbroken failure on main with no baseline indicates a workflow-definition or persistent engine issue specific to that workflow — reproduce with verbose Copilot logging.
Success criteria / verification
An unclassified exitCode=1 failure includes the Copilot CLI stderr/output tail in its failure annotation.
Daily Model Inventory Checker produces at least one successful (or correctly-classified) run on main within a subsequent 48h window.
No copilot-harness exitCode=1 with all-classifiers-false recurs across Copilot-engine workflows over a 24h window without an attached root cause.
Other lower-priority clusters observed this 6h window (no separate issue filed)
Claude max-invocations cap (P2) — Design Decision Gate 🏗️ §27724121419 (PR branch copilot/fix-orphaned-entries-actions-lock): isRateLimitError=true + API Error: Request rejected (429) · Maximum LLM invocations exceeded (20 / 20). Hit the per-run 20-invocation budget; single occurrence on a dev branch.
BYOK Ollama proxy 503 (P2) — Daily BYOK Ollama Test §27724841546 (main, schedule): `awf-reflect: models fetch returned 503 for (apiproxy/redacted) exit 1 after 52s — BYOK model backend/proxy unavailable.
safe_outputs on PR/test branches (P2) — Smoke Copilot/Claude §27726924811, §27726899549, 27718500741, 27718402567: the agent job succeeds but Process Safe Outputs fails because emitted items target unresolvable contexts (add_comment to discussion Welcome to Agentic Workflows! #335 "not found"; resolve_pull_request_review_thread "Resource not accessible by integration"; push_to_pull_request_branch "No patch file"). These are smoke tests on PR/dev branches (e.g. PR Fix phantom asset failures: align safe-outputs staging path with RUNNER_TEMP #39900, copilot/aw-failures-fix-upload-assets), not production defects.
Parent report: #39883. Analyzed runs: 27728272202, 27715405170. Comparator diff for the separate upload-assets cluster (#39885): 27579121502 → 27721789885.
Related to #39883
6h-window recurrence update — 2026-06-18 08:26 UTC
Still active on main. Three Copilot-engine workflows failed at agent / Execute GitHub Copilot CLI this window with the unclassified exitCode=1 signature:
New evidence — the unclassified exit 1 hides a quota error the classifier misses. Code Simplifier §27737401463: the harness records attempt 3/4 failed: exitCode=1 isCAPIError400=false isCAPIQuotaExceededError=false isMCPPolicyError=false isModelNotSupportedError=false isNullTypeToolCallError=false isAuthError=false isAuthenticationFailedError=false permissionDeniedCount=0 hasNumerousPermissionDenied=false hasOutput=true — i.e. all classifiers false — yet the captured output one line above is Failed to get response from the AI model; retried 5 times ... Last error: CAPIError: 429 Maximum LLM invocations exceeded (52 / 50). So at least one instance of this cluster is a per-run LLM-invocation cap (52/50) that isCAPIQuotaExceededError fails to flag. This is a concrete classifier gap and directly validates remediation rejig docs #1/Add workflow: githubnext/agentics/weekly-research #2: classify 429 Maximum LLM invocations exceeded as a quota/cap failure and surface the captured output tail rather than reporting a bare exit 1.
Second mode confirmed — permission-denial saturation. Daily Compiler Quality Check §27735045174: exitCode=1 with permissionDeniedCount=11 hasNumerousPermissionDenied=true (34 bash + 3 view tool calls before failure; firewall clean, 37/37 allowed to api.githubcopilot.com). This is the hasNumerousPermissionDenied mode already noted in this cluster, now recurring on main.
No successful baseline for Code Simplifier or Daily Compiler Quality Check on main in this window; both are schedule-driven and reproduce the same signature. Keep open.
Problem statement
The Copilot CLI engine (
copilot-harness) exhausts all retries and exitsexitCode=1while every built-in error classifier reports false —isCAPIError400=false isCAPIQuotaExceededError=false isMCPPolicyError=false isModelNotSupportedError=false isNullTypeToolCallError=false isAuthError=false isAuthenticationFailedError=false permissionDeniedCount=0 hasNumerousPermissionDenied=false— withhasOutput=true. The agent produces output and tool calls, but the harness reports failure with no actionable classification, so the run fails opaquely. This is distinct from the previously-tracked Codex model-404 and Copilot BYOKmodel-not-supported (400)clusters (those set a specific classifier; here all classifiers are false).Affected workflows and run IDs
main,schedule; 880k tokens, 35 turns). Chronic: this workflow has failed onmainevery day for ≥8 consecutive days — §27657051775, §27585387447, 27516779860, 27483283602, 27450621036, 27386031918, 27315084790 — with no successful baseline run available for comparison.copilot/instructions-sync). Identical harness signature:attempt 4 failed: exitCode=1 ... <all classifiers false> ... retriesRemaining=0, 2 turns / 58.5k tokens.Probable root cause
{"message":"Gateway shutdown initiated","serversTerminated":2,"status":"closed"}immediately before the harness reported exit 1 — suggesting an unclean session teardown (MCP gateway/sidecar shutdown race) rather than an LLM/model error.Proposed remediation
exitCode=1, persist and surface the Copilot CLI stderr / last-output tail into the run's failure summary (the harness already recordshasOutput=true— capture that content).mainwith no baseline indicates a workflow-definition or persistent engine issue specific to that workflow — reproduce with verbose Copilot logging.Success criteria / verification
exitCode=1failure includes the Copilot CLI stderr/output tail in its failure annotation.mainwithin a subsequent 48h window.copilot-harness exitCode=1with all-classifiers-false recurs across Copilot-engine workflows over a 24h window without an attached root cause.Other lower-priority clusters observed this 6h window (no separate issue filed)
copilot/fix-orphaned-entries-actions-lock):isRateLimitError=true+API Error: Request rejected (429) · Maximum LLM invocations exceeded (20 / 20). Hit the per-run 20-invocation budget; single occurrence on a dev branch.main, schedule): `awf-reflect: models fetch returned 503 for (apiproxy/redacted) exit 1 after 52s — BYOK model backend/proxy unavailable.agentjob succeeds butProcess Safe Outputsfails because emitted items target unresolvable contexts (add_commentto discussion Welcome to Agentic Workflows! #335 "not found";resolve_pull_request_review_thread"Resource not accessible by integration";push_to_pull_request_branch"No patch file"). These are smoke tests on PR/dev branches (e.g. PR Fix phantom asset failures: align safe-outputs staging path with RUNNER_TEMP #39900,copilot/aw-failures-fix-upload-assets), not production defects.git lfs fetchexit code 2 after 3 retries; transient infra (prior cycle's [aw] Documentation Unbloat failed #39856).Parent report: #39883. Analyzed runs: 27728272202, 27715405170. Comparator diff for the separate upload-assets cluster (#39885): 27579121502 → 27721789885.
Related to #39883
6h-window recurrence update — 2026-06-18 08:26 UTC
main. Three Copilot-engine workflows failed atagent/Execute GitHub Copilot CLIthis window with the unclassifiedexitCode=1signature:main,schedule).main,schedule).copilot/fix-mcp-container-tmp-mount).attempt 3/4 failed: exitCode=1 isCAPIError400=false isCAPIQuotaExceededError=false isMCPPolicyError=false isModelNotSupportedError=false isNullTypeToolCallError=false isAuthError=false isAuthenticationFailedError=false permissionDeniedCount=0 hasNumerousPermissionDenied=false hasOutput=true— i.e. all classifiers false — yet the captured output one line above isFailed to get response from the AI model; retried 5 times ... Last error: CAPIError: 429 Maximum LLM invocations exceeded (52 / 50). So at least one instance of this cluster is a per-run LLM-invocation cap (52/50) thatisCAPIQuotaExceededErrorfails to flag. This is a concrete classifier gap and directly validates remediation rejig docs #1/Add workflow: githubnext/agentics/weekly-research #2: classify429 Maximum LLM invocations exceededas a quota/cap failure and surface the captured output tail rather than reporting a bare exit 1.exitCode=1withpermissionDeniedCount=11 hasNumerousPermissionDenied=true(34bash+ 3viewtool calls before failure; firewall clean, 37/37 allowed toapi.githubcopilot.com). This is thehasNumerousPermissionDeniedmode already noted in this cluster, now recurring onmain.mainin this window; both areschedule-driven and reproduce the same signature. Keep open.References: §27737401463 · §27735045174 · §27737630154