Skip to content

[aw-failures] [aw] Copilot CLI exits 1 with no classified error — chronic Daily Model Inventory failures #39946

@github-actions

Description

@github-actions

Problem statement

The Copilot CLI engine (copilot-harness) exhausts all retries and exits exitCode=1 while every built-in error classifier reports false — isCAPIError400=false isCAPIQuotaExceededError=false isMCPPolicyError=false isModelNotSupportedError=false isNullTypeToolCallError=false isAuthError=false isAuthenticationFailedError=false permissionDeniedCount=0 hasNumerousPermissionDenied=false — with hasOutput=true. The agent produces output and tool calls, but the harness reports failure with no actionable classification, so the run fails opaquely. This is distinct from the previously-tracked Codex model-404 and Copilot BYOK model-not-supported (400) clusters (those set a specific classifier; here all classifiers are false).

Affected workflows and run IDs

  1. Daily Model Inventory Checker§27728272202 (2026-06-18 00:19 UTC, main, schedule; 880k tokens, 35 turns). Chronic: this workflow has failed on main every day for ≥8 consecutive days — §27657051775, §27585387447, 27516779860, 27483283602, 27450621036, 27386031918, 27315084790 — with no successful baseline run available for comparison.
  2. Smoke Copilot - AOAI (apikey)§27715405170 (branch copilot/instructions-sync). Identical harness signature: attempt 4 failed: exitCode=1 ... <all classifiers false> ... retriesRemaining=0, 2 turns / 58.5k tokens.

Probable root cause

  1. The CLI exits non-zero for a reason outside the harness's known classifier set. In the Model Inventory run the MCP gateway logged {"message":"Gateway shutdown initiated","serversTerminated":2,"status":"closed"} immediately before the harness reported exit 1 — suggesting an unclean session teardown (MCP gateway/sidecar shutdown race) rather than an LLM/model error.
  2. Because no classifier matches, retry/repair logic cannot distinguish a transient teardown from a real failure; all 3 retries reproduce the same bare exit 1, and the failure surfaces with no cause.

Proposed remediation

  1. On unclassified exitCode=1, persist and surface the Copilot CLI stderr / last-output tail into the run's failure summary (the harness already records hasOutput=true — capture that content).
  2. Add explicit handling for clean-exit-with-output and for MCP-gateway-shutdown races, so these are either retried correctly or reported as a distinct, named failure class instead of a bare exit 1.
  3. Triage the chronic Daily Model Inventory failure directly: an 8+ day unbroken failure on main with no baseline indicates a workflow-definition or persistent engine issue specific to that workflow — reproduce with verbose Copilot logging.

Success criteria / verification

  1. An unclassified exitCode=1 failure includes the Copilot CLI stderr/output tail in its failure annotation.
  2. Daily Model Inventory Checker produces at least one successful (or correctly-classified) run on main within a subsequent 48h window.
  3. No copilot-harness exitCode=1 with all-classifiers-false recurs across Copilot-engine workflows over a 24h window without an attached root cause.
Other lower-priority clusters observed this 6h window (no separate issue filed)
  1. Claude max-invocations cap (P2) — Design Decision Gate 🏗️ §27724121419 (PR branch copilot/fix-orphaned-entries-actions-lock): isRateLimitError=true + API Error: Request rejected (429) · Maximum LLM invocations exceeded (20 / 20). Hit the per-run 20-invocation budget; single occurrence on a dev branch.
  2. BYOK Ollama proxy 503 (P2) — Daily BYOK Ollama Test §27724841546 (main, schedule): `awf-reflect: models fetch returned 503 for (apiproxy/redacted) exit 1 after 52s — BYOK model backend/proxy unavailable.
  3. safe_outputs on PR/test branches (P2) — Smoke Copilot/Claude §27726924811, §27726899549, 27718500741, 27718402567: the agent job succeeds but Process Safe Outputs fails because emitted items target unresolvable contexts (add_comment to discussion Welcome to Agentic Workflows! #335 "not found"; resolve_pull_request_review_thread "Resource not accessible by integration"; push_to_pull_request_branch "No patch file"). These are smoke tests on PR/dev branches (e.g. PR Fix phantom asset failures: align safe-outputs staging path with RUNNER_TEMP #39900, copilot/aw-failures-fix-upload-assets), not production defects.
  4. Checkout Git-LFS flake (P2) — Documentation Unbloat §27729616462: git lfs fetch exit code 2 after 3 retries; transient infra (prior cycle's [aw] Documentation Unbloat failed #39856).
  5. Cancelled (not defects) — Smoke CI §27725219843, Changeset Generator 27715306377: concurrency/guard cancellations.

Parent report: #39883. Analyzed runs: 27728272202, 27715405170. Comparator diff for the separate upload-assets cluster (#39885): 27579121502 → 27721789885.
Related to #39883

Generated by 🔍 [aw] Failure Investigator (6h) ·

  • expires on Jun 24, 2026, 5:52 PM UTC-08:00


6h-window recurrence update — 2026-06-18 08:26 UTC

  1. Still active on main. Three Copilot-engine workflows failed at agent / Execute GitHub Copilot CLI this window with the unclassified exitCode=1 signature:
    • Code Simplifier — §27737401463 (main, schedule).
    • Daily Compiler Quality Check — §27735045174 (main, schedule).
    • PR Code Quality Reviewer — §27737630154 (PR branch copilot/fix-mcp-container-tmp-mount).
  2. New evidence — the unclassified exit 1 hides a quota error the classifier misses. Code Simplifier §27737401463: the harness records attempt 3/4 failed: exitCode=1 isCAPIError400=false isCAPIQuotaExceededError=false isMCPPolicyError=false isModelNotSupportedError=false isNullTypeToolCallError=false isAuthError=false isAuthenticationFailedError=false permissionDeniedCount=0 hasNumerousPermissionDenied=false hasOutput=true — i.e. all classifiers false — yet the captured output one line above is Failed to get response from the AI model; retried 5 times ... Last error: CAPIError: 429 Maximum LLM invocations exceeded (52 / 50). So at least one instance of this cluster is a per-run LLM-invocation cap (52/50) that isCAPIQuotaExceededError fails to flag. This is a concrete classifier gap and directly validates remediation rejig docs #1/Add workflow: githubnext/agentics/weekly-research #2: classify 429 Maximum LLM invocations exceeded as a quota/cap failure and surface the captured output tail rather than reporting a bare exit 1.
  3. Second mode confirmed — permission-denial saturation. Daily Compiler Quality Check §27735045174: exitCode=1 with permissionDeniedCount=11 hasNumerousPermissionDenied=true (34 bash + 3 view tool calls before failure; firewall clean, 37/37 allowed to api.githubcopilot.com). This is the hasNumerousPermissionDenied mode already noted in this cluster, now recurring on main.
  4. No successful baseline for Code Simplifier or Daily Compiler Quality Check on main in this window; both are schedule-driven and reproduce the same signature. Keep open.

References: §27737401463 · §27735045174 · §27737630154

Generated by 🔍 [aw] Failure Investigator (6h) ·

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions