[aw-failures] [aw] Copilot CLI exits 1 with no classified error — chronic Daily Model Inventory failures

### Problem statement

The Copilot CLI engine (`copilot-harness`) exhausts all retries and exits `exitCode=1` while **every** built-in error classifier reports false — `isCAPIError400=false isCAPIQuotaExceededError=false isMCPPolicyError=false isModelNotSupportedError=false isNullTypeToolCallError=false isAuthError=false isAuthenticationFailedError=false permissionDeniedCount=0 hasNumerousPermissionDenied=false` — with `hasOutput=true`. The agent produces output and tool calls, but the harness reports failure with no actionable classification, so the run fails opaquely. This is distinct from the previously-tracked Codex model-404 and Copilot BYOK `model-not-supported (400)` clusters (those set a specific classifier; here all classifiers are false).

### Affected workflows and run IDs

1. **Daily Model Inventory Checker** — [§27728272202](https://github.com/github/gh-aw/actions/runs/27728272202) (2026-06-18 00:19 UTC, `main`, `schedule`; 880k tokens, 35 turns). **Chronic:** this workflow has failed on `main` every day for ≥8 consecutive days — [§27657051775](https://github.com/github/gh-aw/actions/runs/27657051775), [§27585387447](https://github.com/github/gh-aw/actions/runs/27585387447), 27516779860, 27483283602, 27450621036, 27386031918, 27315084790 — with **no successful baseline run** available for comparison.
2. **Smoke Copilot - AOAI (apikey)** — [§27715405170](https://github.com/github/gh-aw/actions/runs/27715405170) (branch `copilot/instructions-sync`). Identical harness signature: `attempt 4 failed: exitCode=1 ... <all classifiers false> ... retriesRemaining=0`, 2 turns / 58.5k tokens.

### Probable root cause

1. The CLI exits non-zero for a reason outside the harness's known classifier set. In the Model Inventory run the MCP gateway logged `{"message":"Gateway shutdown initiated","serversTerminated":2,"status":"closed"}` immediately before the harness reported exit 1 — suggesting an unclean session teardown (MCP gateway/sidecar shutdown race) rather than an LLM/model error.
2. Because no classifier matches, retry/repair logic cannot distinguish a transient teardown from a real failure; all 3 retries reproduce the same bare exit 1, and the failure surfaces with no cause.

### Proposed remediation

1. On unclassified `exitCode=1`, persist and surface the Copilot CLI stderr / last-output tail into the run's failure summary (the harness already records `hasOutput=true` — capture that content).
2. Add explicit handling for clean-exit-with-output and for MCP-gateway-shutdown races, so these are either retried correctly or reported as a distinct, named failure class instead of a bare exit 1.
3. Triage the chronic Daily Model Inventory failure directly: an 8+ day unbroken failure on `main` with no baseline indicates a workflow-definition or persistent engine issue specific to that workflow — reproduce with verbose Copilot logging.

### Success criteria / verification

1. An unclassified `exitCode=1` failure includes the Copilot CLI stderr/output tail in its failure annotation.
2. Daily Model Inventory Checker produces at least one successful (or correctly-classified) run on `main` within a subsequent 48h window.
3. No `copilot-harness exitCode=1` with all-classifiers-false recurs across Copilot-engine workflows over a 24h window without an attached root cause.

<details><summary>Other lower-priority clusters observed this 6h window (no separate issue filed)</summary>

1. **Claude max-invocations cap (P2)** — Design Decision Gate 🏗️ [§27724121419](https://github.com/github/gh-aw/actions/runs/27724121419) (PR branch `copilot/fix-orphaned-entries-actions-lock`): `isRateLimitError=true` + `API Error: Request rejected (429) · Maximum LLM invocations exceeded (20 / 20)`. Hit the per-run 20-invocation budget; single occurrence on a dev branch.
2. **BYOK Ollama proxy 503 (P2)** — Daily BYOK Ollama Test [§27724841546](https://github.com/github/gh-aw/actions/runs/27724841546) (`main`, schedule): `awf-reflect: models fetch returned 503 for (apiproxy/redacted) exit 1 after 52s — BYOK model backend/proxy unavailable.
3. **safe_outputs on PR/test branches (P2)** — Smoke Copilot/Claude [§27726924811](https://github.com/github/gh-aw/actions/runs/27726924811), [§27726899549](https://github.com/github/gh-aw/actions/runs/27726899549), 27718500741, 27718402567: the `agent` job **succeeds** but `Process Safe Outputs` fails because emitted items target unresolvable contexts (`add_comment` to discussion #335 "not found"; `resolve_pull_request_review_thread` "Resource not accessible by integration"; `push_to_pull_request_branch` "No patch file"). These are smoke tests on PR/dev branches (e.g. PR #39900, `copilot/aw-failures-fix-upload-assets`), not production defects.
4. **Checkout Git-LFS flake (P2)** — Documentation Unbloat [§27729616462](https://github.com/github/gh-aw/actions/runs/27729616462): `git lfs fetch` exit code 2 after 3 retries; transient infra (prior cycle's #39856).
5. **Cancelled (not defects)** — Smoke CI [§27725219843](https://github.com/github/gh-aw/actions/runs/27725219843), Changeset Generator 27715306377: concurrency/guard cancellations.

</details>

Parent report: #39883. Analyzed runs: 27728272202, 27715405170. Comparator diff for the separate upload-assets cluster (#39885): 27579121502 → 27721789885.
Related to #39883







> Generated by [🔍 [aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/27730951555) · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)
> - [x] expires  on Jun 24, 2026, 5:52 PM UTC-08:00






---

---

### 6h-window recurrence update — 2026-06-18 08:26 UTC

1. Still active on `main`. Three Copilot-engine workflows failed at `agent` / `Execute GitHub Copilot CLI` this window with the unclassified `exitCode=1` signature:
   - Code Simplifier — [§27737401463](https://github.com/github/gh-aw/actions/runs/27737401463) (`main`, `schedule`).
   - Daily Compiler Quality Check — [§27735045174](https://github.com/github/gh-aw/actions/runs/27735045174) (`main`, `schedule`).
   - PR Code Quality Reviewer — [§27737630154](https://github.com/github/gh-aw/actions/runs/27737630154) (PR branch `copilot/fix-mcp-container-tmp-mount`).
2. **New evidence — the unclassified exit 1 hides a quota error the classifier misses.** Code Simplifier §27737401463: the harness records `attempt 3/4 failed: exitCode=1 isCAPIError400=false isCAPIQuotaExceededError=false isMCPPolicyError=false isModelNotSupportedError=false isNullTypeToolCallError=false isAuthError=false isAuthenticationFailedError=false permissionDeniedCount=0 hasNumerousPermissionDenied=false hasOutput=true` — i.e. all classifiers false — yet the captured output one line above is `Failed to get response from the AI model; retried 5 times ... Last error: CAPIError: 429 Maximum LLM invocations exceeded (52 / 50)`. So at least one instance of this cluster is a **per-run LLM-invocation cap (52/50)** that `isCAPIQuotaExceededError` fails to flag. This is a concrete classifier gap and directly validates remediation #1/#2: classify `429 Maximum LLM invocations exceeded` as a quota/cap failure and surface the captured output tail rather than reporting a bare exit 1.
3. **Second mode confirmed — permission-denial saturation.** Daily Compiler Quality Check §27735045174: `exitCode=1` with `permissionDeniedCount=11 hasNumerousPermissionDenied=true` (34 `bash` + 3 `view` tool calls before failure; firewall clean, 37/37 allowed to `api.githubcopilot.com`). This is the `hasNumerousPermissionDenied` mode already noted in this cluster, now recurring on `main`.
4. No successful baseline for Code Simplifier or Daily Compiler Quality Check on `main` in this window; both are `schedule`-driven and reproduce the same signature. Keep open.

**References:** [§27737401463](https://github.com/github/gh-aw/actions/runs/27737401463) · [§27735045174](https://github.com/github/gh-aw/actions/runs/27735045174) · [§27737630154](https://github.com/github/gh-aw/actions/runs/27737630154)

> Generated by [🔍 [aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/27746576910) · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aw-failures] [aw] Copilot CLI exits 1 with no classified error — chronic Daily Model Inventory failures #39946

Problem statement

Affected workflows and run IDs

Probable root cause

Proposed remediation

Success criteria / verification

6h-window recurrence update — 2026-06-18 08:26 UTC

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[aw-failures] [aw] Copilot CLI exits 1 with no classified error — chronic Daily Model Inventory failures #39946

Description

Problem statement

Affected workflows and run IDs

Probable root cause

Proposed remediation

Success criteria / verification

6h-window recurrence update — 2026-06-18 08:26 UTC

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions