Skip to content

[aw-failures] [aw] Code Simplifier fails daily on main — Copilot BYOK provider returns HTTP 403 (authentication_failed, non-retryable) #40270

@github-actions

Description

@github-actions

Parent report: #39883.

Problem statement

  1. The Code Simplifier workflow has failed 10/10 consecutive scheduled runs on main (daily ~04:48 UTC, 2026-06-10 → 2026-06-19). No successful baseline run exists in that window.
  2. In the latest run the agent executes substantial work (turns=49, 1.85M tokens, 160 firewall requests, 0 blocked) and then the Copilot CLI exits non-zero: [copilot-harness] attempt 1 failed: exitCode=1 failureClass=authentication_failed.
  3. The provider returns HTTP 403: Authentication failed with provider at (172.30.0.30/redacted) (HTTP 403). Check your COPILOT_PROVIDER_API_KEY or COPILOT_PROVIDER_BEARER_TOKEN. The harness treats first-attempt auth failure as non-retryable (authentication failed — not retrying), so the run dies after one attempt with write_actions=0.

Affected workflow and run IDs

  • Workflow: Code Simplifier (.github/workflows/code-simplifier.lock.yml), engine copilot (GitHub Copilot CLI, BYOK/offline mode), trigger schedule, branch main.
  • Representative (analyzed): §27806068277 — 2026-06-19 04:48 UTC. Failed step Execute GitHub Copilot CLI; failureClass=authentication_failed, HTTP 403 from 172.30.0.30:10002; turns=49, tokens=1,851,074.
  • Chronic comparators (all failure, main, schedule): 27737401463 (06-18), 27666488465 (06-17), 27594887412 (06-16), 27524707602 (06-15), 27488668377 (06-14), 27456907583 (06-13), 27395179213 (06-12) — workflow is red every day with no successful baseline.

Probable root cause

  1. The Copilot BYOK api-proxy (172.30.0.30:10002) rejects the request with HTTP 403 part-way through the session (the agent had already produced output, hasOutput=true, after 6m25s). This indicates the bearer token / API key presented to the proxy is invalid, expired, or revoked for this workflow — not a model-resolution or quota error (isCAPIError400=false isCAPIQuotaExceededError=false isModelNotSupportedError=false).
  2. The non-retryable first-attempt auth policy means a mid-session 403 (e.g. token TTL expiring during a long 49-turn run) is fatal with no recovery.

Proposed remediation

  1. Verify COPILOT_PROVIDER_API_KEY / COPILOT_PROVIDER_BEARER_TOKEN provisioning for Code Simplifier's scheduled main runs — confirm the credential is present, valid, and not expiring within the run's wall-clock (this run ran 6m25s before the 403).
  2. If the token has a short TTL, refresh/re-mint it before long-running agent sessions, or allow a single token-refresh-and-retry on a mid-session 403 instead of the blanket non-retryable path.
  3. Capture the BYOK proxy's 403 response body into the failure annotation so credential-vs-policy 403s are distinguishable (the harness currently surfaces only the generic Check your COPILOT_PROVIDER_* hint).

Success criteria / verification

  1. A subsequent scheduled Code Simplifier run on main completes the agent step without an authentication_failed HTTP 403.
  2. Code Simplifier produces at least one successful (or correctly-classified, non-auth) run on main within 48h.
  3. A mid-session provider 403 is either recovered via token refresh or surfaced as a distinct classified failure with the proxy response body attached.

Correlation note

  1. Prior tracking is closed and the signature has shifted. Earlier Code Simplifier failures were tracked by [aw-failures] P1: Code Simplifier exhausts api-proxy invocation cap (maxRuns 50/50) → CAPIError 429, 6/6 fail + unclassified #39199 (api-proxy invocation cap 50/50 → CAPIError 429) and the per-run [aw] Code Simplifier failed #39489 / [aw] Code Simplifier failed #39729 / [aw] Code Simplifier failed #39968 — all now CLOSED. Today's signature is HTTP 403 authentication_failed, distinct from the 429 cap, so there is currently no open coverage for this chronic failure (tracking gap; consistent with the dedup concern in [dedup-detect] Failure detection workflows re-file issues for already-tracked problems #40071).
  2. Avenger parse-log failure recurred this window (§27808815806, ERR_CONFIG / turns=2) — already covered by [aw-failures] [aw] Avenger agent job fails at "Parse agent logs" — ERR_CONFIG "no structured log entries" despite successful age [Content truncated due to length] #40145; not re-filed.
  3. AI Moderator §27806212006 failed at Checkout PR branch with Refusing PR checkout: actor 'Jobayer-Q1' has 'read' permission (requires write or higher) — this is the security guard working as intended (external issue_comment actor without write access), not a defect.
  4. Daily News exit-127 (§27806997132, node not available inside AWF chroot) occurred on a workflow_dispatch of dev branch copilot/add-share-agentic-workflow (firewall v0.27.2, a stale lock predating the [aw-failures] [aw] Copilot CLI exits 127 — node missing inside AWF chroot (chronic Daily News failure on main) #40074 fix); scheduled main Daily News runs the same day succeeded. Not a production regression — recompile the branch. [aw-failures] [aw] Copilot CLI exits 127 — node missing inside AWF chroot (chronic Daily News failure on main) #40074 stays closed.
  5. Not defects: Smoke CI ×4 cancelled (concurrency/guard); Smoke Copilot ×3 safe_outputs failures on PR/test branches (unresolvable targets, per [aw-failures] [aw] Copilot CLI exits 1 with no classified error — chronic Daily Model Inventory failures #39946).

References:

Generated by 🔍 [aw] Failure Investigator (6h) · 224.7 AIC · ⌖ 12.9 AIC · ⊞ 4.9K ·

  • expires on Jun 26, 2026, 12:50 AM UTC-08:00

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions