Skip to content

[aw-failures] [aw] Failure Investigation Report — 6h window (2026-06-17 19:34 UTC) #39883

@github-actions

Description

@github-actions

[aw] Failure Investigation — 6h window ending 2026-06-17 19:34 UTC

Executive summary

  1. Scope: 29 failed/cancelled runs in github/gh-aw over the last 6h. 13 were cancelled (concurrency/guard, not defects); 16 were genuine failure outcomes across 6 signature clusters.
  2. P0 (provider/config): Codex engine cannot reach its model — every Codex run 404s on gpt-5-codex-alpha-2025-11-07. Tracked per-workflow but root cause not escalated.
  3. P1 (product bug, new/untracked): four asset-producing workflows succeed at the agent step but the upload_assets job fails because the declared PNG asset files are never staged. Filed as sub-issue below.
  4. P1 (provider): Copilot-CLI BYOK proxy rejects claude-sonnet-4.6 with 400 model not supported; partially covered by the existing cascade rollup.
  5. No open agentic-workflows issue qualified for closure — all 20 are <6h old and reflect still-active failure modes; none have fresh evidence of being fixed.

Failure cluster table

Cluster Class Runs Representative Comparator Existing coverage
C1 codex-model-404 model_resolution_404 (P0) 2 §27713303874 §27703903782 #39878, #39844 (per-workflow)
C2 phantom-asset missing-staged-file (P1) 4 §27713375907 §27705239494 none → sub-issue
C3 copilot-byok model-not-supported 400 / denials (P1) 8 §27703869930 §27702793798 #39850 #39851 #39848 #39853 #39852
C4 claude-cli engine failure (P2) 1 §27703868296 (in cascade #39852)
C5 checkout-infra checkout step (P2) 1 §27707338407 #39856
C6 super-linter linter step (P2) 1 §27698271426 none (low impact)
— cancelled not a defect 13 Smoke CI ×8, AI Moderator ×4, Auto-Triage ×1 concurrency/guard cancellations

Evidence

C1 — Codex model 404 (P0)
  • Dominant error: 404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07 at `(172.30.0.30/redacted)
  • The Codex engine resolves alias gpt-5-codex to backend id gpt-5-codex-alpha-2025-11-07, which no longer exists on the proxy. The Unknown model gpt-5-codex ... fallback model metadata path swaps metadata only — the request still targets the missing backend, so all 5 reconnect retries 404 and the turn fails with 0 tool calls.
  • Audit posture: read-only, turns 11→0 vs baseline (agent never started real work).
  • Impact: every Codex-engine workflow (Daily Cache Strategy Analyzer, Smoke Codex). Auto-filed at [aw] Daily Cache Strategy Analyzer failed #39878 / [aw] Smoke Codex produced no safe outputs #39844 as symptoms, but the shared root cause (removed alpha model id) is not escalated.
C2 — Phantom asset (P1, new)
  • upload_assets job fails: ERR_SYSTEM: Asset file not found: /tmp/gh-aw/safeoutputs/assets/quality_score_breakdown.png while the agent job succeeded.
  • The agent emitted upload_asset safe-output items referencing PNGs with sha + byte size + markdown image links (e.g. quality_score_breakdown.png size=178654, historical_trends.png size=422386) that were never written to the staging dir.
  • Path mismatch: agent referenced .../.gh-aw-assets/(file) while the upload pipeline reads /tmp/gh-aw/safeoutputs/assets/; the safe-outputs-assets artifact was never produced (Artifact not found for name: safe-outputs-assets).
  • Affected: §27713375907 Daily Code Metrics, §27705239494 Daily Security Observability Report, §27704205632 Daily Repository Chronicle, §27699677975 Daily Agent of the Day Blog Writer.
C3 — Copilot BYOK (P1, heterogeneous)
  • Representative §27703869930: 400 The requested model is not supported (model=claude-sonnet-4.6, isModelNotSupportedError=true), not retried; not auth, not denial-limit (permissionDeniedCount=0).
  • Comparator §27702793798: hasNumerousPermissionDenied (permissionDeniedCount=11).
  • Third mode §27701060020: harness exit code 127 (runtime).
  • Cluster mixes three distinct Copilot-CLI failure modes; mostly covered by cascade rollup [aw] Failure cascade detected #39852 and per-workflow issues.

Existing issue correlation

  1. Clusters → tracking: C1 ↔ [aw] Daily Cache Strategy Analyzer failed #39878/[aw] Smoke Codex produced no safe outputs #39844; C3 ↔ [aw] Smoke Copilot - AOAI (apikey) failed #39850/[aw] Smoke Copilot - AOAI (Entra) failed #39851/[aw] Daily Formal Spec Verifier exceeded tool denial limit #39848/[aw] Daily SPDD Spec Planner exceeded tool denial limit #39853/[aw] Failure cascade detected #39852; C5 ↔ [aw] Documentation Unbloat failed #39856.
  2. Cascade rollup [aw] Failure cascade detected #39852 already groups 10 of today's [aw] * failed issues — consistent with C1/C3 sharing provider-side root causes.
  3. Gaps: C2 (phantom asset) has no tracking coverage → addressed by the sub-issue below. C1 root cause is filed only as per-workflow symptoms; recommend treating [aw] Daily Cache Strategy Analyzer failed #39878 as the canonical P0 and updating the Codex model alias.
  4. Potential duplicates: [aw] Smoke Copilot - AOAI (apikey) failed #39850[aw] Smoke Copilot - AOAI (apikey) produced no safe outputs #39861 and [aw] Smoke Copilot - AOAI (Entra) failed #39851[aw] Smoke Copilot - AOAI (Entra) produced no safe outputs #39862 (apikey/Entra "failed" vs "produced no safe outputs" describe the same two Smoke Copilot runs); candidates for consolidation by the owning workflow, not closed here without confirmation.
  5. Closures: none performed — every open agentic-workflows issue is <6h old and reflects an active failure mode; no fresh evidence of resolution.

Fix roadmap

  1. P0 — Restore the Codex model route. The configured Codex model resolves to gpt-5-codex-alpha-2025-11-07, which 404s on the proxy. Point the alias at a live model (or restore the backend) and make the "unknown model" fallback re-target the request, not just metadata. Canonical tracker: [aw] Daily Cache Strategy Analyzer failed #39878.
  2. P1 — Fix asset staging (sub-issue below). Ensure declared upload_asset files are staged to /tmp/gh-aw/safeoutputs/assets/ before the upload_assets job, and validate file existence at safe-output emission time so the agent cannot declare phantom assets.
  3. P1 — Copilot BYOK model support. Resolve claude-sonnet-4.6 rejection (400 not supported) on the BYOK proxy; covered by cascade [aw] Failure cascade detected #39852.
  4. P2 — Monitor C4 (Claude CLI), C5 (checkout, [aw] Documentation Unbloat failed #39856), C6 (super-linter); single-occurrence, no action this cycle.

Sub-issues created

  • Phantom asset staging failure (C2) — see linked sub-issue.

References: §27713303874 · §27713375907 · §27703869930

Generated by 🔍 [aw] Failure Investigator (6h) ·

  • expires on Jun 24, 2026, 11:47 AM UTC-08:00


6h-window follow-up — 2026-06-18 08:26 UTC

Scope: 18 failed/cancelled runs. 5 cancelled Smoke CI (concurrency/guard — not defects); 13 genuine failure across 5 signature clusters. No P0 (no provider-down). No issue closures — all open trackers reflect still-active modes.

Cluster Class Runs Branch Coverage
phantom-asset upload_assets/Push assets, agent success (P1) 2 main #39885 updated (recurrence)
copilot-exit-1 Execute Copilot CLI exit 1, classifiers false (P1) 3 main×2, PR×1 #39946 updated (recurrence + 429 evidence)
safe-outputs-skip Process Safe Outputs fail on unsatisfiable item (P1) 1 main new sub-issue (LintMonster)
smoke-pr-noise Process Safe Outputs on PR/dev branches (P2) 5 PR branches already noted in #39946
post-step-singleton agent success, post-agent infra step fail (P2) 2 main monitored (below)

Closures: none. #39885 and #39946 both recur this window; parent (this issue) and #39790 (token audit) are unrelated/fresh.

New sub-issue: LintMonster — scheduled run fails Process Safe Outputs because the agent emitted update_issue target:triggering (unsatisfiable outside issue context) and the skip is counted as a hard failure. Linked to this report.

Monitored singletons (no issue filed — single occurrence, post-agent infra step):

  1. Avenger §27740190790 (main): agent job marked failure at Parse agent logs for step summary, but the agent itself completed and emitted a noop ("No PR created"). Post-processing log-parser step failure; watch for recurrence.
  2. Daily Safe Outputs Git Simulator §27739899787 (main): agent success (emitted noop, state persisted to repo memory), but push_repo_memory/Push repo-memory changes (default) failed — likely a repo-memory branch push race/conflict. Single occurrence.

References: §27735258410 · §27737401463 · §27738455642

Generated by 🔍 [aw] Failure Investigator (6h) ·



Failure Investigation — 6h window ending 2026-06-19 19:20 UTC

Overview

Prefetch flagged 40 runs, but 31 were cancelled (concurrency: a dependabot PR batch at 15:00 UTC + main pushes superseded in-flight runs) — not real failures. 9 runs genuinely failed, across 7 signatures. No systemic P0; no new parent report needed. Two new sub-issues filed; one existing tracker confirmed still active.

Failure clusters (real failures only)

Cluster Workflow(s) Runs Failure class Status
C1 Claude ERR_CONFIG: no structured log entries Avenger 3 parser/engine (intermittent) #40145 — confirmed recurring, updated
C2 Codex Model not found gpt-5-codex-alpha-2025-11-07 (proxy 404) Daily Cache Strategy Analyzer 1 (4/6 days) model-alias NEW → #aw_c2
C3 Copilot Python driver ModuleNotFoundError: 'copilot' (python3 absent) Daily Issues Report Generator 1 (6/6 days) driver-config NEW → #aw_c3
C4 Copilot max tool denials (5/5) → SDK aborts Daily SPDD Spec Planner 1 (5/6 days) tool-allowlist documented (not filed)
C5 Copilot sdk_session_idle_timeout → 15-min action timeout PR Code Quality Reviewer 1 session-idle documented
C6 Git LFS fetch fail Could not scan for Git LFS files (exit 2) Documentation Unbloat 1 checkout/LFS documented
C7 super-linter exit 1 + artifact zip error Super Linter Report 1 lint/infra documented

Evidence (key log refs)

  1. C2 §27843593919: unexpected status 404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07 on every sampling retry; all 3 retries exhausted.
  2. C3 §27834964940: pre-flight: command not found: python3 + ModuleNotFoundError: No module named 'copilot'; Turns=0, agent_output.json={"items":[]}.
  3. C1 §27841318778: ##[error]ERR_CONFIG: Claude execution failed: no structured log entries were produced.
  4. C4 §27837850380: [sdk-driver] max tool denials threshold reached (5/5); 13 denials of shell(sed ...) / shell(cd && ...) not in the allow-list (permissionDeniedCount=13).
  5. C5 §27841406519: [sdk-driver] error: Timeout after 870000ms waiting for session.idle (sub-agent grumpy-coder ran long); isSDKSessionIdleTimeoutError=true; action timed out at 15m.
  6. C6 §27839021623: git lfs fetchCould not scan for Git LFS files, git failed with exit code 2; agent never ran.
  7. C7 §27832960010: super-linter step Process completed with exit code 1, then An error has occurred during zip creation for the artifact on log upload.

audit-diff (firewall/metrics)

Pairwise diffs across C1–C5 runs show no firewall anomalies (has_anomalies:false; only expected provider-domain swaps: codex→api.openai.com/chatgpt.com, copilot→api.githubcopilot.com). Failures are engine/config/driver — not network policy. C5 shows the cost signature of a hang: core_consumed +11767% (1068 GitHub API calls), 153 copilot calls before the idle timeout.

Existing-issue correlation & close decisions

Fix roadmap

Sub-issues created this run

  • #aw_c3 — Daily Issues Report Generator: Python sample driver crash.
  • #aw_c2 — Daily Cache Strategy Analyzer: Codex gpt-5-codex dead alias 404.

C4–C7 documented here but not separately filed (minimum-necessary; lower priority / transient). File on recurrence.

References:

Generated by 🔍 [aw] Failure Investigator (6h) · 444.2 AIC · ⌖ 12.3 AIC · ⊞ 4.9K ·



[aw] Failure Investigation — 6h window ending 2026-06-21 08:19 UTC

Executive summary

No issues qualified for closure and no new parent is warranted — every P0/P1 failure in this window is already tracked and still active. 9 failed/cancelled runs across 7 signatures; 6 of 7 map cleanly to open agentic-workflows issues, 1 is a transient infra blip, and only one new untracked product gap was found (filed as a sub-issue below).

  1. The three recurring red workflows — Skillet, Avenger, Code Simplifier — all reproduced their tracked signatures this window, so [aw-failures] [aw] Skillet floods Actions with startup-failures on copilot/* branch pushes (recurring — 73 failed runs / 6h as o [Content truncated due to length] #40447, [aw-failures] [aw] Avenger agent job fails at "Parse agent logs" — ERR_CONFIG "no structured log entries" despite successful age [Content truncated due to length] #40145, and [aw-failures] [aw] Code Simplifier fails daily on main — Copilot BYOK provider returns HTTP 403 (authentication_failed, non-retryable) #40270 stay open as active.
  2. PR Sous Chef failed once with an unclassified Copilot CLI exit-1 after a full agent run; audit-diff shows it is behaviorally identical to a successful baseline (stable, transient). Folded into existing class [aw-failures] [aw] Copilot CLI exits 1 with no classified error — chronic Daily Model Inventory failures #39946 as a newly-affected workflow.
  3. One new untracked gap: Smoke Codex safe_outputs fails on set_issue_field "No issue number available". Sub-issue #aw_setfield.
  4. No open issue has fresh evidence of being fixed; absence of a daily workflow from a 6h window is not a fix. Nothing closed.

Failure cluster table

Cluster Class Runs Representative Comparator Coverage
Skillet startup-failures startup_failure (P1) 3 §27893974986 n/a (no agent job) #40447 (open, active)
Avenger parse-logs ERR_CONFIG post-run (P1) 1 §27895513143 prior Avenger runs #40145 (open, active)
Code Simplifier BYOK auth 403 non-retryable (P1) 1 §27893897834 none (10/10 red) #40270 (open, active)
PR Sous Chef agent exit-1 unclassified copilot exit-1 (P2) 1 §27895333209 §27893147412 success #39946 (evidence added)
Smoke Codex safe_outputs set_issue_field unbound (P2) 1 §27892632179 n/a none → #aw_setfield
Smoke AOAI push transient DNS (infra) 1 §27892686646 n/a not tracked (transient)
Smoke CI cancelled concurrency cancel (n/a) 1 §27892967610 n/a not a defect
Evidence per cluster
  1. Skillet (27893974986, 27892396761, 27891316950): instantaneous startup-failures on copilot/* branch pushes, no agent job dispatched — exact signature of [aw-failures] [aw] Skillet floods Actions with startup-failures on copilot/* branch pushes (recurring — 73 failed runs / 6h as o [Content truncated due to length] #40447 (Skillet triggers only on workflow_dispatch on main; stale lock on copilot branches fires on push).
  2. Avenger (27895513143): agent job fails at Parse agent logs for step summary after a successful upload — the ERR_CONFIG "no structured log entries" post-run parser defect of [aw-failures] [aw] Avenger agent job fails at "Parse agent logs" — ERR_CONFIG "no structured log entries" despite successful age [Content truncated due to length] #40145.
  3. Code Simplifier (27893897834): Execute GitHub Copilot CLI fails — BYOK provider 403 authentication_failed, non-retryable, per [aw-failures] [aw] Code Simplifier fails daily on main — Copilot BYOK provider returns HTTP 403 (authentication_failed, non-retryable) #40270 (10/10 daily red, no baseline).
  4. PR Sous Chef (27895333209): agent ran 28 turns / 599k tokens; Execute GitHub Copilot CLI closed exitCode=1 hasOutput=true with no failureClass##[error]Process completed with exit code 1. audit-diff vs success baseline 27893147412: 0 new domains, 0 firewall status changes, 0 anomalies, turns unchanged → stable/transient. Same class as [aw-failures] [aw] Copilot CLI exits 1 with no classified error — chronic Daily Model Inventory failures #39946.
  5. Smoke Codex (27892632179): agent success; safe_outputs → Process Safe Outputs fails — ✗ Message 2 (set_issue_field) failed: No issue number available (messages: create_issue, set_issue_field, add_comment, comment_memory). New gap, see #aw_setfield.
  6. Smoke AOAI (Entra) (27892686646): agent success; push_experiments_state job fails at checkout — fatal: unable to access 'https://github.com/github/gh-aw/': Could not resolve host: github.com → git exit 128. Transient runner DNS failure, no product defect, no recurrence.
  7. Smoke CI (27892967610): cancelled (concurrency/guard), not a failure.

Existing-issue correlation

Fix roadmap

Sub-issues created

  • #aw_setfield — Smoke Codex safe_outputs fails on unbound set_issue_field.

References: §27895333209 · §27892632179 · §27893974986

Generated by 🔍 [aw] Failure Investigator (6h) · 321.2 AIC · ⌖ 14.2 AIC · ⊞ 4.9K ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions