You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[aw] Failure Investigation — 6h window ending 2026-06-17 19:34 UTC
Executive summary
Scope: 29 failed/cancelled runs in github/gh-aw over the last 6h. 13 were cancelled (concurrency/guard, not defects); 16 were genuine failure outcomes across 6 signature clusters.
P0 (provider/config): Codex engine cannot reach its model — every Codex run 404s on gpt-5-codex-alpha-2025-11-07. Tracked per-workflow but root cause not escalated.
P1 (product bug, new/untracked): four asset-producing workflows succeed at the agent step but the upload_assets job fails because the declared PNG asset files are never staged. Filed as sub-issue below.
P1 (provider): Copilot-CLI BYOK proxy rejects claude-sonnet-4.6 with 400 model not supported; partially covered by the existing cascade rollup.
No open agentic-workflows issue qualified for closure — all 20 are <6h old and reflect still-active failure modes; none have fresh evidence of being fixed.
Dominant error: 404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07 at `(172.30.0.30/redacted)
The Codex engine resolves alias gpt-5-codex to backend id gpt-5-codex-alpha-2025-11-07, which no longer exists on the proxy. The Unknown model gpt-5-codex ... fallback model metadata path swaps metadata only — the request still targets the missing backend, so all 5 reconnect retries 404 and the turn fails with 0 tool calls.
Audit posture: read-only, turns 11→0 vs baseline (agent never started real work).
upload_assets job fails: ERR_SYSTEM: Asset file not found: /tmp/gh-aw/safeoutputs/assets/quality_score_breakdown.png while the agent job succeeded.
The agent emitted upload_asset safe-output items referencing PNGs with sha + byte size + markdown image links (e.g. quality_score_breakdown.png size=178654, historical_trends.png size=422386) that were never written to the staging dir.
Path mismatch: agent referenced .../.gh-aw-assets/(file) while the upload pipeline reads /tmp/gh-aw/safeoutputs/assets/; the safe-outputs-assets artifact was never produced (Artifact not found for name: safe-outputs-assets).
Representative §27703869930: 400 The requested model is not supported (model=claude-sonnet-4.6, isModelNotSupportedError=true), not retried; not auth, not denial-limit (permissionDeniedCount=0).
Cascade rollup [aw] Failure cascade detected #39852 already groups 10 of today's [aw] * failed issues — consistent with C1/C3 sharing provider-side root causes.
Gaps: C2 (phantom asset) has no tracking coverage → addressed by the sub-issue below. C1 root cause is filed only as per-workflow symptoms; recommend treating [aw] Daily Cache Strategy Analyzer failed #39878 as the canonical P0 and updating the Codex model alias.
Closures: none performed — every open agentic-workflows issue is <6h old and reflects an active failure mode; no fresh evidence of resolution.
Fix roadmap
P0 — Restore the Codex model route. The configured Codex model resolves to gpt-5-codex-alpha-2025-11-07, which 404s on the proxy. Point the alias at a live model (or restore the backend) and make the "unknown model" fallback re-target the request, not just metadata. Canonical tracker: [aw] Daily Cache Strategy Analyzer failed #39878.
P1 — Fix asset staging (sub-issue below). Ensure declared upload_asset files are staged to /tmp/gh-aw/safeoutputs/assets/ before the upload_assets job, and validate file existence at safe-output emission time so the agent cannot declare phantom assets.
P1 — Copilot BYOK model support. Resolve claude-sonnet-4.6 rejection (400 not supported) on the BYOK proxy; covered by cascade [aw] Failure cascade detected #39852.
Scope: 18 failed/cancelled runs. 5 cancelled Smoke CI (concurrency/guard — not defects); 13 genuine failure across 5 signature clusters. No P0 (no provider-down). No issue closures — all open trackers reflect still-active modes.
Closures: none. #39885 and #39946 both recur this window; parent (this issue) and #39790 (token audit) are unrelated/fresh.
New sub-issue: LintMonster — scheduled run fails Process Safe Outputs because the agent emitted update_issue target:triggering (unsatisfiable outside issue context) and the skip is counted as a hard failure. Linked to this report.
Monitored singletons (no issue filed — single occurrence, post-agent infra step):
Avenger §27740190790 (main): agent job marked failure at Parse agent logs for step summary, but the agent itself completed and emitted a noop ("No PR created"). Post-processing log-parser step failure; watch for recurrence.
Daily Safe Outputs Git Simulator §27739899787 (main): agent success (emitted noop, state persisted to repo memory), but push_repo_memory/Push repo-memory changes (default) failed — likely a repo-memory branch push race/conflict. Single occurrence.
Failure Investigation — 6h window ending 2026-06-19 19:20 UTC
Overview
Prefetch flagged 40 runs, but 31 were cancelled (concurrency: a dependabot PR batch at 15:00 UTC + main pushes superseded in-flight runs) — not real failures. 9 runs genuinely failed, across 7 signatures. No systemic P0; no new parent report needed. Two new sub-issues filed; one existing tracker confirmed still active.
C6 Git LFS fetch fail Could not scan for Git LFS files (exit 2)
Documentation Unbloat
1
checkout/LFS
documented
C7 super-linter exit 1 + artifact zip error
Super Linter Report
1
lint/infra
documented
Evidence (key log refs)
C2§27843593919: unexpected status 404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07 on every sampling retry; all 3 retries exhausted.
C3§27834964940: pre-flight: command not found: python3 + ModuleNotFoundError: No module named 'copilot'; Turns=0, agent_output.json={"items":[]}.
C1§27841318778: ##[error]ERR_CONFIG: Claude execution failed: no structured log entries were produced.
C4§27837850380: [sdk-driver] max tool denials threshold reached (5/5); 13 denials of shell(sed ...) / shell(cd && ...) not in the allow-list (permissionDeniedCount=13).
C5§27841406519: [sdk-driver] error: Timeout after 870000ms waiting for session.idle (sub-agent grumpy-coder ran long); isSDKSessionIdleTimeoutError=true; action timed out at 15m.
C6§27839021623: git lfs fetch → Could not scan for Git LFS files, git failed with exit code 2; agent never ran.
C7§27832960010: super-linter step Process completed with exit code 1, then An error has occurred during zip creation for the artifact on log upload.
audit-diff (firewall/metrics)
Pairwise diffs across C1–C5 runs show no firewall anomalies (has_anomalies:false; only expected provider-domain swaps: codex→api.openai.com/chatgpt.com, copilot→api.githubcopilot.com). Failures are engine/config/driver — not network policy. C5 shows the cost signature of a hang: core_consumed +11767% (1068 GitHub API calls), 153 copilot calls before the idle timeout.
[aw] Failure Investigation — 6h window ending 2026-06-21 08:19 UTC
Executive summary
No issues qualified for closure and no new parent is warranted — every P0/P1 failure in this window is already tracked and still active. 9 failed/cancelled runs across 7 signatures; 6 of 7 map cleanly to open agentic-workflows issues, 1 is a transient infra blip, and only one new untracked product gap was found (filed as a sub-issue below).
PR Sous Chef (27895333209): agent ran 28 turns / 599k tokens; Execute GitHub Copilot CLI closed exitCode=1 hasOutput=true with no failureClass → ##[error]Process completed with exit code 1. audit-diff vs success baseline 27893147412: 0 new domains, 0 firewall status changes, 0 anomalies, turns unchanged → stable/transient. Same class as [aw-failures] [aw] Copilot CLI exits 1 with no classified error — chronic Daily Model Inventory failures #39946.
Smoke Codex (27892632179): agent success; safe_outputs → Process Safe Outputs fails — ✗ Message 2 (set_issue_field) failed: No issue number available (messages: create_issue, set_issue_field, add_comment, comment_memory). New gap, see #aw_setfield.
Smoke AOAI (Entra) (27892686646): agent success; push_experiments_state job fails at checkout — fatal: unable to access 'https://github.com/github/gh-aw/': Could not resolve host: github.com → git exit 128. Transient runner DNS failure, no product defect, no recurrence.
Smoke CI (27892967610): cancelled (concurrency/guard), not a failure.
[aw] Failure Investigation — 6h window ending 2026-06-17 19:34 UTC
Executive summary
github/gh-awover the last 6h. 13 werecancelled(concurrency/guard, not defects); 16 were genuinefailureoutcomes across 6 signature clusters.gpt-5-codex-alpha-2025-11-07. Tracked per-workflow but root cause not escalated.upload_assetsjob fails because the declared PNG asset files are never staged. Filed as sub-issue below.claude-sonnet-4.6with400 model not supported; partially covered by the existing cascade rollup.agentic-workflowsissue qualified for closure — all 20 are <6h old and reflect still-active failure modes; none have fresh evidence of being fixed.Failure cluster table
Evidence
C1 — Codex model 404 (P0)
404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07at `(172.30.0.30/redacted)gpt-5-codexto backend idgpt-5-codex-alpha-2025-11-07, which no longer exists on the proxy. TheUnknown model gpt-5-codex ... fallback model metadatapath swaps metadata only — the request still targets the missing backend, so all 5 reconnect retries 404 and the turn fails with 0 tool calls.C2 — Phantom asset (P1, new)
upload_assetsjob fails:ERR_SYSTEM: Asset file not found: /tmp/gh-aw/safeoutputs/assets/quality_score_breakdown.pngwhile the agent job succeeded.upload_assetsafe-output items referencing PNGs with sha + byte size + markdown image links (e.g.quality_score_breakdown.pngsize=178654,historical_trends.pngsize=422386) that were never written to the staging dir..../.gh-aw-assets/(file)while the upload pipeline reads/tmp/gh-aw/safeoutputs/assets/; thesafe-outputs-assetsartifact was never produced (Artifact not found for name: safe-outputs-assets).C3 — Copilot BYOK (P1, heterogeneous)
400 The requested model is not supported(model=claude-sonnet-4.6,isModelNotSupportedError=true), not retried; not auth, not denial-limit (permissionDeniedCount=0).hasNumerousPermissionDenied(permissionDeniedCount=11).Existing issue correlation
[aw] * failedissues — consistent with C1/C3 sharing provider-side root causes.agentic-workflowsissue is <6h old and reflects an active failure mode; no fresh evidence of resolution.Fix roadmap
gpt-5-codex-alpha-2025-11-07, which 404s on the proxy. Point the alias at a live model (or restore the backend) and make the "unknown model" fallback re-target the request, not just metadata. Canonical tracker: [aw] Daily Cache Strategy Analyzer failed #39878.upload_assetfiles are staged to/tmp/gh-aw/safeoutputs/assets/before theupload_assetsjob, and validate file existence at safe-output emission time so the agent cannot declare phantom assets.claude-sonnet-4.6rejection (400 not supported) on the BYOK proxy; covered by cascade [aw] Failure cascade detected #39852.Sub-issues created
References: §27713303874 · §27713375907 · §27703869930
6h-window follow-up — 2026-06-18 08:26 UTC
Scope: 18 failed/cancelled runs. 5
cancelledSmoke CI (concurrency/guard — not defects); 13 genuinefailureacross 5 signature clusters. No P0 (no provider-down). No issue closures — all open trackers reflect still-active modes.upload_assets/Push assets, agent success (P1)Execute Copilot CLIexit 1, classifiers false (P1)Process Safe Outputsfail on unsatisfiable item (P1)Process Safe Outputson PR/dev branches (P2)Closures: none. #39885 and #39946 both recur this window; parent (this issue) and #39790 (token audit) are unrelated/fresh.
New sub-issue: LintMonster — scheduled run fails
Process Safe Outputsbecause the agent emittedupdate_issue target:triggering(unsatisfiable outside issue context) and the skip is counted as a hard failure. Linked to this report.Monitored singletons (no issue filed — single occurrence, post-agent infra step):
main):agentjob markedfailureatParse agent logs for step summary, but the agent itself completed and emitted anoop("No PR created"). Post-processing log-parser step failure; watch for recurrence.main): agentsuccess(emittednoop, state persisted to repo memory), butpush_repo_memory/Push repo-memory changes (default)failed — likely a repo-memory branch push race/conflict. Single occurrence.References: §27735258410 · §27737401463 · §27738455642
Failure Investigation — 6h window ending 2026-06-19 19:20 UTC
Overview
Prefetch flagged 40 runs, but 31 were
cancelled(concurrency: a dependabot PR batch at 15:00 UTC +mainpushes superseded in-flight runs) — not real failures. 9 runs genuinely failed, across 7 signatures. No systemic P0; no new parent report needed. Two new sub-issues filed; one existing tracker confirmed still active.Failure clusters (real failures only)
ERR_CONFIG: no structured log entriesModel not found gpt-5-codex-alpha-2025-11-07(proxy 404)ModuleNotFoundError: 'copilot'(python3 absent)max tool denials (5/5)→ SDK abortssdk_session_idle_timeout→ 15-min action timeoutCould not scan for Git LFS files(exit 2)Evidence (key log refs)
unexpected status 404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07on every sampling retry;all 3 retries exhausted.pre-flight: command not found: python3+ModuleNotFoundError: No module named 'copilot';Turns=0,agent_output.json={"items":[]}.##[error]ERR_CONFIG: Claude execution failed: no structured log entries were produced.[sdk-driver] max tool denials threshold reached (5/5); 13 denials ofshell(sed ...)/shell(cd && ...)not in the allow-list (permissionDeniedCount=13).[sdk-driver] error: Timeout after 870000ms waiting for session.idle(sub-agentgrumpy-coderran long);isSDKSessionIdleTimeoutError=true; action timed out at 15m.git lfs fetch→Could not scan for Git LFS files,git failed with exit code 2; agent never ran.Process completed with exit code 1, thenAn error has occurred during zip creation for the artifacton log upload.audit-diff(firewall/metrics)Pairwise diffs across C1–C5 runs show no firewall anomalies (
has_anomalies:false; only expected provider-domain swaps: codex→api.openai.com/chatgpt.com, copilot→api.githubcopilot.com). Failures are engine/config/driver — not network policy. C5 shows the cost signature of a hang:core_consumed +11767%(1068 GitHub API calls), 153 copilot calls before the idle timeout.Existing-issue correlation & close decisions
conclusion == cancelledbefore clustering.Fix roadmap
sed, or make tool-denial non-fatal); C5 (signalsession.idleafter sub-agent completion / raise step timeout for PR review).lfs:trueif unused); C7 (super-linter findings + upload-artifact zip race).Sub-issues created this run
gpt-5-codexdead alias 404.C4–C7 documented here but not separately filed (minimum-necessary; lower priority / transient). File on recurrence.
References:
[aw] Failure Investigation — 6h window ending 2026-06-21 08:19 UTC
Executive summary
No issues qualified for closure and no new parent is warranted — every P0/P1 failure in this window is already tracked and still active. 9 failed/cancelled runs across 7 signatures; 6 of 7 map cleanly to open
agentic-workflowsissues, 1 is a transient infra blip, and only one new untracked product gap was found (filed as a sub-issue below).Skillet,Avenger,Code Simplifier— all reproduced their tracked signatures this window, so [aw-failures] [aw] Skillet floods Actions with startup-failures on copilot/* branch pushes (recurring — 73 failed runs / 6h as o [Content truncated due to length] #40447, [aw-failures] [aw] Avenger agent job fails at "Parse agent logs" — ERR_CONFIG "no structured log entries" despite successful age [Content truncated due to length] #40145, and [aw-failures] [aw] Code Simplifier fails daily on main — Copilot BYOK provider returns HTTP 403 (authentication_failed, non-retryable) #40270 stay open as active.PR Sous Cheffailed once with an unclassified Copilot CLI exit-1 after a full agent run;audit-diffshows it is behaviorally identical to a successful baseline (stable, transient). Folded into existing class [aw-failures] [aw] Copilot CLI exits 1 with no classified error — chronic Daily Model Inventory failures #39946 as a newly-affected workflow.Smoke Codexsafe_outputsfails onset_issue_field"No issue number available". Sub-issue #aw_setfield.Failure cluster table
Evidence per cluster
copilot/*branch pushes, no agent job dispatched — exact signature of [aw-failures] [aw] Skillet floods Actions with startup-failures on copilot/* branch pushes (recurring — 73 failed runs / 6h as o [Content truncated due to length] #40447 (Skillet triggers only onworkflow_dispatchonmain; stale lock on copilot branches fires onpush).agentjob fails atParse agent logs for step summaryafter a successful upload — theERR_CONFIG "no structured log entries"post-run parser defect of [aw-failures] [aw] Avenger agent job fails at "Parse agent logs" — ERR_CONFIG "no structured log entries" despite successful age [Content truncated due to length] #40145.Execute GitHub Copilot CLIfails — BYOK provider 403authentication_failed, non-retryable, per [aw-failures] [aw] Code Simplifier fails daily on main — Copilot BYOK provider returns HTTP 403 (authentication_failed, non-retryable) #40270 (10/10 daily red, no baseline).Execute GitHub Copilot CLIclosedexitCode=1 hasOutput=truewith nofailureClass→##[error]Process completed with exit code 1.audit-diffvs success baseline 27893147412: 0 new domains, 0 firewall status changes, 0 anomalies, turns unchanged → stable/transient. Same class as [aw-failures] [aw] Copilot CLI exits 1 with no classified error — chronic Daily Model Inventory failures #39946.safe_outputs→ Process Safe Outputs fails —✗ Message 2 (set_issue_field) failed: No issue number available(messages: create_issue, set_issue_field, add_comment, comment_memory). New gap, see #aw_setfield.push_experiments_statejob fails at checkout —fatal: unable to access 'https://github.com/github/gh-aw/': Could not resolve host: github.com→ git exit 128. Transient runner DNS failure, no product defect, no recurrence.cancelled(concurrency/guard), not a failure.Existing-issue correlation
Smoke Codex set_issue_field(no coverage) → filed.Smoke AOAIDNS blip andSmoke CIcancel → no tracking warranted.Fix roadmap
exitCode=1 hasOutput=true(nofailureClass) instead of opaque job failure ([aw-failures] [aw] Copilot CLI exits 1 with no classified error — chronic Daily Model Inventory failures #39946); fixset_issue_fieldsame-batch target resolution (#aw_setfield).Sub-issues created
safe_outputsfails on unboundset_issue_field.References: §27895333209 · §27892632179 · §27893974986