Skip to content

[aw-failures] [aw] Harden DIFC awf-cli-proxy startup — one transient incident failed Auto-Triage + Sub-Issue Closer (runs 27261698585, 2726137 [Content truncated due to length] #38309

@github-actions

Description

@github-actions

Fix the DIFC awf-cli-proxy startup path — a single ~7-minute infra incident (07:46–07:53 UTC, 2026-06-10) failed two scheduled agentic workflows before the agent ever ran. Both failures share one signature: the firewall fails fast after only 2 liveness probes when the external DIFC proxy is slow to accept connections.

Summary

  • Scope: 2 workflows, 1 root cause, 0 agent turns executed (agent never invoked).
  • Class: Infrastructure / firewall startup race — not an agent-logic, prompt, or firewall-policy regression.
  • Status: Transient — surrounding and subsequent runs of both workflows are green.
  • Priority: P1 (blocks scheduled runs at the infra layer; recurs on DIFC proxy startup slowness).

Failure Cluster Table

Workflow Run Trigger Signature Turns Existing issue
Auto-Triage Issues 27261698585 schedule awf-cli-proxy exited (1) — DIFC proxy connect refused 0 (baseline 10) #38308
Sub-Issue Closer 27261373050 schedule awf-cli-proxy exited (1) — DIFC proxy connect refused 0 (baseline 118) #38305

Root Cause

The awf-cli-proxy sidecar tunnels localhost:18443 → host.docker.internal:18443 (external DIFC proxy), then probes liveness with gh api .../rate_limit. The probe failed connection-refused, the firewall aborted, and the agent was never invoked.

[cli-proxy] DIFC proxy probe failed (attempt 1/2), retrying in 1s...
[cli-proxy] ERROR: DIFC proxy liveness probe failed for localhost:18443 (gh api exit=0)
[cli-proxy] gh api error: Get "(localhost/redacted) dial tcp [::1]:18443: connect: connection refused
[cli-proxy] Failing fast to avoid repeated in-agent retries
[ERROR] Fatal error: AWF firewall failed to start: awf-cli-proxy could not connect to the external DIFC proxy ... The agent was never invoked.

Both runs failed in the same minute window, so the trigger is host-level DIFC proxy startup contention under concurrent scheduled jobs — the fail-fast probe (2 attempts, 1s apart) is too tight to absorb it.

Evidence — audit & audit-diff
  • audit (both runs): turns: 0 vs baselines 10 / 118 → agent never executed.
  • audit-diff (27261698585 fail vs 27261149457 success, same workflow): has_anomalies: false, anomaly_count: 0, no new blocked domains, no posture change, token_usage: 0. All deltas are downstream of the agent not starting — confirms no firewall-policy or config regression.
  • Probe resolves localhost to IPv6 [::1]; worth confirming the tunnel listener binds the same family.

Existing Issue Correlation

  • [aw] Auto-Triage Issues failed #38308 and [aw] Sub-Issue Closer failed #38305 are the auto-generated per-workflow failure issues for this cluster; both say only "assign to an agent to debug." This issue supplies the shared root-cause analysis and remediation they lack — treat them as duplicates of this item.
  • [aw] Daily Hippo Learn failed #38306 (Daily Hippo Learn failed, run 27261326308) failed in the same window but on a different cause — Docker Hub pull timeout for node:lts-alpine (registry-1.docker.io ... i/o timeout, exit 123). Transient network; no code fix required. P2, left to its existing issue.
  • Cancelled run 27261134689 (Auto-Triage) was a benign concurrency cancellation (superseded by run 27261149457, which succeeded). No action.

Proposed Remediation

  1. Add bounded retry/backoff to the DIFC cli-proxy liveness probe — replace the 2-attempt/1s fail-fast with exponential backoff (e.g., ~5 attempts over ~15–30s) so brief proxy-startup contention no longer fails the whole run.
  2. Confirm IPv4/IPv6 binding parity for the localhost:18443 tunnel listener vs the probe's [::1] resolution; pin to 127.0.0.1 if the listener is IPv4-only.
  3. Optional (P2, Hippo Learn): mirror/pre-pull the Docker Hub node:lts-alpine base image to ghcr.io to remove the Docker Hub registry from the critical pull path.

Success Criteria

  • Scheduled agentic runs survive transient DIFC proxy startup slowness without a fatal firewall abort (agent reaches turn 1).
  • No awf-cli-proxy could not connect to the external DIFC proxy fatal in scheduled runs over a 7-day observation window.
  • Verify with audit that turns > 0 on the next scheduled Auto-Triage / Sub-Issue Closer runs.

References:

Generated by 🔍 [aw] Failure Investigator (6h) · 194.5 AIC · ⌖ 12.3 AIC · ⊞ 5.1K ·

  • expires on Jun 17, 2026, 12:20 AM UTC-08:00

Fix the DIFC awf-cli-proxy startup path — a single ~7-minute infra incident (07:46–07:53 UTC, 2026-06-10) failed two scheduled agentic workflows before the agent ever ran. Both failures share one signature: the firewall fails fast after only 2 liveness probes when the external DIFC proxy is slow to accept connections.

Summary

  • Scope: 2 workflows, 1 root cause, 0 agent turns executed (agent never invoked).
  • Class: Infrastructure / firewall startup race — not an agent-logic, prompt, or firewall-policy regression.
  • Status: Transient — surrounding and subsequent runs of both workflows are green.
  • Priority: P1 (blocks scheduled runs at the infra layer; recurs on DIFC proxy startup slowness).

Failure Cluster Table

Workflow Run Trigger Signature Turns Existing issue
Auto-Triage Issues 27261698585 schedule awf-cli-proxy exited (1) — DIFC proxy connect refused 0 (baseline 10) #38308
Sub-Issue Closer 27261373050 schedule awf-cli-proxy exited (1) — DIFC proxy connect refused 0 (baseline 118) #38305

Root Cause

The awf-cli-proxy sidecar tunnels localhost:18443 → host.docker.internal:18443 (external DIFC proxy), then probes liveness with gh api .../rate_limit. The probe failed connection-refused, the firewall aborted, and the agent was never invoked.

[cli-proxy] DIFC proxy probe failed (attempt 1/2), retrying in 1s...
[cli-proxy] ERROR: DIFC proxy liveness probe failed for localhost:18443 (gh api exit=0)
[cli-proxy] gh api error: Get "(localhost/redacted) dial tcp [::1]:18443: connect: connection refused
[cli-proxy] Failing fast to avoid repeated in-agent retries
[ERROR] Fatal error: AWF firewall failed to start: awf-cli-proxy could not connect to the external DIFC proxy ... The agent was never invoked.

Both runs failed in the same minute window, so the trigger is host-level DIFC proxy startup contention under concurrent scheduled jobs — the fail-fast probe (2 attempts, 1s apart) is too tight to absorb it.

Evidence — audit & audit-diff
  • audit (both runs): turns: 0 vs baselines 10 / 118 → agent never executed.
  • audit-diff (27261698585 fail vs 27261149457 success, same workflow): has_anomalies: false, anomaly_count: 0, no new blocked domains, no posture change, token_usage: 0. All deltas are downstream of the agent not starting — confirms no firewall-policy or config regression.
  • Probe resolves localhost to IPv6 [::1]; worth confirming the tunnel listener binds the same family.

Existing Issue Correlation

  • [aw] Auto-Triage Issues failed #38308 and [aw] Sub-Issue Closer failed #38305 are the auto-generated per-workflow failure issues for this cluster; both say only "assign to an agent to debug." This issue supplies the shared root-cause analysis and remediation they lack — treat them as duplicates of this item.
  • [aw] Daily Hippo Learn failed #38306 (Daily Hippo Learn failed, run 27261326308) failed in the same window but on a different cause — Docker Hub pull timeout for node:lts-alpine (registry-1.docker.io ... i/o timeout, exit 123). Transient network; no code fix required. P2, left to its existing issue.
  • Cancelled run 27261134689 (Auto-Triage) was a benign concurrency cancellation (superseded by run 27261149457, which succeeded). No action.

Proposed Remediation

  1. Add bounded retry/backoff to the DIFC cli-proxy liveness probe — replace the 2-attempt/1s fail-fast with exponential backoff (e.g., ~5 attempts over ~15–30s) so brief proxy-startup contention no longer fails the whole run.
  2. Confirm IPv4/IPv6 binding parity for the localhost:18443 tunnel listener vs the probe's [::1] resolution; pin to 127.0.0.1 if the listener is IPv4-only.
  3. Optional (P2, Hippo Learn): mirror/pre-pull the Docker Hub node:lts-alpine base image to ghcr.io to remove the Docker Hub registry from the critical pull path.

Success Criteria

  • Scheduled agentic runs survive transient DIFC proxy startup slowness without a fatal firewall abort (agent reaches turn 1).
  • No awf-cli-proxy could not connect to the external DIFC proxy fatal in scheduled runs over a 7-day observation window.
  • Verify with audit that turns > 0 on the next scheduled Auto-Triage / Sub-Issue Closer runs.

🔁 Recurrence Update — 2026-06-10 13:40 UTC (now recurring, not a one-off transient)

Prioritize remediation #1 (bounded retry/backoff on the DIFC liveness probe) — the awf-cli-proxy startup race recurred ~6h after the original 07:46–07:53 UTC incident, so this is no longer a single transient. Same fatal signature; the agent was never invoked.

Workflow Run Trigger Time (UTC) Signature Turns
Auto-Triage Issues 27280458294 schedule 13:40:51 awf-cli-proxy exited (1) — DIFC proxy connect refused (dial tcp [::1]:18443) 0 (baseline 27273355290 = success)
Fresh evidence — audit & audit-diff (run 27280458294)
  • audit (27280458294): turns=0, agentic_fraction=0, safe_outputs job skipped, fatal The agent was never invoked. Identical fail-fast path — probe attempt 1/2, retry 1s, then Failing fast to avoid repeated in-agent retries.
  • audit-diff (base success 27273355290 → fail 27280458294): has_anomalies=false, anomaly_count=0, run2_token_usage=0, no new blocked domains, no status/posture change. Re-confirms no firewall-policy or config regression — pure host-level DIFC proxy startup race.
  • Probe again resolves localhost to IPv6 [::1]:18443 (connection refused) — supports remediation Add workflow: githubnext/agentics/weekly-research #2 (IPv4/IPv6 binding parity; pin to 127.0.0.1).

Scope change vs original report: the original framed this as a single ~7-minute incident with surrounding/subsequent runs green. This 13:40 UTC recurrence — in an unrelated ~6h-later scheduling window — shows the race recurs under normal cadence, so the 7-day no-recurrence success criterion is already failing. No new sub-issue created: remediation is unchanged and already specified above; this update adds evidence rather than splitting scope.

References:

Generated by 🔍 [aw] Failure Investigator (6h) · 194.5 AIC · ⌖ 12.3 AIC · ⊞ 5.1K ·

  • expires on Jun 17, 2026, 12:20 AM UTC-08:00

Generated by 🔍 [aw] Failure Investigator (6h) · 316.2 AIC · ⌖ 16.2 AIC · ⊞ 5.1K ·


Fix the DIFC awf-cli-proxy startup path — a single ~7-minute infra incident (07:46–07:53 UTC, 2026-06-10) failed two scheduled agentic workflows before the agent ever ran. Both failures share one signature: the firewall fails fast after only 2 liveness probes when the external DIFC proxy is slow to accept connections.

Summary

  • Scope: 2 workflows, 1 root cause, 0 agent turns executed (agent never invoked).
  • Class: Infrastructure / firewall startup race — not an agent-logic, prompt, or firewall-policy regression.
  • Status: Transient — surrounding and subsequent runs of both workflows are green.
  • Priority: P1 (blocks scheduled runs at the infra layer; recurs on DIFC proxy startup slowness).

Failure Cluster Table

Workflow Run Trigger Signature Turns Existing issue
Auto-Triage Issues 27261698585 schedule awf-cli-proxy exited (1) — DIFC proxy connect refused 0 (baseline 10) #38308
Sub-Issue Closer 27261373050 schedule awf-cli-proxy exited (1) — DIFC proxy connect refused 0 (baseline 118) #38305

Root Cause

The awf-cli-proxy sidecar tunnels localhost:18443 → host.docker.internal:18443 (external DIFC proxy), then probes liveness with gh api .../rate_limit. The probe failed connection-refused, the firewall aborted, and the agent was never invoked.

[cli-proxy] DIFC proxy probe failed (attempt 1/2), retrying in 1s...
[cli-proxy] ERROR: DIFC proxy liveness probe failed for localhost:18443 (gh api exit=0)
[cli-proxy] gh api error: Get "(localhost/redacted) dial tcp [::1]:18443: connect: connection refused
[cli-proxy] Failing fast to avoid repeated in-agent retries
[ERROR] Fatal error: AWF firewall failed to start: awf-cli-proxy could not connect to the external DIFC proxy ... The agent was never invoked.

Both runs failed in the same minute window, so the trigger is host-level DIFC proxy startup contention under concurrent scheduled jobs — the fail-fast probe (2 attempts, 1s apart) is too tight to absorb it.

Evidence — audit & audit-diff
  • audit (both runs): turns: 0 vs baselines 10 / 118 → agent never executed.
  • audit-diff (27261698585 fail vs 27261149457 success, same workflow): has_anomalies: false, anomaly_count: 0, no new blocked domains, no posture change, token_usage: 0. All deltas are downstream of the agent not starting — confirms no firewall-policy or config regression.
  • Probe resolves localhost to IPv6 [::1]; worth confirming the tunnel listener binds the same family.

Existing Issue Correlation

  • [aw] Auto-Triage Issues failed #38308 and [aw] Sub-Issue Closer failed #38305 are the auto-generated per-workflow failure issues for this cluster; both say only "assign to an agent to debug." This issue supplies the shared root-cause analysis and remediation they lack — treat them as duplicates of this item.
  • [aw] Daily Hippo Learn failed #38306 (Daily Hippo Learn failed, run 27261326308) failed in the same window but on a different cause — Docker Hub pull timeout for node:lts-alpine (registry-1.docker.io ... i/o timeout, exit 123). Transient network; no code fix required. P2, left to its existing issue.
  • Cancelled run 27261134689 (Auto-Triage) was a benign concurrency cancellation (superseded by run 27261149457, which succeeded). No action.

Proposed Remediation

  1. Add bounded retry/backoff to the DIFC cli-proxy liveness probe — replace the 2-attempt/1s fail-fast with exponential backoff (e.g., ~5 attempts over ~15–30s) so brief proxy-startup contention no longer fails the whole run.
  2. Confirm IPv4/IPv6 binding parity for the localhost:18443 tunnel listener vs the probe's [::1] resolution; pin to 127.0.0.1 if the listener is IPv4-only.
  3. Optional (P2, Hippo Learn): mirror/pre-pull the Docker Hub node:lts-alpine base image to ghcr.io to remove the Docker Hub registry from the critical pull path.

Success Criteria

  • Scheduled agentic runs survive transient DIFC proxy startup slowness without a fatal firewall abort (agent reaches turn 1).
  • No awf-cli-proxy could not connect to the external DIFC proxy fatal in scheduled runs over a 7-day observation window.
  • Verify with audit that turns > 0 on the next scheduled Auto-Triage / Sub-Issue Closer runs.

🔁 Recurrence Update — 2026-06-10 (DIFC race is systemic & recurring, not a one-off)

Prioritize remediation #1 (bounded retry/backoff on the DIFC liveness probe) now — the awf-cli-proxy startup race did NOT stay confined to the 07:46–07:53 UTC window; it recurred 4 more times across the next ~6 hours, hitting 4 additional scheduled/PR agentic workflows. Every instance has the identical fatal signature (awf-cli-proxy exited (1), probe attempt 1/2Failing fast, agent never invoked, turns=0).

Workflow Run Trigger Time (UTC) Turns
Dev 27268650444 schedule 10:02:15 0
Test Quality Sentinel 27270831399 pull_request 10:44:55 0
Issue Monster 27271225223 schedule 10:52:41 0
Auto-Triage Issues 27280458294 schedule 13:40:51 0

Tally: ≥6 DIFC-race failures across 07:46→13:40 UTC on 2026-06-10, spanning 6 distinct workflows (original 2 in the table above + these 4). This is a recurring systemic infra failure, not a single transient incident — the 7-day no-recurrence success criterion is already failing on day 0.

Fresh evidence — audit & audit-diff
  • audit (all 4 runs): turns=0, agentic_fraction=0, safe_outputs skipped, identical fatal The agent was never invoked. Probe resolves localhost → IPv6 [::1]:18443 (connection refused) in every case → reinforces remediation Add workflow: githubnext/agentics/weekly-research #2 (IPv4/IPv6 binding parity; pin to 127.0.0.1).
  • audit-diff (success 27273355290 → fail 27280458294): has_anomalies=false, anomaly_count=0, token_usage=0, no new blocked domains, no posture/policy change → re-confirms no firewall-policy or config regression; pure host-level startup race.

Related distinct failures in the same 6h window (separate root causes — NOT this DIFC cluster):

  • Node.js missing in AWF chrootDaily News (27268191506, 09:53): Copilot CLI requires Node.js, but 'node' is not available inside AWF chroot → exit 127. Distinct chroot-PATH bug; filed as a separate sub-issue.
  • AI credits budget exceededGlossary Maintainer (27272831466, 11:24): 429 Maximum AI credits exceeded (1016.5 / 1000), non-retryable. Daily AIC cap (relates to the token-audit tracker); no code fix.
  • Transient npm ECONNRESETMatt Pocock Skills Reviewer (27270743427, 10:43): activation npm install git-dep hit ECONNRESET (errno -104). Pure transient network; no fix required.

References:

Generated by 🔍 [aw] Failure Investigator (6h) · 194.5 AIC · ⌖ 12.3 AIC · ⊞ 5.1K ·

  • expires on Jun 17, 2026, 12:20 AM UTC-08:00

Generated by 🔍 [aw] Failure Investigator (6h) · 316.2 AIC · ⌖ 16.2 AIC · ⊞ 5.1K ·


🔁 Afternoon recurrence — 2026-06-10 16:41–18:25 UTC (DIFC race is now P0; ship the fix)

Land remediation #1 (bounded retry/backoff on the DIFC liveness probe) now — the awf-cli-proxy startup race recurred 4 more times this afternoon, after the morning cluster already documented above. Identical fatal signature every time: dependency failed to start: container awf-cli-proxy exited (1), probe attempt 1/2Failing fast, agent never invoked, turns=0.

Workflow Run Trigger Time (UTC) Auto-issue
Daily Formal Spec Verifier 27290883003 schedule 16:41 #38402 (closing as dup)
Daily SPDD Spec Planner 27291716088 schedule 16:56 #38409 (closing as dup)
Design Decision Gate 🏗️ 27293003970 pull_request 17:20 #38413 (closing as dup)
Daily Secrets Analysis Agent 27297209601 schedule 18:25 #38421 (closing as dup)

Cumulative tally: ≥10 DIFC-race failures across 07:46→18:25 UTC on 2026-06-10, spanning ≥10 distinct workflows. The race recurs every few hours under normal cadence — the 7-day no-recurrence success criterion is failing continuously. Treat as P0 (sustained infra-layer loss of scheduled agentic runs).

Fresh evidence — run 27297209601 (deep-verified) + 3 afternoon siblings

The 4 afternoon per-workflow auto-issues are being closed as duplicates of this item to consolidate tracking. Remediation and success criteria are unchanged — this update adds recurrence evidence and escalation, not new scope.

References:

Generated by 🔍 [aw] Failure Investigator (6h) · 317.3 AIC · ⌖ 13.8 AIC · ⊞ 5.1K ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions