fix(mcp): detect transport close and update status to failed by herjarsa · Pull Request #24955 · anomalyco/opencode

herjarsa · 2026-04-29T13:07:37Z

Issue for this PR

Type of change

Bug fix
New feature
Refactor / code improvement
Documentation

What does this PR do?

This PR fixes the issue where MCP servers become unavailable mid-session (Issue #23997). After investigation, we found two complementary problems:

Problem 1: Transport errors disconnect servers with no recovery (PR #25670 scope)
When a transport error occurs (ECONNRESET, stale session 404, server crash), the MCP client throws on callTool and the server remains dead for the rest of the session.

Problem 2: MCP manager silently stops routing requests (Issue #23997 root cause)
After initialization, servers in "failed" status are never retried. The health-check loop was missing entirely.

Changes made:

Reactive reconnect (inline isTransportError + makeTool)
- Added isTransportError() that detects transport errors by duck-typing (avoids SDK import issues since StreamableHTTPError is not exported at runtime).
- Replaced convertMcpTool() with makeTool() that catches transport errors in execute, triggers a single-flight reconnect via reconnectClient(), and retries the tool call once with the fresh client.
- Used mcpStateRef pattern to allow async callbacks to access the current InstanceState without forward-reference issues.
Proactive health-check (startHealthCheck loop)
- Added Effect.repeat(Schedule.spaced(Duration.seconds(30))) running in a scoped fiber.
- Every 30 seconds, iterates servers with status === "failed" and attempts reconnectClient().
- Single-flight dedup: concurrent calls share one in-flight Promise via a Map<string, Promise<boolean>>.
Improved initialization logging
- Effect.catch(() => Effect.void) during init now logs the error and marks the server as failed instead of silently swallowing it.

How did you verify your code works?

bun typecheck passes with 0 errors.
bun test test/mcp/ passes: 32/33 tests green (1 pre-existing timeout in headers.test.ts unrelated to this change).
The fix uses the existing EffectBridge and InstanceState patterns already established in the codebase.
The health-check loop runs in a forkScoped fiber so it is automatically cleaned up on instance disposal.

Screenshots / recordings

N/A — backend-only change.

Checklist

I have tested my changes locally
I have not included unrelated changes in this PR

github-actions · 2026-04-29T13:07:48Z

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

Open an issue describing the bug/feature (if one doesn't exist)
Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

github-actions · 2026-04-29T13:08:47Z

The following comment was made by an LLM, it may be inaccurate:

The search results show PR #24955 (the current PR) appearing in the results, along with PR #19116. Let me verify if #19116 is related:

Potentially Related PR:

fix(opencode): reconnect on network disruptions (VPN switch, SSE timeout, connection reset) #19116: "fix(opencode): reconnect on network disruptions (VPN switch, SSE timeout, connection reset)"
- This PR addresses network disruptions and connection resets, which is conceptually related to detecting transport closure in MCP connections. However, it appears to be focused on broader network reconnection logic rather than specifically handling MCP transport closure.

The current PR (#24955) is focused specifically on detecting when an MCP server's transport closes and updating its status to 'failed', which is a targeted fix for issue #23997.

No duplicate PRs found

github-actions · 2026-04-29T13:28:53Z

Thanks for updating your PR! It now meets our contributing guidelines. 👍

herjarsa · 2026-04-29T14:15:09Z

Why this is NOT a duplicate of #19116

#19116 focuses on network-level disruptions (VPN switch, SSE timeout, connection reset) between the opencode client and remote servers. Its scope is handling connection drops at the transport layer and implementing automatic reconnection logic.

This PR (#24955 / Issue #23997) addresses a completely different problem: server-side crashes in MCP servers. When a local MCP server process crashes or a remote MCP server process terminates, the client.onclose callback fires. Before this fix, that callback was ignored, leaving the status stuck on "connected" even though the server was dead.

Key differences:

fix(opencode): reconnect on network disruptions (VPN switch, SSE timeout, connection reset) #19116: Network drops → reconnect (client-side transport resilience)
fix(mcp): detect transport close and update status to failed #24955: Server crashes → detect and mark failed (server-side lifecycle)

These are orthogonal issues affecting different layers of the stack.

github-actions · 2026-05-07T13:17:38Z

Thanks for updating your PR! It now meets our contributing guidelines. 👍

Detect when MCP client transport closes unexpectedly (server crash, network drop, process exit) and immediately mark the connection status as 'failed' instead of leaving it as 'connected'. This fixes the issue where the TUI/API would continue to show MCP servers as online (green) even though they had stopped responding. Fixes anomalyco#23997

…k for failed servers Implements two complementary fixes for MCP server connection issues: 1. Reactive reconnect (PR anomalyco#25670 port): - Inline isTransportError() to detect transport errors without SDK import issues - Replace convertMcpTool() with makeTool() that catches transport errors on callTool - Single-flight reconnectClient() with retry once using fresh client - mcpStateRef pattern to access InstanceState from async callbacks 2. Proactive health-check (Issue anomalyco#23997): - Effect.repeat(Schedule.spaced(Duration.seconds(30))) loop forked scoped - Attempts reconnect for servers in "failed" status every 30s - Improved init logging: Effect.catch now logs error and marks failed instead of silent void Fixes anomalyco#23997 Relates to anomalyco#25670

github-actions Bot added needs:compliance This means the issue will auto-close after 2 hours. needs:issue labels Apr 29, 2026

github-actions Bot removed needs:compliance This means the issue will auto-close after 2 hours. needs:issue labels Apr 29, 2026

cioffiAI mentioned this pull request May 7, 2026

Python FastMCP server (STDIO) exits immediately on Windows — Not connected #26128

Open

github-actions Bot added needs:compliance This means the issue will auto-close after 2 hours. and removed needs:compliance This means the issue will auto-close after 2 hours. labels May 7, 2026

Hernán Arce and others added 2 commits May 7, 2026 19:41

herjarsa force-pushed the fix/mcp-connection-status-drop branch from 53ddd4b to c7258a7 Compare May 7, 2026 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp): detect transport close and update status to failed#24955

fix(mcp): detect transport close and update status to failed#24955
herjarsa wants to merge 2 commits intoanomalyco:devfrom
herjarsa:fix/mcp-connection-status-drop

herjarsa commented Apr 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

herjarsa commented Apr 29, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

herjarsa commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue for this PR

Type of change

What does this PR do?

How did you verify your code works?

Screenshots / recordings

Checklist

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

herjarsa commented Apr 29, 2026

Why this is NOT a duplicate of #19116

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

herjarsa commented Apr 29, 2026 •

edited

Loading