fix(mcp): detect transport close and update status to failed#24955
fix(mcp): detect transport close and update status to failed#24955herjarsa wants to merge 2 commits intoanomalyco:devfrom
Conversation
|
Thanks for your contribution! This PR doesn't have a linked issue. All PRs must reference an existing issue. Please:
See CONTRIBUTING.md for details. |
|
The following comment was made by an LLM, it may be inaccurate: The search results show PR #24955 (the current PR) appearing in the results, along with PR #19116. Let me verify if #19116 is related: Potentially Related PR:
The current PR (#24955) is focused specifically on detecting when an MCP server's transport closes and updating its status to 'failed', which is a targeted fix for issue #23997. No duplicate PRs found |
|
Thanks for updating your PR! It now meets our contributing guidelines. 👍 |
Why this is NOT a duplicate of #19116#19116 focuses on network-level disruptions (VPN switch, SSE timeout, connection reset) between the opencode client and remote servers. Its scope is handling connection drops at the transport layer and implementing automatic reconnection logic. This PR (#24955 / Issue #23997) addresses a completely different problem: server-side crashes in MCP servers. When a local MCP server process crashes or a remote MCP server process terminates, the Key differences:
These are orthogonal issues affecting different layers of the stack. |
|
Thanks for updating your PR! It now meets our contributing guidelines. 👍 |
Detect when MCP client transport closes unexpectedly (server crash, network drop, process exit) and immediately mark the connection status as 'failed' instead of leaving it as 'connected'. This fixes the issue where the TUI/API would continue to show MCP servers as online (green) even though they had stopped responding. Fixes anomalyco#23997
…k for failed servers Implements two complementary fixes for MCP server connection issues: 1. Reactive reconnect (PR anomalyco#25670 port): - Inline isTransportError() to detect transport errors without SDK import issues - Replace convertMcpTool() with makeTool() that catches transport errors on callTool - Single-flight reconnectClient() with retry once using fresh client - mcpStateRef pattern to access InstanceState from async callbacks 2. Proactive health-check (Issue anomalyco#23997): - Effect.repeat(Schedule.spaced(Duration.seconds(30))) loop forked scoped - Attempts reconnect for servers in "failed" status every 30s - Improved init logging: Effect.catch now logs error and marks failed instead of silent void Fixes anomalyco#23997 Relates to anomalyco#25670
53ddd4b to
c7258a7
Compare
Issue for this PR
Closes #23997
Type of change
What does this PR do?
This PR fixes the issue where MCP servers become unavailable mid-session (Issue #23997). After investigation, we found two complementary problems:
Problem 1: Transport errors disconnect servers with no recovery (PR #25670 scope)
When a transport error occurs (ECONNRESET, stale session 404, server crash), the MCP client throws on
callTooland the server remains dead for the rest of the session.Problem 2: MCP manager silently stops routing requests (Issue #23997 root cause)
After initialization, servers in "failed" status are never retried. The health-check loop was missing entirely.
Changes made:
Reactive reconnect (inline isTransportError + makeTool)
isTransportError()that detects transport errors by duck-typing (avoids SDK import issues sinceStreamableHTTPErroris not exported at runtime).convertMcpTool()withmakeTool()that catches transport errors inexecute, triggers a single-flight reconnect viareconnectClient(), and retries the tool call once with the fresh client.mcpStateRefpattern to allow async callbacks to access the currentInstanceStatewithout forward-reference issues.Proactive health-check (startHealthCheck loop)
Effect.repeat(Schedule.spaced(Duration.seconds(30)))running in a scoped fiber.status === "failed"and attemptsreconnectClient().Promisevia aMap<string, Promise<boolean>>.Improved initialization logging
Effect.catch(() => Effect.void)during init now logs the error and marks the server asfailedinstead of silently swallowing it.How did you verify your code works?
bun typecheckpasses with 0 errors.bun test test/mcp/passes: 32/33 tests green (1 pre-existing timeout inheaders.test.tsunrelated to this change).EffectBridgeandInstanceStatepatterns already established in the codebase.forkScopedfiber so it is automatically cleaned up on instance disposal.Screenshots / recordings
N/A — backend-only change.
Checklist