ADK retry mechanism doesn't handle common network errors (httpx.RemoteProtocolError) in production environments

**Issue Type**: Framework Bug / Enhancement Request
**Repository**: adk-python (framework bug with retry mechanism)
**License**: This contribution will be licensed under Apache 2.0

**Describe the bug**

**Summary**: The ADK's retry mechanism is inadequate for production environments, causing system hangs and reliability issues.

**Version Information**: 
- Error originally observed in production with ADK v1.6.1
- Code review of v1.11.0 (latest) confirms the retry mechanism in `mcp_session_manager.py` lines 111-135 remains unchanged
- This confirms the issue persists in the latest version

The ADK's retry mechanism is not production-ready. It only handles a single error type (`anyio.ClosedResourceError`) while ignoring common network errors that occur in real cloud environments. This turns recoverable network glitches into indefinite blocking waits, causing the entire request chain to hang until upstream timeouts occur.

When using MCPToolset with StreamableHTTPConnectionParams in production environments (Google Cloud Run), `httpx.RemoteProtocolError` occurs during network communication failures. However, this error is not caught by the existing `retry_on_closed_resource` decorator, causing:

1. **No retry attempts** - The error is not handled by existing retry logic
2. **Blocking behavior** - The Agent continues waiting for a response that will never come  
3. **System hang** - The entire request chain blocks until upstream timeout (30+ minutes)
4. **Resource waste** - MCP Server completes processing successfully, but response delivery fails

Network failures are expected in distributed systems and should be handled gracefully with appropriate retry logic. When retries fail, the system should fail fast rather than blocking indefinitely.

**To Reproduce**

**Note**: This error occurs intermittently and cannot be reliably reproduced on demand. Both the Agent and MCP Server are deployed on Google Cloud Run, which may contribute to the connection behavior.

Steps to reproduce the behavior:

1. Set up MCP server with FastMCP in stateless HTTP mode on Cloud Run:
```python
# mcp-server/main.py
main_server = FastMCP(name=settings.server_name, stateless_http=True)
mcp.run(transport="streamable-http", host=settings.host, port=settings.port)
```

2. Configure MCPToolset with StreamableHTTPConnectionParams:
```python
from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset, StreamableHTTPConnectionParams

mcp_toolset = MCPToolset(
    connection_params=StreamableHTTPConnectionParams(
        url="http://mcp-server:8080/mcp/",
        timeout=5.0,
        sse_read_timeout=600.0,  # 10 minutes
        terminate_on_close=True,
    ),
    auth_scheme=auth_scheme,
    auth_credential=auth_credential,
)
```

3. Deploy both services to Google Cloud Run:
   - Agent service with ADK and MCPToolset
   - MCP Server service with FastMCP
   - Configuration: Timeout 3600s, CPU always allocated, execution environment gen2

4. Execute a long-running MCP tool that involves continuous polling
5. The error may occur after approximately 50-60 seconds (observed around 53 seconds in our logs)

**Expected behavior**

- Network errors like `httpx.RemoteProtocolError` should trigger automatic retry
- If retry fails, the error should be propagated immediately (not cause blocking wait)
- The retry mechanism should handle common production network failures:
  - Connection resets
  - Incomplete responses  
  - Proxy/load balancer connection drops
- Distributed systems should be resilient to transient network issues
- **Failures should fail fast, not cause indefinite blocking**

**Screenshots**

Error logs showing the issue:

**Agent side (error occurs but Agent continues waiting):**
```
2025-08-15T09:11:16.988955Z httpcore.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)
2025-08-15T09:11:18.064242Z httpx.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)
# Agent continues waiting for response - no immediate failure or retry
# Blocks until upstream timeout occurs (30+ minutes later)
```

**MCP Server side (continues normally):**
```
2025-08-15T09:11:28 INFO: Task execution completed successfully
2025-08-15T09:11:28 INFO: Results saved to database
# No errors reported, processing completes as expected
```

**Agent caller side (eventually times out):**
```
2025-08-15T09:40:23 httpx.ReadTimeout: Request timed out after 300 seconds
# Finally fails after 30 minutes due to upstream timeout
```

**Desktop (please complete the following information):**
- OS: Linux (Google Cloud Run gen2)
- Python version: 3.13
- ADK version: 1.6.1 (error observed), code confirmed unchanged in 1.11.0 (latest)

**Model Information:**
- Agent using Google ADK with MCPToolset
- MCP Server using FastMCP in stateless HTTP mode
- Both deployed on Google Cloud Run with instance-based billing

**Additional context**

**Core Issue**: The ADK's retry mechanism is too narrow in scope for production use:
- Only catches `anyio.ClosedResourceError`
- Ignores common network errors that occur in cloud environments
- Assumes perfect network conditions, which is unrealistic

**Version Note**: While the error was encountered in production using v1.6.1, we have verified that the problematic code remains unchanged in the latest version (v1.11.0), indicating this is an ongoing issue.

**Current Implementation Problem**:
```python
# Current implementation in mcp_session_manager.py
def retry_on_closed_resource(func):
    @functools.wraps(func)
    async def wrapper(self, *args, **kwargs):
        try:
            return await func(self, *args, **kwargs)
        except anyio.ClosedResourceError:  # TOO NARROW - Only catches this specific error
            logger.info('Retrying %s due to closed resource', func.__name__)
            return await func(self, *args, **kwargs)
    return wrapper
```

**Proposed Solution**:
```python
RETRIABLE_EXCEPTIONS = (
    anyio.ClosedResourceError,
    httpx.RemoteProtocolError,
    httpx.ReadTimeout,
    httpx.ConnectTimeout,
    httpcore.RemoteProtocolError,
)

def retry_on_network_errors(func):
    @functools.wraps(func)
    async def wrapper(self, *args, **kwargs):
        for attempt in range(max_retries):
            try:
                return await func(self, *args, **kwargs)
            except RETRIABLE_EXCEPTIONS as e:
                if attempt < max_retries - 1:
                    logger.info('Retrying %s due to network error: %s', func.__name__, type(e).__name__)
                    await asyncio.sleep(backoff_delay)
                else:
                    raise
        return wrapper
```

**Impact on Production Systems**:
1. **Production Reliability**: Without proper retry logic, transient network issues become permanent failures
2. **User Experience**: Long-running operations fail unnecessarily, requiring manual intervention
3. **Resource Waste**: Successfully completed server-side work is discarded due to response delivery failure
4. **System Responsiveness**: Network errors cause blocking waits instead of immediate failure, leading to:
   - Unresponsive user interfaces
   - Resource exhaustion (threads/connections tied up)
   - Cascading timeouts throughout the system
5. **Violates Cloud-Native Principles**: Modern distributed systems must handle network unreliability and fail fast

**Cloud Run Environment Details**:
- Both services run on Google Cloud Run gen2
- Instance-based billing with always-allocated CPU (not request-based)
- Inter-service communication within Cloud Run infrastructure
- Despite instance-based billing, connection drops still occur
- Potential proxy/load balancer timeouts in Cloud Run's networking layer

**Business Impact**: 
- Affects all ADK users deploying to cloud environments (Google Cloud Run, AWS, Azure)
- Turns the ADK framework unreliable in production distributed systems
- Forces every user to implement their own retry logic
- Simple fix with significant reliability improvement  
- Aligns with Google Cloud best practices for resilient applications

**Suggested Enhancement Priority**: High
- Framework-level reliability issue
- Affects production deployments
- Clear technical solution available

**Potential Workaround (Untested Example)**:
ADK users could potentially implement their own network error handling at the application level. Here's a conceptual example that might address the issue:

*Note: This is an untested example to illustrate a possible approach. We have not verified this workaround in production but are willing to test it if needed.*

```python
# Example of potential workaround (not tested)
class ExtendedMCPTool(MCPTool):
    async def _run_async_impl(self, *, args, tool_context, credential):
        for attempt in range(3):
            try:
                return await super()._run_async_impl(args=args, tool_context=tool_context, credential=credential)
            except (httpx.RemoteProtocolError, httpx.ReadTimeout) as e:
                if attempt < 2:
                    await asyncio.sleep(2 ** attempt)
                else:
                    raise
```

However, this approach requires every ADK user to implement similar logic, which should ideally be handled at the framework level.

**Willingness to Contribute**: 
We are willing to:
1. Test potential workarounds and report results
2. Contribute a pull request with the proposed enhancement if the maintainers agree with the approach
3. Collaborate on testing the solution in production environments

The proposed solution involves extending the existing `retry_on_closed_resource` decorator to handle additional network error types commonly encountered in production environments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADK retry mechanism doesn't handle common network errors (httpx.RemoteProtocolError) in production environments #2561

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ADK retry mechanism doesn't handle common network errors (httpx.RemoteProtocolError) in production environments #2561

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions