Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions docs/adr/38544-cap-mcp-server-child-process-concurrency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# ADR-38544: Cap MCP Server Child Process Concurrency with a Shared Semaphore Guardrail

**Date**: 2026-06-11
**Status**: Accepted — the Go MCP server now enforces a shared 4-process subprocess cap.
**Deciders**: pelikhan, Copilot

## Context

The Go MCP server shells out to `gh` and `gh aw` child processes to service tool calls (workflow add/update/fix, logs, audit, diff, compile, inspect), as well as for supporting operations such as repository lookup, actor permission checks, config validation, and the `gh version` startup probe. Under concurrent tool usage these subprocess invocations were unbounded: every in-flight request that needed a child process spawned one immediately, so a burst of concurrent requests could fan out an arbitrary number of `gh`/`gh aw` processes and exhaust host resources (file descriptors, memory, process table). There was no central place enforcing a ceiling, and each call site managed its own `cmd.Output()` / `cmd.CombinedOutput()` independently. This change is preventive hardening: the issue was to add an explicit server-side guardrail before resource pressure showed up as flaky or host-specific failures.

## Decision

We will cap the number of simultaneously-active server-managed child processes at **4** using a single shared, context-aware guardrail. The guardrail (`mcpSubprocessGuardrail`) is a buffered-channel semaphore: each subprocess acquires one slot before executing and releases it when done. Acquisition is context-aware, so a cancelled request stops waiting rather than blocking behind queued subprocesses. All subprocess call sites route through the shared `defaultMCPSubprocessGuardrail` via the helpers `runMCPSubprocessOutput`, `runMCPSubprocessCombinedOutput`, `runMCPExecOutput`, and `runMCPExecCombinedOutput`, replacing direct `cmd.Output()`/`cmd.CombinedOutput()` calls. Output contracts and tool behavior are unchanged; only concurrency is bounded.

## Alternatives Considered

### Alternative 1: Leave subprocess spawning unbounded
Keep relying on the operating system and `gh`'s own behavior to absorb concurrent load. Rejected because the failure mode (resource exhaustion under a burst of concurrent tool calls) is silent and hard to diagnose, and the server has no backpressure mechanism of its own once limits are hit.

### Alternative 2: Per-call-site or per-tool limits
Give each tool or call site its own concurrency limit rather than one shared cap. Rejected because the resource pressure is global — total live child processes is what matters — so independent per-tool counters could still sum well past a safe ceiling, and would duplicate limiting logic across many files.

### Alternative 3: Explicit worker pool / job queue
Introduce a fixed pool of worker goroutines that own subprocess execution, with requests submitting jobs to a queue. Rejected as heavier than needed: it changes the execution model and call-site ergonomics, whereas a buffered-channel semaphore achieves the same bound with a minimal, drop-in wrapper around existing `exec.Cmd` calls.

## Consequences

### Positive
- Total concurrent server-managed child processes is bounded at 4, preventing unbounded fan-out and the associated resource exhaustion.
- A single chokepoint (`defaultMCPSubprocessGuardrail`) centralizes the limit; future call sites that use the helpers are covered automatically.
- Slot acquisition respects `context` cancellation, so cancelled or timed-out requests do not hang waiting for a slot.

### Negative
- The limit `4` is a hardcoded constant (`maxActiveMCPChildProcesses`) and is not configurable per host or deployment; tuning requires a code change. That is intentional for this first guardrail because the goal is to add one deterministic ceiling without introducing new user-facing configuration surface.
- Under high concurrency, requests block while waiting for a free slot, which can increase tail latency for tool calls that previously ran immediately.
- Correctness depends on every subprocess call site using the guardrail helpers; a direct `cmd.Output()`/`cmd.CombinedOutput()` added later would silently bypass the cap.

### Neutral
- The guardrail is process-global state (a package-level `defaultMCPSubprocessGuardrail`), shared across all MCP requests in the server process.
- Existing stdout/stderr separation (using `Output()` vs `CombinedOutput()` for JSON-producing commands) is preserved through distinct helper variants.
26 changes: 26 additions & 0 deletions pkg/cli/mcp_helpers.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ package cli
import (
"fmt"
"os"
"os/exec"
"path/filepath"
"strings"

"github.com/github/gh-aw/pkg/logger"
)
Expand Down Expand Up @@ -67,3 +69,27 @@ func logAndValidateBinaryPath() (string, error) {
mcpHelpersLog.Printf("gh-aw binary path: %s", binaryPath)
return binaryPath, nil
}

func withNonInteractiveCIEnv(env []string) []string {
if env == nil {
env = os.Environ()
}

env = append([]string(nil), env...)
for i, entry := range env {
if strings.HasPrefix(entry, "CI=") {
env[i] = "CI=1"
return env
}
}

return append(env, "CI=1")
}

func setNonInteractiveCIEnv(cmd *exec.Cmd) {
if cmd == nil {
return
}

cmd.Env = withNonInteractiveCIEnv(cmd.Env)
}
4 changes: 2 additions & 2 deletions pkg/cli/mcp_permissions.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ func queryActorRole(ctx context.Context, actor string, repo string) (string, err
mcpLog.Printf("Querying GitHub API for %s's permission in %s", actor, repo)

cmd := workflow.ExecGHContext(ctx, "api", apiPath, "--jq", ".permission")
output, err := cmd.Output()
output, err := runMCPSubprocessOutput(ctx, cmd)
if err != nil {
mcpLog.Printf("Failed to query actor permission: %v", err)
return "", fmt.Errorf("failed to query actor permission: %w", err)
Expand Down Expand Up @@ -83,7 +83,7 @@ func checkActorPermission(ctx context.Context, actor string, validateActor bool,
}

// Get repository using cached lookup
repo, err := getRepository()
repo, err := getRepository(ctx)
if err != nil {
mcpLog.Printf("Tool %s: failed to get repository context, denying access: %v", toolName, err)
return newMCPError(jsonrpc.CodeInternalError, "permission check failed", map[string]any{
Expand Down
7 changes: 4 additions & 3 deletions pkg/cli/mcp_repository.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
package cli

import (
"context"
"errors"
"fmt"
"os"
Expand All @@ -12,7 +13,7 @@ import (
// getRepository retrieves the current repository name (owner/repo format).
// Results are cached for 1 hour to avoid repeated queries.
// Checks GITHUB_REPOSITORY environment variable first, then falls back to gh repo view.
func getRepository() (string, error) {
func getRepository(ctx context.Context) (string, error) {
// Check cache first
if repo, ok := mcpCache.GetRepo(); ok {
mcpLog.Printf("Using cached repository: %s", repo)
Expand All @@ -29,8 +30,8 @@ func getRepository() (string, error) {

// Fall back to gh repo view
mcpLog.Print("Querying repository using gh repo view")
cmd := workflow.ExecGH("repo", "view", "--json", "nameWithOwner", "--jq", ".nameWithOwner")
output, err := cmd.Output()
cmd := workflow.ExecGHContext(ctx, "repo", "view", "--json", "nameWithOwner", "--jq", ".nameWithOwner")
output, err := runMCPSubprocessOutput(ctx, cmd)
if err != nil {
mcpLog.Printf("Failed to get repository: %v", err)
return "", fmt.Errorf("failed to get repository: %w", err)
Expand Down
14 changes: 10 additions & 4 deletions pkg/cli/mcp_server.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,21 @@ type execCmdFunc func(ctx context.Context, args ...string) *exec.Cmd
// The second return value is a reserved SDK extension slot and must be nil for now.

// createMCPServer creates and configures the MCP server with all tools
func createMCPServer(cmdPath string, actor string, validateActor bool, manifestCacheFile string) *mcp.Server {
func createMCPServer(cmdPath string, actor string, validateActor bool, manifestCacheFile string, env []string) *mcp.Server {
// Helper function to execute command with proper path
execCmd := func(ctx context.Context, args ...string) *exec.Cmd {
var cmd *exec.Cmd
if cmdPath != "" {
// Use custom command path
return exec.CommandContext(ctx, cmdPath, args...)
cmd = exec.CommandContext(ctx, cmdPath, args...)
} else {
// Use default gh aw command with proper token handling
cmd = workflow.ExecGHContext(ctx, append([]string{"aw"}, args...)...)
}
if env != nil {
cmd.Env = append([]string(nil), env...)
}
// Use default gh aw command with proper token handling
return workflow.ExecGHContext(ctx, append([]string{"aw"}, args...)...)
return cmd
}

// Log actor and validation settings
Expand Down
33 changes: 7 additions & 26 deletions pkg/cli/mcp_server_command.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,8 @@ package cli
import (
"context"
"os"
"strings"

"github.com/github/gh-aw/pkg/logger"
"github.com/github/gh-aw/pkg/workflow"
"github.com/modelcontextprotocol/go-sdk/mcp"
"github.com/spf13/cobra"
)
Expand Down Expand Up @@ -60,7 +58,7 @@ Examples:
GITHUB_ACTOR=octocat gh aw mcp-server # Set actor via environment variable for access control
DEBUG=mcp:* GITHUB_ACTOR=octocat gh aw mcp-server # Run with verbose debug logging and actor set via environment variable`,
RunE: func(cmd *cobra.Command, args []string) error {
return runMCPServer(port, cmdPath, validateActor)
return runMCPServer(cmd.Context(), port, cmdPath, validateActor)
},
}

Expand All @@ -71,24 +69,10 @@ Examples:
return cmd
}

// checkAndLogGHVersion checks if gh CLI is available and logs its version.
// Diagnostics are emitted through the debug logger only.
func checkAndLogGHVersion() {
cmd := workflow.ExecGH("version")
output, err := cmd.CombinedOutput()

if err != nil {
mcpLog.Print("WARNING: gh CLI not found in PATH")
return
}

// Parse and log the version
versionOutput := strings.TrimSpace(string(output))
mcpLog.Printf("gh CLI version: %s", versionOutput)
}

// runMCPServer starts the MCP server on stdio or HTTP transport
func runMCPServer(port int, cmdPath string, validateActor bool) error {
func runMCPServer(ctx context.Context, port int, cmdPath string, validateActor bool) error {
mcpServerEnv := withNonInteractiveCIEnv(nil)

// Get actor from environment variable
actor := os.Getenv("GITHUB_ACTOR")

Expand Down Expand Up @@ -127,13 +111,10 @@ func runMCPServer(port int, cmdPath string, validateActor bool) error {
mcpLog.Printf("WARNING: Failed to get current working directory: %v", err)
}

// Check and log gh CLI version
checkAndLogGHVersion()

// Validate that the CLI and secrets are properly configured
// Note: Validation failures are logged as warnings but don't prevent server startup
// This allows the server to start in test environments or non-repository directories
if err := validateMCPServerConfiguration(cmdPath); err != nil {
if err := validateMCPServerConfiguration(ctx, cmdPath, mcpServerEnv); err != nil {
mcpLog.Printf("Configuration validation warning: %v", err)
}

Expand All @@ -159,7 +140,7 @@ func runMCPServer(port int, cmdPath string, validateActor bool) error {
}

// Create the server configuration
server := createMCPServer(cmdPath, actor, validateActor, manifestCacheFile)
server := createMCPServer(cmdPath, actor, validateActor, manifestCacheFile, mcpServerEnv)

if port > 0 {
// Run HTTP server with SSE transport
Expand All @@ -168,5 +149,5 @@ func runMCPServer(port int, cmdPath string, validateActor bool) error {

// Run stdio transport
mcpLog.Print("MCP server ready on stdio")
return server.Run(context.Background(), &mcp.StdioTransport{})
return server.Run(ctx, &mcp.StdioTransport{})
}
6 changes: 3 additions & 3 deletions pkg/cli/mcp_server_unit_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ import (
// TestMCPServerUnit_ListTools verifies that the MCP server exposes exactly the
// expected set of tools without spawning a subprocess.
func TestMCPServerUnit_ListTools(t *testing.T) {
server := createMCPServer("", "", false, "")
server := createMCPServer("", "", false, "", nil)
session := connectInMemory(t, server)

ctx := context.Background()
Expand All @@ -43,7 +43,7 @@ func TestMCPServerUnit_ListTools(t *testing.T) {
// TestMCPServerUnit_ServerCapabilities verifies that the server advertises the
// Tools capability with ListChanged=false (tools are static, no notifications needed).
func TestMCPServerUnit_ServerCapabilities(t *testing.T) {
server := createMCPServer("", "", false, "")
server := createMCPServer("", "", false, "", nil)
session := connectInMemory(t, server)

initResult := session.InitializeResult()
Expand All @@ -66,7 +66,7 @@ func TestMCPServerUnit_StatusTool(t *testing.T) {
require.NoError(t, os.Chdir(tmpDir), "should change to temp dir")
t.Cleanup(func() { _ = os.Chdir(origDir) })

server := createMCPServer("", "", false, "")
server := createMCPServer("", "", false, "", nil)
session := connectInMemory(t, server)

ctx := context.Background()
Expand Down
81 changes: 81 additions & 0 deletions pkg/cli/mcp_subprocess_guardrail.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
package cli

import (
"context"
"os/exec"
)

const maxActiveMCPChildProcesses = 4

type mcpSubprocessGuardrail struct {
slots chan struct{}
}

var defaultMCPSubprocessGuardrail = newMCPSubprocessGuardrail(maxActiveMCPChildProcesses)

func newMCPSubprocessGuardrail(limit int) *mcpSubprocessGuardrail {
return &mcpSubprocessGuardrail{
slots: make(chan struct{}, limit),
}
}

func (g *mcpSubprocessGuardrail) acquire(ctx context.Context) error {
if err := ctx.Err(); err != nil {
return err
}

select {
case g.slots <- struct{}{}:
if err := ctx.Err(); err != nil {
g.release()
return err
}
return nil
case <-ctx.Done():
return ctx.Err()
}
}
Comment thread
Copilot marked this conversation as resolved.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acquire may succeed on an already-cancelled context — when a slot is available at the same instant ctx is done, Go's select picks randomly, silently bypassing the cancellation.

💡 Detail and suggested fix

The companion test TestMCPSubprocessGuardrailAcquireHonorsContextCancellation never exercises this race: it pre-fills the only slot before cancelling, so the slots <- case is never simultaneously ready. Two important cases are untested:

  1. cancelled context + available slot → non-deterministic outcome
  2. goroutine already blocked on a full semaphore whose context is cancelled while waiting → the primary production scenario

Fix with a pre-check that makes the already-cancelled case deterministic:

func (g *mcpSubprocessGuardrail) acquire(ctx context.Context) error {
    if err := ctx.Err(); err != nil {
        return err
    }
    select {
    case g.slots <- struct{}{}:
        return nil
    case <-ctx.Done():
        return ctx.Err()
    }
}

The TOCTOU window after the check is harmless — a context cancelled right after still proceeds, but the cmd (created with ExecGHContext) fails fast.


func (g *mcpSubprocessGuardrail) release() {
<-g.slots
}

func (g *mcpSubprocessGuardrail) output(ctx context.Context, cmd *exec.Cmd) ([]byte, error) {
if err := g.acquire(ctx); err != nil {
return nil, err
}
defer g.release()

return cmd.Output()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context is silently discarded after slot acquisitionctx gates semaphore entry only; once acquired, cmd.Output() runs unconditionally. If cmd was created without a context (e.g. workflow.ExecGH, as in checkAndLogGHVersion), a timed-out or cancelled ctx cannot terminate a hung subprocess — the slot stays occupied until the OS reaps the process.

💡 Detail

This is a real exposure in checkAndLogGHVersion: workflow.ExecGH("version") creates a context-free exec.Cmd. Neither the 5-second timeout pattern used elsewhere nor any future caller-provided deadline can cancel the running subprocess.

At minimum, document the contract on runMCPSubprocessOutput / runMCPSubprocessCombinedOutput:

// runMCPSubprocessOutput executes cmd under the guardrail semaphore.
// ctx is used only for semaphore acquisition. To bound subprocess execution,
// cmd must be created with exec.CommandContext or workflow.ExecGHContext
// using the same ctx.
func runMCPSubprocessOutput(ctx context.Context, cmd *exec.Cmd) ([]byte, error) {

The stronger fix is to enforce context-aware cmd creation at the call sites that currently use ExecGH (not ExecGHContext).

}

func (g *mcpSubprocessGuardrail) combinedOutput(ctx context.Context, cmd *exec.Cmd) ([]byte, error) {
if err := g.acquire(ctx); err != nil {
return nil, err
}
defer g.release()

return cmd.CombinedOutput()
}

// runMCPSubprocessOutput executes cmd under the shared MCP subprocess guardrail.
// ctx governs slot acquisition and any subprocess cancellation only when cmd was
// created with the same context (for example via exec.CommandContext or ExecGHContext).
func runMCPSubprocessOutput(ctx context.Context, cmd *exec.Cmd) ([]byte, error) {
return defaultMCPSubprocessGuardrail.output(ctx, cmd)
}

// runMCPSubprocessCombinedOutput executes cmd under the shared MCP subprocess
// guardrail. ctx governs slot acquisition and any subprocess cancellation only
// when cmd was created with the same context.
func runMCPSubprocessCombinedOutput(ctx context.Context, cmd *exec.Cmd) ([]byte, error) {
return defaultMCPSubprocessGuardrail.combinedOutput(ctx, cmd)
}

func runMCPExecOutput(ctx context.Context, execCmd execCmdFunc, args ...string) ([]byte, error) {
return runMCPSubprocessOutput(ctx, execCmd(ctx, args...))
}

func runMCPExecCombinedOutput(ctx context.Context, execCmd execCmdFunc, args ...string) ([]byte, error) {
return runMCPSubprocessCombinedOutput(ctx, execCmd(ctx, args...))
}
Loading
Loading