Skip to content

Support gRPC communication with SMG (Shepherd Model Gateway) workers#3946

Merged
Bihan merged 2 commits into
dstackai:masterfrom
Bihan:support-vllm-backend-with-smg-router
Jun 11, 2026
Merged

Support gRPC communication with SMG (Shepherd Model Gateway) workers#3946
Bihan merged 2 commits into
dstackai:masterfrom
Bihan:support-vllm-backend-with-smg-router

Conversation

@Bihan

@Bihan Bihan commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Extends router worker sync to discover and register gRPC SMG workers (vLLM and SGLang PD), in addition to the existing HTTP SGLang path.

gRPC client: Adds job_replica_grpc_client.py — same SSH tunnel pattern as the HTTP replica client, but opens a gRPC channel over a Unix domain socket (unix://) forwarded to the worker’s service port.

Worker registration: To register a worker with SMG, dstack needs runtime_type (vLLM / SGLang) and connection_mode (HTTP / gRPC). Rather than adding new service configuration fields, dstack discovers these by probing workers. Discovery runs in two stages:

  • First sync (router workers is empty): dstack does not yet know connection_mode or runtime_type. It probes each worker replica: HTTP /server_info and/or gRPC GetServerInfo, trying the SGLang then vLLM gRPC stub until one responds. Registered workers include connection_mode, runtime_type, and PD fields (kv_role / disaggregation_mode, bootstrap port for SGLang prefill).

  • Later syncs: dstack reads connection_mode and runtime_type from the router’s GET /workers list and reuses them — no repeated protocol or runtime guessing. When connection_mode is grpc, HTTP probes are skipped.

Minor Change

Support grpc communication with smg router
@Bihan Bihan requested a review from jvstme June 9, 2026 08:29
@jvstme

jvstme commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@Bihan, could you please share a bit more context behind the PR:

  • Do I understand correctly that the user-facing benefit of the PR is that vLLM workers are now supported with the SGLang router? Are there other benefits?
  • Why do we need to choose between HTTP and gRPC — is it that in some cases only one of them is available? What determines which one is available?
  • Could you share some service configurations to test different combinations of runtime_type (vLLM / SGLang) and connection_mode (HTTP / gRPC)?

@Bihan

Bihan commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

@jvstme

Do I understand correctly that the user-facing benefit of the PR is that vLLM workers are now supported with the SGLang router? Are there other benefits?

Yes, the main benefit is that vLLM gRPC workers are now supported with the SMG router in dstack PD services.

Other benefits come from gRPC mode, which applies to both SGLang and vLLM workers.

  • In gRPC mode the SMG router tokenizes requests once and sends tokenized input to workers. In HTTP mode each worker tokenizes separately. That reduces duplicate work and can improve throughput.

  • In gRPC mode the router tokenizes the prompt first, so routing works on tokens instead of raw text. That makes policies like cache_aware more effective, because KV cache is stored by tokens, not characters. With HTTP, the router only had an approximate match on character strings.

Why do we need to choose between HTTP and gRPC — is it that in some cases only one of them is available?

As far as I understand gRPC should be the chosen over HTTP based worker. For vLLM there is no option but to use gRPC.

What determines which one is available?

With SGLang worker: The worker is gRPC when worker is launched with the option --grpc-mode.
With vLLM worker: The worker is gRPC when worker is launched using vllm.entrypoints.grpc_server.
See below configs.

Could you share some service configurations to test different combinations of runtime_type (vLLM / SGLang) and connection_mode (HTTP / gRPC)?

With vLLM gRPC Worker:

type: service
name: prefill-decode-smg-vllm

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct

replicas:
  - count: 1
    image: python:3.12-slim
    commands:
      - pip install smg
      - |
          smg launch \
            --pd-disaggregation \
            --model-path "$MODEL_ID" \
            --enable-igw \
            --host 0.0.0.0 \
            --port 8000 \
            --prefill-policy cache_aware
    router:
      type: sglang
    resources:
      cpu: 4

  - count: 1
    image: vllm/vllm-openai:latest
    commands:
      - pip install -U "vllm[grpc]"
      - |
          python3 -m vllm.entrypoints.grpc_server \
            --model "$MODEL_ID" \
            --host 0.0.0.0 \
            --port 8000 \
            --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}' \
            > worker-server.log 2>&1
    resources:
      gpu: L40S

  - count: 1
    image: vllm/vllm-openai:latest
    commands:
      - pip install -U "vllm[grpc]"
      - |
          python3 -m vllm.entrypoints.grpc_server \
            --model "$MODEL_ID" \
            --host 0.0.0.0 \
            --port 8000 \
            --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}' \
            > worker-server.log 2>&1
    resources:
      gpu: L40S

port: 8000
#model: meta-llama/Llama-3.2-3B-Instruct
fleets: [pd-disagg]

#probes:
#  - type: http
#    url: /health
#    interval: 15s

With SGLang gRPC worker:

type: service
name: prefill-decode-smg-sglang

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct

replicas:
  - count: 1
    image: python:3.12-slim
    commands:
      - pip install smg
      - |
          smg launch \
            --enable-igw \
            --pd-disaggregation \
            --model-path "$MODEL_ID" \
            --host 0.0.0.0 \
            --port 8000 \
            --prefill-policy cache_aware
    router:
      type: sglang
    resources:
      cpu: 4

  - count: 1
    image: ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10
    commands:
      - |
          python3 -m sglang.launch_server \
            --model-path "$MODEL_ID" \
            --host 0.0.0.0 \
            --port 8000 \
            --grpc-mode \
            --disaggregation-mode prefill \
            --disaggregation-transfer-backend mooncake \
            --disaggregation-bootstrap-port 8998 \
            > worker-server.log 2>&1
    resources:
      gpu: L40S

  - count: 1
    image: ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10
    commands:
      - |
          python3 -m sglang.launch_server \
            --model-path "$MODEL_ID" \
            --host 0.0.0.0 \
            --port 8000 \
            --grpc-mode \
            --disaggregation-mode decode \
            --disaggregation-transfer-backend mooncake \
            > worker-server.log 2>&1
    resources:
      gpu: L40S

port: 8000
fleets: [pd-disagg]

#probes:
#  - type: http
#    url: /health
#    interval: 15s

Note:

  • To use model: with gRPC based workers, we need replica-wise probes. This will be a different PR.
  • To use NIXL transfer backend. Use --disaggregation-transfer-backend nixl for SGLang and
    --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":...}'

Comment thread src/dstack/_internal/server/services/jobs/job_replica_grpc_client.py Outdated
Comment thread src/dstack/_internal/server/services/runs/router_worker_sync.py Outdated
Comment on lines +562 to +566
if result["status"] == "ready":
return result
return await _get_grpc_worker_payload(
job_model, worker_url=grpc_worker_url, runtime_type=runtime_type
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) What if the worker responded successfully over HTTP, but returned a status other than ready? Is this a valid case? If so, I assume we shouldn't try gRPC, because we already know that the worker responds over HTTP?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a valid SGLang case. SGLang fails at startup and never serves with a non-ready status. See

Comment on lines +553 to +566
try:
result = await _get_http_worker_payload(job_model, worker_url=http_worker_url)
except RemoteProtocolError as e:
logger.debug(
"HTTP server_info probe failed for %s (trying gRPC): %r",
http_worker_url,
e,
)
result: _WorkerPayloadResult = {"status": "not_ready", "payload": None}
if result["status"] == "ready":
return result
return await _get_grpc_worker_payload(
job_model, worker_url=grpc_worker_url, runtime_type=runtime_type
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) This can open and close the same SSH tunnel twice (once in _get_http_worker_payload and once in _get_grpc_worker_payload), which is an expensive operation

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Will I handle it in a separate PR?

Plan:

  • Extract shared tunnel setup into something like get_service_replica_tunnel(job) (yield UDS path).

  • Have the HTTP/gRPC clients use that helper.

  • Open one tunnel and run HTTP then gRPC over the same UDS.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds goods

@Bihan Bihan requested a review from jvstme June 11, 2026 09:22
@Bihan Bihan merged commit fc16f72 into dstackai:master Jun 11, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants