Support gRPC communication with SMG (Shepherd Model Gateway) workers by Bihan · Pull Request #3946 · dstackai/dstack

Bihan · 2026-06-09T08:17:56Z

Extends router worker sync to discover and register gRPC SMG workers (vLLM and SGLang PD), in addition to the existing HTTP SGLang path.

gRPC client: Adds job_replica_grpc_client.py — same SSH tunnel pattern as the HTTP replica client, but opens a gRPC channel over a Unix domain socket (unix://) forwarded to the worker’s service port.

Worker registration: To register a worker with SMG, dstack needs runtime_type (vLLM / SGLang) and connection_mode (HTTP / gRPC). Rather than adding new service configuration fields, dstack discovers these by probing workers. Discovery runs in two stages:

First sync (router workers is empty): dstack does not yet know connection_mode or runtime_type. It probes each worker replica: HTTP /server_info and/or gRPC GetServerInfo, trying the SGLang then vLLM gRPC stub until one responds. Registered workers include connection_mode, runtime_type, and PD fields (kv_role / disaggregation_mode, bootstrap port for SGLang prefill).
Later syncs: dstack reads connection_mode and runtime_type from the router’s GET /workers list and reuses them — no repeated protocol or runtime guessing. When connection_mode is grpc, HTTP probes are skipped.

Minor Change Support grpc communication with smg router

jvstme · 2026-06-09T09:25:51Z

@Bihan, could you please share a bit more context behind the PR:

Do I understand correctly that the user-facing benefit of the PR is that vLLM workers are now supported with the SGLang router? Are there other benefits?
Why do we need to choose between HTTP and gRPC — is it that in some cases only one of them is available? What determines which one is available?
Could you share some service configurations to test different combinations of runtime_type (vLLM / SGLang) and connection_mode (HTTP / gRPC)?

Bihan · 2026-06-09T13:37:58Z

@jvstme

Do I understand correctly that the user-facing benefit of the PR is that vLLM workers are now supported with the SGLang router? Are there other benefits?

Yes, the main benefit is that vLLM gRPC workers are now supported with the SMG router in dstack PD services.

Other benefits come from gRPC mode, which applies to both SGLang and vLLM workers.

In gRPC mode the SMG router tokenizes requests once and sends tokenized input to workers. In HTTP mode each worker tokenizes separately. That reduces duplicate work and can improve throughput.
In gRPC mode the router tokenizes the prompt first, so routing works on tokens instead of raw text. That makes policies like cache_aware more effective, because KV cache is stored by tokens, not characters. With HTTP, the router only had an approximate match on character strings.

Why do we need to choose between HTTP and gRPC — is it that in some cases only one of them is available?

As far as I understand gRPC should be the chosen over HTTP based worker. For vLLM there is no option but to use gRPC.

What determines which one is available?

With SGLang worker: The worker is gRPC when worker is launched with the option --grpc-mode.
With vLLM worker: The worker is gRPC when worker is launched using vllm.entrypoints.grpc_server.
See below configs.

Could you share some service configurations to test different combinations of runtime_type (vLLM / SGLang) and connection_mode (HTTP / gRPC)?

With vLLM gRPC Worker:

type: service
name: prefill-decode-smg-vllm

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct

replicas:
  - count: 1
    image: python:3.12-slim
    commands:
      - pip install smg
      - |
          smg launch \
            --pd-disaggregation \
            --model-path "$MODEL_ID" \
            --enable-igw \
            --host 0.0.0.0 \
            --port 8000 \
            --prefill-policy cache_aware
    router:
      type: sglang
    resources:
      cpu: 4

  - count: 1
    image: vllm/vllm-openai:latest
    commands:
      - pip install -U "vllm[grpc]"
      - |
          python3 -m vllm.entrypoints.grpc_server \
            --model "$MODEL_ID" \
            --host 0.0.0.0 \
            --port 8000 \
            --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}' \
            > worker-server.log 2>&1
    resources:
      gpu: L40S

  - count: 1
    image: vllm/vllm-openai:latest
    commands:
      - pip install -U "vllm[grpc]"
      - |
          python3 -m vllm.entrypoints.grpc_server \
            --model "$MODEL_ID" \
            --host 0.0.0.0 \
            --port 8000 \
            --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}' \
            > worker-server.log 2>&1
    resources:
      gpu: L40S

port: 8000
#model: meta-llama/Llama-3.2-3B-Instruct
fleets: [pd-disagg]

#probes:
#  - type: http
#    url: /health
#    interval: 15s

With SGLang gRPC worker:

type: service
name: prefill-decode-smg-sglang

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct

replicas:
  - count: 1
    image: python:3.12-slim
    commands:
      - pip install smg
      - |
          smg launch \
            --enable-igw \
            --pd-disaggregation \
            --model-path "$MODEL_ID" \
            --host 0.0.0.0 \
            --port 8000 \
            --prefill-policy cache_aware
    router:
      type: sglang
    resources:
      cpu: 4

  - count: 1
    image: ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10
    commands:
      - |
          python3 -m sglang.launch_server \
            --model-path "$MODEL_ID" \
            --host 0.0.0.0 \
            --port 8000 \
            --grpc-mode \
            --disaggregation-mode prefill \
            --disaggregation-transfer-backend mooncake \
            --disaggregation-bootstrap-port 8998 \
            > worker-server.log 2>&1
    resources:
      gpu: L40S

  - count: 1
    image: ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10
    commands:
      - |
          python3 -m sglang.launch_server \
            --model-path "$MODEL_ID" \
            --host 0.0.0.0 \
            --port 8000 \
            --grpc-mode \
            --disaggregation-mode decode \
            --disaggregation-transfer-backend mooncake \
            > worker-server.log 2>&1
    resources:
      gpu: L40S

port: 8000
fleets: [pd-disagg]

#probes:
#  - type: http
#    url: /health
#    interval: 15s

Note:

To use model: with gRPC based workers, we need replica-wise probes. This will be a different PR.
To use NIXL transfer backend. Use --disaggregation-transfer-backend nixl for SGLang and
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":...}'

jvstme · 2026-06-10T22:38:16Z

+    if result["status"] == "ready":
+        return result
+    return await _get_grpc_worker_payload(
+        job_model, worker_url=grpc_worker_url, runtime_type=runtime_type
+    )


(nit) What if the worker responded successfully over HTTP, but returned a status other than ready? Is this a valid case? If so, I assume we shouldn't try gRPC, because we already know that the worker responds over HTTP?

Not a valid SGLang case. SGLang fails at startup and never serves with a non-ready status. See

jvstme · 2026-06-10T22:48:09Z

+    try:
+        result = await _get_http_worker_payload(job_model, worker_url=http_worker_url)
+    except RemoteProtocolError as e:
+        logger.debug(
+            "HTTP server_info probe failed for %s (trying gRPC): %r",
+            http_worker_url,
+            e,
+        )
+        result: _WorkerPayloadResult = {"status": "not_ready", "payload": None}
+    if result["status"] == "ready":
+        return result
+    return await _get_grpc_worker_payload(
+        job_model, worker_url=grpc_worker_url, runtime_type=runtime_type
+    )


(nit) This can open and close the same SSH tunnel twice (once in _get_http_worker_payload and once in _get_grpc_worker_payload), which is an expensive operation

Good catch!

Will I handle it in a separate PR?

Plan:

Extract shared tunnel setup into something like get_service_replica_tunnel(job) (yield UDS path).

Have the HTTP/gRPC clients use that helper.

Open one tunnel and run HTTP then gRPC over the same UDS.

Yes, sounds goods

Support grpc communication with smg workers

baf92a4

Minor Change Support grpc communication with smg router

Bihan requested a review from jvstme June 9, 2026 08:29

jvstme approved these changes Jun 10, 2026

View reviewed changes

Resolve Review Comments

fce861b

Bihan requested a review from jvstme June 11, 2026 09:22

jvstme approved these changes Jun 11, 2026

View reviewed changes

Bihan merged commit fc16f72 into dstackai:master Jun 11, 2026
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support gRPC communication with SMG (Shepherd Model Gateway) workers#3946

Support gRPC communication with SMG (Shepherd Model Gateway) workers#3946
Bihan merged 2 commits into
dstackai:masterfrom
Bihan:support-vllm-backend-with-smg-router

Bihan commented Jun 9, 2026

Uh oh!

jvstme commented Jun 9, 2026

Uh oh!

Bihan commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

jvstme Jun 10, 2026

Uh oh!

Bihan Jun 11, 2026

Uh oh!

jvstme Jun 10, 2026

Uh oh!

Bihan Jun 11, 2026

Uh oh!

jvstme Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Bihan commented Jun 9, 2026

Uh oh!

jvstme commented Jun 9, 2026

Uh oh!

Bihan commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

jvstme Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Bihan Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

jvstme Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Bihan Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

jvstme Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants