Skip to content

GPU draft-MTP speculative decoding for Gemma 4 12B (decode 54 → ~70 t/s) #178

@pekkah

Description

@pekkah

Status (2026-06-14): unblocked

The hard blocker — #179 (fp16/q8_0 KV for the dense Gemma CUDA path)shipped, so a draft model now fits alongside a quantized KV cache at usable context. CUDA single-user batched-verify infrastructure also landed since (#207/#208: BatchVerify on BatchForwardMulti, n-gram --draft-lookup, ~1.33× at k=4). Remaining for this issue: drive an external GPU draft-MTP model for Gemma 4 12B and wire N>1 sampled (temp>0) accept. Still a good-case-only ~1.3× win, so deprioritized — but no longer blocked.


Motivation

llama.cpp users report ~70 decode t/s on Gemma 4 12B-it QAT on 12 GB cards using MTP speculative decoding with an external "assistant-MTP" draft model:

llama-server -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \
  -md gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \
  -ngl 99 --spec-draft-ngl 99 --spec-type draft-mtp --spec-draft-n-max 4 \
  --cache-type-k q8_0 --cache-type-v q8_0 ... --temp 0.2

Our recorded raw decode is 54.1 t/s (README text-gen table, CUDA -g -1), already within ~6% of llama.cpp's non-speculative 57 t/s. The 54 → 70 gap is entirely the MTP draft speculation (draft proposes up to 4 tokens, the 12B verifies them in one batched pass; their temp 0.2 gives high acceptance).

Why this is a follow-on, not a first move

  • The 54 → 70 win is ~1.3× and good-case only (low temp / high acceptance). Both rates are far above reading speed, so the benefit is real only for long generations / agentic loops, not interactive chat.
  • It does not fit in VRAM without KV quantization first. Our dense Gemma CUDA KV cache is fp32 (GpuForwardPass.cs:199, "K + V, fp32"), so the 12B already nearly fills a 12 GB card at long context with no draft model (that's why the README row is measured at -c 2048). A draft model only fits once KV is quantized. Blocked on Quantized (fp16/q8_0) KV cache for the dense CUDA path → unlock long context (Gemma 4 12B first) #179 (fp16/q8_0 KV for the dense Gemma CUDA path).

So: land KV quantization first, then this becomes a purely additive throughput feature.

Current limitations to lift

Our draft-model speculative path (src/SharpInference.Cli/RunCommand.cs) today:

Their config Our support
Draft fully on GPU (--spec-draft-ngl 99) CPU-onlynGpuLayers != 0 falls back (RunCommand.cs:619-621)
--spec-draft-n-max 4 N=1 only on the MTP path (issue #30)
--temp 0.2 Greedy only (--temp 0); otherwise falls back (RunCommand.cs:615-617)
External assistant-MTP draft for a model with no native MTP head Our MTP is self-speculation for native-MTP models (Qwen3.6-MTP etc.); no external GPU draft wired for Gemma

Scope

  1. GPU draft path — drive an external draft model on CUDA and batch-verify on the 12B (today draft is CPU-only).
  2. N > 1 batched accept (issue Batched main verify + per-token GDN snapshot ring — realize the MTP >=1.3x speedup #30) — N=4 draft + batched verify on the dense CUDA kernel; this is where the multiplier comes from. Reuse the native-MTP batched-verify machinery (MtpDecoder.cs, Batched main verify + per-token GDN snapshot ring — realize the MTP >=1.3x speedup #30/MoE batched verify: support BatchForward2 for IsMoE models (follow-up to #30, #44) #45) where possible.
  3. Sampled (non-greedy) accept — they run temp 0.2; we require temp 0. Hook exists: --spec-draft-p-min probabilistic accept (issue MTP: implement --spec-draft-p-min (probabilistic accept threshold) #38).
  4. (Optional) load the Gemma assistant-MTP draft GGUF format.

Acceptance

  • Gemma 4 12B QAT, CUDA -g -1 + external Q8_0 draft, reaches ~65-70 decode t/s at temp 0.2 with the draft fitting alongside a quantized KV cache at a usable context (≥32K).
  • Argmax-stability / accept-correctness invariants preserved; SHARPI_* bisect switches as elsewhere.

Building blocks already present

Blocked on #179.

https://claude.ai/code/session_01Ti38STBkCwbN1c41mspjZv

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions