GPU draft-MTP speculative decoding for Gemma 4 12B (decode 54 → ~70 t/s)

> ## Status (2026-06-14): unblocked
> The hard blocker — **#179 (fp16/q8_0 KV for the dense Gemma CUDA path)** — **shipped**, so a draft model now fits alongside a quantized KV cache at usable context. CUDA single-user batched-verify infrastructure also landed since (#207/#208: `BatchVerify` on `BatchForwardMulti`, n-gram `--draft-lookup`, ~1.33× at k=4). Remaining for this issue: drive an **external** GPU draft-MTP model for Gemma 4 12B and wire **N>1** sampled (temp>0) accept. Still a good-case-only ~1.3× win, so deprioritized — but no longer blocked.
>
> ---

## Motivation

llama.cpp users report **~70 decode t/s** on Gemma 4 12B-it QAT on 12 GB cards using MTP speculative decoding with an external "assistant-MTP" draft model:

```
llama-server -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \
  -md gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \
  -ngl 99 --spec-draft-ngl 99 --spec-type draft-mtp --spec-draft-n-max 4 \
  --cache-type-k q8_0 --cache-type-v q8_0 ... --temp 0.2
```

Our recorded raw decode is **54.1 t/s** (README text-gen table, `CUDA -g -1`), already within ~6% of llama.cpp's *non-speculative* 57 t/s. The 54 → 70 gap is **entirely** the MTP draft speculation (draft proposes up to 4 tokens, the 12B verifies them in one batched pass; their temp 0.2 gives high acceptance).

## Why this is a follow-on, not a first move

- The 54 → 70 win is ~1.3× and **good-case only** (low temp / high acceptance). Both rates are far above reading speed, so the benefit is real only for long generations / agentic loops, not interactive chat.
- **It does not fit in VRAM without KV quantization first.** Our dense Gemma CUDA KV cache is fp32 (`GpuForwardPass.cs:199`, "K + V, fp32"), so the 12B already nearly fills a 12 GB card at long context with *no* draft model (that's why the README row is measured at `-c 2048`). A draft model only fits once KV is quantized. **Blocked on #179 (fp16/q8_0 KV for the dense Gemma CUDA path).**

So: land KV quantization first, then this becomes a purely additive throughput feature.

## Current limitations to lift

Our draft-model speculative path (`src/SharpInference.Cli/RunCommand.cs`) today:

| Their config | Our support |
|---|---|
| Draft fully on GPU (`--spec-draft-ngl 99`) | **CPU-only** — `nGpuLayers != 0` falls back (RunCommand.cs:619-621) |
| `--spec-draft-n-max 4` | **N=1 only** on the MTP path (issue #30) |
| `--temp 0.2` | **Greedy only** (`--temp 0`); otherwise falls back (RunCommand.cs:615-617) |
| External `assistant-MTP` draft for a model with no native MTP head | Our MTP is *self*-speculation for native-MTP models (Qwen3.6-MTP etc.); no external GPU draft wired for Gemma |

## Scope

1. **GPU draft path** — drive an external draft model on CUDA and batch-verify on the 12B (today draft is CPU-only).
2. **N > 1 batched accept** (issue #30) — N=4 draft + batched verify on the dense CUDA kernel; this is where the multiplier comes from. Reuse the native-MTP batched-verify machinery (`MtpDecoder.cs`, #30/#45) where possible.
3. **Sampled (non-greedy) accept** — they run temp 0.2; we require temp 0. Hook exists: `--spec-draft-p-min` probabilistic accept (issue #38).
4. **(Optional)** load the Gemma `assistant-MTP` draft GGUF format.

## Acceptance

- Gemma 4 12B QAT, `CUDA -g -1` + external Q8_0 draft, reaches ~65-70 decode t/s at temp 0.2 with the draft fitting alongside a quantized KV cache at a usable context (≥32K).
- Argmax-stability / accept-correctness invariants preserved; `SHARPI_*` bisect switches as elsewhere.

## Building blocks already present

- CLI flags mirror llama.cpp: `--draft-model`, `--spec-type`, `--spec-draft-n-max/-n-min/-p-min` (RunCommand.cs).
- Native-MTP batched N=2 verify (#30/#45) in `MtpDecoder.cs`.
- `SpeculativeDecoder` (CPU) in the Engine.

Blocked on #179.

https://claude.ai/code/session_01Ti38STBkCwbN1c41mspjZv


Their config	Our support
Draft fully on GPU (`--spec-draft-ngl 99`)	CPU-only — `nGpuLayers != 0` falls back (RunCommand.cs:619-621)
`--spec-draft-n-max 4`	N=1 only on the MTP path (issue #30)
`--temp 0.2`	Greedy only (`--temp 0`); otherwise falls back (RunCommand.cs:615-617)
External `assistant-MTP` draft for a model with no native MTP head	Our MTP is self-speculation for native-MTP models (Qwen3.6-MTP etc.); no external GPU draft wired for Gemma

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU draft-MTP speculative decoding for Gemma 4 12B (decode 54 → ~70 t/s) #178

Status (2026-06-14): unblocked

Motivation

Why this is a follow-on, not a first move

Current limitations to lift

Scope

Acceptance

Building blocks already present

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GPU draft-MTP speculative decoding for Gemma 4 12B (decode 54 → ~70 t/s) #178

Description

Status (2026-06-14): unblocked

Motivation

Why this is a follow-on, not a first move

Current limitations to lift

Scope

Acceptance

Building blocks already present

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions