You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The hard blocker — #179 (fp16/q8_0 KV for the dense Gemma CUDA path) — shipped, so a draft model now fits alongside a quantized KV cache at usable context. CUDA single-user batched-verify infrastructure also landed since (#207/#208: BatchVerify on BatchForwardMulti, n-gram --draft-lookup, ~1.33× at k=4). Remaining for this issue: drive an external GPU draft-MTP model for Gemma 4 12B and wire N>1 sampled (temp>0) accept. Still a good-case-only ~1.3× win, so deprioritized — but no longer blocked.
Motivation
llama.cpp users report ~70 decode t/s on Gemma 4 12B-it QAT on 12 GB cards using MTP speculative decoding with an external "assistant-MTP" draft model:
Our recorded raw decode is 54.1 t/s (README text-gen table, CUDA -g -1), already within ~6% of llama.cpp's non-speculative 57 t/s. The 54 → 70 gap is entirely the MTP draft speculation (draft proposes up to 4 tokens, the 12B verifies them in one batched pass; their temp 0.2 gives high acceptance).
Why this is a follow-on, not a first move
The 54 → 70 win is ~1.3× and good-case only (low temp / high acceptance). Both rates are far above reading speed, so the benefit is real only for long generations / agentic loops, not interactive chat.
It does not fit in VRAM without KV quantization first. Our dense Gemma CUDA KV cache is fp32 (GpuForwardPass.cs:199, "K + V, fp32"), so the 12B already nearly fills a 12 GB card at long context with no draft model (that's why the README row is measured at -c 2048). A draft model only fits once KV is quantized. Blocked on Quantized (fp16/q8_0) KV cache for the dense CUDA path → unlock long context (Gemma 4 12B first) #179 (fp16/q8_0 KV for the dense Gemma CUDA path).
So: land KV quantization first, then this becomes a purely additive throughput feature.
(Optional) load the Gemma assistant-MTP draft GGUF format.
Acceptance
Gemma 4 12B QAT, CUDA -g -1 + external Q8_0 draft, reaches ~65-70 decode t/s at temp 0.2 with the draft fitting alongside a quantized KV cache at a usable context (≥32K).
Argmax-stability / accept-correctness invariants preserved; SHARPI_* bisect switches as elsewhere.
Motivation
llama.cpp users report ~70 decode t/s on Gemma 4 12B-it QAT on 12 GB cards using MTP speculative decoding with an external "assistant-MTP" draft model:
Our recorded raw decode is 54.1 t/s (README text-gen table,
CUDA -g -1), already within ~6% of llama.cpp's non-speculative 57 t/s. The 54 → 70 gap is entirely the MTP draft speculation (draft proposes up to 4 tokens, the 12B verifies them in one batched pass; their temp 0.2 gives high acceptance).Why this is a follow-on, not a first move
GpuForwardPass.cs:199, "K + V, fp32"), so the 12B already nearly fills a 12 GB card at long context with no draft model (that's why the README row is measured at-c 2048). A draft model only fits once KV is quantized. Blocked on Quantized (fp16/q8_0) KV cache for the dense CUDA path → unlock long context (Gemma 4 12B first) #179 (fp16/q8_0 KV for the dense Gemma CUDA path).So: land KV quantization first, then this becomes a purely additive throughput feature.
Current limitations to lift
Our draft-model speculative path (
src/SharpInference.Cli/RunCommand.cs) today:--spec-draft-ngl 99)nGpuLayers != 0falls back (RunCommand.cs:619-621)--spec-draft-n-max 4--temp 0.2--temp 0); otherwise falls back (RunCommand.cs:615-617)assistant-MTPdraft for a model with no native MTP headScope
MtpDecoder.cs, Batched main verify + per-token GDN snapshot ring — realize the MTP >=1.3x speedup #30/MoE batched verify: support BatchForward2 for IsMoE models (follow-up to #30, #44) #45) where possible.--spec-draft-p-minprobabilistic accept (issue MTP: implement --spec-draft-p-min (probabilistic accept threshold) #38).assistant-MTPdraft GGUF format.Acceptance
CUDA -g -1+ external Q8_0 draft, reaches ~65-70 decode t/s at temp 0.2 with the draft fitting alongside a quantized KV cache at a usable context (≥32K).SHARPI_*bisect switches as elsewhere.Building blocks already present
--draft-model,--spec-type,--spec-draft-n-max/-n-min/-p-min(RunCommand.cs).MtpDecoder.cs.SpeculativeDecoder(CPU) in the Engine.Blocked on #179.
https://claude.ai/code/session_01Ti38STBkCwbN1c41mspjZv