You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#208 (issues #207/#30) landed MTP k-token batched verify on the GDN-hybrid path at 6.2 → 10.4 t/s (1.68×) on 27B Q4_K_M CUDA-hybrid — but the measured optimum is the k=2 verify batch, and the sweep shows why deeper chains don't pay yet:
draftN (k = draftN+1)
decode t/s
acceptance
1 (k=2, 1-slot ring)
10.4
90%
2 (k=3)
8.5
95%
3 (k=4)
9.2
84%
Per-step cost scaling k=2 → k=4 is ~1.94×: the GPU trunk + lm_head run the temp-free matvec re-stream (k weight reads, zero amortization — the MMQ/dequant-GEMM compute kernels' per-call fp16 temps only amortize at prefill N and WDDM-page at decode k, see MatMulComputeBatchMinN), and the CPU mmap FFN (46/64 layers, ~6 GB/token — the dominant cost) only pair-amortizes via MatVec2In. Even at 100% acceptance, k=4 roughly ties k=2.
Work items (ordered by measured leverage)
4-input dense CPU MatVec4In (SimdKernels, Q4_K/Q6_K/Q5_K/Q8_0 + F32): one weight read per 4 tokens on the CPU FFN → k=4 step drops an estimated ~100 ms → ~12+ t/s (≈2×). Keep per-token accumulation order identical to MatVec2In so per-position bits stay k-parity-independent (the duplicated-input-tail idiom in both passes' BatchVerify then collapses to one helper — it's currently hand-copied at 3 sites, flagged in the perf: #207 single-user speculative decoding — dense batched verify + MTP k-token verify on GDN-hybrid (#30) #208 review).
MTP-KV-write-only refresh: the post-verify refresh re-runs the FULL MtpForward (incl. the ~152K×emb lm_head matvec + a blocking 600 KB logits download) once per accepted draft just to rewrite K/V — a head-block-only variant saves ~1-1.6 ms per accepted draft.
Logits buffer churn: every BatchVerify returns k fresh vocab-sized float[] (~600 KB each — LOH) per step, and EnsureDecodeLogits/EnsureBatchVerifyScratch are exact-size, so --draft-lookup's varying proposal lengths realloc device+host buffers on most steps. Reuse per-k cached buffers / return views.
PromptLookupDraft incremental n-gram index: Propose linearly scans the whole history per step (≥0.1-0.5 ms at 16-32K ctx on the no-match path, which is exactly the "floor = baseline" case); llama.cpp-style last-occurrence map updated O(1) in Append.
Dead code from the Batched main verify + per-token GDN snapshot ring — realize the MTP >=1.3x speedup #30 generalization: LastHiddenT1 (interface + both passes + the CUDA pinned buffer/D2H) and the CPU pass's BatchForward2 (~200 lines) have no production consumers left — MtpDecoder drives BatchVerify/HiddenAt everywhere; only the CUDA SHARPI_CPU_GDN=1 debug branch still calls its own BatchForward2. Amputate, and dedupe the per-token FFN fallback shared with PrefillBatchedTrunkGpuFfn.
Acceptance
27B Q4_K_M CUDA-hybrid decode ≥ 12 t/s at the new optimum k (vs 10.4 today, 6.2 MTP-off)
MtpDecoder_GreedyParity_LlamaCpp untouched and green; per-position bit-identity across k parities preserved (duplicated-tail contract or MatVec4In equivalence)
Context
#208 (issues #207/#30) landed MTP k-token batched verify on the GDN-hybrid path at 6.2 → 10.4 t/s (1.68×) on 27B Q4_K_M CUDA-hybrid — but the measured optimum is the k=2 verify batch, and the sweep shows why deeper chains don't pay yet:
Per-step cost scaling k=2 → k=4 is ~1.94×: the GPU trunk + lm_head run the temp-free matvec re-stream (k weight reads, zero amortization — the MMQ/dequant-GEMM compute kernels' per-call fp16 temps only amortize at prefill N and WDDM-page at decode k, see
MatMulComputeBatchMinN), and the CPU mmap FFN (46/64 layers, ~6 GB/token — the dominant cost) only pair-amortizes viaMatVec2In. Even at 100% acceptance, k=4 roughly ties k=2.Work items (ordered by measured leverage)
MatVec4In(SimdKernels, Q4_K/Q6_K/Q5_K/Q8_0 + F32): one weight read per 4 tokens on the CPU FFN → k=4 step drops an estimated ~100 ms → ~12+ t/s (≈2×). Keep per-token accumulation order identical toMatVec2Inso per-position bits stay k-parity-independent (the duplicated-input-tail idiom in both passes' BatchVerify then collapses to one helper — it's currently hand-copied at 3 sites, flagged in the perf: #207 single-user speculative decoding — dense batched verify + MTP k-token verify on GDN-hybrid (#30) #208 review).MtpForward(incl. the ~152K×emb lm_head matvec + a blocking 600 KB logits download) once per accepted draft just to rewrite K/V — a head-block-only variant saves ~1-1.6 ms per accepted draft.gdnSnapRingcurrently forces the per-position recurrence loop (k×8 launches/layer) + ~96×(k−1)CopyDeviceRegionlaunches per step (~3 ms at k=4). ExtendingGdnRecurrenceScanto optionally dump per-token states into the ring keeps the perf(engine,cpu,cuda): remaining GDN-hybrid prefill headroom after #111/#112 (N-input MoE dots + per-position recurrence/SDPA batching) #114-B fused path for the verify trunk.BatchVerifyreturns k fresh vocab-sizedfloat[](~600 KB each — LOH) per step, andEnsureDecodeLogits/EnsureBatchVerifyScratchare exact-size, so--draft-lookup's varying proposal lengths realloc device+host buffers on most steps. Reuse per-k cached buffers / return views.PromptLookupDraftincremental n-gram index:Proposelinearly scans the whole history per step (≥0.1-0.5 ms at 16-32K ctx on the no-match path, which is exactly the "floor = baseline" case); llama.cpp-style last-occurrence map updated O(1) inAppend.LastHiddenT1(interface + both passes + the CUDA pinned buffer/D2H) and the CPU pass'sBatchForward2(~200 lines) have no production consumers left —MtpDecoderdrivesBatchVerify/HiddenAteverywhere; only the CUDASHARPI_CPU_GDN=1debug branch still calls its ownBatchForward2. Amputate, and dedupe the per-token FFN fallback shared withPrefillBatchedTrunkGpuFfn.Acceptance
MtpDecoder_GreedyParity_LlamaCppuntouched and green; per-position bit-identity across k parities preserved (duplicated-tail contract orMatVec4Inequivalence)bench-27b-mtp.ps1+ README row updated