Skip to content

perf(mtp): k>2 batched-verify headroom - 4-input dense MatVec, WS hybrid-trunk decode kernels, ring/refresh cleanups #209

@pekkah

Description

@pekkah

Context

#208 (issues #207/#30) landed MTP k-token batched verify on the GDN-hybrid path at 6.2 → 10.4 t/s (1.68×) on 27B Q4_K_M CUDA-hybrid — but the measured optimum is the k=2 verify batch, and the sweep shows why deeper chains don't pay yet:

draftN (k = draftN+1) decode t/s acceptance
1 (k=2, 1-slot ring) 10.4 90%
2 (k=3) 8.5 95%
3 (k=4) 9.2 84%

Per-step cost scaling k=2 → k=4 is ~1.94×: the GPU trunk + lm_head run the temp-free matvec re-stream (k weight reads, zero amortization — the MMQ/dequant-GEMM compute kernels' per-call fp16 temps only amortize at prefill N and WDDM-page at decode k, see MatMulComputeBatchMinN), and the CPU mmap FFN (46/64 layers, ~6 GB/token — the dominant cost) only pair-amortizes via MatVec2In. Even at 100% acceptance, k=4 roughly ties k=2.

Work items (ordered by measured leverage)

  1. 4-input dense CPU MatVec4In (SimdKernels, Q4_K/Q6_K/Q5_K/Q8_0 + F32): one weight read per 4 tokens on the CPU FFN → k=4 step drops an estimated ~100 ms → ~12+ t/s (≈2×). Keep per-token accumulation order identical to MatVec2In so per-position bits stay k-parity-independent (the duplicated-input-tail idiom in both passes' BatchVerify then collapses to one helper — it's currently hand-copied at 3 sites, flagged in the perf: #207 single-user speculative decoding — dense batched verify + MTP k-token verify on GDN-hybrid (#30) #208 review).
  2. Weight-stationary batched-decode kernels for the hybrid GPU trunk (the perf(cuda): weight-stationary batched-decode matmul — amortize weight HBM reads across the batch (#190 follow-up) #194 pattern, dense-path proven 2.42× at N=8): removes the linear-in-k re-stream cost on projections + lm_head.
  3. MTP-KV-write-only refresh: the post-verify refresh re-runs the FULL MtpForward (incl. the ~152K×emb lm_head matvec + a blocking 600 KB logits download) once per accepted draft just to rewrite K/V — a head-block-only variant saves ~1-1.6 ms per accepted draft.
  4. Fused-scan ring capture: gdnSnapRing currently forces the per-position recurrence loop (k×8 launches/layer) + ~96×(k−1) CopyDeviceRegion launches per step (~3 ms at k=4). Extending GdnRecurrenceScan to optionally dump per-token states into the ring keeps the perf(engine,cpu,cuda): remaining GDN-hybrid prefill headroom after #111/#112 (N-input MoE dots + per-position recurrence/SDPA batching) #114-B fused path for the verify trunk.
  5. Logits buffer churn: every BatchVerify returns k fresh vocab-sized float[] (~600 KB each — LOH) per step, and EnsureDecodeLogits/EnsureBatchVerifyScratch are exact-size, so --draft-lookup's varying proposal lengths realloc device+host buffers on most steps. Reuse per-k cached buffers / return views.
  6. PromptLookupDraft incremental n-gram index: Propose linearly scans the whole history per step (≥0.1-0.5 ms at 16-32K ctx on the no-match path, which is exactly the "floor = baseline" case); llama.cpp-style last-occurrence map updated O(1) in Append.
  7. Dead code from the Batched main verify + per-token GDN snapshot ring — realize the MTP >=1.3x speedup #30 generalization: LastHiddenT1 (interface + both passes + the CUDA pinned buffer/D2H) and the CPU pass's BatchForward2 (~200 lines) have no production consumers left — MtpDecoder drives BatchVerify/HiddenAt everywhere; only the CUDA SHARPI_CPU_GDN=1 debug branch still calls its own BatchForward2. Amputate, and dedupe the per-token FFN fallback shared with PrefillBatchedTrunkGpuFfn.

Acceptance

  • 27B Q4_K_M CUDA-hybrid decode ≥ 12 t/s at the new optimum k (vs 10.4 today, 6.2 MTP-off)
  • MtpDecoder_GreedyParity_LlamaCpp untouched and green; per-position bit-identity across k parities preserved (duplicated-tail contract or MatVec4In equivalence)
  • bench-27b-mtp.ps1 + README row updated

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions