perf(mtp): k>2 batched-verify headroom - 4-input dense MatVec, WS hybrid-trunk decode kernels, ring/refresh cleanups

## Context

#208 (issues #207/#30) landed MTP k-token batched verify on the GDN-hybrid path at **6.2 → 10.4 t/s (1.68×)** on 27B Q4_K_M CUDA-hybrid — but the measured optimum is the **k=2** verify batch, and the sweep shows why deeper chains don't pay yet:

| draftN (k = draftN+1) | decode t/s | acceptance |
|---|---|---|
| 1 (k=2, 1-slot ring) | **10.4** | 90% |
| 2 (k=3) | 8.5 | 95% |
| 3 (k=4) | 9.2 | 84% |

Per-step cost scaling k=2 → k=4 is ~1.94×: the GPU trunk + lm_head run the temp-free **matvec re-stream** (k weight reads, zero amortization — the MMQ/dequant-GEMM compute kernels' per-call fp16 temps only amortize at prefill N and WDDM-page at decode k, see `MatMulComputeBatchMinN`), and the CPU mmap FFN (46/64 layers, ~6 GB/token — the dominant cost) only **pair**-amortizes via `MatVec2In`. Even at 100% acceptance, k=4 roughly ties k=2.

## Work items (ordered by measured leverage)

1. **4-input dense CPU `MatVec4In`** (SimdKernels, Q4_K/Q6_K/Q5_K/Q8_0 + F32): one weight read per 4 tokens on the CPU FFN → k=4 step drops an estimated ~100 ms → **~12+ t/s (≈2×)**. Keep per-token accumulation order identical to `MatVec2In` so per-position bits stay k-parity-independent (the duplicated-input-tail idiom in both passes' BatchVerify then collapses to one helper — it's currently hand-copied at 3 sites, flagged in the #208 review).
2. **Weight-stationary batched-decode kernels for the hybrid GPU trunk** (the #194 pattern, dense-path proven 2.42× at N=8): removes the linear-in-k re-stream cost on projections + lm_head.
3. **MTP-KV-write-only refresh**: the post-verify refresh re-runs the FULL `MtpForward` (incl. the ~152K×emb lm_head matvec + a blocking 600 KB logits download) once per accepted draft just to rewrite K/V — a head-block-only variant saves ~1-1.6 ms per accepted draft.
4. **Fused-scan ring capture**: `gdnSnapRing` currently forces the per-position recurrence loop (k×8 launches/layer) + ~96×(k−1) `CopyDeviceRegion` launches per step (~3 ms at k=4). Extending `GdnRecurrenceScan` to optionally dump per-token states into the ring keeps the #114-B fused path for the verify trunk.
5. **Logits buffer churn**: every `BatchVerify` returns k fresh vocab-sized `float[]` (~600 KB each — LOH) per step, and `EnsureDecodeLogits`/`EnsureBatchVerifyScratch` are exact-size, so `--draft-lookup`'s varying proposal lengths realloc device+host buffers on most steps. Reuse per-k cached buffers / return views.
6. **`PromptLookupDraft` incremental n-gram index**: `Propose` linearly scans the whole history per step (≥0.1-0.5 ms at 16-32K ctx on the no-match path, which is exactly the "floor = baseline" case); llama.cpp-style last-occurrence map updated O(1) in `Append`.
7. **Dead code from the #30 generalization**: `LastHiddenT1` (interface + both passes + the CUDA pinned buffer/D2H) and the CPU pass's `BatchForward2` (~200 lines) have no production consumers left — `MtpDecoder` drives `BatchVerify`/`HiddenAt` everywhere; only the CUDA `SHARPI_CPU_GDN=1` debug branch still calls its own `BatchForward2`. Amputate, and dedupe the per-token FFN fallback shared with `PrefillBatchedTrunkGpuFfn`.

## Acceptance

- [ ] 27B Q4_K_M CUDA-hybrid decode ≥ 12 t/s at the new optimum k (vs 10.4 today, 6.2 MTP-off)
- [ ] `MtpDecoder_GreedyParity_LlamaCpp` untouched and green; per-position bit-identity across k parities preserved (duplicated-tail contract or `MatVec4In` equivalence)
- [ ] `bench-27b-mtp.ps1` + README row updated


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(mtp): k>2 batched-verify headroom - 4-input dense MatVec, WS hybrid-trunk decode kernels, ring/refresh cleanups #209

Context

Work items (ordered by measured leverage)

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(mtp): k>2 batched-verify headroom - 4-input dense MatVec, WS hybrid-trunk decode kernels, ring/refresh cleanups #209

Description

Context

Work items (ordered by measured leverage)

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions