perf(sampler): top-k-first fast path — Gemma 4 E4B decode 48.7→70.3 t/s (#142)#154
Conversation
…/s (#142) Recommended-args decode (--temp 1.0 --top-k 64 --top-p 0.95) on Gemma 4 E4B Q8_0 was bottlenecked by the CPU sampler, not the CUDA kernels. Diagnosis (measure-before-optimize): decode is HBM-bandwidth-bound and the big FFN matvecs already run at ~90% of the 4070 Ti's ~504 GB/s peak (see the new CudaDecodeMatvecRooflineProbe — note a single-buffer loop misleadingly reads >700 GB/s because the 35 MB weight stays resident in the 48 MB L2; rotate over >L2 of buffers to get the true cold-read rate). No low-ctx clock cliff either (GPU holds 2865 MHz / 92% boost, 98% util during sustained decode). So the core matvec kernels have little headroom — the gap to llama.cpp was elsewhere. Two CPU-side issues: 1. Sampler.Sample's recommended-args path sorted a 262144-element (int,float)[] with a delegate comparator + 2 MB alloc every token in ApplyTopP, even though ApplyTopK already left only ~64 non-zero entries. The CLI also defaults --rep-penalty 1.1, which routed through this full-vocab path. 2. bench-textgen.ps1 hard-coded --verbose-prompt, whose per-token DecodeLoop debug log does a full-vocab LINQ OrderByDescending — ~1.5-2.5 ms/token that understated measured decode by 10-25%. Removed from the bench. Fix (Sampler.cs): top-k-first fast path — select the top-(k+W) logits in one vocab pass, then run temperature/softmax/min-p/top-p over just those k (the llama.cpp ordering). Repetition penalty only demotes recent tokens, so it is folded in by over-selecting W = #distinct-recent, penalising that small set, re-sorting, and keeping the top-k (exact). Gated on TopK>0 && !logitBias; bias keeps the full-vocab path. ApplyTopP now sorts only the non-zero survivors (bit-exact). 15/15 SamplerTests green; greedy path unchanged. Adds SHARPI_DECODE_REGIONS=1 region profiler (embed+PLE / device / download split) used in the diagnosis. Result (no-verbose bench, recommended args): 63.6 -> 70.3 t/s. Greedy 71.6 t/s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces significant performance optimizations for decode-time sampling and profiling. Key changes include adding a fast-path "SampleTopK" method to avoid full-vocabulary sorting, optimizing "ApplyTopP" to sort only non-zero survivors, implementing a region profiler for Gemma 4 decode phases, and adding a roofline probe test for Q8_0 matvec kernels. The review feedback identifies two important issues: a bug in the roofline probe's "HalfToUshort" conversion where "BitConverter.GetBytes" implicitly converts "Half" to "float", and a potential performance bottleneck in "SampleTopK" if "TopK" is extremely large, suggesting a gate of "p.TopK <= 256" to keep the fast path allocation-free and highly efficient.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| catch { return null; } | ||
| } | ||
|
|
||
| private static ushort HalfToUshort(Half h) => BitConverter.ToUInt16(BitConverter.GetBytes(h), 0); |
There was a problem hiding this comment.
In HalfToUshort, using BitConverter.GetBytes(h) is problematic because BitConverter does not have an overload for Half. In C#, Half implicitly converts to float, which means BitConverter.GetBytes(h) will resolve to BitConverter.GetBytes((float)h). This returns a 4-byte array representing the single-precision float, and BitConverter.ToUInt16 will then read the incorrect bits, resulting in garbage scale values in the generated matrix.
To correctly convert a Half to its 16-bit representation without allocations, use BitConverter.HalfToUInt16Bits(h).
private static ushort HalfToUshort(Half h) => BitConverter.HalfToUInt16Bits(h);| // the llama.cpp ordering (top-k → softmax(survivors) → top-p). Repetition penalty | ||
| // only *demotes* the recent tokens, so it is folded in by over-selecting (see | ||
| // SampleTopK). Logit bias can promote *arbitrary* tokens, so it keeps the full path. | ||
| if (p.TopK > 0 && p.TopK < vocabSize && !hasBias) |
There was a problem hiding this comment.
The fast path SampleTopK uses a linear insertion sort of complexity TopK. While this is extremely fast for small values of TopK (e.g., TopK is set to a very large value (e.g., several thousands), as the insertion sort will perform millions of shifting operations.
Additionally, SampleTopK only uses stackalloc when sel <= 256, falling back to heap allocations otherwise. Gating the fast path to p.TopK <= 256 ensures that the fast path remains highly efficient and allocation-free.
if (p.TopK > 0 && p.TopK <= 256 && !hasBias)…tests Addresses pr-review-toolkit findings on the #142 sampler change: - Correctness: the slow path now renormalizes after top-k (before min-p/top-p), so its nucleus cutoff is taken over the post-top-k distribution — matching the SampleTopK fast path (top-k -> renormalize -> top-p, the llama.cpp ordering). Previously the two paths computed top-p over different normalizations, so the fast path was not equivalent to the slow path for top-k + top-p. Corrected the "bit-identical" wording. - Safety: SampleTopK could return the Fill(-1) sentinel (or sample a collapsed all-zero distribution) when logits contain -inf (masked tokens). Added a degenerate guard (falls back to Greedy over the raw logits) and made the sampling loop/fallback never return a negative index. - Allocation: removed the per-token Dictionary<int,int> — over-select by PreviousTokens.Count and scan the small (<=64) window for occurrence counts, keeping the hot path allocation-free. - Tests (+8): fast-vs-slow differential equivalence (forced via out-of-range bias) across seeds, penalty demotion out of top-k, duplicate-occurrence penalty compounding, -inf safety, all--inf fallback, top-k+top-p nucleus. 15/15 original + 8 new SamplerTests green. Greedy path unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Recommended-args decode on Gemma 4 E4B Q8_0 (RTX 4070 Ti) was bottlenecked by the CPU sampler, not the CUDA kernels. With
--temp 1.0 --top-k 64 --top-p 0.95decode goes 48.7 → 70.3 t/s (+44%) (goal ≥70 met). Greedy decode is 71.5 t/s (the sampler is now near-free).Diagnosis (measure-before-optimize)
CudaDecodeMatvecRooflineProbeshows the big FFN matvecs (49% of decode) already run at 88–91% of the 4070 Ti's ~504 GB/s peak — near-optimal, no kernel headroom. (Caveat baked into the probe: a single looped 35 MB weight reads >700 GB/s because it stays resident in the 48 MB L2; the probe rotates over >L2 of buffers to force the true cold-read rate.)SHARPI_DECODE_REGIONS=1region profiler: the 42-layer device region is 95% of forward time; PLE prepass 0.5 ms and logits download 0.1 ms are negligible.So the gap to llama.cpp (76.6 t/s greedy) was not in the core matvec kernels.
Root causes (CPU)
Sampler.Samplerecommended-args path ranApplyTopPas anArray.Sortover a 262144-element(int,float)[]with a delegate comparator + 2 MB allocation every token, even thoughApplyTopKhad already left only ~64 non-zero entries, plus a full-vocab softmax/normalize/sample. The CLI defaults--rep-penalty 1.1, which routed every token through this path.bench-textgen.ps1hard-coded--verbose-prompt, whose per-tokenDecodeLoopdebug log does a full-vocab LINQOrderByDescending— ~1.5–2.5 ms/token that understated measured decode by 10–25%. Removed from the bench.Changes
Sampler.cs— top-k-first fast path: select the top-(k+W) logits in one vocab pass, then run temperature / softmax / min-p / top-p over just those k (the llama.cpp ordering: top-k → renormalize → top-p). Repetition penalty only demotes recent tokens, so it is folded in by over-selectingW = PreviousTokens.Count, penalising that small set, re-sorting, and keeping the top-k — exact, no full-vocab work and no per-token allocation. The slow path (reached via logit bias /TopK=0) now also renormalizes after top-k so both paths agree. Degenerate/-infinputs fall back to a valid greedy token (never the sentinel). Greedy path unchanged.bench-textgen.ps1— drop--verbose-prompt; take the sanity-check sample from stdout.CudaForwardPass.cs—SHARPI_DECODE_REGIONS=1region profiler (env-gated, off by default), used in the diagnosis.CudaDecodeMatvecRooflineProbe.cs— decode-shape roofline probe (explicit--filter, silent no-op without CUDA).Review cycle (pr-review-toolkit)
Three reviewers (code-reviewer, pr-test-analyzer, silent-failure-hunter) flagged: (1) fast/slow top-p divergence → fixed by renormalizing the slow path after top-k; (2) the fast path could return the
Fill(-1)sentinel on-inflogits → fixed with a greedy fallback; (3) a per-tokenDictionaryallocation → removed. Plus 8 new tests including a fast-vs-slow differential oracle.Testing
SamplerTests: 23/23 green (15 original + 8 new).--filter ~Cuda): 132/132 green (25m, incl. the new probe).🤖 Generated with Claude Code