perf(sampler): top-k-first fast path — Gemma 4 E4B decode 48.7→70.3 t/s (#142) by pekkah · Pull Request #154 · pekkah/SharpInference

pekkah · 2026-06-06T15:09:58Z

Summary

Recommended-args decode on Gemma 4 E4B Q8_0 (RTX 4070 Ti) was bottlenecked by the CPU sampler, not the CUDA kernels. With --temp 1.0 --top-k 64 --top-p 0.95 decode goes 48.7 → 70.3 t/s (+44%) (goal ≥70 met). Greedy decode is 71.5 t/s (the sampler is now near-free).

Diagnosis (measure-before-optimize)

Decode is HBM-bandwidth-bound. New CudaDecodeMatvecRooflineProbe shows the big FFN matvecs (49% of decode) already run at 88–91% of the 4070 Ti's ~504 GB/s peak — near-optimal, no kernel headroom. (Caveat baked into the probe: a single looped 35 MB weight reads >700 GB/s because it stays resident in the 48 MB L2; the probe rotates over >L2 of buffers to force the true cold-read rate.)
No low-context clock cliff — the GPU holds 2865 MHz (92% of the 3120 boost), 97–99% util, 180/285 W, 54 °C during sustained decode.
SHARPI_DECODE_REGIONS=1 region profiler: the 42-layer device region is 95% of forward time; PLE prepass 0.5 ms and logits download 0.1 ms are negligible.

So the gap to llama.cpp (76.6 t/s greedy) was not in the core matvec kernels.

Root causes (CPU)

Sampler.Sample recommended-args path ran ApplyTopP as an Array.Sort over a 262144-element (int,float)[] with a delegate comparator + 2 MB allocation every token, even though ApplyTopK had already left only ~64 non-zero entries, plus a full-vocab softmax/normalize/sample. The CLI defaults --rep-penalty 1.1, which routed every token through this path.
bench-textgen.ps1 hard-coded --verbose-prompt, whose per-token DecodeLoop debug log does a full-vocab LINQ OrderByDescending — ~1.5–2.5 ms/token that understated measured decode by 10–25%. Removed from the bench.

Changes

Sampler.cs — top-k-first fast path: select the top-(k+W) logits in one vocab pass, then run temperature / softmax / min-p / top-p over just those k (the llama.cpp ordering: top-k → renormalize → top-p). Repetition penalty only demotes recent tokens, so it is folded in by over-selecting W = PreviousTokens.Count, penalising that small set, re-sorting, and keeping the top-k — exact, no full-vocab work and no per-token allocation. The slow path (reached via logit bias / TopK=0) now also renormalizes after top-k so both paths agree. Degenerate/-inf inputs fall back to a valid greedy token (never the sentinel). Greedy path unchanged.
bench-textgen.ps1 — drop --verbose-prompt; take the sanity-check sample from stdout.
CudaForwardPass.cs — SHARPI_DECODE_REGIONS=1 region profiler (env-gated, off by default), used in the diagnosis.
CudaDecodeMatvecRooflineProbe.cs — decode-shape roofline probe (explicit --filter, silent no-op without CUDA).

Review cycle (pr-review-toolkit)

Three reviewers (code-reviewer, pr-test-analyzer, silent-failure-hunter) flagged: (1) fast/slow top-p divergence → fixed by renormalizing the slow path after top-k; (2) the fast path could return the Fill(-1) sentinel on -inf logits → fixed with a greedy fallback; (3) a per-token Dictionary allocation → removed. Plus 8 new tests including a fast-vs-slow differential oracle.

Testing

SamplerTests: 23/23 green (15 original + 8 new).
Full CUDA forward-pass suite (--filter ~Cuda): 132/132 green (25m, incl. the new probe).
Generation verified coherent with recommended args.

🤖 Generated with Claude Code

…/s (#142) Recommended-args decode (--temp 1.0 --top-k 64 --top-p 0.95) on Gemma 4 E4B Q8_0 was bottlenecked by the CPU sampler, not the CUDA kernels. Diagnosis (measure-before-optimize): decode is HBM-bandwidth-bound and the big FFN matvecs already run at ~90% of the 4070 Ti's ~504 GB/s peak (see the new CudaDecodeMatvecRooflineProbe — note a single-buffer loop misleadingly reads >700 GB/s because the 35 MB weight stays resident in the 48 MB L2; rotate over >L2 of buffers to get the true cold-read rate). No low-ctx clock cliff either (GPU holds 2865 MHz / 92% boost, 98% util during sustained decode). So the core matvec kernels have little headroom — the gap to llama.cpp was elsewhere. Two CPU-side issues: 1. Sampler.Sample's recommended-args path sorted a 262144-element (int,float)[] with a delegate comparator + 2 MB alloc every token in ApplyTopP, even though ApplyTopK already left only ~64 non-zero entries. The CLI also defaults --rep-penalty 1.1, which routed through this full-vocab path. 2. bench-textgen.ps1 hard-coded --verbose-prompt, whose per-token DecodeLoop debug log does a full-vocab LINQ OrderByDescending — ~1.5-2.5 ms/token that understated measured decode by 10-25%. Removed from the bench. Fix (Sampler.cs): top-k-first fast path — select the top-(k+W) logits in one vocab pass, then run temperature/softmax/min-p/top-p over just those k (the llama.cpp ordering). Repetition penalty only demotes recent tokens, so it is folded in by over-selecting W = #distinct-recent, penalising that small set, re-sorting, and keeping the top-k (exact). Gated on TopK>0 && !logitBias; bias keeps the full-vocab path. ApplyTopP now sorts only the non-zero survivors (bit-exact). 15/15 SamplerTests green; greedy path unchanged. Adds SHARPI_DECODE_REGIONS=1 region profiler (embed+PLE / device / download split) used in the diagnosis. Result (no-verbose bench, recommended args): 63.6 -> 70.3 t/s. Greedy 71.6 t/s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces significant performance optimizations for decode-time sampling and profiling. Key changes include adding a fast-path "SampleTopK" method to avoid full-vocabulary sorting, optimizing "ApplyTopP" to sort only non-zero survivors, implementing a region profiler for Gemma 4 decode phases, and adding a roofline probe test for Q8_0 matvec kernels. The review feedback identifies two important issues: a bug in the roofline probe's "HalfToUshort" conversion where "BitConverter.GetBytes" implicitly converts "Half" to "float", and a potential performance bottleneck in "SampleTopK" if "TopK" is extremely large, suggesting a gate of "p.TopK <= 256" to keep the fast path allocation-free and highly efficient.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-06T15:11:41Z

+        catch { return null; }
+    }
+
+    private static ushort HalfToUshort(Half h) => BitConverter.ToUInt16(BitConverter.GetBytes(h), 0);


In HalfToUshort, using BitConverter.GetBytes(h) is problematic because BitConverter does not have an overload for Half. In C#, Half implicitly converts to float, which means BitConverter.GetBytes(h) will resolve to BitConverter.GetBytes((float)h). This returns a 4-byte array representing the single-precision float, and BitConverter.ToUInt16 will then read the incorrect bits, resulting in garbage scale values in the generated matrix.

To correctly convert a Half to its 16-bit representation without allocations, use BitConverter.HalfToUInt16Bits(h).

private static ushort HalfToUshort(Half h) => BitConverter.HalfToUInt16Bits(h);

gemini-code-assist · 2026-06-06T15:11:41Z

+        // the llama.cpp ordering (top-k → softmax(survivors) → top-p). Repetition penalty
+        // only *demotes* the recent tokens, so it is folded in by over-selecting (see
+        // SampleTopK). Logit bias can promote *arbitrary* tokens, so it keeps the full path.
+        if (p.TopK > 0 && p.TopK < vocabSize && !hasBias)


The fast path SampleTopK uses a linear insertion sort of complexity $O(V \times K)$ where $V$ is the vocabulary size (262,144 for Gemma 4) and $K$ is TopK. While this is extremely fast for small values of TopK (e.g., $\le 256$), it can become a major performance bottleneck or even cause thread hangs if TopK is set to a very large value (e.g., several thousands), as the insertion sort will perform millions of shifting operations.

Additionally, SampleTopK only uses stackalloc when sel <= 256, falling back to heap allocations otherwise. Gating the fast path to p.TopK <= 256 ensures that the fast path remains highly efficient and allocation-free.

if (p.TopK > 0 && p.TopK <= 256 && !hasBias)

…tests Addresses pr-review-toolkit findings on the #142 sampler change: - Correctness: the slow path now renormalizes after top-k (before min-p/top-p), so its nucleus cutoff is taken over the post-top-k distribution — matching the SampleTopK fast path (top-k -> renormalize -> top-p, the llama.cpp ordering). Previously the two paths computed top-p over different normalizations, so the fast path was not equivalent to the slow path for top-k + top-p. Corrected the "bit-identical" wording. - Safety: SampleTopK could return the Fill(-1) sentinel (or sample a collapsed all-zero distribution) when logits contain -inf (masked tokens). Added a degenerate guard (falls back to Greedy over the raw logits) and made the sampling loop/fallback never return a negative index. - Allocation: removed the per-token Dictionary<int,int> — over-select by PreviousTokens.Count and scan the small (<=64) window for occurrence counts, keeping the hot path allocation-free. - Tests (+8): fast-vs-slow differential equivalence (forced via out-of-range bias) across seeds, penalty demotion out of top-k, duplicate-occurrence penalty compounding, -inf safety, all--inf fallback, top-k+top-p nucleus. 15/15 original + 8 new SamplerTests green. Greedy path unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 6, 2026

View reviewed changes

pekkah changed the title ~~perf(sampler): top-k-first fast path — Gemma 4 E4B decode 63.6→70.3 t/s (#142)~~ perf(sampler): top-k-first fast path — Gemma 4 E4B decode 48.7→70.3 t/s (#142) Jun 6, 2026

pekkah merged commit 45f0f86 into master Jun 6, 2026
1 check passed

pekkah deleted the perf/gemma4-cuda-decode-142 branch June 6, 2026 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(sampler): top-k-first fast path — Gemma 4 E4B decode 48.7→70.3 t/s (#142)#154

perf(sampler): top-k-first fast path — Gemma 4 E4B decode 48.7→70.3 t/s (#142)#154
pekkah merged 2 commits into
masterfrom
perf/gemma4-cuda-decode-142

pekkah commented Jun 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pekkah commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Diagnosis (measure-before-optimize)

Root causes (CPU)

Changes

Review cycle (pr-review-toolkit)

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pekkah commented Jun 6, 2026 •

edited

Loading