Skip to content

perf(sampler): top-k-first fast path — Gemma 4 E4B decode 48.7→70.3 t/s (#142)#154

Merged
pekkah merged 2 commits into
masterfrom
perf/gemma4-cuda-decode-142
Jun 6, 2026
Merged

perf(sampler): top-k-first fast path — Gemma 4 E4B decode 48.7→70.3 t/s (#142)#154
pekkah merged 2 commits into
masterfrom
perf/gemma4-cuda-decode-142

Conversation

@pekkah

@pekkah pekkah commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Summary

Recommended-args decode on Gemma 4 E4B Q8_0 (RTX 4070 Ti) was bottlenecked by the CPU sampler, not the CUDA kernels. With --temp 1.0 --top-k 64 --top-p 0.95 decode goes 48.7 → 70.3 t/s (+44%) (goal ≥70 met). Greedy decode is 71.5 t/s (the sampler is now near-free).

Diagnosis (measure-before-optimize)

  • Decode is HBM-bandwidth-bound. New CudaDecodeMatvecRooflineProbe shows the big FFN matvecs (49% of decode) already run at 88–91% of the 4070 Ti's ~504 GB/s peak — near-optimal, no kernel headroom. (Caveat baked into the probe: a single looped 35 MB weight reads >700 GB/s because it stays resident in the 48 MB L2; the probe rotates over >L2 of buffers to force the true cold-read rate.)
  • No low-context clock cliff — the GPU holds 2865 MHz (92% of the 3120 boost), 97–99% util, 180/285 W, 54 °C during sustained decode.
  • SHARPI_DECODE_REGIONS=1 region profiler: the 42-layer device region is 95% of forward time; PLE prepass 0.5 ms and logits download 0.1 ms are negligible.

So the gap to llama.cpp (76.6 t/s greedy) was not in the core matvec kernels.

Root causes (CPU)

  1. Sampler.Sample recommended-args path ran ApplyTopP as an Array.Sort over a 262144-element (int,float)[] with a delegate comparator + 2 MB allocation every token, even though ApplyTopK had already left only ~64 non-zero entries, plus a full-vocab softmax/normalize/sample. The CLI defaults --rep-penalty 1.1, which routed every token through this path.
  2. bench-textgen.ps1 hard-coded --verbose-prompt, whose per-token DecodeLoop debug log does a full-vocab LINQ OrderByDescending — ~1.5–2.5 ms/token that understated measured decode by 10–25%. Removed from the bench.

Changes

  • Sampler.cs — top-k-first fast path: select the top-(k+W) logits in one vocab pass, then run temperature / softmax / min-p / top-p over just those k (the llama.cpp ordering: top-k → renormalize → top-p). Repetition penalty only demotes recent tokens, so it is folded in by over-selecting W = PreviousTokens.Count, penalising that small set, re-sorting, and keeping the top-k — exact, no full-vocab work and no per-token allocation. The slow path (reached via logit bias / TopK=0) now also renormalizes after top-k so both paths agree. Degenerate/-inf inputs fall back to a valid greedy token (never the sentinel). Greedy path unchanged.
  • bench-textgen.ps1 — drop --verbose-prompt; take the sanity-check sample from stdout.
  • CudaForwardPass.csSHARPI_DECODE_REGIONS=1 region profiler (env-gated, off by default), used in the diagnosis.
  • CudaDecodeMatvecRooflineProbe.cs — decode-shape roofline probe (explicit --filter, silent no-op without CUDA).

Review cycle (pr-review-toolkit)

Three reviewers (code-reviewer, pr-test-analyzer, silent-failure-hunter) flagged: (1) fast/slow top-p divergence → fixed by renormalizing the slow path after top-k; (2) the fast path could return the Fill(-1) sentinel on -inf logits → fixed with a greedy fallback; (3) a per-token Dictionary allocation → removed. Plus 8 new tests including a fast-vs-slow differential oracle.

Testing

  • SamplerTests: 23/23 green (15 original + 8 new).
  • Full CUDA forward-pass suite (--filter ~Cuda): 132/132 green (25m, incl. the new probe).
  • Generation verified coherent with recommended args.

🤖 Generated with Claude Code

…/s (#142)

Recommended-args decode (--temp 1.0 --top-k 64 --top-p 0.95) on Gemma 4 E4B
Q8_0 was bottlenecked by the CPU sampler, not the CUDA kernels.

Diagnosis (measure-before-optimize): decode is HBM-bandwidth-bound and the big
FFN matvecs already run at ~90% of the 4070 Ti's ~504 GB/s peak (see the new
CudaDecodeMatvecRooflineProbe — note a single-buffer loop misleadingly reads
>700 GB/s because the 35 MB weight stays resident in the 48 MB L2; rotate over
>L2 of buffers to get the true cold-read rate). No low-ctx clock cliff either
(GPU holds 2865 MHz / 92% boost, 98% util during sustained decode). So the core
matvec kernels have little headroom — the gap to llama.cpp was elsewhere.

Two CPU-side issues:
1. Sampler.Sample's recommended-args path sorted a 262144-element (int,float)[]
   with a delegate comparator + 2 MB alloc every token in ApplyTopP, even though
   ApplyTopK already left only ~64 non-zero entries. The CLI also defaults
   --rep-penalty 1.1, which routed through this full-vocab path.
2. bench-textgen.ps1 hard-coded --verbose-prompt, whose per-token DecodeLoop
   debug log does a full-vocab LINQ OrderByDescending — ~1.5-2.5 ms/token that
   understated measured decode by 10-25%. Removed from the bench.

Fix (Sampler.cs): top-k-first fast path — select the top-(k+W) logits in one
vocab pass, then run temperature/softmax/min-p/top-p over just those k (the
llama.cpp ordering). Repetition penalty only demotes recent tokens, so it is
folded in by over-selecting W = #distinct-recent, penalising that small set,
re-sorting, and keeping the top-k (exact). Gated on TopK>0 && !logitBias; bias
keeps the full-vocab path. ApplyTopP now sorts only the non-zero survivors
(bit-exact). 15/15 SamplerTests green; greedy path unchanged.

Adds SHARPI_DECODE_REGIONS=1 region profiler (embed+PLE / device / download
split) used in the diagnosis.

Result (no-verbose bench, recommended args): 63.6 -> 70.3 t/s. Greedy 71.6 t/s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance optimizations for decode-time sampling and profiling. Key changes include adding a fast-path "SampleTopK" method to avoid full-vocabulary sorting, optimizing "ApplyTopP" to sort only non-zero survivors, implementing a region profiler for Gemma 4 decode phases, and adding a roofline probe test for Q8_0 matvec kernels. The review feedback identifies two important issues: a bug in the roofline probe's "HalfToUshort" conversion where "BitConverter.GetBytes" implicitly converts "Half" to "float", and a potential performance bottleneck in "SampleTopK" if "TopK" is extremely large, suggesting a gate of "p.TopK <= 256" to keep the fast path allocation-free and highly efficient.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

catch { return null; }
}

private static ushort HalfToUshort(Half h) => BitConverter.ToUInt16(BitConverter.GetBytes(h), 0);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In HalfToUshort, using BitConverter.GetBytes(h) is problematic because BitConverter does not have an overload for Half. In C#, Half implicitly converts to float, which means BitConverter.GetBytes(h) will resolve to BitConverter.GetBytes((float)h). This returns a 4-byte array representing the single-precision float, and BitConverter.ToUInt16 will then read the incorrect bits, resulting in garbage scale values in the generated matrix.

To correctly convert a Half to its 16-bit representation without allocations, use BitConverter.HalfToUInt16Bits(h).

    private static ushort HalfToUshort(Half h) => BitConverter.HalfToUInt16Bits(h);

// the llama.cpp ordering (top-k → softmax(survivors) → top-p). Repetition penalty
// only *demotes* the recent tokens, so it is folded in by over-selecting (see
// SampleTopK). Logit bias can promote *arbitrary* tokens, so it keeps the full path.
if (p.TopK > 0 && p.TopK < vocabSize && !hasBias)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fast path SampleTopK uses a linear insertion sort of complexity $O(V \times K)$ where $V$ is the vocabulary size (262,144 for Gemma 4) and $K$ is TopK. While this is extremely fast for small values of TopK (e.g., $\le 256$), it can become a major performance bottleneck or even cause thread hangs if TopK is set to a very large value (e.g., several thousands), as the insertion sort will perform millions of shifting operations.

Additionally, SampleTopK only uses stackalloc when sel <= 256, falling back to heap allocations otherwise. Gating the fast path to p.TopK <= 256 ensures that the fast path remains highly efficient and allocation-free.

        if (p.TopK > 0 && p.TopK <= 256 && !hasBias)

…tests

Addresses pr-review-toolkit findings on the #142 sampler change:

- Correctness: the slow path now renormalizes after top-k (before min-p/top-p),
  so its nucleus cutoff is taken over the post-top-k distribution — matching the
  SampleTopK fast path (top-k -> renormalize -> top-p, the llama.cpp ordering).
  Previously the two paths computed top-p over different normalizations, so the
  fast path was not equivalent to the slow path for top-k + top-p. Corrected the
  "bit-identical" wording.
- Safety: SampleTopK could return the Fill(-1) sentinel (or sample a collapsed
  all-zero distribution) when logits contain -inf (masked tokens). Added a
  degenerate guard (falls back to Greedy over the raw logits) and made the
  sampling loop/fallback never return a negative index.
- Allocation: removed the per-token Dictionary<int,int> — over-select by
  PreviousTokens.Count and scan the small (<=64) window for occurrence counts,
  keeping the hot path allocation-free.
- Tests (+8): fast-vs-slow differential equivalence (forced via out-of-range
  bias) across seeds, penalty demotion out of top-k, duplicate-occurrence
  penalty compounding, -inf safety, all--inf fallback, top-k+top-p nucleus.

15/15 original + 8 new SamplerTests green. Greedy path unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah pekkah changed the title perf(sampler): top-k-first fast path — Gemma 4 E4B decode 63.6→70.3 t/s (#142) perf(sampler): top-k-first fast path — Gemma 4 E4B decode 48.7→70.3 t/s (#142) Jun 6, 2026
@pekkah pekkah merged commit 45f0f86 into master Jun 6, 2026
1 check passed
@pekkah pekkah deleted the perf/gemma4-cuda-decode-142 branch June 6, 2026 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant