Skip to content

Gemma 4 GPU decode is ~1.8-2.9x slower than llama.cpp (kernel efficiency + low-ctx clock cliff) #142

@pekkah

Description

@pekkah

✅ RESOLVED (2026-06-14)

Decode reached ~70.5 t/s (E4B Q8, RTX 4070 Ti) — greedy 71.5 vs llama.cpp 76.6 = 1.07×, within the "≤1.3× and flat" acceptance — via the top-k-first penalty-aware sampler (#154, +44%), dp4a/Q8_1 decode matvec (#143), and default-on CUDA graphs (#139). Problem 1 (the "low-ctx clock cliff", 27 t/s at depth 0) was diagnosed as a --verbose-prompt full-vocab-LINQ-sort bench artifact, not a GPU stall — real matvecs run at ~90% HBM. Steady-state decode is bandwidth-bound at the ~504 GB/s ÷ 8 GB-per-token ceiling, so there is no structural lever left. Closing.


Finding

Gemma 4 GPU decode is ~1.8–2.9× slower than llama.cpp, and unlike llama.cpp it degrades
at low context.

Benchmark (RTX 4070 Ti, gemma-4-E4B-it-Q8_0, all layers on GPU):

Decode llama.cpp b9529 SharpInference gap
near-zero ctx (tg128 @ d0) 78.5 t/s ~27 t/s ~2.9×
~1K ctx (tg128 @ d1024) 77.7 t/s ~42–45 t/s ~1.8×

llama.cpp holds ~78 t/s flat from depth 0 → 1024. Ours is faster at 1K (42) than at
near-zero (27)
— backwards.

Two distinct problems

  1. Low-context under-utilization / clock cliff. Our near-zero-ctx decode (27) is slower
    than our 1K decode (42) because the GPU never reaches boost clock — the per-token work is a
    long chain of tiny, serialized kernels with idle gaps, so power draw stays low. llama.cpp
    keeps the GPU busy enough to boost even at depth 0 (78 flat). This is a symptom of an
    inefficient/launch-gappy decode
    , not an inherent GPU limit. (The merged CUDA-graph path
    targets exactly these launch gaps but the per-token SetParams overhead makes it −9.7% at
    short ctx; the device-side-position follow-up CUDA Graph decode: device-side position to remove short-context regression (follow-up to #136) #140 is the only graph variant that might
    actually help here — but it's secondary to the kernel work below.)

  2. Steady-state decode is ~1.8× behind even when boosted (1K: 42–45 vs 78). Likely the
    attention and matvec kernels: ours uses a per-head scalar attention kernel + shared-memory
    scores; llama.cpp uses fused flash-attention and tuned dequant-dot matvecs. This is the
    structural gap.

Direction

  • Profile a single decode token vs llama.cpp (kernel-level) to attribute the 1.8× — attention
    vs QKV/FFN matvec vs launch overhead.
  • Likely wins: a fused flash-attention-style kernel (one launch/​layer instead of score+softmax+
    AV), and tighter Q8_0 dequant-dot matvecs. These also keep the GPU busier → fix problem (1)'s
    clock cliff as a side effect.

Acceptance

  • Decode within ~1.3× of llama.cpp and flat across context (no low-ctx cliff).

Scope note

Per "don't optimize paths that won't reach competitive speed": this issue is the structural
decode work. The CUDA-graph micro-opt (#140) is deprioritized behind it — it can't close a
1.8× gap on its own.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions