Gemma 4 GPU decode is ~1.8-2.9x slower than llama.cpp (kernel efficiency + low-ctx clock cliff)

> ## ✅ RESOLVED (2026-06-14)
> Decode reached ~70.5 t/s (E4B Q8, RTX 4070 Ti) — greedy **71.5 vs llama.cpp 76.6 = 1.07×**, within the "≤1.3× and flat" acceptance — via the top-k-first penalty-aware sampler (#154, +44%), dp4a/Q8_1 decode matvec (#143), and default-on CUDA graphs (#139). Problem 1 (the "low-ctx clock cliff", 27 t/s at depth 0) was diagnosed as a `--verbose-prompt` full-vocab-LINQ-sort **bench artifact**, not a GPU stall — real matvecs run at ~90% HBM. Steady-state decode is bandwidth-bound at the ~504 GB/s ÷ 8 GB-per-token ceiling, so there is no structural lever left. Closing.
>
> ---

## Finding

Gemma 4 GPU **decode is ~1.8–2.9× slower than llama.cpp**, and unlike llama.cpp it *degrades*
at low context.

Benchmark (RTX 4070 Ti, `gemma-4-E4B-it-Q8_0`, all layers on GPU):

| Decode | llama.cpp b9529 | SharpInference | gap |
|---|---|---|---|
| near-zero ctx (tg128 @ d0) | **78.5 t/s** | ~27 t/s | ~2.9× |
| ~1K ctx (tg128 @ d1024) | **77.7 t/s** | ~42–45 t/s | ~1.8× |

llama.cpp holds **~78 t/s flat** from depth 0 → 1024. Ours is *faster at 1K (42) than at
near-zero (27)* — backwards.

## Two distinct problems

1. **Low-context under-utilization / clock cliff.** Our near-zero-ctx decode (27) is slower
   than our 1K decode (42) because the GPU never reaches boost clock — the per-token work is a
   long chain of tiny, serialized kernels with idle gaps, so power draw stays low. llama.cpp
   keeps the GPU busy enough to boost even at depth 0 (78 flat). This is a *symptom of an
   inefficient/launch-gappy decode*, not an inherent GPU limit. (The merged CUDA-graph path
   targets exactly these launch gaps but the per-token `SetParams` overhead makes it −9.7% at
   short ctx; the device-side-`position` follow-up #140 is the only graph variant that might
   actually help here — but it's secondary to the kernel work below.)

2. **Steady-state decode is ~1.8× behind even when boosted (1K: 42–45 vs 78).** Likely the
   attention and matvec kernels: ours uses a per-head scalar attention kernel + shared-memory
   scores; llama.cpp uses fused flash-attention and tuned dequant-dot matvecs. This is the
   structural gap.

## Direction

- Profile a single decode token vs llama.cpp (kernel-level) to attribute the 1.8× — attention
  vs QKV/FFN matvec vs launch overhead.
- Likely wins: a fused flash-attention-style kernel (one launch/​layer instead of score+softmax+
  AV), and tighter Q8_0 dequant-dot matvecs. These also keep the GPU busier → fix problem (1)'s
  clock cliff as a side effect.

## Acceptance

- Decode within ~1.3× of llama.cpp and **flat across context** (no low-ctx cliff).

## Scope note

Per "don't optimize paths that won't reach competitive speed": this issue is the structural
decode work. The CUDA-graph micro-opt (#140) is deprioritized behind it — it can't close a
1.8× gap on its own.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 4 GPU decode is ~1.8-2.9x slower than llama.cpp (kernel efficiency + low-ctx clock cliff) #142

✅ RESOLVED (2026-06-14)

Finding

Two distinct problems

Direction

Acceptance

Scope note

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Decode	llama.cpp b9529	SharpInference	gap
near-zero ctx (tg128 @ d0)	78.5 t/s	~27 t/s	~2.9×
~1K ctx (tg128 @ d1024)	77.7 t/s	~42–45 t/s	~1.8×

Gemma 4 GPU decode is ~1.8-2.9x slower than llama.cpp (kernel efficiency + low-ctx clock cliff) #142

Description

✅ RESOLVED (2026-06-14)

Finding

Two distinct problems

Direction

Acceptance

Scope note

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions