You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Decode reached ~70.5 t/s (E4B Q8, RTX 4070 Ti) — greedy 71.5 vs llama.cpp 76.6 = 1.07×, within the "≤1.3× and flat" acceptance — via the top-k-first penalty-aware sampler (#154, +44%), dp4a/Q8_1 decode matvec (#143), and default-on CUDA graphs (#139). Problem 1 (the "low-ctx clock cliff", 27 t/s at depth 0) was diagnosed as a --verbose-prompt full-vocab-LINQ-sort bench artifact, not a GPU stall — real matvecs run at ~90% HBM. Steady-state decode is bandwidth-bound at the ~504 GB/s ÷ 8 GB-per-token ceiling, so there is no structural lever left. Closing.
Finding
Gemma 4 GPU decode is ~1.8–2.9× slower than llama.cpp, and unlike llama.cpp it degrades
at low context.
Benchmark (RTX 4070 Ti, gemma-4-E4B-it-Q8_0, all layers on GPU):
Decode
llama.cpp b9529
SharpInference
gap
near-zero ctx (tg128 @ d0)
78.5 t/s
~27 t/s
~2.9×
~1K ctx (tg128 @ d1024)
77.7 t/s
~42–45 t/s
~1.8×
llama.cpp holds ~78 t/s flat from depth 0 → 1024. Ours is faster at 1K (42) than at
near-zero (27) — backwards.
Two distinct problems
Low-context under-utilization / clock cliff. Our near-zero-ctx decode (27) is slower
than our 1K decode (42) because the GPU never reaches boost clock — the per-token work is a
long chain of tiny, serialized kernels with idle gaps, so power draw stays low. llama.cpp
keeps the GPU busy enough to boost even at depth 0 (78 flat). This is a symptom of an
inefficient/launch-gappy decode, not an inherent GPU limit. (The merged CUDA-graph path
targets exactly these launch gaps but the per-token SetParams overhead makes it −9.7% at
short ctx; the device-side-position follow-up CUDA Graph decode: device-side position to remove short-context regression (follow-up to #136) #140 is the only graph variant that might
actually help here — but it's secondary to the kernel work below.)
Steady-state decode is ~1.8× behind even when boosted (1K: 42–45 vs 78). Likely the
attention and matvec kernels: ours uses a per-head scalar attention kernel + shared-memory
scores; llama.cpp uses fused flash-attention and tuned dequant-dot matvecs. This is the
structural gap.
Direction
Profile a single decode token vs llama.cpp (kernel-level) to attribute the 1.8× — attention
vs QKV/FFN matvec vs launch overhead.
Likely wins: a fused flash-attention-style kernel (one launch/layer instead of score+softmax+
AV), and tighter Q8_0 dequant-dot matvecs. These also keep the GPU busier → fix problem (1)'s
clock cliff as a side effect.
Acceptance
Decode within ~1.3× of llama.cpp and flat across context (no low-ctx cliff).
Scope note
Per "don't optimize paths that won't reach competitive speed": this issue is the structural
decode work. The CUDA-graph micro-opt (#140) is deprioritized behind it — it can't close a
1.8× gap on its own.
Finding
Gemma 4 GPU decode is ~1.8–2.9× slower than llama.cpp, and unlike llama.cpp it degrades
at low context.
Benchmark (RTX 4070 Ti,
gemma-4-E4B-it-Q8_0, all layers on GPU):llama.cpp holds ~78 t/s flat from depth 0 → 1024. Ours is faster at 1K (42) than at
near-zero (27) — backwards.
Two distinct problems
Low-context under-utilization / clock cliff. Our near-zero-ctx decode (27) is slower
than our 1K decode (42) because the GPU never reaches boost clock — the per-token work is a
long chain of tiny, serialized kernels with idle gaps, so power draw stays low. llama.cpp
keeps the GPU busy enough to boost even at depth 0 (78 flat). This is a symptom of an
inefficient/launch-gappy decode, not an inherent GPU limit. (The merged CUDA-graph path
targets exactly these launch gaps but the per-token
SetParamsoverhead makes it −9.7% atshort ctx; the device-side-
positionfollow-up CUDA Graph decode: device-side position to remove short-context regression (follow-up to #136) #140 is the only graph variant that mightactually help here — but it's secondary to the kernel work below.)
Steady-state decode is ~1.8× behind even when boosted (1K: 42–45 vs 78). Likely the
attention and matvec kernels: ours uses a per-head scalar attention kernel + shared-memory
scores; llama.cpp uses fused flash-attention and tuned dequant-dot matvecs. This is the
structural gap.
Direction
vs QKV/FFN matvec vs launch overhead.
AV), and tighter Q8_0 dequant-dot matvecs. These also keep the GPU busier → fix problem (1)'s
clock cliff as a side effect.
Acceptance
Scope note
Per "don't optimize paths that won't reach competitive speed": this issue is the structural
decode work. The CUDA-graph micro-opt (#140) is deprioritized behind it — it can't close a
1.8× gap on its own.