Gemma 4 prefill throughput gap vs llama.cpp is weight-streaming-bound, not kernel-bound
Status: research/tracking. PR #174 closed the decode gap (54 vs llama.cpp 57 t/s) but the all-CUDA prefill stays ~1778 t/s vs llama.cpp's ~4290 (~2.4×). This issue records the conclusion of an exhaustive, ncu-grounded investigation so the dead ends aren't retried.
What was tried and is e2e-NULL (with profiling proof)
All on the bit-identical SoA-activation MMQ substrate (Q8_0/Q4_0):
- Coalesced load remap — SoA activation layout cut uncoalesced global sectors 55%→43% (ncu). e2e-null.
- Occupancy — no-prefetch +
__launch_bounds__(256,4) lifted occupancy 49%→64% (3→4 blocks). e2e-null.
- KS-staged barriers / shared-padding / launch_bounds / single-sided coalescing — all measured null previously.
- cp.async pipelined global→shared (
llm_mmq_{q8_0,q4_0}_soa_acts_cpa, opt-in SHARPI_ACT_SOA_CPA=1) — +10–15% at isolated FFN matmul shapes, but e2e-neutral: 3× alternating 12B A/B dead-flat (1762 vs 1761.9 t/s), prefill profiler matmul 74→73 ms.
Root cause (the key finding)
The isolated matmul probe is L2-resident → L1TEX-bound (DRAM 11.8% per ncu), which is the regime cp.async/coalescing/occupancy optimize. The real 48-layer prefill streams ~7 GB of Q4_0 weights from VRAM and is bound in a different regime, where L1TEX-side kernel optimizations give nothing. So kernel-level matmul tuning cannot close this gap.
Where to actually look (if pursued)
- The gap is in the weight-streaming / DRAM-access regime, not the int8 MMQ inner loop. Compare llama.cpp's
mul_mat_q weight access pattern, tile residency, and whether it overlaps weight streaming across layers differently.
- Don't retry: coalescing, occupancy, KS-barriers, shared-padding, launch_bounds, cp.async on the MMQ kernels — all proven e2e-null.
The cp.async + SoA-activation kernels are kept in-tree as bit-identical, arch-guarded, opt-in groundwork. Follow-up to #124 / #173.
Gemma 4 prefill throughput gap vs llama.cpp is weight-streaming-bound, not kernel-bound
Status: research/tracking. PR #174 closed the decode gap (54 vs llama.cpp 57 t/s) but the all-CUDA prefill stays ~1778 t/s vs llama.cpp's ~4290 (~2.4×). This issue records the conclusion of an exhaustive, ncu-grounded investigation so the dead ends aren't retried.
What was tried and is e2e-NULL (with profiling proof)
All on the bit-identical SoA-activation MMQ substrate (Q8_0/Q4_0):
__launch_bounds__(256,4)lifted occupancy 49%→64% (3→4 blocks). e2e-null.llm_mmq_{q8_0,q4_0}_soa_acts_cpa, opt-inSHARPI_ACT_SOA_CPA=1) — +10–15% at isolated FFN matmul shapes, but e2e-neutral: 3× alternating 12B A/B dead-flat (1762 vs 1761.9 t/s), prefill profiler matmul 74→73 ms.Root cause (the key finding)
The isolated matmul probe is L2-resident → L1TEX-bound (DRAM 11.8% per ncu), which is the regime cp.async/coalescing/occupancy optimize. The real 48-layer prefill streams ~7 GB of Q4_0 weights from VRAM and is bound in a different regime, where L1TEX-side kernel optimizations give nothing. So kernel-level matmul tuning cannot close this gap.
Where to actually look (if pursued)
mul_mat_qweight access pattern, tile residency, and whether it overlaps weight streaming across layers differently.The cp.async + SoA-activation kernels are kept in-tree as bit-identical, arch-guarded, opt-in groundwork. Follow-up to #124 / #173.