Skip to content

Prefill perf gap vs llama.cpp (~2.4x) is weight-streaming-bound, not MMQ-kernel-bound (#124 follow-up) #176

@pekkah

Description

@pekkah

Gemma 4 prefill throughput gap vs llama.cpp is weight-streaming-bound, not kernel-bound

Status: research/tracking. PR #174 closed the decode gap (54 vs llama.cpp 57 t/s) but the all-CUDA prefill stays ~1778 t/s vs llama.cpp's ~4290 (~2.4×). This issue records the conclusion of an exhaustive, ncu-grounded investigation so the dead ends aren't retried.

What was tried and is e2e-NULL (with profiling proof)

All on the bit-identical SoA-activation MMQ substrate (Q8_0/Q4_0):

  1. Coalesced load remap — SoA activation layout cut uncoalesced global sectors 55%→43% (ncu). e2e-null.
  2. Occupancy — no-prefetch + __launch_bounds__(256,4) lifted occupancy 49%→64% (3→4 blocks). e2e-null.
  3. KS-staged barriers / shared-padding / launch_bounds / single-sided coalescing — all measured null previously.
  4. cp.async pipelined global→shared (llm_mmq_{q8_0,q4_0}_soa_acts_cpa, opt-in SHARPI_ACT_SOA_CPA=1) — +10–15% at isolated FFN matmul shapes, but e2e-neutral: 3× alternating 12B A/B dead-flat (1762 vs 1761.9 t/s), prefill profiler matmul 74→73 ms.

Root cause (the key finding)

The isolated matmul probe is L2-resident → L1TEX-bound (DRAM 11.8% per ncu), which is the regime cp.async/coalescing/occupancy optimize. The real 48-layer prefill streams ~7 GB of Q4_0 weights from VRAM and is bound in a different regime, where L1TEX-side kernel optimizations give nothing. So kernel-level matmul tuning cannot close this gap.

Where to actually look (if pursued)

  • The gap is in the weight-streaming / DRAM-access regime, not the int8 MMQ inner loop. Compare llama.cpp's mul_mat_q weight access pattern, tile residency, and whether it overlaps weight streaming across layers differently.
  • Don't retry: coalescing, occupancy, KS-barriers, shared-padding, launch_bounds, cp.async on the MMQ kernels — all proven e2e-null.

The cp.async + SoA-activation kernels are kept in-tree as bit-identical, arch-guarded, opt-in groundwork. Follow-up to #124 / #173.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions