Prefill perf gap vs llama.cpp (~2.4x) is weight-streaming-bound, not MMQ-kernel-bound (#124 follow-up)

## Gemma 4 prefill throughput gap vs llama.cpp is weight-streaming-bound, not kernel-bound

**Status: research/tracking.** PR #174 closed the decode gap (54 vs llama.cpp 57 t/s) but the all-CUDA prefill stays ~1778 t/s vs llama.cpp's ~4290 (~2.4×). This issue records the **conclusion of an exhaustive, ncu-grounded investigation** so the dead ends aren't retried.

### What was tried and is e2e-NULL (with profiling proof)
All on the bit-identical SoA-activation MMQ substrate (Q8_0/Q4_0):
1. **Coalesced load remap** — SoA activation layout cut uncoalesced global sectors 55%→43% (ncu). e2e-null.
2. **Occupancy** — no-prefetch + `__launch_bounds__(256,4)` lifted occupancy 49%→64% (3→4 blocks). e2e-null.
3. **KS-staged barriers / shared-padding / launch_bounds / single-sided coalescing** — all measured null previously.
4. **cp.async pipelined global→shared** (`llm_mmq_{q8_0,q4_0}_soa_acts_cpa`, opt-in `SHARPI_ACT_SOA_CPA=1`) — **+10–15% at isolated FFN matmul shapes**, but **e2e-neutral**: 3× alternating 12B A/B dead-flat (1762 vs 1761.9 t/s), prefill profiler matmul **74→73 ms**.

### Root cause (the key finding)
The isolated matmul probe is **L2-resident → L1TEX-bound** (DRAM 11.8% per ncu), which is the regime cp.async/coalescing/occupancy optimize. The real 48-layer prefill **streams ~7 GB of Q4_0 weights from VRAM** and is bound in a different regime, where L1TEX-side kernel optimizations give nothing. So **kernel-level matmul tuning cannot close this gap.**

### Where to actually look (if pursued)
- The gap is in the weight-streaming / DRAM-access regime, not the int8 MMQ inner loop. Compare llama.cpp's `mul_mat_q` weight access pattern, tile residency, and whether it overlaps weight streaming across layers differently.
- Don't retry: coalescing, occupancy, KS-barriers, shared-padding, launch_bounds, cp.async on the MMQ kernels — all proven e2e-null.

The cp.async + SoA-activation kernels are kept in-tree as bit-identical, arch-guarded, opt-in groundwork. Follow-up to #124 / #173.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefill perf gap vs llama.cpp (~2.4x) is weight-streaming-bound, not MMQ-kernel-bound (#124 follow-up) #176

Gemma 4 prefill throughput gap vs llama.cpp is weight-streaming-bound, not kernel-bound

What was tried and is e2e-NULL (with profiling proof)

Root cause (the key finding)

Where to actually look (if pursued)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Prefill perf gap vs llama.cpp (~2.4x) is weight-streaming-bound, not MMQ-kernel-bound (#124 follow-up) #176

Description

Gemma 4 prefill throughput gap vs llama.cpp is weight-streaming-bound, not kernel-bound

What was tried and is e2e-NULL (with profiling proof)

Root cause (the key finding)

Where to actually look (if pursued)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions