perf(cuda): MMQ tiling rewrite — the remaining Gemma 4 prefill lever (split-K & cp.async ruled out) (#141/#149 follow-up)

Tracks the remaining prefill lever for the Gemma 4 "match llama.cpp" goal (#141), and consolidates what's been **ruled out** so it isn't re-attempted.

### Current state (branch `perf/gemma4-cuda-flash-tc-146`, PR #148)
Prefill is **~4240 t/s = ~50% of llama.cpp's ~8475** after TC flash (#146/#147) + SoA Q8_0 (#149). Profiling shows attention is now ~29% of prefill and the **matmul/FFN GEMMs (~53%) dominate**. The int8 MMQ runs at **~40-43% of the 4070 Ti's int8 TC peak** (`CudaMmqRooflineProbe`), so there's ~2× headroom that maps to the remaining prefill gap.

### Ruled out (do NOT re-attempt without new evidence — see #149 for data)
- **split-K**: occupancy is *not* the bottleneck at real prefill batch sizes (the roofline probe mismeasured at nTok=1024 — real prefill has more tok-blocks). End-to-end wash (+2.3%/-2%, in noise). Reverted.
- **cp.async-pipelined weight staging** (`llm_mmq_q8_0_soa_cpa`): bit-identical but **slower** than the existing register-prefetch double-buffer on Ada (`.cg` -10 to -13%, `.ca` -5 to -8%). The register-prefetch already hides global-load latency; cp.async's commit_group/wait_group barriers add overhead. Reverted.

### The actual lever (this issue): a different MMQ tiling regime
The current kernel is `MMQ_BM=64 × MMQ_BN=128`, 8 warps, one `m16n8k32` mma per 8-token N-tile. To raise arithmetic intensity toward llama.cpp's MMQ schedule (`ggml/src/ggml-cuda/mmq.cuh`, on disk at `C:\p\llama.cpp`):
- larger per-warp output tiles / more N-tiles per warp (fewer weight loads per mma issued),
- revisit the per-block fp32 scale accumulation (8 N-tiles × fp32 muls per K-block — instruction overhead in the inner loop),
- shared-memory bank-conflict analysis on the `sW`/`sY` layouts,
- possibly a fundamentally different K-loop schedule.

Genuinely uncertain payoff and a deep rewrite — deferred as not-a-quick-win. Validate any variant against `CudaMmqSoaTests` (bit-identical to the SoA MMQ) and A/B with `CudaMmqRooflineProbe` **at the real prefill nTok** (the lesson from split-K). Parallel lever for the goal: decode (#142).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuda): MMQ tiling rewrite — the remaining Gemma 4 prefill lever (split-K & cp.async ruled out) (#141/#149 follow-up) #152

Current state (branch `perf/gemma4-cuda-flash-tc-146`, PR #148)

Ruled out (do NOT re-attempt without new evidence — see #149 for data)

The actual lever (this issue): a different MMQ tiling regime

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(cuda): MMQ tiling rewrite — the remaining Gemma 4 prefill lever (split-K & cp.async ruled out) (#141/#149 follow-up) #152

Description

Current state (branch perf/gemma4-cuda-flash-tc-146, PR #148)

Ruled out (do NOT re-attempt without new evidence — see #149 for data)

The actual lever (this issue): a different MMQ tiling regime

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Current state (branch `perf/gemma4-cuda-flash-tc-146`, PR #148)