Tracks the remaining prefill lever for the Gemma 4 "match llama.cpp" goal (#141), and consolidates what's been ruled out so it isn't re-attempted.
Current state (branch perf/gemma4-cuda-flash-tc-146, PR #148)
Prefill is ~4240 t/s = ~50% of llama.cpp's ~8475 after TC flash (#146/#147) + SoA Q8_0 (#149). Profiling shows attention is now ~29% of prefill and the matmul/FFN GEMMs (~53%) dominate. The int8 MMQ runs at ~40-43% of the 4070 Ti's int8 TC peak (CudaMmqRooflineProbe), so there's ~2× headroom that maps to the remaining prefill gap.
Ruled out (do NOT re-attempt without new evidence — see #149 for data)
- split-K: occupancy is not the bottleneck at real prefill batch sizes (the roofline probe mismeasured at nTok=1024 — real prefill has more tok-blocks). End-to-end wash (+2.3%/-2%, in noise). Reverted.
- cp.async-pipelined weight staging (
llm_mmq_q8_0_soa_cpa): bit-identical but slower than the existing register-prefetch double-buffer on Ada (.cg -10 to -13%, .ca -5 to -8%). The register-prefetch already hides global-load latency; cp.async's commit_group/wait_group barriers add overhead. Reverted.
The actual lever (this issue): a different MMQ tiling regime
The current kernel is MMQ_BM=64 × MMQ_BN=128, 8 warps, one m16n8k32 mma per 8-token N-tile. To raise arithmetic intensity toward llama.cpp's MMQ schedule (ggml/src/ggml-cuda/mmq.cuh, on disk at C:\p\llama.cpp):
- larger per-warp output tiles / more N-tiles per warp (fewer weight loads per mma issued),
- revisit the per-block fp32 scale accumulation (8 N-tiles × fp32 muls per K-block — instruction overhead in the inner loop),
- shared-memory bank-conflict analysis on the
sW/sY layouts,
- possibly a fundamentally different K-loop schedule.
Genuinely uncertain payoff and a deep rewrite — deferred as not-a-quick-win. Validate any variant against CudaMmqSoaTests (bit-identical to the SoA MMQ) and A/B with CudaMmqRooflineProbe at the real prefill nTok (the lesson from split-K). Parallel lever for the goal: decode (#142).
Tracks the remaining prefill lever for the Gemma 4 "match llama.cpp" goal (#141), and consolidates what's been ruled out so it isn't re-attempted.
Current state (branch
perf/gemma4-cuda-flash-tc-146, PR #148)Prefill is ~4240 t/s = ~50% of llama.cpp's ~8475 after TC flash (#146/#147) + SoA Q8_0 (#149). Profiling shows attention is now ~29% of prefill and the matmul/FFN GEMMs (~53%) dominate. The int8 MMQ runs at ~40-43% of the 4070 Ti's int8 TC peak (
CudaMmqRooflineProbe), so there's ~2× headroom that maps to the remaining prefill gap.Ruled out (do NOT re-attempt without new evidence — see #149 for data)
llm_mmq_q8_0_soa_cpa): bit-identical but slower than the existing register-prefetch double-buffer on Ada (.cg-10 to -13%,.ca-5 to -8%). The register-prefetch already hides global-load latency; cp.async's commit_group/wait_group barriers add overhead. Reverted.The actual lever (this issue): a different MMQ tiling regime
The current kernel is
MMQ_BM=64 × MMQ_BN=128, 8 warps, onem16n8k32mma per 8-token N-tile. To raise arithmetic intensity toward llama.cpp's MMQ schedule (ggml/src/ggml-cuda/mmq.cuh, on disk atC:\p\llama.cpp):sW/sYlayouts,Genuinely uncertain payoff and a deep rewrite — deferred as not-a-quick-win. Validate any variant against
CudaMmqSoaTests(bit-identical to the SoA MMQ) and A/B withCudaMmqRooflineProbeat the real prefill nTok (the lesson from split-K). Parallel lever for the goal: decode (#142).