✅ RESOLVED / superseded by #152 (2026-06-14)
Of the approaches here: split-K was ruled out (occupancy isn't the bottleneck at real prefill batch sizes — end-to-end wash, reverted), cp.async staging was slower on Ada (reverted), and the SoA Q8_0 repack shipped (+10–12% prefill, bit-identical). The int8 MMQ still sits at ~40–43% of TC peak; that remaining "different tiling regime" lever — the only one not ruled out — is consolidated in #152. Closing in favor of #152 to avoid duplicate tracking.
Follow-up to #141. After the TC flash work (#146/#147) moved the prefill bottleneck to the matmul/FFN GEMMs, a roofline probe (CudaMmqRooflineProbe) shows the int8 MMQ kernel runs at only 23-34% of the 4070 Ti's ~160 TOPS dense int8 TC peak — ~3× headroom, matching the 2.2× end-to-end prefill gap to llama.cpp (~8475 vs our 3818 t/s).
Worst shape: ffn-down [out=2048 × in=8192], 23% peak. Few output rows → few CTA tiles → the SMs are occupancy-starved over the long (8192) contraction, exactly the class of problem the single-warp flash kernel had before #147's d-split.
Scope (pieces, not estimates):
- Split-K the contraction across CTAs for tall/thin-output GEMMs (ffn-down) so all SMs get work; reduce the partials (atomic or a second pass). This alone should lift ffn-down's 23%.
- Better register tiling / larger per-warp output tiles to raise arithmetic intensity (more mma per weight load).
- cp.async double-buffered weight/activation staging — complicated by the Q8_0
qs 2-byte misalignment (the same funnelshift issue noted for decode); may need a re-aligned Q8_0 staging or a 16-byte-aligned shared layout.
- Cross-reference llama.cpp
ggml/src/ggml-cuda/mmq.cuh (on disk at C:\p\llama.cpp) for their tiling/pipelining constants.
- Parity: the existing
CudaMmqQ8_0Tests oracle (argmax-stable vs CPU DotQ8_0); A/B with the roofline probe (target: push the three FFN/qkv shapes toward 60-80% peak).
This is the main remaining lever for the Gemma 4 prefill half of the #141/#142 "match llama.cpp" goal.
Follow-up to #141. After the TC flash work (#146/#147) moved the prefill bottleneck to the matmul/FFN GEMMs, a roofline probe (
CudaMmqRooflineProbe) shows the int8 MMQ kernel runs at only 23-34% of the 4070 Ti's ~160 TOPS dense int8 TC peak — ~3× headroom, matching the 2.2× end-to-end prefill gap to llama.cpp (~8475 vs our 3818 t/s).Worst shape:
ffn-down[out=2048 × in=8192], 23% peak. Few output rows → few CTA tiles → the SMs are occupancy-starved over the long (8192) contraction, exactly the class of problem the single-warp flash kernel had before #147's d-split.Scope (pieces, not estimates):
qs2-byte misalignment (the same funnelshift issue noted for decode); may need a re-aligned Q8_0 staging or a 16-byte-aligned shared layout.ggml/src/ggml-cuda/mmq.cuh(on disk atC:\p\llama.cpp) for their tiling/pipelining constants.CudaMmqQ8_0Testsoracle (argmax-stable vs CPUDotQ8_0); A/B with the roofline probe (target: push the three FFN/qkv shapes toward 60-80% peak).This is the main remaining lever for the Gemma 4 prefill half of the #141/#142 "match llama.cpp" goal.