Skip to content

perf(cuda): split-K + better-tiled int8 MMQ GEMM — 23-34% → 60-80% of TC peak (#141 follow-up) #149

@pekkah

Description

@pekkah

✅ RESOLVED / superseded by #152 (2026-06-14)

Of the approaches here: split-K was ruled out (occupancy isn't the bottleneck at real prefill batch sizes — end-to-end wash, reverted), cp.async staging was slower on Ada (reverted), and the SoA Q8_0 repack shipped (+10–12% prefill, bit-identical). The int8 MMQ still sits at ~40–43% of TC peak; that remaining "different tiling regime" lever — the only one not ruled out — is consolidated in #152. Closing in favor of #152 to avoid duplicate tracking.


Follow-up to #141. After the TC flash work (#146/#147) moved the prefill bottleneck to the matmul/FFN GEMMs, a roofline probe (CudaMmqRooflineProbe) shows the int8 MMQ kernel runs at only 23-34% of the 4070 Ti's ~160 TOPS dense int8 TC peak — ~3× headroom, matching the 2.2× end-to-end prefill gap to llama.cpp (~8475 vs our 3818 t/s).

Worst shape: ffn-down [out=2048 × in=8192], 23% peak. Few output rows → few CTA tiles → the SMs are occupancy-starved over the long (8192) contraction, exactly the class of problem the single-warp flash kernel had before #147's d-split.

Scope (pieces, not estimates):

  • Split-K the contraction across CTAs for tall/thin-output GEMMs (ffn-down) so all SMs get work; reduce the partials (atomic or a second pass). This alone should lift ffn-down's 23%.
  • Better register tiling / larger per-warp output tiles to raise arithmetic intensity (more mma per weight load).
  • cp.async double-buffered weight/activation staging — complicated by the Q8_0 qs 2-byte misalignment (the same funnelshift issue noted for decode); may need a re-aligned Q8_0 staging or a 16-byte-aligned shared layout.
  • Cross-reference llama.cpp ggml/src/ggml-cuda/mmq.cuh (on disk at C:\p\llama.cpp) for their tiling/pipelining constants.
  • Parity: the existing CudaMmqQ8_0Tests oracle (argmax-stable vs CPU DotQ8_0); A/B with the roofline probe (target: push the three FFN/qkv shapes toward 60-80% peak).

This is the main remaining lever for the Gemma 4 prefill half of the #141/#142 "match llama.cpp" goal.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions