perf(cuda): split-K + better-tiled int8 MMQ GEMM — 23-34% → 60-80% of TC peak (#141 follow-up)

> ## ✅ RESOLVED / superseded by #152 (2026-06-14)
> Of the approaches here: **split-K was ruled out** (occupancy isn't the bottleneck at real prefill batch sizes — end-to-end wash, reverted), **cp.async staging was slower on Ada** (reverted), and the **SoA Q8_0 repack shipped** (+10–12% prefill, bit-identical). The int8 MMQ still sits at ~40–43% of TC peak; that remaining "different tiling regime" lever — the only one not ruled out — is consolidated in **#152**. Closing in favor of #152 to avoid duplicate tracking.
>
> ---

Follow-up to #141. After the TC flash work (#146/#147) moved the prefill bottleneck to the matmul/FFN GEMMs, a roofline probe (`CudaMmqRooflineProbe`) shows the int8 MMQ kernel runs at only **23-34% of the 4070 Ti's ~160 TOPS dense int8 TC peak** — ~3× headroom, matching the 2.2× end-to-end prefill gap to llama.cpp (~8475 vs our 3818 t/s).

**Worst shape: `ffn-down` [out=2048 × in=8192], 23% peak.** Few output rows → few CTA tiles → the SMs are occupancy-starved over the long (8192) contraction, exactly the class of problem the single-warp flash kernel had before #147's d-split.

**Scope (pieces, not estimates):**
- **Split-K** the contraction across CTAs for tall/thin-output GEMMs (ffn-down) so all SMs get work; reduce the partials (atomic or a second pass). This alone should lift ffn-down's 23%.
- Better **register tiling** / larger per-warp output tiles to raise arithmetic intensity (more mma per weight load).
- **cp.async** double-buffered weight/activation staging — complicated by the Q8_0 `qs` 2-byte misalignment (the same funnelshift issue noted for decode); may need a re-aligned Q8_0 staging or a 16-byte-aligned shared layout.
- Cross-reference llama.cpp `ggml/src/ggml-cuda/mmq.cuh` (on disk at `C:\p\llama.cpp`) for their tiling/pipelining constants.
- Parity: the existing `CudaMmqQ8_0Tests` oracle (argmax-stable vs CPU `DotQ8_0`); A/B with the roofline probe (target: push the three FFN/qkv shapes toward 60-80% peak).

This is the main remaining lever for the Gemma 4 prefill half of the #141/#142 "match llama.cpp" goal.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuda): split-K + better-tiled int8 MMQ GEMM — 23-34% → 60-80% of TC peak (#141 follow-up) #149

✅ RESOLVED / superseded by #152 (2026-06-14)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(cuda): split-K + better-tiled int8 MMQ GEMM — 23-34% → 60-80% of TC peak (#141 follow-up) #149

Description

✅ RESOLVED / superseded by #152 (2026-06-14)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions