Skip to content

perf(cuda): MMQ tiling rewrite — the remaining Gemma 4 prefill lever (split-K & cp.async ruled out) (#141/#149 follow-up) #152

@pekkah

Description

@pekkah

Tracks the remaining prefill lever for the Gemma 4 "match llama.cpp" goal (#141), and consolidates what's been ruled out so it isn't re-attempted.

Current state (branch perf/gemma4-cuda-flash-tc-146, PR #148)

Prefill is ~4240 t/s = ~50% of llama.cpp's ~8475 after TC flash (#146/#147) + SoA Q8_0 (#149). Profiling shows attention is now ~29% of prefill and the matmul/FFN GEMMs (~53%) dominate. The int8 MMQ runs at ~40-43% of the 4070 Ti's int8 TC peak (CudaMmqRooflineProbe), so there's ~2× headroom that maps to the remaining prefill gap.

Ruled out (do NOT re-attempt without new evidence — see #149 for data)

  • split-K: occupancy is not the bottleneck at real prefill batch sizes (the roofline probe mismeasured at nTok=1024 — real prefill has more tok-blocks). End-to-end wash (+2.3%/-2%, in noise). Reverted.
  • cp.async-pipelined weight staging (llm_mmq_q8_0_soa_cpa): bit-identical but slower than the existing register-prefetch double-buffer on Ada (.cg -10 to -13%, .ca -5 to -8%). The register-prefetch already hides global-load latency; cp.async's commit_group/wait_group barriers add overhead. Reverted.

The actual lever (this issue): a different MMQ tiling regime

The current kernel is MMQ_BM=64 × MMQ_BN=128, 8 warps, one m16n8k32 mma per 8-token N-tile. To raise arithmetic intensity toward llama.cpp's MMQ schedule (ggml/src/ggml-cuda/mmq.cuh, on disk at C:\p\llama.cpp):

  • larger per-warp output tiles / more N-tiles per warp (fewer weight loads per mma issued),
  • revisit the per-block fp32 scale accumulation (8 N-tiles × fp32 muls per K-block — instruction overhead in the inner loop),
  • shared-memory bank-conflict analysis on the sW/sY layouts,
  • possibly a fundamentally different K-loop schedule.

Genuinely uncertain payoff and a deep rewrite — deferred as not-a-quick-win. Validate any variant against CudaMmqSoaTests (bit-identical to the SoA MMQ) and A/B with CudaMmqRooflineProbe at the real prefill nTok (the lesson from split-K). Parallel lever for the goal: decode (#142).

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions