Skip to content

perf(cuda): close remaining Qwen3-8B Q4_K DECODE gap to llama.cpp — non-matvec cost (prefill handled by #167; kernel-efficiency in #149/#152) #162

@pekkah

Description

@pekkah

Follow-up to #156/#158/#160 (PR #161). Qwen3-8B Q4_K_M on RTX 4070 Ti is now prefill 432 / decode 74.7 t/s; llama.cpp b8585 is pp1008 ~5764 / tg128 ~78–91. Two distinct gaps remain, each needing its own approach.

Decode (74.7 → ~78–91) — the lever is NOT the matvec

The Q4_K decode matvec is now ~89% of HBM peak after the #160 SoA repack (was 74%), vs Q8_0's ~90%. Per CudaDecodeMatvecQ4KRooflineProbe, that's near the realistic ceiling — squeezing the last ~1pp (the min-correction's 2 extra dp4a + 0x01010101 sum, the d8/scale float mults, or occupancy) is low-yield. Confirm with the probe before investing.

The real signal: the matvec gained +13–15% BW but e2e decode only moved +7%, so ~half the per-token cost is elsewhere. Mirror the Gemma win (#142, which turned out to be the sampler, not kernels):

  • Run SHARPI_CUDA_PROFILE=1 per-phase decode breakdown on Qwen3-8B and identify the dominant non-matvec cost (attention / RoPE / norms / KV append / sampler).
  • Check the sampler path specifically — Qwen3-8B greedy --temp 0 may still hit a full-vocab artifact like Gemma's bench --verbose-prompt LINQ-sort (see project_gemma4_decode_sampler_142).
  • Fix the top item; argmax-stable; same-session A/B; README.

Prefill (432 → ~5764) — cp.async-pipelined MMQ

The ~13× gap is llama.cpp's mul_mat_q cp.async double-buffered weight load, which hides the per-token-tile weight re-read that our llm_mmq_q4k/llm_mmq_q4k_soa (and cuBLAS, via L2) pay. The SoA layout (#160, quants 16B-aligned) is the right substrate for cp.async.

  • Prototype cp.async.cg.shared.global (sm_80+) double-buffered tile load in llm_mmq_q4k_soa; overlap the next tile's global→shared copy with the current tile's mma.
  • Bit-identical / argmax-stable vs current MMQ; A/B at ~100-tok and ~1K prompts; README.

Notes

  • Decode item is the higher-confidence win (Gemma precedent); prefill cp.async is larger but riskier.
  • All knobs in place: SHARPI_CUDA_PROFILE, SHARPI_Q4K_SOA, SHARPI_PREFILL_MMQ, SHARPI_CUDA_GRAPH. A/B harness: scripts/bench-q4k-soa-156.ps1. Model: models/Qwen3-8B-Q4_K_M.gguf.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions