Follow-up to #156/#158/#160 (PR #161). Qwen3-8B Q4_K_M on RTX 4070 Ti is now prefill 432 / decode 74.7 t/s; llama.cpp b8585 is pp1008 ~5764 / tg128 ~78–91. Two distinct gaps remain, each needing its own approach.
Decode (74.7 → ~78–91) — the lever is NOT the matvec
The Q4_K decode matvec is now ~89% of HBM peak after the #160 SoA repack (was 74%), vs Q8_0's ~90%. Per CudaDecodeMatvecQ4KRooflineProbe, that's near the realistic ceiling — squeezing the last ~1pp (the min-correction's 2 extra dp4a + 0x01010101 sum, the d8/scale float mults, or occupancy) is low-yield. Confirm with the probe before investing.
The real signal: the matvec gained +13–15% BW but e2e decode only moved +7%, so ~half the per-token cost is elsewhere. Mirror the Gemma win (#142, which turned out to be the sampler, not kernels):
Prefill (432 → ~5764) — cp.async-pipelined MMQ
The ~13× gap is llama.cpp's mul_mat_q cp.async double-buffered weight load, which hides the per-token-tile weight re-read that our llm_mmq_q4k/llm_mmq_q4k_soa (and cuBLAS, via L2) pay. The SoA layout (#160, quants 16B-aligned) is the right substrate for cp.async.
Notes
- Decode item is the higher-confidence win (Gemma precedent); prefill cp.async is larger but riskier.
- All knobs in place:
SHARPI_CUDA_PROFILE, SHARPI_Q4K_SOA, SHARPI_PREFILL_MMQ, SHARPI_CUDA_GRAPH. A/B harness: scripts/bench-q4k-soa-156.ps1. Model: models/Qwen3-8B-Q4_K_M.gguf.
Follow-up to #156/#158/#160 (PR #161). Qwen3-8B Q4_K_M on RTX 4070 Ti is now prefill 432 / decode 74.7 t/s; llama.cpp b8585 is pp1008 ~5764 / tg128 ~78–91. Two distinct gaps remain, each needing its own approach.
Decode (74.7 → ~78–91) — the lever is NOT the matvec
The Q4_K decode matvec is now ~89% of HBM peak after the #160 SoA repack (was 74%), vs Q8_0's ~90%. Per
CudaDecodeMatvecQ4KRooflineProbe, that's near the realistic ceiling — squeezing the last ~1pp (the min-correction's 2 extra dp4a +0x01010101sum, the d8/scale float mults, or occupancy) is low-yield. Confirm with the probe before investing.The real signal: the matvec gained +13–15% BW but e2e decode only moved +7%, so ~half the per-token cost is elsewhere. Mirror the Gemma win (#142, which turned out to be the sampler, not kernels):
SHARPI_CUDA_PROFILE=1per-phase decode breakdown on Qwen3-8B and identify the dominant non-matvec cost (attention / RoPE / norms / KV append / sampler).--temp 0may still hit a full-vocab artifact like Gemma'sbench --verbose-promptLINQ-sort (seeproject_gemma4_decode_sampler_142).Prefill (432 → ~5764) — cp.async-pipelined MMQ
The ~13× gap is llama.cpp's
mul_mat_qcp.async double-buffered weight load, which hides the per-token-tile weight re-read that ourllm_mmq_q4k/llm_mmq_q4k_soa(and cuBLAS, via L2) pay. The SoA layout (#160, quants 16B-aligned) is the right substrate for cp.async.cp.async.cg.shared.global(sm_80+) double-buffered tile load inllm_mmq_q4k_soa; overlap the next tile's global→shared copy with the current tile's mma.Notes
SHARPI_CUDA_PROFILE,SHARPI_Q4K_SOA,SHARPI_PREFILL_MMQ,SHARPI_CUDA_GRAPH. A/B harness:scripts/bench-q4k-soa-156.ps1. Model:models/Qwen3-8B-Q4_K_M.gguf.