perf(cuda): close remaining Qwen3-8B Q4_K DECODE gap to llama.cpp — non-matvec cost (prefill handled by #167; kernel-efficiency in #149/#152)

Follow-up to #156/#158/#160 (PR #161). Qwen3-8B Q4_K_M on RTX 4070 Ti is now **prefill 432 / decode 74.7 t/s**; llama.cpp b8585 is **pp1008 ~5764 / tg128 ~78–91**. Two distinct gaps remain, each needing its own approach.

## Decode (74.7 → ~78–91) — the lever is NOT the matvec

The Q4_K decode matvec is now ~**89% of HBM peak** after the #160 SoA repack (was 74%), vs Q8_0's ~90%. Per `CudaDecodeMatvecQ4KRooflineProbe`, that's near the realistic ceiling — squeezing the last ~1pp (the min-correction's 2 extra dp4a + `0x01010101` sum, the d8/scale float mults, or occupancy) is low-yield. **Confirm with the probe before investing.**

The real signal: the matvec gained +13–15% BW but e2e decode only moved +7%, so **~half the per-token cost is elsewhere**. Mirror the Gemma win (#142, which turned out to be the sampler, not kernels):

- [ ] Run `SHARPI_CUDA_PROFILE=1` per-phase decode breakdown on Qwen3-8B and identify the dominant non-matvec cost (attention / RoPE / norms / KV append / sampler).
- [ ] Check the sampler path specifically — Qwen3-8B greedy `--temp 0` may still hit a full-vocab artifact like Gemma's `bench --verbose-prompt` LINQ-sort (see `project_gemma4_decode_sampler_142`).
- [ ] Fix the top item; argmax-stable; same-session A/B; README.

## Prefill (432 → ~5764) — cp.async-pipelined MMQ

The ~13× gap is llama.cpp's `mul_mat_q` **cp.async double-buffered** weight load, which hides the per-token-tile weight re-read that our `llm_mmq_q4k`/`llm_mmq_q4k_soa` (and cuBLAS, via L2) pay. The SoA layout (#160, quants 16B-aligned) is the right substrate for cp.async.

- [ ] Prototype `cp.async.cg.shared.global` (sm_80+) double-buffered tile load in `llm_mmq_q4k_soa`; overlap the next tile's global→shared copy with the current tile's mma.
- [ ] Bit-identical / argmax-stable vs current MMQ; A/B at ~100-tok and ~1K prompts; README.

## Notes
- Decode item is the **higher-confidence win** (Gemma precedent); prefill cp.async is larger but riskier.
- All knobs in place: `SHARPI_CUDA_PROFILE`, `SHARPI_Q4K_SOA`, `SHARPI_PREFILL_MMQ`, `SHARPI_CUDA_GRAPH`. A/B harness: `scripts/bench-q4k-soa-156.ps1`. Model: `models/Qwen3-8B-Q4_K_M.gguf`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuda): close remaining Qwen3-8B Q4_K DECODE gap to llama.cpp — non-matvec cost (prefill handled by #167; kernel-efficiency in #149/#152) #162

Decode (74.7 → ~78–91) — the lever is NOT the matvec

Prefill (432 → ~5764) — cp.async-pipelined MMQ

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(cuda): close remaining Qwen3-8B Q4_K DECODE gap to llama.cpp — non-matvec cost (prefill handled by #167; kernel-efficiency in #149/#152) #162

Description

Decode (74.7 → ~78–91) — the lever is NOT the matvec

Prefill (432 → ~5764) — cp.async-pipelined MMQ

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions