Skip to content

perf(cuda): dense Q4_K batched prefill + decode SoA — Qwen3-8B 61.2→432 prefill, 65→74.7 decode (#156/#158/#160)#161

Merged
pekkah merged 9 commits into
masterfrom
perf/cuda-dense-q4k-batched-prefill-156
Jun 6, 2026
Merged

perf(cuda): dense Q4_K batched prefill + decode SoA — Qwen3-8B 61.2→432 prefill, 65→74.7 decode (#156/#158/#160)#161
pekkah merged 9 commits into
masterfrom
perf/cuda-dense-q4k-batched-prefill-156

Conversation

@pekkah

@pekkah pekkah commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Generalizes the GDN-hybrid batched-prefill machinery to dense Q4_K models and closes most of the CUDA decode/prefill gap on Qwen3-8B. Builds across #156 (batched prefill → compute-bound GEMM → int8 MMQ → decode SoA), #158 (decode CUDA graphs), and #160 (SoA default-on).

Headline (Qwen3-8B Q4_K_M, RTX 4070 Ti, ~1K ctx, same-session A/B):

  • Prefill 61.2 → 432 t/s (batched trunk + flash attn + compute-bound GEMM + int8 tensor-core MMQ)
  • Decode 65 → 74.7 t/s (CUDA graphs +7%, then scale-pre-unpacked Q4_K SoA +7%)

What's in here

#156 — dense Q4_K prefill

  • Generalize batched-trunk prefill (new MatMulBatched GEMM-N path) to dense Q4_K (1b4bb4c)
  • Item C/C1: compute-bound prefill GEMM — dequant Q4_K→fp16 + cuBLAS GEMM, weight read once per batch instead of re-streamed per token (50bac9d)
  • Item C2: int8 tensor-core MMQ (llm_mmq_q4k, mma.m16n8k32.s8) — weight read once as int8, no fp16 HBM temp; SHARPI_PREFILL_MMQ=0 reverts to C1 (cb72245)
  • Decode SoA: one-time scale-pre-unpacked Q4_K repack at upload, so the decode matvec + prefill MMQ read plain bytes instead of the per-super-block get_scale_min_k4 unpack switch — matvec 74% → 89% of HBM peak (0d5fe20)

#158 — decode CUDA graphs for the non-Gemma dense Forward (capture/replay the per-token device region), 65 → 70 t/s; SHARPI_CUDA_GRAPH=0 to bisect (e1de163)

#160 — SoA default-on (621184f): convert the Q4_K readers that previously threw on a SoA handle to bit-identical SoA twins, then flip SHARPI_Q4K_SOA to default-on (=0 reverts). MoE Q4_K stays interleaved (repack gated !_isMoE).

  • Reachable on the default-on path (only CudaForwardPass, the dense non-GDN pass, repacks): llm_matvec_q4k_gemm_n_soa + llm_dequant_q4k_to_f16_soa — the SHARPI_PREFILL_MMQ=0 fallback prefill readers. Both bit-identical (tested). These are what default-on actually required (decode matvec + MMQ prefill SoA shipped in 0d5fe20).
  • Defensive / forward-looking: llm_matvec_q4k_n2_soa (MTP batched-verify). The dense-MTP CUDA pass is CudaHybridGdnForwardPass, which does not repack to SoA today, so no production path sends a SoA handle to MatMulN2 — the old throw was unreachable and this reader is not yet on a live path. It is bit-identity-tested in isolation (Q4KSoaN2_BitIdenticalToInterleaved, maxAbs==0) so a future wiring of SoA into the GDN pass works out of the box. (Review correction: an earlier description called N2 "the blocker" — it was not.)
  • Review fix (f8a262c): eager-JIT the new + pre-existing Q4_K SoA kernels so first-decode pays no JIT stutter (gemini-code-assist).

Correctness

  • All prefill fast paths are argmax-stable vs the fp32 matvec (SHARPI_PREFILL_GEMM=0 reverts to bit-exact); the SoA repack is bit-identical.
  • CUDA SoA readers are guarded by CudaQ4KSoaTests maxAbs==0 bit-identity oracles (decode/MMQ/N2/GEMM-N/dequant). The SoA conversion preserves FP accumulation order exactly (same NWARPS=8) — that's why bit-identity holds.
  • MtpDecoder_GreedyParity_LlamaCpp stays green but runs on CpuBackend — it guards the CPU MTP path and is unaffected by the CUDA SHARPI_Q4K_SOA flag (it does not exercise the CUDA N2 SoA reader).
  • Other tests: CudaMmqQ4KTests, Qwen3CudaBatchedPrefillTests, Qwen3CudaGraphParityTests, CudaDecodeMatvecQ4KRooflineProbe, 13 HybridGdn MTP, SnapKV/MMQ-E2E — all green.
  • CI note: these CUDA tests return silently without a GPU + the local .gguf models, so the green check does not cover them; the maxAbs==0 bit-identity and same-session A/B (default 74.7 vs =0 69.7 t/s) were verified on a local RTX 4070 Ti.

Remaining gap → #162

llama.cpp b8585 is pp1008 ~5764 / tg128 ~78–91. Prefill gap is its cp.async-pipelined MMQ; decode is near the HBM ceiling (~89%), so the next lever is the non-matvec decode cost (attention/RoPE/norms/sampler), not the matvec. Test-hardening follow-ups (default-on A/B test, MoE-skip negative test) also tracked in #162.

🤖 Generated with Claude Code


Closes #156, closes #158, closes #160.

pekkah and others added 7 commits June 6, 2026 19:59
Open the all-GPU batched-trunk prefill (originally Gemma-4-only, #136) to any
dense model the batched kernels cover — e.g. Qwen3-8B Q4_K. Three Gemma
assumptions in the batched layer body are generalized:

- LayerHeadDim null-coalesce to _headDim (non-Gemma has no per-layer head_dim)
- FFN activation dispatch on hp.FfnActivation (SiLuMul for SwiGLU vs GeluTanhMul)
- cached _attnScale (1f for Gemma, -1f else so the kernel derives 1/sqrt(head_dim))

The gate (IsGemma4BatchedPrefillSupported -> IsBatchedPrefillSupported) drops the
_isGemma4Like requirement but keeps every real guard (MoE, TQ, attn-bias, non-NEOX
RoPE, L2 QK-norm, batchable dtype). Methods renamed to drop the Gemma4 suffix.

Fix a latent ordering bug exposed by the generalization: the batched body applied
QK-norm before RoPE, but the dense per-token Forward applies RoPE before QK-norm.
RoPE does not commute with per-channel-weighted RMSNorm, so this diverged ~9 logits
on Qwen3. RoPE/QK-norm are now ordered to match the matching per-token oracle
(Gemma norm->rope; dense rope->norm) and the batched RoPE honors NoRopeLayerStep.

New Qwen3CudaBatchedPrefillTests: default path argmax-stable, flash-off bit-exact
vs the per-token loop. Gemma4CudaBatchedPrefillTests still 5/5 (no regression).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…batched prefill

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generalize the Gemma 4 decode CUDA-graph capture/replay (#136) to the
non-Gemma per-token CudaForwardPass.Forward, so dense Q4_K models
(Qwen3-8B etc.) get the same host-launch-overhead reduction. Decode is
HBM-bandwidth-bound, so the win is killing the ~1k launches/token, not
kernel throughput.

- Refactor Forward into token-varying prefix (embed) -> RunDeviceRegion
  (layer loop + final norm + output projection) -> TQ ring-advance +
  logits download. The TQ ring-advance (host state) and SnapKV Q-capture
  (host-varying device offset) stay outside the captured region.
- Wrap RunDeviceRegion with the generic capture helpers via the new
  TryRunDeviceRegionViaGraph, mirroring TryRunGemma4DeviceRegionViaGraph.
  Bail on _tqEnabled / _kvEvictedCount>0 / _snapKvCaptureSlot>=0 / _isMoE
  (each breaks static topology or does an illegal mid-capture sync).
- Pre-grow the Q4_K/Q8_0 dp4a Q8_1 input scratch to max(embDim,intermDim)
  before first capture (capture forbids cudaMalloc).
- New Qwen3CudaGraphParityTests: graph replay is bit-identical to direct
  launch over 8 decode steps, plus the SnapKV-configured-no-evict guard.
  Gemma4 graph tests + Qwen3 batched-prefill tests still green.

Qwen3-8B Q4_K -g -1 decode: 65 -> 70 t/s (+7%, same-session A/B;
SHARPI_CUDA_GRAPH=0 to bisect).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Route Q4_K trunk matmuls in the batched prefill through the compute-bound
dequant->fp16->cuBLAS GEMM (weight read once per batch) instead of the
memory-bound matvec GEMM-N (weight re-streamed once per token). Profiling
Qwen3-8B Q4_K prefill confirmed the Item-C gate: the trunk matmuls are
99.5% of layer time (attn ~0.5% after flash), and the matvec path is
weight-bandwidth bound (~4.5GB x N_tokens of HBM traffic).

- New llm_dequant_q4k_to_f16 kernel: decodes the Q4_K super-block
  (d*sc*nibble - dmin*mn, identical to llm_embed_lookup_q4k / llm_matvec_q4k)
  to fp16, one 256-thread block per weight row.
- Generalize MatMulBatchedGemm from Q8_0-only to Q8_0|Q4_K: dtype-select the
  dequant kernel + dtype-aware cols alignment (Q4_K super-block = 256).
- GpuMatMulBatched dispatches Q4_K to the GEMM path (cols % 256), with a
  defensive matvec fallback otherwise.
- New parity oracle Qwen3_8B_BatchedPrefill_Q4KGemm_ArgmaxStable: GEMM vs
  bit-exact matvec, argmax-equal + top-5 overlap (fp16, not byte-exact, same
  contract as the Q8_0 #141 GEMM). FlashOff_MatchesSequential pinned to
  PrefillGemmEnabled=false to keep its bit-exact-matvec purpose.
- Prefill-profile gains a matmul= breakdown alongside attn=.

Qwen3-8B Q4_K -g -1 prefill: 119.8 -> 432 t/s (3.6x, same-session A/B;
SHARPI_PREFILL_GEMM=0 to bisect); --no-thinking 61.8 -> 427. Decode and the
--tq rows (batched prefill disabled for TQ) unchanged. llama.cpp b8585
pp1008 = 5764 t/s; the remaining ~13x is its int8 Q4_K MMQ kernel (Item C2
follow-up). README Qwen3-8B rows updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds llm_mmq_q4k, the maximal Item C: each Q4_K weight is read once as int8
(nibble-expanded, get_scale_min_k4 decode) and fed to the m16n8k32 s8 mma with
no fp16 dequant temp to HBM — the cost that capped C1's dequant→fp16→cuBLAS GEMM.
The kernel mirrors llm_mmq_q8_0's tiling/fragment map byte-for-byte (validated by
#141); the only Q4_K-specific work is the 4-bit→int8 weight expansion, the
per-(row,sub-block) (scale,min) unpack, and the asymmetric min-bias term
−super_dmin·mn·(d_a·Σq_a). That activation sum d_a·Σq_a is now packed as the fp16
`s` half of each q8_1 block by llm_quantize_q8_1 (every other reader masks the
d-word with 0xffff, so the high half was inert — mirrors ggml block_q8_1's ds).

Wired as the default Q4_K prefill path under PrefillMmqEnabled (cols%256, weight
read once as int8); SHARPI_PREFILL_MMQ=0 reverts to the C1 GEMM, =…_GEMM=0 to the
bit-exact matvec. Argmax-stable, not bit-exact (both operands int8-quantized + the
min-bias rounds through fp16 s).

Qwen3-8B same-session A/B vs C1: +25% at ~100-tok prompts (284→355 t/s, where C1
still pays its fp16-temp write), converging to a tie by ~1K ctx (430→432 @1008) as
cuBLAS amortizes that temp — so the 1K README column is unchanged but short-context
prefill and prefill VRAM both improve. The remaining gap to llama.cpp's MMQ is its
cp.async pipelining, which hides the weight re-read across token tiles that cuBLAS
amortizes via L2.

Tests: CudaMmqQ4KTests (GPU MMQ vs CPU DotQ4K fp32 ref, 0 mismatches — isolates the
nibble/scale decode + min-bias + fragment map) and Qwen3_8B_BatchedPrefill_Q4KMmq_
ArgmaxStable (MMQ vs bit-exact matvec GEMM-N, argmax + top-5). 220/220 ForwardPass
tests green (the shared quantize_q8_1 change is parity-safe across all q8_1 paths).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Qwen3-8B Q4_K CUDA decode matvec hit only ~74% of HBM peak vs the Q8_0
dp4a path's ~90%. The gap is NOT the #149 funnelshift — the 144-B Q4_K
super-block is already 16-byte aligned. It is per-super-block COMPUTE: the
get_scale_min_k4 6-bit (scale,min) unpack switch forms a dependent chain
that starves memory-level parallelism so loads never saturate HBM.

Fix: a one-time scale-pre-unpacked SoA repack at upload (RepackQ4KSoa /
llm_q4k_repack_soa) splits each block into [quants][unpacked scale/min
bytes][d|dmin] regions. The decode matvec (llm_matvec_q4k_soa) and prefill
MMQ (llm_mmq_q4k_soa) then read plain bytes — no switch. The stored
scale/min integers are identical, so both kernels are bit-identical to the
interleaved versions (8-warp reduction kept to preserve FP order).

Same-session A/B (RTX 4070 Ti, opt-in SHARPI_Q4K_SOA=1):
  matvec BW 74% -> 89% (+13-15%); decode 70.0 -> 74.7 t/s (+7%);
  prefill +5% (MMQ benefits too); bit-identical, coherent output.

Dense-only: repack gated on !_isMoE; the unconverted Q4_K readers
(GEMM-N, dequant->GEMM, MTP N2) throw if a SoA handle reaches them.
Default-on is blocked on converting the fragile MTP N2 byte-parity path.

Tests: CudaQ4KSoaTests (decode + MMQ bit-identical + A/B),
CudaDecodeMatvecQ4KRooflineProbe. 26 targeted CUDA correctness tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…aders (#160)

Flip SHARPI_Q4K_SOA to default-on (set =0 to revert) so every dense Q4_K
CUDA model gets the scale-pre-unpacked decode win (Qwen3-8B 70.0 → 74.7 t/s,
+7% decode / +5% prefill) without an opt-in flag.

The blocker was the three Q4_K readers that threw on a SoA weight handle. All
three now have bit-identical SoA twins:

- llm_matvec_q4k_n2_soa — the N=2 MTP batched-verify reader (dense
  Qwen3.6-27B-MTP). Same two-input dp4a + FP accumulation order, NWARPS=8, only
  scale/min come from the pre-unpacked SoA bytes. The fragile MTP llama.cpp
  byte-parity oracle (MtpDecoder_GreedyParity_LlamaCpp) still holds — cumulative
  trunk drift over 64 layers is unchanged because the arithmetic order is byte-identical.
- llm_matvec_q4k_gemm_n_soa + llm_dequant_q4k_to_f16_soa — the
  SHARPI_PREFILL_MMQ=0 fallback prefill readers, so the flag is safe with MMQ off.

Backend auto-routes per repacked handle (DispatchMatVecQ4KN2/Batched gain a
soa param); the MoE repack stays gated !_isMoE at upload.

Validation: 7 CudaQ4KSoaTests bit-identical (decode/MMQ/N2/GEMM-N/dequant) +
MTP byte-parity + 13 HybridGdn MTP + 6 Qwen3 CUDA parity + SnapKV/MMQ-E2E green.
Same-session A/B: default (env unset) 74.7 t/s vs SHARPI_Q4K_SOA=0 69.7 t/s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a scale-pre-unpacked Structure of Arrays (SoA) layout repack and optimized kernels for Q4_K weights on CUDA, improving decode and prefill performance for dense models like Qwen3-8B. It also generalizes the all-GPU batched-trunk prefill to non-Gemma dense models, adds CUDA graph capture/replay for non-Gemma dense decode regions, and includes comprehensive benchmark scripts and parity tests. The reviewer noted that several of the newly added Q4_K SoA and dequantization kernels are missing from ForceEagerJit(), which should be added to prevent JIT compilation stutters during the first inference run.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/SharpInference.Cuda/CudaBackend.cs Outdated
Comment on lines 4582 to 4584
_matvecQ80GemmNKernel, _mmqQ80Kernel, _mmqQ80SoaKernel, _mmqQ4kKernel,
_matvecQ80Dp4aSoaKernel, _q80RepackSoaKernel,
_matvecQ80SoaKernel, _matvecQ80GemmNSoaKernel, _dequantQ80F16SoaKernel,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The newly introduced Q4_K SoA and dequantization kernels (_mmqQ4kSoaKernel, _q4kRepackSoaKernel, _matvecQ4KSoaKernel, _matvecQ4KN2SoaKernel, _matvecQ4KGemmNSoaKernel, _dequantQ4KF16Kernel, and _dequantQ4KF16SoaKernel) are missing from ForceEagerJit(). Adding them here ensures they are eagerly compiled during initialization, preventing JIT compilation stutters during the first inference run.

            _matvecQ80GemmNKernel, _mmqQ80Kernel, _mmqQ80SoaKernel, _mmqQ4kKernel, _mmqQ4kSoaKernel,\n            _matvecQ80Dp4aSoaKernel, _q80RepackSoaKernel, _q4kRepackSoaKernel,\n            _matvecQ80SoaKernel, _matvecQ80GemmNSoaKernel, _dequantQ80F16SoaKernel,\n            _matvecQ4KSoaKernel, _matvecQ4KN2SoaKernel, _matvecQ4KGemmNSoaKernel,\n            _dequantQ4KF16Kernel, _dequantQ4KF16SoaKernel,

@pekkah

pekkah commented Jun 6, 2026

Copy link
Copy Markdown
Owner Author

Remaining Qwen3-8B Q4_K gap to llama.cpp (decode non-matvec cost + prefill cp.async MMQ) tracked as follow-up #162.

pekkah and others added 2 commits June 6, 2026 23:57
ForceEagerJit() was missing the Q4_K SoA repack + decode/N2/GEMM-N/dequant
readers (and the pre-existing AoS dequant), so their SASS was finalized lazily
on first decode instead of at load. Add them so first-token latency pays no
per-kernel JIT stutter. No behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_diff_*.txt were scratch files a review agent wrote during PR #161 review and
got swept in by `git add -A`. Not project files; remove.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah

pekkah commented Jun 6, 2026

Copy link
Copy Markdown
Owner Author

Addressed the gemini-code-assist suggestion in f8a262c — added the new Q4_K SoA kernels (llm_q4k_repack_soa, llm_matvec_q4k_soa, _n2_soa, _gemm_n_soa, _dequant…_soa) plus the pre-existing _dequantQ4KF16Kernel to ForceEagerJit(), so their SASS is finalized at load and first decode pays no JIT stutter.

@pekkah pekkah merged commit 5f4519b into master Jun 6, 2026
1 check passed
@pekkah pekkah deleted the perf/cuda-dense-q4k-batched-prefill-156 branch June 6, 2026 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant