perf(cuda): honor --kv-type on CudaHybridForwardPass (layer-split MoE hybrid) (#230) by pekkah · Pull Request #232 · pekkah/SharpInference

pekkah · 2026-06-13T13:18:41Z

Summary

Fixes #230. --kv-type bf16|q8_0 (SHARPI_KV_DTYPE, #179) was silently ignored by the layer-split pure-attention MoE hybrid (CudaHybridForwardPass — Qwen3-Coder-30B, OLMoE): its GPU KV was allocated fp32 with no dtype arg, so the flag was a no-op. Worse, TierPlanner already prices the KV budget at the requested dtype (RunCommand passes ResolveConfiguredKvDType to Plan), so the runtime allocated a 4× larger fp32 cache than the planner reserved — a plan/runtime mismatch that can over-commit near the SLRU threshold.

Fix

Wire the dtype through: resolve CudaForwardPass.ResolveConfiguredKvDType() into _kvDType, allocate the GPU-trunk _gpuKCache/_gpuVCache at it, and route every GPU KV append/attention through *Kv dispatch helpers that pick the narrowed kernels (#179):

per-token decode — KvAppend/Attention (+ gemma4 AttentionSwa),
batched prefill — KvAppendBatched, AttentionBatched, AttentionBatchedWave,

each dispatching bf16 / q8_0 / fp32. Narrowing is scoped to the GPU-resident layers; CPU-offloaded layers keep their fp32 SimdKernels KV. Mirrors the dense path's guards — rejects narrowed + TurboQuant (TQ owns its ring) and q8_0 with a non-%32 kvDim geometry, both with clear errors instead of the silent fp32 no-op. The banner shows the KV tag. fp32 default is byte-identical (dispatch routes to the same kernels).

Result (4070 Ti 12 GB, `Qwen3-Coder-30B-A3B -g -1 --kv-type q8_0`)

	fp32	q8_0
GPU KV	6144 MB	1632 MB
free after weights	3652 MiB	6724 MiB (+3 GiB)
expert-cache budget	3957 MB	8469 MB
decode	24.3 t/s	24.4 t/s
>4096 wave-q8 prefill (6169 tok)	—	26.3 t/s, no OOM

The freed VRAM more than doubles the SLRU expert-cache budget (and the runtime now matches what the planner reserved).

Tests

New CudaHybridKvDtypeTests — bf16 and q8_0 teacher-forced argmax-stable vs fp32 (fp32 top-1 ∈ narrowed top-5 + bounded logit max-abs) on Coder-30B and OLMoE, 4/4 green. Existing fp32 CudaHybridBatchedPrefillTests unchanged (9/9). Release build clean under TreatWarningsAsErrors + AOT.

Out of scope (noted in #230)

Vulkan (GpuForwardPass) and CPU (ForwardPass) still lack --kv-type (CPU's compression is --tq); they should warn rather than silently ignore — separate follow-up.

🤖 Generated with Claude Code

… hybrid) (#230) --kv-type bf16|q8_0 (SHARPI_KV_DTYPE, #179) was silently ignored by the layer-split pure-attention MoE hybrid (Qwen3-Coder-30B, OLMoE): its GPU KV was allocated fp32 with no dtype arg, so the flag was a no-op. Worse, TierPlanner ALREADY prices the KV budget at the requested dtype (RunCommand passes ResolveConfiguredKvDType to Plan), so the runtime allocated a 4× larger fp32 cache than the planner reserved — a plan/runtime mismatch that could over-commit near the SLRU threshold. Wire the dtype through: resolve CudaForwardPass.ResolveConfiguredKvDType() into _kvDType, allocate the GPU-trunk _gpuKCache/_gpuVCache at it, and route every GPU KV append/attention through *Kv dispatch helpers that pick the narrowed kernels (#179): per-token decode (KvAppend/Attention, + gemma4 AttentionSwa), and batched prefill (KvAppendBatched, AttentionBatched, AttentionBatchedWave) — bf16/q8_0/fp32. Narrowing is scoped to the GPU-resident layers; CPU-offloaded layers keep their fp32 SimdKernels KV. Mirrors the dense path's guards: rejects narrowed + TurboQuant (TQ owns its ring) and q8_0 with a non-%32 kvDim geometry, both with clear errors instead of the silent fp32 no-op. Banner shows the KV tag. fp32 default is byte-identical (dispatch routes to the same kernels). Result (4070 Ti 12 GB, Qwen3-Coder-30B -g -1 --kv-type q8_0): KV 6144→1632 MB, free after weights 3652→6724 MiB (+3 GiB), expert-cache budget 3957→8469 MB; decode 24.4 t/s (= fp32), >4096 wave-q8 prefill (6169 tok) 26.3 t/s, no OOM. Tests: new CudaHybridKvDtypeTests — bf16 AND q8_0 teacher-forced argmax-stable vs fp32 (top-5 + bounded maxAbs) on Coder-30B and OLMoE (4/4 green). Existing fp32 CudaHybridBatchedPrefillTests unchanged (9/9). Release clean under TreatWarningsAsErrors + AOT. Out of scope (noted in #230): Vulkan (GpuForwardPass) and CPU (ForwardPass) still lack --kv-type and should warn rather than silently ignore — separate follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request adds support for narrowed KV-cache dtypes (bf16 and q8_0) on the layer-split hybrid path in CudaHybridForwardPass, resolving a budget mismatch between the tier planner and the runtime. It introduces dispatch helpers to route GPU-trunk KV operations to their dtype-specific kernels and adds comprehensive tests to verify argmax-stability. The reviewer points out that the q8_0 geometry check is overly restrictive because it validates all layers instead of only the GPU-resident layers, which could cause unnecessary exceptions for CPU-offloaded layers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-13T13:20:22Z

+        _kvDType = _tqEnabled ? DType.Float32 : requestedKv;
+        if (_kvDType == DType.Q8_0 && !CudaForwardPass.Q8KvGeometrySupported(hp))
+            throw new NotSupportedException(
+                "SHARPI_KV_DTYPE=q8_0 requires every layer's kvDim (kvHeads × headDim) to be a multiple " +
+                "of 32 (the q8_0 block size); this model's geometry is incompatible. Use bf16 or fp32.");


The CudaForwardPass.Q8KvGeometrySupported(hp) helper checks the geometry of all layers in the model. However, in CudaHybridForwardPass, only the first _nGpuLayers are resident on the GPU and allocated with the narrowed _kvDType (CPU-offloaded layers always use the standard Float32 KV cache). Checking all layers is overly restrictive and can cause a NotSupportedException to be thrown unnecessarily if a CPU-offloaded layer has an incompatible geometry, even when all GPU-resident layers are fully compatible.

We should restrict this geometry check to only the GPU-resident layers (0 to _nGpuLayers - 1). This also gracefully handles the case where _nGpuLayers == 0 without throwing an exception. Additionally, please ensure that VRAM reserve and estimation calculations are kept exactly in sync with these actual memory allocations, and write unit tests to pin these calculations to prevent regressions.

_kvDType = _tqEnabled ? DType.Float32 : requestedKv; if (_kvDType == DType.Q8_0) { bool perLayerKv = hp.LayerHeadDim is not null; for (int i = 0; i < _nGpuLayers; i++) { if (hp.KvSourceLayer is { } ksl && ksl[i] >= 0) continue; int layerHd = perLayerKv ? hp.LayerHeadDim![i] : _headDim; int layerKv = hp.LayerKvHeads is { } lkv ? lkv[i] : _numKvHeads; if ((((long)layerKv * layerHd) & 31) != 0) { throw new NotSupportedException( "SHARPI_KV_DTYPE=q8_0 requires every GPU-resident layer's kvDim (kvHeads × headDim) to be a multiple " + "of 32 (the q8_0 block size); this model's geometry is incompatible. Use bf16 or fp32."); } } }

References

Keep VRAM reserve and estimation calculations exactly in sync with actual memory allocations, and write unit tests to pin these calculations to prevent regressions.

Fixed in 0be6003-followup: Q8KvGeometrySupported now takes a gpuLayers scope param (default -1 = all, mirroring EstimateKvCacheBytes); the hybrid passes _nGpuLayers so only the GPU-resident (narrowed) layers are checked — a CPU-tail layer's geometry no longer falsely rejects q8, and gpuLayers==0 returns true. Pinned by Q8KvGeometry_GpuLayersScope_IgnoresCpuTailLayers.

Addresses the #232 review (test-coverage gaps; no production defects found by the correctness/silent-failure lenses): - Vacuous-pass guard: new internal CudaHybridForwardPass.KvCacheDType observable; the parity oracle now asserts the requested dtype actually applied (else fp32-vs-fp32 would pass trivially if the env plumbing regressed). - >4096 WAVE path coverage: Coder30B_Q8Kv_Wave_ArgmaxStable prefills >4096 tokens in one call so PrefillBatchedTrunk takes the AttentionBatchedWaveQ8_0 branch (was only manual CLI). - Greedy (non-teacher-forced) coherence: catches an #188-style narrowed-KV self-decode collapse that teacher-forcing masks. Uses each model's OWN GGUF chat template (ApplyChatTemplate via tokenizer.ChatTemplate.Render + add_generation_prompt) — a raw continuation prompt collapses an instruct model regardless of KV dtype (the 'prompt must match chat template' trap); OLMoE now passes with its template. - Tightened the skip: q8_0 is supported for both test models (kvDim%32==0), so a NotSupportedException is now a real failure, not a silent skip. Hybrid KV-dtype suite 7/7 green (4 parity bf16/q8 × OLMoE/Coder + wave-q8 + 2 greedy-coherent). Gemma4-on-hybrid narrowed decode reuses the dense-tested AttentionSwa{Bf16,Q8_0} kernels via the trivial *Kv dispatch; not separately tested (needs a synthetic gemma4 hybrid split). Release clean under TreatWarningsAsErrors + AOT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…HIGH) gemini flagged that CudaForwardPass.Q8KvGeometrySupported(hp) checks ALL layers, but CudaHybridForwardPass narrows only the first _nGpuLayers (CPU-offloaded layers keep fp32), so checking all layers would falsely reject q8 when only a CPU-tail layer has incompatible geometry (and matters for per-layer-head_dim gemma4-on-hybrid). Add a gpuLayers scope param to Q8KvGeometrySupported (mirrors EstimateKvCacheBytes's gpuLayers) and pass _nGpuLayers; gpuLayers==0 returns true (nothing narrowed). New GGUF-free unit test Q8KvGeometry_GpuLayersScope_IgnoresCpuTailLayers pins it. 3/3 Q8KvGeometry green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pekkah · 2026-06-13T14:14:07Z

Review cycle complete

Three lenses (correctness, silent-failure, test-coverage) + gemini.

Correctness + silent-failure: no defects. Confirmed every GPU KV append/attention dispatches on _kvDType (per-token decode incl. gemma4 SWA, batched prefill ≤4096 + >4096 wave); TQ path stays fp32; CPU/GPU dtype split is clean (no cross-tier read); guards fail loud; graph capture safe (kernels compiled in prefill before decode captures); plan/runtime now consistent.

Findings addressed:

gemini (HIGH): Q8KvGeometrySupported checked all layers, but only _nGpuLayers are narrowed → could falsely reject q8 on a CPU-tail layer. Added a gpuLayers scope param; the hybrid passes _nGpuLayers (gpuLayers==0 → true). Unit-tested (Q8KvGeometry_GpuLayersScope_IgnoresCpuTailLayers). (0e88db9)
test-coverage: added KvCacheDType observable + assert (vacuous-pass guard); Coder30B_Q8Kv_Wave_ArgmaxStable (>4096 wave path, was manual-only); greedy-coherence with chat-template support (ApplyChatTemplate renders the model's own tokenizer.chat_template) so OLMoE no longer hits the raw-prompt collapse; tightened the q8 skip to fail on an unexpected throw. (0be6003)

Final: hybrid KV-dtype suite 7/7 (4 parity bf16/q8 × OLMoE/Coder + wave-q8 + 2 greedy-coherent), Q8KvGeometry 3/3, fp32 CudaHybridBatchedPrefillTests unchanged (9/9). Release clean under TreatWarningsAsErrors + AOT. Gemma4-on-hybrid narrowed decode reuses the dense-tested AttentionSwa{Bf16,Q8_0} kernels via the trivial dispatch (documented; would need a synthetic gemma4 hybrid split to test separately). Merging.

gemini-code-assist Bot reviewed Jun 13, 2026

View reviewed changes

pekkah and others added 2 commits June 13, 2026 17:11

pekkah merged commit e501695 into master Jun 13, 2026
1 check passed

pekkah deleted the perf/hybrid-kvtype-230 branch June 13, 2026 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuda): honor --kv-type on CudaHybridForwardPass (layer-split MoE hybrid) (#230)#232

perf(cuda): honor --kv-type on CudaHybridForwardPass (layer-split MoE hybrid) (#230)#232
pekkah merged 3 commits into
masterfrom
perf/hybrid-kvtype-230

pekkah commented Jun 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

pekkah Jun 13, 2026

Uh oh!

pekkah commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pekkah commented Jun 13, 2026

Summary

Fix

Result (4070 Ti 12 GB, Qwen3-Coder-30B-A3B -g -1 --kv-type q8_0)

Tests

Out of scope (noted in #230)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

pekkah Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

pekkah commented Jun 13, 2026

Review cycle complete

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Result (4070 Ti 12 GB, `Qwen3-Coder-30B-A3B -g -1 --kv-type q8_0`)