perf(cuda): honor --kv-type on CudaHybridForwardPass (layer-split MoE hybrid) (#230)#232
Conversation
… hybrid) (#230) --kv-type bf16|q8_0 (SHARPI_KV_DTYPE, #179) was silently ignored by the layer-split pure-attention MoE hybrid (Qwen3-Coder-30B, OLMoE): its GPU KV was allocated fp32 with no dtype arg, so the flag was a no-op. Worse, TierPlanner ALREADY prices the KV budget at the requested dtype (RunCommand passes ResolveConfiguredKvDType to Plan), so the runtime allocated a 4× larger fp32 cache than the planner reserved — a plan/runtime mismatch that could over-commit near the SLRU threshold. Wire the dtype through: resolve CudaForwardPass.ResolveConfiguredKvDType() into _kvDType, allocate the GPU-trunk _gpuKCache/_gpuVCache at it, and route every GPU KV append/attention through *Kv dispatch helpers that pick the narrowed kernels (#179): per-token decode (KvAppend/Attention, + gemma4 AttentionSwa), and batched prefill (KvAppendBatched, AttentionBatched, AttentionBatchedWave) — bf16/q8_0/fp32. Narrowing is scoped to the GPU-resident layers; CPU-offloaded layers keep their fp32 SimdKernels KV. Mirrors the dense path's guards: rejects narrowed + TurboQuant (TQ owns its ring) and q8_0 with a non-%32 kvDim geometry, both with clear errors instead of the silent fp32 no-op. Banner shows the KV tag. fp32 default is byte-identical (dispatch routes to the same kernels). Result (4070 Ti 12 GB, Qwen3-Coder-30B -g -1 --kv-type q8_0): KV 6144→1632 MB, free after weights 3652→6724 MiB (+3 GiB), expert-cache budget 3957→8469 MB; decode 24.4 t/s (= fp32), >4096 wave-q8 prefill (6169 tok) 26.3 t/s, no OOM. Tests: new CudaHybridKvDtypeTests — bf16 AND q8_0 teacher-forced argmax-stable vs fp32 (top-5 + bounded maxAbs) on Coder-30B and OLMoE (4/4 green). Existing fp32 CudaHybridBatchedPrefillTests unchanged (9/9). Release clean under TreatWarningsAsErrors + AOT. Out of scope (noted in #230): Vulkan (GpuForwardPass) and CPU (ForwardPass) still lack --kv-type and should warn rather than silently ignore — separate follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request adds support for narrowed KV-cache dtypes (bf16 and q8_0) on the layer-split hybrid path in CudaHybridForwardPass, resolving a budget mismatch between the tier planner and the runtime. It introduces dispatch helpers to route GPU-trunk KV operations to their dtype-specific kernels and adds comprehensive tests to verify argmax-stability. The reviewer points out that the q8_0 geometry check is overly restrictive because it validates all layers instead of only the GPU-resident layers, which could cause unnecessary exceptions for CPU-offloaded layers.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| _kvDType = _tqEnabled ? DType.Float32 : requestedKv; | ||
| if (_kvDType == DType.Q8_0 && !CudaForwardPass.Q8KvGeometrySupported(hp)) | ||
| throw new NotSupportedException( | ||
| "SHARPI_KV_DTYPE=q8_0 requires every layer's kvDim (kvHeads × headDim) to be a multiple " + | ||
| "of 32 (the q8_0 block size); this model's geometry is incompatible. Use bf16 or fp32."); |
There was a problem hiding this comment.
The CudaForwardPass.Q8KvGeometrySupported(hp) helper checks the geometry of all layers in the model. However, in CudaHybridForwardPass, only the first _nGpuLayers are resident on the GPU and allocated with the narrowed _kvDType (CPU-offloaded layers always use the standard Float32 KV cache). Checking all layers is overly restrictive and can cause a NotSupportedException to be thrown unnecessarily if a CPU-offloaded layer has an incompatible geometry, even when all GPU-resident layers are fully compatible.
We should restrict this geometry check to only the GPU-resident layers (0 to _nGpuLayers - 1). This also gracefully handles the case where _nGpuLayers == 0 without throwing an exception. Additionally, please ensure that VRAM reserve and estimation calculations are kept exactly in sync with these actual memory allocations, and write unit tests to pin these calculations to prevent regressions.
_kvDType = _tqEnabled ? DType.Float32 : requestedKv;
if (_kvDType == DType.Q8_0)
{
bool perLayerKv = hp.LayerHeadDim is not null;
for (int i = 0; i < _nGpuLayers; i++)
{
if (hp.KvSourceLayer is { } ksl && ksl[i] >= 0) continue;
int layerHd = perLayerKv ? hp.LayerHeadDim![i] : _headDim;
int layerKv = hp.LayerKvHeads is { } lkv ? lkv[i] : _numKvHeads;
if ((((long)layerKv * layerHd) & 31) != 0)
{
throw new NotSupportedException(
"SHARPI_KV_DTYPE=q8_0 requires every GPU-resident layer's kvDim (kvHeads × headDim) to be a multiple " +
"of 32 (the q8_0 block size); this model's geometry is incompatible. Use bf16 or fp32.");
}
}
}References
- Keep VRAM reserve and estimation calculations exactly in sync with actual memory allocations, and write unit tests to pin these calculations to prevent regressions.
There was a problem hiding this comment.
Fixed in 0be6003-followup: Q8KvGeometrySupported now takes a gpuLayers scope param (default -1 = all, mirroring EstimateKvCacheBytes); the hybrid passes _nGpuLayers so only the GPU-resident (narrowed) layers are checked — a CPU-tail layer's geometry no longer falsely rejects q8, and gpuLayers==0 returns true. Pinned by Q8KvGeometry_GpuLayersScope_IgnoresCpuTailLayers.
Addresses the #232 review (test-coverage gaps; no production defects found by the correctness/silent-failure lenses): - Vacuous-pass guard: new internal CudaHybridForwardPass.KvCacheDType observable; the parity oracle now asserts the requested dtype actually applied (else fp32-vs-fp32 would pass trivially if the env plumbing regressed). - >4096 WAVE path coverage: Coder30B_Q8Kv_Wave_ArgmaxStable prefills >4096 tokens in one call so PrefillBatchedTrunk takes the AttentionBatchedWaveQ8_0 branch (was only manual CLI). - Greedy (non-teacher-forced) coherence: catches an #188-style narrowed-KV self-decode collapse that teacher-forcing masks. Uses each model's OWN GGUF chat template (ApplyChatTemplate via tokenizer.ChatTemplate.Render + add_generation_prompt) — a raw continuation prompt collapses an instruct model regardless of KV dtype (the 'prompt must match chat template' trap); OLMoE now passes with its template. - Tightened the skip: q8_0 is supported for both test models (kvDim%32==0), so a NotSupportedException is now a real failure, not a silent skip. Hybrid KV-dtype suite 7/7 green (4 parity bf16/q8 × OLMoE/Coder + wave-q8 + 2 greedy-coherent). Gemma4-on-hybrid narrowed decode reuses the dense-tested AttentionSwa{Bf16,Q8_0} kernels via the trivial *Kv dispatch; not separately tested (needs a synthetic gemma4 hybrid split). Release clean under TreatWarningsAsErrors + AOT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…HIGH) gemini flagged that CudaForwardPass.Q8KvGeometrySupported(hp) checks ALL layers, but CudaHybridForwardPass narrows only the first _nGpuLayers (CPU-offloaded layers keep fp32), so checking all layers would falsely reject q8 when only a CPU-tail layer has incompatible geometry (and matters for per-layer-head_dim gemma4-on-hybrid). Add a gpuLayers scope param to Q8KvGeometrySupported (mirrors EstimateKvCacheBytes's gpuLayers) and pass _nGpuLayers; gpuLayers==0 returns true (nothing narrowed). New GGUF-free unit test Q8KvGeometry_GpuLayersScope_IgnoresCpuTailLayers pins it. 3/3 Q8KvGeometry green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review cycle completeThree lenses (correctness, silent-failure, test-coverage) + gemini. Correctness + silent-failure: no defects. Confirmed every GPU KV append/attention dispatches on Findings addressed:
Final: hybrid KV-dtype suite 7/7 (4 parity bf16/q8 × OLMoE/Coder + wave-q8 + 2 greedy-coherent), |
Summary
Fixes #230.
--kv-type bf16|q8_0(SHARPI_KV_DTYPE, #179) was silently ignored by the layer-split pure-attention MoE hybrid (CudaHybridForwardPass— Qwen3-Coder-30B, OLMoE): its GPU KV was allocated fp32 with no dtype arg, so the flag was a no-op. Worse, TierPlanner already prices the KV budget at the requested dtype (RunCommand passesResolveConfiguredKvDTypetoPlan), so the runtime allocated a 4× larger fp32 cache than the planner reserved — a plan/runtime mismatch that can over-commit near the SLRU threshold.Fix
Wire the dtype through: resolve
CudaForwardPass.ResolveConfiguredKvDType()into_kvDType, allocate the GPU-trunk_gpuKCache/_gpuVCacheat it, and route every GPU KV append/attention through*Kvdispatch helpers that pick the narrowed kernels (#179):KvAppend/Attention(+ gemma4AttentionSwa),KvAppendBatched,AttentionBatched,AttentionBatchedWave,each dispatching bf16 / q8_0 / fp32. Narrowing is scoped to the GPU-resident layers; CPU-offloaded layers keep their fp32
SimdKernelsKV. Mirrors the dense path's guards — rejects narrowed + TurboQuant (TQ owns its ring) and q8_0 with a non-%32 kvDim geometry, both with clear errors instead of the silent fp32 no-op. The banner shows the KV tag. fp32 default is byte-identical (dispatch routes to the same kernels).Result (4070 Ti 12 GB,
Qwen3-Coder-30B-A3B -g -1 --kv-type q8_0)The freed VRAM more than doubles the SLRU expert-cache budget (and the runtime now matches what the planner reserved).
Tests
New
CudaHybridKvDtypeTests— bf16 and q8_0 teacher-forced argmax-stable vs fp32 (fp32 top-1 ∈ narrowed top-5 + bounded logit max-abs) on Coder-30B and OLMoE, 4/4 green. Existing fp32CudaHybridBatchedPrefillTestsunchanged (9/9). Release build clean under TreatWarningsAsErrors + AOT.Out of scope (noted in #230)
Vulkan (
GpuForwardPass) and CPU (ForwardPass) still lack--kv-type(CPU's compression is--tq); they should warn rather than silently ignore — separate follow-up.🤖 Generated with Claude Code