Skip to content

perf(cuda): honor --kv-type on CudaHybridForwardPass (layer-split MoE hybrid) (#230)#232

Merged
pekkah merged 3 commits into
masterfrom
perf/hybrid-kvtype-230
Jun 13, 2026
Merged

perf(cuda): honor --kv-type on CudaHybridForwardPass (layer-split MoE hybrid) (#230)#232
pekkah merged 3 commits into
masterfrom
perf/hybrid-kvtype-230

Conversation

@pekkah

@pekkah pekkah commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes #230. --kv-type bf16|q8_0 (SHARPI_KV_DTYPE, #179) was silently ignored by the layer-split pure-attention MoE hybrid (CudaHybridForwardPass — Qwen3-Coder-30B, OLMoE): its GPU KV was allocated fp32 with no dtype arg, so the flag was a no-op. Worse, TierPlanner already prices the KV budget at the requested dtype (RunCommand passes ResolveConfiguredKvDType to Plan), so the runtime allocated a 4× larger fp32 cache than the planner reserved — a plan/runtime mismatch that can over-commit near the SLRU threshold.

Fix

Wire the dtype through: resolve CudaForwardPass.ResolveConfiguredKvDType() into _kvDType, allocate the GPU-trunk _gpuKCache/_gpuVCache at it, and route every GPU KV append/attention through *Kv dispatch helpers that pick the narrowed kernels (#179):

  • per-token decode — KvAppend/Attention (+ gemma4 AttentionSwa),
  • batched prefill — KvAppendBatched, AttentionBatched, AttentionBatchedWave,

each dispatching bf16 / q8_0 / fp32. Narrowing is scoped to the GPU-resident layers; CPU-offloaded layers keep their fp32 SimdKernels KV. Mirrors the dense path's guards — rejects narrowed + TurboQuant (TQ owns its ring) and q8_0 with a non-%32 kvDim geometry, both with clear errors instead of the silent fp32 no-op. The banner shows the KV tag. fp32 default is byte-identical (dispatch routes to the same kernels).

Result (4070 Ti 12 GB, Qwen3-Coder-30B-A3B -g -1 --kv-type q8_0)

fp32 q8_0
GPU KV 6144 MB 1632 MB
free after weights 3652 MiB 6724 MiB (+3 GiB)
expert-cache budget 3957 MB 8469 MB
decode 24.3 t/s 24.4 t/s
>4096 wave-q8 prefill (6169 tok) 26.3 t/s, no OOM

The freed VRAM more than doubles the SLRU expert-cache budget (and the runtime now matches what the planner reserved).

Tests

New CudaHybridKvDtypeTests — bf16 and q8_0 teacher-forced argmax-stable vs fp32 (fp32 top-1 ∈ narrowed top-5 + bounded logit max-abs) on Coder-30B and OLMoE, 4/4 green. Existing fp32 CudaHybridBatchedPrefillTests unchanged (9/9). Release build clean under TreatWarningsAsErrors + AOT.

Out of scope (noted in #230)

Vulkan (GpuForwardPass) and CPU (ForwardPass) still lack --kv-type (CPU's compression is --tq); they should warn rather than silently ignore — separate follow-up.

🤖 Generated with Claude Code

… hybrid) (#230)

--kv-type bf16|q8_0 (SHARPI_KV_DTYPE, #179) was silently ignored by the layer-split
pure-attention MoE hybrid (Qwen3-Coder-30B, OLMoE): its GPU KV was allocated fp32 with
no dtype arg, so the flag was a no-op. Worse, TierPlanner ALREADY prices the KV budget at
the requested dtype (RunCommand passes ResolveConfiguredKvDType to Plan), so the runtime
allocated a 4× larger fp32 cache than the planner reserved — a plan/runtime mismatch that
could over-commit near the SLRU threshold.

Wire the dtype through: resolve CudaForwardPass.ResolveConfiguredKvDType() into _kvDType,
allocate the GPU-trunk _gpuKCache/_gpuVCache at it, and route every GPU KV append/attention
through *Kv dispatch helpers that pick the narrowed kernels (#179): per-token decode
(KvAppend/Attention, + gemma4 AttentionSwa), and batched prefill (KvAppendBatched,
AttentionBatched, AttentionBatchedWave) — bf16/q8_0/fp32. Narrowing is scoped to the
GPU-resident layers; CPU-offloaded layers keep their fp32 SimdKernels KV. Mirrors the dense
path's guards: rejects narrowed + TurboQuant (TQ owns its ring) and q8_0 with a non-%32
kvDim geometry, both with clear errors instead of the silent fp32 no-op. Banner shows the KV
tag. fp32 default is byte-identical (dispatch routes to the same kernels).

Result (4070 Ti 12 GB, Qwen3-Coder-30B -g -1 --kv-type q8_0): KV 6144→1632 MB, free after
weights 3652→6724 MiB (+3 GiB), expert-cache budget 3957→8469 MB; decode 24.4 t/s (= fp32),
>4096 wave-q8 prefill (6169 tok) 26.3 t/s, no OOM.

Tests: new CudaHybridKvDtypeTests — bf16 AND q8_0 teacher-forced argmax-stable vs fp32
(top-5 + bounded maxAbs) on Coder-30B and OLMoE (4/4 green). Existing fp32
CudaHybridBatchedPrefillTests unchanged (9/9). Release clean under TreatWarningsAsErrors + AOT.

Out of scope (noted in #230): Vulkan (GpuForwardPass) and CPU (ForwardPass) still lack
--kv-type and should warn rather than silently ignore — separate follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for narrowed KV-cache dtypes (bf16 and q8_0) on the layer-split hybrid path in CudaHybridForwardPass, resolving a budget mismatch between the tier planner and the runtime. It introduces dispatch helpers to route GPU-trunk KV operations to their dtype-specific kernels and adds comprehensive tests to verify argmax-stability. The reviewer points out that the q8_0 geometry check is overly restrictive because it validates all layers instead of only the GPU-resident layers, which could cause unnecessary exceptions for CPU-offloaded layers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +317 to +321
_kvDType = _tqEnabled ? DType.Float32 : requestedKv;
if (_kvDType == DType.Q8_0 && !CudaForwardPass.Q8KvGeometrySupported(hp))
throw new NotSupportedException(
"SHARPI_KV_DTYPE=q8_0 requires every layer's kvDim (kvHeads × headDim) to be a multiple " +
"of 32 (the q8_0 block size); this model's geometry is incompatible. Use bf16 or fp32.");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CudaForwardPass.Q8KvGeometrySupported(hp) helper checks the geometry of all layers in the model. However, in CudaHybridForwardPass, only the first _nGpuLayers are resident on the GPU and allocated with the narrowed _kvDType (CPU-offloaded layers always use the standard Float32 KV cache). Checking all layers is overly restrictive and can cause a NotSupportedException to be thrown unnecessarily if a CPU-offloaded layer has an incompatible geometry, even when all GPU-resident layers are fully compatible.

We should restrict this geometry check to only the GPU-resident layers (0 to _nGpuLayers - 1). This also gracefully handles the case where _nGpuLayers == 0 without throwing an exception. Additionally, please ensure that VRAM reserve and estimation calculations are kept exactly in sync with these actual memory allocations, and write unit tests to pin these calculations to prevent regressions.

        _kvDType = _tqEnabled ? DType.Float32 : requestedKv;
        if (_kvDType == DType.Q8_0)
        {
            bool perLayerKv = hp.LayerHeadDim is not null;
            for (int i = 0; i < _nGpuLayers; i++)
            {
                if (hp.KvSourceLayer is { } ksl && ksl[i] >= 0) continue;
                int layerHd = perLayerKv ? hp.LayerHeadDim![i] : _headDim;
                int layerKv = hp.LayerKvHeads is { } lkv ? lkv[i] : _numKvHeads;
                if ((((long)layerKv * layerHd) & 31) != 0)
                {
                    throw new NotSupportedException(
                        "SHARPI_KV_DTYPE=q8_0 requires every GPU-resident layer's kvDim (kvHeads × headDim) to be a multiple " +
                        "of 32 (the q8_0 block size); this model's geometry is incompatible. Use bf16 or fp32.");
                }
            }
        }
References
  1. Keep VRAM reserve and estimation calculations exactly in sync with actual memory allocations, and write unit tests to pin these calculations to prevent regressions.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0be6003-followup: Q8KvGeometrySupported now takes a gpuLayers scope param (default -1 = all, mirroring EstimateKvCacheBytes); the hybrid passes _nGpuLayers so only the GPU-resident (narrowed) layers are checked — a CPU-tail layer's geometry no longer falsely rejects q8, and gpuLayers==0 returns true. Pinned by Q8KvGeometry_GpuLayersScope_IgnoresCpuTailLayers.

pekkah and others added 2 commits June 13, 2026 17:11
Addresses the #232 review (test-coverage gaps; no production defects found by the
correctness/silent-failure lenses):
- Vacuous-pass guard: new internal CudaHybridForwardPass.KvCacheDType observable;
  the parity oracle now asserts the requested dtype actually applied (else fp32-vs-fp32
  would pass trivially if the env plumbing regressed).
- >4096 WAVE path coverage: Coder30B_Q8Kv_Wave_ArgmaxStable prefills >4096 tokens in one
  call so PrefillBatchedTrunk takes the AttentionBatchedWaveQ8_0 branch (was only manual CLI).
- Greedy (non-teacher-forced) coherence: catches an #188-style narrowed-KV self-decode
  collapse that teacher-forcing masks. Uses each model's OWN GGUF chat template
  (ApplyChatTemplate via tokenizer.ChatTemplate.Render + add_generation_prompt) — a raw
  continuation prompt collapses an instruct model regardless of KV dtype (the 'prompt must
  match chat template' trap); OLMoE now passes with its template.
- Tightened the skip: q8_0 is supported for both test models (kvDim%32==0), so a
  NotSupportedException is now a real failure, not a silent skip.

Hybrid KV-dtype suite 7/7 green (4 parity bf16/q8 × OLMoE/Coder + wave-q8 + 2 greedy-coherent).
Gemma4-on-hybrid narrowed decode reuses the dense-tested AttentionSwa{Bf16,Q8_0} kernels via
the trivial *Kv dispatch; not separately tested (needs a synthetic gemma4 hybrid split).
Release clean under TreatWarningsAsErrors + AOT.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…HIGH)

gemini flagged that CudaForwardPass.Q8KvGeometrySupported(hp) checks ALL layers, but
CudaHybridForwardPass narrows only the first _nGpuLayers (CPU-offloaded layers keep fp32),
so checking all layers would falsely reject q8 when only a CPU-tail layer has incompatible
geometry (and matters for per-layer-head_dim gemma4-on-hybrid). Add a gpuLayers scope param
to Q8KvGeometrySupported (mirrors EstimateKvCacheBytes's gpuLayers) and pass _nGpuLayers;
gpuLayers==0 returns true (nothing narrowed). New GGUF-free unit test
Q8KvGeometry_GpuLayersScope_IgnoresCpuTailLayers pins it. 3/3 Q8KvGeometry green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah

pekkah commented Jun 13, 2026

Copy link
Copy Markdown
Owner Author

Review cycle complete

Three lenses (correctness, silent-failure, test-coverage) + gemini.

Correctness + silent-failure: no defects. Confirmed every GPU KV append/attention dispatches on _kvDType (per-token decode incl. gemma4 SWA, batched prefill ≤4096 + >4096 wave); TQ path stays fp32; CPU/GPU dtype split is clean (no cross-tier read); guards fail loud; graph capture safe (kernels compiled in prefill before decode captures); plan/runtime now consistent.

Findings addressed:

  • gemini (HIGH): Q8KvGeometrySupported checked all layers, but only _nGpuLayers are narrowed → could falsely reject q8 on a CPU-tail layer. Added a gpuLayers scope param; the hybrid passes _nGpuLayers (gpuLayers==0 → true). Unit-tested (Q8KvGeometry_GpuLayersScope_IgnoresCpuTailLayers). (0e88db9)
  • test-coverage: added KvCacheDType observable + assert (vacuous-pass guard); Coder30B_Q8Kv_Wave_ArgmaxStable (>4096 wave path, was manual-only); greedy-coherence with chat-template support (ApplyChatTemplate renders the model's own tokenizer.chat_template) so OLMoE no longer hits the raw-prompt collapse; tightened the q8 skip to fail on an unexpected throw. (0be6003)

Final: hybrid KV-dtype suite 7/7 (4 parity bf16/q8 × OLMoE/Coder + wave-q8 + 2 greedy-coherent), Q8KvGeometry 3/3, fp32 CudaHybridBatchedPrefillTests unchanged (9/9). Release clean under TreatWarningsAsErrors + AOT. Gemma4-on-hybrid narrowed decode reuses the dense-tested AttentionSwa{Bf16,Q8_0} kernels via the trivial dispatch (documented; would need a synthetic gemma4 hybrid split to test separately). Merging.

@pekkah pekkah merged commit e501695 into master Jun 13, 2026
1 check passed
@pekkah pekkah deleted the perf/hybrid-kvtype-230 branch June 13, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf/bug(cuda): --kv-type silently ignored by CudaHybridForwardPass (layer-split MoE hybrid) — Coder-30B/OLMoE always get fp32 KV

1 participant