feat(cuda): auto-narrow KV dtype when fp32 won't fit VRAM (#185) by pekkah · Pull Request #186 · pekkah/SharpInference

pekkah · 2026-06-09T20:28:39Z

Closes the auto-narrow half of #185 (item 1). Follow-up to #179/#184 (bf16 + q8_0 KV cache for the dense CUDA path).

Problem

--kv-type defaulted to fp32; an oversized -c died at construction with cudaMalloc failed: 2 — capping context at fp32 even though a narrowed store would fit. Operators had to know to pass --kv-type bf16/q8_0 to reach long context.

Change

CudaForwardPass now mirrors the auto-SnapKV VRAM heuristic. When the operator set neither an explicit --kv-type/SHARPI_KV_DTYPE nor an explicit SnapKV budget, TQ is off, and the resolved context's fp32 KV cache won't fit the VRAM budget, it auto-selects a narrowed dtype (bf16 preferred; q8_0 if bf16 still won't fit and the geometry supports it) and logs the choice instead of erroring.

Precedence vs auto-SnapKV (documented in code)

Auto-narrow runs first and wins. SnapKV does not shrink the construction-time allocation (the cache is allocated full-maxSeqLen up front; eviction only bounds the logical length at runtime), so it cannot avert the cudaMalloc failure — only narrowing the element width can. Narrowing is also exact-context, argmax-stable, and full-speed-prefill, whereas SnapKV evicts tokens (lossy).

Explicit choices are never overridden: an explicit fp32 still errors loudly at allocation rather than silently narrowing; an explicit SnapKV budget is honoured (auto-narrow stands down) even though it can't avert the alloc failure — the operator's call.

Implementation

Decision factored into pure helpers — ResolveKvDType / EstimateKvCacheBytes / Q8KvGeometrySupported / EstimateAvailableKvVram (extracted from EstimateMaxContext, single-sourcing the VRAM budget so estimator and heuristic stay in agreement). All unit-testable without a GPU or model — 9 new tests.
EstimateKvCacheBytes rounds each K/V buffer up to the next power of two to match the GPU buffer pool (gpu.Allocate, exact:false). A raw byte sum undercounts by up to ~2× per buffer (q8_0's 34-byte blocks rarely land on a power of two) and could wrongly conclude fp32 fits, defeating the narrow. (Caught in code review.)
Server: SharpInferenceServerOptions.KvType → SHARPI_KV_DTYPE is already forwarded; no InferenceEngineLoader change needed.

Validation

Filter Gemma4CudaBatchedPrefill|Qwen3CudaBatchedPrefill|KvDtype: 39/39 green (40m12s) — 9 new pure tests + all bf16/q8_0 parity, batched-prefill, and >4096 chunked-prefill cases.
E2E: gemma-4-12b -g -1 -c 131072 with no --kv-type (previously cudaMalloc-fails at fp32) now logs KV auto-narrowed to Q8_0 for context 131072 and decodes coherently — "The capital of France is Paris." — at full speed (48.5 t/s, no cliff).
pr-review-toolkit: code-reviewer (1 Important fixed: pool-rounding undercount; + Minor/Nit) and code-simplifier (no changes) cycles run.

Not in this PR — #185 item 2 (Tc/half2 q8 flash thunks)

Remains blocked: every dense GGUF on disk has head_dim divisible by 64 (Qwen3=128, OLMoE=128, SmolLM=64, Qwen3.6=256, Gemma4=512). With no non-%64 head_dim model to validate against, templating the single-warp Tc / half2 flash kernels on <KV> would ship untested kernels. The per-token fallback covers correctness until a validation model exists.

🤖 Generated with Claude Code

Previously --kv-type defaulted to fp32 and an oversized -c died at construction with "cudaMalloc failed: 2" — capping context at fp32 even though a narrowed store would fit. The operator had to know to pass --kv-type bf16/q8_0 to reach long context. CudaForwardPass now mirrors the auto-SnapKV VRAM heuristic: when the operator set NEITHER an explicit --kv-type/SHARPI_KV_DTYPE NOR an explicit SnapKV budget, TQ is off, and the resolved context's fp32 KV cache won't fit the VRAM budget, it auto-selects a narrowed dtype (bf16 preferred; q8_0 if bf16 still won't fit and the geometry supports it) and logs the choice instead of erroring. Precedence vs auto-SnapKV: auto-narrow runs first and wins. SnapKV does NOT shrink the construction-time allocation (the cache is allocated full-maxSeqLen up front; eviction only bounds the logical length at runtime), so it cannot avert the cudaMalloc failure — only narrowing the element width can. Narrowing is also exact-context, argmax-stable, and full-speed-prefill, whereas SnapKV evicts tokens. Explicit choices are never overridden (explicit fp32 still errors loudly; explicit SnapKV is honoured). Decision factored into pure helpers — ResolveKvDType / EstimateKvCacheBytes / Q8KvGeometrySupported / EstimateAvailableKvVram (extracted from EstimateMaxContext, single-sourcing the budget) — unit-tested without a GPU or model (9 new tests). EstimateKvCacheBytes rounds each K/V buffer to the next power of two to match the GPU buffer pool (gpu.Allocate, exact:false); a raw byte sum undercounts by up to ~2x per buffer (q8_0's 34-byte blocks rarely land on a power of two) and could wrongly conclude fp32 fits. The server already forwards SharpInferenceServerOptions.KvType -> SHARPI_KV_DTYPE, so no InferenceEngineLoader change is needed. Validated: filter "Gemma4CudaBatchedPrefill|Qwen3CudaBatchedPrefill|KvDtype" 39/39 green (40m). E2E: gemma-4-12b -g -1 -c 131072 with no --kv-type now auto-narrows to q8_0 and decodes coherently at full speed (48.5 t/s) instead of cudaMalloc-failing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements an auto-narrowing heuristic for the KV cache data type (bf16 or q8_0) when the default fp32 cache exceeds the estimated available VRAM budget, preventing cudaMalloc failures on oversized contexts. It also adds comprehensive unit tests for this logic. The reviewer identified a critical issue where EstimateAvailableKvVram relies on EstimateGpuTensorBytes, which is missing a check for DType.Q4_0 in its raw-upload condition. This omission causes a massive overestimation of weight VRAM for models using Q4_0, severely underestimating the available VRAM for the KV cache and breaking the auto-narrowing logic.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-09T20:31:10Z

+    /// footprint against. Single-sourced so both stay in agreement.
    /// </summary>
-    public static int EstimateMaxContext(GgufModel model, CudaBackend gpu, ModelHyperparams hp)
+    internal static long EstimateAvailableKvVram(GgufModel model, CudaBackend gpu, ModelHyperparams hp)


In EstimateAvailableKvVram, the weight bytes are estimated using EstimateGpuTensorBytes(t). However, EstimateGpuTensorBytes is missing a check for DType.Q4_0 in its raw-upload condition.\n\nSince Q4_0 weights are uploaded raw on the GPU (as seen in UploadWeight at line 3064), they should be estimated using their native ByteSize rather than falling back to ElementCount * sizeof(float) (which assumes 4 bytes per element). For models like Gemma 4 12B QAT that use Q4_0 for bulk weights, this omission causes a massive overestimation of weight VRAM (by ~7x), which in turn severely underestimates the available VRAM for the KV cache and breaks the auto-narrowing/context estimation logic.\n\nPlease update EstimateGpuTensorBytes to include DType.Q4_0 in the raw-upload check:\ncsharp\nif (tensor.DType == DType.Float32 || tensor.DType == DType.Q4_K\n || tensor.DType == DType.Q6_K || tensor.DType == DType.Q8_0 || tensor.DType == DType.Q4_0)\n

Addresses findings from the pr-review cycle + Gemini Code Assist on #186: - EstimateGpuTensorBytes was missing DType.Q4_0 from its raw-upload set, so Q4_0 weights (the Gemma 4 12B QAT dtype) were counted as fp32 — a ~7× over-count that drove the KV budget to its 64 MiB floor and made the new auto-narrow heuristic narrow on a degenerate budget. The set now matches UploadWeight's raw branch exactly ({Float32, Q4_0, Q4_K, Q6_K, Q8_0}); Q5_K and others stay fp32-counted because UploadWeight CPU-dequantizes them. E2E: 12B budget 64 MiB → ~1549 MiB (realistic). - Auto-narrow was suppressed by SHARPI_SNAPKV_BUDGET=0, but that value means "disable SnapKV" (IsBudgetExplicit=true, Budget=0) — the same disable knob the banners advertise. Gating on (IsBudgetExplicit && Budget > 0) so a SnapKV-disable no longer silently reintroduces the cudaMalloc failure. - Tests: gemma4-shaped EstimateKvCacheBytes (aliased-skip + SWA-ring cap + per-layer dims, independently hand-computed), mixed-layer Q8 geometry (any single violator fails; aliased violator skipped), and ResolveKvDType inclusive-fit boundary. 12/12 pure tests green. Validated: pure tests 12/12; Qwen3-8B bf16/q8 parity 3/3; E2E 12B -c 131072 narrows to q8_0 (47.4 t/s) and -c 8192 narrows to q8_0 (fp32 5376 MiB still > 1549 budget — correct, the SWA ring headroom keeps 12B KV large). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pekkah · 2026-06-09T20:37:03Z

Addressed the review round in e7dbf05:

Gemini Code Assist (critical): confirmed and fixed — EstimateGpuTensorBytes was missing DType.Q4_0 from its raw-upload set, counting Q4_0 weights (the Gemma 4 12B QAT dtype) as fp32. That ~7× over-count drove the KV budget to its 64 MiB floor, so auto-narrow was deciding on a degenerate budget. The set now matches UploadWeight's raw branch exactly ({Float32, Q4_0, Q4_K, Q6_K, Q8_0}); Q5_K and others correctly stay fp32-counted since UploadWeight CPU-dequantizes them. E2E confirms the 12B budget went 64 MiB → ~1549 MiB (realistic).

silent-failure-hunter (important): SHARPI_SNAPKV_BUDGET=0 (the "disable SnapKV" knob the banners advertise) was suppressing auto-narrow because it sets IsBudgetExplicit=true. Now gated on (IsBudgetExplicit && Budget > 0), matching the disable semantics used by the SnapKV throws/auto-enable — so disabling SnapKV no longer silently reintroduces the cudaMalloc failure.

pr-test-analyzer (coverage): added the gemma4-shaped EstimateKvCacheBytes test (KvSourceLayer alias-skip + SWA-ring cap + per-layer head_dim/kv-heads, hand-computed independently of the implementation), a mixed-layer Q8KvGeometrySupported test (any single %32 violator fails; an aliased violator is skipped), and a ResolveKvDType inclusive-fit boundary test.

Validation: 12/12 pure tests; Qwen3-8B bf16/q8 parity 3/3; E2E 12B -c 131072 → q8_0 @ 47.4 t/s, -c 8192 → q8_0 (fp32 5376 MiB still exceeds the 1549 budget — correct: the 4096 SWA-ring headroom keeps 12B KV large even at moderate context). The full Gemma4CudaBatchedPrefill|Qwen3CudaBatchedPrefill|KvDtype filter (39/39) was green before this round; these fixes don't touch the parity path (explicit ctx + explicit dtype bypass both EstimateMaxContext and the auto-narrow block).

pekkah mentioned this pull request Jun 9, 2026

q8_0/bf16 KV follow-ups: auto-narrow default + Tc/half2-flash q8 thunks (#179) #185

Closed

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

pekkah merged commit dbb4e78 into master Jun 9, 2026
1 check passed

pekkah deleted the feat/auto-narrow-kv-185 branch June 9, 2026 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cuda): auto-narrow KV dtype when fp32 won't fit VRAM (#185)#186

feat(cuda): auto-narrow KV dtype when fp32 won't fit VRAM (#185)#186
pekkah merged 2 commits into
masterfrom
feat/auto-narrow-kv-185

pekkah commented Jun 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

pekkah commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pekkah commented Jun 9, 2026

Problem

Change

Precedence vs auto-SnapKV (documented in code)

Implementation

Validation

Not in this PR — #185 item 2 (Tc/half2 q8 flash thunks)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

pekkah commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant