feat(cuda): auto-narrow KV dtype when fp32 won't fit VRAM (#185)#186
Conversation
Previously --kv-type defaulted to fp32 and an oversized -c died at construction with "cudaMalloc failed: 2" — capping context at fp32 even though a narrowed store would fit. The operator had to know to pass --kv-type bf16/q8_0 to reach long context. CudaForwardPass now mirrors the auto-SnapKV VRAM heuristic: when the operator set NEITHER an explicit --kv-type/SHARPI_KV_DTYPE NOR an explicit SnapKV budget, TQ is off, and the resolved context's fp32 KV cache won't fit the VRAM budget, it auto-selects a narrowed dtype (bf16 preferred; q8_0 if bf16 still won't fit and the geometry supports it) and logs the choice instead of erroring. Precedence vs auto-SnapKV: auto-narrow runs first and wins. SnapKV does NOT shrink the construction-time allocation (the cache is allocated full-maxSeqLen up front; eviction only bounds the logical length at runtime), so it cannot avert the cudaMalloc failure — only narrowing the element width can. Narrowing is also exact-context, argmax-stable, and full-speed-prefill, whereas SnapKV evicts tokens. Explicit choices are never overridden (explicit fp32 still errors loudly; explicit SnapKV is honoured). Decision factored into pure helpers — ResolveKvDType / EstimateKvCacheBytes / Q8KvGeometrySupported / EstimateAvailableKvVram (extracted from EstimateMaxContext, single-sourcing the budget) — unit-tested without a GPU or model (9 new tests). EstimateKvCacheBytes rounds each K/V buffer to the next power of two to match the GPU buffer pool (gpu.Allocate, exact:false); a raw byte sum undercounts by up to ~2x per buffer (q8_0's 34-byte blocks rarely land on a power of two) and could wrongly conclude fp32 fits. The server already forwards SharpInferenceServerOptions.KvType -> SHARPI_KV_DTYPE, so no InferenceEngineLoader change is needed. Validated: filter "Gemma4CudaBatchedPrefill|Qwen3CudaBatchedPrefill|KvDtype" 39/39 green (40m). E2E: gemma-4-12b -g -1 -c 131072 with no --kv-type now auto-narrows to q8_0 and decodes coherently at full speed (48.5 t/s) instead of cudaMalloc-failing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements an auto-narrowing heuristic for the KV cache data type (bf16 or q8_0) when the default fp32 cache exceeds the estimated available VRAM budget, preventing cudaMalloc failures on oversized contexts. It also adds comprehensive unit tests for this logic. The reviewer identified a critical issue where EstimateAvailableKvVram relies on EstimateGpuTensorBytes, which is missing a check for DType.Q4_0 in its raw-upload condition. This omission causes a massive overestimation of weight VRAM for models using Q4_0, severely underestimating the available VRAM for the KV cache and breaking the auto-narrowing logic.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| /// footprint against. Single-sourced so both stay in agreement. | ||
| /// </summary> | ||
| public static int EstimateMaxContext(GgufModel model, CudaBackend gpu, ModelHyperparams hp) | ||
| internal static long EstimateAvailableKvVram(GgufModel model, CudaBackend gpu, ModelHyperparams hp) |
There was a problem hiding this comment.
In EstimateAvailableKvVram, the weight bytes are estimated using EstimateGpuTensorBytes(t). However, EstimateGpuTensorBytes is missing a check for DType.Q4_0 in its raw-upload condition.\n\nSince Q4_0 weights are uploaded raw on the GPU (as seen in UploadWeight at line 3064), they should be estimated using their native ByteSize rather than falling back to ElementCount * sizeof(float) (which assumes 4 bytes per element). For models like Gemma 4 12B QAT that use Q4_0 for bulk weights, this omission causes a massive overestimation of weight VRAM (by ~7x), which in turn severely underestimates the available VRAM for the KV cache and breaks the auto-narrowing/context estimation logic.\n\nPlease update EstimateGpuTensorBytes to include DType.Q4_0 in the raw-upload check:\ncsharp\nif (tensor.DType == DType.Float32 || tensor.DType == DType.Q4_K\n || tensor.DType == DType.Q6_K || tensor.DType == DType.Q8_0 || tensor.DType == DType.Q4_0)\n
Addresses findings from the pr-review cycle + Gemini Code Assist on #186: - EstimateGpuTensorBytes was missing DType.Q4_0 from its raw-upload set, so Q4_0 weights (the Gemma 4 12B QAT dtype) were counted as fp32 — a ~7× over-count that drove the KV budget to its 64 MiB floor and made the new auto-narrow heuristic narrow on a degenerate budget. The set now matches UploadWeight's raw branch exactly ({Float32, Q4_0, Q4_K, Q6_K, Q8_0}); Q5_K and others stay fp32-counted because UploadWeight CPU-dequantizes them. E2E: 12B budget 64 MiB → ~1549 MiB (realistic). - Auto-narrow was suppressed by SHARPI_SNAPKV_BUDGET=0, but that value means "disable SnapKV" (IsBudgetExplicit=true, Budget=0) — the same disable knob the banners advertise. Gating on (IsBudgetExplicit && Budget > 0) so a SnapKV-disable no longer silently reintroduces the cudaMalloc failure. - Tests: gemma4-shaped EstimateKvCacheBytes (aliased-skip + SWA-ring cap + per-layer dims, independently hand-computed), mixed-layer Q8 geometry (any single violator fails; aliased violator skipped), and ResolveKvDType inclusive-fit boundary. 12/12 pure tests green. Validated: pure tests 12/12; Qwen3-8B bf16/q8 parity 3/3; E2E 12B -c 131072 narrows to q8_0 (47.4 t/s) and -c 8192 narrows to q8_0 (fp32 5376 MiB still > 1549 budget — correct, the SWA ring headroom keeps 12B KV large). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Addressed the review round in e7dbf05: Gemini Code Assist (critical): confirmed and fixed — silent-failure-hunter (important): pr-test-analyzer (coverage): added the gemma4-shaped Validation: 12/12 pure tests; Qwen3-8B bf16/q8 parity 3/3; E2E 12B |
Closes the auto-narrow half of #185 (item 1). Follow-up to #179/#184 (bf16 + q8_0 KV cache for the dense CUDA path).
Problem
--kv-typedefaulted tofp32; an oversized-cdied at construction withcudaMalloc failed: 2— capping context at fp32 even though a narrowed store would fit. Operators had to know to pass--kv-type bf16/q8_0to reach long context.Change
CudaForwardPassnow mirrors the auto-SnapKV VRAM heuristic. When the operator set neither an explicit--kv-type/SHARPI_KV_DTYPEnor an explicit SnapKV budget, TQ is off, and the resolved context's fp32 KV cache won't fit the VRAM budget, it auto-selects a narrowed dtype (bf16 preferred; q8_0 if bf16 still won't fit and the geometry supports it) and logs the choice instead of erroring.Precedence vs auto-SnapKV (documented in code)
Auto-narrow runs first and wins. SnapKV does not shrink the construction-time allocation (the cache is allocated full-
maxSeqLenup front; eviction only bounds the logical length at runtime), so it cannot avert thecudaMallocfailure — only narrowing the element width can. Narrowing is also exact-context, argmax-stable, and full-speed-prefill, whereas SnapKV evicts tokens (lossy).Explicit choices are never overridden: an explicit
fp32still errors loudly at allocation rather than silently narrowing; an explicit SnapKV budget is honoured (auto-narrow stands down) even though it can't avert the alloc failure — the operator's call.Implementation
ResolveKvDType/EstimateKvCacheBytes/Q8KvGeometrySupported/EstimateAvailableKvVram(extracted fromEstimateMaxContext, single-sourcing the VRAM budget so estimator and heuristic stay in agreement). All unit-testable without a GPU or model — 9 new tests.EstimateKvCacheBytesrounds each K/V buffer up to the next power of two to match the GPU buffer pool (gpu.Allocate,exact:false). A raw byte sum undercounts by up to ~2× per buffer (q8_0's 34-byte blocks rarely land on a power of two) and could wrongly conclude fp32 fits, defeating the narrow. (Caught in code review.)SharpInferenceServerOptions.KvType → SHARPI_KV_DTYPEis already forwarded; noInferenceEngineLoaderchange needed.Validation
Gemma4CudaBatchedPrefill|Qwen3CudaBatchedPrefill|KvDtype: 39/39 green (40m12s) — 9 new pure tests + all bf16/q8_0 parity, batched-prefill, and >4096 chunked-prefill cases.gemma-4-12b -g -1 -c 131072with no--kv-type(previouslycudaMalloc-fails at fp32) now logsKV auto-narrowed to Q8_0 for context 131072and decodes coherently — "The capital of France is Paris." — at full speed (48.5 t/s, no cliff).Not in this PR — #185 item 2 (Tc/half2 q8 flash thunks)
Remains blocked: every dense GGUF on disk has head_dim divisible by 64 (Qwen3=128, OLMoE=128, SmolLM=64, Qwen3.6=256, Gemma4=512). With no non-
%64head_dim model to validate against, templating the single-warp Tc / half2 flash kernels on<KV>would ship untested kernels. The per-token fallback covers correctness until a validation model exists.🤖 Generated with Claude Code