Skip to content

feat(cuda): auto-narrow KV dtype when fp32 won't fit VRAM (#185)#186

Merged
pekkah merged 2 commits into
masterfrom
feat/auto-narrow-kv-185
Jun 9, 2026
Merged

feat(cuda): auto-narrow KV dtype when fp32 won't fit VRAM (#185)#186
pekkah merged 2 commits into
masterfrom
feat/auto-narrow-kv-185

Conversation

@pekkah

@pekkah pekkah commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Closes the auto-narrow half of #185 (item 1). Follow-up to #179/#184 (bf16 + q8_0 KV cache for the dense CUDA path).

Problem

--kv-type defaulted to fp32; an oversized -c died at construction with cudaMalloc failed: 2 — capping context at fp32 even though a narrowed store would fit. Operators had to know to pass --kv-type bf16/q8_0 to reach long context.

Change

CudaForwardPass now mirrors the auto-SnapKV VRAM heuristic. When the operator set neither an explicit --kv-type/SHARPI_KV_DTYPE nor an explicit SnapKV budget, TQ is off, and the resolved context's fp32 KV cache won't fit the VRAM budget, it auto-selects a narrowed dtype (bf16 preferred; q8_0 if bf16 still won't fit and the geometry supports it) and logs the choice instead of erroring.

Precedence vs auto-SnapKV (documented in code)

Auto-narrow runs first and wins. SnapKV does not shrink the construction-time allocation (the cache is allocated full-maxSeqLen up front; eviction only bounds the logical length at runtime), so it cannot avert the cudaMalloc failure — only narrowing the element width can. Narrowing is also exact-context, argmax-stable, and full-speed-prefill, whereas SnapKV evicts tokens (lossy).

Explicit choices are never overridden: an explicit fp32 still errors loudly at allocation rather than silently narrowing; an explicit SnapKV budget is honoured (auto-narrow stands down) even though it can't avert the alloc failure — the operator's call.

Implementation

  • Decision factored into pure helpers — ResolveKvDType / EstimateKvCacheBytes / Q8KvGeometrySupported / EstimateAvailableKvVram (extracted from EstimateMaxContext, single-sourcing the VRAM budget so estimator and heuristic stay in agreement). All unit-testable without a GPU or model — 9 new tests.
  • EstimateKvCacheBytes rounds each K/V buffer up to the next power of two to match the GPU buffer pool (gpu.Allocate, exact:false). A raw byte sum undercounts by up to ~2× per buffer (q8_0's 34-byte blocks rarely land on a power of two) and could wrongly conclude fp32 fits, defeating the narrow. (Caught in code review.)
  • Server: SharpInferenceServerOptions.KvType → SHARPI_KV_DTYPE is already forwarded; no InferenceEngineLoader change needed.

Validation

  • Filter Gemma4CudaBatchedPrefill|Qwen3CudaBatchedPrefill|KvDtype: 39/39 green (40m12s) — 9 new pure tests + all bf16/q8_0 parity, batched-prefill, and >4096 chunked-prefill cases.
  • E2E: gemma-4-12b -g -1 -c 131072 with no --kv-type (previously cudaMalloc-fails at fp32) now logs KV auto-narrowed to Q8_0 for context 131072 and decodes coherently — "The capital of France is Paris." — at full speed (48.5 t/s, no cliff).
  • pr-review-toolkit: code-reviewer (1 Important fixed: pool-rounding undercount; + Minor/Nit) and code-simplifier (no changes) cycles run.

Not in this PR — #185 item 2 (Tc/half2 q8 flash thunks)

Remains blocked: every dense GGUF on disk has head_dim divisible by 64 (Qwen3=128, OLMoE=128, SmolLM=64, Qwen3.6=256, Gemma4=512). With no non-%64 head_dim model to validate against, templating the single-warp Tc / half2 flash kernels on <KV> would ship untested kernels. The per-token fallback covers correctness until a validation model exists.

🤖 Generated with Claude Code

Previously --kv-type defaulted to fp32 and an oversized -c died at
construction with "cudaMalloc failed: 2" — capping context at fp32 even
though a narrowed store would fit. The operator had to know to pass
--kv-type bf16/q8_0 to reach long context.

CudaForwardPass now mirrors the auto-SnapKV VRAM heuristic: when the
operator set NEITHER an explicit --kv-type/SHARPI_KV_DTYPE NOR an explicit
SnapKV budget, TQ is off, and the resolved context's fp32 KV cache won't
fit the VRAM budget, it auto-selects a narrowed dtype (bf16 preferred;
q8_0 if bf16 still won't fit and the geometry supports it) and logs the
choice instead of erroring.

Precedence vs auto-SnapKV: auto-narrow runs first and wins. SnapKV does
NOT shrink the construction-time allocation (the cache is allocated
full-maxSeqLen up front; eviction only bounds the logical length at
runtime), so it cannot avert the cudaMalloc failure — only narrowing the
element width can. Narrowing is also exact-context, argmax-stable, and
full-speed-prefill, whereas SnapKV evicts tokens. Explicit choices are
never overridden (explicit fp32 still errors loudly; explicit SnapKV is
honoured).

Decision factored into pure helpers — ResolveKvDType / EstimateKvCacheBytes
/ Q8KvGeometrySupported / EstimateAvailableKvVram (extracted from
EstimateMaxContext, single-sourcing the budget) — unit-tested without a
GPU or model (9 new tests). EstimateKvCacheBytes rounds each K/V buffer to
the next power of two to match the GPU buffer pool (gpu.Allocate,
exact:false); a raw byte sum undercounts by up to ~2x per buffer (q8_0's
34-byte blocks rarely land on a power of two) and could wrongly conclude
fp32 fits.

The server already forwards SharpInferenceServerOptions.KvType ->
SHARPI_KV_DTYPE, so no InferenceEngineLoader change is needed.

Validated: filter "Gemma4CudaBatchedPrefill|Qwen3CudaBatchedPrefill|KvDtype"
39/39 green (40m). E2E: gemma-4-12b -g -1 -c 131072 with no --kv-type now
auto-narrows to q8_0 and decodes coherently at full speed (48.5 t/s)
instead of cudaMalloc-failing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an auto-narrowing heuristic for the KV cache data type (bf16 or q8_0) when the default fp32 cache exceeds the estimated available VRAM budget, preventing cudaMalloc failures on oversized contexts. It also adds comprehensive unit tests for this logic. The reviewer identified a critical issue where EstimateAvailableKvVram relies on EstimateGpuTensorBytes, which is missing a check for DType.Q4_0 in its raw-upload condition. This omission causes a massive overestimation of weight VRAM for models using Q4_0, severely underestimating the available VRAM for the KV cache and breaking the auto-narrowing logic.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

/// footprint against. Single-sourced so both stay in agreement.
/// </summary>
public static int EstimateMaxContext(GgufModel model, CudaBackend gpu, ModelHyperparams hp)
internal static long EstimateAvailableKvVram(GgufModel model, CudaBackend gpu, ModelHyperparams hp)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In EstimateAvailableKvVram, the weight bytes are estimated using EstimateGpuTensorBytes(t). However, EstimateGpuTensorBytes is missing a check for DType.Q4_0 in its raw-upload condition.\n\nSince Q4_0 weights are uploaded raw on the GPU (as seen in UploadWeight at line 3064), they should be estimated using their native ByteSize rather than falling back to ElementCount * sizeof(float) (which assumes 4 bytes per element). For models like Gemma 4 12B QAT that use Q4_0 for bulk weights, this omission causes a massive overestimation of weight VRAM (by ~7x), which in turn severely underestimates the available VRAM for the KV cache and breaks the auto-narrowing/context estimation logic.\n\nPlease update EstimateGpuTensorBytes to include DType.Q4_0 in the raw-upload check:\ncsharp\nif (tensor.DType == DType.Float32 || tensor.DType == DType.Q4_K\n || tensor.DType == DType.Q6_K || tensor.DType == DType.Q8_0 || tensor.DType == DType.Q4_0)\n

Addresses findings from the pr-review cycle + Gemini Code Assist on #186:

- EstimateGpuTensorBytes was missing DType.Q4_0 from its raw-upload set, so
  Q4_0 weights (the Gemma 4 12B QAT dtype) were counted as fp32 — a ~7×
  over-count that drove the KV budget to its 64 MiB floor and made the new
  auto-narrow heuristic narrow on a degenerate budget. The set now matches
  UploadWeight's raw branch exactly ({Float32, Q4_0, Q4_K, Q6_K, Q8_0});
  Q5_K and others stay fp32-counted because UploadWeight CPU-dequantizes
  them. E2E: 12B budget 64 MiB → ~1549 MiB (realistic).

- Auto-narrow was suppressed by SHARPI_SNAPKV_BUDGET=0, but that value means
  "disable SnapKV" (IsBudgetExplicit=true, Budget=0) — the same disable knob
  the banners advertise. Gating on (IsBudgetExplicit && Budget > 0) so a
  SnapKV-disable no longer silently reintroduces the cudaMalloc failure.

- Tests: gemma4-shaped EstimateKvCacheBytes (aliased-skip + SWA-ring cap +
  per-layer dims, independently hand-computed), mixed-layer Q8 geometry
  (any single violator fails; aliased violator skipped), and ResolveKvDType
  inclusive-fit boundary. 12/12 pure tests green.

Validated: pure tests 12/12; Qwen3-8B bf16/q8 parity 3/3; E2E 12B -c 131072
narrows to q8_0 (47.4 t/s) and -c 8192 narrows to q8_0 (fp32 5376 MiB still
> 1549 budget — correct, the SWA ring headroom keeps 12B KV large).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah

pekkah commented Jun 9, 2026

Copy link
Copy Markdown
Owner Author

Addressed the review round in e7dbf05:

Gemini Code Assist (critical): confirmed and fixed — EstimateGpuTensorBytes was missing DType.Q4_0 from its raw-upload set, counting Q4_0 weights (the Gemma 4 12B QAT dtype) as fp32. That ~7× over-count drove the KV budget to its 64 MiB floor, so auto-narrow was deciding on a degenerate budget. The set now matches UploadWeight's raw branch exactly ({Float32, Q4_0, Q4_K, Q6_K, Q8_0}); Q5_K and others correctly stay fp32-counted since UploadWeight CPU-dequantizes them. E2E confirms the 12B budget went 64 MiB → ~1549 MiB (realistic).

silent-failure-hunter (important): SHARPI_SNAPKV_BUDGET=0 (the "disable SnapKV" knob the banners advertise) was suppressing auto-narrow because it sets IsBudgetExplicit=true. Now gated on (IsBudgetExplicit && Budget > 0), matching the disable semantics used by the SnapKV throws/auto-enable — so disabling SnapKV no longer silently reintroduces the cudaMalloc failure.

pr-test-analyzer (coverage): added the gemma4-shaped EstimateKvCacheBytes test (KvSourceLayer alias-skip + SWA-ring cap + per-layer head_dim/kv-heads, hand-computed independently of the implementation), a mixed-layer Q8KvGeometrySupported test (any single %32 violator fails; an aliased violator is skipped), and a ResolveKvDType inclusive-fit boundary test.

Validation: 12/12 pure tests; Qwen3-8B bf16/q8 parity 3/3; E2E 12B -c 131072 → q8_0 @ 47.4 t/s, -c 8192 → q8_0 (fp32 5376 MiB still exceeds the 1549 budget — correct: the 4096 SWA-ring headroom keeps 12B KV large even at moderate context). The full Gemma4CudaBatchedPrefill|Qwen3CudaBatchedPrefill|KvDtype filter (39/39) was green before this round; these fixes don't touch the parity path (explicit ctx + explicit dtype bypass both EstimateMaxContext and the auto-narrow block).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant