GPU saturation gaps: chunked+packed prefill admission, KV budget, PCIe probe (#183)#188
Conversation
…183) Closes the three scheduling gaps from #183 in ContinuousBatchingEngine: Gap 1 — chunked/interleaved prefill: admission no longer prefills a prompt in one blocking call that stalls every active decode. Prompts prefill in prefillChunkTokens slices (default 256; SHARPI_PREFILL_CHUNK / PrefillChunkTokens option; 0 restores blocking) with one BatchForwardMulti decode step between chunks. Cancellation is honoured between chunks. SnapKV-enabled engines fall back to blocking (eviction scoring needs the whole prompt at startPos==0). SmolLM2-CPU A/B (3 decoders + 1040-token prompt injection): worst decode stall 25.1s -> 7.9s, ~12% aggregate cost. Gap 2 — packed multi-sequence prefill: ForwardPass.PrefillPackedMulti runs the in-flight prompts' chunks as ONE packed varlen pass (batched GEMMs over the combined tokens, per-token attention vs each sequence's own cache, no padding). 4x256-token prompts: 33.7 tok/s packed vs 20.4 chunked-serial (+65%), within 6% of blocking whole-prompt prefill. Logits only computed for chunks that complete a prompt. Same guards as BatchForwardMulti. Gap 3 — KV token-budget backpressure: each request reserves min(promptTokens + MaxNewTokens, MaxSeqLen) tokens at admission (ForwardPass.KvBytesPerToken converts bytes->tokens); the head request waits in a local FIFO when it would exceed the budget instead of OOM-ing. A lone oversized request is always admitted (no deadlock). SHARPI_KV_BUDGET_MB / KvBudgetMb: 0 = auto (half RAM), neg = unlimited. Also: dispose-time drain pulls requests still in the channel so a late writer can't hang a caller; _pendingCount is restored if WriteAsync throws. Bench: `SharpInference.Bench -- --cb` (stall A/B) and `--cb --packed` (packed-vs-serial prefill A/B). Tests: 5 new (chunked parity, packed parity incl. startPos>0 continuation, engine chunked-vs-unchunked greedy equality, tiny-budget completion). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…183) HardwareProfile.Detect(CudaBackend) replaces the VRAM-size-bucket PCIe guess with CudaBackend.MeasurePcieBandwidthGBps(): a pinned 64 MiB cudaMemcpy probe (min of H2D/D2H, best-of-3, ~100ms at load), clamped to [5,64] GB/s with the heuristic as fallback. 4070 Ti: measured ~26 GB/s vs the old 20 GB/s bucket, so TierPlanner plans against real bandwidth. CudaLibraryResolver consolidates the assembly's single DllImportResolver (NVRTC probing moved from NvrtcInterop unchanged) and adds CUDA runtime major selection: cudart64_12/cublas64_12 stay the default, SHARPI_CUDA13=1 prefers the _13 pair, and 13-only toolkit installs fall back to _13 instead of hard-failing model load. Decided as a pair — never a mixed 12/13 process. Registered from both interop static ctors so it precedes the first P/Invoke bind on every entry path. CUDA 13.3 evaluated on Qwen3-8B -g -1 (all entry points incl. deprecated cublasSetMathMode still exported): prefill 2354->2390 t/s (+1.5%), decode 58.0->57.9 (noise) — decode is HBM-bound in our NVRTC kernels, so CUDA 12 remains the pinned default matching the nvrtc64_120_0 JIT. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Since #185 the KV dtype auto-narrows when fp32 won't fit the VRAM budget, so on a 12 GB card the 12B tests silently ran with bf16 KV — and their synthetic OOD prompt has a tiny top-logit margin, where bf16 rounding tips greedy decode into a single-token repetition attractor (fails the >=2-distinct assertion; q8_0's different rounding happens to pass; forced bf16 failed identically at #184, so nothing regressed — the default changed). The bf16 path is coherent on real prompts (CLI-validated to 128K, #179/#184). These tests guard trunk math (k_eq_v, per-layer KV, embed), so the dtype must be deterministic: pin fp32 in the class ctor, restore on Dispose. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request addresses concurrent load scheduling issues (issue #183) by introducing chunked/interleaved prefill, packed multi-sequence prefill, and KV token-budget admission backpressure to the continuous batching engine. It also adds dynamic CUDA 12/13 library resolution and a real PCIe bandwidth measurement probe. Feedback on the changes highlights three critical improvements: disabling continuous batching for Gemma 4 models to prevent a guaranteed crash due to unsupported per-layer head dimensions in packed prefill, wrapping the main batcher loop in a try-finally block to ensure resources are properly drained and native memory leaks are avoided on unexpected exceptions, and optimizing the CUDA library resolver to load DLLs lazily rather than eagerly loading both CUDA 12 and 13 pairs.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| thinkTokenId, endThinkTokenId, | ||
| prefillChunkTokens: opts.PrefillChunkTokens, | ||
| kvBudgetBytes: opts.KvBudgetMb > 0 ? opts.KvBudgetMb * 1024 * 1024 : opts.KvBudgetMb); |
There was a problem hiding this comment.
Continuous batching is currently enabled for Gemma 4 models on CPU, but ForwardPass.PrefillPackedMulti explicitly throws a NotSupportedException when _layerHeadDim is not null (which is the case for Gemma 4 models due to their per-layer head dimensions). This will cause a guaranteed crash and silent hang of the background batcher thread on the very first request.
To prevent this, we should disable continuous batching for Gemma 4 models and let them fall back to the single-user InferenceEngine (which is fully supported) by checking hp.LayerHeadDim == null in the admission condition on line 94:
if (opts.MaxBatchSize > 1 && batchingSupported && hp.LayerHeadDim == null && fwd is ForwardPass cpuFwd)| var pending = new Queue<PendingRequest>(); | ||
| var prefilling = new List<PrefillingSeq>(); |
There was a problem hiding this comment.
The main while (!_disposed) loop in BatcherLoop is not wrapped in a try-finally block. If any unexpected exception is thrown inside the loop (e.g., from BatchForwardMulti or RunPrefillStep due to CUDA out-of-memory, transient GPU errors, or bugs in custom kernels), the background thread will die silently.
Because the "Drain" phase is located outside the loop and is not protected by a finally block, it will be bypassed entirely. This leads to:
- Permanent silent hangs: Active and prefilling requests' output channels are never completed, causing all active clients to hang indefinitely.
- Native memory leaks: The
PagedKvCacheof all active and prefilling requests are never disposed, leaking native memory.
We should wrap the entire loop in a try-finally block to guarantee that the Drain phase is always executed on failure:
try
{
while (!_disposed)
{
// ... loop body ...
}
}
finally
{
// Drain: complete everything still in flight ...
}| bool prefer13 = Environment.GetEnvironmentVariable("SHARPI_CUDA13") == "1"; | ||
| bool has13 = CanLoadPair("cudart64_13", "cublas64_13"); | ||
| bool has12 = CanLoadPair("cudart64_12", "cublas64_12"); | ||
| major = prefer13 && has13 ? 13 | ||
| : has12 ? 12 | ||
| : has13 ? 13 | ||
| : 12; // neither loads — keep the pinned name so the standard load error surfaces |
There was a problem hiding this comment.
The current implementation of DecideRuntimeMajor eagerly calls CanLoadPair for both CUDA 12 and CUDA 13. Since NativeLibrary.TryLoad actually loads the libraries into the process memory and increments their reference counts, this causes the unused version's heavy DLLs (like cublas, which is hundreds of megabytes) to be loaded and remain in memory forever.
This wastes significant system memory and can potentially cause runtime conflicts between different CUDA versions. We should optimize the detection flow to only load the preferred version first, and only fall back to the other version if the preferred one is not available.
bool prefer13 = Environment.GetEnvironmentVariable("SHARPI_CUDA13") == "1";
if (prefer13 && CanLoadPair("cudart64_13", "cublas64_13"))
{
major = 13;
}
else if (CanLoadPair("cudart64_12", "cublas64_12"))
{
major = 12;
}
else if (CanLoadPair("cudart64_13", "cublas64_13"))
{
major = 13;
}
else
{
major = 12;
}…esilience, lazy DLL probe) - InferenceEngineLoader: exclude gemma4 per-layer-head_dim models from continuous batching (PrefillWithCache/BatchForwardMulti/PrefillPackedMulti all throw NotSupportedException) — they fall back to the single-user InferenceEngine instead of failing every request. - ContinuousBatchingEngine.BatcherLoop: wrap the loop in try/catch/finally — an exception the per-request handlers didn't isolate now fails every in-flight request with that exception via the finally drain instead of killing the batcher silently (hung callers + leaked caches). - CudaLibraryResolver: probe the preferred runtime pair first and free probe handles, so the unused CUDA major's DLLs are never left mapped in the process. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Addressed all three review findings in 2bf8f98:
Follow-ups filed from this work: #189 (dequant-once MatMulBatched cache → smaller prefill chunks), #190 (continuous batching on the CUDA backend), #191 (narrowed-KV decode-coherence coverage, refs #166). |
Addresses the #232 review (test-coverage gaps; no production defects found by the correctness/silent-failure lenses): - Vacuous-pass guard: new internal CudaHybridForwardPass.KvCacheDType observable; the parity oracle now asserts the requested dtype actually applied (else fp32-vs-fp32 would pass trivially if the env plumbing regressed). - >4096 WAVE path coverage: Coder30B_Q8Kv_Wave_ArgmaxStable prefills >4096 tokens in one call so PrefillBatchedTrunk takes the AttentionBatchedWaveQ8_0 branch (was only manual CLI). - Greedy (non-teacher-forced) coherence: catches an #188-style narrowed-KV self-decode collapse that teacher-forcing masks. Uses each model's OWN GGUF chat template (ApplyChatTemplate via tokenizer.ChatTemplate.Render + add_generation_prompt) — a raw continuation prompt collapses an instruct model regardless of KV dtype (the 'prompt must match chat template' trap); OLMoE now passes with its template. - Tightened the skip: q8_0 is supported for both test models (kvDim%32==0), so a NotSupportedException is now a real failure, not a silent skip. Hybrid KV-dtype suite 7/7 green (4 parity bf16/q8 × OLMoE/Coder + wave-q8 + 2 greedy-coherent). Gemma4-on-hybrid narrowed decode reuses the dense-tested AttentionSwa{Bf16,Q8_0} kernels via the trivial *Kv dispatch; not separately tested (needs a synthetic gemma4 hybrid split). Release clean under TreatWarningsAsErrors + AOT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Closes #183.
Gap 1 — chunked / interleaved prefill admission
Admitted prompts prefill in
prefillChunkTokensslices (default 256;SHARPI_PREFILL_CHUNK/PrefillChunkTokens;0= legacy blocking) with oneBatchForwardMultidecode step between chunks. Cancellation honoured between chunks; SnapKV-enabled engines fall back to blocking (eviction needs the whole prompt atstartPos==0).A/B (
SharpInference.Bench -- --cb, SmolLM2 CPU, 3 active decoders + 1040-token prompt injected):Smaller chunks are a trap on CPU — every
MatMulBatchedcall re-pays weight dequant, so prefill collapses 42→20→13 t/s at chunk 256/64/32. 256 is the default compromise, operator-tunable.Gap 2 — packed multi-sequence prefill
ForwardPass.PrefillPackedMulti(chunks, startPos, caches, wantLogits): in-flight prompts' chunks run as ONE packed varlen pass — batched GEMMs over the combined token count, per-token RoPE/append/attention against each sequence's ownPagedKvCache, no padding, no cross-sequence attention. Logits only for prompt-completing chunks. Same guards asBatchForwardMulti(no MoE/TQ/gemma4).A/B (
--cb --packed, 4×256-token prompts): whole-prompt serial 35.8 tok/s · chunked serial 20.4 · chunked packed 33.7 (+65% vs serial chunks, −6% vs blocking).Gap 3 — KV token-budget admission backpressure
Each request reserves
min(prompt + MaxNewTokens, MaxSeqLen)KV tokens at admission (ForwardPass.KvBytesPerToken); the head request waits when the budget would be exceeded instead of risking OOM. A lone oversized request is always admitted (no deadlock).SHARPI_KV_BUDGET_MB/KvBudgetMb: 0 = auto (half RAM), negative = unlimited.HardwareProfile: measured PCIe bandwidth
CudaBackend.MeasurePcieBandwidthGBps()(pinned 64 MiB probe, min(H2D, D2H), heuristic fallback) replaces the VRAM-bucket guess inDetect(CudaBackend). 4070 Ti: ~26 GB/s measured vs 20 guessed.CUDA 13.3 evaluation (no default change)
All pinned entry points exist in 13.3. Qwen3-8B
-g -1: prefill 2354→2390 t/s (+1.5%), decode 58.0→57.9 (noise) — decode is HBM-bound in our NVRTC kernels.CudaLibraryResolver(assembly's singleDllImportResolver, NVRTC probing folded in) keeps CUDA 12 default,SHARPI_CUDA13=1opts into the 13 pair, and 13-only installs now load instead of hard-failing.Test-suite fix found along the way
Gemma4_12B_CudaForward_ProducesCoherentDecodehas been failing on master at default settings since #186: #185's auto-narrow makes the 12B tests run with bf16 KV on a 12 GB card, and their synthetic OOD prompt's tiny logit margin lets bf16 rounding tip greedy into a one-token attractor (forced bf16 failed identically at #184 — nothing regressed; CLI bf16 output is coherent). The 12B tests now pinSHARPI_KV_DTYPE=fp32(they guard trunk math, not dtype noise).Tests
5 new in
ContinuousBatchingTests(chunked parity, packed parity incl.startPos>0continuation, engine chunked-vs-unchunked greedy equality, tiny-budget completion). Full suite green: Core 134, ForwardPass 481→ all passing after the 12B pin, Pipeline 37, TurboQuant 41, Server 103.🤖 Generated with Claude Code