GPU saturation gaps: chunked+packed prefill admission, KV budget, PCIe probe (#183) by pekkah · Pull Request #188 · pekkah/SharpInference

pekkah · 2026-06-10T09:38:07Z

Closes #183.

Gap 1 — chunked / interleaved prefill admission

Admitted prompts prefill in prefillChunkTokens slices (default 256; SHARPI_PREFILL_CHUNK / PrefillChunkTokens; 0 = legacy blocking) with one BatchForwardMulti decode step between chunks. Cancellation honoured between chunks; SnapKV-enabled engines fall back to blocking (eviction needs the whole prompt at startPos==0).

A/B (SharpInference.Bench -- --cb, SmolLM2 CPU, 3 active decoders + 1040-token prompt injected):

	worst decode stall	aggregate
blocking (pre-#183)	25.1 s	8.1 tok/s
chunked+packed (256)	7.9 s	7.1 tok/s

Smaller chunks are a trap on CPU — every MatMulBatched call re-pays weight dequant, so prefill collapses 42→20→13 t/s at chunk 256/64/32. 256 is the default compromise, operator-tunable.

Gap 2 — packed multi-sequence prefill

ForwardPass.PrefillPackedMulti(chunks, startPos, caches, wantLogits): in-flight prompts' chunks run as ONE packed varlen pass — batched GEMMs over the combined token count, per-token RoPE/append/attention against each sequence's own PagedKvCache, no padding, no cross-sequence attention. Logits only for prompt-completing chunks. Same guards as BatchForwardMulti (no MoE/TQ/gemma4).

A/B (--cb --packed, 4×256-token prompts): whole-prompt serial 35.8 tok/s · chunked serial 20.4 · chunked packed 33.7 (+65% vs serial chunks, −6% vs blocking).

Gap 3 — KV token-budget admission backpressure

Each request reserves min(prompt + MaxNewTokens, MaxSeqLen) KV tokens at admission (ForwardPass.KvBytesPerToken); the head request waits when the budget would be exceeded instead of risking OOM. A lone oversized request is always admitted (no deadlock). SHARPI_KV_BUDGET_MB / KvBudgetMb: 0 = auto (half RAM), negative = unlimited.

HardwareProfile: measured PCIe bandwidth

CudaBackend.MeasurePcieBandwidthGBps() (pinned 64 MiB probe, min(H2D, D2H), heuristic fallback) replaces the VRAM-bucket guess in Detect(CudaBackend). 4070 Ti: ~26 GB/s measured vs 20 guessed.

CUDA 13.3 evaluation (no default change)

All pinned entry points exist in 13.3. Qwen3-8B -g -1: prefill 2354→2390 t/s (+1.5%), decode 58.0→57.9 (noise) — decode is HBM-bound in our NVRTC kernels. CudaLibraryResolver (assembly's single DllImportResolver, NVRTC probing folded in) keeps CUDA 12 default, SHARPI_CUDA13=1 opts into the 13 pair, and 13-only installs now load instead of hard-failing.

Test-suite fix found along the way

Gemma4_12B_CudaForward_ProducesCoherentDecode has been failing on master at default settings since #186: #185's auto-narrow makes the 12B tests run with bf16 KV on a 12 GB card, and their synthetic OOD prompt's tiny logit margin lets bf16 rounding tip greedy into a one-token attractor (forced bf16 failed identically at #184 — nothing regressed; CLI bf16 output is coherent). The 12B tests now pin SHARPI_KV_DTYPE=fp32 (they guard trunk math, not dtype noise).

Tests

5 new in ContinuousBatchingTests (chunked parity, packed parity incl. startPos>0 continuation, engine chunked-vs-unchunked greedy equality, tiny-budget completion). Full suite green: Core 134, ForwardPass 481→ all passing after the 12B pin, Pipeline 37, TurboQuant 41, Server 103.

🤖 Generated with Claude Code

…183) Closes the three scheduling gaps from #183 in ContinuousBatchingEngine: Gap 1 — chunked/interleaved prefill: admission no longer prefills a prompt in one blocking call that stalls every active decode. Prompts prefill in prefillChunkTokens slices (default 256; SHARPI_PREFILL_CHUNK / PrefillChunkTokens option; 0 restores blocking) with one BatchForwardMulti decode step between chunks. Cancellation is honoured between chunks. SnapKV-enabled engines fall back to blocking (eviction scoring needs the whole prompt at startPos==0). SmolLM2-CPU A/B (3 decoders + 1040-token prompt injection): worst decode stall 25.1s -> 7.9s, ~12% aggregate cost. Gap 2 — packed multi-sequence prefill: ForwardPass.PrefillPackedMulti runs the in-flight prompts' chunks as ONE packed varlen pass (batched GEMMs over the combined tokens, per-token attention vs each sequence's own cache, no padding). 4x256-token prompts: 33.7 tok/s packed vs 20.4 chunked-serial (+65%), within 6% of blocking whole-prompt prefill. Logits only computed for chunks that complete a prompt. Same guards as BatchForwardMulti. Gap 3 — KV token-budget backpressure: each request reserves min(promptTokens + MaxNewTokens, MaxSeqLen) tokens at admission (ForwardPass.KvBytesPerToken converts bytes->tokens); the head request waits in a local FIFO when it would exceed the budget instead of OOM-ing. A lone oversized request is always admitted (no deadlock). SHARPI_KV_BUDGET_MB / KvBudgetMb: 0 = auto (half RAM), neg = unlimited. Also: dispose-time drain pulls requests still in the channel so a late writer can't hang a caller; _pendingCount is restored if WriteAsync throws. Bench: `SharpInference.Bench -- --cb` (stall A/B) and `--cb --packed` (packed-vs-serial prefill A/B). Tests: 5 new (chunked parity, packed parity incl. startPos>0 continuation, engine chunked-vs-unchunked greedy equality, tiny-budget completion). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…183) HardwareProfile.Detect(CudaBackend) replaces the VRAM-size-bucket PCIe guess with CudaBackend.MeasurePcieBandwidthGBps(): a pinned 64 MiB cudaMemcpy probe (min of H2D/D2H, best-of-3, ~100ms at load), clamped to [5,64] GB/s with the heuristic as fallback. 4070 Ti: measured ~26 GB/s vs the old 20 GB/s bucket, so TierPlanner plans against real bandwidth. CudaLibraryResolver consolidates the assembly's single DllImportResolver (NVRTC probing moved from NvrtcInterop unchanged) and adds CUDA runtime major selection: cudart64_12/cublas64_12 stay the default, SHARPI_CUDA13=1 prefers the _13 pair, and 13-only toolkit installs fall back to _13 instead of hard-failing model load. Decided as a pair — never a mixed 12/13 process. Registered from both interop static ctors so it precedes the first P/Invoke bind on every entry path. CUDA 13.3 evaluated on Qwen3-8B -g -1 (all entry points incl. deprecated cublasSetMathMode still exported): prefill 2354->2390 t/s (+1.5%), decode 58.0->57.9 (noise) — decode is HBM-bound in our NVRTC kernels, so CUDA 12 remains the pinned default matching the nvrtc64_120_0 JIT. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Since #185 the KV dtype auto-narrows when fp32 won't fit the VRAM budget, so on a 12 GB card the 12B tests silently ran with bf16 KV — and their synthetic OOD prompt has a tiny top-logit margin, where bf16 rounding tips greedy decode into a single-token repetition attractor (fails the >=2-distinct assertion; q8_0's different rounding happens to pass; forced bf16 failed identically at #184, so nothing regressed — the default changed). The bf16 path is coherent on real prompts (CLI-validated to 128K, #179/#184). These tests guard trunk math (k_eq_v, per-layer KV, embed), so the dtype must be deterministic: pin fp32 in the class ctor, restore on Dispose. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request addresses concurrent load scheduling issues (issue #183) by introducing chunked/interleaved prefill, packed multi-sequence prefill, and KV token-budget admission backpressure to the continuous batching engine. It also adds dynamic CUDA 12/13 library resolution and a real PCIe bandwidth measurement probe. Feedback on the changes highlights three critical improvements: disabling continuous batching for Gemma 4 models to prevent a guaranteed crash due to unsupported per-layer head dimensions in packed prefill, wrapping the main batcher loop in a try-finally block to ensure resources are properly drained and native memory leaks are avoided on unexpected exceptions, and optimizing the CUDA library resolver to load DLLs lazily rather than eagerly loading both CUDA 12 and 13 pairs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-10T09:41:41Z

+                thinkTokenId, endThinkTokenId,
+                prefillChunkTokens: opts.PrefillChunkTokens,
+                kvBudgetBytes: opts.KvBudgetMb > 0 ? opts.KvBudgetMb * 1024 * 1024 : opts.KvBudgetMb);


Continuous batching is currently enabled for Gemma 4 models on CPU, but ForwardPass.PrefillPackedMulti explicitly throws a NotSupportedException when _layerHeadDim is not null (which is the case for Gemma 4 models due to their per-layer head dimensions). This will cause a guaranteed crash and silent hang of the background batcher thread on the very first request.

To prevent this, we should disable continuous batching for Gemma 4 models and let them fall back to the single-user InferenceEngine (which is fully supported) by checking hp.LayerHeadDim == null in the admission condition on line 94:

if (opts.MaxBatchSize > 1 && batchingSupported && hp.LayerHeadDim == null && fwd is ForwardPass cpuFwd)

gemini-code-assist · 2026-06-10T09:41:41Z

+        var pending = new Queue<PendingRequest>();
+        var prefilling = new List<PrefillingSeq>();


The main while (!_disposed) loop in BatcherLoop is not wrapped in a try-finally block. If any unexpected exception is thrown inside the loop (e.g., from BatchForwardMulti or RunPrefillStep due to CUDA out-of-memory, transient GPU errors, or bugs in custom kernels), the background thread will die silently.

Because the "Drain" phase is located outside the loop and is not protected by a finally block, it will be bypassed entirely. This leads to:

Permanent silent hangs: Active and prefilling requests' output channels are never completed, causing all active clients to hang indefinitely.

Native memory leaks: The PagedKvCache of all active and prefilling requests are never disposed, leaking native memory.

We should wrap the entire loop in a try-finally block to guarantee that the Drain phase is always executed on failure:

try { while (!_disposed) { // ... loop body ... } } finally { // Drain: complete everything still in flight ... }

gemini-code-assist · 2026-06-10T09:41:41Z

+        bool prefer13 = Environment.GetEnvironmentVariable("SHARPI_CUDA13") == "1";
+        bool has13 = CanLoadPair("cudart64_13", "cublas64_13");
+        bool has12 = CanLoadPair("cudart64_12", "cublas64_12");
+        major = prefer13 && has13 ? 13
+              : has12 ? 12
+              : has13 ? 13
+              : 12; // neither loads — keep the pinned name so the standard load error surfaces


The current implementation of DecideRuntimeMajor eagerly calls CanLoadPair for both CUDA 12 and CUDA 13. Since NativeLibrary.TryLoad actually loads the libraries into the process memory and increments their reference counts, this causes the unused version's heavy DLLs (like cublas, which is hundreds of megabytes) to be loaded and remain in memory forever.

This wastes significant system memory and can potentially cause runtime conflicts between different CUDA versions. We should optimize the detection flow to only load the preferred version first, and only fall back to the other version if the preferred one is not available.

bool prefer13 = Environment.GetEnvironmentVariable("SHARPI_CUDA13") == "1"; if (prefer13 && CanLoadPair("cudart64_13", "cublas64_13")) { major = 13; } else if (CanLoadPair("cudart64_12", "cublas64_12")) { major = 12; } else if (CanLoadPair("cudart64_13", "cublas64_13")) { major = 13; } else { major = 12; }

…esilience, lazy DLL probe) - InferenceEngineLoader: exclude gemma4 per-layer-head_dim models from continuous batching (PrefillWithCache/BatchForwardMulti/PrefillPackedMulti all throw NotSupportedException) — they fall back to the single-user InferenceEngine instead of failing every request. - ContinuousBatchingEngine.BatcherLoop: wrap the loop in try/catch/finally — an exception the per-request handlers didn't isolate now fails every in-flight request with that exception via the finally drain instead of killing the batcher silently (hung callers + leaked caches). - CudaLibraryResolver: probe the preferred runtime pair first and free probe handles, so the unused CUDA major's DLLs are never left mapped in the process. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

pekkah · 2026-06-10T09:48:41Z

Addressed all three review findings in 2bf8f98:

Gemma 4 + continuous batching — gated in BuildForwardPass (batchOk now requires hp.LayerHeadDim is null), falling back to the single-user InferenceEngine. Note the failure mode was a clean per-request NotSupportedException (prefill exceptions are caught and completed onto the request channel), not a batcher hang — but every request failing made the engine useless for gemma4, so the gate is the right call.
Batcher loop resilience — BatcherLoop is now try/catch/finally: an exception the per-request handlers didn't isolate fails every in-flight request with that exception via the finally drain (no hung callers, no leaked caches).
Eager dual-version DLL loading — DecideRuntimeMajor probes the preferred pair first, only falls back when absent, and frees probe handles, so the unused CUDA major is never left mapped.

Follow-ups filed from this work: #189 (dequant-once MatMulBatched cache → smaller prefill chunks), #190 (continuous batching on the CUDA backend), #191 (narrowed-KV decode-coherence coverage, refs #166).

Addresses the #232 review (test-coverage gaps; no production defects found by the correctness/silent-failure lenses): - Vacuous-pass guard: new internal CudaHybridForwardPass.KvCacheDType observable; the parity oracle now asserts the requested dtype actually applied (else fp32-vs-fp32 would pass trivially if the env plumbing regressed). - >4096 WAVE path coverage: Coder30B_Q8Kv_Wave_ArgmaxStable prefills >4096 tokens in one call so PrefillBatchedTrunk takes the AttentionBatchedWaveQ8_0 branch (was only manual CLI). - Greedy (non-teacher-forced) coherence: catches an #188-style narrowed-KV self-decode collapse that teacher-forcing masks. Uses each model's OWN GGUF chat template (ApplyChatTemplate via tokenizer.ChatTemplate.Render + add_generation_prompt) — a raw continuation prompt collapses an instruct model regardless of KV dtype (the 'prompt must match chat template' trap); OLMoE now passes with its template. - Tightened the skip: q8_0 is supported for both test models (kvDim%32==0), so a NotSupportedException is now a real failure, not a silent skip. Hybrid KV-dtype suite 7/7 green (4 parity bf16/q8 × OLMoE/Coder + wave-q8 + 2 greedy-coherent). Gemma4-on-hybrid narrowed decode reuses the dense-tested AttentionSwa{Bf16,Q8_0} kernels via the trivial *Kv dispatch; not separately tested (needs a synthetic gemma4 hybrid split). Release clean under TreatWarningsAsErrors + AOT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pekkah and others added 3 commits June 10, 2026 12:37

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

pekkah merged commit 9b54b19 into master Jun 10, 2026
1 check passed

pekkah deleted the feat/gpu-saturation-183 branch June 10, 2026 09:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU saturation gaps: chunked+packed prefill admission, KV budget, PCIe probe (#183)#188

GPU saturation gaps: chunked+packed prefill admission, KV budget, PCIe probe (#183)#188
pekkah merged 4 commits into
masterfrom
feat/gpu-saturation-183

pekkah commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

pekkah commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		var pending = new Queue<PendingRequest>();
		var prefilling = new List<PrefillingSeq>();

Conversation

pekkah commented Jun 10, 2026

Gap 1 — chunked / interleaved prefill admission

Gap 2 — packed multi-sequence prefill

Gap 3 — KV token-budget admission backpressure

HardwareProfile: measured PCIe bandwidth

CUDA 13.3 evaluation (no default change)

Test-suite fix found along the way

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

pekkah commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant