Skip to content

GPU saturation gaps: chunked+packed prefill admission, KV budget, PCIe probe (#183)#188

Merged
pekkah merged 4 commits into
masterfrom
feat/gpu-saturation-183
Jun 10, 2026
Merged

GPU saturation gaps: chunked+packed prefill admission, KV budget, PCIe probe (#183)#188
pekkah merged 4 commits into
masterfrom
feat/gpu-saturation-183

Conversation

@pekkah

@pekkah pekkah commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Closes #183.

Gap 1 — chunked / interleaved prefill admission

Admitted prompts prefill in prefillChunkTokens slices (default 256; SHARPI_PREFILL_CHUNK / PrefillChunkTokens; 0 = legacy blocking) with one BatchForwardMulti decode step between chunks. Cancellation honoured between chunks; SnapKV-enabled engines fall back to blocking (eviction needs the whole prompt at startPos==0).

A/B (SharpInference.Bench -- --cb, SmolLM2 CPU, 3 active decoders + 1040-token prompt injected):

worst decode stall aggregate
blocking (pre-#183) 25.1 s 8.1 tok/s
chunked+packed (256) 7.9 s 7.1 tok/s

Smaller chunks are a trap on CPU — every MatMulBatched call re-pays weight dequant, so prefill collapses 42→20→13 t/s at chunk 256/64/32. 256 is the default compromise, operator-tunable.

Gap 2 — packed multi-sequence prefill

ForwardPass.PrefillPackedMulti(chunks, startPos, caches, wantLogits): in-flight prompts' chunks run as ONE packed varlen pass — batched GEMMs over the combined token count, per-token RoPE/append/attention against each sequence's own PagedKvCache, no padding, no cross-sequence attention. Logits only for prompt-completing chunks. Same guards as BatchForwardMulti (no MoE/TQ/gemma4).

A/B (--cb --packed, 4×256-token prompts): whole-prompt serial 35.8 tok/s · chunked serial 20.4 · chunked packed 33.7 (+65% vs serial chunks, −6% vs blocking).

Gap 3 — KV token-budget admission backpressure

Each request reserves min(prompt + MaxNewTokens, MaxSeqLen) KV tokens at admission (ForwardPass.KvBytesPerToken); the head request waits when the budget would be exceeded instead of risking OOM. A lone oversized request is always admitted (no deadlock). SHARPI_KV_BUDGET_MB / KvBudgetMb: 0 = auto (half RAM), negative = unlimited.

HardwareProfile: measured PCIe bandwidth

CudaBackend.MeasurePcieBandwidthGBps() (pinned 64 MiB probe, min(H2D, D2H), heuristic fallback) replaces the VRAM-bucket guess in Detect(CudaBackend). 4070 Ti: ~26 GB/s measured vs 20 guessed.

CUDA 13.3 evaluation (no default change)

All pinned entry points exist in 13.3. Qwen3-8B -g -1: prefill 2354→2390 t/s (+1.5%), decode 58.0→57.9 (noise) — decode is HBM-bound in our NVRTC kernels. CudaLibraryResolver (assembly's single DllImportResolver, NVRTC probing folded in) keeps CUDA 12 default, SHARPI_CUDA13=1 opts into the 13 pair, and 13-only installs now load instead of hard-failing.

Test-suite fix found along the way

Gemma4_12B_CudaForward_ProducesCoherentDecode has been failing on master at default settings since #186: #185's auto-narrow makes the 12B tests run with bf16 KV on a 12 GB card, and their synthetic OOD prompt's tiny logit margin lets bf16 rounding tip greedy into a one-token attractor (forced bf16 failed identically at #184 — nothing regressed; CLI bf16 output is coherent). The 12B tests now pin SHARPI_KV_DTYPE=fp32 (they guard trunk math, not dtype noise).

Tests

5 new in ContinuousBatchingTests (chunked parity, packed parity incl. startPos>0 continuation, engine chunked-vs-unchunked greedy equality, tiny-budget completion). Full suite green: Core 134, ForwardPass 481→ all passing after the 12B pin, Pipeline 37, TurboQuant 41, Server 103.

🤖 Generated with Claude Code

pekkah and others added 3 commits June 10, 2026 12:37
…183)

Closes the three scheduling gaps from #183 in ContinuousBatchingEngine:

Gap 1 — chunked/interleaved prefill: admission no longer prefills a prompt
in one blocking call that stalls every active decode. Prompts prefill in
prefillChunkTokens slices (default 256; SHARPI_PREFILL_CHUNK /
PrefillChunkTokens option; 0 restores blocking) with one BatchForwardMulti
decode step between chunks. Cancellation is honoured between chunks.
SnapKV-enabled engines fall back to blocking (eviction scoring needs the
whole prompt at startPos==0). SmolLM2-CPU A/B (3 decoders + 1040-token
prompt injection): worst decode stall 25.1s -> 7.9s, ~12% aggregate cost.

Gap 2 — packed multi-sequence prefill: ForwardPass.PrefillPackedMulti runs
the in-flight prompts' chunks as ONE packed varlen pass (batched GEMMs over
the combined tokens, per-token attention vs each sequence's own cache, no
padding). 4x256-token prompts: 33.7 tok/s packed vs 20.4 chunked-serial
(+65%), within 6% of blocking whole-prompt prefill. Logits only computed
for chunks that complete a prompt. Same guards as BatchForwardMulti.

Gap 3 — KV token-budget backpressure: each request reserves
min(promptTokens + MaxNewTokens, MaxSeqLen) tokens at admission
(ForwardPass.KvBytesPerToken converts bytes->tokens); the head request
waits in a local FIFO when it would exceed the budget instead of OOM-ing.
A lone oversized request is always admitted (no deadlock).
SHARPI_KV_BUDGET_MB / KvBudgetMb: 0 = auto (half RAM), neg = unlimited.

Also: dispose-time drain pulls requests still in the channel so a late
writer can't hang a caller; _pendingCount is restored if WriteAsync throws.

Bench: `SharpInference.Bench -- --cb` (stall A/B) and `--cb --packed`
(packed-vs-serial prefill A/B). Tests: 5 new (chunked parity, packed parity
incl. startPos>0 continuation, engine chunked-vs-unchunked greedy equality,
tiny-budget completion).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…183)

HardwareProfile.Detect(CudaBackend) replaces the VRAM-size-bucket PCIe
guess with CudaBackend.MeasurePcieBandwidthGBps(): a pinned 64 MiB
cudaMemcpy probe (min of H2D/D2H, best-of-3, ~100ms at load), clamped to
[5,64] GB/s with the heuristic as fallback. 4070 Ti: measured ~26 GB/s vs
the old 20 GB/s bucket, so TierPlanner plans against real bandwidth.

CudaLibraryResolver consolidates the assembly's single DllImportResolver
(NVRTC probing moved from NvrtcInterop unchanged) and adds CUDA runtime
major selection: cudart64_12/cublas64_12 stay the default, SHARPI_CUDA13=1
prefers the _13 pair, and 13-only toolkit installs fall back to _13
instead of hard-failing model load. Decided as a pair — never a mixed
12/13 process. Registered from both interop static ctors so it precedes
the first P/Invoke bind on every entry path.

CUDA 13.3 evaluated on Qwen3-8B -g -1 (all entry points incl. deprecated
cublasSetMathMode still exported): prefill 2354->2390 t/s (+1.5%), decode
58.0->57.9 (noise) — decode is HBM-bound in our NVRTC kernels, so CUDA 12
remains the pinned default matching the nvrtc64_120_0 JIT.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Since #185 the KV dtype auto-narrows when fp32 won't fit the VRAM budget,
so on a 12 GB card the 12B tests silently ran with bf16 KV — and their
synthetic OOD prompt has a tiny top-logit margin, where bf16 rounding tips
greedy decode into a single-token repetition attractor (fails the
>=2-distinct assertion; q8_0's different rounding happens to pass; forced
bf16 failed identically at #184, so nothing regressed — the default
changed). The bf16 path is coherent on real prompts (CLI-validated to
128K, #179/#184). These tests guard trunk math (k_eq_v, per-layer KV,
embed), so the dtype must be deterministic: pin fp32 in the class ctor,
restore on Dispose.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses concurrent load scheduling issues (issue #183) by introducing chunked/interleaved prefill, packed multi-sequence prefill, and KV token-budget admission backpressure to the continuous batching engine. It also adds dynamic CUDA 12/13 library resolution and a real PCIe bandwidth measurement probe. Feedback on the changes highlights three critical improvements: disabling continuous batching for Gemma 4 models to prevent a guaranteed crash due to unsupported per-layer head dimensions in packed prefill, wrapping the main batcher loop in a try-finally block to ensure resources are properly drained and native memory leaks are avoided on unexpected exceptions, and optimizing the CUDA library resolver to load DLLs lazily rather than eagerly loading both CUDA 12 and 13 pairs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +97 to +99
thinkTokenId, endThinkTokenId,
prefillChunkTokens: opts.PrefillChunkTokens,
kvBudgetBytes: opts.KvBudgetMb > 0 ? opts.KvBudgetMb * 1024 * 1024 : opts.KvBudgetMb);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Continuous batching is currently enabled for Gemma 4 models on CPU, but ForwardPass.PrefillPackedMulti explicitly throws a NotSupportedException when _layerHeadDim is not null (which is the case for Gemma 4 models due to their per-layer head dimensions). This will cause a guaranteed crash and silent hang of the background batcher thread on the very first request.

To prevent this, we should disable continuous batching for Gemma 4 models and let them fall back to the single-user InferenceEngine (which is fully supported) by checking hp.LayerHeadDim == null in the admission condition on line 94:

if (opts.MaxBatchSize > 1 && batchingSupported && hp.LayerHeadDim == null && fwd is ForwardPass cpuFwd)

Comment on lines +232 to +233
var pending = new Queue<PendingRequest>();
var prefilling = new List<PrefillingSeq>();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The main while (!_disposed) loop in BatcherLoop is not wrapped in a try-finally block. If any unexpected exception is thrown inside the loop (e.g., from BatchForwardMulti or RunPrefillStep due to CUDA out-of-memory, transient GPU errors, or bugs in custom kernels), the background thread will die silently.

Because the "Drain" phase is located outside the loop and is not protected by a finally block, it will be bypassed entirely. This leads to:

  1. Permanent silent hangs: Active and prefilling requests' output channels are never completed, causing all active clients to hang indefinitely.
  2. Native memory leaks: The PagedKvCache of all active and prefilling requests are never disposed, leaking native memory.

We should wrap the entire loop in a try-finally block to guarantee that the Drain phase is always executed on failure:

try
{
    while (!_disposed)
    {
        // ... loop body ...
    }
}
finally
{
    // Drain: complete everything still in flight ...
}

Comment on lines +54 to +60
bool prefer13 = Environment.GetEnvironmentVariable("SHARPI_CUDA13") == "1";
bool has13 = CanLoadPair("cudart64_13", "cublas64_13");
bool has12 = CanLoadPair("cudart64_12", "cublas64_12");
major = prefer13 && has13 ? 13
: has12 ? 12
: has13 ? 13
: 12; // neither loads — keep the pinned name so the standard load error surfaces

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of DecideRuntimeMajor eagerly calls CanLoadPair for both CUDA 12 and CUDA 13. Since NativeLibrary.TryLoad actually loads the libraries into the process memory and increments their reference counts, this causes the unused version's heavy DLLs (like cublas, which is hundreds of megabytes) to be loaded and remain in memory forever.

This wastes significant system memory and can potentially cause runtime conflicts between different CUDA versions. We should optimize the detection flow to only load the preferred version first, and only fall back to the other version if the preferred one is not available.

        bool prefer13 = Environment.GetEnvironmentVariable("SHARPI_CUDA13") == "1";
        if (prefer13 && CanLoadPair("cudart64_13", "cublas64_13"))
        {
            major = 13;
        }
        else if (CanLoadPair("cudart64_12", "cublas64_12"))
        {
            major = 12;
        }
        else if (CanLoadPair("cudart64_13", "cublas64_13"))
        {
            major = 13;
        }
        else
        {
            major = 12;
        }

…esilience, lazy DLL probe)

- InferenceEngineLoader: exclude gemma4 per-layer-head_dim models from
  continuous batching (PrefillWithCache/BatchForwardMulti/PrefillPackedMulti
  all throw NotSupportedException) — they fall back to the single-user
  InferenceEngine instead of failing every request.
- ContinuousBatchingEngine.BatcherLoop: wrap the loop in try/catch/finally —
  an exception the per-request handlers didn't isolate now fails every
  in-flight request with that exception via the finally drain instead of
  killing the batcher silently (hung callers + leaked caches).
- CudaLibraryResolver: probe the preferred runtime pair first and free probe
  handles, so the unused CUDA major's DLLs are never left mapped in the
  process.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@pekkah

pekkah commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

Addressed all three review findings in 2bf8f98:

  1. Gemma 4 + continuous batching — gated in BuildForwardPass (batchOk now requires hp.LayerHeadDim is null), falling back to the single-user InferenceEngine. Note the failure mode was a clean per-request NotSupportedException (prefill exceptions are caught and completed onto the request channel), not a batcher hang — but every request failing made the engine useless for gemma4, so the gate is the right call.
  2. Batcher loop resilienceBatcherLoop is now try/catch/finally: an exception the per-request handlers didn't isolate fails every in-flight request with that exception via the finally drain (no hung callers, no leaked caches).
  3. Eager dual-version DLL loadingDecideRuntimeMajor probes the preferred pair first, only falls back when absent, and frees probe handles, so the unused CUDA major is never left mapped.

Follow-ups filed from this work: #189 (dequant-once MatMulBatched cache → smaller prefill chunks), #190 (continuous batching on the CUDA backend), #191 (narrowed-KV decode-coherence coverage, refs #166).

@pekkah pekkah merged commit 9b54b19 into master Jun 10, 2026
1 check passed
@pekkah pekkah deleted the feat/gpu-saturation-183 branch June 10, 2026 09:48
pekkah added a commit that referenced this pull request Jun 13, 2026
Addresses the #232 review (test-coverage gaps; no production defects found by the
correctness/silent-failure lenses):
- Vacuous-pass guard: new internal CudaHybridForwardPass.KvCacheDType observable;
  the parity oracle now asserts the requested dtype actually applied (else fp32-vs-fp32
  would pass trivially if the env plumbing regressed).
- >4096 WAVE path coverage: Coder30B_Q8Kv_Wave_ArgmaxStable prefills >4096 tokens in one
  call so PrefillBatchedTrunk takes the AttentionBatchedWaveQ8_0 branch (was only manual CLI).
- Greedy (non-teacher-forced) coherence: catches an #188-style narrowed-KV self-decode
  collapse that teacher-forcing masks. Uses each model's OWN GGUF chat template
  (ApplyChatTemplate via tokenizer.ChatTemplate.Render + add_generation_prompt) — a raw
  continuation prompt collapses an instruct model regardless of KV dtype (the 'prompt must
  match chat template' trap); OLMoE now passes with its template.
- Tightened the skip: q8_0 is supported for both test models (kvDim%32==0), so a
  NotSupportedException is now a real failure, not a silent skip.

Hybrid KV-dtype suite 7/7 green (4 parity bf16/q8 × OLMoE/Coder + wave-q8 + 2 greedy-coherent).
Gemma4-on-hybrid narrowed decode reuses the dense-tested AttentionSwa{Bf16,Q8_0} kernels via
the trivial *Kv dispatch; not separately tested (needs a synthetic gemma4 hybrid split).
Release clean under TreatWarningsAsErrors + AOT.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU saturation gaps: non-blocking prefill admission, packed multi-seq prefill, VRAM token-budget autotune

1 participant