Skip to content

GPU saturation gaps: non-blocking prefill admission, packed multi-seq prefill, VRAM token-budget autotune #183

@pekkah

Description

@pekkah

Context

Spun out of reviewing the article "I Built a C++ Backend So My GPU Would Stop Eating Air" (write-up of WarpGroup-Backend). Its thesis: standard batching pads short sequences with zeros up to the longest in the batch, wasting GPU FLOPs; fix it with padding-free packed batches, FFD bin-packing under a VRAM token budget, pinned-memory async DMA, and a Phase-0 autotune.

Most of the headline techniques are already in our design and need no work:

  • Pinned memory + async DMA — already implemented: CudaBackend cudaMallocHosts a staging buffer plus a double-buffered async pinned pool with completion events (src/SharpInference.Cuda/CudaBackend.cs:39-60) and direct-pinned Download/UploadAsync overloads (:708+).
  • No padding waste in decodeForwardPass.BatchForwardMulti (src/SharpInference.Engine/ForwardPass.cs:2208) is already ragged: one token per sequence, per-sequence KV cache, weights amortized N×. No rectangle to pad.
  • No padding waste in prefillContinuousBatchingEngine.AdmitRequest prefills each prompt individually (ContinuousBatchingEngine.cs:332) and PrefillCore runs the real token count. We never build the padded matrix the article fights.
  • 16-token Tensor-Core micro-pad / FFD packing — low value for us: cuBLAS pads GEMMs internally and quant block sizes already enforce alignment; FFD only matters once packed prefill exists.

What the article's deeper lesson (keep the GPU saturated, never let scheduling idle it) does surface are three real gaps worth tracking:


Gap 1 — Blocking, serialized prefill admission (highest value)

In ContinuousBatchingEngine.BatcherLoop, a newly admitted request's prefill runs synchronously and every already-active sequence stalls — no decode advances during that prefill (ContinuousBatchingEngine.cs:171-210, admission at :332). A long prompt freezes the whole decode batch. This is our analog of "the GPU eating air."

Proposed fix: chunked / interleaved prefill (Sarathi / vLLM-style) — split a prompt's prefill into chunks and interleave them with decode steps so active sequences keep progressing.

Scope: moderate; engine-level scheduling change, no new attention kernel.

Gap 2 — Packed multi-sequence prefill (direct port of the article)

A varlen PrefillCore that prefills several pending prompts in one packed forward pass, amortizing weight reads across prompts exactly like BatchForwardMulti does for decode.

Scope: largest lift — needs a cu_seqlens-style packed causal-attention path. Builds naturally on Gap 1.

Gap 3 — VRAM token-budget autotune + admission backpressure (smallest/safest)

HardwareProfile is crude — PCIe bandwidth bucketed by VRAM size, no token budget (HardwareProfile.cs:39-45) — and ContinuousBatchingEngine admits up to maxBatchSize with no KV-memory backpressure, so a burst of long prompts can OOM. Add an empirical Phase-0 probe to derive a real token budget and gate admission on it.

Scope: small; standalone, no kernel work. Good first step and de-risks Gaps 1–2.


Suggested order

  1. Gap 3 (cheap, prevents OOM, unblocks measurement)
  2. Gap 1 (biggest throughput win under concurrent load)
  3. Gap 2 (largest lift; layers on top of 1)

Each is independently shippable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions