Context
Spun out of reviewing the article "I Built a C++ Backend So My GPU Would Stop Eating Air" (write-up of WarpGroup-Backend). Its thesis: standard batching pads short sequences with zeros up to the longest in the batch, wasting GPU FLOPs; fix it with padding-free packed batches, FFD bin-packing under a VRAM token budget, pinned-memory async DMA, and a Phase-0 autotune.
Most of the headline techniques are already in our design and need no work:
- Pinned memory + async DMA — already implemented:
CudaBackend cudaMallocHosts a staging buffer plus a double-buffered async pinned pool with completion events (src/SharpInference.Cuda/CudaBackend.cs:39-60) and direct-pinned Download/UploadAsync overloads (:708+).
- No padding waste in decode —
ForwardPass.BatchForwardMulti (src/SharpInference.Engine/ForwardPass.cs:2208) is already ragged: one token per sequence, per-sequence KV cache, weights amortized N×. No rectangle to pad.
- No padding waste in prefill —
ContinuousBatchingEngine.AdmitRequest prefills each prompt individually (ContinuousBatchingEngine.cs:332) and PrefillCore runs the real token count. We never build the padded matrix the article fights.
- 16-token Tensor-Core micro-pad / FFD packing — low value for us: cuBLAS pads GEMMs internally and quant block sizes already enforce alignment; FFD only matters once packed prefill exists.
What the article's deeper lesson (keep the GPU saturated, never let scheduling idle it) does surface are three real gaps worth tracking:
Gap 1 — Blocking, serialized prefill admission (highest value)
In ContinuousBatchingEngine.BatcherLoop, a newly admitted request's prefill runs synchronously and every already-active sequence stalls — no decode advances during that prefill (ContinuousBatchingEngine.cs:171-210, admission at :332). A long prompt freezes the whole decode batch. This is our analog of "the GPU eating air."
Proposed fix: chunked / interleaved prefill (Sarathi / vLLM-style) — split a prompt's prefill into chunks and interleave them with decode steps so active sequences keep progressing.
Scope: moderate; engine-level scheduling change, no new attention kernel.
Gap 2 — Packed multi-sequence prefill (direct port of the article)
A varlen PrefillCore that prefills several pending prompts in one packed forward pass, amortizing weight reads across prompts exactly like BatchForwardMulti does for decode.
Scope: largest lift — needs a cu_seqlens-style packed causal-attention path. Builds naturally on Gap 1.
Gap 3 — VRAM token-budget autotune + admission backpressure (smallest/safest)
HardwareProfile is crude — PCIe bandwidth bucketed by VRAM size, no token budget (HardwareProfile.cs:39-45) — and ContinuousBatchingEngine admits up to maxBatchSize with no KV-memory backpressure, so a burst of long prompts can OOM. Add an empirical Phase-0 probe to derive a real token budget and gate admission on it.
Scope: small; standalone, no kernel work. Good first step and de-risks Gaps 1–2.
Suggested order
- Gap 3 (cheap, prevents OOM, unblocks measurement)
- Gap 1 (biggest throughput win under concurrent load)
- Gap 2 (largest lift; layers on top of 1)
Each is independently shippable.
Context
Spun out of reviewing the article "I Built a C++ Backend So My GPU Would Stop Eating Air" (write-up of WarpGroup-Backend). Its thesis: standard batching pads short sequences with zeros up to the longest in the batch, wasting GPU FLOPs; fix it with padding-free packed batches, FFD bin-packing under a VRAM token budget, pinned-memory async DMA, and a Phase-0 autotune.
Most of the headline techniques are already in our design and need no work:
CudaBackendcudaMallocHosts a staging buffer plus a double-buffered async pinned pool with completion events (src/SharpInference.Cuda/CudaBackend.cs:39-60) and direct-pinnedDownload/UploadAsyncoverloads (:708+).ForwardPass.BatchForwardMulti(src/SharpInference.Engine/ForwardPass.cs:2208) is already ragged: one token per sequence, per-sequence KV cache, weights amortized N×. No rectangle to pad.ContinuousBatchingEngine.AdmitRequestprefills each prompt individually (ContinuousBatchingEngine.cs:332) andPrefillCoreruns the real token count. We never build the padded matrix the article fights.What the article's deeper lesson (keep the GPU saturated, never let scheduling idle it) does surface are three real gaps worth tracking:
Gap 1 — Blocking, serialized prefill admission (highest value)
In
ContinuousBatchingEngine.BatcherLoop, a newly admitted request's prefill runs synchronously and every already-active sequence stalls — no decode advances during that prefill (ContinuousBatchingEngine.cs:171-210, admission at:332). A long prompt freezes the whole decode batch. This is our analog of "the GPU eating air."Proposed fix: chunked / interleaved prefill (Sarathi / vLLM-style) — split a prompt's prefill into chunks and interleave them with decode steps so active sequences keep progressing.
Scope: moderate; engine-level scheduling change, no new attention kernel.
Gap 2 — Packed multi-sequence prefill (direct port of the article)
A varlen
PrefillCorethat prefills several pending prompts in one packed forward pass, amortizing weight reads across prompts exactly likeBatchForwardMultidoes for decode.Scope: largest lift — needs a
cu_seqlens-style packed causal-attention path. Builds naturally on Gap 1.Gap 3 — VRAM token-budget autotune + admission backpressure (smallest/safest)
HardwareProfileis crude — PCIe bandwidth bucketed by VRAM size, no token budget (HardwareProfile.cs:39-45) — andContinuousBatchingEngineadmits up tomaxBatchSizewith no KV-memory backpressure, so a burst of long prompts can OOM. Add an empirical Phase-0 probe to derive a real token budget and gate admission on it.Scope: small; standalone, no kernel work. Good first step and de-risks Gaps 1–2.
Suggested order
Each is independently shippable.