GPU saturation gaps: non-blocking prefill admission, packed multi-seq prefill, VRAM token-budget autotune

## Context

Spun out of reviewing the article ["I Built a C++ Backend So My GPU Would Stop Eating Air"](https://towardsdatascience.com/i-built-a-c-backend-so-my-gpu-would-stop-eating-air/) (write-up of [WarpGroup-Backend](https://github.com/AnubhabBanerjee/WarpGroup-backend)). Its thesis: standard batching pads short sequences with zeros up to the longest in the batch, wasting GPU FLOPs; fix it with padding-free packed batches, FFD bin-packing under a VRAM token budget, pinned-memory async DMA, and a Phase-0 autotune.

**Most of the headline techniques are already in our design and need no work:**

- **Pinned memory + async DMA** — already implemented: `CudaBackend` `cudaMallocHost`s a staging buffer plus a double-buffered async pinned pool with completion events (`src/SharpInference.Cuda/CudaBackend.cs:39-60`) and direct-pinned `Download/UploadAsync` overloads (`:708+`).
- **No padding waste in decode** — `ForwardPass.BatchForwardMulti` (`src/SharpInference.Engine/ForwardPass.cs:2208`) is already ragged: one token per sequence, per-sequence KV cache, weights amortized N×. No rectangle to pad.
- **No padding waste in prefill** — `ContinuousBatchingEngine.AdmitRequest` prefills each prompt individually (`ContinuousBatchingEngine.cs:332`) and `PrefillCore` runs the real token count. We never build the padded matrix the article fights.
- **16-token Tensor-Core micro-pad / FFD packing** — low value for us: cuBLAS pads GEMMs internally and quant block sizes already enforce alignment; FFD only matters once packed prefill exists.

What the article's *deeper* lesson (keep the GPU saturated, never let scheduling idle it) does surface are three real gaps worth tracking:

---

### Gap 1 — Blocking, serialized prefill admission (highest value)

In `ContinuousBatchingEngine.BatcherLoop`, a newly admitted request's prefill runs synchronously and **every already-active sequence stalls** — no decode advances during that prefill (`ContinuousBatchingEngine.cs:171-210`, admission at `:332`). A long prompt freezes the whole decode batch. This is our analog of "the GPU eating air."

**Proposed fix:** chunked / interleaved prefill (Sarathi / vLLM-style) — split a prompt's prefill into chunks and interleave them with decode steps so active sequences keep progressing.

**Scope:** moderate; engine-level scheduling change, no new attention kernel.

### Gap 2 — Packed multi-sequence prefill (direct port of the article)

A varlen `PrefillCore` that prefills several pending prompts in **one** packed forward pass, amortizing weight reads across prompts exactly like `BatchForwardMulti` does for decode.

**Scope:** largest lift — needs a `cu_seqlens`-style packed causal-attention path. Builds naturally on Gap 1.

### Gap 3 — VRAM token-budget autotune + admission backpressure (smallest/safest)

`HardwareProfile` is crude — PCIe bandwidth bucketed by VRAM size, no token budget (`HardwareProfile.cs:39-45`) — and `ContinuousBatchingEngine` admits up to `maxBatchSize` with **no KV-memory backpressure**, so a burst of long prompts can OOM. Add an empirical Phase-0 probe to derive a real token budget and gate admission on it.

**Scope:** small; standalone, no kernel work. Good first step and de-risks Gaps 1–2.

---

## Suggested order

1. Gap 3 (cheap, prevents OOM, unblocks measurement)
2. Gap 1 (biggest throughput win under concurrent load)
3. Gap 2 (largest lift; layers on top of 1)

Each is independently shippable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU saturation gaps: non-blocking prefill admission, packed multi-seq prefill, VRAM token-budget autotune #183

Context

Gap 1 — Blocking, serialized prefill admission (highest value)

Gap 2 — Packed multi-sequence prefill (direct port of the article)

Gap 3 — VRAM token-budget autotune + admission backpressure (smallest/safest)

Suggested order

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GPU saturation gaps: non-blocking prefill admission, packed multi-seq prefill, VRAM token-budget autotune #183

Description

Context

Gap 1 — Blocking, serialized prefill admission (highest value)

Gap 2 — Packed multi-sequence prefill (direct port of the article)

Gap 3 — VRAM token-budget autotune + admission backpressure (smallest/safest)

Suggested order

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions