Skip to content

feat(cuda): continuous batching on the CUDA backend (per-sequence GPU KV caches + batched decode + packed prefill) #190

@pekkah

Description

@pekkah

Context

Issue #183 / PR #188 closed the scheduling gaps in ContinuousBatchingEngine — but the engine remains CPU-only: InferenceEngineLoader enables it only when the forward pass is the CPU ForwardPass (batchingSupported is false on every GPU/hybrid path). CudaForwardPass has no CreateCache / PrefillWithCache / BatchForwardMulti / PrefillPackedMulti equivalents — multi-user serving on the GPU falls back to the serialized single-user InferenceEngine.

This is the real remaining "GPU eating air" gap behind #183's theme: a 4070 Ti decoding one Qwen3-8B stream at ~75 t/s is using a fraction of its matvec bandwidth; batched decode amortizes weight reads N× exactly as the CPU BatchForwardMulti does.

Scope (large — on the order of #124/#136)

  1. Per-sequence GPU KV caches — today _gpuKCache[layer] is a single [maxSeqLen, kvDim] allocation owned by the forward pass. Needs a per-sequence cache object (paged or slab) + VRAM-budgeted allocation, which dovetails with the q8_0/bf16 KV follow-ups: auto-narrow default + Tc/half2-flash q8 thunks (#179) #185 KV-budget machinery and PR GPU saturation gaps: chunked+packed prefill admission, KV budget, PCIe probe (#183) #188's admission backpressure (KvBytesPerToken analogue for the GPU dtypes fp32/bf16/q8_0).
  2. Batched decode kernels — N tokens, N cache pointers per layer: batched matvec is the easy half (cuBLAS GEMM at N>1); attention needs per-sequence (cache, position) indirection (block-table or pointer-array kernel signature).
  3. Packed multi-prompt prefill — the chunked batched-prefill trunk (Gemma 4 CUDA perf: collapse per-token kernel launches + enable batched prefill #136, PrefillBatchChunk=4096) already exists for one sequence; a cu_seqlens-style packed variant mirrors ForwardPass.PrefillPackedMulti from PR GPU saturation gaps: chunked+packed prefill admission, KV budget, PCIe probe (#183) #188.
  4. Engine generalizationContinuousBatchingEngine takes the concrete ForwardPass; needs an interface (e.g. IBatchedForwardPass) so the CUDA implementation can slot in, plus the loader gate.

Suggested order: 1 → 2 (decode batching alone is most of the throughput win) → 4 → 3.

Constraints

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions