You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue #183 / PR #188 closed the scheduling gaps in ContinuousBatchingEngine — but the engine remains CPU-only: InferenceEngineLoader enables it only when the forward pass is the CPU ForwardPass (batchingSupported is false on every GPU/hybrid path). CudaForwardPass has no CreateCache / PrefillWithCache / BatchForwardMulti / PrefillPackedMulti equivalents — multi-user serving on the GPU falls back to the serialized single-user InferenceEngine.
This is the real remaining "GPU eating air" gap behind #183's theme: a 4070 Ti decoding one Qwen3-8B stream at ~75 t/s is using a fraction of its matvec bandwidth; batched decode amortizes weight reads N× exactly as the CPU BatchForwardMulti does.
Batched decode kernels — N tokens, N cache pointers per layer: batched matvec is the easy half (cuBLAS GEMM at N>1); attention needs per-sequence (cache, position) indirection (block-table or pointer-array kernel signature).
Engine generalization — ContinuousBatchingEngine takes the concrete ForwardPass; needs an interface (e.g. IBatchedForwardPass) so the CUDA implementation can slot in, plus the loader gate.
Suggested order: 1 → 2 (decode batching alone is most of the throughput win) → 4 → 3.
Context
Issue #183 / PR #188 closed the scheduling gaps in
ContinuousBatchingEngine— but the engine remains CPU-only:InferenceEngineLoaderenables it only when the forward pass is the CPUForwardPass(batchingSupportedis false on every GPU/hybrid path).CudaForwardPasshas noCreateCache/PrefillWithCache/BatchForwardMulti/PrefillPackedMultiequivalents — multi-user serving on the GPU falls back to the serialized single-userInferenceEngine.This is the real remaining "GPU eating air" gap behind #183's theme: a 4070 Ti decoding one Qwen3-8B stream at ~75 t/s is using a fraction of its matvec bandwidth; batched decode amortizes weight reads N× exactly as the CPU
BatchForwardMultidoes.Scope (large — on the order of #124/#136)
_gpuKCache[layer]is a single[maxSeqLen, kvDim]allocation owned by the forward pass. Needs a per-sequence cache object (paged or slab) + VRAM-budgeted allocation, which dovetails with the q8_0/bf16 KV follow-ups: auto-narrow default + Tc/half2-flash q8 thunks (#179) #185 KV-budget machinery and PR GPU saturation gaps: chunked+packed prefill admission, KV budget, PCIe probe (#183) #188's admission backpressure (KvBytesPerTokenanalogue for the GPU dtypes fp32/bf16/q8_0).ForwardPass.PrefillPackedMultifrom PR GPU saturation gaps: chunked+packed prefill admission, KV budget, PCIe probe (#183) #188.ContinuousBatchingEnginetakes the concreteForwardPass; needs an interface (e.g.IBatchedForwardPass) so the CUDA implementation can slot in, plus the loader gate.Suggested order: 1 → 2 (decode batching alone is most of the throughput win) → 4 → 3.
Constraints