feat(cuda): continuous batching on the CUDA backend (per-sequence GPU KV caches + batched decode + packed prefill)

## Context

Issue #183 / PR #188 closed the scheduling gaps in `ContinuousBatchingEngine` — but the engine remains CPU-only: `InferenceEngineLoader` enables it only when the forward pass is the CPU `ForwardPass` (`batchingSupported` is false on every GPU/hybrid path). `CudaForwardPass` has no `CreateCache` / `PrefillWithCache` / `BatchForwardMulti` / `PrefillPackedMulti` equivalents — multi-user serving on the GPU falls back to the serialized single-user `InferenceEngine`.

This is the real remaining "GPU eating air" gap behind #183's theme: a 4070 Ti decoding one Qwen3-8B stream at ~75 t/s is using a fraction of its matvec bandwidth; batched decode amortizes weight reads N× exactly as the CPU `BatchForwardMulti` does.

## Scope (large — on the order of #124/#136)

1. **Per-sequence GPU KV caches** — today `_gpuKCache[layer]` is a single `[maxSeqLen, kvDim]` allocation owned by the forward pass. Needs a per-sequence cache object (paged or slab) + VRAM-budgeted allocation, which dovetails with the #185 KV-budget machinery and PR #188's admission backpressure (`KvBytesPerToken` analogue for the GPU dtypes fp32/bf16/q8_0).
2. **Batched decode kernels** — N tokens, N cache pointers per layer: batched matvec is the easy half (cuBLAS GEMM at N>1); attention needs per-sequence (cache, position) indirection (block-table or pointer-array kernel signature).
3. **Packed multi-prompt prefill** — the chunked batched-prefill trunk (#136, PrefillBatchChunk=4096) already exists for one sequence; a cu_seqlens-style packed variant mirrors `ForwardPass.PrefillPackedMulti` from PR #188.
4. **Engine generalization** — `ContinuousBatchingEngine` takes the concrete `ForwardPass`; needs an interface (e.g. `IBatchedForwardPass`) so the CUDA implementation can slot in, plus the loader gate.

Suggested order: 1 → 2 (decode batching alone is most of the throughput win) → 4 → 3.

## Constraints

- CUDA graphs (#136) assume a single cache layout — batched decode either bypasses graphs initially or captures per batch-size.
- Gemma 4 per-layer head_dim / k_eq_v stays out of scope initially (same exclusion as the CPU engine, see PR #188 loader gate).
- SnapKV/TurboQuant composition out of scope.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cuda): continuous batching on the CUDA backend (per-sequence GPU KV caches + batched decode + packed prefill) #190

Context

Scope (large — on the order of #124/#136)

Constraints

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(cuda): continuous batching on the CUDA backend (per-sequence GPU KV caches + batched decode + packed prefill) #190

Description

Context

Scope (large — on the order of #124/#136)

Constraints

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions