Context
PR #188 (issue #183 Gap 1) chunks prompt admission so decode keeps flowing. The chunk-size sweep exposed a CPU cost cliff: every SimdKernels.MatMulBatched call re-pays the full weight dequantization regardless of batch size N, so chunked prefill throughput collapses as chunks shrink.
Measured (SmolLM2-1.7B Q4_K_M, Zen 4, OpenBLAS, engine-shaped chunked prefill of a 1040-token prompt):
| chunk tokens |
effective prefill t/s |
| whole prompt (1040) |
~42 |
| 256 (default) |
~38 |
| 64 |
~20 |
| 32 |
~13 |
This is why the default chunk stays at 256 (≈8 s worst-case decode stall on this box) instead of something interactive like 32–64.
Proposal
Amortize dequant across calls within a layer: a dequant-once fp32 (or bf16) weight cache keyed by (layer, tensor) with reuse when successive MatMulBatched calls hit the same weights — exactly the chunked-admission access pattern (PrefillWithCache/PrefillPackedMulti walk the same layer sequence every chunk). Alternatives: a persistent dequant buffer owned by ForwardPass for the BLAS path only, or chunk-aware fused dequant+GEMM.
Win
Makes small chunks nearly free → default chunk can drop to 32–64 → worst-case decode stall under concurrent load drops from ~8 s toward decode-step scale, without the current −43% serial-chunk throughput penalty (packed prefill already recovers most of it when several prompts are in flight: 33.7 vs 20.4 t/s — see PR #188 --cb --packed A/B; this issue is about the single-prefilling-prompt case).
Bench harness: dotnet run --project benchmarks/SharpInference.Bench -c Release -- --cb [--chunk N] and --cb --packed.
Context
PR #188 (issue #183 Gap 1) chunks prompt admission so decode keeps flowing. The chunk-size sweep exposed a CPU cost cliff: every
SimdKernels.MatMulBatchedcall re-pays the full weight dequantization regardless of batch size N, so chunked prefill throughput collapses as chunks shrink.Measured (SmolLM2-1.7B Q4_K_M, Zen 4, OpenBLAS, engine-shaped chunked prefill of a 1040-token prompt):
This is why the default chunk stays at 256 (≈8 s worst-case decode stall on this box) instead of something interactive like 32–64.
Proposal
Amortize dequant across calls within a layer: a dequant-once fp32 (or bf16) weight cache keyed by (layer, tensor) with reuse when successive MatMulBatched calls hit the same weights — exactly the chunked-admission access pattern (
PrefillWithCache/PrefillPackedMultiwalk the same layer sequence every chunk). Alternatives: a persistent dequant buffer owned by ForwardPass for the BLAS path only, or chunk-aware fused dequant+GEMM.Win
Makes small chunks nearly free → default chunk can drop to 32–64 → worst-case decode stall under concurrent load drops from ~8 s toward decode-step scale, without the current −43% serial-chunk throughput penalty (packed prefill already recovers most of it when several prompts are in flight: 33.7 vs 20.4 t/s — see PR #188
--cb --packedA/B; this issue is about the single-prefilling-prompt case).Bench harness:
dotnet run --project benchmarks/SharpInference.Bench -c Release -- --cb [--chunk N]and--cb --packed.