Skip to content

perf(cpu): dequant-once weight reuse for MatMulBatched — small-chunk prefill pays full dequant per call (#183 follow-up) #189

@pekkah

Description

@pekkah

Context

PR #188 (issue #183 Gap 1) chunks prompt admission so decode keeps flowing. The chunk-size sweep exposed a CPU cost cliff: every SimdKernels.MatMulBatched call re-pays the full weight dequantization regardless of batch size N, so chunked prefill throughput collapses as chunks shrink.

Measured (SmolLM2-1.7B Q4_K_M, Zen 4, OpenBLAS, engine-shaped chunked prefill of a 1040-token prompt):

chunk tokens effective prefill t/s
whole prompt (1040) ~42
256 (default) ~38
64 ~20
32 ~13

This is why the default chunk stays at 256 (≈8 s worst-case decode stall on this box) instead of something interactive like 32–64.

Proposal

Amortize dequant across calls within a layer: a dequant-once fp32 (or bf16) weight cache keyed by (layer, tensor) with reuse when successive MatMulBatched calls hit the same weights — exactly the chunked-admission access pattern (PrefillWithCache/PrefillPackedMulti walk the same layer sequence every chunk). Alternatives: a persistent dequant buffer owned by ForwardPass for the BLAS path only, or chunk-aware fused dequant+GEMM.

Win

Makes small chunks nearly free → default chunk can drop to 32–64 → worst-case decode stall under concurrent load drops from ~8 s toward decode-step scale, without the current −43% serial-chunk throughput penalty (packed prefill already recovers most of it when several prompts are in flight: 33.7 vs 20.4 t/s — see PR #188 --cb --packed A/B; this issue is about the single-prefilling-prompt case).

Bench harness: dotnet run --project benchmarks/SharpInference.Bench -c Release -- --cb [--chunk N] and --cb --packed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions