perf(cpu): dequant-once weight reuse for MatMulBatched — small-chunk prefill pays full dequant per call (#183 follow-up)

## Context

PR #188 (issue #183 Gap 1) chunks prompt admission so decode keeps flowing. The chunk-size sweep exposed a CPU cost cliff: every `SimdKernels.MatMulBatched` call re-pays the full weight dequantization regardless of batch size N, so chunked prefill throughput collapses as chunks shrink.

Measured (SmolLM2-1.7B Q4_K_M, Zen 4, OpenBLAS, engine-shaped chunked prefill of a 1040-token prompt):

| chunk tokens | effective prefill t/s |
|---:|---:|
| whole prompt (1040) | ~42 |
| 256 (default) | ~38 |
| 64 | ~20 |
| 32 | ~13 |

This is why the default chunk stays at 256 (≈8 s worst-case decode stall on this box) instead of something interactive like 32–64.

## Proposal

Amortize dequant across calls within a layer: a dequant-once fp32 (or bf16) weight cache keyed by (layer, tensor) with reuse when successive MatMulBatched calls hit the same weights — exactly the chunked-admission access pattern (`PrefillWithCache`/`PrefillPackedMulti` walk the same layer sequence every chunk). Alternatives: a persistent dequant buffer owned by ForwardPass for the BLAS path only, or chunk-aware fused dequant+GEMM.

## Win

Makes small chunks nearly free → default chunk can drop to 32–64 → worst-case decode stall under concurrent load drops from ~8 s toward decode-step scale, without the current −43% serial-chunk throughput penalty (packed prefill already recovers most of it when several prompts are in flight: 33.7 vs 20.4 t/s — see PR #188 `--cb --packed` A/B; this issue is about the single-prefilling-prompt case).

Bench harness: `dotnet run --project benchmarks/SharpInference.Bench -c Release -- --cb [--chunk N]` and `--cb --packed`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cpu): dequant-once weight reuse for MatMulBatched — small-chunk prefill pays full dequant per call (#183 follow-up) #189

Context

Proposal

Win

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(cpu): dequant-once weight reuse for MatMulBatched — small-chunk prefill pays full dequant per call (#183 follow-up) #189

Description

Context

Proposal

Win

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions