Problem
First-request latency on CPU-MoE configs is dominated by mmap page faults, and the existing "pre-fault" doesn't actually pre-fault:
CudaHybridForwardPass.cs:791-831 (and the Vulkan twin HybridForwardPass.cs:470-494) say "touch the first byte of each weight tensor to ensure OS pages them into RAM" — but the loop reads only DataPtr[0] and DataPtr[size-1] (:826-828). That faults 2 pages (~8 KB) of a tensor, regardless of size. For MoE the stacked expert tensors (ffn_*_exps, hundreds of MB per layer) remain almost entirely unfaulted.
CudaHybridGdnForwardPass — the Carnice / qwen3.6-A3B path, where CPU-MoE is the auto-selected winner on 12 GB — has no pre-fault at all for its CPU-resident expert weights (no touch loop in the file).
The cost is already documented: #210's bench protocol exists because "mmap'd experts make cold cells 5× slow" — every cell must be run twice. The same 5× hits the first real request after server start (and any expert whose pages get evicted). On the target 64 GB-RAM machine the entire model fits in page cache, so this is pure avoidable latency.
Proposed work
- Real pre-fault for CPU-MoE expert tensors: stride-touch every 4 KiB page (or better,
posix_madvise(POSIX_MADV_WILLNEED) on Linux / PrefetchVirtualMemory on Windows, falling back to a parallel stride-read). A parallel sequential sweep runs at memory/SSD-read bandwidth — a 15–20 GB expert region warms in seconds at NVMe speeds, vs. minutes of fault-stalled first-token decode.
- Apply it to all three hybrid classes' CPU-resident weights (
CudaHybridGdnForwardPass gets it for the first time; the two existing touch loops get fixed from 2-pages-per-tensor to full coverage).
- Make it tunable: default ON when CPU-MoE / CPU layers are active and total mapped bytes < ~80% of physical RAM;
SHARPI_PREFAULT=0|1 override; print the warm time. Skip when RAM clearly can't hold the model (would just thrash).
- Optionally run the sweep on a background thread so prefill of the first request overlaps the tail of the warm-up (faults then only stall on still-cold pages).
Acceptance
References
Problem
First-request latency on CPU-MoE configs is dominated by mmap page faults, and the existing "pre-fault" doesn't actually pre-fault:
CudaHybridForwardPass.cs:791-831(and the Vulkan twinHybridForwardPass.cs:470-494) say "touch the first byte of each weight tensor to ensure OS pages them into RAM" — but the loop reads onlyDataPtr[0]andDataPtr[size-1](:826-828). That faults 2 pages (~8 KB) of a tensor, regardless of size. For MoE the stacked expert tensors (ffn_*_exps, hundreds of MB per layer) remain almost entirely unfaulted.CudaHybridGdnForwardPass— the Carnice / qwen3.6-A3B path, where CPU-MoE is the auto-selected winner on 12 GB — has no pre-fault at all for its CPU-resident expert weights (no touch loop in the file).The cost is already documented: #210's bench protocol exists because "mmap'd experts make cold cells 5× slow" — every cell must be run twice. The same 5× hits the first real request after server start (and any expert whose pages get evicted). On the target 64 GB-RAM machine the entire model fits in page cache, so this is pure avoidable latency.
Proposed work
posix_madvise(POSIX_MADV_WILLNEED)on Linux /PrefetchVirtualMemoryon Windows, falling back to a parallel stride-read). A parallel sequential sweep runs at memory/SSD-read bandwidth — a 15–20 GB expert region warms in seconds at NVMe speeds, vs. minutes of fault-stalled first-token decode.CudaHybridGdnForwardPassgets it for the first time; the two existing touch loops get fixed from 2-pages-per-tensor to full coverage).SHARPI_PREFAULT=0|1override; print the warm time. Skip when RAM clearly can't hold the model (would just thrash).Acceptance
SHARPI_PREFAULT=0restores current behavior.References