perf(engine): mmap "pre-fault" touches only 2 pages per tensor, and the GDN CPU-MoE path has none — first request runs ~5× slow on 64 GB machines that could hold the whole model in page cache

## Problem

First-request latency on CPU-MoE configs is dominated by mmap page faults, and the existing "pre-fault" doesn't actually pre-fault:

- `CudaHybridForwardPass.cs:791-831` (and the Vulkan twin `HybridForwardPass.cs:470-494`) say *"touch the first byte of each weight tensor to ensure OS pages them into RAM"* — but the loop reads only `DataPtr[0]` and `DataPtr[size-1]` (`:826-828`). That faults **2 pages (~8 KB) of a tensor**, regardless of size. For MoE the stacked expert tensors (`ffn_*_exps`, hundreds of MB per layer) remain almost entirely unfaulted.
- `CudaHybridGdnForwardPass` — the Carnice / qwen3.6-A3B path, where CPU-MoE is the auto-selected winner on 12 GB — has **no pre-fault at all** for its CPU-resident expert weights (no touch loop in the file).

The cost is already documented: #210's bench protocol exists because *"mmap'd experts make cold cells 5× slow"* — every cell must be run twice. The same 5× hits the first real request after server start (and any expert whose pages get evicted). On the target 64 GB-RAM machine the entire model fits in page cache, so this is pure avoidable latency.

## Proposed work

1. **Real pre-fault for CPU-MoE expert tensors**: stride-touch every 4 KiB page (or better, `posix_madvise(POSIX_MADV_WILLNEED)` on Linux / `PrefetchVirtualMemory` on Windows, falling back to a parallel stride-read). A parallel sequential sweep runs at memory/SSD-read bandwidth — a 15–20 GB expert region warms in seconds at NVMe speeds, vs. minutes of fault-stalled first-token decode.
2. Apply it to all three hybrid classes' CPU-resident weights (`CudaHybridGdnForwardPass` gets it for the first time; the two existing touch loops get fixed from 2-pages-per-tensor to full coverage).
3. Make it tunable: default ON when CPU-MoE / CPU layers are active and total mapped bytes < ~80% of physical RAM; `SHARPI_PREFAULT=0|1` override; print the warm time. Skip when RAM clearly can't hold the model (would just thrash).
4. Optionally run the sweep on a background thread so prefill of the first request overlaps the tail of the warm-up (faults then only stall on still-cold pages).

## Acceptance

- [ ] Cold-start protocol from #210 (fresh page cache, e.g. after drop_caches): first-request decode t/s on Carnice / 35B-A3B CPU-MoE within ~10% of warm steady-state, vs ~5× today.
- [ ] Load-time increase reported and bounded (seconds, parallel sweep); `SHARPI_PREFAULT=0` restores current behavior.
- [ ] No behavior change for fully-GPU-resident configs.

## References

- #210 (documents the 5× cold penalty + warm-cache bench workaround), #100 (CPU-MoE is the auto-selected config on 12 GB — making its cold path matter), #215 (more models landing on CPU-MoE).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(engine): mmap "pre-fault" touches only 2 pages per tensor, and the GDN CPU-MoE path has none — first request runs ~5× slow on 64 GB machines that could hold the whole model in page cache #221

Problem

Proposed work

Acceptance

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(engine): mmap "pre-fault" touches only 2 pages per tensor, and the GDN CPU-MoE path has none — first request runs ~5× slow on 64 GB machines that could hold the whole model in page cache #221

Description

Problem

Proposed work

Acceptance

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions