Skip to content

perf(engine): mmap "pre-fault" touches only 2 pages per tensor, and the GDN CPU-MoE path has none — first request runs ~5× slow on 64 GB machines that could hold the whole model in page cache #221

@pekkah

Description

@pekkah

Problem

First-request latency on CPU-MoE configs is dominated by mmap page faults, and the existing "pre-fault" doesn't actually pre-fault:

  • CudaHybridForwardPass.cs:791-831 (and the Vulkan twin HybridForwardPass.cs:470-494) say "touch the first byte of each weight tensor to ensure OS pages them into RAM" — but the loop reads only DataPtr[0] and DataPtr[size-1] (:826-828). That faults 2 pages (~8 KB) of a tensor, regardless of size. For MoE the stacked expert tensors (ffn_*_exps, hundreds of MB per layer) remain almost entirely unfaulted.
  • CudaHybridGdnForwardPass — the Carnice / qwen3.6-A3B path, where CPU-MoE is the auto-selected winner on 12 GB — has no pre-fault at all for its CPU-resident expert weights (no touch loop in the file).

The cost is already documented: #210's bench protocol exists because "mmap'd experts make cold cells 5× slow" — every cell must be run twice. The same 5× hits the first real request after server start (and any expert whose pages get evicted). On the target 64 GB-RAM machine the entire model fits in page cache, so this is pure avoidable latency.

Proposed work

  1. Real pre-fault for CPU-MoE expert tensors: stride-touch every 4 KiB page (or better, posix_madvise(POSIX_MADV_WILLNEED) on Linux / PrefetchVirtualMemory on Windows, falling back to a parallel stride-read). A parallel sequential sweep runs at memory/SSD-read bandwidth — a 15–20 GB expert region warms in seconds at NVMe speeds, vs. minutes of fault-stalled first-token decode.
  2. Apply it to all three hybrid classes' CPU-resident weights (CudaHybridGdnForwardPass gets it for the first time; the two existing touch loops get fixed from 2-pages-per-tensor to full coverage).
  3. Make it tunable: default ON when CPU-MoE / CPU layers are active and total mapped bytes < ~80% of physical RAM; SHARPI_PREFAULT=0|1 override; print the warm time. Skip when RAM clearly can't hold the model (would just thrash).
  4. Optionally run the sweep on a background thread so prefill of the first request overlaps the tail of the warm-up (faults then only stall on still-cold pages).

Acceptance

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions