perf(engine): real per-page mmap pre-fault for CPU-resident weights (#221) by pekkah · Pull Request #257 · pekkah/SharpInference

pekkah · 2026-06-15T12:23:23Z

Fixes #221.

Problem

The hybrid "pre-fault" only touched 2 pages per tensor (DataPtr[0] + DataPtr[size-1]), and CudaHybridGdnForwardPass had none. So the first request on CPU-MoE configs faulted every expert page on the critical path — ~5× slower than warm, the penalty #210's "run each cell twice" bench protocol works around. The existing sweeps were also gated on _nCpuLayers > 0, but the CPU-MoE routed experts (_cpuMoe*) and Gemma 4 PLE table (_pleTokenEmbed, GiB-scale) are CPU-resident even at -g -1 — exactly the dominant cold cost, and entirely missed.

Change

New MmapPrefault.cs — one shared helper: pure, unit-testable ShouldRun() gate (SHARPI_PREFAULT=0 off / =1 force / auto with an 80%-of-RAM fit heuristic so it never thrashes) → best-effort OS read-ahead (PrefetchVirtualMemory on Windows, posix_madvise(WILLNEED) on Linux, both LibraryImport/AOT-safe) → parallel per-page stride read (32 MiB chunks) that guarantees residency and reports GiB/s.
CudaHybridForwardPass, HybridForwardPass (Vulkan), CudaHybridGdnForwardPass — each gets a BuildCpuPrefaultRegions() and a MmapPrefault.Run() call, replacing the broken 2-page loops (or adding the first-ever prefault for the GDN class). Coverage now includes the CPU-MoE experts + PLE; embedding/output are skipped when GPU-resident; the Vulkan SLRU-miss CPU fallback experts are covered too. Only ResolveCpuWeight/mmap tensors are faulted — never LoadCpuBias/LoadF32Tensor copies.
ForwardPass + HybridGdnForwardPass — their pre-existing full-page prefaults now honour the global SHARPI_PREFAULT=0 kill switch.

Verification (4070 Ti, 12 GB)

Config	Class	Prefault log	Decode t/s
Coder-30B-A3B `-g -1` CPU-MoE, ON	`CudaHybridForwardPass`	`Pre-faulted 16.39 GiB … 0.3s (50.5 GiB/s)`	25.1
Coder-30B-A3B, `SHARPI_PREFAULT=0`	`CudaHybridForwardPass`	`Pre-fault disabled (SHARPI_PREFAULT=0)`	21.0 (prefill 8.2 vs 16.4)
Carnice-35B-A3B (GDN) `-g -1`, ON	`CudaHybridGdnForwardPass`	`Pre-faulted 14.66 GiB … 29.4s (0.5 GiB/s)`	24.0 (MTP 85%)

The Carnice experts were cold on a slow drive: the sweep read 14.66 GiB at 0.5 GiB/s = ~29 s of fault I/O front-loaded off the request path (the previously-uncovered class). Decode is unchanged because the prefault is read-only.

13 new unit tests in MmapPrefaultTests.cs; full Release build clean (TreatWarningsAsErrors + AOT/trim analyzers). A 23-agent adversarial review found 3 minor items; 2 fixed (skip GPU-resident embedding; cover Vulkan SLRU-miss experts).

Not covered

A controlled Linux drop_caches cold-vs-warm first-request decode comparison (acceptance #1) — Windows has no clean cache-evict; the Carnice cold sweep above is the direct evidence. Worth capturing on a Linux box.

🤖 Generated with Claude Code

…221) The hybrid "pre-fault" only touched 2 pages per tensor (DataPtr[0] + DataPtr[size-1]), and CudaHybridGdnForwardPass had none -- so the first request on CPU-MoE configs faulted every expert page on the critical path, ~5x slower than warm (the #210 bench protocol's "run each cell twice" workaround exists for exactly this). New shared MmapPrefault helper: a pure, testable gate (SHARPI_PREFAULT 0=off / 1=force, plus an 80%-of-RAM fit heuristic), best-effort OS read-ahead (PrefetchVirtualMemory on Windows, posix_madvise(WILLNEED) on Linux, both LibraryImport), then a parallel per-page stride read (32 MiB chunks) that guarantees residency and reports GiB/s. Wired into the three hybrid classes via BuildCpuPrefaultRegions(), now covering the full CPU-resident set the old gate missed -- the CPU-MoE routed experts and Gemma 4 PLE table, which are mmap-resident even at -g -1 (_nCpuLayers == 0). Embedding/output are skipped when GPU-resident. The pre-existing full-page prefaults in ForwardPass and HybridGdn- ForwardPass now honour the same SHARPI_PREFAULT=0 kill switch. Verified on a 4070 Ti: CudaHybridForwardPass warms 16.39 GiB of Coder-30B CPU-MoE experts; CudaHybridGdnForwardPass (no prefault before) warms 14.66 GiB of Carnice experts (0.5 GiB/s cold read = ~29s of fault I/O moved off the request path). Decode unchanged (prefault is read-only). 13 unit tests in MmapPrefaultTests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a centralized and optimized memory-mapped weight pre-faulting mechanism via the new MmapPrefault class, addressing issue #221. It replaces inline pre-faulting logic across several forward pass classes with calls to MmapPrefault.Run, which leverages OS-specific hints (PrefetchVirtualMemory on Windows and posix_madvise on Linux) followed by a parallel stride read to guarantee residency. It also introduces a global environment variable switch (SHARPI_PREFAULT) and a RAM-fit heuristic to prevent thrashing, backed by a comprehensive set of unit and integration tests. No review comments were provided for this pull request.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

- Migrate ForwardPass + HybridGdnForwardPass off their hand-rolled per-page sweeps onto MmapPrefault.Run(..., RamGate.Always). Makes RamGate.Always a live (not test-only) code path, unifies the SHARPI_PREFAULT kill switch + logging + OS read-ahead across all five passes, and removes the now-unused MmapPrefault.IsDisabled() (+ its test). Behaviour is unchanged: Always still sweeps the whole model unless SHARPI_PREFAULT=0. - Resolve GPU-trunk routed experts tolerantly (FindTensor, skip if absent) instead of ResolveCpuWeight, which threw -- a pre-fault must never make an otherwise-loadable model fail to load. - Cover the CUDA GPU-SLRU MoE path symmetrically with Vulkan: when experts stream through the SLRU (not CPU-MoE), fault blk.{0..nGpu}.ffn_*_exps so the first request's cache fills don't demand-page off disk. Smoke-tested: pure-CPU ForwardPass logs "Pre-faulted 0.74 GiB" and decodes coherently. 12 unit tests green; full Release build clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

pekkah merged commit b57c8e6 into master Jun 15, 2026
1 check passed

pekkah deleted the perf/221-mmap-prefault branch June 15, 2026 12:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(engine): real per-page mmap pre-fault for CPU-resident weights (#221)#257

perf(engine): real per-page mmap pre-fault for CPU-resident weights (#221)#257
pekkah merged 2 commits into
masterfrom
perf/221-mmap-prefault

pekkah commented Jun 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pekkah commented Jun 15, 2026

Problem

Change

Verification (4070 Ti, 12 GB)

Not covered

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant