Skip to content

perf(engine): real per-page mmap pre-fault for CPU-resident weights (#221)#257

Merged
pekkah merged 2 commits into
masterfrom
perf/221-mmap-prefault
Jun 15, 2026
Merged

perf(engine): real per-page mmap pre-fault for CPU-resident weights (#221)#257
pekkah merged 2 commits into
masterfrom
perf/221-mmap-prefault

Conversation

@pekkah

@pekkah pekkah commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Fixes #221.

Problem

The hybrid "pre-fault" only touched 2 pages per tensor (DataPtr[0] + DataPtr[size-1]), and CudaHybridGdnForwardPass had none. So the first request on CPU-MoE configs faulted every expert page on the critical path — ~5× slower than warm, the penalty #210's "run each cell twice" bench protocol works around. The existing sweeps were also gated on _nCpuLayers > 0, but the CPU-MoE routed experts (_cpuMoe*) and Gemma 4 PLE table (_pleTokenEmbed, GiB-scale) are CPU-resident even at -g -1 — exactly the dominant cold cost, and entirely missed.

Change

  • New MmapPrefault.cs — one shared helper: pure, unit-testable ShouldRun() gate (SHARPI_PREFAULT=0 off / =1 force / auto with an 80%-of-RAM fit heuristic so it never thrashes) → best-effort OS read-ahead (PrefetchVirtualMemory on Windows, posix_madvise(WILLNEED) on Linux, both LibraryImport/AOT-safe) → parallel per-page stride read (32 MiB chunks) that guarantees residency and reports GiB/s.
  • CudaHybridForwardPass, HybridForwardPass (Vulkan), CudaHybridGdnForwardPass — each gets a BuildCpuPrefaultRegions() and a MmapPrefault.Run() call, replacing the broken 2-page loops (or adding the first-ever prefault for the GDN class). Coverage now includes the CPU-MoE experts + PLE; embedding/output are skipped when GPU-resident; the Vulkan SLRU-miss CPU fallback experts are covered too. Only ResolveCpuWeight/mmap tensors are faulted — never LoadCpuBias/LoadF32Tensor copies.
  • ForwardPass + HybridGdnForwardPass — their pre-existing full-page prefaults now honour the global SHARPI_PREFAULT=0 kill switch.

Verification (4070 Ti, 12 GB)

Config Class Prefault log Decode t/s
Coder-30B-A3B -g -1 CPU-MoE, ON CudaHybridForwardPass Pre-faulted 16.39 GiB … 0.3s (50.5 GiB/s) 25.1
Coder-30B-A3B, SHARPI_PREFAULT=0 CudaHybridForwardPass Pre-fault disabled (SHARPI_PREFAULT=0) 21.0 (prefill 8.2 vs 16.4)
Carnice-35B-A3B (GDN) -g -1, ON CudaHybridGdnForwardPass Pre-faulted 14.66 GiB … 29.4s (0.5 GiB/s) 24.0 (MTP 85%)

The Carnice experts were cold on a slow drive: the sweep read 14.66 GiB at 0.5 GiB/s = ~29 s of fault I/O front-loaded off the request path (the previously-uncovered class). Decode is unchanged because the prefault is read-only.

13 new unit tests in MmapPrefaultTests.cs; full Release build clean (TreatWarningsAsErrors + AOT/trim analyzers). A 23-agent adversarial review found 3 minor items; 2 fixed (skip GPU-resident embedding; cover Vulkan SLRU-miss experts).

Not covered

A controlled Linux drop_caches cold-vs-warm first-request decode comparison (acceptance #1) — Windows has no clean cache-evict; the Carnice cold sweep above is the direct evidence. Worth capturing on a Linux box.

🤖 Generated with Claude Code

…221)

The hybrid "pre-fault" only touched 2 pages per tensor (DataPtr[0] +
DataPtr[size-1]), and CudaHybridGdnForwardPass had none -- so the first
request on CPU-MoE configs faulted every expert page on the critical
path, ~5x slower than warm (the #210 bench protocol's "run each cell
twice" workaround exists for exactly this).

New shared MmapPrefault helper: a pure, testable gate (SHARPI_PREFAULT
0=off / 1=force, plus an 80%-of-RAM fit heuristic), best-effort OS
read-ahead (PrefetchVirtualMemory on Windows, posix_madvise(WILLNEED)
on Linux, both LibraryImport), then a parallel per-page stride read
(32 MiB chunks) that guarantees residency and reports GiB/s.

Wired into the three hybrid classes via BuildCpuPrefaultRegions(), now
covering the full CPU-resident set the old gate missed -- the CPU-MoE
routed experts and Gemma 4 PLE table, which are mmap-resident even at
-g -1 (_nCpuLayers == 0). Embedding/output are skipped when GPU-resident.
The pre-existing full-page prefaults in ForwardPass and HybridGdn-
ForwardPass now honour the same SHARPI_PREFAULT=0 kill switch.

Verified on a 4070 Ti: CudaHybridForwardPass warms 16.39 GiB of Coder-30B
CPU-MoE experts; CudaHybridGdnForwardPass (no prefault before) warms
14.66 GiB of Carnice experts (0.5 GiB/s cold read = ~29s of fault I/O
moved off the request path). Decode unchanged (prefault is read-only).
13 unit tests in MmapPrefaultTests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a centralized and optimized memory-mapped weight pre-faulting mechanism via the new MmapPrefault class, addressing issue #221. It replaces inline pre-faulting logic across several forward pass classes with calls to MmapPrefault.Run, which leverages OS-specific hints (PrefetchVirtualMemory on Windows and posix_madvise on Linux) followed by a parallel stride read to guarantee residency. It also introduces a global environment variable switch (SHARPI_PREFAULT) and a RAM-fit heuristic to prevent thrashing, backed by a comprehensive set of unit and integration tests. No review comments were provided for this pull request.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

- Migrate ForwardPass + HybridGdnForwardPass off their hand-rolled per-page
  sweeps onto MmapPrefault.Run(..., RamGate.Always). Makes RamGate.Always a
  live (not test-only) code path, unifies the SHARPI_PREFAULT kill switch +
  logging + OS read-ahead across all five passes, and removes the now-unused
  MmapPrefault.IsDisabled() (+ its test). Behaviour is unchanged: Always still
  sweeps the whole model unless SHARPI_PREFAULT=0.
- Resolve GPU-trunk routed experts tolerantly (FindTensor, skip if absent)
  instead of ResolveCpuWeight, which threw -- a pre-fault must never make an
  otherwise-loadable model fail to load.
- Cover the CUDA GPU-SLRU MoE path symmetrically with Vulkan: when experts
  stream through the SLRU (not CPU-MoE), fault blk.{0..nGpu}.ffn_*_exps so the
  first request's cache fills don't demand-page off disk.

Smoke-tested: pure-CPU ForwardPass logs "Pre-faulted 0.74 GiB" and decodes
coherently. 12 unit tests green; full Release build clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah pekkah merged commit b57c8e6 into master Jun 15, 2026
1 check passed
@pekkah pekkah deleted the perf/221-mmap-prefault branch June 15, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant