MoE expert offloading: research + activation-aware cache, CUDA SLRU parity, predictive prefetch#77
Conversation
Survey current MoE expert-offloading literature (KTransformers, MoE-Infinity, Pre-gated MoE, HOBBIT, MoE-SpeQ, Local Routing Consistency, etc.) and audit SharpInference's offloading stack against it. Key findings: - CUDA hybrid path never uses the SLRU cache; statically uploads all GPU-tier experts to VRAM as F32 (slot manager/prefetcher fields unassigned). - Prefetching is reactive same-layer 1-token re-enqueue, not predictive. - TierPlanner placement ignores ExpertAccessProfiler. Prioritized recommendations: quantized expert cache (HOBBIT-lite), wire SLRU into CUDA path, next-layer predictive prefetch, profiler-driven placement, KTransformers-style CPU expert GEMM.
- GDN CUDA path (CudaHybridGdnForwardPass) DOES use per-expert SLRU + CPU MoE + profiler dump; the gap is the non-GDN CudaHybridForwardPass (whole-layer offload, dead slot-manager fields). - Experts are cached in native quant; only Q5_K is dequantized to F32 on the Vulkan SLRU and the non-GDN CUDA resident path (CudaExpertSlotManager keeps Q5_K raw). - Cross-reference existing issues #50 (predictive prefetch) and #54 (Fiddler CPU fallback) instead of re-proposing them.
…+ warm-pinning (#74) SlruCache gains two optional, default-off refinements: - Frequency-aware eviction: with a frequencyOf accessor, the probationary victim is the least-accessed entry (recency breaks ties) instead of the strict LRU tail, and the just-inserted entry is never evicted (avoids the LFU cold-start trap). Biases the cache toward MoE routing skew. - Pinning: pinned keys live in the protected segment and are never evicted or demoted while resident. ExpertCache exposes Pin/Unpin/IsPinned and a frequencyOf passthrough. ExpertAccessProfiler gains GetAccessCount(layer, expert). Both ExpertSlotManager (Vulkan) and CudaExpertSlotManager now: - construct the cache with profiler-backed frequency-aware eviction (always on — degrades to LRU when access counts are equal); - support opt-in warm-pinning of the hottest resident experts via SHARPI_MOE_WARMPIN=N (per layer) after SHARPI_MOE_WARMPIN_AFTER accesses, bounded to half the cache. Default off → no behavioral change. Adds 7 unit tests (frequency eviction, cold-start protection, pinning, unpin, non-resident pin, ExpertCache pin/freq). All 18 Pipeline tests pass.
CudaHybridForwardPass previously uploaded every routed expert of every GPU-tier MoE layer resident in VRAM (whole-layer offload), forcing a GPU layer to hold its entire expert set. Mixtral / Qwen3-30B-A3B / Qwen3-Coder bigger than VRAM could therefore only offload at whole-layer granularity. This wires in the CudaExpertSlotManager SLRU cache (already proven on the CUDA-GDN path), mirroring CudaHybridGdnForwardPass.GpuMoeFfn: - routed experts are fetched via GetOrLoad (cached, or synchronous upload + cache on miss); the router and shared expert stay resident; - slot capacity is sized from actual free VRAM (cudaMemGetInfo via FreeVramBytes) after attention/KV/scratch upload, capped at the full GPU-layer expert count — so capacity never exceeds what the old eager path used (which TierPlanner already verified fits), i.e. no new OOM risk; the budget term only shrinks capacity when VRAM is tight (e.g. user forced extra GPU layers via -g), enabling streaming instead of an OOM; - frequency-aware eviction + opt-in warm-pinning come for free from #74; - SHARPI_EXPERT_STATS dumps SLRU stats on dispose (parity with GDN). Removes the eager expert arrays, UploadExpertWeights/UploadExpertWeight, and the mistyped dead _prefetcher/_gpuFallbackContrib/_gpuPinnedNorm fields (the former was typed to the Vulkan ExpertSlotManager). Compile-verified under TreatWarningsAsErrors + AOT analyzers; full CPU test suite green. GPU decode validation pending (no CUDA device in CI env).
… increment 1) Adds ExpertRoutePredictor: a training-free, temporal-locality predictor that remembers each MoE layer's previous-token expert selection. MoE routing has strong cross-token locality at a fixed layer (the premise behind the SLRU cache), so last-token selections cheaply predict the next token's. Wired into HybridForwardPass (Vulkan — the only path with an async MoEPrefetcher): while layer L computes, it prefetches layer L+1's predicted experts a full layer ahead, hiding PCIe transfer behind compute. Opt-in via SHARPI_MOE_PREDICT_PREFETCH=1 (default off → no behavior change); predictions are best-effort hints that never affect output. Predictor resets on ResetCache. Unlike a cross-layer GPU router run, this adds zero extra GPU compute or critical-path syncs — just CPU bookkeeping + the existing EnqueuePrefetch. The predictor is pure CPU and fully unit-tested (5 new tests). CUDA paths use synchronous GetOrLoad with no async prefetcher yet, so they don't benefit — async CUDA upload streams remain future work for #50. All 23 Pipeline tests pass; full CPU suite green. Vulkan integration compile-verified (no GPU in CI env).
Extracts the #72 capacity heuristic into MoeCacheSizing.Plan — a pure, unit-tested policy. Capacity is bounded by [1, totalGpuExperts] and never exceeds the VRAM budget (no OOM the budget wouldn't already imply). Separately it reports a RecommendedSlots target from the routing-locality finding of 'Not All Models Suit Expert Offloading' (arXiv:2505.16056): ~2× active experts per layer covers a token segment well. CudaHybridForwardPass now uses the helper and logs an advisory when the VRAM budget forces the cache below the locality sweet spot (suggesting fewer GPU layers or more VRAM) instead of silently underperforming with an arbitrary floor of 64. 6 new unit tests (ample/tight VRAM, recommended sizing, caps). All 29 Pipeline tests pass; full CPU suite green.
Surface the env-var knobs added in #74/#50/#72 as first-class CLI options on the run command (mirroring the --min-batch-blas 'also settable via env' pattern). Each flag sets the corresponding SHARPI_* env var early in Execute, before any forward pass is built, so the engine's existing env reads pick it up and shell-env use still works: --moe-warmpin <N> → SHARPI_MOE_WARMPIN (warm-pin top-N hot experts/layer) --moe-warmpin-after <N> → SHARPI_MOE_WARMPIN_AFTER (warmup threshold) --moe-predict-prefetch → SHARPI_MOE_PREDICT_PREFETCH (Vulkan next-layer prefetch) --expert-stats <path> → SHARPI_EXPERT_STATS (dump SLRU hit-rate stats) Verified: known flags parse (reach model-not-found, exit 1) while an unknown flag is rejected (exit 255). Full solution builds clean; CPU test suite green.
…e/disable Address two points: 1. llama.cpp naming: llama.cpp's MoE args are placement-only (--cpu-moe, --n-cpu-moe, -ot); it has no equivalent for SLRU warm-pinning, predictive prefetch, or cache stats (cf. open llama.cpp request ggml-org/llama.cpp#20757), so these keep SharpInference-conventional names — nothing to match. 2. Good out-of-box defaults (no flags needed for good performance): - frequency-aware SLRU eviction: on (already) - VRAM-sized expert cache: on (already) - next-layer predictive prefetch: now ON BY DEFAULT (it only makes the already-on background prefetch smarter and is a no-op when the cache isn't under pressure). Disable via --no-moe-predict-prefetch / SHARPI_MOE_PREDICT_PREFETCH=0. - warm-pinning stays opt-in (--moe-warmpin N): redundant with frequency-aware eviction for most workloads, so not forced on. --moe-warmpin is now int? so an explicit 0 forces it off; the removed positive --moe-predict-prefetch is replaced by the --no- disable. Verified by running the built CLI: updated flags parse (exit 1, model-not-found), removed positive flag rejected (exit 255). Full solution builds clean; CPU suite green (16 failures are pre-existing no-GPU Vulkan tests).
There was a problem hiding this comment.
Code Review
This pull request introduces significant enhancements to Mixture-of-Experts (MoE) expert offloading, including frequency-aware SLRU cache eviction, next-layer predictive prefetching, and opt-in warm-pinning of hot experts. Crucially, it transitions the non-GDN CUDA hybrid path to use the SLRU-streamed expert cache instead of eager whole-layer loading, bringing it to parity with other backends. The review feedback highlights a potential VRAM leak in CudaHybridForwardPass.Dispose if writing stats throws an exception, a stale comment regarding removed eager weights, and a logic flaw in the ExpertCache_FrequencyAware_EvictsLeastAccessedExpert unit test where a low cache capacity prevents the frequency-aware eviction logic from being properly tested.
- CudaHybridForwardPass.Dispose: wrap the SHARPI_EXPERT_STATS write in try/catch so a StreamWriter failure (bad path/permissions) can't skip _expertSlotManager.Dispose() and leak GPU tensors. - Remove stale eager-expert comment left dangling above the attention-bias fields after the SLRU migration. - ExpertCache_FrequencyAware test: capacity 4 -> 8 so probCap=2 actually exercises frequency-aware eviction under overflow (with probCap=1 the head-exclusion alone forced the eviction and 'hot' was in fact evicted early; the old assertion passed coincidentally). Now asserts 'cold-a' evicted and 'hot' retained.
…nt layers
Two issues uncovered by running the PR's new CLI surface against real GPU
MoE workloads:
- --moe-warmpin-after declared `[DefaultValue(0)]` (Int32) on a `long`
property. Spectre.Console.Cli's Int64Converter rejected the int literal,
so every command invocation failed with "Int64Converter cannot convert
from System.Int32." before any flag was actually parsed. Use 0L.
- ExpertAccessProfiler.PrintStats iterated all `_numLayers` rows, but
CPU-resident layers under CUDA hybrid offload (and any non-MoE layer
the profiler is sized for) never report into it. Their counts are all
zero, GetLayerHitRate returns 0.0, and GetTopExperts(layer, 3) sorts a
count-zero array with the unstable Array.Sort, producing arbitrary IDs
like "[0, 93, 92]" identically on every empty row. Replace with a clear
"(no GPU SLRU accesses recorded)" line so the stats file no longer
invites the reader to chase a phantom signal.
Verified on RTX 4070 Ti:
- Vulkan all-GPU MoE (OLMoE): coherent, prefetch on vs off identical
output at temp 0
- CUDA hybrid non-GDN MoE (Qwen3-Coder-30B-A3B): SLRU streams
3043/3712 experts, 87.1 % overall hit rate, 23-25 t/s decode
- CUDA hybrid GDN+MTP (Qwen3.6-27B-MTP): unaffected, 100 % MTP accept
- All 433 unit tests still pass
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-ran scripts/bench-moe-rerun.ps1 (3 repeats, 80 tokens, --temp 0,
quicksort prompt) on the four Qwen3-Coder rows after the per-expert
SLRU streaming + predictive-prefetch changes landed.
Backend Prefill (was) Decode (was)
CPU 13.3 (15.1) 21.1 (21.2) warm-cache noise
CPU --tq 13.7 (12.0) 21.0 (21.1) warm-cache noise
Vulkan -g -1 hybrid 1.1 (1.0) 5.3 (5.8) prefetch a no-op
until cache pressure
CUDA -g -1 hybrid 11.5 (13.9) 24.4 (22.7) #72 SLRU streaming
trades cold prefill
for warmer decode
Notable: the CUDA-hybrid row now flags the per-expert SLRU streaming
(3043/3712 slot capacity, ≈ 87 % observed hit rate) as the source of the
+7.5 % decode lift, and is explicit about the prefill regression vs the
prior eager whole-layer upload. The Vulkan-hybrid row credits the
next-layer predictive prefetch and notes the disable flag.
CPU and CPU `--tq` decode are flat within run-to-run jitter; prefill
varies by ~2 t/s across cold/warm mmap pages, so the previous numbers
were on the high end of the same distribution.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sweep Single-pass bench-all on the PR #77 branch turned up real improvements on the GDN-MoE and non-GDN-MoE-on-tight-VRAM rows. Row Prefill (was) Decode (was) Llama-4 Scout CPU 2.1 (1.9) 4.3 (3.9) Llama-4 Scout CUDA hybrid 1.2 (0.9) 2.6 (2.1) PR #77 SLRU streaming Qwen3.6-35B-A3B CPU 6.7 (4.3) 8.5 (7.8) Qwen3.6-35B-A3B CUDA hybrid 14.7 (11.2) 23.2 (23.8) Scout CUDA-hybrid is the clearest #72 win — non-GDN MoE on a 12 GB card where the prior eager whole-layer expert upload had to spill aggressively; per-expert SLRU streaming gives +24 % decode and +33 % prefill. CPU-only still wins outright on Scout, but the gap narrowed enough to update the note. Other deltas (qwen36 CPU prefill +56 %, qwen36-cuda-hybrid prefill +31 %) predate this PR and just reflect the README rotting since older optimisations landed. Dense rows (SmolLM2, Qwen3-8B) untouched — single-run jitter only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3-repeat sweep over moe-cpu / moe-cpu-tq / moe-vulkan-hybrid / moe-cuda-hybrid with mean ± stddev and a hardcoded baseline so deltas vs the README's published numbers are visible at a glance. Used to sanity-check the post-PR #77 numbers on the Qwen3-Coder-30B-A3B rows. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Five reviewer agents flagged 12 issues gemini-code-assist missed. Applied all 12 plus three new tests, kept every change additive (no behaviour regression except the documented one below). Critical: - MoEPrefetcher.RunAsync now catches Preload faults per batch instead of killing the worker on the first transient. Without this, one stale predicted expert / cudaMalloc spike silently drops throughput to "no prefetch" for the rest of the process with nothing to grep for. - MoeCacheSizing.Plan exposes a Status enum (Ok / BelowRecommended / BudgetExhausted / UnknownExpertSize / Empty) so callers can distinguish "clamped to 1 because VRAM is too tight" from "ok". CudaHybridForwardPass now emits WARNING with the actual slot/recommendation ratio in the BelowRecommended path and louder messages for the two edge cases. - docs/moe-expert-offloading-research.md got a "Status post PR #77" callout listing which P0 / P1 / gap-analysis items this PR closes, so the audit can't be read as a still-open punch list of work that has already shipped. High: - HybridForwardPass.Dispose now writes SHARPI_EXPERT_STATS (parity with the two CUDA paths) so the CLI flag works on Vulkan too. The README's Server section now spells out which knobs are env-var-only. - ExpertSlotManager.TryGetCached now records hits/misses on the profiler and Preload calls MaybeWarmPin — both are no-ops on the CUDA path which uses GetOrLoad, but turn the Vulkan path's frequency-aware eviction and warm-pinning from "dead code" into functional features (the previous Vulkan profiler stats file was always all-zeros). - PerExpertBytes() now rounds each tensor's bytes up to the buffer pool's allocation bucket via the new public CudaBackend.RoundUpAllocBytes, so the planner sizes from real per-slot VRAM, not raw bytes. Caps the Qwen3-Coder-30B-A3B CUDA hybrid cache at 2220 / 3712 slots (was 3043, which would have hit cudaMalloc failure once a working set filled past ~2220 unique experts). README row updated: decode 24.4 → 22.2 (the +7.5% decode win on the prior commit was relative to a budget that couldn't actually be filled); prefill 11.5 → 10.6 — still a +7.7% decode win vs the eager pre-PR baseline (20.6 → 22.2) with the latent crash gone. - HybridForwardPass.GpuMoeFfn reorders TryGetCached before the prefetch enqueue, so a current-layer probationary slot can't be evicted by the worker mid-loop. Spurious-miss perf-only bug; reorder is cheap. - MaybeWarmPin (both backends) sorts layers by GetLayerAccessCount before picking pins, so a tight pin budget protects the hottest layers instead of biasing to layer-index order (matters for hybrid GDN+MoE models that cluster MoE FFN at high indices). - WarmPinConfig parsers log on malformed values instead of silently treating them as defaults, and the >=0 / >0 asymmetry between ParseInt / ParseLong is made explicit via an allowZero parameter. - HybridForwardPass SHARPI_MOE_PREDICT_PREFETCH parser is now case-insensitive (accepts 1/0, true/false, on/off, yes/no, enabled/ disabled) and logs on unrecognized values. Tests: - 4 new tests pin the MoeCacheSizing status path (Ok / BelowRecommended / BudgetExhausted / UnknownExpertSize). Pipeline suite 29 → 33. - All 433 unit tests still pass (Core 57, Pipeline 33, Server 45, TurboQuant 41, ForwardPass 261). - Sanity checks on the 4070 Ti: Vulkan hybrid Qwen3-Coder writes a populated stats file (68.8 % hit rate, was previously empty); CUDA hybrid Qwen3-Coder decodes coherently and stable across 3 runs at prefill 10.6 ± 0.25, decode 22.2 ± 0.12. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Builds on the SLRU frequency-aware eviction + warm-pin pass shipped in #77 to satisfy the remaining #74 acceptance criteria: - Auto-enable warm-pinning at NumActiveExperts per layer when the SLRU slot capacity is smaller than the total expert count. The env var SHARPI_MOE_WARMPIN still overrides; the auto-rule fires only when warm-pinning can actually help (cache < total experts). - Profiler now records a snapshot of hit/miss totals + the pinned (layer, expertId) set when the slot manager warms. PrintStats splits pre-warm and post-warm hit rates so callers can see whether pinning actually improved residency, and lists the pinned experts per layer.
Summary
Researched the state of the art in MoE expert offloading, audited SharpInference's offloading stack against it, and implemented the high-leverage gaps that don't require model-specific tuning. Starts from a written survey/gap-analysis and lands four engine features plus CLI surface with sensible defaults.
Companion issues opened during the research: #72, #73 (corrected), #74, #75, #76 (and existing #50, #54 cross-referenced).
What's here
Research (
docs/moe-expert-offloading-research.md) — SOTA survey (KTransformers, MoE-Infinity, Pre-gated MoE, HOBBIT, Local Routing Consistency, MoE-SpeQ, …), a verified audit of our three hybrid forward passes, and a prioritized gap analysis.#74 — activation-aware expert cache (fully unit-tested)
SlruCache: optional frequency-aware eviction (least-accessed probationary victim, never the just-inserted entry → no LFU cold-start trap) and pinning (pinned keys live in protected, never evicted/demoted).ExpertCachegainsPin/Unpin/IsPinned+ a frequency passthrough;ExpertAccessProfilergainsGetAccessCount.#72 — non-GDN CUDA hybrid → per-expert SLRU
CudaHybridForwardPasspreviously uploaded every routed expert of every GPU-tier layer resident (whole-layer offload). It now streams experts throughCudaExpertSlotManager.GetOrLoad, mirroring the proven CUDA-GDN path, so big non-GDN MoEs (Mixtral / Qwen3-30B-A3B / Qwen3-Coder) can run with more layers on GPU.cudaMemGetInfo) and capped at the prior eager footprint, so it can never use more VRAM than before → no new OOM risk.#50 — next-layer predictive prefetch (Vulkan; increment 1)
ExpertRoutePredictor(pure, unit-tested): remembers each layer's previous-token expert selection and prefetches the next GPU MoE layer's likely experts a layer ahead — zero extra GPU compute or critical-path syncs, just bookkeeping + the existing async prefetcher.#75 — model-aware cache sizing (fully unit-tested)
MoeCacheSizing.Plan— pure capacity policy; reports a routing-locality-based recommended size (≈2× active experts/layer, per arXiv:2505.16056) and warns when the VRAM budget forces the cache below it.CLI + good defaults
--no-moe-predict-prefetch,--moe-warmpin N(opt-in),--moe-warmpin-after,--expert-stats; each also settable via itsSHARPI_*env var.--cpu-moe/-ot); it has no equivalent for these cache features (cf. open request Feature Request: Two-tier GPU+RAM expert cache for MoE offload (pluggable eviction policy) ggml-org/llama.cpp#20757), so they keep SharpInference-conventional names.Verification
TreatWarningsAsErrors+ AOT analyzers (.NET 10).This environment has no GPU/CUDA/Vulkan, so the GPU integration of #72 (CUDA decode) and #50 (Vulkan prefetch) is compile-verified only — please validate decode on real hardware:
-gforcing partial offload;SHARPI_EXPERT_STATS=/tmp/stats.txtto inspect hit rates.--no-moe-predict-prefetchto compare).The 16 failing
Vulkan*Testsare pre-existingErrorIncompatibleDriverfailures (no GPU in CI), unrelated to these changes.Not included (need GPU + models)
#73 (Vulkan
MatVecQ5Kshader — corrected from the original one-line assumption), #50 increment 2 (async CUDA upload stream), and the measurement parts of #75/#76.https://claude.ai/code/session_01KLZmyj2L4hBi3jDbRTm4NW
Generated by Claude Code