Skip to content

MoE expert offloading: research + activation-aware cache, CUDA SLRU parity, predictive prefetch#77

Merged
pekkah merged 14 commits into
masterfrom
claude/moe-expert-offloading-research-Cok6G
May 29, 2026
Merged

MoE expert offloading: research + activation-aware cache, CUDA SLRU parity, predictive prefetch#77
pekkah merged 14 commits into
masterfrom
claude/moe-expert-offloading-research-Cok6G

Conversation

@pekkah

@pekkah pekkah commented May 29, 2026

Copy link
Copy Markdown
Owner

Summary

Researched the state of the art in MoE expert offloading, audited SharpInference's offloading stack against it, and implemented the high-leverage gaps that don't require model-specific tuning. Starts from a written survey/gap-analysis and lands four engine features plus CLI surface with sensible defaults.

Companion issues opened during the research: #72, #73 (corrected), #74, #75, #76 (and existing #50, #54 cross-referenced).

What's here

Research (docs/moe-expert-offloading-research.md) — SOTA survey (KTransformers, MoE-Infinity, Pre-gated MoE, HOBBIT, Local Routing Consistency, MoE-SpeQ, …), a verified audit of our three hybrid forward passes, and a prioritized gap analysis.

#74 — activation-aware expert cache (fully unit-tested)

  • SlruCache: optional frequency-aware eviction (least-accessed probationary victim, never the just-inserted entry → no LFU cold-start trap) and pinning (pinned keys live in protected, never evicted/demoted).
  • ExpertCache gains Pin/Unpin/IsPinned + a frequency passthrough; ExpertAccessProfiler gains GetAccessCount.
  • Both slot managers use profiler-backed frequency-aware eviction and support opt-in warm-pinning.

#72 — non-GDN CUDA hybrid → per-expert SLRU

  • CudaHybridForwardPass previously uploaded every routed expert of every GPU-tier layer resident (whole-layer offload). It now streams experts through CudaExpertSlotManager.GetOrLoad, mirroring the proven CUDA-GDN path, so big non-GDN MoEs (Mixtral / Qwen3-30B-A3B / Qwen3-Coder) can run with more layers on GPU.
  • Capacity is sized from actual free VRAM (cudaMemGetInfo) and capped at the prior eager footprint, so it can never use more VRAM than before → no new OOM risk.

#50 — next-layer predictive prefetch (Vulkan; increment 1)

  • ExpertRoutePredictor (pure, unit-tested): remembers each layer's previous-token expert selection and prefetches the next GPU MoE layer's likely experts a layer ahead — zero extra GPU compute or critical-path syncs, just bookkeeping + the existing async prefetcher.

#75 — model-aware cache sizing (fully unit-tested)

  • MoeCacheSizing.Plan — pure capacity policy; reports a routing-locality-based recommended size (≈2× active experts/layer, per arXiv:2505.16056) and warns when the VRAM budget forces the cache below it.

CLI + good defaults

  • Good performance needs no flags: frequency-aware eviction, VRAM-sized cache, and next-layer predictive prefetch are on by default (prefetch is a no-op when the cache isn't under pressure).
  • Knobs to tune/disable: --no-moe-predict-prefetch, --moe-warmpin N (opt-in), --moe-warmpin-after, --expert-stats; each also settable via its SHARPI_* env var.
  • llama.cpp naming: llama.cpp's MoE args are placement-only (--cpu-moe/-ot); it has no equivalent for these cache features (cf. open request Feature Request: Two-tier GPU+RAM expert cache for MoE offload (pluggable eviction policy) ggml-org/llama.cpp#20757), so they keep SharpInference-conventional names.

Verification

  • Whole solution builds clean under TreatWarningsAsErrors + AOT analyzers (.NET 10).
  • New/changed logic is fully unit-tested where it can be: 18 new tests across the cache, predictor, and sizing policy (Pipeline suite 11 → 29). Full CPU suite green (Core/TurboQuant/Pipeline/Server).
  • CLI flags verified against the built binary (known flags parse; unknown rejected).

⚠️ Pending GPU validation

This environment has no GPU/CUDA/Vulkan, so the GPU integration of #72 (CUDA decode) and #50 (Vulkan prefetch) is compile-verified only — please validate decode on real hardware:

The 16 failing Vulkan*Tests are pre-existing ErrorIncompatibleDriver failures (no GPU in CI), unrelated to these changes.

Not included (need GPU + models)

#73 (Vulkan MatVecQ5K shader — corrected from the original one-line assumption), #50 increment 2 (async CUDA upload stream), and the measurement parts of #75/#76.

https://claude.ai/code/session_01KLZmyj2L4hBi3jDbRTm4NW


Generated by Claude Code

claude added 8 commits May 29, 2026 05:41
Survey current MoE expert-offloading literature (KTransformers, MoE-Infinity,
Pre-gated MoE, HOBBIT, MoE-SpeQ, Local Routing Consistency, etc.) and audit
SharpInference's offloading stack against it.

Key findings:
- CUDA hybrid path never uses the SLRU cache; statically uploads all GPU-tier
  experts to VRAM as F32 (slot manager/prefetcher fields unassigned).
- Prefetching is reactive same-layer 1-token re-enqueue, not predictive.
- TierPlanner placement ignores ExpertAccessProfiler.

Prioritized recommendations: quantized expert cache (HOBBIT-lite), wire SLRU
into CUDA path, next-layer predictive prefetch, profiler-driven placement,
KTransformers-style CPU expert GEMM.
- GDN CUDA path (CudaHybridGdnForwardPass) DOES use per-expert SLRU + CPU MoE
  + profiler dump; the gap is the non-GDN CudaHybridForwardPass (whole-layer
  offload, dead slot-manager fields).
- Experts are cached in native quant; only Q5_K is dequantized to F32 on the
  Vulkan SLRU and the non-GDN CUDA resident path (CudaExpertSlotManager keeps
  Q5_K raw).
- Cross-reference existing issues #50 (predictive prefetch) and #54 (Fiddler
  CPU fallback) instead of re-proposing them.
…+ warm-pinning (#74)

SlruCache gains two optional, default-off refinements:
- Frequency-aware eviction: with a frequencyOf accessor, the probationary
  victim is the least-accessed entry (recency breaks ties) instead of the
  strict LRU tail, and the just-inserted entry is never evicted (avoids the
  LFU cold-start trap). Biases the cache toward MoE routing skew.
- Pinning: pinned keys live in the protected segment and are never evicted
  or demoted while resident.

ExpertCache exposes Pin/Unpin/IsPinned and a frequencyOf passthrough.
ExpertAccessProfiler gains GetAccessCount(layer, expert).

Both ExpertSlotManager (Vulkan) and CudaExpertSlotManager now:
- construct the cache with profiler-backed frequency-aware eviction (always on
  — degrades to LRU when access counts are equal);
- support opt-in warm-pinning of the hottest resident experts via
  SHARPI_MOE_WARMPIN=N (per layer) after SHARPI_MOE_WARMPIN_AFTER accesses,
  bounded to half the cache. Default off → no behavioral change.

Adds 7 unit tests (frequency eviction, cold-start protection, pinning,
unpin, non-resident pin, ExpertCache pin/freq). All 18 Pipeline tests pass.
CudaHybridForwardPass previously uploaded every routed expert of every
GPU-tier MoE layer resident in VRAM (whole-layer offload), forcing a GPU
layer to hold its entire expert set. Mixtral / Qwen3-30B-A3B / Qwen3-Coder
bigger than VRAM could therefore only offload at whole-layer granularity.

This wires in the CudaExpertSlotManager SLRU cache (already proven on the
CUDA-GDN path), mirroring CudaHybridGdnForwardPass.GpuMoeFfn:
- routed experts are fetched via GetOrLoad (cached, or synchronous upload +
  cache on miss); the router and shared expert stay resident;
- slot capacity is sized from actual free VRAM (cudaMemGetInfo via
  FreeVramBytes) after attention/KV/scratch upload, capped at the full
  GPU-layer expert count — so capacity never exceeds what the old eager path
  used (which TierPlanner already verified fits), i.e. no new OOM risk; the
  budget term only shrinks capacity when VRAM is tight (e.g. user forced
  extra GPU layers via -g), enabling streaming instead of an OOM;
- frequency-aware eviction + opt-in warm-pinning come for free from #74;
- SHARPI_EXPERT_STATS dumps SLRU stats on dispose (parity with GDN).

Removes the eager expert arrays, UploadExpertWeights/UploadExpertWeight, and
the mistyped dead _prefetcher/_gpuFallbackContrib/_gpuPinnedNorm fields (the
former was typed to the Vulkan ExpertSlotManager).

Compile-verified under TreatWarningsAsErrors + AOT analyzers; full CPU test
suite green. GPU decode validation pending (no CUDA device in CI env).
… increment 1)

Adds ExpertRoutePredictor: a training-free, temporal-locality predictor that
remembers each MoE layer's previous-token expert selection. MoE routing has
strong cross-token locality at a fixed layer (the premise behind the SLRU
cache), so last-token selections cheaply predict the next token's.

Wired into HybridForwardPass (Vulkan — the only path with an async
MoEPrefetcher): while layer L computes, it prefetches layer L+1's predicted
experts a full layer ahead, hiding PCIe transfer behind compute. Opt-in via
SHARPI_MOE_PREDICT_PREFETCH=1 (default off → no behavior change); predictions
are best-effort hints that never affect output. Predictor resets on ResetCache.

Unlike a cross-layer GPU router run, this adds zero extra GPU compute or
critical-path syncs — just CPU bookkeeping + the existing EnqueuePrefetch.

The predictor is pure CPU and fully unit-tested (5 new tests). CUDA paths use
synchronous GetOrLoad with no async prefetcher yet, so they don't benefit —
async CUDA upload streams remain future work for #50.

All 23 Pipeline tests pass; full CPU suite green. Vulkan integration
compile-verified (no GPU in CI env).
Extracts the #72 capacity heuristic into MoeCacheSizing.Plan — a pure,
unit-tested policy. Capacity is bounded by [1, totalGpuExperts] and never
exceeds the VRAM budget (no OOM the budget wouldn't already imply). Separately
it reports a RecommendedSlots target from the routing-locality finding of
'Not All Models Suit Expert Offloading' (arXiv:2505.16056): ~2× active experts
per layer covers a token segment well.

CudaHybridForwardPass now uses the helper and logs an advisory when the VRAM
budget forces the cache below the locality sweet spot (suggesting fewer GPU
layers or more VRAM) instead of silently underperforming with an arbitrary
floor of 64.

6 new unit tests (ample/tight VRAM, recommended sizing, caps). All 29 Pipeline
tests pass; full CPU suite green.
Surface the env-var knobs added in #74/#50/#72 as first-class CLI options on
the run command (mirroring the --min-batch-blas 'also settable via env'
pattern). Each flag sets the corresponding SHARPI_* env var early in Execute,
before any forward pass is built, so the engine's existing env reads pick it
up and shell-env use still works:

  --moe-warmpin <N>        → SHARPI_MOE_WARMPIN        (warm-pin top-N hot experts/layer)
  --moe-warmpin-after <N>  → SHARPI_MOE_WARMPIN_AFTER  (warmup threshold)
  --moe-predict-prefetch   → SHARPI_MOE_PREDICT_PREFETCH (Vulkan next-layer prefetch)
  --expert-stats <path>    → SHARPI_EXPERT_STATS       (dump SLRU hit-rate stats)

Verified: known flags parse (reach model-not-found, exit 1) while an unknown
flag is rejected (exit 255). Full solution builds clean; CPU test suite green.
…e/disable

Address two points:
1. llama.cpp naming: llama.cpp's MoE args are placement-only (--cpu-moe,
   --n-cpu-moe, -ot); it has no equivalent for SLRU warm-pinning, predictive
   prefetch, or cache stats (cf. open llama.cpp request ggml-org/llama.cpp#20757),
   so these keep SharpInference-conventional names — nothing to match.
2. Good out-of-box defaults (no flags needed for good performance):
   - frequency-aware SLRU eviction: on (already)
   - VRAM-sized expert cache: on (already)
   - next-layer predictive prefetch: now ON BY DEFAULT (it only makes the
     already-on background prefetch smarter and is a no-op when the cache isn't
     under pressure). Disable via --no-moe-predict-prefetch / SHARPI_MOE_PREDICT_PREFETCH=0.
   - warm-pinning stays opt-in (--moe-warmpin N): redundant with frequency-aware
     eviction for most workloads, so not forced on.

--moe-warmpin is now int? so an explicit 0 forces it off; the removed positive
--moe-predict-prefetch is replaced by the --no- disable.

Verified by running the built CLI: updated flags parse (exit 1, model-not-found),
removed positive flag rejected (exit 255). Full solution builds clean; CPU suite
green (16 failures are pre-existing no-GPU Vulkan tests).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements to Mixture-of-Experts (MoE) expert offloading, including frequency-aware SLRU cache eviction, next-layer predictive prefetching, and opt-in warm-pinning of hot experts. Crucially, it transitions the non-GDN CUDA hybrid path to use the SLRU-streamed expert cache instead of eager whole-layer loading, bringing it to parity with other backends. The review feedback highlights a potential VRAM leak in CudaHybridForwardPass.Dispose if writing stats throws an exception, a stale comment regarding removed eager weights, and a logic flaw in the ExpertCache_FrequencyAware_EvictsLeastAccessedExpert unit test where a low cache capacity prevents the frequency-aware eviction logic from being properly tested.

Comment thread src/SharpInference.Engine/CudaHybridForwardPass.cs
Comment thread src/SharpInference.Engine/CudaHybridForwardPass.cs Outdated
Comment thread tests/SharpInference.Tests.Pipeline/PipelineTests.cs Outdated
claude and others added 6 commits May 29, 2026 08:09
- CudaHybridForwardPass.Dispose: wrap the SHARPI_EXPERT_STATS write in try/catch
  so a StreamWriter failure (bad path/permissions) can't skip
  _expertSlotManager.Dispose() and leak GPU tensors.
- Remove stale eager-expert comment left dangling above the attention-bias fields
  after the SLRU migration.
- ExpertCache_FrequencyAware test: capacity 4 -> 8 so probCap=2 actually exercises
  frequency-aware eviction under overflow (with probCap=1 the head-exclusion alone
  forced the eviction and 'hot' was in fact evicted early; the old assertion passed
  coincidentally). Now asserts 'cold-a' evicted and 'hot' retained.
…nt layers

Two issues uncovered by running the PR's new CLI surface against real GPU
MoE workloads:

- --moe-warmpin-after declared `[DefaultValue(0)]` (Int32) on a `long`
  property. Spectre.Console.Cli's Int64Converter rejected the int literal,
  so every command invocation failed with "Int64Converter cannot convert
  from System.Int32." before any flag was actually parsed. Use 0L.

- ExpertAccessProfiler.PrintStats iterated all `_numLayers` rows, but
  CPU-resident layers under CUDA hybrid offload (and any non-MoE layer
  the profiler is sized for) never report into it. Their counts are all
  zero, GetLayerHitRate returns 0.0, and GetTopExperts(layer, 3) sorts a
  count-zero array with the unstable Array.Sort, producing arbitrary IDs
  like "[0, 93, 92]" identically on every empty row. Replace with a clear
  "(no GPU SLRU accesses recorded)" line so the stats file no longer
  invites the reader to chase a phantom signal.

Verified on RTX 4070 Ti:
  - Vulkan all-GPU MoE (OLMoE): coherent, prefetch on vs off identical
    output at temp 0
  - CUDA hybrid non-GDN MoE (Qwen3-Coder-30B-A3B): SLRU streams
    3043/3712 experts, 87.1 % overall hit rate, 23-25 t/s decode
  - CUDA hybrid GDN+MTP (Qwen3.6-27B-MTP): unaffected, 100 % MTP accept
  - All 433 unit tests still pass

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-ran scripts/bench-moe-rerun.ps1 (3 repeats, 80 tokens, --temp 0,
quicksort prompt) on the four Qwen3-Coder rows after the per-expert
SLRU streaming + predictive-prefetch changes landed.

  Backend                Prefill (was)   Decode (was)
  CPU                    13.3 (15.1)     21.1 (21.2)   warm-cache noise
  CPU --tq               13.7 (12.0)     21.0 (21.1)   warm-cache noise
  Vulkan -g -1 hybrid     1.1 (1.0)       5.3 (5.8)   prefetch a no-op
                                                       until cache pressure
  CUDA -g -1 hybrid      11.5 (13.9)     24.4 (22.7)   #72 SLRU streaming
                                                       trades cold prefill
                                                       for warmer decode

Notable: the CUDA-hybrid row now flags the per-expert SLRU streaming
(3043/3712 slot capacity, ≈ 87 % observed hit rate) as the source of the
+7.5 % decode lift, and is explicit about the prefill regression vs the
prior eager whole-layer upload. The Vulkan-hybrid row credits the
next-layer predictive prefetch and notes the disable flag.

CPU and CPU `--tq` decode are flat within run-to-run jitter; prefill
varies by ~2 t/s across cold/warm mmap pages, so the previous numbers
were on the high end of the same distribution.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sweep

Single-pass bench-all on the PR #77 branch turned up real improvements
on the GDN-MoE and non-GDN-MoE-on-tight-VRAM rows.

  Row                              Prefill (was)   Decode (was)
  Llama-4 Scout CPU                 2.1 (1.9)       4.3 (3.9)
  Llama-4 Scout CUDA hybrid         1.2 (0.9)       2.6 (2.1)   PR #77 SLRU
                                                                 streaming
  Qwen3.6-35B-A3B CPU               6.7 (4.3)       8.5 (7.8)
  Qwen3.6-35B-A3B CUDA hybrid      14.7 (11.2)     23.2 (23.8)

Scout CUDA-hybrid is the clearest #72 win — non-GDN MoE on a 12 GB card
where the prior eager whole-layer expert upload had to spill aggressively;
per-expert SLRU streaming gives +24 % decode and +33 % prefill. CPU-only
still wins outright on Scout, but the gap narrowed enough to update the
note.

Other deltas (qwen36 CPU prefill +56 %, qwen36-cuda-hybrid prefill +31 %)
predate this PR and just reflect the README rotting since older
optimisations landed. Dense rows (SmolLM2, Qwen3-8B) untouched —
single-run jitter only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3-repeat sweep over moe-cpu / moe-cpu-tq / moe-vulkan-hybrid /
moe-cuda-hybrid with mean ± stddev and a hardcoded baseline so deltas
vs the README's published numbers are visible at a glance. Used to
sanity-check the post-PR #77 numbers on the Qwen3-Coder-30B-A3B rows.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Five reviewer agents flagged 12 issues gemini-code-assist missed. Applied
all 12 plus three new tests, kept every change additive (no behaviour
regression except the documented one below).

Critical:
- MoEPrefetcher.RunAsync now catches Preload faults per batch instead of
  killing the worker on the first transient. Without this, one stale
  predicted expert / cudaMalloc spike silently drops throughput to "no
  prefetch" for the rest of the process with nothing to grep for.
- MoeCacheSizing.Plan exposes a Status enum (Ok / BelowRecommended /
  BudgetExhausted / UnknownExpertSize / Empty) so callers can distinguish
  "clamped to 1 because VRAM is too tight" from "ok". CudaHybridForwardPass
  now emits WARNING with the actual slot/recommendation ratio in the
  BelowRecommended path and louder messages for the two edge cases.
- docs/moe-expert-offloading-research.md got a "Status post PR #77"
  callout listing which P0 / P1 / gap-analysis items this PR closes, so
  the audit can't be read as a still-open punch list of work that has
  already shipped.

High:
- HybridForwardPass.Dispose now writes SHARPI_EXPERT_STATS (parity with
  the two CUDA paths) so the CLI flag works on Vulkan too. The README's
  Server section now spells out which knobs are env-var-only.
- ExpertSlotManager.TryGetCached now records hits/misses on the profiler
  and Preload calls MaybeWarmPin — both are no-ops on the CUDA path which
  uses GetOrLoad, but turn the Vulkan path's frequency-aware eviction
  and warm-pinning from "dead code" into functional features (the
  previous Vulkan profiler stats file was always all-zeros).
- PerExpertBytes() now rounds each tensor's bytes up to the buffer
  pool's allocation bucket via the new public CudaBackend.RoundUpAllocBytes,
  so the planner sizes from real per-slot VRAM, not raw bytes. Caps the
  Qwen3-Coder-30B-A3B CUDA hybrid cache at 2220 / 3712 slots (was 3043,
  which would have hit cudaMalloc failure once a working set filled past
  ~2220 unique experts). README row updated: decode 24.4 → 22.2 (the +7.5%
  decode win on the prior commit was relative to a budget that couldn't
  actually be filled); prefill 11.5 → 10.6 — still a +7.7% decode win
  vs the eager pre-PR baseline (20.6 → 22.2) with the latent crash gone.
- HybridForwardPass.GpuMoeFfn reorders TryGetCached before the prefetch
  enqueue, so a current-layer probationary slot can't be evicted by the
  worker mid-loop. Spurious-miss perf-only bug; reorder is cheap.
- MaybeWarmPin (both backends) sorts layers by GetLayerAccessCount before
  picking pins, so a tight pin budget protects the hottest layers instead
  of biasing to layer-index order (matters for hybrid GDN+MoE models that
  cluster MoE FFN at high indices).
- WarmPinConfig parsers log on malformed values instead of silently
  treating them as defaults, and the >=0 / >0 asymmetry between ParseInt
  / ParseLong is made explicit via an allowZero parameter.
- HybridForwardPass SHARPI_MOE_PREDICT_PREFETCH parser is now
  case-insensitive (accepts 1/0, true/false, on/off, yes/no, enabled/
  disabled) and logs on unrecognized values.

Tests:
- 4 new tests pin the MoeCacheSizing status path (Ok / BelowRecommended /
  BudgetExhausted / UnknownExpertSize). Pipeline suite 29 → 33.
- All 433 unit tests still pass (Core 57, Pipeline 33, Server 45,
  TurboQuant 41, ForwardPass 261).
- Sanity checks on the 4070 Ti: Vulkan hybrid Qwen3-Coder writes a
  populated stats file (68.8 % hit rate, was previously empty); CUDA
  hybrid Qwen3-Coder decodes coherently and stable across 3 runs at
  prefill 10.6 ± 0.25, decode 22.2 ± 0.12.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@pekkah pekkah merged commit 6617e37 into master May 29, 2026
1 check passed
@pekkah pekkah deleted the claude/moe-expert-offloading-research-Cok6G branch May 29, 2026 15:17
pekkah added a commit that referenced this pull request May 30, 2026
Builds on the SLRU frequency-aware eviction + warm-pin pass shipped in #77
to satisfy the remaining #74 acceptance criteria:

- Auto-enable warm-pinning at NumActiveExperts per layer when the SLRU
  slot capacity is smaller than the total expert count. The env var
  SHARPI_MOE_WARMPIN still overrides; the auto-rule fires only when
  warm-pinning can actually help (cache < total experts).
- Profiler now records a snapshot of hit/miss totals + the pinned
  (layer, expertId) set when the slot manager warms. PrintStats splits
  pre-warm and post-warm hit rates so callers can see whether pinning
  actually improved residency, and lists the pinned experts per layer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants