Skip to content

Non-GDN CUDA MoE hybrid does whole-layer offload — bring to per-expert SLRU parity with the GDN path #72

@pekkah

Description

@pekkah

Background

We have three hybrid MoE forward passes, and they offload experts at two different granularities:

  • CudaHybridGdnForwardPass (qwen35moe / hybrid-SSM): per-expert SLRU via CudaExpertSlotManager.GetOrLoad (line ~2257), CPU-MoE fallback mode (SHARPI_CPU_MOE=1), ExpertAccessProfiler stats dump. This is the reference path.
  • HybridForwardPass (Vulkan): per-expert SLRU via ExpertSlotManager.TryGetCached + CPU-fallback compute on miss (GpuMoeFfnCpuFallback, ~line 1293).
  • CudaHybridForwardPass (non-GDN MoE on CUDA — Mixtral / Qwen3-MoE / Qwen3-Coder-30B-A3B when they don't fit VRAM): uploads all experts of every GPU-tier layer to VRAM resident (UploadExpertWeights loop, line ~297; indexed directly at line ~1348). Offload granularity is the whole layer, decided by TierPlanner's CPU/GPU layer split. The _expertSlotManager / _prefetcher fields (lines 108–109) are declared and disposed but never assigned — dead code.

So for the big non-GDN MoEs, a "GPU layer" must hold its entire routed-expert set in VRAM. On a 12 GB card this is exactly the squeeze called out in the RunCommand.cs:356 comment ("Qwen3-Coder 30B-A3B in 12 GB … silently OOM"). The per-expert streaming infra we already use on the other two paths is simply not wired in here.

Scope

Mirror the GDN path's design in CudaHybridForwardPass:

  1. Assign _expertSlotManager = new CudaExpertSlotManager(...) (instead of eager UploadExpertWeights for all experts) for the GPU-tier MoE layers.
  2. Route GpuMoeFfn expert access through GetOrLoad (cache hit) with the per-token top-k.
  3. Add CPU-fallback compute on miss (the path already has CpuMoeFfn / _cpuMoeDownTemp) — reuse it, à la the Vulkan GpuMoeFfnCpuFallback.
  4. Wire MoEPrefetcher (the field already exists) with the current 1-token same-layer enqueue, leaving room for the predictive prefetch from Predictive expert prefetch for MoE (pre-gated routing / PreScope-style) #50.
  5. Free the eager _gpuWGateExps/_gpuWUpExps/_gpuWDownExps resident arrays for GPU-tier MoE layers (they become cache-managed).

Acceptance

  • A non-GDN MoE larger than VRAM (e.g. Qwen3-Coder-30B-A3B or Mixtral 8x7B Q4_K_M) runs on a 12 GB card with more layers GPU-tiered than today, because GPU layers no longer need the full expert set resident.
  • Coherent decode; no regression vs the current resident path on models that already fit.
  • ExpertAccessProfiler stats available on this path too.

Related

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions