Non-GDN CUDA MoE hybrid does whole-layer offload — bring to per-expert SLRU parity with the GDN path

## Background

We have three hybrid MoE forward passes, and they offload experts at **two different granularities**:

- **`CudaHybridGdnForwardPass`** (qwen35moe / hybrid-SSM): per-expert SLRU via `CudaExpertSlotManager.GetOrLoad` (line ~2257), CPU-MoE fallback mode (`SHARPI_CPU_MOE=1`), `ExpertAccessProfiler` stats dump. This is the reference path.
- **`HybridForwardPass`** (Vulkan): per-expert SLRU via `ExpertSlotManager.TryGetCached` + CPU-fallback compute on miss (`GpuMoeFfnCpuFallback`, ~line 1293).
- **`CudaHybridForwardPass`** (non-GDN MoE on CUDA — Mixtral / Qwen3-MoE / Qwen3-Coder-30B-A3B when they don't fit VRAM): **uploads *all* experts of every GPU-tier layer to VRAM resident** (`UploadExpertWeights` loop, line ~297; indexed directly at line ~1348). Offload granularity is the whole layer, decided by `TierPlanner`'s CPU/GPU layer split. The `_expertSlotManager` / `_prefetcher` fields (lines 108–109) are declared and disposed but **never assigned** — dead code.

So for the big non-GDN MoEs, a "GPU layer" must hold its **entire** routed-expert set in VRAM. On a 12 GB card this is exactly the squeeze called out in the `RunCommand.cs:356` comment ("Qwen3-Coder 30B-A3B in 12 GB … silently OOM"). The per-expert streaming infra we already use on the other two paths is simply not wired in here.

## Scope

Mirror the GDN path's design in `CudaHybridForwardPass`:

1. Assign `_expertSlotManager = new CudaExpertSlotManager(...)` (instead of eager `UploadExpertWeights` for all experts) for the GPU-tier MoE layers.
2. Route `GpuMoeFfn` expert access through `GetOrLoad` (cache hit) with the per-token top-k.
3. Add CPU-fallback compute on miss (the path already has `CpuMoeFfn` / `_cpuMoeDownTemp`) — reuse it, à la the Vulkan `GpuMoeFfnCpuFallback`.
4. Wire `MoEPrefetcher` (the field already exists) with the current 1-token same-layer enqueue, leaving room for the predictive prefetch from #50.
5. Free the eager `_gpuWGateExps/_gpuWUpExps/_gpuWDownExps` resident arrays for GPU-tier MoE layers (they become cache-managed).

## Acceptance

- [ ] A non-GDN MoE larger than VRAM (e.g. Qwen3-Coder-30B-A3B or Mixtral 8x7B Q4_K_M) runs on a 12 GB card with **more layers GPU-tiered** than today, because GPU layers no longer need the full expert set resident.
- [ ] Coherent decode; no regression vs the current resident path on models that already fit.
- [ ] `ExpertAccessProfiler` stats available on this path too.

## Related

- Reference implementation: `CudaHybridGdnForwardPass` (SLRU + CPU MoE).
- #54 (Fiddler-style per-dispatch CPU fallback — the policy layer on top of this).
- #50 (predictive prefetch — consumes the prefetcher this issue wires in).
- `docs/moe-expert-offloading-research.md` §2.2, §5 (P0).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-GDN CUDA MoE hybrid does whole-layer offload — bring to per-expert SLRU parity with the GDN path #72

Background

Scope

Acceptance

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Non-GDN CUDA MoE hybrid does whole-layer offload — bring to per-expert SLRU parity with the GDN path #72

Description

Background

Scope

Acceptance

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions