Skip to content

perf(cuda,engine): CPU-MoE mode + expert-aware TierPlanner for CudaHybridForwardPass — auto placement is structurally wrong for pure-attention MoE on 12 GB #215

@pekkah

Description

@pekkah

Context

Target: single-user decode/prefill on 12 GB VRAM / 64 GB RAM. The GDN-hybrid class (CudaHybridGdnForwardPass) already has the right low-VRAM MoE configuration: all trunk layers on GPU + routed experts on CPU, with a capacity-driven auto-router (CudaHybridGdnForwardPass.cs:886-909: SLRU capacity ratio < 0.5 → CPU MoE, SHARPI_CPU_MOE override). The #100 verification showed why this matters: on a 4070 Ti 12 GB, CPU MoE is 11.8 t/s vs 6.1 t/s GPU-SLRU on the structurally identical qwen3.6-A3B.

The pure-attention MoE hybrid (CudaHybridForwardPass — Qwen3-Coder-30B-A3B, OLMoE, qwen2moe-class models) has neither piece:

Problem 1 — TierPlanner books all routed experts per layer, but the runtime never uploads them eagerly

TierPlanner.MeasureLayerBytes (TierPlanner.cs:165-180) includes ffn_gate_exps / ffn_up_exps / ffn_down_exps — i.e. every expert — in each MoE layer's byte cost. But CudaHybridForwardPass uploads routed experts lazily via the SLRU slot manager, not per layer (CudaHybridForwardPass.cs:2805-2806: "Routed-expert weights are uploaded lazily by CudaExpertSlotManager on cache miss"; the shared expert and router stay resident).

Consequence of -g -1 (InferenceEngineLoader.cs:209-242, same pattern in RunCommand.cs): the greedy packer charges ~N×expert bytes per layer, stops packing far too early, and trunk layers whose actual eager footprint is tiny (attn q/k/v/o + norms + router + shexp) land on CPU. Each CPU layer then pays CPU attention plus the per-token pinned-buffer boundary crossings in Forward (CudaHybridForwardPass.cs:890-936), while the "saved" VRAM just becomes SLRU headroom for fewer GPU layers.

Problem 2 — no "GPU trunk + CPU routed experts" mode at all

CudaHybridForwardPass only supports a wholesale layer split: GPU layers get GPU attention + SLRU experts, CPU layers get everything on CPU. The config that wins on 12 GB for the GDN class (full GPU trunk, CpuMoeFfn-style routed experts on CPU with the Q8_KS SIMD path) cannot be expressed. SHARPI_CPU_MOE (#93) is only read by CudaHybridGdnForwardPass.

Problem 3 (minor) — KV budget always priced at fp32

TierPlanner.cs:99,124 size KV at sizeof(float) per element even when SHARPI_KV_DTYPE=bf16/q8_0 halves/quarters the real cost, so the auto context/layer trade-off is mispriced for narrowed-KV runs.

Proposed work

  1. CPU-MoE mode for CudaHybridForwardPass: mirror the GDN class — per layer, GPU computes attention/norms (and shared expert), downloads the normed hidden, CPU runs the routed experts (reuse the existing CPU MoE core / SimdKernels Q8_KS path), uploads the result. Same SHARPI_CPU_MOE=0|1 override + capacity auto-router (predicted SLRU slots vs total experts, 50% threshold) so behavior matches CudaHybridGdnForwardPass.cs:886-909.
  2. TierPlanner MoE-aware accounting: for MoE models, measure layers as trunk-only bytes (attn + norms + router + shexp; exclude ffn_*_exps) and report an explicit expert-cache budget (leftover VRAM after trunk + KV) in LayerPlacement, so the loader can decide trunk split and MoE placement (SLRU vs CPU-MoE) from real numbers instead of phantom expert bytes.
  3. KV-dtype-aware pricing in Plan (pass the resolved KV dtype; fp32 default unchanged).

Acceptance

  • Qwen3-Coder-30B-A3B (or OLMoE) with -g -1 on a 12 GB profile places all trunk layers on GPU; placement summary shows the expert budget.
  • A/B decode + prefill vs today's auto placement on the same model/hardware; CPU-MoE mode beats the current wholesale split (expect a multiple, mirroring the GDN-class 11.8-vs-6.1 t/s data point).
  • No placement change for dense models or explicit -g N; existing batched-prefill parity tests stay green.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions