Context
Target: single-user decode/prefill on 12 GB VRAM / 64 GB RAM. The GDN-hybrid class (CudaHybridGdnForwardPass) already has the right low-VRAM MoE configuration: all trunk layers on GPU + routed experts on CPU, with a capacity-driven auto-router (CudaHybridGdnForwardPass.cs:886-909: SLRU capacity ratio < 0.5 → CPU MoE, SHARPI_CPU_MOE override). The #100 verification showed why this matters: on a 4070 Ti 12 GB, CPU MoE is 11.8 t/s vs 6.1 t/s GPU-SLRU on the structurally identical qwen3.6-A3B.
The pure-attention MoE hybrid (CudaHybridForwardPass — Qwen3-Coder-30B-A3B, OLMoE, qwen2moe-class models) has neither piece:
Problem 1 — TierPlanner books all routed experts per layer, but the runtime never uploads them eagerly
TierPlanner.MeasureLayerBytes (TierPlanner.cs:165-180) includes ffn_gate_exps / ffn_up_exps / ffn_down_exps — i.e. every expert — in each MoE layer's byte cost. But CudaHybridForwardPass uploads routed experts lazily via the SLRU slot manager, not per layer (CudaHybridForwardPass.cs:2805-2806: "Routed-expert weights are uploaded lazily by CudaExpertSlotManager on cache miss"; the shared expert and router stay resident).
Consequence of -g -1 (InferenceEngineLoader.cs:209-242, same pattern in RunCommand.cs): the greedy packer charges ~N×expert bytes per layer, stops packing far too early, and trunk layers whose actual eager footprint is tiny (attn q/k/v/o + norms + router + shexp) land on CPU. Each CPU layer then pays CPU attention plus the per-token pinned-buffer boundary crossings in Forward (CudaHybridForwardPass.cs:890-936), while the "saved" VRAM just becomes SLRU headroom for fewer GPU layers.
Problem 2 — no "GPU trunk + CPU routed experts" mode at all
CudaHybridForwardPass only supports a wholesale layer split: GPU layers get GPU attention + SLRU experts, CPU layers get everything on CPU. The config that wins on 12 GB for the GDN class (full GPU trunk, CpuMoeFfn-style routed experts on CPU with the Q8_KS SIMD path) cannot be expressed. SHARPI_CPU_MOE (#93) is only read by CudaHybridGdnForwardPass.
Problem 3 (minor) — KV budget always priced at fp32
TierPlanner.cs:99,124 size KV at sizeof(float) per element even when SHARPI_KV_DTYPE=bf16/q8_0 halves/quarters the real cost, so the auto context/layer trade-off is mispriced for narrowed-KV runs.
Proposed work
- CPU-MoE mode for
CudaHybridForwardPass: mirror the GDN class — per layer, GPU computes attention/norms (and shared expert), downloads the normed hidden, CPU runs the routed experts (reuse the existing CPU MoE core / SimdKernels Q8_KS path), uploads the result. Same SHARPI_CPU_MOE=0|1 override + capacity auto-router (predicted SLRU slots vs total experts, 50% threshold) so behavior matches CudaHybridGdnForwardPass.cs:886-909.
- TierPlanner MoE-aware accounting: for MoE models, measure layers as trunk-only bytes (attn + norms + router + shexp; exclude
ffn_*_exps) and report an explicit expert-cache budget (leftover VRAM after trunk + KV) in LayerPlacement, so the loader can decide trunk split and MoE placement (SLRU vs CPU-MoE) from real numbers instead of phantom expert bytes.
- KV-dtype-aware pricing in
Plan (pass the resolved KV dtype; fp32 default unchanged).
Acceptance
References
Context
Target: single-user decode/prefill on 12 GB VRAM / 64 GB RAM. The GDN-hybrid class (
CudaHybridGdnForwardPass) already has the right low-VRAM MoE configuration: all trunk layers on GPU + routed experts on CPU, with a capacity-driven auto-router (CudaHybridGdnForwardPass.cs:886-909: SLRU capacity ratio < 0.5 → CPU MoE,SHARPI_CPU_MOEoverride). The #100 verification showed why this matters: on a 4070 Ti 12 GB, CPU MoE is 11.8 t/s vs 6.1 t/s GPU-SLRU on the structurally identical qwen3.6-A3B.The pure-attention MoE hybrid (
CudaHybridForwardPass— Qwen3-Coder-30B-A3B, OLMoE, qwen2moe-class models) has neither piece:Problem 1 — TierPlanner books all routed experts per layer, but the runtime never uploads them eagerly
TierPlanner.MeasureLayerBytes(TierPlanner.cs:165-180) includesffn_gate_exps/ffn_up_exps/ffn_down_exps— i.e. every expert — in each MoE layer's byte cost. ButCudaHybridForwardPassuploads routed experts lazily via the SLRU slot manager, not per layer (CudaHybridForwardPass.cs:2805-2806: "Routed-expert weights are uploaded lazily by CudaExpertSlotManager on cache miss"; the shared expert and router stay resident).Consequence of
-g -1(InferenceEngineLoader.cs:209-242, same pattern inRunCommand.cs): the greedy packer charges ~N×expert bytes per layer, stops packing far too early, and trunk layers whose actual eager footprint is tiny (attn q/k/v/o + norms + router + shexp) land on CPU. Each CPU layer then pays CPU attention plus the per-token pinned-buffer boundary crossings inForward(CudaHybridForwardPass.cs:890-936), while the "saved" VRAM just becomes SLRU headroom for fewer GPU layers.Problem 2 — no "GPU trunk + CPU routed experts" mode at all
CudaHybridForwardPassonly supports a wholesale layer split: GPU layers get GPU attention + SLRU experts, CPU layers get everything on CPU. The config that wins on 12 GB for the GDN class (full GPU trunk,CpuMoeFfn-style routed experts on CPU with the Q8_KS SIMD path) cannot be expressed.SHARPI_CPU_MOE(#93) is only read byCudaHybridGdnForwardPass.Problem 3 (minor) — KV budget always priced at fp32
TierPlanner.cs:99,124size KV atsizeof(float)per element even whenSHARPI_KV_DTYPE=bf16/q8_0halves/quarters the real cost, so the auto context/layer trade-off is mispriced for narrowed-KV runs.Proposed work
CudaHybridForwardPass: mirror the GDN class — per layer, GPU computes attention/norms (and shared expert), downloads the normed hidden, CPU runs the routed experts (reuse the existing CPU MoE core /SimdKernelsQ8_KS path), uploads the result. SameSHARPI_CPU_MOE=0|1override + capacity auto-router (predicted SLRU slots vs total experts, 50% threshold) so behavior matchesCudaHybridGdnForwardPass.cs:886-909.ffn_*_exps) and report an explicit expert-cache budget (leftover VRAM after trunk + KV) inLayerPlacement, so the loader can decide trunk split and MoE placement (SLRU vs CPU-MoE) from real numbers instead of phantom expert bytes.Plan(pass the resolved KV dtype; fp32 default unchanged).Acceptance
-g -1on a 12 GB profile places all trunk layers on GPU; placement summary shows the expert budget.-g N; existing batched-prefill parity tests stay green.References
SHARPI_CPU_MOEsurface), perf(cuda): batched-trunk prefill for the pure-attention layer-split hybrid (CudaHybridForwardPass / Qwen3-Coder-30B) #123 (batched prefill on this class).