perf(cuda,engine): CPU-MoE mode + expert-aware TierPlanner for CudaHybridForwardPass — auto placement is structurally wrong for pure-attention MoE on 12 GB

## Context

Target: single-user decode/prefill on 12 GB VRAM / 64 GB RAM. The GDN-hybrid class (`CudaHybridGdnForwardPass`) already has the right low-VRAM MoE configuration: **all trunk layers on GPU + routed experts on CPU**, with a capacity-driven auto-router (`CudaHybridGdnForwardPass.cs:886-909`: SLRU capacity ratio &lt; 0.5 → CPU MoE, `SHARPI_CPU_MOE` override). The #100 verification showed why this matters: on a 4070 Ti 12 GB, CPU MoE is **11.8 t/s vs 6.1 t/s** GPU-SLRU on the structurally identical qwen3.6-A3B.

The **pure-attention MoE hybrid** (`CudaHybridForwardPass` — Qwen3-Coder-30B-A3B, OLMoE, qwen2moe-class models) has neither piece:

## Problem 1 — TierPlanner books all routed experts per layer, but the runtime never uploads them eagerly

`TierPlanner.MeasureLayerBytes` (`TierPlanner.cs:165-180`) includes `ffn_gate_exps` / `ffn_up_exps` / `ffn_down_exps` — i.e. **every expert** — in each MoE layer's byte cost. But `CudaHybridForwardPass` uploads routed experts **lazily via the SLRU slot manager**, not per layer (`CudaHybridForwardPass.cs:2805-2806`: "Routed-expert weights are uploaded lazily by CudaExpertSlotManager on cache miss"; the shared expert and router stay resident).

Consequence of `-g -1` (`InferenceEngineLoader.cs:209-242`, same pattern in `RunCommand.cs`): the greedy packer charges ~N×expert bytes per layer, stops packing far too early, and trunk layers whose *actual* eager footprint is tiny (attn q/k/v/o + norms + router + shexp) land on CPU. Each CPU layer then pays CPU attention plus the per-token pinned-buffer boundary crossings in `Forward` (`CudaHybridForwardPass.cs:890-936`), while the "saved" VRAM just becomes SLRU headroom for fewer GPU layers.

## Problem 2 — no "GPU trunk + CPU routed experts" mode at all

`CudaHybridForwardPass` only supports a wholesale layer split: GPU layers get GPU attention + SLRU experts, CPU layers get everything on CPU. The config that wins on 12 GB for the GDN class (full GPU trunk, `CpuMoeFfn`-style routed experts on CPU with the Q8_KS SIMD path) cannot be expressed. `SHARPI_CPU_MOE` (#93) is only read by `CudaHybridGdnForwardPass`.

## Problem 3 (minor) — KV budget always priced at fp32

`TierPlanner.cs:99,124` size KV at `sizeof(float)` per element even when `SHARPI_KV_DTYPE=bf16`/`q8_0` halves/quarters the real cost, so the auto context/layer trade-off is mispriced for narrowed-KV runs.

## Proposed work

1. **CPU-MoE mode for `CudaHybridForwardPass`**: mirror the GDN class — per layer, GPU computes attention/norms (and shared expert), downloads the normed hidden, CPU runs the routed experts (reuse the existing CPU MoE core / `SimdKernels` Q8_KS path), uploads the result. Same `SHARPI_CPU_MOE=0|1` override + capacity auto-router (predicted SLRU slots vs total experts, 50% threshold) so behavior matches `CudaHybridGdnForwardPass.cs:886-909`.
2. **TierPlanner MoE-aware accounting**: for MoE models, measure layers as *trunk-only* bytes (attn + norms + router + shexp; exclude `ffn_*_exps`) and report an explicit expert-cache budget (leftover VRAM after trunk + KV) in `LayerPlacement`, so the loader can decide trunk split and MoE placement (SLRU vs CPU-MoE) from real numbers instead of phantom expert bytes.
3. **KV-dtype-aware pricing** in `Plan` (pass the resolved KV dtype; fp32 default unchanged).

## Acceptance

- [ ] Qwen3-Coder-30B-A3B (or OLMoE) with `-g -1` on a 12 GB profile places **all** trunk layers on GPU; placement summary shows the expert budget.
- [ ] A/B decode + prefill vs today's auto placement on the same model/hardware; CPU-MoE mode beats the current wholesale split (expect a multiple, mirroring the GDN-class 11.8-vs-6.1 t/s data point).
- [ ] No placement change for dense models or explicit `-g N`; existing batched-prefill parity tests stay green.

## References

- #100 (capacity threshold + CPU-vs-SLRU bench data), #104 (TierPlanner places by size only — this is the MoE half), #93 (`SHARPI_CPU_MOE` surface), #123 (batched prefill on this class).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuda,engine): CPU-MoE mode + expert-aware TierPlanner for CudaHybridForwardPass — auto placement is structurally wrong for pure-attention MoE on 12 GB #215

Context

Problem 1 — TierPlanner books all routed experts per layer, but the runtime never uploads them eagerly

Problem 2 — no "GPU trunk + CPU routed experts" mode at all

Problem 3 (minor) — KV budget always priced at fp32

Proposed work

Acceptance

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(cuda,engine): CPU-MoE mode + expert-aware TierPlanner for CudaHybridForwardPass — auto placement is structurally wrong for pure-attention MoE on 12 GB #215

Description

Context

Problem 1 — TierPlanner books all routed experts per layer, but the runtime never uploads them eagerly

Problem 2 — no "GPU trunk + CPU routed experts" mode at all

Problem 3 (minor) — KV budget always priced at fp32

Proposed work

Acceptance

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions