Full-GPU MoE for MTP head (avoid SHARPI_CPU_MOE=1 requirement)

## Background

`CudaHybridGdnForwardPass` rejects MoE MTP heads when `!_cpuMoe` with:

\`\`\`
"MoE MTP head requires SHARPI_CPU_MOE=1. GPU MoE path (SLRU expert cache)
doesn't reserve slots for the MTP block; enable CPU MoE mode to load this model."
\`\`\`

So Qwen3.6-35B-A3B-MTP requires `SHARPI_CPU_MOE=1`. On a 12 GB card the model can't fit anyway (~22 GB Q4_K_M), so this is moot today — but on 24 GB+ cards where the full routed-expert stack could live in VRAM, the `_expertSlotManager` needs to be sized for `(L+1) * numExperts` instead of `L * numExperts`, and the `GpuMoeFfn` dispatch needs a parameterised variant that takes a layer's tensors instead of an array index.

## Scope

1. Plumb the MTP block as "layer L" in `_expertSlotManager` allocations (`L * _numExperts` → `(L+1) * _numExperts`, conditional on `_hasMtp && _mtpIsMoE`).
2. Upload the MTP block's routed expert weights into the SLRU cache, indexed at layer L.
3. Refactor `GpuMoeFfn(int layer)` into a tensor-parameterised core (same pattern as `MoeFfnCore` / `CpuMoeFfnCore` from #44).
4. Remove the NotSupportedException; dispatch through GPU MoE when `!_cpuMoe` and `_mtpIsMoE`.

## Acceptance criteria

- [ ] Qwen3.6-35B-A3B-MTP loads + decodes coherently without `SHARPI_CPU_MOE=1` on a card with enough VRAM.
- [ ] 100 % draft acceptance preserved.
- [ ] No regression on the CPU MoE path.

## Related

- #44 (MoE MTP loader, shipped — left this as a follow-up)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full-GPU MoE for MTP head (avoid SHARPI_CPU_MOE=1 requirement) #46

Background

Scope

Acceptance criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Full-GPU MoE for MTP head (avoid SHARPI_CPU_MOE=1 requirement) #46

Description

Background

Scope

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions