Skip to content

Full-GPU MoE for MTP head (avoid SHARPI_CPU_MOE=1 requirement) #46

@pekkah

Description

@pekkah

Background

CudaHybridGdnForwardPass rejects MoE MTP heads when !_cpuMoe with:

```
"MoE MTP head requires SHARPI_CPU_MOE=1. GPU MoE path (SLRU expert cache)
doesn't reserve slots for the MTP block; enable CPU MoE mode to load this model."
```

So Qwen3.6-35B-A3B-MTP requires SHARPI_CPU_MOE=1. On a 12 GB card the model can't fit anyway (~22 GB Q4_K_M), so this is moot today — but on 24 GB+ cards where the full routed-expert stack could live in VRAM, the _expertSlotManager needs to be sized for (L+1) * numExperts instead of L * numExperts, and the GpuMoeFfn dispatch needs a parameterised variant that takes a layer's tensors instead of an array index.

Scope

  1. Plumb the MTP block as "layer L" in _expertSlotManager allocations (L * _numExperts(L+1) * _numExperts, conditional on _hasMtp && _mtpIsMoE).
  2. Upload the MTP block's routed expert weights into the SLRU cache, indexed at layer L.
  3. Refactor GpuMoeFfn(int layer) into a tensor-parameterised core (same pattern as MoeFfnCore / CpuMoeFfnCore from MoE-FFN MTP head: unblock Qwen3.6-35B-A3B-MTP loading #44).
  4. Remove the NotSupportedException; dispatch through GPU MoE when !_cpuMoe and _mtpIsMoE.

Acceptance criteria

  • Qwen3.6-35B-A3B-MTP loads + decodes coherently without SHARPI_CPU_MOE=1 on a card with enough VRAM.
  • 100 % draft acceptance preserved.
  • No regression on the CPU MoE path.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions