Skip to content

MoE-FFN MTP head: unblock Qwen3.6-35B-A3B-MTP loading #44

@pekkah

Description

@pekkah

Background

Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf has an MoE FFN structure at the MTP head (one router + N experts), but the current MTP loaders in HybridGdnForwardPass and CudaHybridGdnForwardPass assume the MTP block uses a dense FFN (single ffn_gate / ffn_up / ffn_down triplet). Loading the A3B-MTP variant fails because the dense tensor lookup misses on blk.{NumLayers}.ffn_gate.weight (the file has ffn_gate_exps.weight instead).

Per memory project_mtp_moe_ffn_head_unsupported: known plumbing sites for the fix are

  • HybridGdnForwardPass.cs:450-452 (dense MTP FFN tensor resolution)
  • CudaHybridGdnForwardPass.cs:778-780 (same on GPU)
  • the MTP forward path in both files (the SiLuMul + down sequence currently runs the dense triplet).

This blocks adding 35B-A3B-MTP rows to the #28 bench matrix, and prevents #30's batched verify from running on that model (though #30 also gates MoE off — BatchForward2 requires !_hp.IsMoE, see HybridGdnForwardPass.cs BatchForward2 guard).

Scope

  1. Loader: detect MoE-head MTP via tensor presence (blk.{NumLayers}.ffn_gate_exps.weight exists). Resolve the MoE-head tensor set (router weight, gate_exps, up_exps, down_exps, possibly shared-expert tensors if present).
  2. Forward: refactor MtpForward so the FFN sub-block dispatches on _isMtpMoe (new flag set at load time). Reuse the existing routed-MoE machinery (MoeFfn on CPU, CpuMoeFfn / GpuMoeFfn on CUDA). The MTP MoE inherits the same N=top-K, normalize-flag, and shared-expert presence as the trunk MoE — single source of hp config.
  3. Add bench rows for qwen36-27b-mtp (cpu, cuda-hybrid) and establish perf baseline #28 bench rows: add qwen36-35b-a3b-mtp-q4km-{cpu,cuda-hybrid}-{mtp,nomtp} entries to scripts/bench-27b-mtp.ps1 (or fork to scripts/bench-35b-a3b-mtp.ps1 if the dense vs MoE flag pattern complicates the existing script).

Out of scope

  • Batched verify for MoE-MTP — BatchForward2 still requires dense FFN until a MatVec2In-equivalent for MoE routed experts lands. The N=1 sequential MTP path should work on MoE-MTP once loading is fixed; that already gives correctness (MtpDecoder.DecodeSequential is the fallback when SupportsBatchVerify == false).

Acceptance

  • Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf loads successfully via both HybridGdnForwardPass and CudaHybridGdnForwardPass.
  • MtpForward on the loaded model produces well-formed logits (mirror of HybridGdnForwardPass_Qwen35Mtp_MtpHeadProducesWellFormedLogits test pattern).
  • CLI decode with --spec-type mtp --no-thinking --temp 0 runs to completion on a short prompt without errors. (Sequential N=1 MTP path; speedup is not asserted here — that's gated on MoE-aware BatchForward2 which is a separate follow-up.)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions