You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf has an MoE FFN structure at the MTP head (one router + N experts), but the current MTP loaders in HybridGdnForwardPass and CudaHybridGdnForwardPass assume the MTP block uses a dense FFN (single ffn_gate / ffn_up / ffn_down triplet). Loading the A3B-MTP variant fails because the dense tensor lookup misses on blk.{NumLayers}.ffn_gate.weight (the file has ffn_gate_exps.weight instead).
Per memory project_mtp_moe_ffn_head_unsupported: known plumbing sites for the fix are
the MTP forward path in both files (the SiLuMul + down sequence currently runs the dense triplet).
This blocks adding 35B-A3B-MTP rows to the #28 bench matrix, and prevents #30's batched verify from running on that model (though #30 also gates MoE off — BatchForward2 requires !_hp.IsMoE, see HybridGdnForwardPass.cs BatchForward2 guard).
Scope
Loader: detect MoE-head MTP via tensor presence (blk.{NumLayers}.ffn_gate_exps.weight exists). Resolve the MoE-head tensor set (router weight, gate_exps, up_exps, down_exps, possibly shared-expert tensors if present).
Forward: refactor MtpForward so the FFN sub-block dispatches on _isMtpMoe (new flag set at load time). Reuse the existing routed-MoE machinery (MoeFfn on CPU, CpuMoeFfn / GpuMoeFfn on CUDA). The MTP MoE inherits the same N=top-K, normalize-flag, and shared-expert presence as the trunk MoE — single source of hp config.
Batched verify for MoE-MTP — BatchForward2 still requires dense FFN until a MatVec2In-equivalent for MoE routed experts lands. The N=1 sequential MTP path should work on MoE-MTP once loading is fixed; that already gives correctness (MtpDecoder.DecodeSequential is the fallback when SupportsBatchVerify == false).
Acceptance
Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf loads successfully via both HybridGdnForwardPass and CudaHybridGdnForwardPass.
MtpForward on the loaded model produces well-formed logits (mirror of HybridGdnForwardPass_Qwen35Mtp_MtpHeadProducesWellFormedLogits test pattern).
CLI decode with --spec-type mtp --no-thinking --temp 0 runs to completion on a short prompt without errors. (Sequential N=1 MTP path; speedup is not asserted here — that's gated on MoE-aware BatchForward2 which is a separate follow-up.)
Background
Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.ggufhas an MoE FFN structure at the MTP head (one router + N experts), but the current MTP loaders inHybridGdnForwardPassandCudaHybridGdnForwardPassassume the MTP block uses a dense FFN (singleffn_gate/ffn_up/ffn_downtriplet). Loading the A3B-MTP variant fails because the dense tensor lookup misses onblk.{NumLayers}.ffn_gate.weight(the file hasffn_gate_exps.weightinstead).Per memory
project_mtp_moe_ffn_head_unsupported: known plumbing sites for the fix areHybridGdnForwardPass.cs:450-452(dense MTP FFN tensor resolution)CudaHybridGdnForwardPass.cs:778-780(same on GPU)This blocks adding 35B-A3B-MTP rows to the #28 bench matrix, and prevents #30's batched verify from running on that model (though #30 also gates MoE off —
BatchForward2requires!_hp.IsMoE, seeHybridGdnForwardPass.csBatchForward2 guard).Scope
blk.{NumLayers}.ffn_gate_exps.weightexists). Resolve the MoE-head tensor set (router weight, gate_exps, up_exps, down_exps, possibly shared-expert tensors if present).MtpForwardso the FFN sub-block dispatches on_isMtpMoe(new flag set at load time). Reuse the existing routed-MoE machinery (MoeFfnon CPU,CpuMoeFfn/GpuMoeFfnon CUDA). The MTP MoE inherits the same N=top-K, normalize-flag, and shared-expert presence as the trunk MoE — single source of hp config.qwen36-35b-a3b-mtp-q4km-{cpu,cuda-hybrid}-{mtp,nomtp}entries toscripts/bench-27b-mtp.ps1(or fork toscripts/bench-35b-a3b-mtp.ps1if the dense vs MoE flag pattern complicates the existing script).Out of scope
BatchForward2still requires dense FFN until a MatVec2In-equivalent for MoE routed experts lands. The N=1 sequential MTP path should work on MoE-MTP once loading is fixed; that already gives correctness (MtpDecoder.DecodeSequentialis the fallback whenSupportsBatchVerify == false).Acceptance
Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.ggufloads successfully via bothHybridGdnForwardPassandCudaHybridGdnForwardPass.MtpForwardon the loaded model produces well-formed logits (mirror ofHybridGdnForwardPass_Qwen35Mtp_MtpHeadProducesWellFormedLogitstest pattern).--spec-type mtp --no-thinking --temp 0runs to completion on a short prompt without errors. (Sequential N=1 MTP path; speedup is not asserted here — that's gated on MoE-aware BatchForward2 which is a separate follow-up.)Related
project_mtp_moe_ffn_head_unsupported