MoE-FFN MTP head: unblock Qwen3.6-35B-A3B-MTP loading

## Background

`Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf` has an MoE FFN structure at the MTP head (one router + N experts), but the current MTP loaders in `HybridGdnForwardPass` and `CudaHybridGdnForwardPass` assume the MTP block uses a **dense** FFN (single `ffn_gate` / `ffn_up` / `ffn_down` triplet). Loading the A3B-MTP variant fails because the dense tensor lookup misses on `blk.{NumLayers}.ffn_gate.weight` (the file has `ffn_gate_exps.weight` instead).

Per memory `project_mtp_moe_ffn_head_unsupported`: known plumbing sites for the fix are
- `HybridGdnForwardPass.cs:450-452` (dense MTP FFN tensor resolution)
- `CudaHybridGdnForwardPass.cs:778-780` (same on GPU)
- the MTP forward path in both files (the SiLuMul + down sequence currently runs the dense triplet).

This blocks adding 35B-A3B-MTP rows to the #28 bench matrix, and prevents #30's batched verify from running on that model (though #30 also gates MoE off — `BatchForward2` requires `!_hp.IsMoE`, see `HybridGdnForwardPass.cs` BatchForward2 guard).

## Scope

1. **Loader**: detect MoE-head MTP via tensor presence (`blk.{NumLayers}.ffn_gate_exps.weight` exists). Resolve the MoE-head tensor set (router weight, gate_exps, up_exps, down_exps, possibly shared-expert tensors if present).
2. **Forward**: refactor `MtpForward` so the FFN sub-block dispatches on `_isMtpMoe` (new flag set at load time). Reuse the existing routed-MoE machinery (`MoeFfn` on CPU, `CpuMoeFfn` / `GpuMoeFfn` on CUDA). The MTP MoE inherits the same N=top-K, normalize-flag, and shared-expert presence as the trunk MoE — single source of hp config.
3. **#28 bench rows**: add `qwen36-35b-a3b-mtp-q4km-{cpu,cuda-hybrid}-{mtp,nomtp}` entries to `scripts/bench-27b-mtp.ps1` (or fork to `scripts/bench-35b-a3b-mtp.ps1` if the dense vs MoE flag pattern complicates the existing script).

## Out of scope

- Batched verify for MoE-MTP — `BatchForward2` still requires dense FFN until a MatVec2In-equivalent for MoE routed experts lands. The N=1 sequential MTP path should work on MoE-MTP once loading is fixed; that already gives correctness (`MtpDecoder.DecodeSequential` is the fallback when `SupportsBatchVerify == false`).

## Acceptance

- [ ] `Qwen3.6-35B-A3B-MTP-UD-Q4_K_M.gguf` loads successfully via both `HybridGdnForwardPass` and `CudaHybridGdnForwardPass`.
- [ ] `MtpForward` on the loaded model produces well-formed logits (mirror of `HybridGdnForwardPass_Qwen35Mtp_MtpHeadProducesWellFormedLogits` test pattern).
- [ ] CLI decode with `--spec-type mtp --no-thinking --temp 0` runs to completion on a short prompt without errors. (Sequential N=1 MTP path; speedup is not asserted here — that's gated on MoE-aware BatchForward2 which is a separate follow-up.)

## Related

- #28 (bench rows — this unblocks adding the 35B-A3B variant)
- #30 (batched verify — MoE-MTP batched is a future extension; sequential first)
- Memory: `project_mtp_moe_ffn_head_unsupported`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE-FFN MTP head: unblock Qwen3.6-35B-A3B-MTP loading #44

Background

Scope

Out of scope

Acceptance

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MoE-FFN MTP head: unblock Qwen3.6-35B-A3B-MTP loading #44

Description

Background

Scope

Out of scope

Acceptance

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions