Status (2026-06-14) — verified against current code
The headline Tier-1 #1 is already implemented for the recommended config, so the "per-token serial in EVERY path / essentially all the gain" framing is overstated:
Genuinely open: #3 (chunked-matmul DeltaNet on tensor cores — the GDN scan is still a scalar per-head shared-mem scan) and the small per-token trunk share (#4: the CPU-MoE batched-prefill trunk — attn + GDN — stays sequential per token, :516; minor since MoE dominates and is already batched).
Goal
Improve prefill (TTFT) throughput for Qwen3.6 (and Qwen3.5) hybrid models on a 12 GB Ada GPU (RTX 4070 Ti, sm_89). Companion to #247 (Gemma 4). Based on profiling the existing hybrid prefill path and the external state of the art.
What Qwen3.6 is (and why prefill is different)
Qwen3.6 is a hybrid linear/full attention + sparse MoE model, not a conventional transformer:
- 4:1 Gated DeltaNet : full-attention ratio — ten 4-layer blocks (3× GDN linear attention + 1× full attention). ¾ of layers are linear attention.
- Sparse MoE — 35B-A3B variant: 256 experts, top-8 routed + 1 shared, ~3B active.
- 262K context.
The architecture is already implemented in the codebase (built for Qwen3.5): CudaHybridGdnForwardPass, GdnKernels, GdnStateCache, docs/qwen35moe-*. This issue is about prefill performance, not enablement.
Where prefill time goes — the two halves diverge
| Subsystem |
Prefill status |
Evidence |
| Gated DeltaNet (¾ of layers) |
✅ Batched parallel scan on GPU |
GdnRecurrenceScan (one block per V-head, scans N tokens) CudaBackend.cs:5587; PrefillBatchedTrunkGpuFfn CudaHybridGdnForwardPass.cs:1485 |
| MoE FFN (40 layers) |
❌ Per-token, serial in every path |
GpuMoeFfn: router → CPU download → top-8 → 8 separate matvecs via SLRU CudaHybridGdnForwardPass.cs:4281 |
The MoE FFN is the prefill bottleneck and it's a hard architectural gap. The "batched" prefill paths batch only the GDN trunk, then fall back to per-token MoE (CudaHybridGdnForwardPass.cs:1466, 1485). The _isMoE guard hard-disables batched FFN everywhere (CudaForwardPass.cs:2409, ForwardPass.cs:649).
Two compounding costs, both pathological on 12 GB:
- Memory-bound matvec, weight re-read per token — each expert weight matrix is read from VRAM once per token (
GpuMatMul(slot.Gate, ...) inside the per-token loop).
- SLRU thrashing — 35B-A3B experts (256 × 40 layers) can't fit in 12 GB, so they stream from host/disk via
CudaExpertSlotManager. Exploration confirmed: "No special prefetch during prefill — experts loaded on-demand per-token."
Attack plan (ranked by ROI)
Tier 1 — MoE prefill rewrite (by far the biggest win)
Tier 2 — linear / full attention kernels
Tier 3 — fit & tuning on 12 GB
Prerequisite to verify
FP8 / FA-3 note
Same as #247: on Ada fp8 ≈ int8 throughput, so once grouped GEMM uses the int8-MMQ path there's no fp8 prefill speed win; FlashAttention-3 is Hopper-only.
Bottom line
Unlike Gemma 4 (CPU-bound PLE + missing SWA flash), Qwen3.6's prefill problem is entirely the MoE FFN — token-serial with per-token weight re-reads and on-demand expert streaming, pathological on 12 GB. Tier 1 (grouped-GEMM MoE + upfront expert prefetch) is where essentially all the gain is; the GDN trunk is already batched.
References
Working branch: claude/gemma4-prefill-optimization-h7syay. Related: #247.
Goal
Improve prefill (TTFT) throughput for Qwen3.6 (and Qwen3.5) hybrid models on a 12 GB Ada GPU (RTX 4070 Ti, sm_89). Companion to #247 (Gemma 4). Based on profiling the existing hybrid prefill path and the external state of the art.
What Qwen3.6 is (and why prefill is different)
Qwen3.6 is a hybrid linear/full attention + sparse MoE model, not a conventional transformer:
The architecture is already implemented in the codebase (built for Qwen3.5):
CudaHybridGdnForwardPass,GdnKernels,GdnStateCache,docs/qwen35moe-*. This issue is about prefill performance, not enablement.Where prefill time goes — the two halves diverge
GdnRecurrenceScan(one block per V-head, scans N tokens)CudaBackend.cs:5587;PrefillBatchedTrunkGpuFfnCudaHybridGdnForwardPass.cs:1485GpuMoeFfn: router → CPU download → top-8 → 8 separate matvecs via SLRUCudaHybridGdnForwardPass.cs:4281The MoE FFN is the prefill bottleneck and it's a hard architectural gap. The "batched" prefill paths batch only the GDN trunk, then fall back to per-token MoE (
CudaHybridGdnForwardPass.cs:1466, 1485). The_isMoEguard hard-disables batched FFN everywhere (CudaForwardPass.cs:2409,ForwardPass.cs:649).Two compounding costs, both pathological on 12 GB:
GpuMatMul(slot.Gate, ...)inside the per-token loop).CudaExpertSlotManager. Exploration confirmed: "No special prefetch during prefill — experts loaded on-demand per-token."Attack plan (ranked by ROI)
Tier 1 — MoE prefill rewrite (by far the biggest win)
CudaBackend.cs:2213/2308). Also collapses SLRU loads from O(tokens) to O(active experts). Touch point:CudaHybridGdnForwardPass.GpuMoeFfn(:4281).UploadBackground(CudaExpertSlotManager.cs:169) while the GDN trunk scan runs, so weights are resident before the FFN needs them. Kills the on-demand-per-token DMA storms.Tier 2 — linear / full attention kernels
GdnRecurrenceScanparallelizes but is a per-head shared-memory scalar scan — no tensor cores. The FLA chunked-DeltaNet algorithm (intra-chunk parallel matmuls + inter-chunk recurrence) maps the bulk onto MMA. GDN is ¾ of layers, so meaningful — but complex kernel work, ranks below the MoE fix.d%64==0; verify the GLU-gated-Q + partial-RoPE batched path routes toFlashAttentionPrefillTc2and not the scalar fallback.Tier 3 — fit & tuning on 12 GB
SHARPI_MOE_WARMPIN/ SLRU frequency eviction (CudaExpertSlotManager.cs) so the busiest experts stay resident across a prefill.Prerequisite to verify
qwen35moe/qwen35/qwen3next(ModelGraph.cs:314) but not obviouslyqwen3.6/qwen36moe— confirm Qwen3.6's GGUFgeneral.architectureis recognized. The 4:1 vs 3:1 ratio is read fromfull_attention_intervalmetadata so should generalize, but the arch-string gate needs checking.FP8 / FA-3 note
Same as #247: on Ada fp8 ≈ int8 throughput, so once grouped GEMM uses the int8-MMQ path there's no fp8 prefill speed win; FlashAttention-3 is Hopper-only.
Bottom line
Unlike Gemma 4 (CPU-bound PLE + missing SWA flash), Qwen3.6's prefill problem is entirely the MoE FFN — token-serial with per-token weight re-reads and on-demand expert streaming, pathological on 12 GB. Tier 1 (grouped-GEMM MoE + upfront expert prefetch) is where essentially all the gain is; the GDN trunk is already batched.
References
https://dev.to/czmilo/qwen36-35b-a3b-complete-review-alibabas-open-source-coding-model-that-beats-frontier-giants-4382Working branch:
claude/gemma4-prefill-optimization-h7syay. Related: #247.