Prefill optimization attack plan: Qwen3.6 hybrid (GDN + MoE) on RTX 4070 Ti (12 GB, Ada)

> ## Status (2026-06-14) — verified against current code
> The headline Tier-1 #1 is already implemented for the **recommended** config, so the "per-token serial in EVERY path / essentially all the gain" framing is overstated:
>
> - **#1 grouped-GEMM MoE prefill — ALREADY DONE on the CPU-MoE path** (`SHARPI_CPU_MOE=1`, the Qwen3.6-35B-A3B config in use). #110 groups routed experts by selected expert — each expert's weight rows read once per chunk and dotted against every token routing to it, not re-read per token (`CudaHybridGdnForwardPass.cs:514-523`) — and #162 routes trunk matmuls through compute-bound int8-MMQ / fp16-GEMM (`:528-540`, `SHARPI_GDN_PREFILL_COMPUTE`). The #112/#121 audit found this path already at the .NET-10 no-VNNI int8-dot ceiling, and routed-MoE is 78-83% of prefill wall → little prefill headroom left here. The per-token `GpuMoeFfn` (`:4281`) cited is the **GPU-SLRU path** (`SHARPI_CPU_MOE=0`), which loses to CPU-MoE on 12 GB for this model.
> - **#2 batch-prefetch experts** — applies to the GPU-SLRU path; CPU-MoE routed experts are mmap-resident (OS page cache), not SLRU-DMA-streamed, so there's no per-token DMA storm to prefetch away.
>
> **Genuinely open:** **#3** (chunked-matmul DeltaNet on tensor cores — the GDN scan is still a scalar per-head shared-mem scan) and the small per-token trunk share (#4: the CPU-MoE batched-prefill trunk — attn + GDN — stays sequential per token, `:516`; minor since MoE dominates and is already batched).
>
> ---

## Goal

Improve prefill (TTFT) throughput for Qwen3.6 (and Qwen3.5) hybrid models on a 12 GB Ada GPU (RTX 4070 Ti, sm_89). Companion to #247 (Gemma 4). Based on profiling the existing hybrid prefill path and the external state of the art.

## What Qwen3.6 is (and why prefill is different)

Qwen3.6 is a **hybrid linear/full attention + sparse MoE** model, not a conventional transformer:
- **4:1 Gated DeltaNet : full-attention ratio** — ten 4-layer blocks (3× GDN linear attention + 1× full attention). ¾ of layers are linear attention.
- **Sparse MoE** — 35B-A3B variant: **256 experts, top-8 routed + 1 shared, ~3B active**.
- 262K context.

The architecture is **already implemented** in the codebase (built for Qwen3.5): `CudaHybridGdnForwardPass`, `GdnKernels`, `GdnStateCache`, `docs/qwen35moe-*`. This issue is about prefill *performance*, not enablement.

## Where prefill time goes — the two halves diverge

| Subsystem | Prefill status | Evidence |
|---|---|---|
| **Gated DeltaNet (¾ of layers)** | ✅ Batched parallel scan on GPU | `GdnRecurrenceScan` (one block per V-head, scans N tokens) `CudaBackend.cs:5587`; `PrefillBatchedTrunkGpuFfn` `CudaHybridGdnForwardPass.cs:1485` |
| **MoE FFN (40 layers)** | ❌ Per-token, serial in **every** path | `GpuMoeFfn`: router → CPU download → top-8 → 8 separate matvecs via SLRU `CudaHybridGdnForwardPass.cs:4281` |

The MoE FFN is the prefill bottleneck and it's a hard architectural gap. The "batched" prefill paths batch only the GDN trunk, then fall back to per-token MoE (`CudaHybridGdnForwardPass.cs:1466, 1485`). The `_isMoE` guard hard-disables batched FFN everywhere (`CudaForwardPass.cs:2409`, `ForwardPass.cs:649`).

Two compounding costs, both pathological on 12 GB:
1. **Memory-bound matvec, weight re-read per token** — each expert weight matrix is read from VRAM once *per token* (`GpuMatMul(slot.Gate, ...)` inside the per-token loop).
2. **SLRU thrashing** — 35B-A3B experts (256 × 40 layers) can't fit in 12 GB, so they stream from host/disk via `CudaExpertSlotManager`. Exploration confirmed: *"No special prefetch during prefill — experts loaded on-demand per-token."*

---

## Attack plan (ranked by ROI)

### Tier 1 — MoE prefill rewrite (by far the biggest win)

- [ ] **1. Grouped-GEMM MoE prefill (sorted-token routing).** Replace per-token × per-expert matvec with: (a) router GEMM over all N tokens, (b) sort/group token indices by selected expert (MegaBlocks pattern), (c) one batched GEMM per active expert over its assigned tokens. Converts memory-bound matvec (weight read N×) into compute-bound GEMM (weight read once per expert per chunk) — reuse the existing int8-MMQ / fp16-GEMM machinery (`CudaBackend.cs:2213/2308`). Also collapses SLRU loads from O(tokens) to O(active experts). Touch point: `CudaHybridGdnForwardPass.GpuMoeFfn` (`:4281`).
- [ ] **2. Batch-prefetch experts during prefill.** After the router runs for the whole chunk (#1), the chunk's full expert set is known upfront — prefetch those weights via existing async `UploadBackground` (`CudaExpertSlotManager.cs:169`) while the GDN trunk scan runs, so weights are resident before the FFN needs them. Kills the on-demand-per-token DMA storms.

### Tier 2 — linear / full attention kernels

- [ ] **3. Chunked-matmul DeltaNet formulation for GDN prefill (tensor cores).** `GdnRecurrenceScan` parallelizes but is a per-head shared-memory scalar scan — no tensor cores. The FLA chunked-DeltaNet algorithm (intra-chunk parallel matmuls + inter-chunk recurrence) maps the bulk onto MMA. GDN is ¾ of layers, so meaningful — but complex kernel work, ranks below the MoE fix.
- [ ] **4. Confirm Tc2 flash on full-attention layers.** Full attention uses head_dim 256 (Q=24/KV=4) with GLU-gated Q + partial RoPE (64 of 256 dims). d=256 satisfies Tc2's `d%64==0`; verify the GLU-gated-Q + partial-RoPE batched path routes to `FlashAttentionPrefillTc2` and not the scalar fallback.

### Tier 3 — fit & tuning on 12 GB

- [ ] **5. KV narrowing** — only 10/40 layers keep a KV cache (rest carry fixed ~63 MB GDN state); bf16/q8_0 KV still helps stretch full-attention context but is a smaller lever than for Gemma 4.
- [ ] **6. Warm-pin hottest experts for prefill** — tune `SHARPI_MOE_WARMPIN` / SLRU frequency eviction (`CudaExpertSlotManager.cs`) so the busiest experts stay resident across a prefill.

### Prerequisite to verify
- [ ] Arch detector recognizes `qwen35moe`/`qwen35`/`qwen3next` (`ModelGraph.cs:314`) but **not obviously** `qwen3.6`/`qwen36moe` — confirm Qwen3.6's GGUF `general.architecture` is recognized. The 4:1 vs 3:1 ratio is read from `full_attention_interval` metadata so should generalize, but the arch-string gate needs checking.

## FP8 / FA-3 note
Same as #247: on Ada fp8 ≈ int8 throughput, so once grouped GEMM uses the int8-MMQ path there's no fp8 prefill speed win; FlashAttention-3 is Hopper-only.

## Bottom line
Unlike Gemma 4 (CPU-bound PLE + missing SWA flash), **Qwen3.6's prefill problem is entirely the MoE FFN** — token-serial with per-token weight re-reads and on-demand expert streaming, pathological on 12 GB. Tier 1 (grouped-GEMM MoE + upfront expert prefetch) is where essentially all the gain is; the GDN trunk is already batched.

## References
- Qwen3.6 GitHub — https://github.com/QwenLM/Qwen3.6
- Qwen3.5 architecture analysis (mlabonne) — https://huggingface.co/blog/mlabonne/qwen35
- Qwen3.6-35B-A3B + GDN/MoE hybrid — https://lilting.ch/en/articles/qwen36-35b-a3b-agentic-coding-moe-hybrid
- Qwen3.6-35B-A3B review (experts/routing) — ``https://dev.to/czmilo/qwen36-35b-a3b-complete-review-alibabas-open-source-coding-model-that-beats-frontier-giants-4382``
- MoE expert co-activation reordering — https://blog.doubleword.ai/moe-expert-coactivations
- Qwen3 Technical Report — https://arxiv.org/html/2505.09388v1

Working branch: `claude/gemma4-prefill-optimization-h7syay`. Related: #247.


Subsystem	Prefill status	Evidence
Gated DeltaNet (¾ of layers)	✅ Batched parallel scan on GPU	`GdnRecurrenceScan` (one block per V-head, scans N tokens) `CudaBackend.cs:5587`; `PrefillBatchedTrunkGpuFfn` `CudaHybridGdnForwardPass.cs:1485`
MoE FFN (40 layers)	❌ Per-token, serial in every path	`GpuMoeFfn`: router → CPU download → top-8 → 8 separate matvecs via SLRU `CudaHybridGdnForwardPass.cs:4281`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefill optimization attack plan: Qwen3.6 hybrid (GDN + MoE) on RTX 4070 Ti (12 GB, Ada) #248

Status (2026-06-14) — verified against current code

Goal

What Qwen3.6 is (and why prefill is different)

Where prefill time goes — the two halves diverge

Attack plan (ranked by ROI)

Tier 1 — MoE prefill rewrite (by far the biggest win)

Tier 2 — linear / full attention kernels

Tier 3 — fit & tuning on 12 GB

Prerequisite to verify

FP8 / FA-3 note

Bottom line

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Prefill optimization attack plan: Qwen3.6 hybrid (GDN + MoE) on RTX 4070 Ti (12 GB, Ada) #248

Description

Status (2026-06-14) — verified against current code

Goal

What Qwen3.6 is (and why prefill is different)

Where prefill time goes — the two halves diverge

Attack plan (ranked by ROI)

Tier 1 — MoE prefill rewrite (by far the biggest win)

Tier 2 — linear / full attention kernels

Tier 3 — fit & tuning on 12 GB

Prerequisite to verify

FP8 / FA-3 note

Bottom line

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions