Skip to content

Prefill optimization attack plan: Qwen3.6 hybrid (GDN + MoE) on RTX 4070 Ti (12 GB, Ada) #248

@pekkah

Description

@pekkah

Status (2026-06-14) — verified against current code

The headline Tier-1 #1 is already implemented for the recommended config, so the "per-token serial in EVERY path / essentially all the gain" framing is overstated:

Genuinely open: #3 (chunked-matmul DeltaNet on tensor cores — the GDN scan is still a scalar per-head shared-mem scan) and the small per-token trunk share (#4: the CPU-MoE batched-prefill trunk — attn + GDN — stays sequential per token, :516; minor since MoE dominates and is already batched).


Goal

Improve prefill (TTFT) throughput for Qwen3.6 (and Qwen3.5) hybrid models on a 12 GB Ada GPU (RTX 4070 Ti, sm_89). Companion to #247 (Gemma 4). Based on profiling the existing hybrid prefill path and the external state of the art.

What Qwen3.6 is (and why prefill is different)

Qwen3.6 is a hybrid linear/full attention + sparse MoE model, not a conventional transformer:

  • 4:1 Gated DeltaNet : full-attention ratio — ten 4-layer blocks (3× GDN linear attention + 1× full attention). ¾ of layers are linear attention.
  • Sparse MoE — 35B-A3B variant: 256 experts, top-8 routed + 1 shared, ~3B active.
  • 262K context.

The architecture is already implemented in the codebase (built for Qwen3.5): CudaHybridGdnForwardPass, GdnKernels, GdnStateCache, docs/qwen35moe-*. This issue is about prefill performance, not enablement.

Where prefill time goes — the two halves diverge

Subsystem Prefill status Evidence
Gated DeltaNet (¾ of layers) ✅ Batched parallel scan on GPU GdnRecurrenceScan (one block per V-head, scans N tokens) CudaBackend.cs:5587; PrefillBatchedTrunkGpuFfn CudaHybridGdnForwardPass.cs:1485
MoE FFN (40 layers) ❌ Per-token, serial in every path GpuMoeFfn: router → CPU download → top-8 → 8 separate matvecs via SLRU CudaHybridGdnForwardPass.cs:4281

The MoE FFN is the prefill bottleneck and it's a hard architectural gap. The "batched" prefill paths batch only the GDN trunk, then fall back to per-token MoE (CudaHybridGdnForwardPass.cs:1466, 1485). The _isMoE guard hard-disables batched FFN everywhere (CudaForwardPass.cs:2409, ForwardPass.cs:649).

Two compounding costs, both pathological on 12 GB:

  1. Memory-bound matvec, weight re-read per token — each expert weight matrix is read from VRAM once per token (GpuMatMul(slot.Gate, ...) inside the per-token loop).
  2. SLRU thrashing — 35B-A3B experts (256 × 40 layers) can't fit in 12 GB, so they stream from host/disk via CudaExpertSlotManager. Exploration confirmed: "No special prefetch during prefill — experts loaded on-demand per-token."

Attack plan (ranked by ROI)

Tier 1 — MoE prefill rewrite (by far the biggest win)

  • 1. Grouped-GEMM MoE prefill (sorted-token routing). Replace per-token × per-expert matvec with: (a) router GEMM over all N tokens, (b) sort/group token indices by selected expert (MegaBlocks pattern), (c) one batched GEMM per active expert over its assigned tokens. Converts memory-bound matvec (weight read N×) into compute-bound GEMM (weight read once per expert per chunk) — reuse the existing int8-MMQ / fp16-GEMM machinery (CudaBackend.cs:2213/2308). Also collapses SLRU loads from O(tokens) to O(active experts). Touch point: CudaHybridGdnForwardPass.GpuMoeFfn (:4281).
  • 2. Batch-prefetch experts during prefill. After the router runs for the whole chunk (docs: add MIT LICENSE and prep for public release #1), the chunk's full expert set is known upfront — prefetch those weights via existing async UploadBackground (CudaExpertSlotManager.cs:169) while the GDN trunk scan runs, so weights are resident before the FFN needs them. Kills the on-demand-per-token DMA storms.

Tier 2 — linear / full attention kernels

  • 3. Chunked-matmul DeltaNet formulation for GDN prefill (tensor cores). GdnRecurrenceScan parallelizes but is a per-head shared-memory scalar scan — no tensor cores. The FLA chunked-DeltaNet algorithm (intra-chunk parallel matmuls + inter-chunk recurrence) maps the bulk onto MMA. GDN is ¾ of layers, so meaningful — but complex kernel work, ranks below the MoE fix.
  • 4. Confirm Tc2 flash on full-attention layers. Full attention uses head_dim 256 (Q=24/KV=4) with GLU-gated Q + partial RoPE (64 of 256 dims). d=256 satisfies Tc2's d%64==0; verify the GLU-gated-Q + partial-RoPE batched path routes to FlashAttentionPrefillTc2 and not the scalar fallback.

Tier 3 — fit & tuning on 12 GB

  • 5. KV narrowing — only 10/40 layers keep a KV cache (rest carry fixed ~63 MB GDN state); bf16/q8_0 KV still helps stretch full-attention context but is a smaller lever than for Gemma 4.
  • 6. Warm-pin hottest experts for prefill — tune SHARPI_MOE_WARMPIN / SLRU frequency eviction (CudaExpertSlotManager.cs) so the busiest experts stay resident across a prefill.

Prerequisite to verify

  • Arch detector recognizes qwen35moe/qwen35/qwen3next (ModelGraph.cs:314) but not obviously qwen3.6/qwen36moe — confirm Qwen3.6's GGUF general.architecture is recognized. The 4:1 vs 3:1 ratio is read from full_attention_interval metadata so should generalize, but the arch-string gate needs checking.

FP8 / FA-3 note

Same as #247: on Ada fp8 ≈ int8 throughput, so once grouped GEMM uses the int8-MMQ path there's no fp8 prefill speed win; FlashAttention-3 is Hopper-only.

Bottom line

Unlike Gemma 4 (CPU-bound PLE + missing SWA flash), Qwen3.6's prefill problem is entirely the MoE FFN — token-serial with per-token weight re-reads and on-demand expert streaming, pathological on 12 GB. Tier 1 (grouped-GEMM MoE + upfront expert prefetch) is where essentially all the gain is; the GDN trunk is already batched.

References

Working branch: claude/gemma4-prefill-optimization-h7syay. Related: #247.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions