Skip to content

perf(cuda): batch the FFN/MoE stage of the dense + GPU-SLRU GDN-hybrid prefill (#119 follow-up) #121

@pekkah

Description

@pekkah

Follow-up to #119 (PR #120). PrefillBatchedTrunkGpuFfn batches the trunk (GDN/attn launches collapsed via #111/#114-B) for the dense GDN-hybrid and GPU-SLRU MoE configs, but the FFN/MoE stage still runs per token:

// PrefillBatchedTrunkGpuFfn
TrunkBlockBatched(layer, N, ...);          // batched
for (int i = 0; i < N; i++) {              // per token
    CopyDeviceRegion(_gpuNormBuf, ..., moeNorm, i*embDim, ...);
    if (!isMoe)  GpuDenseFfn(layer) / CpuDenseFfn(layer);
    else         GpuMoeFfn(layer);          // GPU-SLRU routed experts, per token
    AddInPlace + write stream slice
}

So the trunk win lands but the FFN is N separate launches per layer. Two batchings remain:

Scope

  1. Dense FFN batched over N — gate/up/down as single GEMM-N launches over the N post-attn-norm rows (à la the existing MatMulN2 / CpuDenseFfn2 two-token path, generalized to N), instead of the per-token GpuDenseFfn/CpuDenseFfn loop. The GEMM-N kernels already exist for Q4_K/Q5_K/Q6_K/F32 (MatMulBatched); the dense FFN gate/up/down just need to feed them. Must stay bit-identical to the per-token path.
  2. GPU-SLRU routed experts grouped-by-expert — analogous to the CPU-MoE BatchedRoutedExperts but on-GPU: load each cached expert's gate/up/down once per layer and matmul against all N tokens routing to it, instead of GpuMoeFfn's per-token SLRU GetOrLoad + matmul. Keeps the routed-expert amortization the CPU-MoE path already gets.

Related

  • MatMulBatched currently throws NotSupportedException for Q8_0 (and other non-{Q4_K,Q5_K,Q6_K,F32}) weights. A GDN-hybrid model with Q8_0 trunk/FFN projection weights would hit this; add llm_matvec_q8_0_gemm_n if such a model needs the batched path.

Tests

  • Bit-identical prefill logits vs the sequential Forward loop for both dense (27B-MTP) and GPU-SLRU (35B-A3B forced SHARPI_CPU_MOE=0), single + multi-chunk. Quantify with bench-allrows-1k.ps1 (the 27B-MTP CUDA + GPU-SLRU rows).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions