Follow-up to #119 (PR #120). PrefillBatchedTrunkGpuFfn batches the trunk (GDN/attn launches collapsed via #111/#114-B) for the dense GDN-hybrid and GPU-SLRU MoE configs, but the FFN/MoE stage still runs per token:
// PrefillBatchedTrunkGpuFfn
TrunkBlockBatched(layer, N, ...); // batched
for (int i = 0; i < N; i++) { // per token
CopyDeviceRegion(_gpuNormBuf, ..., moeNorm, i*embDim, ...);
if (!isMoe) GpuDenseFfn(layer) / CpuDenseFfn(layer);
else GpuMoeFfn(layer); // GPU-SLRU routed experts, per token
AddInPlace + write stream slice
}
So the trunk win lands but the FFN is N separate launches per layer. Two batchings remain:
Scope
- Dense FFN batched over N — gate/up/down as single GEMM-N launches over the N post-attn-norm rows (à la the existing
MatMulN2 / CpuDenseFfn2 two-token path, generalized to N), instead of the per-token GpuDenseFfn/CpuDenseFfn loop. The GEMM-N kernels already exist for Q4_K/Q5_K/Q6_K/F32 (MatMulBatched); the dense FFN gate/up/down just need to feed them. Must stay bit-identical to the per-token path.
- GPU-SLRU routed experts grouped-by-expert — analogous to the CPU-MoE
BatchedRoutedExperts but on-GPU: load each cached expert's gate/up/down once per layer and matmul against all N tokens routing to it, instead of GpuMoeFfn's per-token SLRU GetOrLoad + matmul. Keeps the routed-expert amortization the CPU-MoE path already gets.
Related
MatMulBatched currently throws NotSupportedException for Q8_0 (and other non-{Q4_K,Q5_K,Q6_K,F32}) weights. A GDN-hybrid model with Q8_0 trunk/FFN projection weights would hit this; add llm_matvec_q8_0_gemm_n if such a model needs the batched path.
Tests
- Bit-identical prefill logits vs the sequential
Forward loop for both dense (27B-MTP) and GPU-SLRU (35B-A3B forced SHARPI_CPU_MOE=0), single + multi-chunk. Quantify with bench-allrows-1k.ps1 (the 27B-MTP CUDA + GPU-SLRU rows).
Follow-up to #119 (PR #120).
PrefillBatchedTrunkGpuFfnbatches the trunk (GDN/attn launches collapsed via #111/#114-B) for the dense GDN-hybrid and GPU-SLRU MoE configs, but the FFN/MoE stage still runs per token:So the trunk win lands but the FFN is N separate launches per layer. Two batchings remain:
Scope
MatMulN2/CpuDenseFfn2two-token path, generalized to N), instead of the per-tokenGpuDenseFfn/CpuDenseFfnloop. The GEMM-N kernels already exist for Q4_K/Q5_K/Q6_K/F32 (MatMulBatched); the dense FFN gate/up/down just need to feed them. Must stay bit-identical to the per-token path.BatchedRoutedExpertsbut on-GPU: load each cached expert's gate/up/down once per layer and matmul against all N tokens routing to it, instead ofGpuMoeFfn's per-token SLRUGetOrLoad+ matmul. Keeps the routed-expert amortization the CPU-MoE path already gets.Related
MatMulBatchedcurrently throwsNotSupportedExceptionfor Q8_0 (and other non-{Q4_K,Q5_K,Q6_K,F32}) weights. A GDN-hybrid model with Q8_0 trunk/FFN projection weights would hit this; addllm_matvec_q8_0_gemm_nif such a model needs the batched path.Tests
Forwardloop for both dense (27B-MTP) and GPU-SLRU (35B-A3B forcedSHARPI_CPU_MOE=0), single + multi-chunk. Quantify withbench-allrows-1k.ps1(the 27B-MTP CUDA + GPU-SLRU rows).