perf(cuda): batch the FFN/MoE stage of the dense + GPU-SLRU GDN-hybrid prefill (#119 follow-up)

Follow-up to **#119** (PR #120). `PrefillBatchedTrunkGpuFfn` batches the **trunk** (GDN/attn launches collapsed via #111/#114-B) for the dense GDN-hybrid and GPU-SLRU MoE configs, but the **FFN/MoE stage still runs per token**:

```csharp
// PrefillBatchedTrunkGpuFfn
TrunkBlockBatched(layer, N, ...);          // batched
for (int i = 0; i < N; i++) {              // per token
    CopyDeviceRegion(_gpuNormBuf, ..., moeNorm, i*embDim, ...);
    if (!isMoe)  GpuDenseFfn(layer) / CpuDenseFfn(layer);
    else         GpuMoeFfn(layer);          // GPU-SLRU routed experts, per token
    AddInPlace + write stream slice
}
```

So the trunk win lands but the FFN is N separate launches per layer. Two batchings remain:

## Scope
1. **Dense FFN batched over N** — gate/up/down as single GEMM-N launches over the N post-attn-norm rows (à la the existing `MatMulN2` / `CpuDenseFfn2` two-token path, generalized to N), instead of the per-token `GpuDenseFfn`/`CpuDenseFfn` loop. The GEMM-N kernels already exist for Q4_K/Q5_K/Q6_K/F32 (`MatMulBatched`); the dense FFN gate/up/down just need to feed them. Must stay bit-identical to the per-token path.
2. **GPU-SLRU routed experts grouped-by-expert** — analogous to the CPU-MoE `BatchedRoutedExperts` but on-GPU: load each cached expert's gate/up/down once per layer and matmul against all N tokens routing to it, instead of `GpuMoeFfn`'s per-token SLRU `GetOrLoad` + matmul. Keeps the routed-expert amortization the CPU-MoE path already gets.

## Related
- `MatMulBatched` currently throws `NotSupportedException` for Q8_0 (and other non-{Q4_K,Q5_K,Q6_K,F32}) weights. A GDN-hybrid model with Q8_0 trunk/FFN projection weights would hit this; add `llm_matvec_q8_0_gemm_n` if such a model needs the batched path.

## Tests
- Bit-identical prefill logits vs the sequential `Forward` loop for both dense (27B-MTP) and GPU-SLRU (35B-A3B forced `SHARPI_CPU_MOE=0`), single + multi-chunk. Quantify with `bench-allrows-1k.ps1` (the 27B-MTP CUDA + GPU-SLRU rows).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuda): batch the FFN/MoE stage of the dense + GPU-SLRU GDN-hybrid prefill (#119 follow-up) #121

Scope

Related

Tests

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(cuda): batch the FFN/MoE stage of the dense + GPU-SLRU GDN-hybrid prefill (#119 follow-up) #121

Description

Scope

Related

Tests

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions