Skip to content

perf(engine,cpu,cuda): remaining GDN-hybrid prefill headroom after #111/#112 (N-input MoE dots + per-position recurrence/SDPA batching) #114

@pekkah

Description

@pekkah

@
Follow-up to #111 (GEMM-batched trunk) and #112 (dequant-once 2-input routed-MoE dots), both implemented on branch perf/gdn-batched-prefill-110 (not yet merged).

Where we are after #111 + #112

512-token prefill chunk, Carnice-Qwen3.6-35B-A3B-APEX-MTP, RTX 4070 Ti, SHARPI_CPU_MOE=1 (warm):

phase #110 baseline after #111+#112
trunk (attn+GDN) 6.8 s ~1.9 s
routed MoE 4.7 s ~3.1 s
total / chunk 11.8 s ~5.4 s
prefill ~40 tok/s ~93 tok/s (~2.3×)

Bit-exact (BatchedPrefill_BitwiseMatchesSequential_Carnice single + multi-chunk + MTP draft logits), 305/305 non-Vulkan ForwardPass tests green. Two cost centres remain.

A. Routed MoE is now the dominant cost (~57% of wall, ~3.1 s)

#112 dots each expert-row against its tokens in pairs (DotQ4K_2In / DotQ5K_2In / new DotQ3K_Q8KS_2In), so the expensive Q3_K/Q4_K weight unpack is amortized only decode/2. Generalize to N-input (register-tiled tiles of 4–8 tokens, or true N-input) so the unpack amortizes decode/T where T = tokens routing to that expert (~16 avg at N=512).

  • Extend the existing 2-input precedents to DotQ4K_NIn / DotQ3K_Q8KS_NIn (and optionally Q5_K). Keep N FP accumulators register-tiled per tile; each input accumulated in the identical sub-block order so it stays bit-identical to N single dots (the BatchedRoutedExperts byte-parity oracle must hold — see feedback_q4k_q8k_no_parity_win).
  • Wire into BatchedRoutedExperts Phase A (gate+up) and Phase C (down), iterating the per-expert token list (expTokI[expStart[e]..]) in tiles instead of pairs.
  • Expected: ~1.2–1.5× on routed MoE.

B. Residual per-position trunk launches (~1.9 s)

#111 deliberately kept the conv1d / delta-net recurrence and KV-append / SDPA per-position (one launch/token each — the positional ops). These are now the residual trunk cost.

  • Chunked-scan delta-net: process the prompt chunk in blocks with a parallel intra-block scan + sequential inter-block state carry, collapsing the ~512 per-token GdnRecurrenceDecode launches/layer. Higher complexity and FP-order/parity risk.
  • Batched-prefill attention: CudaBackend.FullSeqAttention currently throws. A batched-query SDPA kernel (grid over N queries × heads, reading the populated KV ring) would replace the per-token KvAppend+Attention loop after a batched KV-append.

C. Docs (at merge time)

Add a "GDN-hybrid batched prefill" subsection to the README with a measured bench-prefill.ps1 long-prompt scaling table (#110/#111/#112). NB: the benchmark-table Carnice Prefill 20.6 cell is the short-prompt bench-carnice.ps1 number — do not overwrite it with the long-prompt rate.

Notes

  • All new bit-exact kernels live in CudaTextKernels.cs / CudaBackend.cs (GEMM-N + batched norms, View) and SimdKernels.cs (DotQ3K_Q8KS_2In). Unit tests: CudaMatMulBatchedTests, SimdKernelsQ8KSTests.DotQ3K_Q8KS_2In_BitwiseMatchesSingle.
  • Shared-expert/routed GPU overlap remains perf-neutral (cost is host launch-issue, not GPU compute).
    @

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions