@
Follow-up to #111 (GEMM-batched trunk) and #112 (dequant-once 2-input routed-MoE dots), both implemented on branch perf/gdn-batched-prefill-110 (not yet merged).
Where we are after #111 + #112
512-token prefill chunk, Carnice-Qwen3.6-35B-A3B-APEX-MTP, RTX 4070 Ti, SHARPI_CPU_MOE=1 (warm):
| phase |
#110 baseline |
after #111+#112 |
| trunk (attn+GDN) |
6.8 s |
~1.9 s |
| routed MoE |
4.7 s |
~3.1 s |
| total / chunk |
11.8 s |
~5.4 s |
| prefill |
~40 tok/s |
~93 tok/s (~2.3×) |
Bit-exact (BatchedPrefill_BitwiseMatchesSequential_Carnice single + multi-chunk + MTP draft logits), 305/305 non-Vulkan ForwardPass tests green. Two cost centres remain.
A. Routed MoE is now the dominant cost (~57% of wall, ~3.1 s)
#112 dots each expert-row against its tokens in pairs (DotQ4K_2In / DotQ5K_2In / new DotQ3K_Q8KS_2In), so the expensive Q3_K/Q4_K weight unpack is amortized only decode/2. Generalize to N-input (register-tiled tiles of 4–8 tokens, or true N-input) so the unpack amortizes decode/T where T = tokens routing to that expert (~16 avg at N=512).
- Extend the existing 2-input precedents to
DotQ4K_NIn / DotQ3K_Q8KS_NIn (and optionally Q5_K). Keep N FP accumulators register-tiled per tile; each input accumulated in the identical sub-block order so it stays bit-identical to N single dots (the BatchedRoutedExperts byte-parity oracle must hold — see feedback_q4k_q8k_no_parity_win).
- Wire into
BatchedRoutedExperts Phase A (gate+up) and Phase C (down), iterating the per-expert token list (expTokI[expStart[e]..]) in tiles instead of pairs.
- Expected: ~1.2–1.5× on routed MoE.
B. Residual per-position trunk launches (~1.9 s)
#111 deliberately kept the conv1d / delta-net recurrence and KV-append / SDPA per-position (one launch/token each — the positional ops). These are now the residual trunk cost.
- Chunked-scan delta-net: process the prompt chunk in blocks with a parallel intra-block scan + sequential inter-block state carry, collapsing the ~512 per-token
GdnRecurrenceDecode launches/layer. Higher complexity and FP-order/parity risk.
- Batched-prefill attention:
CudaBackend.FullSeqAttention currently throws. A batched-query SDPA kernel (grid over N queries × heads, reading the populated KV ring) would replace the per-token KvAppend+Attention loop after a batched KV-append.
C. Docs (at merge time)
Add a "GDN-hybrid batched prefill" subsection to the README with a measured bench-prefill.ps1 long-prompt scaling table (#110/#111/#112). NB: the benchmark-table Carnice Prefill 20.6 cell is the short-prompt bench-carnice.ps1 number — do not overwrite it with the long-prompt rate.
Notes
- All new bit-exact kernels live in
CudaTextKernels.cs / CudaBackend.cs (GEMM-N + batched norms, View) and SimdKernels.cs (DotQ3K_Q8KS_2In). Unit tests: CudaMatMulBatchedTests, SimdKernelsQ8KSTests.DotQ3K_Q8KS_2In_BitwiseMatchesSingle.
- Shared-expert/routed GPU overlap remains perf-neutral (cost is host launch-issue, not GPU compute).
@
@
Follow-up to #111 (GEMM-batched trunk) and #112 (dequant-once 2-input routed-MoE dots), both implemented on branch
perf/gdn-batched-prefill-110(not yet merged).Where we are after #111 + #112
512-token prefill chunk, Carnice-Qwen3.6-35B-A3B-APEX-MTP, RTX 4070 Ti,
SHARPI_CPU_MOE=1(warm):Bit-exact (
BatchedPrefill_BitwiseMatchesSequential_Carnicesingle + multi-chunk + MTP draft logits), 305/305 non-Vulkan ForwardPass tests green. Two cost centres remain.A. Routed MoE is now the dominant cost (~57% of wall, ~3.1 s)
#112 dots each expert-row against its tokens in pairs (
DotQ4K_2In/DotQ5K_2In/ newDotQ3K_Q8KS_2In), so the expensive Q3_K/Q4_K weight unpack is amortized onlydecode/2. Generalize to N-input (register-tiled tiles of 4–8 tokens, or true N-input) so the unpack amortizesdecode/Twhere T = tokens routing to that expert (~16 avg at N=512).DotQ4K_NIn/DotQ3K_Q8KS_NIn(and optionally Q5_K). Keep N FP accumulators register-tiled per tile; each input accumulated in the identical sub-block order so it stays bit-identical to N single dots (theBatchedRoutedExpertsbyte-parity oracle must hold — seefeedback_q4k_q8k_no_parity_win).BatchedRoutedExpertsPhase A (gate+up) and Phase C (down), iterating the per-expert token list (expTokI[expStart[e]..]) in tiles instead of pairs.B. Residual per-position trunk launches (~1.9 s)
#111 deliberately kept the conv1d / delta-net recurrence and KV-append / SDPA per-position (one launch/token each — the positional ops). These are now the residual trunk cost.
GdnRecurrenceDecodelaunches/layer. Higher complexity and FP-order/parity risk.CudaBackend.FullSeqAttentioncurrently throws. A batched-query SDPA kernel (grid over N queries × heads, reading the populated KV ring) would replace the per-tokenKvAppend+Attentionloop after a batched KV-append.C. Docs (at merge time)
Add a "GDN-hybrid batched prefill" subsection to the README with a measured
bench-prefill.ps1long-prompt scaling table (#110/#111/#112). NB: the benchmark-table CarnicePrefill 20.6cell is the short-promptbench-carnice.ps1number — do not overwrite it with the long-prompt rate.Notes
CudaTextKernels.cs/CudaBackend.cs(GEMM-N + batched norms,View) andSimdKernels.cs(DotQ3K_Q8KS_2In). Unit tests:CudaMatMulBatchedTests,SimdKernelsQ8KSTests.DotQ3K_Q8KS_2In_BitwiseMatchesSingle.@