perf(engine,cpu,cuda): remaining GDN-hybrid prefill headroom after #111/#112 (N-input MoE dots + per-position recurrence/SDPA batching)

@
Follow-up to **#111** (GEMM-batched trunk) and **#112** (dequant-once 2-input routed-MoE dots), both implemented on branch `perf/gdn-batched-prefill-110` (not yet merged).

## Where we are after #111 + #112

512-token prefill chunk, Carnice-Qwen3.6-35B-A3B-APEX-MTP, RTX 4070 Ti, `SHARPI_CPU_MOE=1` (warm):

| phase | #110 baseline | after #111+#112 |
|---|---:|---:|
| trunk (attn+GDN) | 6.8 s | **~1.9 s** |
| routed MoE | 4.7 s | **~3.1 s** |
| total / chunk | 11.8 s | **~5.4 s** |
| **prefill** | ~40 tok/s | **~93 tok/s (~2.3×)** |

Bit-exact (`BatchedPrefill_BitwiseMatchesSequential_Carnice` single + multi-chunk + MTP draft logits), 305/305 non-Vulkan ForwardPass tests green. Two cost centres remain.

## A. Routed MoE is now the dominant cost (~57% of wall, ~3.1 s)

#112 dots each expert-row against its tokens in **pairs** (`DotQ4K_2In` / `DotQ5K_2In` / new `DotQ3K_Q8KS_2In`), so the expensive Q3_K/Q4_K weight unpack is amortized only `decode/2`. Generalize to **N-input** (register-tiled tiles of 4–8 tokens, or true N-input) so the unpack amortizes `decode/T` where T = tokens routing to that expert (~16 avg at N=512).

- Extend the existing 2-input precedents to `DotQ4K_NIn` / `DotQ3K_Q8KS_NIn` (and optionally Q5_K). Keep N FP accumulators register-tiled per tile; each input accumulated in the identical sub-block order so it stays **bit-identical** to N single dots (the `BatchedRoutedExperts` byte-parity oracle must hold — see `feedback_q4k_q8k_no_parity_win`).
- Wire into `BatchedRoutedExperts` Phase A (gate+up) and Phase C (down), iterating the per-expert token list (`expTokI[expStart[e]..]`) in tiles instead of pairs.
- Expected: ~1.2–1.5× on routed MoE.

## B. Residual per-position trunk launches (~1.9 s)

#111 deliberately kept the **conv1d / delta-net recurrence** and **KV-append / SDPA** per-position (one launch/token each — the positional ops). These are now the residual trunk cost.

- **Chunked-scan delta-net**: process the prompt chunk in blocks with a parallel intra-block scan + sequential inter-block state carry, collapsing the ~512 per-token `GdnRecurrenceDecode` launches/layer. Higher complexity and FP-order/parity risk.
- **Batched-prefill attention**: `CudaBackend.FullSeqAttention` currently throws. A batched-query SDPA kernel (grid over N queries × heads, reading the populated KV ring) would replace the per-token `KvAppend`+`Attention` loop after a batched KV-append.

## C. Docs (at merge time)

Add a "GDN-hybrid batched prefill" subsection to the README with a measured `bench-prefill.ps1` long-prompt scaling table (#110/#111/#112). NB: the benchmark-table Carnice `Prefill 20.6` cell is the **short-prompt** `bench-carnice.ps1` number — do **not** overwrite it with the long-prompt rate.

## Notes
- All new bit-exact kernels live in `CudaTextKernels.cs` / `CudaBackend.cs` (GEMM-N + batched norms, `View`) and `SimdKernels.cs` (`DotQ3K_Q8KS_2In`). Unit tests: `CudaMatMulBatchedTests`, `SimdKernelsQ8KSTests.DotQ3K_Q8KS_2In_BitwiseMatchesSingle`.
- Shared-expert/routed GPU overlap remains perf-neutral (cost is host launch-issue, not GPU compute).
@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(engine,cpu,cuda): remaining GDN-hybrid prefill headroom after #111/#112 (N-input MoE dots + per-position recurrence/SDPA batching) #114

Where we are after #111 + #112

A. Routed MoE is now the dominant cost (~57% of wall, ~3.1 s)

B. Residual per-position trunk launches (~1.9 s)

C. Docs (at merge time)

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

phase	#110 baseline	after #111+#112
trunk (attn+GDN)	6.8 s	~1.9 s
routed MoE	4.7 s	~3.1 s
total / chunk	11.8 s	~5.4 s
prefill	~40 tok/s	~93 tok/s (~2.3×)

perf(engine,cpu,cuda): remaining GDN-hybrid prefill headroom after #111/#112 (N-input MoE dots + per-position recurrence/SDPA batching) #114

Description

Where we are after #111 + #112

A. Routed MoE is now the dominant cost (~57% of wall, ~3.1 s)

B. Residual per-position trunk launches (~1.9 s)

C. Docs (at merge time)

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions