perf(cpu): VNNI (vpdpbusd) + AVX-512 paths for the int8 Q8K/Q8KS dot kernels — the CPU-MoE hot path runs on AVX2 vpmaddubsw only

## Context

The CPU int8 dot kernels are the decode bottleneck on the target 12 GB / 64 GB machine: CPU-MoE is the auto-selected MoE config there (#100, `CudaHybridGdnForwardPass.cs:886-909`), and the Q8_KS routed-expert dots are what the 4.9× Carnice win (`SHARPI_Q3K_Q8K`) runs on. More models land on this path with #215.

## Problem

`SimdKernels.cs` implements the whole int8 dot family (`DotQ3K_Q8K(S)`, `DotQ8_0_Q8K`, `DotQ6K_Q8K`, the `_2In` variants, …) on **AVX2 256-bit** `Avx2.MultiplyAddAdjacent` chains (vpmaddubsw → vpmaddwd → vpaddd, e.g. `SimdKernels.cs:1956-1981`). Two generations of ISA are left unused:

- **VNNI is absent entirely** — zero hits for `AvxVnni` in the repo. `vpdpbusd` does u8×s8→s32 multiply-accumulate in **one** instruction, replacing the 2-3 instruction vpmaddubsw/vpmaddwd chain.
- **AVX-512 coverage is thin**: 7 `Avx512F.IsSupported` gates (Q4_K/Q5_K float-path dots) vs 273 `Vector256` uses; none of the int8 Q8K/Q8KS kernels have a 512-bit path. Zen 5 executes 512-bit natively; Zen 4 double-pumps but still wins on instruction count.

llama.cpp's ggml carries exactly these paths (VNNI int8 dots), so the CPU-MoE comparison against it gives away ISA for free on modern CPUs.

## What .NET actually exposes (verified 2026-06, corrected from the original issue text)

- **`AvxVnni` (VEX, 128/256-bit)** — available and stable on net10.0 (preview attribute was lifted in .NET 9). `MultiplyWideningAndAdd` = vpdpbusd, `MultiplyWideningAndAddSaturate` = vpdpbusds. Hardware gate is the **VEX AVX-VNNI CPUID flag**: Intel Alder Lake+ (incl. E-cores) and **Zen 5**. **Zen 4 advertises only EVEX AVX512-VNNI, so expect `AvxVnni.IsSupported == false` there — verify with a runtime probe on a 7000-series.**
- **512-bit VNNI is NOT in .NET 10.** `Avx512Vnni` does not exist. The only 512-bit VNNI surface today is `Avx10v1.V512` (.NET 9+), which lights up only on Intel AVX10 hardware (Granite-Rapids-class), not AMD.
- **.NET 11 is adding `AvxVnni.V512`** (512-bit VPDPBUSD/VPDPWSSD/VPDPBUSDS): dotnet/runtime#86849 is api-approved and the implementation PR dotnet/runtime#128365 is approved-pending-checks as of June 2026 — on track for .NET 11, not guaranteed until merged.
- `AvxVnniInt8` (vpdpbssd, s8×s8 — would drop the unsigned-bias trick entirely) exists in net10.0 but its hardware (Sierra Forest / Arrow Lake / Lunar Lake) is rare in the target audience; note only.

### Zen 4 escape hatch (optional, only if profiling justifies)
If Zen 4 matters and `AvxVnni.IsSupported` is indeed false there, the EVEX vpdpbusd can be reached via a tiny native helper (a C file compiled with `-mavx512vnni`, P/Invoked at MatVec/row-sweep granularity so the ~ns call cost amortizes). The repo already P/Invokes OpenBLAS/cuBLAS/NVRTC, and NativeAOT can even statically link it (DirectPInvoke). Likely NOT worth it: single-stream decode dots are DRAM-bound (see below), and Zen 4 still has the AVX2 path.

## Honest expectations

- **Single-token decode dots are largely DRAM-bandwidth-bound** (expert weights stream from RAM once per token) — VNNI may move little there. Measure before claiming.
- The clear compute-bound wins are: (a) the **grouped-by-expert batched paths** (#110 prefill, #210 verify batching) where one expert read is dotted against N token rows; (b) the **multi-input `_2In`/`_4In` kernels** (MTP/batch-verify); (c) freeing SMT/core pressure so the `Parallel.For` sweeps scale. Expect ~1.5-2× kernel-level on those, prefill-visible per the #110 split (routed MoE was ~36% of prefill wall).

## Proposed work

1. Add VNNI variants of the Q8K/Q8KS dot inner loops behind `AvxVnni.IsSupported` (256-bit — Alder Lake+/Zen 5), keeping current AVX2 as fallback; same accumulation order or explicitly argmax-stable-gated, per the repo's parity discipline.
2. 512-bit int8 paths where they exist today: `Avx512BW` widening variants of the same kernels gated on `Vector512.IsHardwareAccelerated` (helps Zen 4/5 without VNNI semantics), and optionally `Avx10v1.V512` VNNI for AVX10 hardware. Revisit with `AvxVnni.V512` when the project moves to .NET 11.
3. Bench at three levels, #176-style so dead ends aren't retried: isolated kernel (existing SimdKernels micro-bench/tests), grouped-expert prefill chunk profile (`SHARPI_PREFILL_PROFILE=1`), e2e Carnice + 35B-A3B decode/prefill (`bench-carnice.ps1`, warm-cache protocol). Record which level the win survives to.
4. Parity: clone the existing Q8KS validation tests (cf. `bench-q8k-validation.ps1` / `SimdKernelsQ8KSTests`) for each new path; 0 mismatches vs scalar reference.

## Acceptance

- [ ] VNNI path auto-selected on capable CPUs, kill switch env var for bisection.
- [ ] Documented runtime-probe result for `AvxVnni.IsSupported` on Zen 4 (decides whether the native-helper question stays open).
- [ ] Kernel-level speedup quantified; e2e A/B on Carnice CPU-MoE prefill + decode reported (even if decode is bandwidth-bound-null, document it).
- [ ] Bit-parity/argmax-stability tests green; no regression on AVX2-only hardware.

## References

- #100/#215 (CPU-MoE is the steady-state config on 12 GB), #110/#210 (the compute-bound grouped paths that benefit most), #176 (bench discipline: isolated wins must prove themselves e2e).
- dotnet/runtime#86849 (AVX-512 VNNI API, approved), dotnet/runtime#128365 (`AvxVnni.V512` implementation PR, open as of 2026-06).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cpu): VNNI (vpdpbusd) + AVX-512 paths for the int8 Q8K/Q8KS dot kernels — the CPU-MoE hot path runs on AVX2 vpmaddubsw only #222

Context

Problem

What .NET actually exposes (verified 2026-06, corrected from the original issue text)

Zen 4 escape hatch (optional, only if profiling justifies)

Honest expectations

Proposed work

Acceptance

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(cpu): VNNI (vpdpbusd) + AVX-512 paths for the int8 Q8K/Q8KS dot kernels — the CPU-MoE hot path runs on AVX2 vpmaddubsw only #222

Description

Context

Problem

What .NET actually exposes (verified 2026-06, corrected from the original issue text)

Zen 4 escape hatch (optional, only if profiling justifies)

Honest expectations

Proposed work

Acceptance

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions