Skip to content

perf(cpu): VNNI (vpdpbusd) + AVX-512 paths for the int8 Q8K/Q8KS dot kernels — the CPU-MoE hot path runs on AVX2 vpmaddubsw only #222

@pekkah

Description

@pekkah

Context

The CPU int8 dot kernels are the decode bottleneck on the target 12 GB / 64 GB machine: CPU-MoE is the auto-selected MoE config there (#100, CudaHybridGdnForwardPass.cs:886-909), and the Q8_KS routed-expert dots are what the 4.9× Carnice win (SHARPI_Q3K_Q8K) runs on. More models land on this path with #215.

Problem

SimdKernels.cs implements the whole int8 dot family (DotQ3K_Q8K(S), DotQ8_0_Q8K, DotQ6K_Q8K, the _2In variants, …) on AVX2 256-bit Avx2.MultiplyAddAdjacent chains (vpmaddubsw → vpmaddwd → vpaddd, e.g. SimdKernels.cs:1956-1981). Two generations of ISA are left unused:

  • VNNI is absent entirely — zero hits for AvxVnni in the repo. vpdpbusd does u8×s8→s32 multiply-accumulate in one instruction, replacing the 2-3 instruction vpmaddubsw/vpmaddwd chain.
  • AVX-512 coverage is thin: 7 Avx512F.IsSupported gates (Q4_K/Q5_K float-path dots) vs 273 Vector256 uses; none of the int8 Q8K/Q8KS kernels have a 512-bit path. Zen 5 executes 512-bit natively; Zen 4 double-pumps but still wins on instruction count.

llama.cpp's ggml carries exactly these paths (VNNI int8 dots), so the CPU-MoE comparison against it gives away ISA for free on modern CPUs.

What .NET actually exposes (verified 2026-06, corrected from the original issue text)

  • AvxVnni (VEX, 128/256-bit) — available and stable on net10.0 (preview attribute was lifted in .NET 9). MultiplyWideningAndAdd = vpdpbusd, MultiplyWideningAndAddSaturate = vpdpbusds. Hardware gate is the VEX AVX-VNNI CPUID flag: Intel Alder Lake+ (incl. E-cores) and Zen 5. Zen 4 advertises only EVEX AVX512-VNNI, so expect AvxVnni.IsSupported == false there — verify with a runtime probe on a 7000-series.
  • 512-bit VNNI is NOT in .NET 10. Avx512Vnni does not exist. The only 512-bit VNNI surface today is Avx10v1.V512 (.NET 9+), which lights up only on Intel AVX10 hardware (Granite-Rapids-class), not AMD.
  • .NET 11 is adding AvxVnni.V512 (512-bit VPDPBUSD/VPDPWSSD/VPDPBUSDS): [API Proposal]: Add support for AVX-512 VNNI hardware instructions dotnet/runtime#86849 is api-approved and the implementation PR Add AvxVnni.V512 hardware intrinsics dotnet/runtime#128365 is approved-pending-checks as of June 2026 — on track for .NET 11, not guaranteed until merged.
  • AvxVnniInt8 (vpdpbssd, s8×s8 — would drop the unsigned-bias trick entirely) exists in net10.0 but its hardware (Sierra Forest / Arrow Lake / Lunar Lake) is rare in the target audience; note only.

Zen 4 escape hatch (optional, only if profiling justifies)

If Zen 4 matters and AvxVnni.IsSupported is indeed false there, the EVEX vpdpbusd can be reached via a tiny native helper (a C file compiled with -mavx512vnni, P/Invoked at MatVec/row-sweep granularity so the ~ns call cost amortizes). The repo already P/Invokes OpenBLAS/cuBLAS/NVRTC, and NativeAOT can even statically link it (DirectPInvoke). Likely NOT worth it: single-stream decode dots are DRAM-bound (see below), and Zen 4 still has the AVX2 path.

Honest expectations

Proposed work

  1. Add VNNI variants of the Q8K/Q8KS dot inner loops behind AvxVnni.IsSupported (256-bit — Alder Lake+/Zen 5), keeping current AVX2 as fallback; same accumulation order or explicitly argmax-stable-gated, per the repo's parity discipline.
  2. 512-bit int8 paths where they exist today: Avx512BW widening variants of the same kernels gated on Vector512.IsHardwareAccelerated (helps Zen 4/5 without VNNI semantics), and optionally Avx10v1.V512 VNNI for AVX10 hardware. Revisit with AvxVnni.V512 when the project moves to .NET 11.
  3. Bench at three levels, Prefill perf gap vs llama.cpp (~2.4x) is weight-streaming-bound, not MMQ-kernel-bound (#124 follow-up) #176-style so dead ends aren't retried: isolated kernel (existing SimdKernels micro-bench/tests), grouped-expert prefill chunk profile (SHARPI_PREFILL_PROFILE=1), e2e Carnice + 35B-A3B decode/prefill (bench-carnice.ps1, warm-cache protocol). Record which level the win survives to.
  4. Parity: clone the existing Q8KS validation tests (cf. bench-q8k-validation.ps1 / SimdKernelsQ8KSTests) for each new path; 0 mismatches vs scalar reference.

Acceptance

  • VNNI path auto-selected on capable CPUs, kill switch env var for bisection.
  • Documented runtime-probe result for AvxVnni.IsSupported on Zen 4 (decides whether the native-helper question stays open).
  • Kernel-level speedup quantified; e2e A/B on Carnice CPU-MoE prefill + decode reported (even if decode is bandwidth-bound-null, document it).
  • Bit-parity/argmax-stability tests green; no regression on AVX2-only hardware.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions