Skip to content

[API Proposal]: Add managed surface for AVX-512 BF16 hardware intrinsics #129323

@jamesburton

Description

@jamesburton

Background and motivation

The classic AVX512_BF16 instruction set adds bfloat16 support to AVX-512: a bf16 dot-product accumulating into FP32 (VDPBF16PS) and FP32→bf16 round-to-nearest-even conversions (VCVTNE2PS2BF16, VCVTNEPS2BF16). It is supported on Intel Cooper Lake / Ice Lake-SP / Sapphire Rapids+ and on AMD Zen 4 and Zen 5 (including Strix Halo), but there is currently no managed surface for it in System.Runtime.Intrinsics.X86 on the classic-AVX512 path.

This is exactly the situation #86849 / PR #128365 solved for AVX512_VNNI: an R2R name and the underlying JIT plumbing existed under AVX10v*, but shipping Sapphire Rapids / Zen 4 / Zen 5 hardware does not enable AVX10, so the surface was unreachable on the parts where it matters. The fix shape that landed there — separate managed surface tied to the classic gate, AVX10 path stays parallel — is the precedent this proposal follows.

What already exists in dotnet/runtime (verified by grep)

  • System.BFloat16 primitive (src/libraries/System.Private.CoreLib/src/System/BitConverter.cs:312-1016 — full GetBytes(BFloat16), ToBFloat16, Int16BitsToBFloat16, BFloat16ToInt16Bits etc.). So Vector*<BFloat16> is the natural element type — no ushort interim.
  • R2R IDs already allocated: READYTORUN_INSTRUCTION_Avx512Bf16 = 72, READYTORUN_INSTRUCTION_Avx512Bf16_VL = 73 (src/coreclr/inc/readytoruninstructionset.h:82-83).
  • InstructionSetDesc.txt:76-77 has rows for Avx512Bf16 and Avx512Bf16_VLmanaged-name column (column 3) is empty, parent (column 6) is AVX10v1:
    instructionset     ,X86   ,                     ,Avx512Bf16         ,72 ,AVX10v1               ,avx10v1
    instructionset     ,X86   ,                     ,Avx512Bf16_VL      ,73 ,AVX10v1               ,avx10v1
    
  • JIT instruction encodings already in src/coreclr/jit/instrsxarch.h:
    • vdpbf16ps (line 1035) — EVEX
    • vcvtne2ps2bf16 (line 994) — EVEX
    • vcvtneps2bf16 (line 995) — EVEX

So the scope of the BF16 PR is small relative to a green-field intrinsic: fill in the managed-name column, route to the classic gate, add the public C# class, wire the JIT dispatch, add tests.

Motivating use case — quantized / mixed-precision ML inference

VDPBF16PS accumulates bf16×bf16 products directly into an FP32 accumulator. For block-scaled integer quantization (e.g. GGUF Q8_0: 32 int8 + one fp16 scale per block), an integer-VNNI dot must convert+rescale after every block, which prevents long accumulation. A BF16 path can dequantize int8→bf16 folding the per-block scale into the value, then accumulate the already-scaled products across all blocks in one FP32 accumulator via VDPBF16PS — removing the per-block reduction entirely. Concrete data point: an A/B on Zen 5 / Strix Halo had a 512-bit integer-VNNI Q8_0 GEMM kernel hitting only parity with the existing maddubs path precisely because of those per-block reductions; BF16 long-accumulation is the path past that ceiling. BF16 GEMM and attention more generally also benefit.

API Proposal

Mirrors the established Avx512* / AvxVnni pattern.

namespace System.Runtime.Intrinsics.X86
{
    /// <summary>Provides access to X86 AVX512-BF16 hardware instructions via intrinsics.</summary>
    [Intrinsic]
    public abstract class Avx512Bf16 : Avx512F
    {
        public static new bool IsSupported { get; }

        /// <summary>__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b) — VDPBF16PS zmm, zmm, zmm/m512</summary>
        public static Vector512<float> MultiplyWideningAndAdd(Vector512<float> addend, Vector512<BFloat16> left, Vector512<BFloat16> right);

        /// <summary>__m512bh _mm512_cvtne2ps_pbh (__m512 a, __m512 b) — VCVTNE2PS2BF16 zmm, zmm, zmm/m512 (packs two fp32 vectors → one bf16 vector)</summary>
        public static Vector512<BFloat16> ConvertToBFloat16(Vector512<float> lower, Vector512<float> upper);

        /// <summary>__m256bh _mm512_cvtneps_pbh (__m512 a) — VCVTNEPS2BF16 ymm, zmm/m512</summary>
        public static Vector256<BFloat16> ConvertToBFloat16(Vector512<float> value);

        [Intrinsic] public new abstract class X64 : Avx512F.X64 { public static new bool IsSupported { get; } }

        [Intrinsic]
        public new abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get; }
            public static Vector128<float> MultiplyWideningAndAdd(Vector128<float> addend, Vector128<BFloat16> left, Vector128<BFloat16> right);
            public static Vector256<float> MultiplyWideningAndAdd(Vector256<float> addend, Vector256<BFloat16> left, Vector256<BFloat16> right);
            public static Vector128<BFloat16> ConvertToBFloat16(Vector128<float> lower, Vector128<float> upper);
            public static Vector256<BFloat16> ConvertToBFloat16(Vector256<float> lower, Vector256<float> upper);
            public static Vector128<BFloat16> ConvertToBFloat16(Vector128<float> value);
            public static Vector128<BFloat16> ConvertToBFloat16(Vector256<float> value);
        }
    }
}

Proposed ISA grouping — alignment with the #128365 precedent

This is the architectural question that ate the most review time on the VNNI PR, so it is being surfaced up-front for confirmation rather than left for review-time discovery.

@tannergooding's stated principle on #128365 (resolved in commit ebddfff3):

"While there is a very small subset of hardware that contains just AVX512v2 + AVX-512-VNNI, we don't want to support that being enabled/disabled independently without significant justification, as it adds unnecessary complexity to the overall support and testing matrix. Instead, we just want to separate this based on what captures the broadest overall support."

Applied to BF16, the symmetric question is: does v3-as-currently-defined imply BF16 on shipping hardware?

full AVX512v3 group AVX512_BF16
Intel Cooper Lake / Ice Lake-SP / Sapphire Rapids+
AMD Zen 4 (Genoa, Bergamo)
AMD Zen 5 (Granite Ridge, Turin, Strix Halo)
Intel Ice Lake-U/Y, Tiger Lake, Rocket Lake

This is asymmetric with the VNNI case: there is real shipping hardware (Intel client AVX-512 parts pre-disable) that ships full v3 but lacks BF16. Widening v3 to require BF16 would silently drop v3 enablement on those parts — a regression for those users.

Recommended shape (option a): add AVX512_BF16 as its own ISA group with implication AVX512_BF16 → AVX512v3. That preserves Tanner's "what captures the broadest overall support" principle (Tiger Lake keeps its v3 surface; SPR / Zen 4 / Zen 5 / Strix Halo gain BF16) and satisfies the "significant justification" he asked for on VNNI — namely, that v3-without-BF16 is a real shipping configuration, which v2-and-VNNI-without-v3 was not.

Concretely this means editing InstructionSetDesc.txt:76-77 to:

  1. Fill in column 3 with the managed names (Avx512Bf16, Avx512Bf16.VL).
  2. Re-point column 6 from AVX10v1 to the new AVX512_BF16 group.
  3. Add an instructionset row for the AVX512_BF16 group plus implication ,X86 ,AVX512_BF16 ,AVX512v3.

Alternative shapes considered:

  • (b) Same as (a) framed as "edit the existing rows" rather than introduce a new group — likely produces a smaller diff but functionally equivalent.
  • (c) Widen v3 to require BF16. Mentioned for completeness only — regresses Tiger Lake et al. as noted, do not pursue unless explicitly directed.

Alignment with #128365's specific guidance

The PR that follows this issue will mirror these patterns from the VNNI review verbatim:

  • No separate AVXVNNI_V512-style ISA "just because it's a V512 thing" — the v512 width and the classic-gate group are one ISA, not two.
  • simdSize = -1 on the HARDWARE_INTRINSIC rows so that V128/V256/V512 calls all dispatch through one set of entries — matching the AVXVNNIINT / AVXVNNIINT_V512 precedent. Mixing fixed simdSize = 64 with VL forms in one ISA was the trap on Add AvxVnni.V512 hardware intrinsics #128365.
  • Dispatch in Compiler::lookupInstructionSet, not a codeman.cpp-side fallback. (Tanner's review on codeman.cpp:1430 of Add AvxVnni.V512 hardware intrinsics #128365 was explicit: "the right place is the JIT's class-name dispatch.") BF16 has no VEX-encoded sibling to fall back to, so the dispatch is just "return the group" — no compSupportsHWIntrinsic branch.
  • Don't drop tokens from the required-ISAs comments next to existing config flags. Nit pass on Add AvxVnni.V512 hardware intrinsics #128365 (commit 91cdc6c2) twice flagged accidentally-dropped VNNI tokens; same care needed here.
  • JIT-EE GUID regenerated every InstructionSetDesc.txt change. Expect 4-6 merge cycles, each requiring a fresh GUID — never pick one side of the conflict.

Open questions for confirmation by reviewers

I'd like @tannergooding (and the API-review team where relevant) to confirm or vary:

  1. Group shape. Approve option (a) — AVX512_BF16 as own group with implication → AVX512v3 — vs option (b) (same semantics, edit the existing rows in place) vs option (c) (widen v3, regressing Tiger Lake). My read is (a) best satisfies the "broadest overall support" principle while honouring the "significant justification" requirement. Open to direction.
  2. Element type. Vector*<BFloat16> using the existing System.BFloat16 primitive — confirm this is the intended shape rather than a Vector*<ushort> raw-bits stand-in.
  3. Parent class. Avx512Bf16 : Avx512F (sibling of Avx512Vbmi, AvxVnni, etc.) — confirm rather than nesting under a tier.
  4. Naming. MultiplyWideningAndAdd for the dot-product (parallel to AvxVnni), ConvertToBFloat16 for the two/one-source conversions. Open to convention.
  5. Preview gating. [RequiresPreviewFeatures] initially, per AvxVnni precedent.

If any of (1)–(5) need to vary from what's proposed, please flag here so the draft PR (linked below once opened) can be aligned before review cycles start.

Alternatives considered

  • AVX10 BF16 (Avx10v*) — gated on AVX10, unavailable on Sapphire Rapids / Zen 4 / Zen 5 / Strix Halo. Does not serve the parts where the ML use case lives.
  • Software bf16 dot — no hardware acceleration; defeats the motivating use case.

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions