You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The classic AVX512_BF16 instruction set adds bfloat16 support to AVX-512: a bf16 dot-product accumulating into FP32 (VDPBF16PS) and FP32→bf16 round-to-nearest-even conversions (VCVTNE2PS2BF16, VCVTNEPS2BF16). It is supported on Intel Cooper Lake / Ice Lake-SP / Sapphire Rapids+ and on AMD Zen 4 and Zen 5 (including Strix Halo), but there is currently no managed surface for it in System.Runtime.Intrinsics.X86 on the classic-AVX512 path.
This is exactly the situation #86849 / PR #128365 solved for AVX512_VNNI: an R2R name and the underlying JIT plumbing existed under AVX10v*, but shipping Sapphire Rapids / Zen 4 / Zen 5 hardware does not enable AVX10, so the surface was unreachable on the parts where it matters. The fix shape that landed there — separate managed surface tied to the classic gate, AVX10 path stays parallel — is the precedent this proposal follows.
What already exists in dotnet/runtime (verified by grep)
System.BFloat16 primitive (src/libraries/System.Private.CoreLib/src/System/BitConverter.cs:312-1016 — full GetBytes(BFloat16), ToBFloat16, Int16BitsToBFloat16, BFloat16ToInt16Bits etc.). So Vector*<BFloat16> is the natural element type — no ushort interim.
JIT instruction encodings already in src/coreclr/jit/instrsxarch.h:
vdpbf16ps (line 1035) — EVEX
vcvtne2ps2bf16 (line 994) — EVEX
vcvtneps2bf16 (line 995) — EVEX
So the scope of the BF16 PR is small relative to a green-field intrinsic: fill in the managed-name column, route to the classic gate, add the public C# class, wire the JIT dispatch, add tests.
Motivating use case — quantized / mixed-precision ML inference
VDPBF16PS accumulates bf16×bf16 products directly into an FP32 accumulator. For block-scaled integer quantization (e.g. GGUF Q8_0: 32 int8 + one fp16 scale per block), an integer-VNNI dot must convert+rescale after every block, which prevents long accumulation. A BF16 path can dequantize int8→bf16 folding the per-block scale into the value, then accumulate the already-scaled products across all blocks in one FP32 accumulator via VDPBF16PS — removing the per-block reduction entirely. Concrete data point: an A/B on Zen 5 / Strix Halo had a 512-bit integer-VNNI Q8_0 GEMM kernel hitting only parity with the existing maddubs path precisely because of those per-block reductions; BF16 long-accumulation is the path past that ceiling. BF16 GEMM and attention more generally also benefit.
API Proposal
Mirrors the established Avx512* / AvxVnni pattern.
namespaceSystem.Runtime.Intrinsics.X86{/// <summary>Provides access to X86 AVX512-BF16 hardware instructions via intrinsics.</summary>[Intrinsic]publicabstractclassAvx512Bf16:Avx512F{publicstaticnewboolIsSupported{get;}/// <summary>__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b) — VDPBF16PS zmm, zmm, zmm/m512</summary>publicstaticVector512<float>MultiplyWideningAndAdd(Vector512<float>addend,Vector512<BFloat16>left,Vector512<BFloat16>right);/// <summary>__m512bh _mm512_cvtne2ps_pbh (__m512 a, __m512 b) — VCVTNE2PS2BF16 zmm, zmm, zmm/m512 (packs two fp32 vectors → one bf16 vector)</summary>publicstaticVector512<BFloat16>ConvertToBFloat16(Vector512<float>lower,Vector512<float>upper);/// <summary>__m256bh _mm512_cvtneps_pbh (__m512 a) — VCVTNEPS2BF16 ymm, zmm/m512</summary>publicstaticVector256<BFloat16>ConvertToBFloat16(Vector512<float>value);[Intrinsic]publicnewabstractclassX64:Avx512F.X64{publicstaticnewboolIsSupported{get;}}[Intrinsic]publicnewabstractclassVL:Avx512F.VL{publicstaticnewboolIsSupported{get;}publicstaticVector128<float>MultiplyWideningAndAdd(Vector128<float>addend,Vector128<BFloat16>left,Vector128<BFloat16>right);publicstaticVector256<float>MultiplyWideningAndAdd(Vector256<float>addend,Vector256<BFloat16>left,Vector256<BFloat16>right);publicstaticVector128<BFloat16>ConvertToBFloat16(Vector128<float>lower,Vector128<float>upper);publicstaticVector256<BFloat16>ConvertToBFloat16(Vector256<float>lower,Vector256<float>upper);publicstaticVector128<BFloat16>ConvertToBFloat16(Vector128<float>value);publicstaticVector128<BFloat16>ConvertToBFloat16(Vector256<float>value);}}}
Proposed ISA grouping — alignment with the #128365 precedent
This is the architectural question that ate the most review time on the VNNI PR, so it is being surfaced up-front for confirmation rather than left for review-time discovery.
"While there is a very small subset of hardware that contains just AVX512v2 + AVX-512-VNNI, we don't want to support that being enabled/disabled independently without significant justification, as it adds unnecessary complexity to the overall support and testing matrix. Instead, we just want to separate this based on what captures the broadest overall support."
Applied to BF16, the symmetric question is: does v3-as-currently-defined imply BF16 on shipping hardware?
full AVX512v3 group
AVX512_BF16
Intel Cooper Lake / Ice Lake-SP / Sapphire Rapids+
✓
✓
AMD Zen 4 (Genoa, Bergamo)
✓
✓
AMD Zen 5 (Granite Ridge, Turin, Strix Halo)
✓
✓
Intel Ice Lake-U/Y, Tiger Lake, Rocket Lake
✓
✗
This is asymmetric with the VNNI case: there is real shipping hardware (Intel client AVX-512 parts pre-disable) that ships full v3 but lacks BF16. Widening v3 to require BF16 would silently drop v3 enablement on those parts — a regression for those users.
Recommended shape (option a): add AVX512_BF16 as its own ISA group with implication AVX512_BF16 → AVX512v3. That preserves Tanner's "what captures the broadest overall support" principle (Tiger Lake keeps its v3 surface; SPR / Zen 4 / Zen 5 / Strix Halo gain BF16) and satisfies the "significant justification" he asked for on VNNI — namely, that v3-without-BF16 is a real shipping configuration, which v2-and-VNNI-without-v3 was not.
Concretely this means editing InstructionSetDesc.txt:76-77 to:
Fill in column 3 with the managed names (Avx512Bf16, Avx512Bf16.VL).
Re-point column 6 from AVX10v1 to the new AVX512_BF16 group.
Add an instructionset row for the AVX512_BF16 group plus implication ,X86 ,AVX512_BF16 ,AVX512v3.
Alternative shapes considered:
(b) Same as (a) framed as "edit the existing rows" rather than introduce a new group — likely produces a smaller diff but functionally equivalent.
(c) Widen v3 to require BF16. Mentioned for completeness only — regresses Tiger Lake et al. as noted, do not pursue unless explicitly directed.
The PR that follows this issue will mirror these patterns from the VNNI review verbatim:
No separate AVXVNNI_V512-style ISA "just because it's a V512 thing" — the v512 width and the classic-gate group are one ISA, not two.
simdSize = -1 on the HARDWARE_INTRINSIC rows so that V128/V256/V512 calls all dispatch through one set of entries — matching the AVXVNNIINT / AVXVNNIINT_V512 precedent. Mixing fixed simdSize = 64 with VL forms in one ISA was the trap on Add AvxVnni.V512 hardware intrinsics #128365.
Dispatch in Compiler::lookupInstructionSet, not a codeman.cpp-side fallback. (Tanner's review on codeman.cpp:1430 of Add AvxVnni.V512 hardware intrinsics #128365 was explicit: "the right place is the JIT's class-name dispatch.") BF16 has no VEX-encoded sibling to fall back to, so the dispatch is just "return the group" — no compSupportsHWIntrinsic branch.
Don't drop tokens from the required-ISAs comments next to existing config flags. Nit pass on Add AvxVnni.V512 hardware intrinsics #128365 (commit 91cdc6c2) twice flagged accidentally-dropped VNNI tokens; same care needed here.
JIT-EE GUID regenerated every InstructionSetDesc.txt change. Expect 4-6 merge cycles, each requiring a fresh GUID — never pick one side of the conflict.
Open questions for confirmation by reviewers
I'd like @tannergooding (and the API-review team where relevant) to confirm or vary:
Group shape. Approve option (a) — AVX512_BF16 as own group with implication → AVX512v3 — vs option (b) (same semantics, edit the existing rows in place) vs option (c) (widen v3, regressing Tiger Lake). My read is (a) best satisfies the "broadest overall support" principle while honouring the "significant justification" requirement. Open to direction.
Element type.Vector*<BFloat16> using the existing System.BFloat16 primitive — confirm this is the intended shape rather than a Vector*<ushort> raw-bits stand-in.
Parent class.Avx512Bf16 : Avx512F (sibling of Avx512Vbmi, AvxVnni, etc.) — confirm rather than nesting under a tier.
Naming.MultiplyWideningAndAdd for the dot-product (parallel to AvxVnni), ConvertToBFloat16 for the two/one-source conversions. Open to convention.
Preview gating.[RequiresPreviewFeatures] initially, per AvxVnni precedent.
If any of (1)–(5) need to vary from what's proposed, please flag here so the draft PR (linked below once opened) can be aligned before review cycles start.
Alternatives considered
AVX10 BF16 (Avx10v*) — gated on AVX10, unavailable on Sapphire Rapids / Zen 4 / Zen 5 / Strix Halo. Does not serve the parts where the ML use case lives.
Software bf16 dot — no hardware acceleration; defeats the motivating use case.
Background and motivation
The classic
AVX512_BF16instruction set adds bfloat16 support to AVX-512: a bf16 dot-product accumulating into FP32 (VDPBF16PS) and FP32→bf16 round-to-nearest-even conversions (VCVTNE2PS2BF16,VCVTNEPS2BF16). It is supported on Intel Cooper Lake / Ice Lake-SP / Sapphire Rapids+ and on AMD Zen 4 and Zen 5 (including Strix Halo), but there is currently no managed surface for it inSystem.Runtime.Intrinsics.X86on the classic-AVX512 path.This is exactly the situation #86849 / PR #128365 solved for
AVX512_VNNI: an R2R name and the underlying JIT plumbing existed underAVX10v*, but shipping Sapphire Rapids / Zen 4 / Zen 5 hardware does not enable AVX10, so the surface was unreachable on the parts where it matters. The fix shape that landed there — separate managed surface tied to the classic gate, AVX10 path stays parallel — is the precedent this proposal follows.What already exists in dotnet/runtime (verified by grep)
System.BFloat16primitive (src/libraries/System.Private.CoreLib/src/System/BitConverter.cs:312-1016— fullGetBytes(BFloat16),ToBFloat16,Int16BitsToBFloat16,BFloat16ToInt16Bitsetc.). SoVector*<BFloat16>is the natural element type — noushortinterim.READYTORUN_INSTRUCTION_Avx512Bf16 = 72,READYTORUN_INSTRUCTION_Avx512Bf16_VL = 73(src/coreclr/inc/readytoruninstructionset.h:82-83).InstructionSetDesc.txt:76-77has rows forAvx512Bf16andAvx512Bf16_VL— managed-name column (column 3) is empty, parent (column 6) isAVX10v1:src/coreclr/jit/instrsxarch.h:vdpbf16ps(line 1035) — EVEXvcvtne2ps2bf16(line 994) — EVEXvcvtneps2bf16(line 995) — EVEXSo the scope of the BF16 PR is small relative to a green-field intrinsic: fill in the managed-name column, route to the classic gate, add the public C# class, wire the JIT dispatch, add tests.
Motivating use case — quantized / mixed-precision ML inference
VDPBF16PSaccumulates bf16×bf16 products directly into an FP32 accumulator. For block-scaled integer quantization (e.g. GGUFQ8_0: 32 int8 + one fp16 scale per block), an integer-VNNI dot must convert+rescale after every block, which prevents long accumulation. A BF16 path can dequantize int8→bf16 folding the per-block scale into the value, then accumulate the already-scaled products across all blocks in one FP32 accumulator viaVDPBF16PS— removing the per-block reduction entirely. Concrete data point: an A/B on Zen 5 / Strix Halo had a 512-bit integer-VNNI Q8_0 GEMM kernel hitting only parity with the existing maddubs path precisely because of those per-block reductions; BF16 long-accumulation is the path past that ceiling. BF16 GEMM and attention more generally also benefit.API Proposal
Mirrors the established
Avx512*/AvxVnnipattern.Proposed ISA grouping — alignment with the #128365 precedent
This is the architectural question that ate the most review time on the VNNI PR, so it is being surfaced up-front for confirmation rather than left for review-time discovery.
@tannergooding's stated principle on #128365 (resolved in commit
ebddfff3):Applied to BF16, the symmetric question is: does v3-as-currently-defined imply BF16 on shipping hardware?
This is asymmetric with the VNNI case: there is real shipping hardware (Intel client AVX-512 parts pre-disable) that ships full v3 but lacks BF16. Widening v3 to require BF16 would silently drop v3 enablement on those parts — a regression for those users.
Recommended shape (option a): add
AVX512_BF16as its own ISA group withimplication AVX512_BF16 → AVX512v3. That preserves Tanner's "what captures the broadest overall support" principle (Tiger Lake keeps its v3 surface; SPR / Zen 4 / Zen 5 / Strix Halo gain BF16) and satisfies the "significant justification" he asked for on VNNI — namely, that v3-without-BF16 is a real shipping configuration, which v2-and-VNNI-without-v3 was not.Concretely this means editing
InstructionSetDesc.txt:76-77to:Avx512Bf16,Avx512Bf16.VL).AVX10v1to the newAVX512_BF16group.instructionsetrow for theAVX512_BF16group plusimplication ,X86 ,AVX512_BF16 ,AVX512v3.Alternative shapes considered:
Alignment with #128365's specific guidance
The PR that follows this issue will mirror these patterns from the VNNI review verbatim:
simdSize = -1on theHARDWARE_INTRINSICrows so that V128/V256/V512 calls all dispatch through one set of entries — matching theAVXVNNIINT/AVXVNNIINT_V512precedent. Mixing fixedsimdSize = 64with VL forms in one ISA was the trap on Add AvxVnni.V512 hardware intrinsics #128365.Compiler::lookupInstructionSet, not acodeman.cpp-side fallback. (Tanner's review oncodeman.cpp:1430of Add AvxVnni.V512 hardware intrinsics #128365 was explicit: "the right place is the JIT's class-name dispatch.") BF16 has no VEX-encoded sibling to fall back to, so the dispatch is just "return the group" — nocompSupportsHWIntrinsicbranch.91cdc6c2) twice flagged accidentally-dropped VNNI tokens; same care needed here.InstructionSetDesc.txtchange. Expect 4-6 merge cycles, each requiring a fresh GUID — never pick one side of the conflict.Open questions for confirmation by reviewers
I'd like @tannergooding (and the API-review team where relevant) to confirm or vary:
AVX512_BF16as own group with implication →AVX512v3— vs option (b) (same semantics, edit the existing rows in place) vs option (c) (widen v3, regressing Tiger Lake). My read is (a) best satisfies the "broadest overall support" principle while honouring the "significant justification" requirement. Open to direction.Vector*<BFloat16>using the existingSystem.BFloat16primitive — confirm this is the intended shape rather than aVector*<ushort>raw-bits stand-in.Avx512Bf16 : Avx512F(sibling ofAvx512Vbmi,AvxVnni, etc.) — confirm rather than nesting under a tier.MultiplyWideningAndAddfor the dot-product (parallel toAvxVnni),ConvertToBFloat16for the two/one-source conversions. Open to convention.[RequiresPreviewFeatures]initially, perAvxVnniprecedent.If any of (1)–(5) need to vary from what's proposed, please flag here so the draft PR (linked below once opened) can be aligned before review cycles start.
Alternatives considered
Avx10v*) — gated on AVX10, unavailable on Sapphire Rapids / Zen 4 / Zen 5 / Strix Halo. Does not serve the parts where the ML use case lives.References
AvxVnni.V512— the direct precedent this proposal follows.VDPBF16PS,VCVTNE2PS2BF16,VCVTNEPS2BF16.