[API Proposal]: Add managed surface for AVX-512 BF16 hardware intrinsics

### Background and motivation

The classic `AVX512_BF16` instruction set adds bfloat16 support to AVX-512: a bf16 dot-product accumulating into FP32 (`VDPBF16PS`) and FP32→bf16 round-to-nearest-even conversions (`VCVTNE2PS2BF16`, `VCVTNEPS2BF16`). It is supported on Intel Cooper Lake / Ice Lake-SP / Sapphire Rapids+ and on AMD **Zen 4 and Zen 5** (including **Strix Halo**), but there is currently **no managed surface for it** in `System.Runtime.Intrinsics.X86` on the classic-AVX512 path.

This is exactly the situation #86849 / PR #128365 solved for `AVX512_VNNI`: an R2R name and the underlying JIT plumbing existed under `AVX10v*`, but shipping Sapphire Rapids / Zen 4 / Zen 5 hardware does not enable AVX10, so the surface was unreachable on the parts where it matters. The fix shape that landed there — separate managed surface tied to the classic gate, AVX10 path stays parallel — is the precedent this proposal follows.

#### What already exists in dotnet/runtime (verified by grep)

- `System.BFloat16` primitive (`src/libraries/System.Private.CoreLib/src/System/BitConverter.cs:312-1016` — full `GetBytes(BFloat16)`, `ToBFloat16`, `Int16BitsToBFloat16`, `BFloat16ToInt16Bits` etc.). So `Vector*<BFloat16>` is the natural element type — no `ushort` interim.
- R2R IDs already allocated: `READYTORUN_INSTRUCTION_Avx512Bf16 = 72`, `READYTORUN_INSTRUCTION_Avx512Bf16_VL = 73` (`src/coreclr/inc/readytoruninstructionset.h:82-83`).
- `InstructionSetDesc.txt:76-77` has rows for `Avx512Bf16` and `Avx512Bf16_VL` — **managed-name column (column 3) is empty**, parent (column 6) is `AVX10v1`:
  ```
  instructionset     ,X86   ,                     ,Avx512Bf16         ,72 ,AVX10v1               ,avx10v1
  instructionset     ,X86   ,                     ,Avx512Bf16_VL      ,73 ,AVX10v1               ,avx10v1
  ```
- JIT instruction encodings already in `src/coreclr/jit/instrsxarch.h`:
  - `vdpbf16ps` (line 1035) — EVEX
  - `vcvtne2ps2bf16` (line 994) — EVEX
  - `vcvtneps2bf16` (line 995) — EVEX

So the scope of the BF16 PR is small relative to a green-field intrinsic: fill in the managed-name column, route to the classic gate, add the public C# class, wire the JIT dispatch, add tests.

#### Motivating use case — quantized / mixed-precision ML inference

`VDPBF16PS` accumulates bf16×bf16 products directly into an FP32 accumulator. For block-scaled integer quantization (e.g. GGUF `Q8_0`: 32 int8 + one fp16 scale per block), an integer-VNNI dot must convert+rescale after *every* block, which prevents long accumulation. A BF16 path can dequantize int8→bf16 **folding the per-block scale into the value**, then accumulate the already-scaled products across all blocks in one FP32 accumulator via `VDPBF16PS` — removing the per-block reduction entirely. Concrete data point: an A/B on Zen 5 / Strix Halo had a 512-bit integer-VNNI Q8_0 GEMM kernel hitting only *parity* with the existing maddubs path precisely because of those per-block reductions; BF16 long-accumulation is the path past that ceiling. BF16 GEMM and attention more generally also benefit.

### API Proposal

Mirrors the established `Avx512*` / `AvxVnni` pattern.

```csharp
namespace System.Runtime.Intrinsics.X86
{
    /// <summary>Provides access to X86 AVX512-BF16 hardware instructions via intrinsics.</summary>
    [Intrinsic]
    public abstract class Avx512Bf16 : Avx512F
    {
        public static new bool IsSupported { get; }

        /// <summary>__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b) — VDPBF16PS zmm, zmm, zmm/m512</summary>
        public static Vector512<float> MultiplyWideningAndAdd(Vector512<float> addend, Vector512<BFloat16> left, Vector512<BFloat16> right);

        /// <summary>__m512bh _mm512_cvtne2ps_pbh (__m512 a, __m512 b) — VCVTNE2PS2BF16 zmm, zmm, zmm/m512 (packs two fp32 vectors → one bf16 vector)</summary>
        public static Vector512<BFloat16> ConvertToBFloat16(Vector512<float> lower, Vector512<float> upper);

        /// <summary>__m256bh _mm512_cvtneps_pbh (__m512 a) — VCVTNEPS2BF16 ymm, zmm/m512</summary>
        public static Vector256<BFloat16> ConvertToBFloat16(Vector512<float> value);

        [Intrinsic] public new abstract class X64 : Avx512F.X64 { public static new bool IsSupported { get; } }

        [Intrinsic]
        public new abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get; }
            public static Vector128<float> MultiplyWideningAndAdd(Vector128<float> addend, Vector128<BFloat16> left, Vector128<BFloat16> right);
            public static Vector256<float> MultiplyWideningAndAdd(Vector256<float> addend, Vector256<BFloat16> left, Vector256<BFloat16> right);
            public static Vector128<BFloat16> ConvertToBFloat16(Vector128<float> lower, Vector128<float> upper);
            public static Vector256<BFloat16> ConvertToBFloat16(Vector256<float> lower, Vector256<float> upper);
            public static Vector128<BFloat16> ConvertToBFloat16(Vector128<float> value);
            public static Vector128<BFloat16> ConvertToBFloat16(Vector256<float> value);
        }
    }
}
```

### Proposed ISA grouping — alignment with the #128365 precedent

This is the architectural question that ate the most review time on the VNNI PR, so it is being surfaced up-front for confirmation rather than left for review-time discovery.

@tannergooding's stated principle on #128365 (resolved in commit `ebddfff3`):

> *"While there is a very small subset of hardware that contains just AVX512v2 + AVX-512-VNNI, we don't want to support that being enabled/disabled independently without significant justification, as it adds unnecessary complexity to the overall support and testing matrix. Instead, we just want to separate this based on what captures the broadest overall support."*

Applied to BF16, the symmetric question is: **does v3-as-currently-defined imply BF16 on shipping hardware?**

|                                                  | full AVX512v3 group | AVX512_BF16 |
|--------------------------------------------------|:-------------------:|:-----------:|
| Intel Cooper Lake / Ice Lake-SP / Sapphire Rapids+ | ✓ | ✓ |
| AMD Zen 4 (Genoa, Bergamo)                       | ✓ | ✓ |
| AMD Zen 5 (Granite Ridge, Turin, **Strix Halo**) | ✓ | ✓ |
| Intel Ice Lake-U/Y, Tiger Lake, Rocket Lake      | ✓ | ✗ |

This is asymmetric with the VNNI case: there is real shipping hardware (Intel client AVX-512 parts pre-disable) that ships full v3 but **lacks BF16**. Widening v3 to require BF16 would silently drop v3 enablement on those parts — a regression for those users.

**Recommended shape (option a):** add `AVX512_BF16` as its own ISA group with `implication AVX512_BF16 → AVX512v3`. That preserves Tanner's "what captures the broadest overall support" principle (Tiger Lake keeps its v3 surface; SPR / Zen 4 / Zen 5 / Strix Halo gain BF16) and satisfies the "significant justification" he asked for on VNNI — namely, that v3-without-BF16 is a real shipping configuration, which v2-and-VNNI-without-v3 was not.

Concretely this means editing `InstructionSetDesc.txt:76-77` to:
1. Fill in column 3 with the managed names (`Avx512Bf16`, `Avx512Bf16.VL`).
2. Re-point column 6 from `AVX10v1` to the new `AVX512_BF16` group.
3. Add an `instructionset` row for the `AVX512_BF16` group plus `implication ,X86 ,AVX512_BF16 ,AVX512v3`.

Alternative shapes considered:
- **(b)** Same as (a) framed as "edit the existing rows" rather than introduce a new group — likely produces a smaller diff but functionally equivalent.
- **(c)** Widen v3 to *require* BF16. Mentioned for completeness only — regresses Tiger Lake et al. as noted, do **not** pursue unless explicitly directed.

### Alignment with #128365's specific guidance

The PR that follows this issue will mirror these patterns from the VNNI review verbatim:

- **No separate AVXVNNI_V512-style ISA "just because it's a V512 thing"** — the v512 width and the classic-gate group are one ISA, not two.
- **`simdSize = -1`** on the `HARDWARE_INTRINSIC` rows so that V128/V256/V512 calls all dispatch through one set of entries — matching the `AVXVNNIINT` / `AVXVNNIINT_V512` precedent. Mixing fixed `simdSize = 64` with VL forms in one ISA was the trap on #128365.
- **Dispatch in `Compiler::lookupInstructionSet`**, not a `codeman.cpp`-side fallback. (Tanner's review on `codeman.cpp:1430` of #128365 was explicit: "the right place is the JIT's class-name dispatch.") BF16 has no VEX-encoded sibling to fall back to, so the dispatch is just "return the group" — no `compSupportsHWIntrinsic` branch.
- **Don't drop tokens from the required-ISAs comments** next to existing config flags. Nit pass on #128365 (commit `91cdc6c2`) twice flagged accidentally-dropped VNNI tokens; same care needed here.
- **JIT-EE GUID** regenerated every `InstructionSetDesc.txt` change. Expect 4-6 merge cycles, each requiring a fresh GUID — never pick one side of the conflict.

### Open questions for confirmation by reviewers

I'd like @tannergooding (and the API-review team where relevant) to confirm or vary:

1. **Group shape.** Approve option (a) — `AVX512_BF16` as own group with implication → `AVX512v3` — vs option (b) (same semantics, edit the existing rows in place) vs option (c) (widen v3, regressing Tiger Lake). My read is (a) best satisfies the "broadest overall support" principle while honouring the "significant justification" requirement. Open to direction.
2. **Element type.** `Vector*<BFloat16>` using the existing `System.BFloat16` primitive — confirm this is the intended shape rather than a `Vector*<ushort>` raw-bits stand-in.
3. **Parent class.** `Avx512Bf16 : Avx512F` (sibling of `Avx512Vbmi`, `AvxVnni`, etc.) — confirm rather than nesting under a tier.
4. **Naming.** `MultiplyWideningAndAdd` for the dot-product (parallel to `AvxVnni`), `ConvertToBFloat16` for the two/one-source conversions. Open to convention.
5. **Preview gating.** `[RequiresPreviewFeatures]` initially, per `AvxVnni` precedent.

If any of (1)–(5) need to vary from what's proposed, please flag here so the draft PR (linked below once opened) can be aligned before review cycles start.

### Alternatives considered

- AVX10 BF16 (`Avx10v*`) — gated on AVX10, unavailable on Sapphire Rapids / Zen 4 / Zen 5 / Strix Halo. Does not serve the parts where the ML use case lives.
- Software bf16 dot — no hardware acceleration; defeats the motivating use case.

### References

- #86849 / PR #128365 — `AvxVnni.V512` — the direct precedent this proposal follows.
- The classic Intel Software Developer's Manual entries for `VDPBF16PS`, `VCVTNE2PS2BF16`, `VCVTNEPS2BF16`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API Proposal]: Add managed surface for AVX-512 BF16 hardware intrinsics #129323

Background and motivation

What already exists in dotnet/runtime (verified by grep)

Motivating use case — quantized / mixed-precision ML inference

API Proposal

Proposed ISA grouping — alignment with the #128365 precedent

Alignment with #128365's specific guidance

Open questions for confirmation by reviewers

Alternatives considered

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	full AVX512v3 group	AVX512_BF16
Intel Cooper Lake / Ice Lake-SP / Sapphire Rapids+	✓	✓
AMD Zen 4 (Genoa, Bergamo)	✓	✓
AMD Zen 5 (Granite Ridge, Turin, Strix Halo)	✓	✓
Intel Ice Lake-U/Y, Tiger Lake, Rocket Lake	✓	✗

[API Proposal]: Add managed surface for AVX-512 BF16 hardware intrinsics #129323

Description

Background and motivation

What already exists in dotnet/runtime (verified by grep)

Motivating use case — quantized / mixed-precision ML inference

API Proposal

Proposed ISA grouping — alignment with the #128365 precedent

Alignment with #128365's specific guidance

Open questions for confirmation by reviewers

Alternatives considered

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions