Accelerate `Half` with FP16 ISA by anthonycanino · Pull Request #122649 · dotnet/runtime

anthonycanino · 2025-12-18T20:45:27Z

Draft PR for in-progress work to accelerate System.Half with FP16 ISA.

Current work done:

Add a TYP_HALF to the .NET runtime, which is treated like a TYP_SIMDXX, but with some notable differences. Namely, a TYP_HALF is passed around via the xmm registers, and while it will pass a varTypeIsStruct test, it must be treated as a primitive in other places.
Accelerate System.Half operations with the TYP_HALF and some FP16 intrinsics. Not every possible function has been accelerated yet.

For discussion:

I have currently worked around some checks to make TYP_HALF behave like a struct and a primitive. It's very ad-hoc at the moment.
Much of the work to transform the named System.Half intrinsics into a sequence of intrinsic nodes is done in importcall.cpp and might want to be moved up into some of the gtNewSimdXX nodes.

anthonycanino · 2025-12-18T20:50:41Z

@tannergooding @jakobbotsch please take a look when you get a chance.

anthonycanino · 2026-01-06T12:47:42Z

@dotnet/intel @tannergooding may I get some high level feedback on the structure of the PR?

tannergooding · 2026-04-15T17:29:06Z

Resolved merge conflicts and as part of that fixed the spacing in hwintrinsiclistxarch as a lot of it was messed up by the new column.

Copilot

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (3)

src/coreclr/jit/utils.cpp:1

shiftRightJam produces incorrect results when dist == 0: it will set the jam bit whenever l != 0, even though shifting by 0 should not jam. Even if current call sites don't pass 0, this helper is a general utility and should be correct/stable. Consider explicitly handling dist == 0 (return l) and use a clear jam test like (l & ((1ULL << dist) - 1)) != 0 for 0 < dist < 64 to avoid subtle precedence/edge-case issues.
src/coreclr/vm/callingconvention.h:1
This Half FP-register classification is currently inside the #if !defined(UNIX_AMD64_ABI) block (Windows x64 path). However the surrounding comments and other VM changes describe Half as FP-reg passed on xarch more generally. If the JIT/ABI intends this for SysV AMD64 too, the UNIX_AMD64_ABI path needs an equivalent update; otherwise reflection/call helpers could disagree with actual managed calling convention on Linux/macOS and lead to invocation/call convention mismatches.
src/libraries/System.Private.CoreLib/src/System/Half.cs:1
Most changes in this file mark Half APIs with [Intrinsic] to enable JIT recognition, but Asin was changed to [MethodImpl(MethodImplOptions.AggressiveInlining)] instead. If the intent is to make Asin(Half) an intrinsic (or to keep attribute usage consistent for the acceleration discussion), consider using [Intrinsic] here as well (or clarify why this one is only inlined while similar methods are intrinsic-tagged).

Copilot

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (8)

src/coreclr/jit/utils.cpp:1

shiftRightJam has a precedence bug: (l << (-dist & 63) != 0 ? 1UL : 0UL) is parsed as shifting by either 0/1, not as “shift then test non-zero”. This produces incorrect jamming behavior. Add parentheses so the non-zero test applies to the shifted value (and ensure the function behaves correctly when dist == 0, even if current callers don’t pass 0)."
src/coreclr/jit/utils.cpp:1
HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITS are declared as uint64_t but used as 16-bit half bit patterns and returned as float16_t (typedef’d to uint16_t). This introduces unnecessary implicit narrowing and makes the intent less clear. Prefer declaring these constants as uint16_t (and keep all half-bit masks/constants consistently uint16_t).
src/coreclr/jit/utils.cpp:1
HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITS are declared as uint64_t but used as 16-bit half bit patterns and returned as float16_t (typedef’d to uint16_t). This introduces unnecessary implicit narrowing and makes the intent less clear. Prefer declaring these constants as uint16_t (and keep all half-bit masks/constants consistently uint16_t).
src/coreclr/jit/utils.cpp:1
HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITS are declared as uint64_t but used as 16-bit half bit patterns and returned as float16_t (typedef’d to uint16_t). This introduces unnecessary implicit narrowing and makes the intent less clear. Prefer declaring these constants as uint16_t (and keep all half-bit masks/constants consistently uint16_t).
src/coreclr/jit/utils.h:1
In utils.cpp a new BitOperations::UInt16BitsToHalf(uint16_t) definition was added, but this header hunk only shows adding HalfToUInt16Bits. If UInt16BitsToHalf wasn’t already declared elsewhere in BitOperations, the current change will fail to compile. Add the corresponding declaration next to HalfToUInt16Bits (or remove the definition if it’s unused).
src/coreclr/jit/vartype.h:1
This branch is for non-xarch targets (“Other targets pass them as regular structs”), but it forces TYP_HALF to be treated as a float-reg argument type. That is inconsistent with the comment and risks ABI mistakes if TYP_HALF ever appears on non-xarch builds (or in shared JIT logic). Restrict the TYP_HALF handling to xarch-only codepaths, or keep the non-xarch behavior as varTypeIsFloating(vt).
src/coreclr/vm/class.cpp:1
MethodTable::IsNativeHalfType() is likely consulted in hot-ish reflection/profiler/call descriptor paths; doing GetFullyQualifiedNameInfo + strcmp per query is avoidable overhead. Prefer a cheaper identity check (e.g., compare the MethodTable* against a pre-resolved System.Half MethodTable via binder/type lookup cached in the loader/EE, or cache a flag on the MethodTable once recognized). Keep the AVX10v1 gating, but make the type identity check O(1) without string comparisons.
src/libraries/System.Private.CoreLib/src/System/Half.cs:1
This change introduces [MethodImpl(MethodImplOptions.AggressiveInlining)] on Half.Asin, while most other Half members in this PR are annotated with [Intrinsic] for JIT recognition. If Asin is meant to be treated as an intrinsic, it should be marked consistently (and wired up in the JIT); if it’s not meant to be intrinsic, consider avoiding a one-off inlining attribute here to keep the intrinsic surface area/annotation patterns consistent.

jkotas · 2026-04-17T01:52:59Z

This is dynamically changing the calling convention based on CPU capabilities:

The calling convention change needs to be mentioned in https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/botr/clr-abi.md
The calling convention change needs to be handled in crossgen. compExactlyDependsOn(InstructionSet_AVX10v1) on the JIT side is not sufficient.
The calling convention handling in the VM looks very incomplete. For example, there are no provisions for returning floating point values in xmm registers on x86 - x86 still returns values using x86 FP stack, but the comments suggest that the half floating point value is returned in xmm registers.

@tannergooding I would recommend reverting this PR until this gets all addressed.

This reverts commit f7693e1.

Reverts #122649

Since #122649 had to be reverted due to the ABI concerns, this is a simpler initial change that works with the existing ABI and on hardware with AVX2 support (not just AVX512-FP16 capable hardware). This should provide a nice win across most existing hardware and we can follow up with a PR that does similar for the AVX512-FP16 instructions that allow directly accelerated arithmetic operations, rather than only handling conversions. ### Before ```asm ; Method Program:HalfToSingle(System.Half):float (FullOpts) G_M16314_IG01: ;; offset=0x0000 4883EC28 sub rsp, 40 ;; size=4 bbWeight=1 PerfScore 0.25 G_M16314_IG02: ;; offset=0x0004 0FB7C9 movzx rcx, cx FF156BA74500 call [System.Half:op_Explicit(System.Half):float] 90 nop ;; size=10 bbWeight=1 PerfScore 3.50 G_M16314_IG03: ;; offset=0x000E 4883C428 add rsp, 40 C3 ret ;; size=5 bbWeight=1 PerfScore 1.25 ; Total bytes of code: 19 ; Method Program:SingleToHalf(float):System.Half (FullOpts) G_M32250_IG01: ;; offset=0x0000 ;; size=0 bbWeight=1 PerfScore 0.00 G_M32250_IG02: ;; offset=0x0000 FF2572A74500 tail.jmp [System.Half:op_Explicit(float):System.Half] ;; size=6 bbWeight=1 PerfScore 2.00 ; Total bytes of code: 6 ``` ### After ```asm ; Method Program:HalfToSingle(System.Half):float (FullOpts) G_M15861_IG01: ;; offset=0x0000 ;; size=0 bbWeight=1 PerfScore 0.00 G_M15861_IG02: ;; offset=0x0000 0FB7C1 movzx rax, cx C5F96EC0 vmovd xmm0, eax C4E27913C0 vcvtph2ps xmm0, xmm0 ;; size=12 bbWeight=1 PerfScore 6.25 G_M15861_IG03: ;; offset=0x000C C3 ret ;; size=1 bbWeight=1 PerfScore 1.00 ; Total bytes of code: 13 ; Method Program:SingleToHalf(float):System.Half (FullOpts) G_M15413_IG01: ;; offset=0x0000 ;; size=0 bbWeight=1 PerfScore 0.00 G_M15413_IG02: ;; offset=0x0000 C4E3791DC000 vcvtps2ph xmm0, xmm0, 0 C5F97EC0 vmovd eax, xmm0 0FB7C0 movzx rax, ax ;; size=13 bbWeight=1 PerfScore 6.25 G_M15413_IG03: ;; offset=0x000D C3 ret ;; size=1 bbWeight=1 PerfScore 1.00 ; Total bytes of code: 14 ```

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 18, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Dec 18, 2025

build-analysis bot mentioned this pull request Dec 19, 2025

[android][arm64] System.Net.Sockets.Tests.SendTo_SyncForceNonBlocking.Datagram_UDP_ShouldImplicitlyBindLocalEndpoint fails with NetworkUnreachable #120526

Open

jakobbotsch reviewed Jan 5, 2026

View reviewed changes

Comment thread src/coreclr/jit/codegencommon.cpp

anthonycanino force-pushed the half-xmm-struct-abi branch from 3b8abaa to f633726 Compare January 5, 2026 19:52

This was referenced Jan 5, 2026

[mono] mono_thread_info_install_interrupt: previous_token should be INTERRUPT_STATE #122669

Open

iOS.Device test WorkItemExecutions #122874

Closed