Skip to content

Accelerate Half with FP16 ISA#122649

Merged
tannergooding merged 15 commits intodotnet:mainfrom
anthonycanino:half-xmm-struct-abi
Apr 16, 2026
Merged

Accelerate Half with FP16 ISA#122649
tannergooding merged 15 commits intodotnet:mainfrom
anthonycanino:half-xmm-struct-abi

Conversation

@anthonycanino
Copy link
Copy Markdown
Contributor

Draft PR for in-progress work to accelerate System.Half with FP16 ISA.

Current work done:

  1. Add a TYP_HALF to the .NET runtime, which is treated like a TYP_SIMDXX, but with some notable differences. Namely, a TYP_HALF is passed around via the xmm registers, and while it will pass a varTypeIsStruct test, it must be treated as a primitive in other places.

  2. Accelerate System.Half operations with the TYP_HALF and some FP16 intrinsics. Not every possible function has been accelerated yet.

For discussion:

  1. I have currently worked around some checks to make TYP_HALF behave like a struct and a primitive. It's very ad-hoc at the moment.

  2. Much of the work to transform the named System.Half intrinsics into a sequence of intrinsic nodes is done in importcall.cpp and might want to be moved up into some of the gtNewSimdXX nodes.

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 18, 2025
@anthonycanino
Copy link
Copy Markdown
Contributor Author

@tannergooding @jakobbotsch please take a look when you get a chance.

Comment thread src/coreclr/jit/codegencommon.cpp
@anthonycanino
Copy link
Copy Markdown
Contributor Author

@dotnet/intel @tannergooding may I get some high level feedback on the structure of the PR?

Comment thread src/coreclr/jit/codegenxarch.cpp
Comment thread src/coreclr/jit/compiler.cpp Outdated
Comment thread src/coreclr/jit/compiler.cpp
Comment thread src/coreclr/jit/emitxarch.cpp
Comment thread src/coreclr/jit/emitxarch.cpp Outdated
Comment thread src/coreclr/jit/emitxarch.cpp Outdated
Comment thread src/coreclr/jit/emitxarch.cpp
Comment thread src/coreclr/jit/emitxarch.cpp Outdated
Comment thread src/coreclr/jit/emitxarch.cpp Outdated
Comment thread src/coreclr/jit/emitxarch.cpp Outdated
Comment thread src/coreclr/jit/gentree.cpp Outdated
Comment thread src/coreclr/jit/gentree.h Outdated
Comment thread src/coreclr/jit/hwintrinsiccodegenxarch.cpp Outdated
Comment thread src/coreclr/jit/importer.cpp
Comment thread src/coreclr/jit/importer.cpp
Comment thread src/coreclr/jit/importercalls.cpp
Comment thread src/coreclr/jit/instr.cpp
Comment thread src/coreclr/jit/lower.cpp Outdated
Comment thread src/coreclr/jit/lsrabuild.cpp Outdated
Comment thread src/coreclr/jit/lsrabuild.cpp
Copilot AI review requested due to automatic review settings April 15, 2026 17:28
@tannergooding
Copy link
Copy Markdown
Member

Resolved merge conflicts and as part of that fixed the spacing in hwintrinsiclistxarch as a lot of it was messed up by the new column.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (3)

src/coreclr/jit/utils.cpp:1

  • shiftRightJam produces incorrect results when dist == 0: it will set the jam bit whenever l != 0, even though shifting by 0 should not jam. Even if current call sites don't pass 0, this helper is a general utility and should be correct/stable. Consider explicitly handling dist == 0 (return l) and use a clear jam test like (l & ((1ULL << dist) - 1)) != 0 for 0 < dist < 64 to avoid subtle precedence/edge-case issues.
    src/coreclr/vm/callingconvention.h:1
  • This Half FP-register classification is currently inside the #if !defined(UNIX_AMD64_ABI) block (Windows x64 path). However the surrounding comments and other VM changes describe Half as FP-reg passed on xarch more generally. If the JIT/ABI intends this for SysV AMD64 too, the UNIX_AMD64_ABI path needs an equivalent update; otherwise reflection/call helpers could disagree with actual managed calling convention on Linux/macOS and lead to invocation/call convention mismatches.
    src/libraries/System.Private.CoreLib/src/System/Half.cs:1
  • Most changes in this file mark Half APIs with [Intrinsic] to enable JIT recognition, but Asin was changed to [MethodImpl(MethodImplOptions.AggressiveInlining)] instead. If the intent is to make Asin(Half) an intrinsic (or to keep attribute usage consistent for the acceleration discussion), consider using [Intrinsic] here as well (or clarify why this one is only inlined while similar methods are intrinsic-tagged).

Comment thread src/coreclr/jit/instr.cpp
Comment thread src/coreclr/jit/instr.cpp
Comment thread src/coreclr/jit/compiler.cpp
Comment thread src/coreclr/jit/emitxarch.cpp
Copilot AI review requested due to automatic review settings April 15, 2026 22:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (8)

src/coreclr/jit/utils.cpp:1

  • shiftRightJam has a precedence bug: (l << (-dist & 63) != 0 ? 1UL : 0UL) is parsed as shifting by either 0/1, not as “shift then test non-zero”. This produces incorrect jamming behavior. Add parentheses so the non-zero test applies to the shifted value (and ensure the function behaves correctly when dist == 0, even if current callers don’t pass 0)."
    src/coreclr/jit/utils.cpp:1
  • HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITS are declared as uint64_t but used as 16-bit half bit patterns and returned as float16_t (typedef’d to uint16_t). This introduces unnecessary implicit narrowing and makes the intent less clear. Prefer declaring these constants as uint16_t (and keep all half-bit masks/constants consistently uint16_t).
    src/coreclr/jit/utils.cpp:1
  • HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITS are declared as uint64_t but used as 16-bit half bit patterns and returned as float16_t (typedef’d to uint16_t). This introduces unnecessary implicit narrowing and makes the intent less clear. Prefer declaring these constants as uint16_t (and keep all half-bit masks/constants consistently uint16_t).
    src/coreclr/jit/utils.cpp:1
  • HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITS are declared as uint64_t but used as 16-bit half bit patterns and returned as float16_t (typedef’d to uint16_t). This introduces unnecessary implicit narrowing and makes the intent less clear. Prefer declaring these constants as uint16_t (and keep all half-bit masks/constants consistently uint16_t).
    src/coreclr/jit/utils.h:1
  • In utils.cpp a new BitOperations::UInt16BitsToHalf(uint16_t) definition was added, but this header hunk only shows adding HalfToUInt16Bits. If UInt16BitsToHalf wasn’t already declared elsewhere in BitOperations, the current change will fail to compile. Add the corresponding declaration next to HalfToUInt16Bits (or remove the definition if it’s unused).
    src/coreclr/jit/vartype.h:1
  • This branch is for non-xarch targets (“Other targets pass them as regular structs”), but it forces TYP_HALF to be treated as a float-reg argument type. That is inconsistent with the comment and risks ABI mistakes if TYP_HALF ever appears on non-xarch builds (or in shared JIT logic). Restrict the TYP_HALF handling to xarch-only codepaths, or keep the non-xarch behavior as varTypeIsFloating(vt).
    src/coreclr/vm/class.cpp:1
  • MethodTable::IsNativeHalfType() is likely consulted in hot-ish reflection/profiler/call descriptor paths; doing GetFullyQualifiedNameInfo + strcmp per query is avoidable overhead. Prefer a cheaper identity check (e.g., compare the MethodTable* against a pre-resolved System.Half MethodTable via binder/type lookup cached in the loader/EE, or cache a flag on the MethodTable once recognized). Keep the AVX10v1 gating, but make the type identity check O(1) without string comparisons.
    src/libraries/System.Private.CoreLib/src/System/Half.cs:1
  • This change introduces [MethodImpl(MethodImplOptions.AggressiveInlining)] on Half.Asin, while most other Half members in this PR are annotated with [Intrinsic] for JIT recognition. If Asin is meant to be treated as an intrinsic, it should be marked consistently (and wired up in the JIT); if it’s not meant to be intrinsic, consider avoiding a one-off inlining attribute here to keep the intrinsic surface area/annotation patterns consistent.

Comment thread src/coreclr/jit/importercalls.cpp
@tannergooding tannergooding merged commit f7693e1 into dotnet:main Apr 16, 2026
180 of 190 checks passed
@jkotas
Copy link
Copy Markdown
Member

jkotas commented Apr 17, 2026

This is dynamically changing the calling convention based on CPU capabilities:

  • The calling convention change needs to be mentioned in https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/botr/clr-abi.md
  • The calling convention change needs to be handled in crossgen. compExactlyDependsOn(InstructionSet_AVX10v1) on the JIT side is not sufficient.
  • The calling convention handling in the VM looks very incomplete. For example, there are no provisions for returning floating point values in xmm registers on x86 - x86 still returns values using x86 FP stack, but the comments suggest that the half floating point value is returned in xmm registers.

@tannergooding I would recommend reverting this PR until this gets all addressed.

tannergooding added a commit that referenced this pull request Apr 17, 2026
tannergooding added a commit that referenced this pull request Apr 17, 2026
tannergooding added a commit that referenced this pull request Apr 18, 2026
Since #122649 had to be reverted
due to the ABI concerns, this is a simpler initial change that works
with the existing ABI and on hardware with AVX2 support (not just
AVX512-FP16 capable hardware).

This should provide a nice win across most existing hardware and we can
follow up with a PR that does similar for the AVX512-FP16 instructions
that allow directly accelerated arithmetic operations, rather than only
handling conversions.

### Before

```asm
; Method Program:HalfToSingle(System.Half):float (FullOpts)
G_M16314_IG01:  ;; offset=0x0000
       4883EC28             sub      rsp, 40
						;; size=4 bbWeight=1 PerfScore 0.25

G_M16314_IG02:  ;; offset=0x0004
       0FB7C9               movzx    rcx, cx
       FF156BA74500         call     [System.Half:op_Explicit(System.Half):float]
       90                   nop      
						;; size=10 bbWeight=1 PerfScore 3.50

G_M16314_IG03:  ;; offset=0x000E
       4883C428             add      rsp, 40
       C3                   ret      
						;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code: 19

; Method Program:SingleToHalf(float):System.Half (FullOpts)
G_M32250_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M32250_IG02:  ;; offset=0x0000
       FF2572A74500         tail.jmp [System.Half:op_Explicit(float):System.Half]
						;; size=6 bbWeight=1 PerfScore 2.00
; Total bytes of code: 6

```

### After

```asm
; Method Program:HalfToSingle(System.Half):float (FullOpts)
G_M15861_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M15861_IG02:  ;; offset=0x0000
       0FB7C1               movzx    rax, cx
       C5F96EC0             vmovd    xmm0, eax
       C4E27913C0           vcvtph2ps xmm0, xmm0
						;; size=12 bbWeight=1 PerfScore 6.25

G_M15861_IG03:  ;; offset=0x000C
       C3                   ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 13

; Method Program:SingleToHalf(float):System.Half (FullOpts)
G_M15413_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M15413_IG02:  ;; offset=0x0000
       C4E3791DC000         vcvtps2ph xmm0, xmm0, 0
       C5F97EC0             vmovd    eax, xmm0
       0FB7C0               movzx    rax, ax
						;; size=13 bbWeight=1 PerfScore 6.25

G_M15413_IG03:  ;; offset=0x000D
       C3                   ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 14
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants