Accelerate Half with FP16 ISA#122649
Conversation
|
@tannergooding @jakobbotsch please take a look when you get a chance. |
3b8abaa to
f633726
Compare
|
@dotnet/intel @tannergooding may I get some high level feedback on the structure of the PR? |
|
Resolved merge conflicts and as part of that fixed the spacing in |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 46 out of 47 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (3)
src/coreclr/jit/utils.cpp:1
shiftRightJamproduces incorrect results whendist == 0: it will set the jam bit wheneverl != 0, even though shifting by 0 should not jam. Even if current call sites don't pass 0, this helper is a general utility and should be correct/stable. Consider explicitly handlingdist == 0(returnl) and use a clear jam test like(l & ((1ULL << dist) - 1)) != 0for0 < dist < 64to avoid subtle precedence/edge-case issues.
src/coreclr/vm/callingconvention.h:1- This
HalfFP-register classification is currently inside the#if !defined(UNIX_AMD64_ABI)block (Windows x64 path). However the surrounding comments and other VM changes describeHalfas FP-reg passed on xarch more generally. If the JIT/ABI intends this for SysV AMD64 too, the UNIX_AMD64_ABI path needs an equivalent update; otherwise reflection/call helpers could disagree with actual managed calling convention on Linux/macOS and lead to invocation/call convention mismatches.
src/libraries/System.Private.CoreLib/src/System/Half.cs:1 - Most changes in this file mark
HalfAPIs with[Intrinsic]to enable JIT recognition, butAsinwas changed to[MethodImpl(MethodImplOptions.AggressiveInlining)]instead. If the intent is to makeAsin(Half)an intrinsic (or to keep attribute usage consistent for the acceleration discussion), consider using[Intrinsic]here as well (or clarify why this one is only inlined while similar methods are intrinsic-tagged).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 46 out of 47 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (8)
src/coreclr/jit/utils.cpp:1
shiftRightJamhas a precedence bug:(l << (-dist & 63) != 0 ? 1UL : 0UL)is parsed as shifting by either 0/1, not as “shift then test non-zero”. This produces incorrect jamming behavior. Add parentheses so the non-zero test applies to the shifted value (and ensure the function behaves correctly whendist == 0, even if current callers don’t pass 0)."
src/coreclr/jit/utils.cpp:1HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITSare declared asuint64_tbut used as 16-bit half bit patterns and returned asfloat16_t(typedef’d touint16_t). This introduces unnecessary implicit narrowing and makes the intent less clear. Prefer declaring these constants asuint16_t(and keep all half-bit masks/constants consistentlyuint16_t).
src/coreclr/jit/utils.cpp:1HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITSare declared asuint64_tbut used as 16-bit half bit patterns and returned asfloat16_t(typedef’d touint16_t). This introduces unnecessary implicit narrowing and makes the intent less clear. Prefer declaring these constants asuint16_t(and keep all half-bit masks/constants consistentlyuint16_t).
src/coreclr/jit/utils.cpp:1HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITSare declared asuint64_tbut used as 16-bit half bit patterns and returned asfloat16_t(typedef’d touint16_t). This introduces unnecessary implicit narrowing and makes the intent less clear. Prefer declaring these constants asuint16_t(and keep all half-bit masks/constants consistentlyuint16_t).
src/coreclr/jit/utils.h:1- In
utils.cppa newBitOperations::UInt16BitsToHalf(uint16_t)definition was added, but this header hunk only shows addingHalfToUInt16Bits. IfUInt16BitsToHalfwasn’t already declared elsewhere inBitOperations, the current change will fail to compile. Add the corresponding declaration next toHalfToUInt16Bits(or remove the definition if it’s unused).
src/coreclr/jit/vartype.h:1 - This branch is for non-xarch targets (“Other targets pass them as regular structs”), but it forces
TYP_HALFto be treated as a float-reg argument type. That is inconsistent with the comment and risks ABI mistakes ifTYP_HALFever appears on non-xarch builds (or in shared JIT logic). Restrict theTYP_HALFhandling to xarch-only codepaths, or keep the non-xarch behavior asvarTypeIsFloating(vt).
src/coreclr/vm/class.cpp:1 MethodTable::IsNativeHalfType()is likely consulted in hot-ish reflection/profiler/call descriptor paths; doingGetFullyQualifiedNameInfo+strcmpper query is avoidable overhead. Prefer a cheaper identity check (e.g., compare theMethodTable*against a pre-resolvedSystem.HalfMethodTable via binder/type lookup cached in the loader/EE, or cache a flag on the MethodTable once recognized). Keep the AVX10v1 gating, but make the type identity check O(1) without string comparisons.
src/libraries/System.Private.CoreLib/src/System/Half.cs:1- This change introduces
[MethodImpl(MethodImplOptions.AggressiveInlining)]onHalf.Asin, while most other Half members in this PR are annotated with[Intrinsic]for JIT recognition. IfAsinis meant to be treated as an intrinsic, it should be marked consistently (and wired up in the JIT); if it’s not meant to be intrinsic, consider avoiding a one-off inlining attribute here to keep the intrinsic surface area/annotation patterns consistent.
95f0678 to
e35788f
Compare
|
This is dynamically changing the calling convention based on CPU capabilities:
@tannergooding I would recommend reverting this PR until this gets all addressed. |
This reverts commit f7693e1.
Since #122649 had to be reverted due to the ABI concerns, this is a simpler initial change that works with the existing ABI and on hardware with AVX2 support (not just AVX512-FP16 capable hardware). This should provide a nice win across most existing hardware and we can follow up with a PR that does similar for the AVX512-FP16 instructions that allow directly accelerated arithmetic operations, rather than only handling conversions. ### Before ```asm ; Method Program:HalfToSingle(System.Half):float (FullOpts) G_M16314_IG01: ;; offset=0x0000 4883EC28 sub rsp, 40 ;; size=4 bbWeight=1 PerfScore 0.25 G_M16314_IG02: ;; offset=0x0004 0FB7C9 movzx rcx, cx FF156BA74500 call [System.Half:op_Explicit(System.Half):float] 90 nop ;; size=10 bbWeight=1 PerfScore 3.50 G_M16314_IG03: ;; offset=0x000E 4883C428 add rsp, 40 C3 ret ;; size=5 bbWeight=1 PerfScore 1.25 ; Total bytes of code: 19 ; Method Program:SingleToHalf(float):System.Half (FullOpts) G_M32250_IG01: ;; offset=0x0000 ;; size=0 bbWeight=1 PerfScore 0.00 G_M32250_IG02: ;; offset=0x0000 FF2572A74500 tail.jmp [System.Half:op_Explicit(float):System.Half] ;; size=6 bbWeight=1 PerfScore 2.00 ; Total bytes of code: 6 ``` ### After ```asm ; Method Program:HalfToSingle(System.Half):float (FullOpts) G_M15861_IG01: ;; offset=0x0000 ;; size=0 bbWeight=1 PerfScore 0.00 G_M15861_IG02: ;; offset=0x0000 0FB7C1 movzx rax, cx C5F96EC0 vmovd xmm0, eax C4E27913C0 vcvtph2ps xmm0, xmm0 ;; size=12 bbWeight=1 PerfScore 6.25 G_M15861_IG03: ;; offset=0x000C C3 ret ;; size=1 bbWeight=1 PerfScore 1.00 ; Total bytes of code: 13 ; Method Program:SingleToHalf(float):System.Half (FullOpts) G_M15413_IG01: ;; offset=0x0000 ;; size=0 bbWeight=1 PerfScore 0.00 G_M15413_IG02: ;; offset=0x0000 C4E3791DC000 vcvtps2ph xmm0, xmm0, 0 C5F97EC0 vmovd eax, xmm0 0FB7C0 movzx rax, ax ;; size=13 bbWeight=1 PerfScore 6.25 G_M15413_IG03: ;; offset=0x000D C3 ret ;; size=1 bbWeight=1 PerfScore 1.00 ; Total bytes of code: 14 ```
Draft PR for in-progress work to accelerate
System.Halfwith FP16 ISA.Current work done:
Add a
TYP_HALFto the .NET runtime, which is treated like aTYP_SIMDXX, but with some notable differences. Namely, aTYP_HALFis passed around via the xmm registers, and while it will pass avarTypeIsStructtest, it must be treated as a primitive in other places.Accelerate
System.Halfoperations with theTYP_HALFand some FP16 intrinsics. Not every possible function has been accelerated yet.For discussion:
I have currently worked around some checks to make
TYP_HALFbehave like a struct and a primitive. It's very ad-hoc at the moment.Much of the work to transform the named
System.Halfintrinsics into a sequence of intrinsic nodes is done inimportcall.cppand might want to be moved up into some of thegtNewSimdXXnodes.