Skip to content

JIT: Accelerate floating->long casts on x86#125180

Open
saucecontrol wants to merge 9 commits intodotnet:mainfrom
saucecontrol:lng2flt6
Open

JIT: Accelerate floating->long casts on x86#125180
saucecontrol wants to merge 9 commits intodotnet:mainfrom
saucecontrol:lng2flt6

Conversation

@saucecontrol
Copy link
Copy Markdown
Member

@saucecontrol saucecontrol commented Mar 4, 2026

This adds floating->long/ulong cast codegen for AVX-512 and AVX10.2 on x86. With this, all non-overflow casts are now hardware accelerated. This is the last bit pulled from #116805.

Typical Diff (double->long AVX-512):

-       sub      esp, 8
-       vzeroupper 
-       vmovsd   xmm0, qword ptr [esp+0x0C]
-       sub      esp, 8
-       ; npt arg push 0
-       ; npt arg push 1
-       vmovsd   qword ptr [esp], xmm0
-       call     CORINFO_HELP_DBL2LNG
-       ; gcr arg pop 2
+       vmovsd   xmm0, qword ptr [esp+0x04]
+       vcmpordsd k1, xmm0, xmm0
+       vcvttpd2qq xmm1 {k1}{z}, xmm0
+       vcmpge_oqsd k1, xmm0, qword ptr [@RWD00]
+       vpcmpeqd xmm0, xmm0, xmm0
+       vpsrlq   xmm1 {k1}, xmm0, 1
+       vmovd    eax, xmm1
+       vpextrd  edx, xmm1, 1
-       add      esp, 8
        ret      8

+RWD00  	dq	43E0000000000000h
 
-; Total bytes of code 31
+; Total bytes of code 54

Full Diffs

Breakdown of the double->long asm:

; load the scalar double
vmovsd   xmm0, qword ptr [esp+0x04]

; set the low bit of k1 if the scalar value is not NaN
vcmpordsd k1, xmm0, xmm0

; convert, using k1 mask bit.  if the mask bit is not set (meaning we have a NaN), set the value to zero
vcvttpd2qq xmm1 {k1}{z}, xmm0

; set the low bit of k1 if the input was greater than or equal to 2^63 (nearest double greater than long.MaxValue)
vcmpge_oqsd k1, xmm0, qword ptr [@RWD00]

; set all bits of xmm0 to 1
vpcmpeqd xmm0, xmm0, xmm0

; if the low bit of k1 is set (meaning overflow), set the value to xmm0 >>> 1 (0x7FFFFFFFFFFFFFFF), otherwise take the conversion result
vpsrlq   xmm1 {k1}, xmm0, 1

; extract the two 32-bit halves of the long result
vmovd    eax, xmm1
vpextrd  edx, xmm1, 1

Copilot AI review requested due to automatic review settings March 4, 2026 15:43
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 4, 2026
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 4, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends x86 JIT codegen to hardware-accelerate non-overflow floating→long/ulong casts using AVX-512 and AVX10.2, completing the remaining cast-acceleration work pulled from #116805.

Changes:

  • Teach cast helper selection to allow floating↔long casts to stay intrinsic-based on x86 when AVX-512 is available.
  • Add/extend x86 long decomposition logic to generate AVX-512/AVX10.2 sequences for floating→long/ulong and long→floating casts.
  • Introduce a new AVX-512 scalar compare-mask intrinsic and wire it up for immediate bounds + containment.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/coreclr/jit/lowerxarch.cpp Refactors vector constant construction and adds containment support for the new AVX-512 scalar compare-mask intrinsic.
src/coreclr/jit/hwintrinsicxarch.cpp Adds immediate upper-bound handling for the new AVX-512 scalar compare-mask intrinsic.
src/coreclr/jit/hwintrinsiclistxarch.h Introduces AVX512.CompareScalarMask as a new intrinsic mapping to vcmpss/vcmpsd with IMM.
src/coreclr/jit/flowgraph.cpp Updates helper-requirement logic so x86 floating↔long casts can avoid helper calls when AVX-512 is available.
src/coreclr/jit/decomposelongs.cpp Implements the AVX-512/AVX10.2-based lowering/decomposition sequences for floating↔long/ulong on x86.

Comment thread src/coreclr/jit/decomposelongs.cpp Outdated
Comment thread src/coreclr/jit/decomposelongs.cpp Outdated
Copilot AI review requested due to automatic review settings March 4, 2026 16:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@saucecontrol saucecontrol marked this pull request as ready for review March 4, 2026 19:31
Copilot AI review requested due to automatic review settings March 4, 2026 19:31
@saucecontrol
Copy link
Copy Markdown
Member Author

saucecontrol commented Mar 4, 2026

@dotnet/jit-contrib this is ready for review

diffs

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread src/coreclr/jit/decomposelongs.cpp
Comment thread src/coreclr/jit/decomposelongs.cpp
@JulieLeeMSFT
Copy link
Copy Markdown
Member

@EgorBo, please review this community PR.

Copilot AI review requested due to automatic review settings April 16, 2026 19:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/coreclr/jit/flowgraph.cpp:1347

  • fgCastRequiresHelper on x86 currently only exempts long<->floating casts when InstructionSet_AVX512 is enabled. This PR adds long/floating cast acceleration that can use AVX10.2 (InstructionSet_AVX10v2) as well (e.g., DecomposeLongs::DecomposeCast checks compOpportunisticallyDependsOn(InstructionSet_AVX10v2)). If AVX10v2 is enabled while AVX512 is disabled/unavailable, morphing may still force a helper call and bypass the new codegen. Consider updating the x86 condition to treat AVX10v2 as sufficient (e.g., require helper only when neither AVX512 nor AVX10v2 is available).
#if defined(TARGET_X86) || defined(TARGET_ARM)
    if ((varTypeIsLong(fromType) && varTypeIsFloating(toType)) ||
        (varTypeIsFloating(fromType) && varTypeIsLong(toType)))
    {
#if defined(TARGET_X86)
        return !compOpportunisticallyDependsOn(InstructionSet_AVX512);
#else

Comment thread src/coreclr/jit/hwintrinsiclistxarch.h Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 16, 2026 19:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/coreclr/jit/flowgraph.cpp:1347

  • On x86, this helper decision only checks for AVX-512. But the PR adds AVX10.2-based acceleration for floating↔long casts as well, so this will still force helper calls on AVX10v2-capable targets where AVX-512 isn’t enabled/reported. Consider also allowing InstructionSet_AVX10v2 (or whatever ISA predicate you use for the new codegen) to avoid blocking the new lowering/decomposition path.
    if ((varTypeIsLong(fromType) && varTypeIsFloating(toType)) ||
        (varTypeIsFloating(fromType) && varTypeIsLong(toType)))
    {
#if defined(TARGET_X86)
        return !compOpportunisticallyDependsOn(InstructionSet_AVX512);
#else

Comment thread src/coreclr/jit/decomposelongs.cpp
Comment thread src/coreclr/jit/decomposelongs.cpp
@saucecontrol
Copy link
Copy Markdown
Member Author

Merged up to resolve conflicts with #122649, and again after the revert.

Comment thread src/coreclr/jit/flowgraph.cpp
Copilot AI review requested due to automatic review settings April 20, 2026 04:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comment thread src/coreclr/jit/gentree.cpp
Comment on lines +672 to +674
// var nanMask = Avx.CompareScalar(srcVec, srcVec, FloatComparisonMode.OrderedNonSignaling);
// var convert = Avx512DQ.VL.ConvertToVector128Int64WithTruncation(srcVec);
// convertResult = Vector128.ConditionalSelect(nanMask, convert, Vector128<long>.Zero);
Copy link
Copy Markdown
Member

@tannergooding tannergooding Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this whole thing rather be:

var srcVec = Vector128.CreateScalarUnsafe(castOp);

var valMask = Avx.CompareScalar(srcVec, srcVec, FloatComparisonMode.OrderedEqualNonSignaling);
var ovfMask = Avx.CompareScalar(srcVec, Vector128.Create(9223372036854775808.0), FloatComparisonMode.OrderedGreaterThanOrEqualNonSignaling);

var result = Avx512DQ.VL.ConvertToVector128Int64WithTruncation(srcVec & valMask);
return Vector128.ConditionalSelect(ovfMask, Vector128.Create<long>(long.MaxValue), result).ToScalar();

This allows pipelining the two comparisons, does a simple & to get nan -> 0 normalization, and then does a fixup for the overflow case (without a delay as the comparison should be done already).


There does notably need to be a fixup either with the above suggestion or the below code if srcType == TYP_FLOAT is possible, as the comparison produces a 32-bit mask and we need a 64-bit one for the TYP_LONG result.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation also allows the comparisons to be done in parallel, and because it's using EVEX-masked scalar instructions, we only use the low bit of the mask regardless of operand size. Since the conversion requires AVX-512 anyway, I don't see any downside to using the EVEX masking.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it'd be roughly the below, so I guess it's fine to use kmask, but we should likely just use a regular constant and not the AllBits >> 1. It's an extra instruction and codegen for something that's already speculatively prefetched (according to both VTune and uProf). If it was something beneficial to do, then we should be optimizing it more broadly as part of constant generation (and likely do it for all such bitmasks).

No kmask

vmovsd     xmm0, qword ptr [esp+0x04]                   ;  7-bytes     4-cycles
vcmpgesd   xmm1, xmm0, qword ptr [@RWD00]               ;  9-bytes     6-cycles -- Pipelined
vcmpeqsd   xmm2, xmm0, xmm0                             ;  5-bytes     0-cycles -/
vandpd     xmm0, xmm2, xmm0                             ;  4-bytes     1-cycle
vcvttpd2qq xmm0, xmm0                                   ;  6-bytes     3-cycles
vpternlogq xmm1, xmm0, qword ptr [@RWD08] {1to2}, -84   ; 11-bytes     1-cycle  -- Prefetched
vmovd      eax, xmm1                                    ;  4-bytes     5-cycles
vpextrd    edx, xmm1, 1                                 ;  6-bytes     6-cycles
                                                        ; ---------------------
                                                        ; 52-bytes    26-cycles

Use a constant

vmovsd     xmm0, qword ptr [esp+0x04]                   ;  7-bytes     4-cycles
vcmpgesd   k1, xmm0, qword ptr [@RWD00]                 ; 11-bytes     7-cycles -- Pipelined
vcmpeqsd   k2, xmm0, xmm0                               ;  7-bytes     0-cycles -/
vcvttpd2qq zmm0 {k2}{z}, zmm0                           ;  6-bytes     3-cycles
vblendmpd  zmm0 {k1}, zmm0, qword ptr [@RWD08] {1to2}   ; 10-bytes     1-cycle  -- Prefetched
vmovd      eax, xmm0                                    ;  4-bytes     5-cycles
vpextrd    edx, xmm0, 1                                 ;  6-bytes     6-cycles
                                                        ; ---------------------
                                                        ; 51-bytes    26-cycles

PR changes (with a change to put the compares next to eachother and do OrderedEqualNonSignaling for NaN detection so we can get {k2}{z})

vmovsd     xmm0, qword ptr [esp+0x04]                   ;  7-bytes    4-cycles
vcmpgesd   k1, xmm0, qword ptr [@RWD00]                 ; 11-bytes    7-cycles -- Pipelined
vcmpeqsd   k2, xmm0, xmm0                               ;  7-bytes    0-cycles -/
vcvttpd2qq xmm0 {k2}{z}, xmm0                           ;  6-bytes    3-cycles
vpcmpeqd   xmm1, xmm1, xmm1                             ;  4-bytes    0-cycles
vpsrlq     xmm0 {k1}, xmm1, 1                           ;  7-bytes    1-cycle
vmovd      eax, xmm0                                    ;  4-bytes    5-cycles
vpextrd    edx, xmm0, 1                                 ;  6-bytes    6-cycles
                                                        ; --------------------
                                                        ; 52-bytes   26-cycles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants