JIT: Accelerate floating->long casts on x86 by saucecontrol · Pull Request #125180 · dotnet/runtime

saucecontrol · 2026-03-04T15:43:15Z

This adds floating->long/ulong cast codegen for AVX-512 and AVX10.2 on x86. With this, all non-overflow casts are now hardware accelerated. This is the last bit pulled from #116805.

Typical Diff (double->long AVX-512):

-       sub      esp, 8
-       vzeroupper 
-       vmovsd   xmm0, qword ptr [esp+0x0C]
-       sub      esp, 8
-       ; npt arg push 0
-       ; npt arg push 1
-       vmovsd   qword ptr [esp], xmm0
-       call     CORINFO_HELP_DBL2LNG
-       ; gcr arg pop 2
+       vmovsd   xmm0, qword ptr [esp+0x04]
+       vcmpordsd k1, xmm0, xmm0
+       vcvttpd2qq xmm1 {k1}{z}, xmm0
+       vcmpge_oqsd k1, xmm0, qword ptr [@RWD00]
+       vpcmpeqd xmm0, xmm0, xmm0
+       vpsrlq   xmm1 {k1}, xmm0, 1
+       vmovd    eax, xmm1
+       vpextrd  edx, xmm1, 1
-       add      esp, 8
        ret      8

+RWD00  	dq	43E0000000000000h
 
-; Total bytes of code 31
+; Total bytes of code 54

Full Diffs

Breakdown of the double->long asm:

; load the scalar double
vmovsd   xmm0, qword ptr [esp+0x04]

; set the low bit of k1 if the scalar value is not NaN
vcmpordsd k1, xmm0, xmm0

; convert, using k1 mask bit.  if the mask bit is not set (meaning we have a NaN), set the value to zero
vcvttpd2qq xmm1 {k1}{z}, xmm0

; set the low bit of k1 if the input was greater than or equal to 2^63 (nearest double greater than long.MaxValue)
vcmpge_oqsd k1, xmm0, qword ptr [@RWD00]

; set all bits of xmm0 to 1
vpcmpeqd xmm0, xmm0, xmm0

; if the low bit of k1 is set (meaning overflow), set the value to xmm0 >>> 1 (0x7FFFFFFFFFFFFFFF), otherwise take the conversion result
vpsrlq   xmm1 {k1}, xmm0, 1

; extract the two 32-bit halves of the long result
vmovd    eax, xmm1
vpextrd  edx, xmm1, 1

dotnet-policy-service · 2026-03-04T15:44:20Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

This PR extends x86 JIT codegen to hardware-accelerate non-overflow floating→long/ulong casts using AVX-512 and AVX10.2, completing the remaining cast-acceleration work pulled from #116805.

Changes:

Teach cast helper selection to allow floating↔long casts to stay intrinsic-based on x86 when AVX-512 is available.
Add/extend x86 long decomposition logic to generate AVX-512/AVX10.2 sequences for floating→long/ulong and long→floating casts.
Introduce a new AVX-512 scalar compare-mask intrinsic and wire it up for immediate bounds + containment.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/coreclr/jit/lowerxarch.cpp	Refactors vector constant construction and adds containment support for the new AVX-512 scalar compare-mask intrinsic.
src/coreclr/jit/hwintrinsicxarch.cpp	Adds immediate upper-bound handling for the new AVX-512 scalar compare-mask intrinsic.
src/coreclr/jit/hwintrinsiclistxarch.h	Introduces `AVX512.CompareScalarMask` as a new intrinsic mapping to `vcmpss/vcmpsd` with IMM.
src/coreclr/jit/flowgraph.cpp	Updates helper-requirement logic so x86 floating↔long casts can avoid helper calls when AVX-512 is available.
src/coreclr/jit/decomposelongs.cpp	Implements the AVX-512/AVX10.2-based lowering/decomposition sequences for floating↔long/ulong on x86.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

saucecontrol · 2026-03-04T19:32:01Z

@dotnet/jit-contrib this is ready for review

diffs

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

JulieLeeMSFT · 2026-04-13T12:08:48Z

@EgorBo, please review this community PR.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/coreclr/jit/flowgraph.cpp:1347

fgCastRequiresHelper on x86 currently only exempts long<->floating casts when InstructionSet_AVX512 is enabled. This PR adds long/floating cast acceleration that can use AVX10.2 (InstructionSet_AVX10v2) as well (e.g., DecomposeLongs::DecomposeCast checks compOpportunisticallyDependsOn(InstructionSet_AVX10v2)). If AVX10v2 is enabled while AVX512 is disabled/unavailable, morphing may still force a helper call and bypass the new codegen. Consider updating the x86 condition to treat AVX10v2 as sufficient (e.g., require helper only when neither AVX512 nor AVX10v2 is available).

#if defined(TARGET_X86) || defined(TARGET_ARM)
    if ((varTypeIsLong(fromType) && varTypeIsFloating(toType)) ||
        (varTypeIsFloating(fromType) && varTypeIsLong(toType)))
    {
#if defined(TARGET_X86)
        return !compOpportunisticallyDependsOn(InstructionSet_AVX512);
#else

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/coreclr/jit/flowgraph.cpp:1347

On x86, this helper decision only checks for AVX-512. But the PR adds AVX10.2-based acceleration for floating↔long casts as well, so this will still force helper calls on AVX10v2-capable targets where AVX-512 isn’t enabled/reported. Consider also allowing InstructionSet_AVX10v2 (or whatever ISA predicate you use for the new codegen) to avoid blocking the new lowering/decomposition path.

    if ((varTypeIsLong(fromType) && varTypeIsFloating(toType)) ||
        (varTypeIsFloating(fromType) && varTypeIsLong(toType)))
    {
#if defined(TARGET_X86)
        return !compOpportunisticallyDependsOn(InstructionSet_AVX512);
#else

saucecontrol · 2026-04-17T22:25:46Z

Merged up to resolve conflicts with #122649, and again after the revert.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

tannergooding · 2026-04-20T17:24:17Z

+            //   var nanMask = Avx.CompareScalar(srcVec, srcVec, FloatComparisonMode.OrderedNonSignaling);
+            //   var convert = Avx512DQ.VL.ConvertToVector128Int64WithTruncation(srcVec);
+            //   convertResult = Vector128.ConditionalSelect(nanMask, convert, Vector128<long>.Zero);


Can't this whole thing rather be:

var srcVec = Vector128.CreateScalarUnsafe(castOp); var valMask = Avx.CompareScalar(srcVec, srcVec, FloatComparisonMode.OrderedEqualNonSignaling); var ovfMask = Avx.CompareScalar(srcVec, Vector128.Create(9223372036854775808.0), FloatComparisonMode.OrderedGreaterThanOrEqualNonSignaling); var result = Avx512DQ.VL.ConvertToVector128Int64WithTruncation(srcVec & valMask); return Vector128.ConditionalSelect(ovfMask, Vector128.Create<long>(long.MaxValue), result).ToScalar();

This allows pipelining the two comparisons, does a simple & to get nan -> 0 normalization, and then does a fixup for the overflow case (without a delay as the comparison should be done already).

There does notably need to be a fixup either with the above suggestion or the below code if srcType == TYP_FLOAT is possible, as the comparison produces a 32-bit mask and we need a 64-bit one for the TYP_LONG result.

This implementation also allows the comparisons to be done in parallel, and because it's using EVEX-masked scalar instructions, we only use the low bit of the mask regardless of operand size. Since the conversion requires AVX-512 anyway, I don't see any downside to using the EVEX masking.

Looks like it'd be roughly the below, so I guess it's fine to use kmask, but we should likely just use a regular constant and not the AllBits >> 1. It's an extra instruction and codegen for something that's already speculatively prefetched (according to both VTune and uProf). If it was something beneficial to do, then we should be optimizing it more broadly as part of constant generation (and likely do it for all such bitmasks).

No kmask

vmovsd xmm0, qword ptr [esp+0x04] ; 7-bytes 4-cycles vcmpgesd xmm1, xmm0, qword ptr [@RWD00] ; 9-bytes 6-cycles -- Pipelined vcmpeqsd xmm2, xmm0, xmm0 ; 5-bytes 0-cycles -/ vandpd xmm0, xmm2, xmm0 ; 4-bytes 1-cycle vcvttpd2qq xmm0, xmm0 ; 6-bytes 3-cycles vpternlogq xmm1, xmm0, qword ptr [@RWD08] {1to2}, -84 ; 11-bytes 1-cycle -- Prefetched vmovd eax, xmm1 ; 4-bytes 5-cycles vpextrd edx, xmm1, 1 ; 6-bytes 6-cycles ; --------------------- ; 52-bytes 26-cycles

Use a constant

vmovsd xmm0, qword ptr [esp+0x04] ; 7-bytes 4-cycles vcmpgesd k1, xmm0, qword ptr [@RWD00] ; 11-bytes 7-cycles -- Pipelined vcmpeqsd k2, xmm0, xmm0 ; 7-bytes 0-cycles -/ vcvttpd2qq zmm0 {k2}{z}, zmm0 ; 6-bytes 3-cycles vblendmpd zmm0 {k1}, zmm0, qword ptr [@RWD08] {1to2} ; 10-bytes 1-cycle -- Prefetched vmovd eax, xmm0 ; 4-bytes 5-cycles vpextrd edx, xmm0, 1 ; 6-bytes 6-cycles ; --------------------- ; 51-bytes 26-cycles

PR changes (with a change to put the compares next to eachother and do OrderedEqualNonSignaling for NaN detection so we can get {k2}{z})

vmovsd xmm0, qword ptr [esp+0x04] ; 7-bytes 4-cycles vcmpgesd k1, xmm0, qword ptr [@RWD00] ; 11-bytes 7-cycles -- Pipelined vcmpeqsd k2, xmm0, xmm0 ; 7-bytes 0-cycles -/ vcvttpd2qq xmm0 {k2}{z}, xmm0 ; 6-bytes 3-cycles vpcmpeqd xmm1, xmm1, xmm1 ; 4-bytes 0-cycles vpsrlq xmm0 {k1}, xmm1, 1 ; 7-bytes 1-cycle vmovd eax, xmm0 ; 4-bytes 5-cycles vpextrd edx, xmm0, 1 ; 6-bytes 6-cycles ; -------------------- ; 52-bytes 26-cycles

Copilot AI review requested due to automatic review settings March 4, 2026 15:43

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 4, 2026

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 4, 2026

Copilot started reviewing on behalf of saucecontrol March 4, 2026 15:44 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Comment thread src/coreclr/jit/decomposelongs.cpp Outdated

Comment thread src/coreclr/jit/decomposelongs.cpp Outdated

Copilot AI review requested due to automatic review settings March 4, 2026 16:08

Copilot started reviewing on behalf of saucecontrol March 4, 2026 16:09 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

saucecontrol added 2 commits March 4, 2026 09:46

accelerate floating->long casts on x86

47a6bc8

rename variable

9b393fb

saucecontrol force-pushed the lng2flt6 branch from dfda2d3 to 9b393fb Compare March 4, 2026 17:52

saucecontrol marked this pull request as ready for review March 4, 2026 19:31

Copilot AI review requested due to automatic review settings March 4, 2026 19:31

Copilot started reviewing on behalf of saucecontrol March 4, 2026 19:32 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Comment thread src/coreclr/jit/decomposelongs.cpp

Comment thread src/coreclr/jit/decomposelongs.cpp

build-analysis bot mentioned this pull request Mar 5, 2026

Android WebSocket failure #121518

Open

JulieLeeMSFT requested a review from EgorBo April 13, 2026 12:08

Merge branch 'main' into lng2flt6

708800e

Merge remote-tracking branch 'upstream/main' into lng2flt6

e656c5f

Copilot AI review requested due to automatic review settings April 16, 2026 19:48

Copilot started reviewing on behalf of saucecontrol April 16, 2026 19:49 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Comment thread src/coreclr/jit/hwintrinsiclistxarch.h Outdated

Update src/coreclr/jit/hwintrinsiclistxarch.h

bfede63

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 16, 2026 19:55

Copilot started reviewing on behalf of saucecontrol April 16, 2026 19:56 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Comment thread src/coreclr/jit/decomposelongs.cpp

Comment thread src/coreclr/jit/decomposelongs.cpp

build-analysis bot mentioned this pull request Apr 17, 2026

linux-armel build failing with unused macro #127034

Closed

Merge remote-tracking branch 'upstream/main' into lng2flt6

3337148

tannergooding reviewed Apr 19, 2026

View reviewed changes

Comment thread src/coreclr/jit/flowgraph.cpp

saucecontrol added 2 commits April 19, 2026 21:34

Merge remote-tracking branch 'upstream/main' into lng2flt6

b828000

add costing info

27c7ff2

Copilot AI review requested due to automatic review settings April 20, 2026 04:36

Copilot started reviewing on behalf of saucecontrol April 20, 2026 04:37 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp

fix comment typo

23fbdbf

This was referenced Apr 20, 2026

[wasm] WBT SatelliteAssembliesTests.CheckThatSatelliteAssembliesAreNotAOTed failing #90458

Open

Unable to pull image from mcr.microsoft.com #117164

Open

tannergooding reviewed Apr 20, 2026

View reviewed changes

Conversation

saucecontrol commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dotnet-policy-service bot commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

saucecontrol commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

JulieLeeMSFT commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

saucecontrol commented Apr 17, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

tannergooding Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saucecontrol Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

tannergooding Apr 20, 2026

Choose a reason for hiding this comment

No kmask

Use a constant

PR changes (with a change to put the compares next to eachother and do OrderedEqualNonSignaling for NaN detection so we can get {k2}{z})

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

saucecontrol commented Mar 4, 2026 •

edited

Loading

saucecontrol commented Mar 4, 2026 •

edited

Loading

tannergooding Apr 20, 2026 •

edited

Loading

PR changes (with a change to put the compares next to eachother and do `OrderedEqualNonSignaling` for NaN detection so we can get `{k2}{z}`)