JIT: Accelerate floating->long casts on x86#125180
JIT: Accelerate floating->long casts on x86#125180saucecontrol wants to merge 9 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
This PR extends x86 JIT codegen to hardware-accelerate non-overflow floating→long/ulong casts using AVX-512 and AVX10.2, completing the remaining cast-acceleration work pulled from #116805.
Changes:
- Teach cast helper selection to allow floating↔long casts to stay intrinsic-based on x86 when AVX-512 is available.
- Add/extend x86 long decomposition logic to generate AVX-512/AVX10.2 sequences for floating→long/ulong and long→floating casts.
- Introduce a new AVX-512 scalar compare-mask intrinsic and wire it up for immediate bounds + containment.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/jit/lowerxarch.cpp | Refactors vector constant construction and adds containment support for the new AVX-512 scalar compare-mask intrinsic. |
| src/coreclr/jit/hwintrinsicxarch.cpp | Adds immediate upper-bound handling for the new AVX-512 scalar compare-mask intrinsic. |
| src/coreclr/jit/hwintrinsiclistxarch.h | Introduces AVX512.CompareScalarMask as a new intrinsic mapping to vcmpss/vcmpsd with IMM. |
| src/coreclr/jit/flowgraph.cpp | Updates helper-requirement logic so x86 floating↔long casts can avoid helper calls when AVX-512 is available. |
| src/coreclr/jit/decomposelongs.cpp | Implements the AVX-512/AVX10.2-based lowering/decomposition sequences for floating↔long/ulong on x86. |
|
@dotnet/jit-contrib this is ready for review |
|
@EgorBo, please review this community PR. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/coreclr/jit/flowgraph.cpp:1347
fgCastRequiresHelperon x86 currently only exempts long<->floating casts whenInstructionSet_AVX512is enabled. This PR adds long/floating cast acceleration that can use AVX10.2 (InstructionSet_AVX10v2) as well (e.g.,DecomposeLongs::DecomposeCastcheckscompOpportunisticallyDependsOn(InstructionSet_AVX10v2)). If AVX10v2 is enabled while AVX512 is disabled/unavailable, morphing may still force a helper call and bypass the new codegen. Consider updating the x86 condition to treat AVX10v2 as sufficient (e.g., require helper only when neither AVX512 nor AVX10v2 is available).
#if defined(TARGET_X86) || defined(TARGET_ARM)
if ((varTypeIsLong(fromType) && varTypeIsFloating(toType)) ||
(varTypeIsFloating(fromType) && varTypeIsLong(toType)))
{
#if defined(TARGET_X86)
return !compOpportunisticallyDependsOn(InstructionSet_AVX512);
#else
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/coreclr/jit/flowgraph.cpp:1347
- On x86, this helper decision only checks for AVX-512. But the PR adds AVX10.2-based acceleration for floating↔long casts as well, so this will still force helper calls on AVX10v2-capable targets where AVX-512 isn’t enabled/reported. Consider also allowing
InstructionSet_AVX10v2(or whatever ISA predicate you use for the new codegen) to avoid blocking the new lowering/decomposition path.
if ((varTypeIsLong(fromType) && varTypeIsFloating(toType)) ||
(varTypeIsFloating(fromType) && varTypeIsLong(toType)))
{
#if defined(TARGET_X86)
return !compOpportunisticallyDependsOn(InstructionSet_AVX512);
#else
|
Merged up to resolve conflicts with #122649, and again after the revert. |
| // var nanMask = Avx.CompareScalar(srcVec, srcVec, FloatComparisonMode.OrderedNonSignaling); | ||
| // var convert = Avx512DQ.VL.ConvertToVector128Int64WithTruncation(srcVec); | ||
| // convertResult = Vector128.ConditionalSelect(nanMask, convert, Vector128<long>.Zero); |
There was a problem hiding this comment.
Can't this whole thing rather be:
var srcVec = Vector128.CreateScalarUnsafe(castOp);
var valMask = Avx.CompareScalar(srcVec, srcVec, FloatComparisonMode.OrderedEqualNonSignaling);
var ovfMask = Avx.CompareScalar(srcVec, Vector128.Create(9223372036854775808.0), FloatComparisonMode.OrderedGreaterThanOrEqualNonSignaling);
var result = Avx512DQ.VL.ConvertToVector128Int64WithTruncation(srcVec & valMask);
return Vector128.ConditionalSelect(ovfMask, Vector128.Create<long>(long.MaxValue), result).ToScalar();This allows pipelining the two comparisons, does a simple & to get nan -> 0 normalization, and then does a fixup for the overflow case (without a delay as the comparison should be done already).
There does notably need to be a fixup either with the above suggestion or the below code if srcType == TYP_FLOAT is possible, as the comparison produces a 32-bit mask and we need a 64-bit one for the TYP_LONG result.
There was a problem hiding this comment.
This implementation also allows the comparisons to be done in parallel, and because it's using EVEX-masked scalar instructions, we only use the low bit of the mask regardless of operand size. Since the conversion requires AVX-512 anyway, I don't see any downside to using the EVEX masking.
There was a problem hiding this comment.
Looks like it'd be roughly the below, so I guess it's fine to use kmask, but we should likely just use a regular constant and not the AllBits >> 1. It's an extra instruction and codegen for something that's already speculatively prefetched (according to both VTune and uProf). If it was something beneficial to do, then we should be optimizing it more broadly as part of constant generation (and likely do it for all such bitmasks).
No kmask
vmovsd xmm0, qword ptr [esp+0x04] ; 7-bytes 4-cycles
vcmpgesd xmm1, xmm0, qword ptr [@RWD00] ; 9-bytes 6-cycles -- Pipelined
vcmpeqsd xmm2, xmm0, xmm0 ; 5-bytes 0-cycles -/
vandpd xmm0, xmm2, xmm0 ; 4-bytes 1-cycle
vcvttpd2qq xmm0, xmm0 ; 6-bytes 3-cycles
vpternlogq xmm1, xmm0, qword ptr [@RWD08] {1to2}, -84 ; 11-bytes 1-cycle -- Prefetched
vmovd eax, xmm1 ; 4-bytes 5-cycles
vpextrd edx, xmm1, 1 ; 6-bytes 6-cycles
; ---------------------
; 52-bytes 26-cyclesUse a constant
vmovsd xmm0, qword ptr [esp+0x04] ; 7-bytes 4-cycles
vcmpgesd k1, xmm0, qword ptr [@RWD00] ; 11-bytes 7-cycles -- Pipelined
vcmpeqsd k2, xmm0, xmm0 ; 7-bytes 0-cycles -/
vcvttpd2qq zmm0 {k2}{z}, zmm0 ; 6-bytes 3-cycles
vblendmpd zmm0 {k1}, zmm0, qword ptr [@RWD08] {1to2} ; 10-bytes 1-cycle -- Prefetched
vmovd eax, xmm0 ; 4-bytes 5-cycles
vpextrd edx, xmm0, 1 ; 6-bytes 6-cycles
; ---------------------
; 51-bytes 26-cyclesPR changes (with a change to put the compares next to eachother and do OrderedEqualNonSignaling for NaN detection so we can get {k2}{z})
vmovsd xmm0, qword ptr [esp+0x04] ; 7-bytes 4-cycles
vcmpgesd k1, xmm0, qword ptr [@RWD00] ; 11-bytes 7-cycles -- Pipelined
vcmpeqsd k2, xmm0, xmm0 ; 7-bytes 0-cycles -/
vcvttpd2qq xmm0 {k2}{z}, xmm0 ; 6-bytes 3-cycles
vpcmpeqd xmm1, xmm1, xmm1 ; 4-bytes 0-cycles
vpsrlq xmm0 {k1}, xmm1, 1 ; 7-bytes 1-cycle
vmovd eax, xmm0 ; 4-bytes 5-cycles
vpextrd edx, xmm0, 1 ; 6-bytes 6-cycles
; --------------------
; 52-bytes 26-cycles
This adds floating->long/ulong cast codegen for AVX-512 and AVX10.2 on x86. With this, all non-overflow casts are now hardware accelerated. This is the last bit pulled from #116805.
Typical Diff (double->long AVX-512):
Full Diffs
Breakdown of the double->long asm: