Skip to content

Improve simd and floating-point costing for xarch#127048

Open
tannergooding wants to merge 8 commits intodotnet:mainfrom
tannergooding:improve-simd-costing
Open

Improve simd and floating-point costing for xarch#127048
tannergooding wants to merge 8 commits intodotnet:mainfrom
tannergooding:improve-simd-costing

Conversation

@tannergooding
Copy link
Copy Markdown
Member

The execution and size costing set for SIMD and floating-point on xarch has been updated to more accurately reflect the real numbers.

The floating-point costs/sizes hadn't really been updated since the 32-bit legacy JIT where they were based on the x87 FPU, which we haven't used in a long time.

Similarly, the hardware intrinsic nodes were basically all set at costEx=1, costSz=1 which would actively prevent CSE and other optimizations from kicking in, this is despite most such operations taking significantly more codegen bytes than general-purpose instructions and most floating-point instructions taking a minimum of 4 cycles, sometimes more.

This will likely result in a larger set of diffs, but should allow the JIT to make better decisions on what should be optimized based on what is taking the most bytes or cycles. If this goes through, we can do a similar PR for Arm64.

Copilot AI review requested due to automatic review settings April 17, 2026 06:26
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 17, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates CoreCLR JIT instruction costing on xarch to better reflect modern SIMD and floating-point codegen characteristics, enabling more accurate optimization decisions (e.g., CSE and size/throughput tradeoffs).

Changes:

  • Extends xarch HW intrinsic metadata to carry separate integer vs floating-point execution costs.
  • Refines GT_HWINTRINSIC and indirection/address costing on xarch, including special-casing various SIMD/FP operations and constants.
  • Adds a shared helper to compute address-mode costing (gtGetAddrNodeCost) and expands floating-point indirection costing.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/coreclr/jit/valuenumfuncs.h Updates xarch HW intrinsic macro signature to align VN function defs with new intrinsic list fields.
src/coreclr/jit/namedintrinsiclist.h Updates xarch HW intrinsic macro signature to include int/float cost fields.
src/coreclr/jit/hwintrinsic.h Extends HWIntrinsicInfo with intCost/fltCost and adds lookup helpers.
src/coreclr/jit/hwintrinsic.cpp Populates HWIntrinsicInfo array with the new cost fields on xarch (defaults on arm64).
src/coreclr/jit/gentree.h Introduces FLT_IND_COST_EX and clarifies cost constant definitions.
src/coreclr/jit/gentree.cpp Major update to costing/eval-order logic for HW intrinsics, indirections, FP/SIMD constants, and FP ops on xarch; adds gtGetAddrNodeCost.
src/coreclr/jit/compiler.h Declares gtGetAddrNodeCost and fixes trailing whitespace.

Comment thread src/coreclr/jit/gentree.cpp
Comment thread src/coreclr/jit/gentree.cpp Outdated
Copilot AI review requested due to automatic review settings April 17, 2026 13:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Comment thread src/coreclr/jit/gentree.cpp
Comment thread src/coreclr/jit/gentree.cpp
Comment thread src/coreclr/jit/gentree.cpp
Copilot AI review requested due to automatic review settings April 17, 2026 14:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Comment thread src/coreclr/jit/gentree.cpp
Copilot AI review requested due to automatic review settings April 17, 2026 17:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Comment thread src/coreclr/jit/gentree.cpp
@tannergooding tannergooding marked this pull request as ready for review April 17, 2026 19:42
@tannergooding
Copy link
Copy Markdown
Member Author

CC. @dotnet/jit-contrib, @EgorBo, @kg

This improves gtSetEvalOrder for SIMD/floating-point

It's limited to xarch only in this first iteration, but basic premise is that it ensures that floating-point matches current codegen (SSE/SSE2) and not the x87 FPU codegen it was originally setup for. Likewise it ensures that hardware intrinsics aren't all set as costEx=1, costSz=1, which means they can be CSE'd, hoisted, etc

We see a size regression because we now have trees that are consistently CSE'd/hoisted where they weren't previously. This uses more registers and causes larger prologue/epilogue to accommodate the non-volatile spills. However, the actual diffs in the core code, particularly loops, is much improved. We particularly see more opportunities to do broadcasts and so the method local constant sections are reduced by up to nearly 8% in some methods.

@tannergooding
Copy link
Copy Markdown
Member Author

We have a number of places where the core loops show diffs like this, even though the overall method size is regressed by 24 bytes (due to the prologue/epilogue growth):

-						;; size=88 bbWeight=32 PerfScore 864.00
+						;; size=76 bbWeight=32 PerfScore 736.00

There are some actual regressions too, namely from when CSE decides to hoist a constant (often aggressive) and then LSRA decides to not keep it enregistered and marks it as spill-single-def. This causes a diff like:

-    vmulss xmm0, xmm0, dword ptr [reloc @RWD16]
-    vdivss xmm0, xmm0, dword ptr [reloc @RWD20]
+    vmovss xmm1, dword ptr [reloc @RWD16]
+    vmovss dword ptr [rbp-0x24], xmm1
+    vmulss xmm0, xmm0, xmm1
+    vmovss xmm2, dword ptr [reloc @RWD20]
+    vmovss dword ptr [rbp-0x28], xmm2
+    vdivss xmm0, xmm0, xmm2
...
-    vmulss xmm0, xmm0, dword ptr [reloc @RWD16]
-    vdivss xmm0, xmm0, dword ptr [reloc @RWD20]
+    vmulss xmm0, xmm0, dword ptr [rbp-0x24]
+    vmulss xmm0, xmm0, dword ptr [rbp-0x28]

When ideally we'd not spill at all and just reload from the [reloc @RWD] slot instead. This is notably an existing issue and something that is worth us improving long term anyways, but is something that should be done in a separate PR.

@tannergooding
Copy link
Copy Markdown
Member Author

Overall this should be a net improvement, with likely a couple issues getting filed after the subsequent perf triage.

PerfScore needs a bit of a similar cleanup, namely in ensuring:

  1. load/store costing is based on int vs float/simd, as the latter are 2-4 cycles more expensive
  2. that we check for simdSize == 16 ? x : y and not simdSize == 32 ? y : x, so SIMD64 has a correct score

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants