We should finish fleshing out the design around the various CLR Configuration Knobs that allow users to control the various System.Numerics.Vectors and System.Runtime.Intrinsics support.
Prior to HWIntrinsics
This section discusses the state of the world in netcoreapp1.0 through netcoreapp2.0. Desktop had similar, but slightly different behavior that I will attempt to call out where relevant.
COMPlus_EnableSSE3_4 and COMPlus_EnableAVX
The former (COMPlus_EnableSSE3_4) appears to be .NETCore only but defaults to 1 and is used in combination with a CPUID check that the VM does. It controls whether the S.N.Vectors intrinsics generate SSE3 through SSE4.2 instructions. The flag is completely ignored if COMPlus_EnableAVX and the corresponding VM check are 1.
The latter (COMPlus_EnableAVX) is available for both Desktop and .NETCore, it defaults to 1 ans is used in combination with a CPUID check that the VM does. It controls:
- Whether the
S.N.Vectors intrinsics generate AVX/AVX2 instructions
- The size of
Vector<T> (setting it to zero, forces Vector<T> to be sized to 16)
- Whether the compiler emits
VEX encoded instructions
COMPlus_FeatureSIMD
This flag is available for both Desktop and .NETCore, it defaults to 1 and controls various bits of code related to Vector<T>, the S.N.Vectors compiler support, and the TYP_SIMD* support.
Setting this to 0 causes Vector<T> to be sized at 16, none of the S.N.Vectors code to be treated as intrinsic, and prevents the various types from being resolved as TYP_SIMD* (which also generally prevents these types from appearing).
Current State
This section discusses the current state of the world for netcoreapp3.0.
COMPlus_Enable<ISA>
We have a number of Enable<ISA> flags, including: the pre-existing SSE3_4 and AVX flags as well as the SSE, SSE2, SSE3, SSSE3, SSE41, SSE42, AVX2, AES, PCLMULQDQ, POPCNT, FMA, LZCNT, BMI1, and BMI2` flags. All of these are used in combination with a corresponding CPUID check that the VM does.
These flags impact the compiler support for a given ISA and any ISAs that are "descendants" of that ISA (e.g. SSE=0 would also disable SSE2 which would disable any ISAs dependent on SSE2, etc). The flags are currently used primarily for the HWIntrinsics feature as that is the only thing that will cause the various instructions to be generated. In the future, these flags might be applicable more generally depending on other optimizations the JIT could consider. An exception to this is SSE and SSE2 which are considered "baseline support" by RyuJIT. These ISAs only impact the corresponding HWIntrinsic ISAs and do not actually impact compiler support for generating these instructions.
The pre-existing SSE3_4 flag is now treated as equivalent to SSE3. It impacts the SSE3 ISA and any child ISAs (including AVX). It otherwise functions identically and continues impacting the codgen support for S.N.Vectors.
The pre-existing AVX flag continues impacting the size of Vector<T> and whether the compiler emits VEX encoded instructions. However, for the size of Vector<T> it now does so indirectly (in that disabling AVX also disables AVX2), as the check was shifted onto AVX2.
COMPlus_FeatureSIMD and COMPlus_EnableHWIntrinsic
The COMPlus_FeatureSIMD flag continues functioning as it did before.
The COMPlus_EnableHWIntrinsic flag controls whether the System.Runtime.Intrinsic methods are treated as intrinsic, and therefore, whether they throw a PNSE or generate actual code when the given ISA is supported by the CPU/Compiler. There is currently a bug that setting EnableHWIntrinsic=0 will also disable compiler support for all the various ISAs listed above. This also means that it currently impacts the size of Vector<T> and whether the compiler will emit VEX encoded instructions.
Proposal for Cleanup
In this section, I will attempt to describe where we want to be with the various flags.
New Flag: EnableVEX
Currently we control the VEX support for the compiler by checking the EnableAVX flag (and corresponding CPUID check done by the VM). However, there are two ISAs that require the VEX encoding but not for AVX to also be enabled, these are BMI1 and BMI2. While we should never encounter a CPU that has BMI1/BMI2 but that does not also support AVX, AVX requires an additional check that the OS supports saving/restoring the 256-bit YMM registers. This support is not guaranteed and, at least on Windows, can be toggled by the user. Due to this, we need the check to be updated so that the BMI1/BMI2 ISAs (and any future ISAs with similar requirements) can still use the VEX encoding. Additionally, the VEX encoding is generally more efficient (it removes the RMW requirement from most of the instructions and supports unaligned memory addresses) and it may be desirable to still emit the VEX encoded instructions for SSE through SSE42 when the user has set EnableAVX=0.
The proposal is then to expose a new COMPlus_EnableVEX flag that is used to control the VEX encoding. Setting it to 0 would disable any ISA that requires the VEX encoding (AVX, AVX2, FMA, BMI1, and BMI2, as well as any future ISAs). Its default value (1) would allow the compiler to emit the VEX encoding for SSE through SSE42 when the CPU/OS support AVX but when the user has set EnableAVX=0. It would also allow other ISAs not in the AVX hierarchy (BMI1 and BMI2) to be emitted even when the OS does not support the saving/restoring the 256-bit YMM registers.
An alternative would be to not expose a new flag and instead just update the emitter to know that it can use the VEX encoding for the BMI1/BMI2 ISAs. The only difference from the above would be that SSE through SSE42 would not use the VEX encoding when AVX=0 (and when the OS supports saving/restoring the 256-bit YMM registers). This might be a more accurate state since the VEX encoded forms of the SSE through SSE42 instructions are technically part of the AVX instruction set.
New Flag: VectorTSize
Currently we control the sizeof Vector<T> by defaulting it to 16 and changing it to 32 if AVX2 is supported. However, this is not very extensible (what do we do when/if AVX-512 becomes supported and the size can be 64) and it is very much tied to x86 (you wouldn't want this to impact ARM if we add SVE support). It also means that if you need a smaller Vector<T>, you must also disable the general compiler support for the AVX2 ISA (at a minimum). This also impacts the HWIntrinsics feature.
The proposal is then to expose a new COMPlus_VectorTSize flag that is used to control the sizeof Vector<T>. The value would default to 0 which would mean to follow the normal logic we have today (size to 16 by default and change to 32 if AVX2 is supported). We would then come up with an additional scheme such that other values allow the user to explicitly size Vector<T> (to a supported size).
My current thinking is that any unsupported value is treated as 0 (default). Otherwise, the supported values are the exact sizes (16 or 32, in the future 64 if AVX-512 becomes supported, etc). Another option would be that the value is treated as the nearest size that is less than the given size. As an example, if the user gives 31, it would be sized 16. If the user gave 64 and we only support 32 and 16, it would be 32. If the user gave 100 and we support 128, 64, 32, and 16; they would get 64.
The flag would continue being used in conjunction with the Enable<ISA> checks for a given platform, as you can't size Vector<T> to 32 if AVX is not supported (for example).
COMPlus_Enable<ISA>
These flags are currently in a fairly good state, some considerations might be:
- Should we be exposing
SSE and SSE2 or should they be folded back into the EnableHWIntrinsic flag (given that they are considered "baseline" for CoreCLR).
- Can we get rid of
SSE3_4, since this is now covered by the individual SSE3, SSSE3, SSE41, and SSE42 flags and since it is treated as equivalent to SSE3 (which will also disable the child ISAs).
COMPlus_FeatureSIMD and COMPlus_EnableHWIntrinsic
COMPlus_FeatureSIMD should have its scope reduced so that it only impacts the S.N.Vectors codegen. The TYP_SIMD* support should be split out into its own feature that FEATURE_SIMD and FEATURE_HW_INTRINSICS can sit ontop of.
COMPlus_EnableHWIntrinsic should be fixed so that it only impacts the S.R.Intrinsics codegen. It should have no impact on the various ISAs the compiler lists as supported.
category:implementation
theme:vector-codegen
skill-level:intermediate
cost:medium
impact:small
We should finish fleshing out the design around the various
CLR Configuration Knobsthat allow users to control the variousSystem.Numerics.VectorsandSystem.Runtime.Intrinsicssupport.Prior to HWIntrinsics
This section discusses the state of the world in netcoreapp1.0 through netcoreapp2.0. Desktop had similar, but slightly different behavior that I will attempt to call out where relevant.
COMPlus_EnableSSE3_4andCOMPlus_EnableAVXThe former (
COMPlus_EnableSSE3_4) appears to be .NETCore only but defaults to 1 and is used in combination with a CPUID check that the VM does. It controls whether theS.N.Vectorsintrinsics generate SSE3 through SSE4.2 instructions. The flag is completely ignored ifCOMPlus_EnableAVXand the corresponding VM check are1.The latter (
COMPlus_EnableAVX) is available for both Desktop and .NETCore, it defaults to 1 ans is used in combination with a CPUID check that the VM does. It controls:S.N.Vectorsintrinsics generate AVX/AVX2 instructionsVector<T>(setting it to zero, forcesVector<T>to be sized to 16)VEXencoded instructionsCOMPlus_FeatureSIMDThis flag is available for both Desktop and .NETCore, it defaults to 1 and controls various bits of code related to
Vector<T>, theS.N.Vectorscompiler support, and theTYP_SIMD*support.Setting this to 0 causes
Vector<T>to be sized at 16, none of theS.N.Vectorscode to be treated as intrinsic, and prevents the various types from being resolved asTYP_SIMD*(which also generally prevents these types from appearing).Current State
This section discusses the current state of the world for netcoreapp3.0.
COMPlus_Enable<ISA>We have a number of
Enable<ISA>flags, including: the pre-existingSSE3_4andAVXflags as well as theSSE,SSE2,SSE3, SSSE3,SSE41,SSE42,AVX2,AES,PCLMULQDQ,POPCNT, FMA,LZCNT,BMI1, andBMI2` flags. All of these are used in combination with a corresponding CPUID check that the VM does.These flags impact the compiler support for a given ISA and any ISAs that are "descendants" of that ISA (e.g.
SSE=0would also disableSSE2which would disable any ISAs dependent onSSE2, etc). The flags are currently used primarily for theHWIntrinsicsfeature as that is the only thing that will cause the various instructions to be generated. In the future, these flags might be applicable more generally depending on other optimizations the JIT could consider. An exception to this isSSEandSSE2which are considered "baseline support" by RyuJIT. These ISAs only impact the correspondingHWIntrinsicISAs and do not actually impact compiler support for generating these instructions.The pre-existing
SSE3_4flag is now treated as equivalent toSSE3. It impacts theSSE3ISA and any child ISAs (includingAVX). It otherwise functions identically and continues impacting the codgen support forS.N.Vectors.The pre-existing
AVXflag continues impacting the size ofVector<T>and whether the compiler emitsVEXencoded instructions. However, for the size ofVector<T>it now does so indirectly (in that disablingAVXalso disablesAVX2), as the check was shifted ontoAVX2.COMPlus_FeatureSIMDandCOMPlus_EnableHWIntrinsicThe
COMPlus_FeatureSIMDflag continues functioning as it did before.The
COMPlus_EnableHWIntrinsicflag controls whether theSystem.Runtime.Intrinsicmethods are treated as intrinsic, and therefore, whether they throw aPNSEor generate actual code when the given ISA is supported by the CPU/Compiler. There is currently a bug that settingEnableHWIntrinsic=0will also disable compiler support for all the various ISAs listed above. This also means that it currently impacts the size ofVector<T>and whether the compiler will emit VEX encoded instructions.Proposal for Cleanup
In this section, I will attempt to describe where we want to be with the various flags.
New Flag:
EnableVEXCurrently we control the
VEXsupport for the compiler by checking theEnableAVXflag (and corresponding CPUID check done by the VM). However, there are two ISAs that require the VEX encoding but not for AVX to also be enabled, these areBMI1andBMI2. While we should never encounter a CPU that hasBMI1/BMI2but that does not also supportAVX,AVXrequires an additional check that the OS supports saving/restoring the 256-bit YMM registers. This support is not guaranteed and, at least on Windows, can be toggled by the user. Due to this, we need the check to be updated so that theBMI1/BMI2ISAs (and any future ISAs with similar requirements) can still use the VEX encoding. Additionally, theVEXencoding is generally more efficient (it removes the RMW requirement from most of the instructions and supports unaligned memory addresses) and it may be desirable to still emit the VEX encoded instructions forSSEthroughSSE42when the user has setEnableAVX=0.The proposal is then to expose a new
COMPlus_EnableVEXflag that is used to control the VEX encoding. Setting it to0would disable any ISA that requires the VEX encoding (AVX,AVX2,FMA,BMI1, andBMI2, as well as any future ISAs). Its default value (1) would allow the compiler to emit the VEX encoding forSSEthroughSSE42when the CPU/OS support AVX but when the user has setEnableAVX=0. It would also allow other ISAs not in the AVX hierarchy (BMI1andBMI2) to be emitted even when the OS does not support the saving/restoring the 256-bit YMM registers.An alternative would be to not expose a new flag and instead just update the emitter to know that it can use the
VEXencoding for theBMI1/BMI2ISAs. The only difference from the above would be thatSSEthroughSSE42would not use the VEX encoding whenAVX=0(and when the OS supports saving/restoring the 256-bit YMM registers). This might be a more accurate state since theVEXencoded forms of theSSEthroughSSE42instructions are technically part of the AVX instruction set.New Flag:
VectorTSizeCurrently we control the sizeof
Vector<T>by defaulting it to16and changing it to32ifAVX2is supported. However, this is not very extensible (what do we do when/ifAVX-512becomes supported and the size can be64) and it is very much tied to x86 (you wouldn't want this to impact ARM if we addSVEsupport). It also means that if you need a smallerVector<T>, you must also disable the general compiler support for theAVX2ISA (at a minimum). This also impacts the HWIntrinsics feature.The proposal is then to expose a new
COMPlus_VectorTSizeflag that is used to control the sizeofVector<T>. The value would default to0which would mean to follow the normal logic we have today (size to16by default and change to32ifAVX2is supported). We would then come up with an additional scheme such that other values allow the user to explicitly sizeVector<T>(to a supported size).My current thinking is that any unsupported value is treated as
0(default). Otherwise, the supported values are the exact sizes (16 or 32, in the future 64 ifAVX-512becomes supported, etc). Another option would be that the value is treated as the nearest size that is less than the given size. As an example, if the user gives31, it would be sized16. If the user gave64and we only support32and16, it would be32. If the user gave100and we support128,64,32, and16; they would get64.The flag would continue being used in conjunction with the
Enable<ISA>checks for a given platform, as you can't sizeVector<T>to32ifAVXis not supported (for example).COMPlus_Enable<ISA>These flags are currently in a fairly good state, some considerations might be:
SSEandSSE2or should they be folded back into theEnableHWIntrinsicflag (given that they are considered "baseline" for CoreCLR).SSE3_4, since this is now covered by the individualSSE3,SSSE3,SSE41, andSSE42flags and since it is treated as equivalent toSSE3(which will also disable the child ISAs).COMPlus_FeatureSIMDandCOMPlus_EnableHWIntrinsicCOMPlus_FeatureSIMDshould have its scope reduced so that it only impacts theS.N.Vectorscodegen. TheTYP_SIMD*support should be split out into its own feature thatFEATURE_SIMDandFEATURE_HW_INTRINSICScan sit ontop of.COMPlus_EnableHWIntrinsicshould be fixed so that it only impacts theS.R.Intrinsicscodegen. It should have no impact on the various ISAs the compiler lists as supported.category:implementation
theme:vector-codegen
skill-level:intermediate
cost:medium
impact:small