Arm64: Improve support for HW_Flag_ReturnsPerElementMask#128326
Arm64: Improve support for HW_Flag_ReturnsPerElementMask#128326snickolls-arm wants to merge 8 commits into
Conversation
When wrapping an intrinsic node that has an embedded mask with a ConditionalSelect, ensure that the constant node in op3 has a mask type when the intrinsic has the HW_Flag_ReturnsPerElementMask flag. Build out further support for ConditionalSelect_Predicates, and use this to wrap nodes with HW_Flag_ReturnsPerElementMask. Add GenTree::IsSelectZero and update various areas in HW intrinsic codegen to ensure this intrinsic assembles correctly. Use a tree visitor for assigning `TYP_MASK` to intrinsics that have `HW_Flag_ReturnsPerElementMask`. The current version of `impHWIntrinsic` does not process child nodes of the tree it returns for mask types, only the root node.
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
| { | ||
| GenTreeHWIntrinsic* intrin = (*use)->AsHWIntrinsic(); | ||
|
|
||
| if (HWIntrinsicInfo::ReturnsPerElementMask(intrin->GetHWIntrinsicId()) && !intrin->TypeIs(TYP_MASK)) |
There was a problem hiding this comment.
This is not correct for all platforms and is going to regress xarch as well as be incorrect for hardware without TYP_MASK support. I imagine it may also regress optimization opportunities for AdvSimd on Arm64.
The consideration is that many node kinds return a per element mask and are explicitly not returning TYP_MASK. For example, Vector128.GreaterThan is an API which explicitly returns a vector, but where it is conceptually known to be a "per-element mask" (i.e. each element is either AllBitsSet or Zero).
Having this knowledge, even pre-SVE or pre-AVX512, where no dedicated TYP_MASK support exists, is beneficial as it allows unlocking a number of other optimization opportunities and folding operations that may not be otherwise valid. -- These optimizations are notably missing from Arm64, in part because the SVE predication feature has deviated a lot from the general support.
Rather, we only want to do such a transform if we have TYP_MASK support and its going to emit an instruction that actually produces a TYP_MASK, not that is just "conceptually" a mask. In the case of xarch we do so by marking the downlevel instructions as special-import and adjusting those as needed; there is explicitly no need to do traversals of the tree since we know that we are either producing a mask (and therefore need CvtMaskToVector) or we are expecting a mask (and therefore need CvtVectorToMask).
It's unclear then why Arm64 needs to do a tree traversal itself here as it should have the same general scenario. Any given intrinsic is one of three categories (does nothing with masks, produces a mask, or consumes a mask) and so it should be trivially handled without any consideration of tree traversal.
There was a problem hiding this comment.
-- The code here is only called from Arm64, but the ifdef doesn't cover that; nor does the summary or visitor make it clear its only valid for Arm64; so future readers or refactorings may miss the consideration.
But then it's very unclear to me why we need this setup and why it needs to deviate from what's already trivially working for other platforms with masking support.
There was a problem hiding this comment.
Sorry, I should've left a comment with more context.
I was having trouble with implementing the MaxMagnitude intrinsic, see this import code:
The intrinsics in this tree don't have mask types assigned and tend to cause assertions in Lowering when inserting implicit mask operands. I decided that rather than require the author of this sort of algorithm to maintain the TYP_MASK consistency for Arm64, I would add a pass based on HW_Flag_ReturnsPerElementMask to enforce that instead. This would allow you to write short algorithms in import with correct types per the CIL and have the visitor apply the types for a small runtime cost.
-- The code here is only called from Arm64, but the ifdef doesn't cover that; nor does the summary or visitor make it clear its only valid for Arm64; so future readers or refactorings may miss the consideration.
This is my mistake, it was only intended for Arm64 and I need to fix the ifdef. I will clarify the documentation too.
These optimizations are notably missing from Arm64, in part because the SVE predication feature has deviated a lot from the general support.
@a74nh has made a good amount of progress on this recently. We're building an abstraction of a constant in terms of {pattern, value} for this. For example, strength reduction of {any pattern, any value} & {repeated, 0} => {repeated, 0}. The current way things are done I think fits in this hierarchy, just where the pattern is a 'single scalar'.
I can see how SVE has overloaded the meaning of FEATURE_MASKED_HW_INTRINSICS and HW_Flag_ReturnsPerElementMask, really it's a different feature. I think we'll want to reconcile that in future.
There was a problem hiding this comment.
I decided that rather than require the author of this sort of algorithm to maintain the TYP_MASK consistency for Arm64
We notably handle this scenario on xarch by having the user visible API as HW_Flag_InvalidNodeId and having a different internal only intrinsic ID that is expected to have the mask. For example, we have NI_AVX512_CompareEqual which matches the managed API surface returning a Vector512<T> and then NI_AVX512_CompareEqualMask which is the internal API returning a TYP_MASK.
This helps ensure we never produce the vector returning ID in IR (as it triggers an assert when setting the ID here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L30507-L30531), which helps force users to think about the correct shape.
However, beyond that we also have the GetHWIntrinsicIdFor* and GetLookupTypeFor* helpers, which Arm64 isn't participating in right now (part of the deviation that probably shouldn't exist).
Note for example how gtNewSimdCmpOpNode calls GetLookupTypeForCmpOp which forces the type to TYP_MASK if AVX512 is supported, helping to canonicalize the support.
Then GetHWIntrinsicIdForCmpOp handles this and knows to return say NI_AVX512_CompareEqualMask instead of NI_X86Base_CompareEqual for a 128-bit comparison, guaranteeing the IR is correct since we then know to insert CvtMaskToVectorNode since the lookup type (mask) and actual type (simd) mismatch.
If Arm64 just participates in these existing helpers, then there's no need to have a custom visitor or different logic for SVE, it "just works" with all the existing support in the JIT. It also avoids issues if something like gtNewSimdMinMaxNode is used outside of import, which is very possible for many of the other helpers (especially due to morph or other phases doing optimizations).
There was a problem hiding this comment.
Generally speaking, I'd expect the minimum changes here be that Arm64 has GetLookupTypeFor* return TYP_MASK for any SVE intrinsic that is flagged HW_Flag_ReturnsPerElementMask and for it to actually return NI_Sve_* intrinsics from GetHWIntrinsicIdFor* when the simd size is "unknown".
This will cause almost all the existing helpers to light up and participate in all the general optimizations and to be correct by construction, rather than relying on developers to manually ensure any given instantiation is valid.
There was a problem hiding this comment.
This sounds like you're missing the separation between CndSel(vector, vectorWhenTrue, vectorWhenFalse) and CndSel(mask, vectorWhenTrue, vectorWhenFalse)?
Yes, the first version there doesn't exist for us in an instruction. We only have (mask, vector, vector) -> (vector) for vectors.
By the VM semantics we should only see CndSel(vector, vector, vector), and then need to insert a CvtVectorToMask on the first operand to ensure we get a predicate register allocated to that source node.
Ideally this results in a lot of CndSel(CvtVectorToMask(CvtMaskToVector(mask)), vector, vector) patterns that can be folded away. Other instructions have this property as well which I think is why we need the flag.
We also have an instruction for (mask, mask, mask) -> (mask) which can help us fold CndSel(CvtMaskToVector(mask), CvtMaskToVector(mask), CvtMaskToVector(mask)) in a similar manner. We should end up with CvtMaskToVector(CndSel_Predicates(mask, mask, mask)) after morph.
The instructions are:
SEL <Pd>.B, <Pg>, <Pn>.B, <Pm>.B (masks)
SEL <Zd>.<T>, <Pv>, <Zn>.<T>, <Zm>.<T> (vectors)
There was a problem hiding this comment.
Yes, the first version there doesn't exist for us in an instruction.
It does, it's BSL, which SVE also exposed.
That is, operations like AdvSimd.CompareEqual produce a vector where each element is one of AllBitsSet (true) or Zero (false) and thus you can perform masking by doing (vectorWhenTrue & vector) | (vectorWhenFalse & ~vector), aka BSL which is exposed as AdvSimd.BitwiseSelect or Vector128.ConditionalSelect. -- i.e. this is how users have historically done masked operations
Then, on SVE this is simply extended such that Sve/CompareEqual may produce a "predicate" instead. Overall this is still the same concept, you are getting something that is "true" (select all bits) or "false" (select no bits) on a per-element basis and which can be used for doing masked operations. The nuance is really that it gets specialized hardware support and instructions can effectively "embed" the selection rather than making it an additional instruction.
There is some minor nuance in that predicates are not bitwise, they are elementwise; but this is itself largely irrelevant. If the "mask" is actually a vector, we just emit BSL. If it is something known to be "per element mask" then we can optimize and emit a predicated instruction instead, based on the base type the mask is known to be good for.
There was a problem hiding this comment.
Thanks for pointing out as I did miss BSL, but I will add that it is SVE2 onwards so we will have to make use of it opportunistically for scalable vectors.
I didn't realize that the semantics of Sve.ConditionalSelect don't match the semantics of Vector.ConditionalSelect (or AdvSimd.ConditionalSelect for that matter), as currently implemented.
Should we be changing Sve.ConditionalSelect to operate bit-wise, and add optimization cases for element-wise masks as you suggest? Or would we prefer to implement this under Vector.ConditionalSelect by carefully choosing Sve.ConditionalSelect when we have a mask parameter that looks element-wise?
There was a problem hiding this comment.
We basically have the platform specific APIs (such as AdvSimd or Sve) and the xplat APIs (such as Vector<T>, Vector128<T>, etc).
The latter, xplat APIs, have a documented behavior and are otherwise left up to the flexibility and judgement of the underlying compiler to optimize how we see fit. Provided we aren't changing the observable behavior, we can emit whatever instruction sequence is deemed most optimal for the IR and hardware available. So, for example, we can decide whether we want Vector128.Sum to emit ADDV or some pairwise sequence instead.
While the former, platform specific APIs, are expected to exactly map to a documented instruction and functionality. The JIT shouldn't be significantly rewriting their codegen or changing the premise of what they do. It is expected the developer "knows what they're doing" when they target these APIs and they are the ones generally responsible for ensuring codegen is what they desired.
-- There is a little flexibility, such as allowing constant folding, reversal of arithmetic or comparison ops, and swapping to an identically behaving instruction that is trivially known to have the same throughput and latency characteristics; but we shouldn't do anything drastically different. So, for example, AdvSimd.AddAcross must always emit exactly ADDV, we cannot rewrite it into some pairwise sequence instead or vice-versa. While we would allow something like AND(x, NOT(y)) to become AND_NOT(x, y) or ADD(x, NEG(y)) to become SUB(x, y).
-- The rules here are also not "hard set", we do have a little flexibility if there is sufficient justification and we can show it doesn't significantly transform the users intentions. We just don't want to get in the same scenario some other compilers are in where users are surprised by the codegen and potentially unable to target specific hardware.
So in the case of Sve.ConditionalSelect it will always map to SEL and it is expected that it forces the use of a predicate even if it requires CVT_TO_PREDICATE(vector) and results in suboptimal codegen. It is expected that the developer ensure that vector is the output of something like Sve.CompareEqual which is known to produce such a predicate if they want it to be optimal. And so for this scenario it is only up to the JIT to ensure we don't lose track of something that is actually a predicate so that we don't pessimize what the user explicitly wrote.
While for Vector.ConditionalSelect, on the other hand, the user is rather just stating the conceptual operation they want done, a bitwise select. The JIT is free to emit OR(AND(true, cond), AND_NOT(false, cond)) or SEL(CVT2PRED(cond), true, false) or any other "optimal" sequence.
There was a problem hiding this comment.
Thanks for the clarification, so nothing needs to be done there then. Just need to be careful with the implementation of Vector.ConditionalSelect.
One last point, is TYP_MASK supposed to be semantically bit-wise or element-wise? Currently on Arm64 it is always an element-wise representation as it is backed by a predicate register, and this could be another source of difference between architectures/ISAs?
When wrapping an intrinsic node that has an embedded mask with a
ConditionalSelect, ensure that the constant node in op3 has a mask type when the intrinsic has theHW_Flag_ReturnsPerElementMaskflag.Build out further support for
ConditionalSelect_Predicates, and use this to wrap nodes withHW_Flag_ReturnsPerElementMask. AddGenTree::IsSelectZeroand update various areas in HW intrinsic codegen to ensure this intrinsic assembles correctly.