Arm64: Improve support for HW_Flag_ReturnsPerElementMask by snickolls-arm · Pull Request #128326 · dotnet/runtime

snickolls-arm · 2026-05-18T13:52:28Z

When wrapping an intrinsic node that has an embedded mask with a ConditionalSelect, ensure that the constant node in op3 has a mask type when the intrinsic has the HW_Flag_ReturnsPerElementMask flag.

Build out further support for ConditionalSelect_Predicates, and use this to wrap nodes with HW_Flag_ReturnsPerElementMask. Add GenTree::IsSelectZero and update various areas in HW intrinsic codegen to ensure this intrinsic assembles correctly.

When wrapping an intrinsic node that has an embedded mask with a ConditionalSelect, ensure that the constant node in op3 has a mask type when the intrinsic has the HW_Flag_ReturnsPerElementMask flag. Build out further support for ConditionalSelect_Predicates, and use this to wrap nodes with HW_Flag_ReturnsPerElementMask. Add GenTree::IsSelectZero and update various areas in HW intrinsic codegen to ensure this intrinsic assembles correctly. Use a tree visitor for assigning `TYP_MASK` to intrinsics that have `HW_Flag_ReturnsPerElementMask`. The current version of `impHWIntrinsic` does not process child nodes of the tree it returns for mask types, only the root node.

dotnet-policy-service · 2026-05-18T13:54:17Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

tannergooding · 2026-05-26T12:32:04Z

+        {
+            GenTreeHWIntrinsic* intrin = (*use)->AsHWIntrinsic();
+
+            if (HWIntrinsicInfo::ReturnsPerElementMask(intrin->GetHWIntrinsicId()) && !intrin->TypeIs(TYP_MASK))


This is not correct for all platforms and is going to regress xarch as well as be incorrect for hardware without TYP_MASK support. I imagine it may also regress optimization opportunities for AdvSimd on Arm64.

The consideration is that many node kinds return a per element mask and are explicitly not returning TYP_MASK. For example, Vector128.GreaterThan is an API which explicitly returns a vector, but where it is conceptually known to be a "per-element mask" (i.e. each element is either AllBitsSet or Zero).

Having this knowledge, even pre-SVE or pre-AVX512, where no dedicated TYP_MASK support exists, is beneficial as it allows unlocking a number of other optimization opportunities and folding operations that may not be otherwise valid. -- These optimizations are notably missing from Arm64, in part because the SVE predication feature has deviated a lot from the general support.

Rather, we only want to do such a transform if we have TYP_MASK support and its going to emit an instruction that actually produces a TYP_MASK, not that is just "conceptually" a mask. In the case of xarch we do so by marking the downlevel instructions as special-import and adjusting those as needed; there is explicitly no need to do traversals of the tree since we know that we are either producing a mask (and therefore need CvtMaskToVector) or we are expecting a mask (and therefore need CvtVectorToMask).

It's unclear then why Arm64 needs to do a tree traversal itself here as it should have the same general scenario. Any given intrinsic is one of three categories (does nothing with masks, produces a mask, or consumes a mask) and so it should be trivially handled without any consideration of tree traversal.

-- The code here is only called from Arm64, but the ifdef doesn't cover that; nor does the summary or visitor make it clear its only valid for Arm64; so future readers or refactorings may miss the consideration.

But then it's very unclear to me why we need this setup and why it needs to deviate from what's already trivially working for other platforms with masking support.

Sorry, I should've left a comment with more context.

I was having trouble with implementing the MaxMagnitude intrinsic, see this import code:

https://github.com/dotnet/runtime/blob/758abdf6906992c73adcd2c5ad8a1f5ecd0d70c7/src/coreclr/jit/gentree.cpp#L26420-L26514

The intrinsics in this tree don't have mask types assigned and tend to cause assertions in Lowering when inserting implicit mask operands. I decided that rather than require the author of this sort of algorithm to maintain the TYP_MASK consistency for Arm64, I would add a pass based on HW_Flag_ReturnsPerElementMask to enforce that instead. This would allow you to write short algorithms in import with correct types per the CIL and have the visitor apply the types for a small runtime cost.

-- The code here is only called from Arm64, but the ifdef doesn't cover that; nor does the summary or visitor make it clear its only valid for Arm64; so future readers or refactorings may miss the consideration.

This is my mistake, it was only intended for Arm64 and I need to fix the ifdef. I will clarify the documentation too.

These optimizations are notably missing from Arm64, in part because the SVE predication feature has deviated a lot from the general support.

@a74nh has made a good amount of progress on this recently. We're building an abstraction of a constant in terms of {pattern, value} for this. For example, strength reduction of {any pattern, any value} & {repeated, 0} => {repeated, 0}. The current way things are done I think fits in this hierarchy, just where the pattern is a 'single scalar'.

I can see how SVE has overloaded the meaning of FEATURE_MASKED_HW_INTRINSICS and HW_Flag_ReturnsPerElementMask, really it's a different feature. I think we'll want to reconcile that in future.

I decided that rather than require the author of this sort of algorithm to maintain the TYP_MASK consistency for Arm64

We notably handle this scenario on xarch by having the user visible API as HW_Flag_InvalidNodeId and having a different internal only intrinsic ID that is expected to have the mask. For example, we have NI_AVX512_CompareEqual which matches the managed API surface returning a Vector512<T> and then NI_AVX512_CompareEqualMask which is the internal API returning a TYP_MASK.

This helps ensure we never produce the vector returning ID in IR (as it triggers an assert when setting the ID here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L30507-L30531), which helps force users to think about the correct shape.

However, beyond that we also have the GetHWIntrinsicIdFor* and GetLookupTypeFor* helpers, which Arm64 isn't participating in right now (part of the deviation that probably shouldn't exist).

Note for example how gtNewSimdCmpOpNode calls GetLookupTypeForCmpOp which forces the type to TYP_MASK if AVX512 is supported, helping to canonicalize the support.

Then GetHWIntrinsicIdForCmpOp handles this and knows to return say NI_AVX512_CompareEqualMask instead of NI_X86Base_CompareEqual for a 128-bit comparison, guaranteeing the IR is correct since we then know to insert CvtMaskToVectorNode since the lookup type (mask) and actual type (simd) mismatch.

If Arm64 just participates in these existing helpers, then there's no need to have a custom visitor or different logic for SVE, it "just works" with all the existing support in the JIT. It also avoids issues if something like gtNewSimdMinMaxNode is used outside of import, which is very possible for many of the other helpers (especially due to morph or other phases doing optimizations).

Generally speaking, I'd expect the minimum changes here be that Arm64 has GetLookupTypeFor* return TYP_MASK for any SVE intrinsic that is flagged HW_Flag_ReturnsPerElementMask and for it to actually return NI_Sve_* intrinsics from GetHWIntrinsicIdFor* when the simd size is "unknown".

This will cause almost all the existing helpers to light up and participate in all the general optimizations and to be correct by construction, rather than relying on developers to manually ensure any given instantiation is valid.

This sounds like you're missing the separation between CndSel(vector, vectorWhenTrue, vectorWhenFalse) and CndSel(mask, vectorWhenTrue, vectorWhenFalse)?

Yes, the first version there doesn't exist for us in an instruction. We only have (mask, vector, vector) -> (vector) for vectors.

By the VM semantics we should only see CndSel(vector, vector, vector), and then need to insert a CvtVectorToMask on the first operand to ensure we get a predicate register allocated to that source node.

Ideally this results in a lot of CndSel(CvtVectorToMask(CvtMaskToVector(mask)), vector, vector) patterns that can be folded away. Other instructions have this property as well which I think is why we need the flag.

We also have an instruction for (mask, mask, mask) -> (mask) which can help us fold CndSel(CvtMaskToVector(mask), CvtMaskToVector(mask), CvtMaskToVector(mask)) in a similar manner. We should end up with CvtMaskToVector(CndSel_Predicates(mask, mask, mask)) after morph.

The instructions are:

SEL <Pd>.B, <Pg>, <Pn>.B, <Pm>.B (masks) SEL <Zd>.<T>, <Pv>, <Zn>.<T>, <Zm>.<T> (vectors)

Yes, the first version there doesn't exist for us in an instruction.

It does, it's BSL, which SVE also exposed.

That is, operations like AdvSimd.CompareEqual produce a vector where each element is one of AllBitsSet (true) or Zero (false) and thus you can perform masking by doing (vectorWhenTrue & vector) | (vectorWhenFalse & ~vector), aka BSL which is exposed as AdvSimd.BitwiseSelect or Vector128.ConditionalSelect. -- i.e. this is how users have historically done masked operations

Then, on SVE this is simply extended such that Sve/CompareEqual may produce a "predicate" instead. Overall this is still the same concept, you are getting something that is "true" (select all bits) or "false" (select no bits) on a per-element basis and which can be used for doing masked operations. The nuance is really that it gets specialized hardware support and instructions can effectively "embed" the selection rather than making it an additional instruction.

There is some minor nuance in that predicates are not bitwise, they are elementwise; but this is itself largely irrelevant. If the "mask" is actually a vector, we just emit BSL. If it is something known to be "per element mask" then we can optimize and emit a predicated instruction instead, based on the base type the mask is known to be good for.

Thanks for pointing out as I did miss BSL, but I will add that it is SVE2 onwards so we will have to make use of it opportunistically for scalable vectors.

I didn't realize that the semantics of Sve.ConditionalSelect don't match the semantics of Vector.ConditionalSelect (or AdvSimd.ConditionalSelect for that matter), as currently implemented.

Should we be changing Sve.ConditionalSelect to operate bit-wise, and add optimization cases for element-wise masks as you suggest? Or would we prefer to implement this under Vector.ConditionalSelect by carefully choosing Sve.ConditionalSelect when we have a mask parameter that looks element-wise?

We basically have the platform specific APIs (such as AdvSimd or Sve) and the xplat APIs (such as Vector<T>, Vector128<T>, etc).

The latter, xplat APIs, have a documented behavior and are otherwise left up to the flexibility and judgement of the underlying compiler to optimize how we see fit. Provided we aren't changing the observable behavior, we can emit whatever instruction sequence is deemed most optimal for the IR and hardware available. So, for example, we can decide whether we want Vector128.Sum to emit ADDV or some pairwise sequence instead.

While the former, platform specific APIs, are expected to exactly map to a documented instruction and functionality. The JIT shouldn't be significantly rewriting their codegen or changing the premise of what they do. It is expected the developer "knows what they're doing" when they target these APIs and they are the ones generally responsible for ensuring codegen is what they desired.

-- There is a little flexibility, such as allowing constant folding, reversal of arithmetic or comparison ops, and swapping to an identically behaving instruction that is trivially known to have the same throughput and latency characteristics; but we shouldn't do anything drastically different. So, for example, AdvSimd.AddAcross must always emit exactly ADDV, we cannot rewrite it into some pairwise sequence instead or vice-versa. While we would allow something like AND(x, NOT(y)) to become AND_NOT(x, y) or ADD(x, NEG(y)) to become SUB(x, y).

-- The rules here are also not "hard set", we do have a little flexibility if there is sufficient justification and we can show it doesn't significantly transform the users intentions. We just don't want to get in the same scenario some other compilers are in where users are surprised by the codegen and potentially unable to target specific hardware.

So in the case of Sve.ConditionalSelect it will always map to SEL and it is expected that it forces the use of a predicate even if it requires CVT_TO_PREDICATE(vector) and results in suboptimal codegen. It is expected that the developer ensure that vector is the output of something like Sve.CompareEqual which is known to produce such a predicate if they want it to be optimal. And so for this scenario it is only up to the JIT to ensure we don't lose track of something that is actually a predicate so that we don't pessimize what the user explicitly wrote.

While for Vector.ConditionalSelect, on the other hand, the user is rather just stating the conceptual operation they want done, a bitwise select. The JIT is free to emit OR(AND(true, cond), AND_NOT(false, cond)) or SEL(CVT2PRED(cond), true, false) or any other "optimal" sequence.

Thanks for the clarification, so nothing needs to be done there then. Just need to be careful with the implementation of Vector.ConditionalSelect.

One last point, is TYP_MASK supposed to be semantically bit-wise or element-wise? Currently on Arm64 it is always an element-wise representation as it is backed by a predicate register, and this could be another source of difference between architectures/ISAs?

github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 18, 2026

dotnet-policy-service Bot added the community-contribution Indicates that the PR has been added by a community member label May 18, 2026

snickolls-arm mentioned this pull request May 19, 2026

Accelerating Vector<T> with SVE on ARM64 #120599

Open

16 tasks

Merge branch 'main' into fix-conditionalselect-predicates

3d7b8b5

tannergooding reviewed May 26, 2026

View reviewed changes

build-analysis Bot mentioned this pull request May 26, 2026

XHarness package install failure on iOS due to devicectl NSPOSIXErrorDomain error 49 #123796

Open

snickolls-arm and others added 3 commits May 27, 2026 13:08

Revert general changes to HWIntrinsic importer

0510dd6

Merge branch 'main' into fix-conditionalselect-predicates

cf281c5

Fix type inconsistencies

b147ef6

build-analysis Bot mentioned this pull request Jun 15, 2026

CI build failure: error CS8032: An instance of analyzer Microsoft.NetCore.CSharp.Analyzers.... cannot be created #129031

Open

snickolls-arm and others added 3 commits June 16, 2026 10:24

Merge branch 'main' into fix-conditionalselect-predicates

a1bdef8

Evaluate a mask pattern when importing CreateTrueMask/CreateFalseMask

1b3d67b

Formatting

73cf09e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm64: Improve support for HW_Flag_ReturnsPerElementMask#128326

Arm64: Improve support for HW_Flag_ReturnsPerElementMask#128326
snickolls-arm wants to merge 8 commits into
dotnet:mainfrom
snickolls-arm:fix-conditionalselect-predicates

snickolls-arm commented May 18, 2026 •

edited

Loading

Uh oh!

dotnet-policy-service Bot commented May 18, 2026

Uh oh!

tannergooding May 26, 2026

Uh oh!

tannergooding May 26, 2026

Uh oh!

snickolls-arm May 26, 2026

Uh oh!

tannergooding May 26, 2026

Uh oh!

tannergooding May 26, 2026

Uh oh!

snickolls-arm Jun 15, 2026 •

edited

Loading

Uh oh!

tannergooding Jun 15, 2026

Uh oh!

snickolls-arm Jun 16, 2026

Uh oh!

tannergooding Jun 16, 2026

Uh oh!

snickolls-arm Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

snickolls-arm commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dotnet-policy-service Bot commented May 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snickolls-arm Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

snickolls-arm commented May 18, 2026 •

edited

Loading

snickolls-arm Jun 15, 2026 •

edited

Loading