rounding shift rights should use rounding halving add#6494
Conversation
On x86 currently we lower cast<uint8_t>((cast<uint16_t>(x) + 8) / 16) to: cast<uint8_t>(shift_right(widening_add(x, 8), 4)) This compiles to 8 instructions on x86: Widen each half of the input vector, add 8 to each half-vector, shift each half-vector, then narrow each half-vector. First, this should have been a rounding_shift_right. Some patterns were missing in FindIntrinsics. Second, rounding_shift_right had suboptimal codegen in the case where the second arg is a positive const. On archs without a rounding shift right instruction you can further rewrite this to: shift_right(rounding_halving_add(x, 7), 3) which is just two instructions on x86.
|
Dillon, I have a question for you in the IRMatch.h code. I'm a bit confused about the types of the second arg to shifts. |
| // Assuming the args have the same type as the intrinsic is incorrect in | ||
| // general. But for the intrinsics we can fold (just shifts), the LHS | ||
| // has the same type as the intrinsic, and we can always treat the RHS | ||
| // as a signed int, because we're using 64 bits for it. (TODO: Dillon, |
There was a problem hiding this comment.
Yes, I think this is OK. Constants will be simplified to unsigned shifts if needed.
| check("pavgw", 4 * w, u16((u32(u16_1) + u32(u16_2) + 1) >> 1)); | ||
|
|
||
| // Rounding right shifts should also use pavg | ||
| check("pavgb", 8 * w, u8((u16(u8_1) + 15) >> 4)); |
There was a problem hiding this comment.
I think these added checks should be in test/correctness/intrinsics.cpp instead (much lighter weight test than simd_op_check and it is target independent).
There was a problem hiding this comment.
It's a test that rounding_shift_right is both caught by pattern matching and also lowers correctly on x86, so I'll move half of it to intrinsics
There was a problem hiding this comment.
Aren't the two tests below this the same test after accounting for the pattern matching part?
There was a problem hiding this comment.
Hopefully the latest commit makes it clearer. The first two tests aren't just a rounding shift right. they strength-reduce to an average followed by a shift right. So it tests the new patterns.
The second test checks lower_rounding_shift_right does the right thing.
There was a problem hiding this comment.
Oh right, I hadn't noticed the constant isn't a rounding_shift_right. Still, I think the change you just made is good (easier tests to debug if they fail).
| rewrite(min(shift_right(widening_mul(x, y), z), upper), mul_shift_right(x, y, cast(unsigned_type, z)), is_x_same_uint && is_uint(z)) || | ||
| rewrite(min(rounding_shift_right(widening_mul(x, y), z), upper), rounding_mul_shift_right(x, y, cast(unsigned_type, z)), is_x_same_uint && is_uint(z)) || | ||
| if ( | ||
| // Saturating patterns |
There was a problem hiding this comment.
Side comment, the reformatting here makes this change a lot harder to review easily.
There was a problem hiding this comment.
+1, may I suggest restoring original format and putting clang-format off back in place? If we want to let these be 'naturally' reformatted then IMHO we should do that in a standalone PR
There was a problem hiding this comment.
I will in future, but here I just manually reformatted a few surrounding lines because they were hard to modify. What I'd done happened to agree with clang-format, so I also removed those comments.
There was a problem hiding this comment.
In general rewrite rules are exempt from clang-format, but these particular ones use named intrinsics, so they get very long indeed.
On x86 currently we lower cast<uint8_t>((cast<uint16_t>(x) + 8) / 16)
to:
cast<uint8_t>(shift_right(widening_add(x, 8), 4))
This compiles to 8 instructions on x86: Widen each half of the input
vector, add 8 to each half-vector, shift each half-vector, then narrow
each half-vector.
First, this should have been a rounding_shift_right. Some patterns were
missing in FindIntrinsics.
Second, rounding_shift_right had suboptimal codegen in the case where
the second arg is a positive const. On archs without a rounding shift
right instruction you can further rewrite this to:
shift_right(rounding_halving_add(x, 7), 3)
which is just two instructions on x86.