Conversation
| Expr divided = ((prod_sum / divisor) + prod_sum) / divisor; | ||
| Expr shift = Cast::make(UInt(2 * bits), bits); | ||
| Expr prod_sum = widening_mul(zero_val, inverse_typed_weight) + widening_mul(one_val, typed_weight); | ||
| Expr divided = rounding_shift_right(rounding_shift_right(prod_sum, shift) + prod_sum, shift); |
There was a problem hiding this comment.
This is equivalent to the original code, but I'm not convinced it's a good idea on x86. I bet you can do better with an average-round-up instruction.
There was a problem hiding this comment.
It's a trick to do x/255 as (x/256 + x)/256
We should document that trick in a comment.
|
I might spend some time trying to figure out how to do that division by 2^n - 1 more cheaply, but that would be a follow-up PR. |
| Expr divided = ((prod_sum / divisor) + prod_sum) / divisor; | ||
| Expr shift = Cast::make(UInt(2 * bits), bits); | ||
| Expr prod_sum = widening_mul(zero_val, inverse_typed_weight) + widening_mul(one_val, typed_weight); | ||
| Expr divided = rounding_shift_right(rounding_shift_right(prod_sum, shift) + prod_sum, shift); |
There was a problem hiding this comment.
There was a problem hiding this comment.
It's a trick to do x/255 as (x/256 + x)/256
There was a problem hiding this comment.
OK yeah, division by 2^n-1...
There was a problem hiding this comment.
Note that this is a special case of one of our unsigned division lowering methods. I was about to suggest just dividing and relying on that, but that only covers up to 255
There was a problem hiding this comment.
Actually I could easily extend that table to include 65535 and 2^32 - 1 ...
There was a problem hiding this comment.
btw the nasty thing about this trick is that it fails for the largest uint16_t, so you have to have some way to know that's not possible (as we do here).
There was a problem hiding this comment.
On the other hand, x/255 == ((x + 1)/256 + x + 1)/256 seems to hold for all uint16 x.
So maybe the resolution here is teaching the arm backend this trick for division by 255
Edit: Except that it overflows for the largest uint16. Sigh.
There was a problem hiding this comment.
Do any of us have a copy of the book "Hacker's Delight"? IIRC it has quite a long section on this sort of technique, but I last read it at a different employer ~20 years ago...
There was a problem hiding this comment.
Here's a useful identity for uint16s:
cast<uint8_t>(x / 255) === let y = saturating_sub(x, 127) in cast<uint8_t>(rounding_shift_right(halving_add(rounding_shift_right(y, 8), y), 7))
On ARM we currently lower x/255 as:
umull v3.4s, v2.4h, v0.4h
umull2 v2.4s, v2.8h, v1.8h
ushr v3.4s, v3.4s, #23
ushr v2.4s, v2.4s, #23
xtn v3.4h, v3.4s
xtn2 v3.8h, v2.4s
xtn v2.8b, v3.8h
but using this identity we get:
uqsub v1.8h, v1.8h, v0.8h
urshr v2.8h, v1.8h, #8
uhadd v1.8h, v2.8h, v1.8h
rshrn v1.8b, v1.8h, #7
Nice.
Edit: nevermind, this also overflows in the addition. Dang.
Second edit: Fixed, by using an averaging op at the cost of an additional instruction. Now it looks a lot like method 2 of our division methods...
There was a problem hiding this comment.
Do any of us have a copy of the book "Hacker's Delight"? IIRC it has quite a long section on this sort of technique, but I last read it at a different employer ~20 years ago...
@steven-johnson - very late, I know, but I keep a copy on my desk. If it ever comes up again that we want to consult this book, let me know.
|
My overall conclusion is that we should merge this PR with a comment explaining the trick and a TODO that notes it's actually one instruction cheaper to do the division directly on x86. |
|
Unrelated, but |
|
I am going to submit this (tests failure is unrelated) and create an issue to track if there are any better ways to do x / 255 (so discussion in the comments is not lost). |
|
This actually hurt lerp on x86, because rounding shift right is more expensive than (x + 128) / 256. Before: After: I'm changing the lowering on x86 anyway, but are there other targets that don't have a rounding_shift_right instruction? |
Saves instructions on x86. Before #6426 vpaddw %ymm0, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm2 vpaddw %ymm2, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm1 After #6426 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm4 vpaddw %ymm2, %ymm4, %ymm2 vpaddw %ymm3, %ymm2, %ymm2 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm2 vpaddw %ymm2, %ymm3, %ymm2 vpand %ymm1, %ymm2, %ymm2 This PR: vpaddw %ymm0, %ymm3, %ymm3 vpmulhuw %ymm1, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3
* Do target-specific lowering of lerp Saves instructions on x86. Before #6426 vpaddw %ymm0, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm2 vpaddw %ymm2, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm1 After #6426 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm4 vpaddw %ymm2, %ymm4, %ymm2 vpaddw %ymm3, %ymm2, %ymm2 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm2 vpaddw %ymm2, %ymm3, %ymm2 vpand %ymm1, %ymm2, %ymm2 This PR: vpaddw %ymm0, %ymm3, %ymm3 vpmulhuw %ymm1, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3 * Target is a struct
Should be pretty much the same thing as before, but easier to read and maybe may produce better code.