Skip to content

[x86 & HVX & WASM] Use bounds inference for saturating_narrow instruction selection#7805

Merged
rootjalex merged 17 commits intomainfrom
rootjalex/x86-sat
Apr 30, 2024
Merged

[x86 & HVX & WASM] Use bounds inference for saturating_narrow instruction selection#7805
rootjalex merged 17 commits intomainfrom
rootjalex/x86-sat

Conversation

@rootjalex
Copy link
Copy Markdown
Member

@rootjalex rootjalex commented Aug 24, 2023

There are cases where bounds inference allows us to safely reinterpret unsigned values as signed values, in order to target x86 saturating narrow instructions. I give an example below. I don't add tests to simd_op_check_x86 because we can't test for removing min instructions via the existing mechanism.

Example:

Var x("x");
ImageParam in0(UInt(8), 1);
Func out("out");
out(x) = u8_sat(u16(in0(x)) * 5);
out.vectorize(x, 32);
Target x86("x86-64-linux-avx-avx2-fma-sse41");
out.compile_to_assembly("x86-test0.asm", out.infer_arguments(), x86);

Before:

vpmovzxbw	(%rax,%rcx), %ymm2      # ymm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
vpmovzxbw	16(%rax,%rcx), %ymm3    # ymm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
vpmullw	%ymm0, %ymm3, %ymm3
vpmullw	%ymm0, %ymm2, %ymm2
vpminuw	%ymm1, %ymm2, %ymm2
vpminuw	%ymm1, %ymm3, %ymm3
vpackuswb	%ymm3, %ymm2, %ymm2
vpermq	$216, %ymm2, %ymm2              # ymm2 = ymm2[0,2,1,3]
vmovdqu	%ymm2, (%r13,%rcx)
addq	$32, %rcx
cmpq	%rcx, %rdx
jne	.LBB219_44

After (removed 2 vpminuw instructions):

vpmovzxbw	(%rax,%rcx), %ymm1      # ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
vpmovzxbw	16(%rax,%rcx), %ymm2    # ymm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
vpmullw	%ymm0, %ymm2, %ymm2
vpmullw	%ymm0, %ymm1, %ymm1
vpackuswb	%ymm2, %ymm1, %ymm1
vpermq	$216, %ymm1, %ymm1              # ymm1 = ymm1[0,2,1,3]
vmovdqu	%ymm1, (%r13,%rcx)
addq	$32, %rcx
cmpq	%rcx, %rdx
jne	.LBB219_44

@steven-johnson
Copy link
Copy Markdown
Contributor

I don't add tests to simd_op_check_x86

Understood, but is there any other reasonable way to add checks for this? This sort of stuff is often pretty fragile.

Comment thread src/Type.cpp
@steven-johnson steven-johnson self-requested a review August 24, 2023 22:12
Copy link
Copy Markdown
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I feel like if we don't have some way to test this, we'll eventually find it broken :-/

@rootjalex
Copy link
Copy Markdown
Member Author

I don't add tests to simd_op_check_x86

Understood, but is there any other reasonable way to add checks for this? This sort of stuff is often pretty fragile.

I thought about adding a "check_not" to simd_op_check, because that's the only way I could think to test this sort of optimization, but that seems like heavy functionality to use for just this case (I'm also not sure we can verify that this optimization runs by saying "don't give me a vpmin", because other parts of the generated code could contain a vpmin and we don't want false failures). I'm open to suggestions (especially because I intend to make a similar PR for HVX).

@rootjalex rootjalex changed the title [x86] Use bounds inference for saturating_narrow instruction selection [x86 & HVX] Use bounds inference for saturating_narrow instruction selection Aug 25, 2023
@rootjalex
Copy link
Copy Markdown
Member Author

We can do the same thing for HVX saturating narrow instructions, which I was going to do in a separate PR, but it requires the same changes to Bounds.cpp and Type.h that that the x86 version does, so I am making this one PR.

@rootjalex
Copy link
Copy Markdown
Member Author

@steven-johnson Luckily, the improvement to HVX can be tested (unlike x86). Unluckily, this revealed that constant bounds inference does not work on u32 like I expected (which is part of why @abadams opened #7807)

Comment thread src/CodeGen_X86.cpp Outdated
Comment thread src/HexagonOptimize.cpp Outdated
Comment thread src/HexagonOptimize.cpp Outdated
@steven-johnson
Copy link
Copy Markdown
Contributor

Is this PR still an active one? It seems pretty stale.

@steven-johnson
Copy link
Copy Markdown
Contributor

What's the story, needs more work?

@rootjalex
Copy link
Copy Markdown
Member Author

This has gotten stale, but I am trying to find time to update it. Andrew and I discussed some issues with using bounds inference in this way, but I think they are fixed by using the new infrastructure in #8179. I will aim to do this in the next two weeks, I've finally started to catch up on my insane backlog.

@abadams
Copy link
Copy Markdown
Member

abadams commented Apr 28, 2024

I think it should be fairly straight-forward to adapt it to use ConstantInterval, now that the ConstantInterval refactor is in. You could do it now and there wouldn't be much of a merge conflict, or you could wait for #8155

Here's how I use ConstantInterval to do something similar: https://github.com/halide/Halide/pull/8155/files#diff-6cd3cc01186f9ce4abaa1a98c9201bd440983ecc5620fbccae383df1116bdce6R704

All your uses are about something being safe to reinterpret as signed/unsigned, so I think you want a few instances of const bool safe_to_reinterpret = some_expr.type().with_code(...).can_represent(constant_integer_bounds(some_expr))

@rootjalex
Copy link
Copy Markdown
Member Author

Yeah, realized it wasn't a ton of work, and was bored during an ASPLOS workshop. This is good for review now! Also, the u32 HVX tests now work, because the new ConstantInterval stuff is stronger than find_constant_bounds on u32. Nice side effect.

Comment thread src/CodeGen_X86.cpp Outdated
Comment thread src/HexagonOptimize.cpp Outdated
Comment thread src/HexagonOptimize.cpp Outdated
@rootjalex rootjalex changed the title [x86 & HVX] Use bounds inference for saturating_narrow instruction selection [x86 & HVX & WASM] Use bounds inference for saturating_narrow instruction selection Apr 29, 2024
@rootjalex
Copy link
Copy Markdown
Member Author

@abadams Sorry to add a change after your approval, but I just saw a wasm talk and was reminded that we can do this for wasm too. It's basically a copy-paste from x86, but would be nice to have eyes just to be safe.
@steven-johnson just want to ping you about the wasm change, if you're curious to look

@abadams
Copy link
Copy Markdown
Member

abadams commented Apr 29, 2024

The big TODO for wasm now is exploiting constant_integer_bounds to hit the relaxed simd instructions in case where we know they don't have UB

@rootjalex
Copy link
Copy Markdown
Member Author

The big TODO for wasm now is exploiting constant_integer_bounds to hit the relaxed simd instructions in case where we know they don't have UB

Yep, I think #7312 is for tracking that. This is still a nice side benefit of constant bounds for wasm though

@rootjalex
Copy link
Copy Markdown
Member Author

I don't think relaxed SIMD has actually shipped in wasm yet, so that's still waiting on the wasm folks

@steven-johnson
Copy link
Copy Markdown
Contributor

I don't think relaxed SIMD has actually shipped in wasm yet, so that's still waiting on the wasm folks

https://webassembly.org/features/

TL;DR: shipped in Chrome, behind flags elsewhere

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants