[CUDA] Improve injective schedule to enable half2#8457
Merged
comaniac merged 4 commits intoapache:mainfrom Jul 14, 2021
Merged
Conversation
vinx13
reviewed
Jul 13, 2021
jcf94
approved these changes
Jul 13, 2021
Contributor
jcf94
left a comment
There was a problem hiding this comment.
Thanks!
I'm thinking that if loop partition can solve this problem? @comaniac @junrushao1994
For example, with a dimension of 301, partition it to a 300 loop and a single 1 loop, then process vectorize in the 300 loop.
Looks like this can be more easily implemented in TensorIR's block scope.
Contributor
Author
Right. Both loop partitioning and input padding can resolve this issue as well. |
vinx13
approved these changes
Jul 13, 2021
Contributor
Author
ylc
pushed a commit
to ylc/tvm
that referenced
this pull request
Sep 29, 2021
* [CUDA] Improve injective schedule to enable half2 * lint * fix * trigger ci
zxy844288792
pushed a commit
to zxy844288792/tvm
that referenced
this pull request
Mar 4, 2022
* [CUDA] Improve injective schedule to enable half2 * lint * fix * trigger ci
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Per discussion in https://discuss.tvm.apache.org/t/cuda-enable-half2-in-cuda-injective-schedule/10441, this PR improves the CUDA injective schedule to benefit more from
half2when working onfloat16.The background is that although the CUDA injective schedule does vectorize the innermost loop when working on float16, the vectorization may fail due to the if-conditions introduced by non-dividable workload and block/thread sizes. Formally, vectorization requires
prod(output_shape) % block % thread % vector_width == 0. To make sure vectorization is effective, this PR adjusts the block and thread sizes accordingly (see the code change for details).On the other hand, when the output shapes are weird (e.g., prime numbers), the selected block and thread sizes may be too small. For example, if the output shape is
(311, 3814), then factors are(1, 2, 311, 1907, 3814). As a result, we may select(block, thread) = (2, 311)with the consideration of the maximum(block, thread) = (256, 1024). In this case, we don't utilize the compute resources well evenhalf2is enabled.Ideally, we should pad the output to let the factors always be power of two, but it is too complicate and may introduce other issues. Accordingly, another heuristic introduced by this PR is that when
(select_block * select_thread) / (max_block * max_thread) < R, then we don't apply the change and let the vectorization failed.Here is the evaluation results when
R=0.7.For each platform, I displayed the worst, the best, and the average speedup of all workloads over the current upstream.
cc @vinx13 @wpan11nv @Laurawly @masahi