[TOPI] Custom schedule for standalone transpose in cuda#8030
[TOPI] Custom schedule for standalone transpose in cuda#8030areusch merged 12 commits intoapache:mainfrom
Conversation
altanh
left a comment
There was a problem hiding this comment.
Some minor nits about code organization and the test, but otherwise looks good and is a welcome addition. For other reviewers: offline testing of this transpose schedule (when applicable) showed about a 2x improvement over the baseline injective transpose schedule on BERT Large training workloads. I do wonder how we could select whether or not to opt in to this vs trying to tune the original injective schedule, but I think in most cases this new kernel will probably be faster.
| Dispatches to and optimized schedule if the transpose is standalone (not fused). | ||
| """ | ||
| warp_size = int(Target.current(allow_none=False).thread_warp_size) | ||
| if ( |
There was a problem hiding this comment.
is there a more principled way to do this? like maybe with an OpStrategy or something
There was a problem hiding this comment.
As far as I can tell, there is not a better way to do this. There is a way to add implementations based on input sizes, but these are not on a per-target basis. If you know a better way, let me know.
|
|
||
|
|
||
| def schedule_cuda_transpose(s, out): | ||
| def schedule_transpose(outs): |
There was a problem hiding this comment.
feels a bit weird to have this in sparse.py
There was a problem hiding this comment.
moved to transform.py
| r = np.random.rand(*shape) | ||
| tvm.testing.assert_allclose(ex.evaluate()(r).asnumpy(), np.transpose(r)) | ||
|
|
||
| # make sure schedule does not fire here |
There was a problem hiding this comment.
is this a TODO? Also I wonder if it would be good to parametrize the test shape by warp size (rather than hard coding) for future proofing
There was a problem hiding this comment.
It is more like a wish. Ideally we would be able to know which schedules were used, but there is to way to introspect on what was used. I've updated the comment to reflect this.
|
|
||
| # filter out non-packed conv2d task | ||
| tasks = list(filter(lambda t: len(t.args[0][1]) > 4, tasks)) | ||
| tasks = list(filter(lambda t: len(t.args[0][1]) > 4 and "conv" in t.name, tasks)) |
There was a problem hiding this comment.
what happened here, did this transpose change introduce a new task or something?
There was a problem hiding this comment.
actually, no, but this check makes sure anyways.
There was a problem hiding this comment.
Isn't the new added schedule not tunable? Or is there any concern of adding knobs?
There was a problem hiding this comment.
We may want to tune it in the future.
altanh
left a comment
There was a problem hiding this comment.
LGTM, the only thing I'm wondering about is if someone (for whatever reason) really wanted to tune the default injective schedule for transpose, is there any way to allow that?
cc @comaniac for additional review (feel free to tag more relevant reviewers)
There's no reason to tune inject schedule and you basically cannot do it because injective schedule doesn't have AutoTVM knobs for tuning. |
| # We want to make sure schedule does not fire here, but there is no way of | ||
| # inspecting which schedules were used. |
There was a problem hiding this comment.
Like this comment mentions, there is no way of inspecting which schedules were used, so it seems to me that the difference between this test and test_transpose is the workload in this test includes add to test the case of fusion. Accordingly, could we just extend test_transpose?
There was a problem hiding this comment.
We could. I like to keep it separate so the intention is known.
There was a problem hiding this comment.
Fair enough. Then it might be better to name it test_transpose_fuse or something like that (nit).
There was a problem hiding this comment.
switched to test_transpose_unfused_schedule
|
|
||
| # filter out non-packed conv2d task | ||
| tasks = list(filter(lambda t: len(t.args[0][1]) > 4, tasks)) | ||
| tasks = list(filter(lambda t: len(t.args[0][1]) > 4 and "conv" in t.name, tasks)) |
There was a problem hiding this comment.
Isn't the new added schedule not tunable? Or is there any concern of adding knobs?
comaniac
left a comment
There was a problem hiding this comment.
I don't have other comments. LGTM.
* [TOPI] Custom schedule for standalone transpose in cuda * check if input is not Any * fix vta test * check input shape * fix injective * move transpose out of sparse.py * update comments, use warp size * missspelled transform * formatting * rename test * comment * fix tests
* [TOPI] Custom schedule for standalone transpose in cuda * check if input is not Any * fix vta test * check input shape * fix injective * move transpose out of sparse.py * update comments, use warp size * missspelled transform * formatting * rename test * comment * fix tests
This PR adds an optimized schedule for transpose if the transpose is not fused into anything else.
@altanh @junrushao1994