[VM] Per-input, data dependence specification for shape func#7210
Merged
icemelon merged 13 commits intoapache:mainfrom Jan 15, 2021
Merged
[VM] Per-input, data dependence specification for shape func#7210icemelon merged 13 commits intoapache:mainfrom
icemelon merged 13 commits intoapache:mainfrom
Conversation
Contributor
|
I like this, I think it makes a lot of sense, but I'll defer mainly to @zhiics and @icemelon9 since they mainly implemented the infrastructure for heterogeneous shape functions. |
Member
Author
|
cc @icemelon9 @zhiics Any thought on this issue? I think this is important to discuss. |
zhiics
reviewed
Jan 14, 2021
Member
zhiics
left a comment
There was a problem hiding this comment.
I am okay with the change but I am not sure if there is a better solution. @icemelon9 can you take a look?
1ffa75a to
f658556
Compare
icemelon
reviewed
Jan 14, 2021
f658556 to
11bd3f5
Compare
11bd3f5 to
6c1b318
Compare
Member
Author
|
@icemelon9 it should be ready |
Member
|
Thanks @masahi @zhiics @mbrookhart |
masahi
added a commit
to masahi/tvm
that referenced
this pull request
Jan 18, 2021
…7210) * made TShapeDataDependant array * add stub * dyn strided slice working * reshape also working * remove log * works on maskrcnn * lint fix * fix cpp test * remove stale pop back * add more doc * dependant -> dependent * remove redundant check * remove data_dependent_
TusharKanekiDey
pushed a commit
to TusharKanekiDey/tvm
that referenced
this pull request
Jan 20, 2021
…7210) * made TShapeDataDependant array * add stub * dyn strided slice working * reshape also working * remove log * works on maskrcnn * lint fix * fix cpp test * remove stale pop back * add more doc * dependant -> dependent * remove redundant check * remove data_dependent_
trevor-m
pushed a commit
to neo-ai/tvm
that referenced
this pull request
Jan 21, 2021
…7210) * made TShapeDataDependant array * add stub * dyn strided slice working * reshape also working * remove log * works on maskrcnn * lint fix * fix cpp test * remove stale pop back * add more doc * dependant -> dependent * remove redundant check * remove data_dependent_
electriclilies
pushed a commit
to electriclilies/tvm
that referenced
this pull request
Feb 18, 2021
…7210) * made TShapeDataDependant array * add stub * dyn strided slice working * reshape also working * remove log * works on maskrcnn * lint fix * fix cpp test * remove stale pop back * add more doc * dependant -> dependent * remove redundant check * remove data_dependent_
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Currently, shape functions are executed on CPU, even if the model is running on GPU. Each shape function is declared with
data_dependentflag, specifying whether or not the shape function needs the input tensors themselves or only the shapes of input tensors, to compute output shape:tvm/python/tvm/relay/op/op.py
Lines 359 to 368 in 9e766d9
When an op's
data_dependentis true, VM will do device to host memcpy of entire tensors before running that op's shape func on CPU. In particular, sincedyn.strided_slicehasdata_dependentTrue, VM would do device to host memcpy of a to-be-sliced tensor for everydyn.strided_slliceinvocation, which can be highly expensive if the tensor is big.tvm/python/tvm/relay/op/dyn/_transform.py
Line 195 in 9e766d9
In fact, one of the bottlenecks of running PyTorch MaskRCNN on GPU is this repeated device to host memcpy, as shown in the profiler output below. Most of them is for
dyn.strided_sliceshape func.CUDA memcpy HtoDis also very slow, but this is necessary to send large parameters once to GPU.But if we think about it, these expensive copyings are completely useless: shape func of
dyn.strided_sliceonly needs data shape. But since other argumentsbegin, end, stridesrequire their tensor values,dyn.strided_sliceneeds to declare itsdata_dependantflag to be True. This issue can be resolved if we let each op declare its data dependence per input.Proposed solution
Luckily, decisions to insert device-to-host memcpy before shape func is already done per each input separately, as shown below:
tvm/python/tvm/relay/transform/memory_alloc.py
Lines 207 to 221 in 4c4888b
So all we need to do is to have a way to specify per-input data dependence for each op, and send these information to
ManifestAllocPassabove. This PR enables such specification like this:I've also updated
compile_engine.ccaccordingly to send per-input data dep information toManifestAllocPass. The change I made is minimum necessary to achieve my goal, so it can be improved. With this PR, I was able to remove all expensive device-to-host memcpy, and it cuts GPU MaskRCNN runtime by 14 millisecond. More importantly, the purpose of this PR is to let people aware of this problem, and decide the best solution.please review @icemelon9 @zhiics @kevinthesun @mbrookhart