[ARM][Performance]Improve ARM CPU depthwise convolution performance#2028
Closed
FrozenGene wants to merge 530 commits intoapache:masterfrom
Closed
[ARM][Performance]Improve ARM CPU depthwise convolution performance#2028FrozenGene wants to merge 530 commits intoapache:masterfrom
FrozenGene wants to merge 530 commits intoapache:masterfrom
Conversation
GCC issues warnings with -Wextra if we don't explicitly initialize base class in copy constructors. This commit fixed the issue.
* [RELAY][PASS] FoldScaleAxis Forward * Introduce helper function type_as * Update per review comment * Fix according to comments
Member
Author
|
The CI's test_topi_depthwise_conv2d.py test error is because I have modified the schedule, which doesn't have |
* Add Eddie to committer * Fix order
* Add MXNet test example for relay * Fix a bug in BiasAddSimplifier
Dtype of output of pad should follows input, but if dtype of input is not float, output will still be float becase pad_value is float.
…and ssd ops (apache#2322) * add ssd ops to mxnet.py * add ssd ops to mxnet.py * add result check for multibox and nms unit tests * add result check for multibox and nms unit tests * address @kevinthesun's comments * Disable cuda test for nms for now.
dtype of count is the same as dtype of inputs[0] when created, but its type may change when multiplied by inputs[0]->shape[i]. Which causes dtype of output is not same as dtype of input.
* Add cast op * Rename dtype_cast to cast * Add additional safety check for String2TVMType * Add missing relay op docs
d95a24c to
aa73419
Compare
aa73419 to
bfc259b
Compare
Member
Author
|
I'm very sorry that I commit the merge code previously. I wish this doesn't interrupt you. Currently, I open one new PR: #2345 to continue this work and add this PR as reference in case people are interested in the background. Sorry again for my mistake. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
To leverage existing spatial pack schedule and add tunable compute_at knob to re-implement ARM CPU's depthwise convolution.
On my A53@2.0GHz ARM CPU (MTK6763), which can boost 1.6X performance compared with previous depthwise convolution in the Mobilenet V1 model (I have also checked the correctness of this schedule).
The following is the Tensorflow Mobilenet V1 model auto tvm training GFLOPS log:
Currently:
[Task 2/20] Current/Best: 0.98/ 2.32 GFLOPS | Progress: (1427/2000) | 2679.82 s Done.
[Task 4/20] Current/Best: 0.56/ 1.15 GFLOPS | Progress: (1072/2000) | 2461.27 s Done.
[Task 6/20] Current/Best: 1.08/ 2.78 GFLOPS | Progress: (1084/2000) | 1987.91 s Done.
[Task 8/20] Current/Best: 0.39/ 1.19 GFLOPS | Progress: (1815/2000) | 2744.70 s Done.
[Task 10/20] Current/Best: 1.09/ 2.33 GFLOPS | Progress: (1222/2000) | 1866.02 s Done.
[Task 12/20] Current/Best: 0.42/ 0.90 GFLOPS | Progress: (1716/2000) | 2528.94 s Done.
[Task 14/20] Current/Best: 1.89/ 2.63 GFLOPS | Progress: (1284/2000) | 2288.55 s Done.
[Task 16/20] Current/Best: 0.47/ 0.96 GFLOPS | Progress: (1467/2000) | 2282.65 s Done.
[Task 18/20] Current/Best: 1.43/ 2.61 GFLOPS | Progress: (1007/2000) | 1525.76 s Done.
After this PR optimization:
[Task 2/20] Current/Best: 0.00/ 4.83 GFLOPS | Progress: (1682/2000) | 1470.40 s Done.
[Task 4/20] Current/Best: 1.35/ 3.17 GFLOPS | Progress: (1257/2000) | 1032.80 s Done.
[Task 6/20] Current/Best: 2.04/ 5.49 GFLOPS | Progress: (1904/2000) | 1623.10 s Done.
[Task 8/20] Current/Best: 0.75/ 3.15 GFLOPS | Progress: (1885/2000) | 1546.22 s Done.
[Task 10/20] Current/Best: 2.09/ 6.07 GFLOPS | Progress: (2000/2000) | 1640.41 s Done.
[Task 12/20] Current/Best: 2.99/ 3.80 GFLOPS | Progress: (1853/2000) | 1547.13 s Done.
[Task 14/20] Current/Best: 4.59/ 6.06 GFLOPS | Progress: (1355/2000) | 1091.93 s Done.
[Task 16/20] Current/Best: 1.96/ 4.01 GFLOPS | Progress: (2000/2000) | 1586.18 s Done.
[Task 18/20] Current/Best: 2.33/ 4.63 GFLOPS | Progress: (2000/2000) | 1599.89 s Done.
The depthwise convolution total execution time on single A53@2.0GHz time can be from
45.3839msto28.1945ms.One thing you must notice to use this schedule: You MUST make the XGBTunner constructor’s feature type argument be feature_type= 'knob'. i.e. XGBTuner(tsk, loss_type='rank', feature_type='knob'). Otherwise your program maybe hang forever.
@merrymercy @tqchen Pls review it.