[BugFix] SSD fully supported on GPUs, updated deploy_ssd tutorial#2510
[BugFix] SSD fully supported on GPUs, updated deploy_ssd tutorial#2510masahi merged 12 commits intoapache:masterfrom
Conversation
| #ctx = tvm.gpu(0) | ||
| # Use these commented settings to build for opencl. | ||
| #target = 'opencl' | ||
| #ctx = tvm.gpu(0) |
There was a problem hiding this comment.
if I remember correctly, for opencl it should be tvm.opencl(0) or tvm.cl(0), isn't it?
There was a problem hiding this comment.
Yes, sry I forgot to change.
|
I agree to put sort in a common file. And we can add a unitest for it as well. |
| with ib.for_range(0, batch, for_type="unroll") as b: | ||
| start = b * num_anchors | ||
| with ib.if_scope(tid < num_anchors): | ||
| p_out[start + tid] = tid |
There was a problem hiding this comment.
seems storage_sync is missing here, I will update my pr
There was a problem hiding this comment.
@vinx13 Would you like to seperate argsort to a seperate file so that we can share the use of it? I can add unitest to it if needed.
There was a problem hiding this comment.
@Laurawly What's needed in ssd? Seems that you changed num_bbox in my pr to p_index[0], why only first element in p_index is used?
There was a problem hiding this comment.
Maybe we can make argsort a normal topi op? I'll add cpu implementation later.
There was a problem hiding this comment.
@vinx13 p_index is the valid_count variable which is a 1D array resulted from the multibox operators. So instead of sorting all of data.shape[1] numbers, we only need to sort the first p_index[0] numbers.
There was a problem hiding this comment.
@Laurawly shouldn't be p_index[batch_id]? are you assuming batch = 1?
There was a problem hiding this comment.
@vinx13 p_index only have one dimension. So it should be p_index[0].
There was a problem hiding this comment.
@kevinthesun @Laurawly The difficulty of sharing argsort (or extract it as a topi operator) is that we hope sort_num can be either a tvm.Tensor or constant array, but we can't use tvm.Expr to subscript a python array. Do you have ideas?
| with ib.else_scope(): | ||
| start = sizes[tid-1] | ||
| p_out[base_idx + k * axis_mul_after] = tvm.if_then_else( | ||
| k < p_index[tid], index_new[k+start], k) |
There was a problem hiding this comment.
@Laurawly still confused, if batch > 1, it should enter this if branch (since axis_mul_before * axis_mul_after > 1). Does p_index[tid] here mean that each batch has a different valid count?
There was a problem hiding this comment.
@vinx13 From here https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/nms.py#L368 axis is always 1, so axis_mul_before and axis_mul_after are both 1.
There was a problem hiding this comment.
@Laurawly since ndim == 2, axis == 1, the actual loop is like
for i in range(0, 2):
if i < 1:
axis_mul_before *= data.shape[i]
I assume that axis_mul_after == 1, axis_mul_before == data.shape[0], which is batch size, right?
There was a problem hiding this comment.
@vinx13 Yeah, that's right. I see what you mean. So each batch could have a different valid count when batch_size > 1. I shouldn't have assumed batch_size = 1. I just pushed the changes.
|
@Laurawly Btw have you checked the data race in nms ir? Seems __syncthreads and global barrier (maybe we should rewrite the ir to avoid global barrier) are needed on CUDA. I sometimes get incorrect nms results in my pr. |
|
@vinx13 Does the conflict happen in argsort_ir? |
|
@Laurawly the conflict happens in nms_ir, I replaced |
|
@vinx13 I don't see conflicts in my nms_ir using |
|
@Laurawly If the data written by other threads is needed (probably this line |
|
@vinx13 There's no data conflict for |
|
@Laurawly the writing |
|
@vinx13 No, because there's a condition that |
|
@Laurawly I see, thanks for your clarification |
|
thanks @Laurawly @vinx13 @kevinthesun @zhiics this is merged. |
…ache#2510) * nms fixed for gpu, tested on cuda and opencl devices, ssd now can run fully on the gpu * sort updated to use virtual thread * typo fixed * fix lint * fix lint * add support when batch_size > 1 * intel graphics conv2d bugs fixed for inception_v3 * intel conv2d api updated, nn input size 4 condition added * review addressed * move conv_tags to attributes * opencl ctx fixed * nms_ir index simplified
…ache#2510) * nms fixed for gpu, tested on cuda and opencl devices, ssd now can run fully on the gpu * sort updated to use virtual thread * typo fixed * fix lint * fix lint * add support when batch_size > 1 * intel graphics conv2d bugs fixed for inception_v3 * intel conv2d api updated, nn input size 4 condition added * review addressed * move conv_tags to attributes * opencl ctx fixed * nms_ir index simplified
…ache#2510) * nms fixed for gpu, tested on cuda and opencl devices, ssd now can run fully on the gpu * sort updated to use virtual thread * typo fixed * fix lint * fix lint * add support when batch_size > 1 * intel graphics conv2d bugs fixed for inception_v3 * intel conv2d api updated, nn input size 4 condition added * review addressed * move conv_tags to attributes * opencl ctx fixed * nms_ir index simplified
Thanks to @vinx13 's pr #2420, argsort working now on GPUs.
Tested SSD full pipeline on NVIDIA K80c and Intel HD graphics. Performance improved compared with heterogenous results.
Please review @masahi @kevinthesun @zhiics