Add Adreno GPU target and topi supporting textures with dynamically allocated textures#11161
Add Adreno GPU target and topi supporting textures with dynamically allocated textures#11161csullivan merged 9 commits intoapache:mainfrom
Conversation
ddfa320 to
fb29643
Compare
|
@csullivan Could you please take a look? |
| * e.g. image2d[height=O, width=IHW] | ||
| */ | ||
| kImage2DWeight, | ||
| kTexture2DNHWC, |
There was a problem hiding this comment.
Note: We can now support arbitrary layouts with transform_layout which I will suggest we move to. It will require some rework on the TIR lowering. I don't suggest this block these schedules from being upstreamed now, but we should circle back on this soon.
There was a problem hiding this comment.
Should we add any AR/TODO into the code?
There was a problem hiding this comment.
I like that idea. Something like,
TODO(tvm-team): Uncouple use of storage scope and data layout by using the transform_layout schedule primitive to express the desired texture layout. This will require supporting Nd indices in BufferLoad and BufferStore in CodegenOpenCL, and ensuring Nd allocations for texture are correctly routed to the AllocateTexture packed function in the OpenCL DeviceAPI.
| elif data_layout == "NHWC4c": | ||
| ic = data.shape[3] * data.shape[4] | ||
| else: | ||
| # TODO(amalyshe) add proper error raising |
| # specific language governing permissions and limitations | ||
| # under the License. | ||
| # pylint: disable=invalid-name,unused-variable,unused-argument,no-member | ||
| """Conv2D alter op and legalize functions for x86""" |
| from ..utils import get_const_tuple | ||
|
|
||
|
|
||
| def getDiv(value, start): |
There was a problem hiding this comment.
snake_case to match the rest of the file
| ---------- | ||
| out: tuple of the (chunks, block, tail) | ||
| """ | ||
| tail = trip_count % 4 |
| in_channel_tail: int | ||
| Tail in the latest chunk diffing original number of channels vs blocked one | ||
| If in_channel_tail != in_channel_block: | ||
| original_channels = in_channel_chunks * in_channel_block - in_channel_tail |
There was a problem hiding this comment.
nit: consider referring to this as padding_tail so that it's clear this isn't the remainder of a floordiv. anything to make this a little more clear upfront, took me a bit to understand given the current naming convention. Same comment for filter api below.
| def _reorder_data_nchw(*indices): | ||
| condition = [] | ||
| condition.append(indices[1] == in_channel_chunks - 1) | ||
| condition.append(indices[4] >= in_channel_tail) | ||
| condition = tvm.tir.all(*condition) | ||
| return tvm.tir.if_then_else( | ||
| condition, | ||
| pad_value, | ||
| Input[indices[0], indices[1] * in_channel_block + indices[4], indices[2], indices[3]], | ||
| ) | ||
|
|
||
| def _reorder_data_nhwc(*indices): | ||
| condition = [] | ||
| condition.append(indices[3] == in_channel_chunks - 1) | ||
| condition.append(indices[4] >= in_channel_tail) | ||
| condition = tvm.tir.all(*condition) | ||
| return tvm.tir.if_then_else( | ||
| condition, | ||
| pad_value, | ||
| Input[indices[0], indices[1], indices[2], indices[3] * in_channel_block + indices[4]], | ||
| ) |
There was a problem hiding this comment.
Note: Explicit buffer layout padding as part of transform_layout is on the roadmap and will appear in RFC soon. Putting a note here to note that explicit layout transformations like this should be unnecessary in the future.
There was a problem hiding this comment.
added comment and reference to rfc
| in_height, in_width, kernel_h, kernel_w, dilation_h, dilation_w, padding, stride_h, stride_w | ||
| ): | ||
| """ | ||
| Expands spatial dimensions to be dividable by factor 4. This will allow us to do extrimely |
There was a problem hiding this comment.
Typos
| Expands spatial dimensions to be dividable by factor 4. This will allow us to do extrimely | |
| Expands spatial dimensions to be dividable by factor 4. This will allow us |
There was a problem hiding this comment.
could you please point where typos are?
| Height of the feature map | ||
|
|
||
| in_width: int | ||
| Width of the featrue map |
There was a problem hiding this comment.
| Width of the featrue map | |
| Width of the feature map |
| # certain limitation of the Qualcomm devices. Subject to be determined for certain device | ||
| # individually, but until we have access to remote device during compilation, we have to | ||
| # define it uniformly for all target devices | ||
| limit = 16384 |
There was a problem hiding this comment.
Let us use the Target attributes for this, and specifically use the attribute preprocessor as is done for cuda here. Add image extent to the attribute list for the device api and use it when calling DetectDeviceFlag to query the size limits of the opencl image on the remote device.
There was a problem hiding this comment.
I added new texture_spatial_limit attribute to opencl target, added to the DeviceAttrKind and runtime_ctypes in python, but not sure if it was required since I don;t know how and when to use DetectDeviceFlag as well I have an access to the texture_spatial_limit in the python part through tvm.target.Target.current().attrs["texture_spatial_limit"]
I would consider this as "addressed" but need to understand if my solution is applicable and if we need parts related to DeviceAttrKind
- There are 5 compute/schedules: conv2d for NCHW/NHWC, depthwise_conv2d for NCHW/NHWC, average pooling - Fix of dynamically allocated textures caching - Add texture-nhwc scope - Fix issue with codegen of vars having non acceptable symbols Co-authored-by: Chris Sullivan <csullivan@octoml.ai> Co-authored-by: Egor Churaev <egor.churaev@gmail.com>
Done |
Co-authored-by: Li <quic_lih@quicinc.com>
|
|
||
| pad_data, kernel = s[conv].op.input_tensors | ||
|
|
||
| s[pad_data].compute_inline() |
There was a problem hiding this comment.
Are you meaning to inline padding here? Your comment above implies that you intend to do otherwise.
There was a problem hiding this comment.
It is inlined into next stage - cache read for textures
AT = s.cache_read(pad_data, "global.texture", [conv])
bind_data_copy(s[AT])
If I do not add s[pad_data].compute_inline() the schedule would not be complete and would claim about missing of some bindings
| from tvm.contrib import graph_runtime | ||
|
|
||
|
|
||
| def get_reference(mod, params1, input_shape, inputs): |
There was a problem hiding this comment.
Common utility shared in other test files, consider adding to the utils subdir.
There was a problem hiding this comment.
moved shared functions into utils/adreno_utils.py
|
|
||
|
|
||
| # build module run with opencl and cpu, compare results | ||
| def build_run_compare( |
There was a problem hiding this comment.
moved shared functions into utils/adreno_utils.py
|
|
||
|
|
||
| @tvm.testing.requires_opencl | ||
| def test_conv2d_yolov3_v2_nchw_3c(): |
There was a problem hiding this comment.
Do these tests pass on a local opencl device (e.g. with an nvgpu?). If not, it would be good to skip the tests that depend on the RPC tracker env vars if they are not set if they require a remote device.
There was a problem hiding this comment.
I have not verified with nvidia gpu, but they pass successfully on intel integrated graphics and enabled opencl in the platform and tvm. I need to verify if tests run in the CI, but cannot do this due to issues with GPU build in CI
There was a problem hiding this comment.
@csullivan Looked into CI test results and got an impression that all opencl tests are disabled. It seems we need to enable them in CI but in separate PR
There was a problem hiding this comment.
That's accurate, and I agree we can consider enabling them in CI in a separate PR. If you see that these tests pass when running locally and without and RPC tracker that is sufficient.
| .add_attr_option<Bool>("system-lib") | ||
| .add_attr_option<Integer>("max_num_threads", Integer(256)) | ||
| .add_attr_option<Integer>("thread_warp_size", Integer(1)) | ||
| .add_attr_option<Integer>("texture_spatial_limit", Integer(16384)) |
There was a problem hiding this comment.
Thanks for adding this. An improvement would be to query the remote device using a call to the device api GetAttr using the target attr preprocessor.
There was a problem hiding this comment.
I still do not fully understand the usage model. I left for a while only definition of texture_spatial_limit in opencl target and access in python because adding of kTextureSpatialLimit in DeviceAttrKind caused a fail during compilation of cuda and as I do not fully understand usage model, don't know how to fix this properly. If I need to extend cuda as well for this constant or just ignore and if ignore in which place kTextureSpatialLimit should be used
csullivan
left a comment
There was a problem hiding this comment.
LGTM with a few final nits
…llocated textures (apache#11161) * Add Adreno GPU target and topi supporting textures - There are 5 compute/schedules: conv2d for NCHW/NHWC, depthwise_conv2d for NCHW/NHWC, average pooling - Fix of dynamically allocated textures caching - Add texture-nhwc scope - Fix issue with codegen of vars having non acceptable symbols Co-authored-by: Chris Sullivan <csullivan@octoml.ai> Co-authored-by: Egor Churaev <egor.churaev@gmail.com> * Address comments * Add vectorization into some adreno pool flow Co-authored-by: Li <quic_lih@quicinc.com> * Fix adreno tests for running on the opencl host platform * remove unnecessary kDriverVersion in DeviceAttrKind * Move utils adreno functinos to separate shared file * fix black hits Co-authored-by: Chris Sullivan <csullivan@octoml.ai> Co-authored-by: Egor Churaev <egor.churaev@gmail.com> Co-authored-by: Li <quic_lih@quicinc.com>
…llocated textures (apache#11161) * Add Adreno GPU target and topi supporting textures - There are 5 compute/schedules: conv2d for NCHW/NHWC, depthwise_conv2d for NCHW/NHWC, average pooling - Fix of dynamically allocated textures caching - Add texture-nhwc scope - Fix issue with codegen of vars having non acceptable symbols Co-authored-by: Chris Sullivan <csullivan@octoml.ai> Co-authored-by: Egor Churaev <egor.churaev@gmail.com> * Address comments * Add vectorization into some adreno pool flow Co-authored-by: Li <quic_lih@quicinc.com> * Fix adreno tests for running on the opencl host platform * remove unnecessary kDriverVersion in DeviceAttrKind * Move utils adreno functinos to separate shared file * fix black hits Co-authored-by: Chris Sullivan <csullivan@octoml.ai> Co-authored-by: Egor Churaev <egor.churaev@gmail.com> Co-authored-by: Li <quic_lih@quicinc.com>
…llocated textures (apache#11161) * Add Adreno GPU target and topi supporting textures - There are 5 compute/schedules: conv2d for NCHW/NHWC, depthwise_conv2d for NCHW/NHWC, average pooling - Fix of dynamically allocated textures caching - Add texture-nhwc scope - Fix issue with codegen of vars having non acceptable symbols Co-authored-by: Chris Sullivan <csullivan@octoml.ai> Co-authored-by: Egor Churaev <egor.churaev@gmail.com> * Address comments * Add vectorization into some adreno pool flow Co-authored-by: Li <quic_lih@quicinc.com> * Fix adreno tests for running on the opencl host platform * remove unnecessary kDriverVersion in DeviceAttrKind * Move utils adreno functinos to separate shared file * fix black hits Co-authored-by: Chris Sullivan <csullivan@octoml.ai> Co-authored-by: Egor Churaev <egor.churaev@gmail.com> Co-authored-by: Li <quic_lih@quicinc.com>
for NCHW/NHWC, average/max pooling