[Unity][BYOC] Add cuBLAS backend#14291
Conversation
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
| return arg_idx | ||
| extract_func = tvm.get_global_func("relax.contrib.extract_arg_idx") | ||
| arg_indices = extract_func(pattern_name, f) | ||
| return {k: int(v) for k, v in arg_indices.items()} |
|
Need to wait for rebase against |
|
@masahi please rebase as the other PR is merged |
|
any updates? |
|
waiting for the next rebase |
| auto C_data = static_cast<char*>(C->data) + C->byte_offset; | ||
|
|
||
| CHECK_CUBLAS_ERROR(cublasLtMatmul(hdl, op_desc, alpha, B_data, A_desc, A_data, B_desc, beta, | ||
| C_data, C_desc, C_data, C_desc, nullptr, nullptr, 0, nullptr)); |
There was a problem hiding this comment.
cublas API has a default workspace pool, it seems cublasLT always require workspace being explicit set, does passing nullptr here impact the performance? We may want to have default workspace allocated and stored as thread local in CublasThreadEntry
There was a problem hiding this comment.
yeah there are some performance knobs that might worth exploring, in terms of memory management and algorithm selection (see also https://docs.nvidia.com/cuda/cublas/#heuristics-cache). I haven't tested any of them, those are good items for future work if cuBLAS BYOC gets more traction.
2b5d190 to
75deeed
Compare
|
Just realized that the In particular, the GPU image is using an outdated one https://github.com/apache/tvm/blob/unity/ci/jenkins/unity_jenkinsfile.groovy#L34. That's why I'm still getting an build error after I've updated CUDA version on Shouldn't |
This PR adds support for cuBLAS offloading in Relax via BYOC. In particular, we are targeting cuBLASLt API, which has a limited but useful set of epilogue operations (bias / relu / gelu).
Compared to the CUTLASS BYOC, the introduction of cuBLAS BYOC is motivated by dynamic shape support - For dynamic shape, we cannot tune CUTLASS kernels, so we either end up choosing a kernel that works for any shape (
align1) at build time or developing some runtime heuristics. I realized that cuBLAS doesn't differentiate static / dynamic shape and already has tons of heuristics that are likely better than anything we can come up with. So I believe cuBLAS is a better default solution for dynamic shape.cc @vinx13 @yelite @mbaret