[Unity][BYOC] Faster cutlass codegen#14465
Conversation
Improve cutlass compilation time, by cmpiling a single CSourceModule instead of one for each kernel.
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
|
The original intention was to compile all generated files in parallel (via NVCC |
Could you elaborate what I did parallelize this loop to compile the generated C source modules in parallel but that wasn't faster than compiling a single file. The difference between the two was not huge but compiling a single source module was a bit faster (~50 seconds for single source mod vs ~70 seconds for multiple C source mod in parallel). |
|
tvm/python/tvm/contrib/cutlass/build.py Line 75 in 5562d90 I don't expect NVCC would use multiple threads to compile a huge single source, but the numbers you described sound indeed good. |
|
Actually, since |
| auto [f_code, op_headers] = GenCutlassFunc(f, options); | ||
| code += "\n" + f_code; | ||
| for (const auto& header : op_headers) { | ||
| headers += "#include <" + header + ">\n"; |
There was a problem hiding this comment.
Here we might be adding duplicated headers. It probably won't matter for compilation speed but the generated file might get ugly.
There was a problem hiding this comment.
Yes, however since this is a generated file, I felt it is ok to have duplicate entries in header. We can improve upon it in follow up PRs though.
This PR improves cutlass compilation time, by compiling a single CSourceModule instead of creating and compiling one for each kernel.
Creating and compiling a new CSourceModule for every function is quite slow and slows down model with multiple functions offloaded to cutlass quite significantly. Instead we can generate a single CSourceModule and compile it once to produce a single
runtime::Module.This brings down the cutlass compilation time of large models like SD Unet significantly (~30 min to ~4 min). Similar results on other large models.
Testing
tests/python/relax/test_codegen_cutlass.py::test_matmul_offloadis broken at HEAD. This PR passes on all other tests when tested locally.cc @masahi @vinx13