[Unity][BYOC] Faster cutlass codegen by psrivas2 · Pull Request #14465 · apache/tvm

psrivas2 · 2023-04-03T14:10:58Z

This PR improves cutlass compilation time, by compiling a single CSourceModule instead of creating and compiling one for each kernel.

Creating and compiling a new CSourceModule for every function is quite slow and slows down model with multiple functions offloaded to cutlass quite significantly. Instead we can generate a single CSourceModule and compile it once to produce a single runtime::Module.
This brings down the cutlass compilation time of large models like SD Unet significantly (~30 min to ~4 min). Similar results on other large models.

Testing

tests/python/relax/test_codegen_cutlass.py::test_matmul_offload is broken at HEAD. This PR passes on all other tests when tested locally.

cc @masahi @vinx13

Improve cutlass compilation time, by cmpiling a single CSourceModule instead of one for each kernel.

tvm-bot · 2023-04-03T14:11:02Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

cc @billishyahao, @quic-sanirudh _{See #10317 for details}

_{Generated by tvm-bot}

masahi · 2023-04-03T19:07:33Z

The original intention was to compile all generated files in parallel (via NVCC -t flag), but I forgot to actually do it. Have you tested that? I expect that would be faster than this solution.

psrivas2 · 2023-04-03T20:06:25Z

The original intention was to compile all generated files in parallel (via NVCC -t flag), but I forgot to actually do it. Have you tested that? I expect that would be faster than this solution.

Could you elaborate what -t flag would do and how would we use it? Loop here processes annotated functions sequentially, so we will still have to parallelize that I think.

I did parallelize this loop to compile the generated C source modules in parallel but that wasn't faster than compiling a single file. The difference between the two was not huge but compiling a single source module was a bit faster (~50 seconds for single source mod vs ~70 seconds for multiple C source mod in parallel).

masahi · 2023-04-03T20:15:34Z

-t flag is the number of threads to use for NVCC, https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#threads-number-t. This is used by the Relay BYOC to compile all files in parallel.

tvm/python/tvm/contrib/cutlass/build.py

Line 75 in 5562d90

kwargs["options"].append("-t %d" % ncpu)

I don't expect NVCC would use multiple threads to compile a huge single source, but the numbers you described sound indeed good.

masahi · 2023-04-03T20:20:03Z

Actually, since compile_cutlass_module is also used by the Relax BYOC, I think we are already making use of -t flag. And putting all sources into a single source module is the right solution to really benefit from multi threaded compilation.

masahi · 2023-04-03T20:23:34Z

+      auto [f_code, op_headers] = GenCutlassFunc(f, options);
+      code += "\n" + f_code;
+      for (const auto& header : op_headers) {
+        headers += "#include <" + header + ">\n";


Here we might be adding duplicated headers. It probably won't matter for compilation speed but the generated file might get ugly.

Yes, however since this is a generated file, I felt it is ok to have duplicate entries in header. We can improve upon it in follow up PRs though.

[Unity][BYOC] Faster cutlass codegen

b733df2

Improve cutlass compilation time, by cmpiling a single CSourceModule instead of one for each kernel.

github-actions Bot requested review from masahi and vinx13 April 3, 2023 14:11

vinx13 approved these changes Apr 3, 2023

View reviewed changes

vinx13 merged commit 97ab25c into apache:unity Apr 3, 2023

masahi reviewed Apr 3, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Unity][BYOC] Faster cutlass codegen#14465

[Unity][BYOC] Faster cutlass codegen#14465
vinx13 merged 1 commit intoapache:unityfrom
psrivas2:fast-cutlass-codegen

psrivas2 commented Apr 3, 2023 •

edited

Loading

Uh oh!

tvm-bot commented Apr 3, 2023

Uh oh!

masahi commented Apr 3, 2023

Uh oh!

psrivas2 commented Apr 3, 2023

Uh oh!

masahi commented Apr 3, 2023

Uh oh!

masahi commented Apr 3, 2023

Uh oh!

masahi Apr 3, 2023

Uh oh!

psrivas2 Apr 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

psrivas2 commented Apr 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

tvm-bot commented Apr 3, 2023

Uh oh!

masahi commented Apr 3, 2023

Uh oh!

psrivas2 commented Apr 3, 2023

Uh oh!

masahi commented Apr 3, 2023

Uh oh!

masahi commented Apr 3, 2023

Uh oh!

masahi Apr 3, 2023

Choose a reason for hiding this comment

Uh oh!

psrivas2 Apr 3, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

psrivas2 commented Apr 3, 2023 •

edited

Loading