[Runtime] Extend Graph Runtime To Support Cuda Graph Launch#7616
[Runtime] Extend Graph Runtime To Support Cuda Graph Launch#7616comaniac merged 19 commits intoapache:mainfrom zhuochenKIDD:main
Conversation
|
@zhuochenKIDD is this ready for review? Please modify the description if so; otherwise please mark this PR as a draft first. Thanks. |
|
@comaniac I've added test case, would you please help review, thanks. |
| if(USE_CUDA) | ||
| if(USE_GRAPH_RUNTIME_CUGRAPH) |
There was a problem hiding this comment.
This makes USE_GRAPH_RUNTIME_CUGRAPH silent when CUDA is OFF and may confuse users. We should have
if(USE_GRAPH_RUNTIME_CUGRAPH)
if(NOT USE_CUDA)
// error out saying please config with USE_CUDA=ON.
There was a problem hiding this comment.
I moved this to CUDA.cmake to better check CUDA version > 10, so it might like cudnn/cublas feature, is that ok?
| except ValueError: | ||
| raise ValueError( | ||
| "Please set '(USE_GRAPH_RUNTIME_CUGRAPH ON)' in " | ||
| "config.cmake and rebuild TVM to enable cu_graph test mode" |
There was a problem hiding this comment.
It is because we are currently evaluating CUDA graph API vs kernel launch, and it's keep on going, using TVM is more convenient to do so on new workloads than TF Runtime. And also currently only Kernel-kind cuda node is in captured CUDA graph, in might be more benefits when Memcpy-kind node or using manually created cuda graph, so currently I am not sure current stream-capture way is the optimal way, perhaps need more test
There was a problem hiding this comment.
I see. We usually call it "experimental". I'll suggest the following:
To enable CuGraph (experimental), please set '(USE_GRAPH_RUNTIME_CUGRAPH ON)'
in config.cmake and rebuild TVM
Co-authored-by: Cody Yu <comaniac0422@gmail.com>
This reverts commit f286711.
comaniac
left a comment
There was a problem hiding this comment.
Just some miner changes but overall is good. Two additional points:
- I found two terms
cuGraphandCUDA graphare used in this PR. It would be better to just usecuGraph. - It would be great if you could send a follow-up PR for a tutorial to explain how to use the two interfaces.
| if(CUDAToolkit_VERSION_MAJOR LESS "10") | ||
| message(FATAL_ERROR "CUDA Graph requires at least CUDA 10, got=" ${CUDAToolkit_VERSION}) | ||
| endif() | ||
| message(STATUS "Build with Graph runtime cuGraph support...") |
There was a problem hiding this comment.
It would be better to have one terminology in this PR. Either cuGraph or CUDA graph.
There was a problem hiding this comment.
Yes I removed all cuGraph or cu_graph name, use CUDA Graph instread
| return False | ||
| return True | ||
| except RuntimeError: | ||
| warnings.warn("Cannot find cuda path") |
There was a problem hiding this comment.
This warning has no information and can consider to remove.
Co-authored-by: Cody Yu <comaniac0422@gmail.com>
Co-authored-by: Cody Yu <comaniac0422@gmail.com>
|
comaniac
left a comment
There was a problem hiding this comment.
LGTM. I'm going to merge this first and the doc could be the next PR.
For the doc location, it's reasonable to put it under TVM runtime along with debugger (.rst), but to me this is a feature not limited to developers, so it would be more impactful if we put it under tutorial (.py). @tqchen @hogepodge could you please advice?
|
Thanks @zhuochenKIDD |
|
a tutorial/howto guide would be nice |
|
Agree with Tianqi. A how-to guide would be best. You can write it as a Sphinx-Gallery document, under the tvm/tutorials directory. I'm not entirely certain which subdirectory it should go under (you should avoid the get_started directory). Maybe a new directory if it doesn't fit into classifications for the others. |
) * add graph runtime cuGraph poc * lint format * add unittest * fix review comments * Update CMakeLists.txt Co-authored-by: Cody Yu <comaniac0422@gmail.com> * build cuda graph runtime in gpu test * Revert "build cuda graph runtime in gpu test" This reverts commit f286711. * rename cuGraph to CUDA Graph * rename cuda_graph * rename cuda_graph * lint format * Update src/runtime/graph/graph_runtime_factory.cc Co-authored-by: Cody Yu <comaniac0422@gmail.com> * Update python/tvm/testing.py Co-authored-by: Cody Yu <comaniac0422@gmail.com> * fix lint error * remove unnecessary warn * add test, fix lint * fix lint W0223 Co-authored-by: Cody Yu <comaniac0422@gmail.com>
) * add graph runtime cuGraph poc * lint format * add unittest * fix review comments * Update CMakeLists.txt Co-authored-by: Cody Yu <comaniac0422@gmail.com> * build cuda graph runtime in gpu test * Revert "build cuda graph runtime in gpu test" This reverts commit f286711. * rename cuGraph to CUDA Graph * rename cuda_graph * rename cuda_graph * lint format * Update src/runtime/graph/graph_runtime_factory.cc Co-authored-by: Cody Yu <comaniac0422@gmail.com> * Update python/tvm/testing.py Co-authored-by: Cody Yu <comaniac0422@gmail.com> * fix lint error * remove unnecessary warn * add test, fix lint * fix lint W0223 Co-authored-by: Cody Yu <comaniac0422@gmail.com>
We are currently using graph runtime to run some CTR models on NV-GPU, for our in-house model (around 100 nodes in tvm json graph ) cuGraphLaunch can reduce 5% to 10% percent latency vs the original for-loop cuda kernel launch.
So I wonder if the extension might benefits other workloads, I haven't test other types of models.