[VM][DMLC] Lower memory usage when loading and dumping weights#13877
Merged
AndrewZhaoLuo merged 8 commits intoapache:mainfrom Feb 2, 2023
Merged
[VM][DMLC] Lower memory usage when loading and dumping weights#13877AndrewZhaoLuo merged 8 commits intoapache:mainfrom
AndrewZhaoLuo merged 8 commits intoapache:mainfrom
Conversation
Collaborator
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
Contributor
Author
Member
|
The approach of having overload file support util is fine, one thing is that it would needs to be part of the runtime folder as it is simple enough. Given most of the cases are on GPU, having ability to be able to load one array into CPU, copy into GPU then immediately free that CPU array can also be effective. |
Contributor
Author
|
@tqchen thanks for the comments. PTAL, ready for review. |
tqchen
reviewed
Feb 1, 2023
Member
tqchen
left a comment
There was a problem hiding this comment.
Thanks @AndrewZhaoLuo , one minor comment
tqchen
approved these changes
Feb 1, 2023
AndrewZhaoLuo
added a commit
that referenced
this pull request
Feb 10, 2023
* initial commit * update additional use cases * typo * asf header, summary * clean up * lint * move code to src/runtime/file_utils.h * file utils is cool
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Right now there is a bad pattern in VM executable where when loading weights, we load serialized representation in memory, and then deserialize off the in-memory store without progressively freeing memory.
This is bad because if our weights take up ~ 5GB, then the serialized representation in memory takes up 5GB and the deserialized representation will take ~ 5 GB too. This means peak memory use for using the VM for execution is 2 * the size of the weight models.
This is bad, especially with some of the larger models out there today.
This fixes thing by using a stream from disk, and depending on the standard C file interface to buffer things for performant results.
Some before and after graphs though loading and benchmarking a model with ~5GB weights:
Before:
After:
This is a draft since: