[Contrib][Sort] Faster Top-K Implementation#13599
Merged
tkonolige merged 5 commits intoapache:mainfrom Jan 4, 2023
Merged
Conversation
Collaborator
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
tkonolige
approved these changes
Jan 4, 2023
Contributor
tkonolige
left a comment
There was a problem hiding this comment.
Numbers look excellent! We could probably simplify the implementation to use std::partial_sort (https://en.cppreference.com/w/cpp/algorithm/partial_sort), but that can wait for a future PR.
fzi-peccia
pushed a commit
to fzi-peccia/tvm
that referenced
this pull request
Mar 27, 2023
This is a simple rewrite of hand-coded top-k function used for CPU targets. The old implementation sorted each axis and then took the biggest k elements. The new implementation does a single pass of each axis, keeping a min heap to store the top-k elements up to that point. If n is the size of the array, and we want to find top k, the old implementation has runtime in O(nlogn) with additional memory O(n) to store the sorted array. The new implementation is O(n log k), and in practice is probably amortized to O(n / k * log k) in many scenarios and only requires O(k). Note n >> k most of the time. In practice this new kernel led to a 20x speedup over existing one. On a Xeon Platinum 8370C CPU @ 2.80GHz for input shape [1, 3050] with k = 15, the latency went from 200us --> ~10us. There is probably more room for shaving off a little more time on the scale of a single us's, however I have determined it to not be worth it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
This is a simple rewrite of hand-coded top-k function used for CPU targets.
The old implementation sorted each axis and then took the biggest k elements.
The new implementation does a single pass of each axis, keeping a min heap to store the top-k elements up to that point.
If n is the size of the array, and we want to find top k, the old implementation has runtime in O(nlogn) with additional memory O(n) to store the sorted array. The new implementation is O(n log k), and in practice is probably amortized to O(n / k * log k) in many scenarios and only requires O(k). Note n >> k most of the time.
In practice this new kernel led to a 20x speedup over existing one. On a Xeon Platinum 8370C CPU @ 2.80GHz for input shape [1, 3050] with k = 15, the latency went from 200us --> ~10us. There is probably more room for shaving off a little more time on the scale of a single us's, however I have determined it to not be worth it.
This change however is probably in the range of worth committing.
I've launched benchmarks on my m1 mac, and a Xeon Platinum 8370C CPU @ 2.80GHz with 8 cores.
Data:
All data is collected along axis=1.
M1:
Xeon:
As can be seen, except in one pathological case (k ~ axis_size), we see significant speedups across almost all conditions. For M1, this case also has speedups surprisingly.
Other Changes: