[TOPI][OP] cuda for argwhere by zhiics · Pull Request #6868 · apache/tvm

zhiics · 2020-11-06T19:34:51Z

This PR adds cuda schedule for argwhere.

Since frameworks require sorted results, we sort the indices from the least significant to the most significant columns.
Only one block is used to avoid atomic_add emitting flaky results. Thanks @kevinthesun for looking into it.
The added argwhere tests in test_any would currently fail because topi strided_slice currently doesn't support symbolic shape yet. @mbrookhart has some work on it.

Will ping reviews when we can run argwhere relay tests.

mbrookhart · 2020-11-06T20:20:41Z

@zhiics I have a branch with the changes you'd need, but I haven't opened a PR because I've been fighting that memory corruption issue with topk. Would you like me to submit a PR to enable the other dynamic tests and include my refactors to strided slice?

zhiics · 2020-11-06T20:23:10Z

@mbrookhart Thanks. That would be cool.

tkonolige · 2020-11-06T23:02:41Z

+    max_threads = int(tvm.target.Target.current(allow_none=False).max_num_threads)
+    nthread_tx = max_threads
+
+    # Limit threads to a single block to make sure atomic_add works normally.


Cuda does have a kernel level atomic add (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions). It is just slower or do we not have access to it from TIR?

We use atomicAdd. However, if number of blocks is larger than a threshold(like 18), it will return incorrect result.

I'm surprised. I'd expect atomicAdd to work with any number of blocks. Could you maybe expand this comment with why and when atomicAdd fails?

The observation is that if input data size is large( > 300 * 300 for example), previous we don't limit the number of blocks and the output of IR routine would be incorrect. I didn't dig deeper into it at this time.

In addition we need to use thrust otherwise tvm implemetation of topk can also generate incorrect result.

kevinthesun

LGTM

kevinthesun · 2020-12-04T22:53:07Z

Thanks @zhiics @mbrookhart @tkonolige

* argwhere * cuda schedule * sort argwhere result * Use single block and thrust to fix flaky behavior * format * used dynamic strided_slice * Fix dynamic strided_slice * try new strided_slice * Improve dynamic strided slice to bind data depedent shape var. * all tests pass * remove print * use new strided_slice * clean Co-authored-by: Yao Wang <kevinthesunwy@gmail.com>

zhiics force-pushed the argwhere branch from 24bd378 to ffbd2ac Compare November 6, 2020 20:25

tkonolige reviewed Nov 6, 2020

View reviewed changes

zhiics force-pushed the argwhere branch from ffbd2ac to e7dc317 Compare November 13, 2020 17:19

zhiics and others added 11 commits December 4, 2020 04:52

argwhere

68fb2f6

cuda schedule

d91598b

sort argwhere result

b74d59f

Use single block and thrust to fix flaky behavior

6d83d85

format

1716846

used dynamic strided_slice

18e10bf

Fix dynamic strided_slice

6824edf

try new strided_slice

689732a

Improve dynamic strided slice to bind data depedent shape var.

2524663

all tests pass

8c85766

remove print

736dfe7

zhiics force-pushed the argwhere branch from b61ac98 to abf2708 Compare December 4, 2020 05:58

use new strided_slice

ea8ba0c

zhiics force-pushed the argwhere branch from abf2708 to ea8ba0c Compare December 4, 2020 06:00

clean

0a77250

zhiics changed the title ~~[WIP][TOPI][OP] cuda for argwhere~~ [TOPI][OP] cuda for argwhere Dec 4, 2020

mbrookhart approved these changes Dec 4, 2020

View reviewed changes

kevinthesun approved these changes Dec 4, 2020

View reviewed changes

kevinthesun merged commit 54cd235 into apache:main Dec 4, 2020

zhiics deleted the argwhere branch December 4, 2020 22:57

This was referenced Jan 18, 2021

[TOPI] Make cumsum IR reusable, add thrust scan #7303

Merged

[TOPI] Rewrite GPU argwhere using exclusive scan #7314

Merged

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI][OP] cuda for argwhere#6868

[TOPI][OP] cuda for argwhere#6868
kevinthesun merged 13 commits intoapache:mainfrom
zhiics:argwhere

zhiics commented Nov 6, 2020 •

edited

Loading

Uh oh!

mbrookhart commented Nov 6, 2020

Uh oh!

zhiics commented Nov 6, 2020

Uh oh!

tkonolige Nov 6, 2020

Uh oh!

kevinthesun Nov 6, 2020 •

edited

Loading

Uh oh!

tkonolige Nov 6, 2020

Uh oh!

kevinthesun Nov 18, 2020 •

edited

Loading

Uh oh!

kevinthesun left a comment

Uh oh!

kevinthesun commented Dec 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhiics commented Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbrookhart commented Nov 6, 2020

Uh oh!

zhiics commented Nov 6, 2020

Uh oh!

tkonolige Nov 6, 2020

Choose a reason for hiding this comment

Uh oh!

kevinthesun Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkonolige Nov 6, 2020

Choose a reason for hiding this comment

Uh oh!

kevinthesun Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinthesun left a comment

Choose a reason for hiding this comment

Uh oh!

kevinthesun commented Dec 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhiics commented Nov 6, 2020 •

edited

Loading

kevinthesun Nov 6, 2020 •

edited

Loading

kevinthesun Nov 18, 2020 •

edited

Loading