[BugFix][TOPI] Fix the integer overflow problem of the scatter_nd op.#8415
Conversation
|
I've implemented an alternative fix in #8419 |
@tkonolige Thank you for review this PR. Just curious, is there any specific reason why you created another PR? I understand the alternative fix you proposed maybe better than the existing one, in someway probably. But why bother create a new separate PR? We can discuss and improve the code here, I think that's exactly what code review is about, right? |
|
@zhuwenxi Sorry, I definitely shouldn't have opened a new PR. I just got a little hasty and did the fix myself. Feel free to copy the code from that PR into this one. I'll end up closing the other one. |
|
@tkonolige That's OK, it happens. |
|
You hit a flaky test, I found a fix here: #8431 |
1. Existing scatter_nd cuda implementation has a very large bound, which could overflow int32 range when input tensor shape is large enough; 2. The overflow could cause the if statement always evaluate to true, thus conducts invalid memory accesses; 3. We fix this problem in this commit by reducing the bound, the original large bound is not only unnecessary, but also degrading the performance; With this fix, scatter_op's performance improves 100x on some cases.
842243b to
cd30a24
Compare
|
@mbrookhart I've fixed the UT failure. |
|
Thanks @zhuwenxi @tkonolige |
…apache#8415) * Fix the integer overflow problem of the scatter_nd op. * Fix scatter_nd's crash problem: 1. Existing scatter_nd cuda implementation has a very large bound, which could overflow int32 range when input tensor shape is large enough; 2. The overflow could cause the if statement always evaluate to true, thus conducts invalid memory accesses; 3. We fix this problem in this commit by reducing the bound, the original large bound is not only unnecessary, but also degrading the performance; With this fix, scatter_op's performance improves 100x on some cases. Co-authored-by: wenxizhu <wenxizhu@tencent.com>
…apache#8415) * Fix the integer overflow problem of the scatter_nd op. * Fix scatter_nd's crash problem: 1. Existing scatter_nd cuda implementation has a very large bound, which could overflow int32 range when input tensor shape is large enough; 2. The overflow could cause the if statement always evaluate to true, thus conducts invalid memory accesses; 3. We fix this problem in this commit by reducing the bound, the original large bound is not only unnecessary, but also degrading the performance; With this fix, scatter_op's performance improves 100x on some cases. Co-authored-by: wenxizhu <wenxizhu@tencent.com>
Problem Statement
scatter_ndcrashes on cuda backend, when input data shape is slightly larger than usual.Code to reproduce
import tvm import numpy as np import tvm.relay as relay dev = tvm.cuda() target = tvm.target.Target("cuda") # input data: data_np = np.zeros((32, 128, 128, 256)).astype(np.float32) indices_np = np.random.uniform(1,5,(32, 600, 3)).astype(np.int64) updates_np = np.random.rand(32, 600, 256).astype(np.float32) # Construct relay input nodes: data = relay.var("data", shape=data_np.shape, dtype=str(data_np.dtype)) indices = relay.var("indices", shape=indices_np.shape, dtype=str(indices_np.dtype)) updates = relay.var("updates", shape=updates_np.shape, dtype=str(updates_np.dtype)) # Compute indices: indices_dim = len(indices_np.shape) axes = list(range(indices_dim)) indices_t = relay.transpose(indices, axes[-1:] + axes[:-1]) # Construct relay scatter_nd op: out = relay.op.scatter_nd(data, indices_t, updates, "update") func = relay.Function([data, indices, updates], out) # Execute scatter_nd: intrp = relay.create_executor("debug", device=dev, target=target) op_res = intrp.evaluate(func)(data_np, indices_np, updates_np)Error Message
Root Cause
We can see the problem more clearly from the cuda code generated. The TIR implementation of scatter_nd would cause a int32 overflow when "i" is large, thus the if statement is always evaluate to true, and conducts a invalid memory access.