Consider the following program in which we split a loop by a factor of 48 and then we split the inner loop (which is of size 48) by a factor of 32.
import tvm
m=100
A=tvm.placeholder((m,), name='A')
C=tvm.compute((m,), lambda *index: A(*index), name='C')
s=tvm.create_schedule(C.op)
do,di=s[C].split(C.op.axis[0], 48)
di0,di1 = s[C].split(di, 32)
print(tvm.lower(s, [A,C], simple_mode=True))
The generated Halide IR is:
produce C {
for (i0.outer, 0, 3) {
for (i0.inner.outer, 0, 2) {
for (i0.inner.inner, 0, 32) {
if (likely(((i0.outer*48) < ((100 - i0.inner.inner) - (i0.inner.outer*32))))) {
C[(((i0.outer*48) + (i0.inner.outer*32)) + i0.inner.inner)] = A[(((i0.outer*48) + (i0.inner.outer*32)) + i0.inner.inner)]
}
}
}
}
}
While the generated code is functionally correct but it's inefficient in the sense that some points of the iteration space are visited more than once. In particular, when i0.outer is 0, we execute the assignment for points [0-63], when i0.outer is 1, we execute it for points [48-99], and when i0.outer is 2, we execute it for points [96-99].
If we add a predicate that relates i0.inner.outer and i0.inner.inner (i.e., i0.inner.outer*32 + i0.inner.inner < 48) the problem will be solved.
Consider the following program in which we split a loop by a factor of 48 and then we split the inner loop (which is of size 48) by a factor of 32.
The generated Halide IR is:
While the generated code is functionally correct but it's inefficient in the sense that some points of the iteration space are visited more than once. In particular, when
i0.outeris 0, we execute the assignment for points [0-63], wheni0.outeris 1, we execute it for points [48-99], and wheni0.outeris 2, we execute it for points [96-99].If we add a predicate that relates
i0.inner.outerandi0.inner.inner(i.e.,i0.inner.outer*32 + i0.inner.inner < 48) the problem will be solved.