Improved find_constant_bound(s) by rootjalex · Pull Request #6792 · halide/Halide

rootjalex · 2022-05-31T17:55:06Z

This PR provides a series of methods for removing/simplifying correlated expressions for find_constant_bounds:

Bounded let-substitutions (~~n=100~~ edit: n=16). We don't want to always substitute all lets, but some constant bounds can be calculated just by a small number of substitutions.
Removing unbounded terms from mins/maxs. A (simplistic) example is below:

Find lower bound on:
max(x, y) - (z + y)
With z : [0, 8]

This method would note that x is unbounded, and therefore the lhs of the max can be stripped, producing:

y - (z - y) -> 0 - z -> lower bounded by -8

Affine term reordering. Halide’s TRS-based simplification can only cancel terms in sums up to a certain depth, this method uses a linear-time algorithm for canceling like-terms.
Pushing rationals inwards. This technique pushes multiplications inwards to allow stronger simplification. More importantly, it pushes divisions inwards via a safe approximation, most encapsulate by the following equations:

// Addition:
(a / n) + (b / n) <= (a + b) / n <= (a / n) + (b / n) + 1
// Subtraction:
(a / n) - (b / n) - 1 <= (a - b) / n <= (a / n) - (b / n)

This allows us to push divisions inside additions/subtractions which can improve the ability to cancel like terms in a lot of generated equations.

@abadams ran a series of experiments with randomly-generated schedules (n=256) on a series of apps (bgu, camera_pipe, conv_layer, depthwise_separable_conv, harris, hist, iir_blur, lens_blur, max_filter, stencil_chain, unsharp), and here is a summary of the results (percentages are total across the benchmarks):

Less failed unrolls: bgu (5 -> 3), camera_pipe (69 -> 58), harris (197 -> 160), lens_blur (158 -> 12), max_filter (338 -> 242), unsharp (110 -> 32)
Less memory: camera_pipe (0.6%), depthwise_separable_conv (0.3%), hist (0.1%), lens_blur (0.6%), unsharp (0.2%)
Less malloc calls: camera_pipe (173592 -> 171612), harris (608943 -> 608655), lens_blur (144324 -> 141888), stencil_chain (701085 -> 698712), unsharp (127428 -> 127284)
Some small runtime improvements (0.05% to 0.6%) : bgu, camera_pipe, harris, hist, iir_blur, lens_blur, max_filter, stencil_chain

More memory: harris (0.06%), iir_blur (0.002%), stencil_chain (0.1%)

The runtime improvements might not be statistically significant, but I think better loop unrolling and improved stack allocations are important contributions.

For apps with no improved unrolling, compilation times increase by a small amount (~3%). With improved unrolling, there are large increases but are mostly due to the fact that generating the unrolled code takes longer in both our codegen and LLVM codegen.

This work was part of a project with @abadams and @shoaibkamil .

…ove_cbounds_fixed

…x/improve_cbounds_fixed

steven-johnson · 2022-06-01T19:53:14Z

@abadams -- should I pull this into Google and do some torture testing before landing, or are we pretty confident this is good?

abadams · 2022-06-01T20:03:58Z

Torture testing inside Google would be pretty helpful, thanks.

steven-johnson · 2022-06-02T18:00:12Z

Testing in Google, I find only one new failure, but... it appears to be a hang (or near-infinite loop) inside Bounding small realizations... when compiling one specific Generator. Adding to the fun, it's in some proprietary stuff that might be hard to share publicly. Let me see if I can narrow things down further...

steven-johnson · 2022-06-02T18:11:27Z

Yeah, we definitely get stuck ~forever in bound_small_allocations(), which was only changed to include the new header, so something about the change in definition has injected something here. Let me see if I can come up with a repro case I can share.

rootjalex · 2022-06-02T18:48:40Z

bound_small_allocations() is calling the new version(s) of find_constant_bound(s), which means that there is likely an allocation expression that is tripping up the new method - if you could log which expression (and the corresponding scope) is causing the hang, I can investigate (hopefully sharing that much is okay?).

steven-johnson · 2022-06-02T19:25:43Z

So far, what I'm finding is that we have a fairly complex Expr that is the input to remove_unbounded_terms:

(((let t418 = min(max(min(max(min(max(min(max(min(max(min(max(min(max(min((foo$13.extent.0 + foo$13.min.0) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0) in (max(min(max(min(max(min(max(min(max(min(max(min(max(t418, 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), max(min(max(min(max(min(max(min(max(max(min(max(t418, 1) + 4, input.extent.0), t418), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 1) + 3, input.extent.0), 1) + -1)) - (let t407 = min(max(min(min(min(max(foo$13.min.0, -1), min(max(min(input.extent.0 + -1, foo$13.min.0), 0), max(min(foo$13.min.0 + 2, input.extent.0), 1) + -1)) + 2, input.extent.0) + -1, foo$13.min.0), 0), max(min(min(max(min(min(max(foo$13.min.0, -1) + 2, input.extent.0) + -1, foo$13.min.0), 0), max(min(min(max(min(input.extent.0 + -1, foo$13.min.0), 0), max(min(foo$13.min.0 + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) in (let t411 = min(max(min(min(min(min(max(min(input.extent.0 + -1, t407), 0), max(min(t407 + 2, input.extent.0), 1) + -1), t407) + 2, input.extent.0) + -1, t407), 0), max(min(min(max(min(min(t407 + 2, input.extent.0) + -1, t407), 0), max(min(min(max(min(input.extent.0 + -1, t407), 0), max(min(t407 + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) in (let t423 = min(max(min(min(min(min(max(min(input.extent.0 + -1, t411), 0), max(min(t411 + 2, input.extent.0), 1) + -1), t411) + 2, input.extent.0) + -1, t411), 0), max(min(min(max(min(min(t411 + 2, input.extent.0) + -1, t411), 0), max(min(min(max(min(input.extent.0 + -1, t411), 0), max(min(t411 + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) in (let t427 = min(max(min(min(min(min(max(min(input.extent.0 + -1, t423), 0), max(min(t423 + 2, input.extent.0), 1) + -1), t423) + 2, input.extent.0) + -1, t423), 0), max(min(min(max(min(min(t423 + 2, input.extent.0) + -1, t423), 0), max(min(min(max(min(input.extent.0 + -1, t423), 0), max(min(t423 + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) in min(max(min(min(t427 + 2, input.extent.0) + -1, t427), 0), max(min(min(min(max(min(input.extent.0 + -1, t427), 0), min(min(max(min(t427 + 2, input.extent.0), 1), min(min(min(max(min(input.extent.0 + -1, t427), 0), min(min(max(min(t427 + 2, input.extent.0), 1), min(min(min(max(min(input.extent.0 + -1, t427), 0), min(min(max(min(t427 + 2, input.extent.0), 1), min(min(min(max(min(input.extent.0 + -1, t427), 0), min(min(max(min(t427 + 2, input.extent.0), 1), min(min(min(max(min(input.extent.0 + -1, t427), 0), min(max(min(t427 + 2, input.extent.0), 1), min(max(min(input.extent.0 + -1, t427), 0), min(max(min(t427 + 2, input.extent.0), 1), min(max(min(input.extent.0 + -1, t427), 0), min(max(min(t427 + 2, input.extent.0), 1), max(min(input.extent.0 + -1, t427), 0) + 1)) + 1)) + 1)), max(min(t427 + 2, input.extent.0), 1) + -1), min(max(min(input.extent.0 + -1, t427), 0), max(min(t427 + 2, input.extent.0), 1) + -1) + 2)), max(min(input.extent.0 + -1, t427), 0) + 1)), max(min(t427 + 2, input.extent.0), 1) + -1), min(max(min(input.extent.0 + -1, t427), 0), max(min(t427 + 2, input.extent.0), 1) + -1) + 2)), max(min(input.extent.0 + -1, t427), 0) + 1)), max(min(t427 + 2, input.extent.0), 1) + -1), min(max(min(input.extent.0 + -1, t427), 0), max(min(t427 + 2, input.extent.0), 1) + -1) + 2)), max(min(input.extent.0 + -1, t427), 0) + 1)), max(min(t427 + 2, input.extent.0), 1) + -1), min(max(min(input.extent.0 + -1, t427), 0), max(min(t427 + 2, input.extent.0), 1) + -1) + 2)), max(min(input.extent.0 + -1, t427), 0) + 1)), max(min(t427 + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1)))))) + 1)

which becomes insanely huge afterwards (too large to bother pasting here -- something like 7MB of text when the Expr is printed), and that's after the call to simplify().

EDIT: the corresponding scope at that point:

 scope:
{
  output$1.s0.x.max
  output$1.s0.x.min
  output$1.s0.y.max
  output$1.s0.y.min
  output$1.s1.r8$x.max
  output$1.s1.r8$x.min
  output$1.s1.x.max
  output$1.s1.x.min
  output$1.s1.y.max
  output$1.s1.y.min
  output$10.s0.x.max
  output$10.s0.x.min
  output$10.s0.y.max
  output$10.s0.y.min
  output$10.s1.r125$x.max
  output$10.s1.r125$x.min
  output$10.s1.x.max
  output$10.s1.x.min
  output$10.s1.y.max
  output$10.s1.y.min
  output$11.s0.x.max
  output$11.s0.x.min
  output$11.s0.y.max
  output$11.s0.y.min
  output$11.s1.r138$x.max
  output$11.s1.r138$x.min
  output$11.s1.x.max
  output$11.s1.x.min
  output$11.s1.y.max
  output$11.s1.y.min
  output$12.s0.x.max
  output$12.s0.x.min
  output$12.s0.y.max
  output$12.s0.y.min
  output$12.s1.r151$x.max
  output$12.s1.r151$x.min
  output$12.s1.x.max
  output$12.s1.x.min
  output$12.s1.y.max
  output$12.s1.y.min
  output$13.s0.x.max
  output$13.s0.x.min
  output$13.s0.y.max
  output$13.s0.y.min
  output$13.s1.r164$x.max
  output$13.s1.r164$x.min
  output$13.s1.x.max
  output$13.s1.x.min
  output$13.s1.y.max
  output$13.s1.y.min
  output$14.s0.x.max
  output$14.s0.x.min
  output$14.s0.y.max
  output$14.s0.y.min
  output$14.s1.r177$x.max
  output$14.s1.r177$x.min
  output$14.s1.x.max
  output$14.s1.x.min
  output$14.s1.y.max
  output$14.s1.y.min
  output$2.s0.x.max
  output$2.s0.x.min
  output$2.s0.y.max
  output$2.s0.y.min
  output$2.s1.r21$x.max
  output$2.s1.r21$x.min
  output$2.s1.x.max
  output$2.s1.x.min
  output$2.s1.y.max
  output$2.s1.y.min
  output$3.s0.x.max
  output$3.s0.x.min
  output$3.s0.y.max
  output$3.s0.y.min
  output$3.s1.r34$x.max
  output$3.s1.r34$x.min
  output$3.s1.x.max
  output$3.s1.x.min
  output$3.s1.y.max
  output$3.s1.y.min
  output$4.s0.x.max
  output$4.s0.x.min
  output$4.s0.y.max
  output$4.s0.y.min
  output$4.s1.r47$x.max
  output$4.s1.r47$x.min
  output$4.s1.x.max
  output$4.s1.x.min
  output$4.s1.y.max
  output$4.s1.y.min
  output$5.s0.x.max
  output$5.s0.x.min
  output$5.s0.y.max
  output$5.s0.y.min
  output$5.s1.r60$x.max
  output$5.s1.r60$x.min
  output$5.s1.x.max
  output$5.s1.x.min
  output$5.s1.y.max
  output$5.s1.y.min
  output$6.s0.x.max
  output$6.s0.x.min
  output$6.s0.y.max
  output$6.s0.y.min
  output$6.s1.r73$x.max
  output$6.s1.r73$x.min
  output$6.s1.x.max
  output$6.s1.x.min
  output$6.s1.y.max
  output$6.s1.y.min
  output$7.s0.x.max
  output$7.s0.x.min
  output$7.s0.y.max
  output$7.s0.y.min
  output$7.s1.r86$x.max
  output$7.s1.r86$x.min
  output$7.s1.x.max
  output$7.s1.x.min
  output$7.s1.y.max
  output$7.s1.y.min
  output$8.s0.x.max
  output$8.s0.x.min
  output$8.s0.y.max
  output$8.s0.y.min
  output$8.s1.r99$x.max
  output$8.s1.r99$x.min
  output$8.s1.x.max
  output$8.s1.x.min
  output$8.s1.y.max
  output$8.s1.y.min
  output$9.s0.x.max
  output$9.s0.x.min
  output$9.s0.y.max
  output$9.s0.y.min
  output$9.s1.r112$x.max
  output$9.s1.r112$x.min
  output$9.s1.x.max
  output$9.s1.x.min
  output$9.s1.y.max
  output$9.s1.y.min
  foo$1.s0.x.max
  foo$1.s0.x.max.s
  foo$1.s0.x.min
  foo$1.s0.y.max
  foo$1.s0.y.max.s
  foo$1.s0.y.min
  foo$10.s0.x.max
  foo$10.s0.x.max.s
  foo$10.s0.x.min
  foo$10.s0.y.max
  foo$10.s0.y.max.s
  foo$10.s0.y.min
  foo$11.s0.x.max
  foo$11.s0.x.max.s
  foo$11.s0.x.min
  foo$11.s0.y.max
  foo$11.s0.y.max.s
  foo$11.s0.y.min
  foo$12.s0.x.max
  foo$12.s0.x.max.s
  foo$12.s0.x.min
  foo$12.s0.y.max
  foo$12.s0.y.max.s
  foo$12.s0.y.min
  foo$13.s0.x.max
  foo$13.s0.x.min
  foo$13.s0.y.max
  foo$13.s0.y.min
  foo$2.s0.x.max
  foo$2.s0.x.max.s
  foo$2.s0.x.min
  foo$2.s0.y.max
  foo$2.s0.y.max.s
  foo$2.s0.y.min
  foo$3.s0.x.max
  foo$3.s0.x.max.s
  foo$3.s0.x.min
  foo$3.s0.y.max
  foo$3.s0.y.max.s
  foo$3.s0.y.min
  foo$4.s0.x.max
  foo$4.s0.x.max.s
  foo$4.s0.x.min
  foo$4.s0.y.max
  foo$4.s0.y.max.s
  foo$4.s0.y.min
  foo$5.s0.x.max
  foo$5.s0.x.max.s
  foo$5.s0.x.min
  foo$5.s0.y.max
  foo$5.s0.y.max.s
  foo$5.s0.y.min
  foo$6.s0.x.max
  foo$6.s0.x.max.s
  foo$6.s0.x.min
  foo$6.s0.y.max
  foo$6.s0.y.max.s
  foo$6.s0.y.min
  foo$7.s0.x.max
  foo$7.s0.x.max.s
  foo$7.s0.x.min
  foo$7.s0.y.max
  foo$7.s0.y.max.s
  foo$7.s0.y.min
  foo$8.s0.x.max
  foo$8.s0.x.max.s
  foo$8.s0.x.min
  foo$8.s0.y.max
  foo$8.s0.y.max.s
  foo$8.s0.y.min
  foo$9.s0.x.max
  foo$9.s0.x.max.s
  foo$9.s0.x.min
  foo$9.s0.y.max
  foo$9.s0.y.max.s
  foo$9.s0.y.min
  foo.s0.x.max
  foo.s0.x.max.s
  foo.s0.x.min
  foo.s0.y.max
  foo.s0.y.max.s
  foo.s0.y.min
}

rootjalex · 2022-06-02T20:23:08Z

Is it possible to know if the scope has any values actually set? Sorry, I didn't realize that printing scope only prints the names, I need the corresponding intervals as well.

rootjalex · 2022-06-02T20:46:07Z

Definitely seems like the issue here is substitute_some_lets. Not sure exactly what the count should be, but 100 is too high

steven-johnson · 2022-06-02T20:54:22Z

Is it possible to know if the scope has any values actually set? Sorry, I didn't realize that printing scope only prints the names, I need the corresponding intervals as well.

{
  output$1.s0.x.max: 0, (void *)pos_inf
  output$1.s0.x.min: 0, (void *)pos_inf
  output$1.s0.y.max: 0, (void *)pos_inf
  output$1.s0.y.min: 0, (void *)pos_inf
  output$1.s1.r8$x.max: 3, 3
  output$1.s1.r8$x.min: 0, 0
  output$1.s1.x.max: 0, (void *)pos_inf
  output$1.s1.x.min: 0, (void *)pos_inf
  output$1.s1.y.max: 0, (void *)pos_inf
  output$1.s1.y.min: 0, (void *)pos_inf
  output$10.s0.x.max: 0, (void *)pos_inf
  output$10.s0.x.min: 0, (void *)pos_inf
  output$10.s0.y.max: 0, (void *)pos_inf
  output$10.s0.y.min: 0, (void *)pos_inf
  output$10.s1.r125$x.max: 3, 3
  output$10.s1.r125$x.min: 0, 0
  output$10.s1.x.max: 0, (void *)pos_inf
  output$10.s1.x.min: 0, (void *)pos_inf
  output$10.s1.y.max: 0, (void *)pos_inf
  output$10.s1.y.min: 0, (void *)pos_inf
  output$11.s0.x.max: 0, (void *)pos_inf
  output$11.s0.x.min: 0, (void *)pos_inf
  output$11.s0.y.max: 0, (void *)pos_inf
  output$11.s0.y.min: 0, (void *)pos_inf
  output$11.s1.r138$x.max: 3, 3
  output$11.s1.r138$x.min: 0, 0
  output$11.s1.x.max: 0, (void *)pos_inf
  output$11.s1.x.min: 0, (void *)pos_inf
  output$11.s1.y.max: 0, (void *)pos_inf
  output$11.s1.y.min: 0, (void *)pos_inf
  output$12.s0.x.max: 0, (void *)pos_inf
  output$12.s0.x.min: 0, (void *)pos_inf
  output$12.s0.y.max: 0, (void *)pos_inf
  output$12.s0.y.min: 0, (void *)pos_inf
  output$12.s1.r151$x.max: 3, 3
  output$12.s1.r151$x.min: 0, 0
  output$12.s1.x.max: 0, (void *)pos_inf
  output$12.s1.x.min: 0, (void *)pos_inf
  output$12.s1.y.max: 0, (void *)pos_inf
  output$12.s1.y.min: 0, (void *)pos_inf
  output$13.s0.x.max: 0, (void *)pos_inf
  output$13.s0.x.min: 0, (void *)pos_inf
  output$13.s0.y.max: 0, (void *)pos_inf
  output$13.s0.y.min: 0, (void *)pos_inf
  output$13.s1.r164$x.max: 3, 3
  output$13.s1.r164$x.min: 0, 0
  output$13.s1.x.max: 0, (void *)pos_inf
  output$13.s1.x.min: 0, (void *)pos_inf
  output$13.s1.y.max: 0, (void *)pos_inf
  output$13.s1.y.min: 0, (void *)pos_inf
  output$14.s0.x.max: (void *)neg_inf, (void *)pos_inf
  output$14.s0.x.min: (void *)neg_inf, (void *)pos_inf
  output$14.s0.y.max: (void *)neg_inf, (void *)pos_inf
  output$14.s0.y.min: (void *)neg_inf, (void *)pos_inf
  output$14.s1.r177$x.max: 3, 3
  output$14.s1.r177$x.min: 0, 0
  output$14.s1.x.max: (void *)neg_inf, (void *)pos_inf
  output$14.s1.x.min: (void *)neg_inf, (void *)pos_inf
  output$14.s1.y.max: (void *)neg_inf, (void *)pos_inf
  output$14.s1.y.min: (void *)neg_inf, (void *)pos_inf
  output$2.s0.x.max: 0, (void *)pos_inf
  output$2.s0.x.min: 0, (void *)pos_inf
  output$2.s0.y.max: 0, (void *)pos_inf
  output$2.s0.y.min: 0, (void *)pos_inf
  output$2.s1.r21$x.max: 3, 3
  output$2.s1.r21$x.min: 0, 0
  output$2.s1.x.max: 0, (void *)pos_inf
  output$2.s1.x.min: 0, (void *)pos_inf
  output$2.s1.y.max: 0, (void *)pos_inf
  output$2.s1.y.min: 0, (void *)pos_inf
  output$3.s0.x.max: 0, (void *)pos_inf
  output$3.s0.x.min: 0, (void *)pos_inf
  output$3.s0.y.max: 0, (void *)pos_inf
  output$3.s0.y.min: 0, (void *)pos_inf
  output$3.s1.r34$x.max: 3, 3
  output$3.s1.r34$x.min: 0, 0
  output$3.s1.x.max: 0, (void *)pos_inf
  output$3.s1.x.min: 0, (void *)pos_inf
  output$3.s1.y.max: 0, (void *)pos_inf
  output$3.s1.y.min: 0, (void *)pos_inf
  output$4.s0.x.max: 0, (void *)pos_inf
  output$4.s0.x.min: 0, (void *)pos_inf
  output$4.s0.y.max: 0, (void *)pos_inf
  output$4.s0.y.min: 0, (void *)pos_inf
  output$4.s1.r47$x.max: 3, 3
  output$4.s1.r47$x.min: 0, 0
  output$4.s1.x.max: 0, (void *)pos_inf
  output$4.s1.x.min: 0, (void *)pos_inf
  output$4.s1.y.max: 0, (void *)pos_inf
  output$4.s1.y.min: 0, (void *)pos_inf
  output$5.s0.x.max: 0, (void *)pos_inf
  output$5.s0.x.min: 0, (void *)pos_inf
  output$5.s0.y.max: 0, (void *)pos_inf
  output$5.s0.y.min: 0, (void *)pos_inf
  output$5.s1.r60$x.max: 3, 3
  output$5.s1.r60$x.min: 0, 0
  output$5.s1.x.max: 0, (void *)pos_inf
  output$5.s1.x.min: 0, (void *)pos_inf
  output$5.s1.y.max: 0, (void *)pos_inf
  output$5.s1.y.min: 0, (void *)pos_inf
  output$6.s0.x.max: 0, (void *)pos_inf
  output$6.s0.x.min: 0, (void *)pos_inf
  output$6.s0.y.max: 0, (void *)pos_inf
  output$6.s0.y.min: 0, (void *)pos_inf
  output$6.s1.r73$x.max: 3, 3
  output$6.s1.r73$x.min: 0, 0
  output$6.s1.x.max: 0, (void *)pos_inf
  output$6.s1.x.min: 0, (void *)pos_inf
  output$6.s1.y.max: 0, (void *)pos_inf
  output$6.s1.y.min: 0, (void *)pos_inf
  output$7.s0.x.max: 0, (void *)pos_inf
  output$7.s0.x.min: 0, (void *)pos_inf
  output$7.s0.y.max: 0, (void *)pos_inf
  output$7.s0.y.min: 0, (void *)pos_inf
  output$7.s1.r86$x.max: 3, 3
  output$7.s1.r86$x.min: 0, 0
  output$7.s1.x.max: 0, (void *)pos_inf
  output$7.s1.x.min: 0, (void *)pos_inf
  output$7.s1.y.max: 0, (void *)pos_inf
  output$7.s1.y.min: 0, (void *)pos_inf
  output$8.s0.x.max: 0, (void *)pos_inf
  output$8.s0.x.min: 0, (void *)pos_inf
  output$8.s0.y.max: 0, (void *)pos_inf
  output$8.s0.y.min: 0, (void *)pos_inf
  output$8.s1.r99$x.max: 3, 3
  output$8.s1.r99$x.min: 0, 0
  output$8.s1.x.max: 0, (void *)pos_inf
  output$8.s1.x.min: 0, (void *)pos_inf
  output$8.s1.y.max: 0, (void *)pos_inf
  output$8.s1.y.min: 0, (void *)pos_inf
  output$9.s0.x.max: 0, (void *)pos_inf
  output$9.s0.x.min: 0, (void *)pos_inf
  output$9.s0.y.max: 0, (void *)pos_inf
  output$9.s0.y.min: 0, (void *)pos_inf
  output$9.s1.r112$x.max: 3, 3
  output$9.s1.r112$x.min: 0, 0
  output$9.s1.x.max: 0, (void *)pos_inf
  output$9.s1.x.min: 0, (void *)pos_inf
  output$9.s1.y.max: 0, (void *)pos_inf
  output$9.s1.y.min: 0, (void *)pos_inf
  foo$1.s0.x.max: 0, (void *)pos_inf
  foo$1.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$1.s0.x.min: 0, (void *)pos_inf
  foo$1.s0.y.max: 0, (void *)pos_inf
  foo$1.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$1.s0.y.min: 0, (void *)pos_inf
  foo$10.s0.x.max: 0, (void *)pos_inf
  foo$10.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$10.s0.x.min: 0, (void *)pos_inf
  foo$10.s0.y.max: 0, (void *)pos_inf
  foo$10.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$10.s0.y.min: 0, (void *)pos_inf
  foo$11.s0.x.max: 0, (void *)pos_inf
  foo$11.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$11.s0.x.min: 0, (void *)pos_inf
  foo$11.s0.y.max: 0, (void *)pos_inf
  foo$11.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$11.s0.y.min: 0, (void *)pos_inf
  foo$12.s0.x.max: 0, (void *)pos_inf
  foo$12.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$12.s0.x.min: 0, (void *)pos_inf
  foo$12.s0.y.max: 0, (void *)pos_inf
  foo$12.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$12.s0.y.min: 0, (void *)pos_inf
  foo$13.s0.x.max: (void *)neg_inf, (void *)pos_inf
  foo$13.s0.x.min: (void *)neg_inf, (void *)pos_inf
  foo$13.s0.y.max: (void *)neg_inf, (void *)pos_inf
  foo$13.s0.y.min: (void *)neg_inf, (void *)pos_inf
  foo$2.s0.x.max: 0, (void *)pos_inf
  foo$2.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$2.s0.x.min: 0, (void *)pos_inf
  foo$2.s0.y.max: 0, (void *)pos_inf
  foo$2.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$2.s0.y.min: 0, (void *)pos_inf
  foo$3.s0.x.max: 0, (void *)pos_inf
  foo$3.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$3.s0.x.min: 0, (void *)pos_inf
  foo$3.s0.y.max: 0, (void *)pos_inf
  foo$3.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$3.s0.y.min: 0, (void *)pos_inf
  foo$4.s0.x.max: 0, (void *)pos_inf
  foo$4.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$4.s0.x.min: 0, (void *)pos_inf
  foo$4.s0.y.max: 0, (void *)pos_inf
  foo$4.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$4.s0.y.min: 0, (void *)pos_inf
  foo$5.s0.x.max: 0, (void *)pos_inf
  foo$5.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$5.s0.x.min: 0, (void *)pos_inf
  foo$5.s0.y.max: 0, (void *)pos_inf
  foo$5.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$5.s0.y.min: 0, (void *)pos_inf
  foo$6.s0.x.max: 0, (void *)pos_inf
  foo$6.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$6.s0.x.min: 0, (void *)pos_inf
  foo$6.s0.y.max: 0, (void *)pos_inf
  foo$6.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$6.s0.y.min: 0, (void *)pos_inf
  foo$7.s0.x.max: 0, (void *)pos_inf
  foo$7.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$7.s0.x.min: 0, (void *)pos_inf
  foo$7.s0.y.max: 0, (void *)pos_inf
  foo$7.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$7.s0.y.min: 0, (void *)pos_inf
  foo$8.s0.x.max: 0, (void *)pos_inf
  foo$8.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$8.s0.x.min: 0, (void *)pos_inf
  foo$8.s0.y.max: 0, (void *)pos_inf
  foo$8.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$8.s0.y.min: 0, (void *)pos_inf
  foo$9.s0.x.max: 0, (void *)pos_inf
  foo$9.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$9.s0.x.min: 0, (void *)pos_inf
  foo$9.s0.y.max: 0, (void *)pos_inf
  foo$9.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$9.s0.y.min: 0, (void *)pos_inf
  foo.s0.x.max: 0, (void *)pos_inf
  foo.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo.s0.x.min: 0, (void *)pos_inf
  foo.s0.y.max: 0, (void *)pos_inf
  foo.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo.s0.y.min: 0, (void *)pos_inf
}

rootjalex · 2022-06-06T22:38:44Z

@steven-johnson Do you think you could run Google testing again? I think my tests just never had such enormous expressions, the example you provided should end reasonably fast now.

steven-johnson · 2022-06-06T23:08:52Z

@steven-johnson Do you think you could run Google testing again? I think my tests just never had such enormous expressions, the example you provided should end reasonably fast now.

Testing now, but hiding an apparently-critical constant (the count arg to substitute_some_lets) as a default-value argument seems suboptimal. If 16 is a good value for everything, make it internal to the function and name and comment on it. If it's not a good value for everything, don't give it a default value.

steven-johnson · 2022-06-07T00:02:37Z

(Tests look good so far, stand by)

abadams · 2022-06-07T19:15:41Z

+ * Visitor for removing terms that are completely unbounded from
+ * a min or a max.
+ */
+class StripUnboundedTerms : public IRMutator {


It looks like this class can only correctly handle the listed IR node types, and everything it would make a mess on is explicitly listed below. It seems like this would be safer as a recursive function - when someone adds a new IR node type this class will be implicitly broken and they won't know to update it.

I agree. I will do this, but probably won't be able to today, hopefully tomorrow.

changed in 7a26154. Would love your thoughts on the Let case, that's the one I'm a little unsure on.

abadams · 2022-06-07T19:20:33Z

+    }
+
+    // Two-finger O(n) algorithm for simplifying sums.
+    std::vector<AffineTerm> simplify_linear_summation(const std::vector<AffineTerm> &terms) {


I don't understand how this linear-time algorithm works. An explanatory comment would help.

added in cf591f7 , let me know if it is satisfactory

It's sorted by deep equality comparison? I just see a should_commute. should_commute is only a partial order.

No, I think we had a previous discussion about deep_equality being too expensive and should_commute being good enough but we could make it stronger. Should we change that?

Let's change to deep equality and see what the impact is.

abadams · 2022-06-07T19:23:07Z

+};
+
+// Used to bound the number of substitutions.
+class SubstituteSomeLets : public IRMutator {


What does this do on something like: let b = a + a in let c = b + b in let d = c + c in ....

Does it produce an Expr of size 2 ^ count?

Yes, I believe it does, the count only bounds the number of substtitutions, not the accumulated size. Would you prefer different behavior?

Wait, something looks fishy here. You put values into the scope after mutating them, but then when you pull them out of the scope you mutate them a second time. I can't figure out what this would do with shadowed lets.

IMO we need to bound the actual size of the generated expression. 2^16 is still pretty damn big.

I think the second mutation shouldn't happen, thanks for the catch.

Do you have any recommendations on how to bound overall size? I guess we could track multiplicative substitutions, i.e. in the example you gave, substituting c counts for 3 substitutions, as it has 2 b substitutions plus the actual c substutiton

Are shadowed lets allowed in Halide syntax? count_var_uses definitely assumes that there are no shadowed lets. I thought #6583 was explicitly trying to prevent shadowed lets because we don't allow them.

Conclusion from offline discussion:

This may be called in sliding window which is before removing shadowed lets.

Should be feasible to track total # substitutions by including an int in the Scope that counts the number of substitutions that occurred when mutating the RHS of that let

Should we only substitute in the final expression?
i.e. suppose we have: (let x = w + w in (let y = x + x in (let z = y + y in z + z))) and we have substition count = 1. Should the resultant Expr be (let x = w + w in (let y = (w + w) + x in (let z = y + y in z + z))) or should it be (let x = w + w in (let y = x + x in (let z = y + y in (y + y) + z))) or neither? I assume (and prefer) the latter, but I don't think there's a good scheme to make that happen.

Also, if we have the count to do full substitutions, should the resultant Expr be:
(let x = w + w in (let y = x + x in (let z = y + y in (((w + w) + (w + w)) + ((w + w) + (w + w))) + (((w + w) + (w + w)) + ((w + w) + (w + w)))))) or should the substitutions also happen inside of the let values? i.e. the Expr should become (let x = w + w in (let y = (w + w) + (w + w) in ....? I think this also ties into the above question - if we have a count of 1, and the scope for Exprs says that z = ((w + w) + (w + w)) + ((w + w) + (w + w)) then we can't substitute that in because the tracked cost is too expensive, so perhaps the resultant Expr should be unchanged in this case?

abadams · 2022-06-10T19:53:35Z

+        // We want to recurse through Lets.
+        auto [v_count, v_new] = strip_unbounded(op->value, direction, scope, var_uses);
+        auto [b_count, b_new] = strip_unbounded(op->body, direction, scope, var_uses);
+        // We might want to only count b_count.


Yes, this should just be b_count

steven-johnson · 2022-06-27T18:34:43Z

Is this ready to land (pending green)?

rootjalex · 2022-06-27T20:30:03Z

No - I still need to address Andrew's point on using deep_equality, and still need feedback on the substiution.

Sorry for dropping the ball on this, I was out for a conference for a week and have been playing catch-up on other duties in the week since. Will try to make more progress on it this week.

steven-johnson · 2022-06-27T20:31:10Z

No worries, just trying to catch up on things after returning from my own vacation -- no rush on this from my perspective.

rootjalex · 2022-06-27T20:42:02Z

Thanks! I hope it was a fun vacation!

steven-johnson · 2022-07-19T23:02:58Z

Are we hoping that this will allow us to remove the HL_PERMIT_FAILED_UNROLL hack?

steven-johnson · 2022-07-27T22:09:21Z

Hey, just a periodic status check on this one.

rootjalex · 2022-08-01T16:58:20Z

Sorry - I'm getting a tad behind, and this PR has been on the back burner for a bit. I will try to get to it in the next few weeks.

steven-johnson · 2022-08-22T16:45:49Z

Monday Morning Review Ping -- where does this PR stand?

rootjalex · 2022-08-22T17:47:18Z

It still has a bit of work to be done, and I have not managed to get to it yet. I haven't forgotten it, and will aim to address it by the end of September (I know that's far away and I apologize, but I am currently in paper-writing mode + am about to move across the country)

Alexander Root added 13 commits January 26, 2022 10:15

make bounds of let visitor use unique_name

83770d5

add constant bounds code

07afc4f

update Makefile

695934b

only simplify if an approximate method changed the Expr

3a19053

optimize for the singular direction bounds case

ff75fff

Merge branch 'master' of github.com:halide/Halide into rootjalex/impr…

1accd0a

…ove_cbounds_fixed

fix stupid bug in find_constant_bound

ac42774

Merge branch 'main' of https://github.com/halide/Halide into rootjale…

853bcbd

…x/improve_cbounds_fixed

clang format

75d2b7a

performance improvements for new constant bounds methods

6c1a930

merge conflicts

bfbca65

tabs -> spaces

3df718f

comment nits

dd75433

rootjalex requested a review from abadams May 31, 2022 17:55

clang format

d45f934

steven-johnson reviewed May 31, 2022

View reviewed changes