Skip to content

Improved find_constant_bound(s)#6792

Open
rootjalex wants to merge 26 commits intomainfrom
rootjalex/improve_cbounds_fixed
Open

Improved find_constant_bound(s)#6792
rootjalex wants to merge 26 commits intomainfrom
rootjalex/improve_cbounds_fixed

Conversation

@rootjalex
Copy link
Copy Markdown
Member

@rootjalex rootjalex commented May 31, 2022

This PR provides a series of methods for removing/simplifying correlated expressions for find_constant_bounds:

  • Bounded let-substitutions (n=100 edit: n=16). We don't want to always substitute all lets, but some constant bounds can be calculated just by a small number of substitutions.
  • Removing unbounded terms from mins/maxs. A (simplistic) example is below:
Find lower bound on:
max(x, y) - (z + y)
With z : [0, 8]

This method would note that x is unbounded, and therefore the lhs of the max can be stripped, producing:

y - (z - y) -> 0 - z -> lower bounded by -8
  • Affine term reordering. Halide’s TRS-based simplification can only cancel terms in sums up to a certain depth, this method uses a linear-time algorithm for canceling like-terms.
  • Pushing rationals inwards. This technique pushes multiplications inwards to allow stronger simplification. More importantly, it pushes divisions inwards via a safe approximation, most encapsulate by the following equations:
// Addition:
(a / n) + (b / n) <= (a + b) / n <= (a / n) + (b / n) + 1
// Subtraction:
(a / n) - (b / n) - 1 <= (a - b) / n <= (a / n) - (b / n)

This allows us to push divisions inside additions/subtractions which can improve the ability to cancel like terms in a lot of generated equations.

@abadams ran a series of experiments with randomly-generated schedules (n=256) on a series of apps (bgu, camera_pipe, conv_layer, depthwise_separable_conv, harris, hist, iir_blur, lens_blur, max_filter, stencil_chain, unsharp), and here is a summary of the results (percentages are total across the benchmarks):

Less failed unrolls: bgu (5 -> 3), camera_pipe (69 -> 58), harris (197 -> 160), lens_blur (158 -> 12), max_filter (338 -> 242), unsharp (110 -> 32)
Less memory: camera_pipe (0.6%), depthwise_separable_conv (0.3%), hist (0.1%), lens_blur (0.6%), unsharp (0.2%)
Less malloc calls: camera_pipe (173592 -> 171612), harris (608943 -> 608655), lens_blur (144324 -> 141888), stencil_chain (701085 -> 698712), unsharp (127428 -> 127284)
Some small runtime improvements (0.05% to 0.6%) : bgu, camera_pipe, harris, hist, iir_blur, lens_blur, max_filter, stencil_chain

More memory: harris (0.06%), iir_blur (0.002%), stencil_chain (0.1%)

The runtime improvements might not be statistically significant, but I think better loop unrolling and improved stack allocations are important contributions.

For apps with no improved unrolling, compilation times increase by a small amount (~3%). With improved unrolling, there are large increases but are mostly due to the fact that generating the unrolled code takes longer in both our codegen and LLVM codegen.

This work was part of a project with @abadams and @shoaibkamil .

@rootjalex rootjalex requested a review from abadams May 31, 2022 17:55
Comment thread src/ConstantBounds.cpp Outdated
@steven-johnson
Copy link
Copy Markdown
Contributor

@abadams -- should I pull this into Google and do some torture testing before landing, or are we pretty confident this is good?

@abadams
Copy link
Copy Markdown
Member

abadams commented Jun 1, 2022

Torture testing inside Google would be pretty helpful, thanks.

@steven-johnson
Copy link
Copy Markdown
Contributor

Testing in Google, I find only one new failure, but... it appears to be a hang (or near-infinite loop) inside Bounding small realizations... when compiling one specific Generator. Adding to the fun, it's in some proprietary stuff that might be hard to share publicly. Let me see if I can narrow things down further...

@steven-johnson
Copy link
Copy Markdown
Contributor

Yeah, we definitely get stuck ~forever in bound_small_allocations(), which was only changed to include the new header, so something about the change in definition has injected something here. Let me see if I can come up with a repro case I can share.

@rootjalex
Copy link
Copy Markdown
Member Author

bound_small_allocations() is calling the new version(s) of find_constant_bound(s), which means that there is likely an allocation expression that is tripping up the new method - if you could log which expression (and the corresponding scope) is causing the hang, I can investigate (hopefully sharing that much is okay?).

@steven-johnson
Copy link
Copy Markdown
Contributor

steven-johnson commented Jun 2, 2022

So far, what I'm finding is that we have a fairly complex Expr that is the input to remove_unbounded_terms:

(((let t418 = min(max(min(max(min(max(min(max(min(max(min(max(min(max(min((foo$13.extent.0 + foo$13.min.0) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0) in (max(min(max(min(max(min(max(min(max(min(max(min(max(t418, 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), max(min(max(min(max(min(max(min(max(max(min(max(t418, 1) + 4, input.extent.0), t418), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 4, input.extent.0), 1) + 1) + 3, input.extent.0), 1) + -1)) - (let t407 = min(max(min(min(min(max(foo$13.min.0, -1), min(max(min(input.extent.0 + -1, foo$13.min.0), 0), max(min(foo$13.min.0 + 2, input.extent.0), 1) + -1)) + 2, input.extent.0) + -1, foo$13.min.0), 0), max(min(min(max(min(min(max(foo$13.min.0, -1) + 2, input.extent.0) + -1, foo$13.min.0), 0), max(min(min(max(min(input.extent.0 + -1, foo$13.min.0), 0), max(min(foo$13.min.0 + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) in (let t411 = min(max(min(min(min(min(max(min(input.extent.0 + -1, t407), 0), max(min(t407 + 2, input.extent.0), 1) + -1), t407) + 2, input.extent.0) + -1, t407), 0), max(min(min(max(min(min(t407 + 2, input.extent.0) + -1, t407), 0), max(min(min(max(min(input.extent.0 + -1, t407), 0), max(min(t407 + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) in (let t423 = min(max(min(min(min(min(max(min(input.extent.0 + -1, t411), 0), max(min(t411 + 2, input.extent.0), 1) + -1), t411) + 2, input.extent.0) + -1, t411), 0), max(min(min(max(min(min(t411 + 2, input.extent.0) + -1, t411), 0), max(min(min(max(min(input.extent.0 + -1, t411), 0), max(min(t411 + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) in (let t427 = min(max(min(min(min(min(max(min(input.extent.0 + -1, t423), 0), max(min(t423 + 2, input.extent.0), 1) + -1), t423) + 2, input.extent.0) + -1, t423), 0), max(min(min(max(min(min(t423 + 2, input.extent.0) + -1, t423), 0), max(min(min(max(min(input.extent.0 + -1, t423), 0), max(min(t423 + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1) in min(max(min(min(t427 + 2, input.extent.0) + -1, t427), 0), max(min(min(min(max(min(input.extent.0 + -1, t427), 0), min(min(max(min(t427 + 2, input.extent.0), 1), min(min(min(max(min(input.extent.0 + -1, t427), 0), min(min(max(min(t427 + 2, input.extent.0), 1), min(min(min(max(min(input.extent.0 + -1, t427), 0), min(min(max(min(t427 + 2, input.extent.0), 1), min(min(min(max(min(input.extent.0 + -1, t427), 0), min(min(max(min(t427 + 2, input.extent.0), 1), min(min(min(max(min(input.extent.0 + -1, t427), 0), min(max(min(t427 + 2, input.extent.0), 1), min(max(min(input.extent.0 + -1, t427), 0), min(max(min(t427 + 2, input.extent.0), 1), min(max(min(input.extent.0 + -1, t427), 0), min(max(min(t427 + 2, input.extent.0), 1), max(min(input.extent.0 + -1, t427), 0) + 1)) + 1)) + 1)), max(min(t427 + 2, input.extent.0), 1) + -1), min(max(min(input.extent.0 + -1, t427), 0), max(min(t427 + 2, input.extent.0), 1) + -1) + 2)), max(min(input.extent.0 + -1, t427), 0) + 1)), max(min(t427 + 2, input.extent.0), 1) + -1), min(max(min(input.extent.0 + -1, t427), 0), max(min(t427 + 2, input.extent.0), 1) + -1) + 2)), max(min(input.extent.0 + -1, t427), 0) + 1)), max(min(t427 + 2, input.extent.0), 1) + -1), min(max(min(input.extent.0 + -1, t427), 0), max(min(t427 + 2, input.extent.0), 1) + -1) + 2)), max(min(input.extent.0 + -1, t427), 0) + 1)), max(min(t427 + 2, input.extent.0), 1) + -1), min(max(min(input.extent.0 + -1, t427), 0), max(min(t427 + 2, input.extent.0), 1) + -1) + 2)), max(min(input.extent.0 + -1, t427), 0) + 1)), max(min(t427 + 2, input.extent.0), 1) + -1) + 2, input.extent.0), 1) + -1)))))) + 1)

which becomes insanely huge afterwards (too large to bother pasting here -- something like 7MB of text when the Expr is printed), and that's after the call to simplify().

EDIT: the corresponding scope at that point:

 scope:
{
  output$1.s0.x.max
  output$1.s0.x.min
  output$1.s0.y.max
  output$1.s0.y.min
  output$1.s1.r8$x.max
  output$1.s1.r8$x.min
  output$1.s1.x.max
  output$1.s1.x.min
  output$1.s1.y.max
  output$1.s1.y.min
  output$10.s0.x.max
  output$10.s0.x.min
  output$10.s0.y.max
  output$10.s0.y.min
  output$10.s1.r125$x.max
  output$10.s1.r125$x.min
  output$10.s1.x.max
  output$10.s1.x.min
  output$10.s1.y.max
  output$10.s1.y.min
  output$11.s0.x.max
  output$11.s0.x.min
  output$11.s0.y.max
  output$11.s0.y.min
  output$11.s1.r138$x.max
  output$11.s1.r138$x.min
  output$11.s1.x.max
  output$11.s1.x.min
  output$11.s1.y.max
  output$11.s1.y.min
  output$12.s0.x.max
  output$12.s0.x.min
  output$12.s0.y.max
  output$12.s0.y.min
  output$12.s1.r151$x.max
  output$12.s1.r151$x.min
  output$12.s1.x.max
  output$12.s1.x.min
  output$12.s1.y.max
  output$12.s1.y.min
  output$13.s0.x.max
  output$13.s0.x.min
  output$13.s0.y.max
  output$13.s0.y.min
  output$13.s1.r164$x.max
  output$13.s1.r164$x.min
  output$13.s1.x.max
  output$13.s1.x.min
  output$13.s1.y.max
  output$13.s1.y.min
  output$14.s0.x.max
  output$14.s0.x.min
  output$14.s0.y.max
  output$14.s0.y.min
  output$14.s1.r177$x.max
  output$14.s1.r177$x.min
  output$14.s1.x.max
  output$14.s1.x.min
  output$14.s1.y.max
  output$14.s1.y.min
  output$2.s0.x.max
  output$2.s0.x.min
  output$2.s0.y.max
  output$2.s0.y.min
  output$2.s1.r21$x.max
  output$2.s1.r21$x.min
  output$2.s1.x.max
  output$2.s1.x.min
  output$2.s1.y.max
  output$2.s1.y.min
  output$3.s0.x.max
  output$3.s0.x.min
  output$3.s0.y.max
  output$3.s0.y.min
  output$3.s1.r34$x.max
  output$3.s1.r34$x.min
  output$3.s1.x.max
  output$3.s1.x.min
  output$3.s1.y.max
  output$3.s1.y.min
  output$4.s0.x.max
  output$4.s0.x.min
  output$4.s0.y.max
  output$4.s0.y.min
  output$4.s1.r47$x.max
  output$4.s1.r47$x.min
  output$4.s1.x.max
  output$4.s1.x.min
  output$4.s1.y.max
  output$4.s1.y.min
  output$5.s0.x.max
  output$5.s0.x.min
  output$5.s0.y.max
  output$5.s0.y.min
  output$5.s1.r60$x.max
  output$5.s1.r60$x.min
  output$5.s1.x.max
  output$5.s1.x.min
  output$5.s1.y.max
  output$5.s1.y.min
  output$6.s0.x.max
  output$6.s0.x.min
  output$6.s0.y.max
  output$6.s0.y.min
  output$6.s1.r73$x.max
  output$6.s1.r73$x.min
  output$6.s1.x.max
  output$6.s1.x.min
  output$6.s1.y.max
  output$6.s1.y.min
  output$7.s0.x.max
  output$7.s0.x.min
  output$7.s0.y.max
  output$7.s0.y.min
  output$7.s1.r86$x.max
  output$7.s1.r86$x.min
  output$7.s1.x.max
  output$7.s1.x.min
  output$7.s1.y.max
  output$7.s1.y.min
  output$8.s0.x.max
  output$8.s0.x.min
  output$8.s0.y.max
  output$8.s0.y.min
  output$8.s1.r99$x.max
  output$8.s1.r99$x.min
  output$8.s1.x.max
  output$8.s1.x.min
  output$8.s1.y.max
  output$8.s1.y.min
  output$9.s0.x.max
  output$9.s0.x.min
  output$9.s0.y.max
  output$9.s0.y.min
  output$9.s1.r112$x.max
  output$9.s1.r112$x.min
  output$9.s1.x.max
  output$9.s1.x.min
  output$9.s1.y.max
  output$9.s1.y.min
  foo$1.s0.x.max
  foo$1.s0.x.max.s
  foo$1.s0.x.min
  foo$1.s0.y.max
  foo$1.s0.y.max.s
  foo$1.s0.y.min
  foo$10.s0.x.max
  foo$10.s0.x.max.s
  foo$10.s0.x.min
  foo$10.s0.y.max
  foo$10.s0.y.max.s
  foo$10.s0.y.min
  foo$11.s0.x.max
  foo$11.s0.x.max.s
  foo$11.s0.x.min
  foo$11.s0.y.max
  foo$11.s0.y.max.s
  foo$11.s0.y.min
  foo$12.s0.x.max
  foo$12.s0.x.max.s
  foo$12.s0.x.min
  foo$12.s0.y.max
  foo$12.s0.y.max.s
  foo$12.s0.y.min
  foo$13.s0.x.max
  foo$13.s0.x.min
  foo$13.s0.y.max
  foo$13.s0.y.min
  foo$2.s0.x.max
  foo$2.s0.x.max.s
  foo$2.s0.x.min
  foo$2.s0.y.max
  foo$2.s0.y.max.s
  foo$2.s0.y.min
  foo$3.s0.x.max
  foo$3.s0.x.max.s
  foo$3.s0.x.min
  foo$3.s0.y.max
  foo$3.s0.y.max.s
  foo$3.s0.y.min
  foo$4.s0.x.max
  foo$4.s0.x.max.s
  foo$4.s0.x.min
  foo$4.s0.y.max
  foo$4.s0.y.max.s
  foo$4.s0.y.min
  foo$5.s0.x.max
  foo$5.s0.x.max.s
  foo$5.s0.x.min
  foo$5.s0.y.max
  foo$5.s0.y.max.s
  foo$5.s0.y.min
  foo$6.s0.x.max
  foo$6.s0.x.max.s
  foo$6.s0.x.min
  foo$6.s0.y.max
  foo$6.s0.y.max.s
  foo$6.s0.y.min
  foo$7.s0.x.max
  foo$7.s0.x.max.s
  foo$7.s0.x.min
  foo$7.s0.y.max
  foo$7.s0.y.max.s
  foo$7.s0.y.min
  foo$8.s0.x.max
  foo$8.s0.x.max.s
  foo$8.s0.x.min
  foo$8.s0.y.max
  foo$8.s0.y.max.s
  foo$8.s0.y.min
  foo$9.s0.x.max
  foo$9.s0.x.max.s
  foo$9.s0.x.min
  foo$9.s0.y.max
  foo$9.s0.y.max.s
  foo$9.s0.y.min
  foo.s0.x.max
  foo.s0.x.max.s
  foo.s0.x.min
  foo.s0.y.max
  foo.s0.y.max.s
  foo.s0.y.min
}

@rootjalex
Copy link
Copy Markdown
Member Author

Is it possible to know if the scope has any values actually set? Sorry, I didn't realize that printing scope only prints the names, I need the corresponding intervals as well.

@rootjalex
Copy link
Copy Markdown
Member Author

Definitely seems like the issue here is substitute_some_lets. Not sure exactly what the count should be, but 100 is too high

@steven-johnson
Copy link
Copy Markdown
Contributor

Is it possible to know if the scope has any values actually set? Sorry, I didn't realize that printing scope only prints the names, I need the corresponding intervals as well.

{
  output$1.s0.x.max: 0, (void *)pos_inf
  output$1.s0.x.min: 0, (void *)pos_inf
  output$1.s0.y.max: 0, (void *)pos_inf
  output$1.s0.y.min: 0, (void *)pos_inf
  output$1.s1.r8$x.max: 3, 3
  output$1.s1.r8$x.min: 0, 0
  output$1.s1.x.max: 0, (void *)pos_inf
  output$1.s1.x.min: 0, (void *)pos_inf
  output$1.s1.y.max: 0, (void *)pos_inf
  output$1.s1.y.min: 0, (void *)pos_inf
  output$10.s0.x.max: 0, (void *)pos_inf
  output$10.s0.x.min: 0, (void *)pos_inf
  output$10.s0.y.max: 0, (void *)pos_inf
  output$10.s0.y.min: 0, (void *)pos_inf
  output$10.s1.r125$x.max: 3, 3
  output$10.s1.r125$x.min: 0, 0
  output$10.s1.x.max: 0, (void *)pos_inf
  output$10.s1.x.min: 0, (void *)pos_inf
  output$10.s1.y.max: 0, (void *)pos_inf
  output$10.s1.y.min: 0, (void *)pos_inf
  output$11.s0.x.max: 0, (void *)pos_inf
  output$11.s0.x.min: 0, (void *)pos_inf
  output$11.s0.y.max: 0, (void *)pos_inf
  output$11.s0.y.min: 0, (void *)pos_inf
  output$11.s1.r138$x.max: 3, 3
  output$11.s1.r138$x.min: 0, 0
  output$11.s1.x.max: 0, (void *)pos_inf
  output$11.s1.x.min: 0, (void *)pos_inf
  output$11.s1.y.max: 0, (void *)pos_inf
  output$11.s1.y.min: 0, (void *)pos_inf
  output$12.s0.x.max: 0, (void *)pos_inf
  output$12.s0.x.min: 0, (void *)pos_inf
  output$12.s0.y.max: 0, (void *)pos_inf
  output$12.s0.y.min: 0, (void *)pos_inf
  output$12.s1.r151$x.max: 3, 3
  output$12.s1.r151$x.min: 0, 0
  output$12.s1.x.max: 0, (void *)pos_inf
  output$12.s1.x.min: 0, (void *)pos_inf
  output$12.s1.y.max: 0, (void *)pos_inf
  output$12.s1.y.min: 0, (void *)pos_inf
  output$13.s0.x.max: 0, (void *)pos_inf
  output$13.s0.x.min: 0, (void *)pos_inf
  output$13.s0.y.max: 0, (void *)pos_inf
  output$13.s0.y.min: 0, (void *)pos_inf
  output$13.s1.r164$x.max: 3, 3
  output$13.s1.r164$x.min: 0, 0
  output$13.s1.x.max: 0, (void *)pos_inf
  output$13.s1.x.min: 0, (void *)pos_inf
  output$13.s1.y.max: 0, (void *)pos_inf
  output$13.s1.y.min: 0, (void *)pos_inf
  output$14.s0.x.max: (void *)neg_inf, (void *)pos_inf
  output$14.s0.x.min: (void *)neg_inf, (void *)pos_inf
  output$14.s0.y.max: (void *)neg_inf, (void *)pos_inf
  output$14.s0.y.min: (void *)neg_inf, (void *)pos_inf
  output$14.s1.r177$x.max: 3, 3
  output$14.s1.r177$x.min: 0, 0
  output$14.s1.x.max: (void *)neg_inf, (void *)pos_inf
  output$14.s1.x.min: (void *)neg_inf, (void *)pos_inf
  output$14.s1.y.max: (void *)neg_inf, (void *)pos_inf
  output$14.s1.y.min: (void *)neg_inf, (void *)pos_inf
  output$2.s0.x.max: 0, (void *)pos_inf
  output$2.s0.x.min: 0, (void *)pos_inf
  output$2.s0.y.max: 0, (void *)pos_inf
  output$2.s0.y.min: 0, (void *)pos_inf
  output$2.s1.r21$x.max: 3, 3
  output$2.s1.r21$x.min: 0, 0
  output$2.s1.x.max: 0, (void *)pos_inf
  output$2.s1.x.min: 0, (void *)pos_inf
  output$2.s1.y.max: 0, (void *)pos_inf
  output$2.s1.y.min: 0, (void *)pos_inf
  output$3.s0.x.max: 0, (void *)pos_inf
  output$3.s0.x.min: 0, (void *)pos_inf
  output$3.s0.y.max: 0, (void *)pos_inf
  output$3.s0.y.min: 0, (void *)pos_inf
  output$3.s1.r34$x.max: 3, 3
  output$3.s1.r34$x.min: 0, 0
  output$3.s1.x.max: 0, (void *)pos_inf
  output$3.s1.x.min: 0, (void *)pos_inf
  output$3.s1.y.max: 0, (void *)pos_inf
  output$3.s1.y.min: 0, (void *)pos_inf
  output$4.s0.x.max: 0, (void *)pos_inf
  output$4.s0.x.min: 0, (void *)pos_inf
  output$4.s0.y.max: 0, (void *)pos_inf
  output$4.s0.y.min: 0, (void *)pos_inf
  output$4.s1.r47$x.max: 3, 3
  output$4.s1.r47$x.min: 0, 0
  output$4.s1.x.max: 0, (void *)pos_inf
  output$4.s1.x.min: 0, (void *)pos_inf
  output$4.s1.y.max: 0, (void *)pos_inf
  output$4.s1.y.min: 0, (void *)pos_inf
  output$5.s0.x.max: 0, (void *)pos_inf
  output$5.s0.x.min: 0, (void *)pos_inf
  output$5.s0.y.max: 0, (void *)pos_inf
  output$5.s0.y.min: 0, (void *)pos_inf
  output$5.s1.r60$x.max: 3, 3
  output$5.s1.r60$x.min: 0, 0
  output$5.s1.x.max: 0, (void *)pos_inf
  output$5.s1.x.min: 0, (void *)pos_inf
  output$5.s1.y.max: 0, (void *)pos_inf
  output$5.s1.y.min: 0, (void *)pos_inf
  output$6.s0.x.max: 0, (void *)pos_inf
  output$6.s0.x.min: 0, (void *)pos_inf
  output$6.s0.y.max: 0, (void *)pos_inf
  output$6.s0.y.min: 0, (void *)pos_inf
  output$6.s1.r73$x.max: 3, 3
  output$6.s1.r73$x.min: 0, 0
  output$6.s1.x.max: 0, (void *)pos_inf
  output$6.s1.x.min: 0, (void *)pos_inf
  output$6.s1.y.max: 0, (void *)pos_inf
  output$6.s1.y.min: 0, (void *)pos_inf
  output$7.s0.x.max: 0, (void *)pos_inf
  output$7.s0.x.min: 0, (void *)pos_inf
  output$7.s0.y.max: 0, (void *)pos_inf
  output$7.s0.y.min: 0, (void *)pos_inf
  output$7.s1.r86$x.max: 3, 3
  output$7.s1.r86$x.min: 0, 0
  output$7.s1.x.max: 0, (void *)pos_inf
  output$7.s1.x.min: 0, (void *)pos_inf
  output$7.s1.y.max: 0, (void *)pos_inf
  output$7.s1.y.min: 0, (void *)pos_inf
  output$8.s0.x.max: 0, (void *)pos_inf
  output$8.s0.x.min: 0, (void *)pos_inf
  output$8.s0.y.max: 0, (void *)pos_inf
  output$8.s0.y.min: 0, (void *)pos_inf
  output$8.s1.r99$x.max: 3, 3
  output$8.s1.r99$x.min: 0, 0
  output$8.s1.x.max: 0, (void *)pos_inf
  output$8.s1.x.min: 0, (void *)pos_inf
  output$8.s1.y.max: 0, (void *)pos_inf
  output$8.s1.y.min: 0, (void *)pos_inf
  output$9.s0.x.max: 0, (void *)pos_inf
  output$9.s0.x.min: 0, (void *)pos_inf
  output$9.s0.y.max: 0, (void *)pos_inf
  output$9.s0.y.min: 0, (void *)pos_inf
  output$9.s1.r112$x.max: 3, 3
  output$9.s1.r112$x.min: 0, 0
  output$9.s1.x.max: 0, (void *)pos_inf
  output$9.s1.x.min: 0, (void *)pos_inf
  output$9.s1.y.max: 0, (void *)pos_inf
  output$9.s1.y.min: 0, (void *)pos_inf
  foo$1.s0.x.max: 0, (void *)pos_inf
  foo$1.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$1.s0.x.min: 0, (void *)pos_inf
  foo$1.s0.y.max: 0, (void *)pos_inf
  foo$1.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$1.s0.y.min: 0, (void *)pos_inf
  foo$10.s0.x.max: 0, (void *)pos_inf
  foo$10.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$10.s0.x.min: 0, (void *)pos_inf
  foo$10.s0.y.max: 0, (void *)pos_inf
  foo$10.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$10.s0.y.min: 0, (void *)pos_inf
  foo$11.s0.x.max: 0, (void *)pos_inf
  foo$11.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$11.s0.x.min: 0, (void *)pos_inf
  foo$11.s0.y.max: 0, (void *)pos_inf
  foo$11.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$11.s0.y.min: 0, (void *)pos_inf
  foo$12.s0.x.max: 0, (void *)pos_inf
  foo$12.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$12.s0.x.min: 0, (void *)pos_inf
  foo$12.s0.y.max: 0, (void *)pos_inf
  foo$12.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$12.s0.y.min: 0, (void *)pos_inf
  foo$13.s0.x.max: (void *)neg_inf, (void *)pos_inf
  foo$13.s0.x.min: (void *)neg_inf, (void *)pos_inf
  foo$13.s0.y.max: (void *)neg_inf, (void *)pos_inf
  foo$13.s0.y.min: (void *)neg_inf, (void *)pos_inf
  foo$2.s0.x.max: 0, (void *)pos_inf
  foo$2.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$2.s0.x.min: 0, (void *)pos_inf
  foo$2.s0.y.max: 0, (void *)pos_inf
  foo$2.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$2.s0.y.min: 0, (void *)pos_inf
  foo$3.s0.x.max: 0, (void *)pos_inf
  foo$3.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$3.s0.x.min: 0, (void *)pos_inf
  foo$3.s0.y.max: 0, (void *)pos_inf
  foo$3.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$3.s0.y.min: 0, (void *)pos_inf
  foo$4.s0.x.max: 0, (void *)pos_inf
  foo$4.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$4.s0.x.min: 0, (void *)pos_inf
  foo$4.s0.y.max: 0, (void *)pos_inf
  foo$4.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$4.s0.y.min: 0, (void *)pos_inf
  foo$5.s0.x.max: 0, (void *)pos_inf
  foo$5.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$5.s0.x.min: 0, (void *)pos_inf
  foo$5.s0.y.max: 0, (void *)pos_inf
  foo$5.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$5.s0.y.min: 0, (void *)pos_inf
  foo$6.s0.x.max: 0, (void *)pos_inf
  foo$6.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$6.s0.x.min: 0, (void *)pos_inf
  foo$6.s0.y.max: 0, (void *)pos_inf
  foo$6.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$6.s0.y.min: 0, (void *)pos_inf
  foo$7.s0.x.max: 0, (void *)pos_inf
  foo$7.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$7.s0.x.min: 0, (void *)pos_inf
  foo$7.s0.y.max: 0, (void *)pos_inf
  foo$7.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$7.s0.y.min: 0, (void *)pos_inf
  foo$8.s0.x.max: 0, (void *)pos_inf
  foo$8.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$8.s0.x.min: 0, (void *)pos_inf
  foo$8.s0.y.max: 0, (void *)pos_inf
  foo$8.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$8.s0.y.min: 0, (void *)pos_inf
  foo$9.s0.x.max: 0, (void *)pos_inf
  foo$9.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo$9.s0.x.min: 0, (void *)pos_inf
  foo$9.s0.y.max: 0, (void *)pos_inf
  foo$9.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo$9.s0.y.min: 0, (void *)pos_inf
  foo.s0.x.max: 0, (void *)pos_inf
  foo.s0.x.max.s: (void *)neg_inf, (void *)pos_inf
  foo.s0.x.min: 0, (void *)pos_inf
  foo.s0.y.max: 0, (void *)pos_inf
  foo.s0.y.max.s: (void *)neg_inf, (void *)pos_inf
  foo.s0.y.min: 0, (void *)pos_inf
}

@rootjalex
Copy link
Copy Markdown
Member Author

@steven-johnson Do you think you could run Google testing again? I think my tests just never had such enormous expressions, the example you provided should end reasonably fast now.

@steven-johnson
Copy link
Copy Markdown
Contributor

@steven-johnson Do you think you could run Google testing again? I think my tests just never had such enormous expressions, the example you provided should end reasonably fast now.

Testing now, but hiding an apparently-critical constant (the count arg to substitute_some_lets) as a default-value argument seems suboptimal. If 16 is a good value for everything, make it internal to the function and name and comment on it. If it's not a good value for everything, don't give it a default value.

@steven-johnson
Copy link
Copy Markdown
Contributor

(Tests look good so far, stand by)

Comment thread src/ConstantBounds.cpp Outdated
Comment thread src/ConstantBounds.cpp Outdated
Comment thread src/ConstantBounds.cpp Outdated
Comment thread src/ConstantBounds.cpp Outdated
Comment thread src/ConstantBounds.cpp Outdated
* Visitor for removing terms that are completely unbounded from
* a min or a max.
*/
class StripUnboundedTerms : public IRMutator {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this class can only correctly handle the listed IR node types, and everything it would make a mess on is explicitly listed below. It seems like this would be safer as a recursive function - when someone adds a new IR node type this class will be implicitly broken and they won't know to update it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I will do this, but probably won't be able to today, hopefully tomorrow.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed in 7a26154. Would love your thoughts on the Let case, that's the one I'm a little unsure on.

Comment thread src/ConstantBounds.cpp Outdated
Comment thread src/ConstantBounds.cpp
Comment thread src/ConstantBounds.cpp
}

// Two-finger O(n) algorithm for simplifying sums.
std::vector<AffineTerm> simplify_linear_summation(const std::vector<AffineTerm> &terms) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this linear-time algorithm works. An explanatory comment would help.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in cf591f7 , let me know if it is satisfactory

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sorted by deep equality comparison? I just see a should_commute. should_commute is only a partial order.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think we had a previous discussion about deep_equality being too expensive and should_commute being good enough but we could make it stronger. Should we change that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change to deep equality and see what the impact is.

Comment thread src/ConstantBounds.cpp
};

// Used to bound the number of substitutions.
class SubstituteSomeLets : public IRMutator {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do on something like: let b = a + a in let c = b + b in let d = c + c in ....

Does it produce an Expr of size 2 ^ count?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe it does, the count only bounds the number of substtitutions, not the accumulated size. Would you prefer different behavior?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, something looks fishy here. You put values into the scope after mutating them, but then when you pull them out of the scope you mutate them a second time. I can't figure out what this would do with shadowed lets.

IMO we need to bound the actual size of the generated expression. 2^16 is still pretty damn big.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second mutation shouldn't happen, thanks for the catch.

Do you have any recommendations on how to bound overall size? I guess we could track multiplicative substitutions, i.e. in the example you gave, substituting c counts for 3 substitutions, as it has 2 b substitutions plus the actual c substutiton

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are shadowed lets allowed in Halide syntax? count_var_uses definitely assumes that there are no shadowed lets. I thought #6583 was explicitly trying to prevent shadowed lets because we don't allow them.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion from offline discussion:

  • This may be called in sliding window which is before removing shadowed lets.
  • Should be feasible to track total # substitutions by including an int in the Scope that counts the number of substitutions that occurred when mutating the RHS of that let

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only substitute in the final expression?
i.e. suppose we have: (let x = w + w in (let y = x + x in (let z = y + y in z + z))) and we have substition count = 1. Should the resultant Expr be (let x = w + w in (let y = (w + w) + x in (let z = y + y in z + z))) or should it be (let x = w + w in (let y = x + x in (let z = y + y in (y + y) + z))) or neither? I assume (and prefer) the latter, but I don't think there's a good scheme to make that happen.

Also, if we have the count to do full substitutions, should the resultant Expr be:
(let x = w + w in (let y = x + x in (let z = y + y in (((w + w) + (w + w)) + ((w + w) + (w + w))) + (((w + w) + (w + w)) + ((w + w) + (w + w)))))) or should the substitutions also happen inside of the let values? i.e. the Expr should become (let x = w + w in (let y = (w + w) + (w + w) in ....? I think this also ties into the above question - if we have a count of 1, and the scope for Exprs says that z = ((w + w) + (w + w)) + ((w + w) + (w + w)) then we can't substitute that in because the tracked cost is too expensive, so perhaps the resultant Expr should be unchanged in this case?

Comment thread src/ConstantBounds.cpp Outdated
Comment thread src/ConstantBounds.cpp
// We want to recurse through Lets.
auto [v_count, v_new] = strip_unbounded(op->value, direction, scope, var_uses);
auto [b_count, b_new] = strip_unbounded(op->body, direction, scope, var_uses);
// We might want to only count b_count.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should just be b_count

@steven-johnson
Copy link
Copy Markdown
Contributor

Is this ready to land (pending green)?

@rootjalex
Copy link
Copy Markdown
Member Author

No - I still need to address Andrew's point on using deep_equality, and still need feedback on the substiution.

Sorry for dropping the ball on this, I was out for a conference for a week and have been playing catch-up on other duties in the week since. Will try to make more progress on it this week.

@steven-johnson
Copy link
Copy Markdown
Contributor

No worries, just trying to catch up on things after returning from my own vacation -- no rush on this from my perspective.

@rootjalex
Copy link
Copy Markdown
Member Author

Thanks! I hope it was a fun vacation!

@steven-johnson
Copy link
Copy Markdown
Contributor

Are we hoping that this will allow us to remove the HL_PERMIT_FAILED_UNROLL hack?

@steven-johnson
Copy link
Copy Markdown
Contributor

Hey, just a periodic status check on this one.

@rootjalex
Copy link
Copy Markdown
Member Author

Sorry - I'm getting a tad behind, and this PR has been on the back burner for a bit. I will try to get to it in the next few weeks.

@steven-johnson
Copy link
Copy Markdown
Contributor

Monday Morning Review Ping -- where does this PR stand?

@rootjalex
Copy link
Copy Markdown
Member Author

It still has a bit of work to be done, and I have not managed to get to it yet. I haven't forgotten it, and will aim to address it by the end of September (I know that's far away and I apologize, but I am currently in paper-writing mode + am about to move across the country)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants