Conversation
…ove_cbounds_fixed
…x/improve_cbounds_fixed
|
@abadams -- should I pull this into Google and do some torture testing before landing, or are we pretty confident this is good? |
|
Torture testing inside Google would be pretty helpful, thanks. |
|
Testing in Google, I find only one new failure, but... it appears to be a hang (or near-infinite loop) inside |
|
Yeah, we definitely get stuck ~forever in |
|
|
|
So far, what I'm finding is that we have a fairly complex Expr that is the input to remove_unbounded_terms: which becomes insanely huge afterwards (too large to bother pasting here -- something like 7MB of text when the Expr is printed), and that's after the call to simplify(). EDIT: the corresponding scope at that point: |
|
Is it possible to know if the scope has any values actually set? Sorry, I didn't realize that printing scope only prints the names, I need the corresponding intervals as well. |
|
Definitely seems like the issue here is |
|
|
@steven-johnson Do you think you could run Google testing again? I think my tests just never had such enormous expressions, the example you provided should end reasonably fast now. |
Testing now, but hiding an apparently-critical constant (the |
|
(Tests look good so far, stand by) |
| * Visitor for removing terms that are completely unbounded from | ||
| * a min or a max. | ||
| */ | ||
| class StripUnboundedTerms : public IRMutator { |
There was a problem hiding this comment.
It looks like this class can only correctly handle the listed IR node types, and everything it would make a mess on is explicitly listed below. It seems like this would be safer as a recursive function - when someone adds a new IR node type this class will be implicitly broken and they won't know to update it.
There was a problem hiding this comment.
I agree. I will do this, but probably won't be able to today, hopefully tomorrow.
There was a problem hiding this comment.
changed in 7a26154. Would love your thoughts on the Let case, that's the one I'm a little unsure on.
| } | ||
|
|
||
| // Two-finger O(n) algorithm for simplifying sums. | ||
| std::vector<AffineTerm> simplify_linear_summation(const std::vector<AffineTerm> &terms) { |
There was a problem hiding this comment.
I don't understand how this linear-time algorithm works. An explanatory comment would help.
There was a problem hiding this comment.
added in cf591f7 , let me know if it is satisfactory
There was a problem hiding this comment.
It's sorted by deep equality comparison? I just see a should_commute. should_commute is only a partial order.
There was a problem hiding this comment.
No, I think we had a previous discussion about deep_equality being too expensive and should_commute being good enough but we could make it stronger. Should we change that?
There was a problem hiding this comment.
Let's change to deep equality and see what the impact is.
| }; | ||
|
|
||
| // Used to bound the number of substitutions. | ||
| class SubstituteSomeLets : public IRMutator { |
There was a problem hiding this comment.
What does this do on something like: let b = a + a in let c = b + b in let d = c + c in ....
Does it produce an Expr of size 2 ^ count?
There was a problem hiding this comment.
Yes, I believe it does, the count only bounds the number of substtitutions, not the accumulated size. Would you prefer different behavior?
There was a problem hiding this comment.
Wait, something looks fishy here. You put values into the scope after mutating them, but then when you pull them out of the scope you mutate them a second time. I can't figure out what this would do with shadowed lets.
IMO we need to bound the actual size of the generated expression. 2^16 is still pretty damn big.
There was a problem hiding this comment.
I think the second mutation shouldn't happen, thanks for the catch.
Do you have any recommendations on how to bound overall size? I guess we could track multiplicative substitutions, i.e. in the example you gave, substituting c counts for 3 substitutions, as it has 2 b substitutions plus the actual c substutiton
There was a problem hiding this comment.
Are shadowed lets allowed in Halide syntax? count_var_uses definitely assumes that there are no shadowed lets. I thought #6583 was explicitly trying to prevent shadowed lets because we don't allow them.
There was a problem hiding this comment.
Conclusion from offline discussion:
- This may be called in sliding window which is before removing shadowed lets.
- Should be feasible to track total # substitutions by including an int in the Scope that counts the number of substitutions that occurred when mutating the RHS of that let
There was a problem hiding this comment.
Should we only substitute in the final expression?
i.e. suppose we have: (let x = w + w in (let y = x + x in (let z = y + y in z + z))) and we have substition count = 1. Should the resultant Expr be (let x = w + w in (let y = (w + w) + x in (let z = y + y in z + z))) or should it be (let x = w + w in (let y = x + x in (let z = y + y in (y + y) + z))) or neither? I assume (and prefer) the latter, but I don't think there's a good scheme to make that happen.
Also, if we have the count to do full substitutions, should the resultant Expr be:
(let x = w + w in (let y = x + x in (let z = y + y in (((w + w) + (w + w)) + ((w + w) + (w + w))) + (((w + w) + (w + w)) + ((w + w) + (w + w)))))) or should the substitutions also happen inside of the let values? i.e. the Expr should become (let x = w + w in (let y = (w + w) + (w + w) in ....? I think this also ties into the above question - if we have a count of 1, and the scope for Exprs says that z = ((w + w) + (w + w)) + ((w + w) + (w + w)) then we can't substitute that in because the tracked cost is too expensive, so perhaps the resultant Expr should be unchanged in this case?
| // We want to recurse through Lets. | ||
| auto [v_count, v_new] = strip_unbounded(op->value, direction, scope, var_uses); | ||
| auto [b_count, b_new] = strip_unbounded(op->body, direction, scope, var_uses); | ||
| // We might want to only count b_count. |
There was a problem hiding this comment.
Yes, this should just be b_count
|
Is this ready to land (pending green)? |
|
No - I still need to address Andrew's point on using deep_equality, and still need feedback on the substiution. Sorry for dropping the ball on this, I was out for a conference for a week and have been playing catch-up on other duties in the week since. Will try to make more progress on it this week. |
|
No worries, just trying to catch up on things after returning from my own vacation -- no rush on this from my perspective. |
|
Thanks! I hope it was a fun vacation! |
|
Are we hoping that this will allow us to remove the |
|
Hey, just a periodic status check on this one. |
|
Sorry - I'm getting a tad behind, and this PR has been on the back burner for a bit. I will try to get to it in the next few weeks. |
|
Monday Morning Review Ping -- where does this PR stand? |
|
It still has a bit of work to be done, and I have not managed to get to it yet. I haven't forgotten it, and will aim to address it by the end of September (I know that's far away and I apologize, but I am currently in paper-writing mode + am about to move across the country) |
This PR provides a series of methods for removing/simplifying correlated expressions for
find_constant_bounds:n=100edit: n=16). We don't want to always substitute all lets, but some constant bounds can be calculated just by a small number of substitutions.This method would note that
xis unbounded, and therefore the lhs of themaxcan be stripped, producing:This allows us to push divisions inside additions/subtractions which can improve the ability to cancel like terms in a lot of generated equations.
@abadams ran a series of experiments with randomly-generated schedules (n=256) on a series of apps (bgu, camera_pipe, conv_layer, depthwise_separable_conv, harris, hist, iir_blur, lens_blur, max_filter, stencil_chain, unsharp), and here is a summary of the results (percentages are total across the benchmarks):
Less failed unrolls: bgu (5 -> 3), camera_pipe (69 -> 58), harris (197 -> 160), lens_blur (158 -> 12), max_filter (338 -> 242), unsharp (110 -> 32)
Less memory: camera_pipe (0.6%), depthwise_separable_conv (0.3%), hist (0.1%), lens_blur (0.6%), unsharp (0.2%)
Less malloc calls: camera_pipe (173592 -> 171612), harris (608943 -> 608655), lens_blur (144324 -> 141888), stencil_chain (701085 -> 698712), unsharp (127428 -> 127284)
Some small runtime improvements (0.05% to 0.6%) : bgu, camera_pipe, harris, hist, iir_blur, lens_blur, max_filter, stencil_chain
More memory: harris (0.06%), iir_blur (0.002%), stencil_chain (0.1%)
The runtime improvements might not be statistically significant, but I think better loop unrolling and improved stack allocations are important contributions.
For apps with no improved unrolling, compilation times increase by a small amount (~3%). With improved unrolling, there are large increases but are mostly due to the fact that generating the unrolled code takes longer in both our codegen and LLVM codegen.
This work was part of a project with @abadams and @shoaibkamil .