Merged
Conversation
Contributor
|
The excellent PR description of what's going on now (and what we used to do) should be captured in a code comment somewhere. |
rootjalex
reviewed
Dec 12, 2022
Some backends don't like non-power-of-two vectors. Do two overlapping half-sized loads and shuffle instead of one funny-sized load.
c455859 to
df3bf08
Compare
steven-johnson
approved these changes
Dec 15, 2022
Contributor
steven-johnson
left a comment
There was a problem hiding this comment.
LGTM -- don't see any red flags inside Google.
rootjalex
reviewed
Dec 15, 2022
rootjalex
reviewed
Dec 15, 2022
| } | ||
| // TODO: We do not yet handle nested vectorization here for | ||
| // ramps which have not already collapsed. We could potentially | ||
| // handle more interesting types of shuffle than simple flat slices. |
Member
There was a problem hiding this comment.
Should this TODO be an open issue?
rootjalex
reviewed
Dec 15, 2022
rootjalex
reviewed
Dec 15, 2022
| // The loop body definitely runs | ||
|
|
||
| // TODO: worry about different iterations of the loop somehow not | ||
| // providing the evidence we thought it did. |
Member
There was a problem hiding this comment.
(same as above) should this TODO be an issue? Or a blocker for this PR? I'm also not quite sure what the TODO means.
rootjalex
reviewed
Dec 15, 2022
rootjalex
reviewed
Dec 15, 2022
ardier
pushed a commit
to ardier/Halide-mutation
that referenced
this pull request
Mar 3, 2024
* Add a pass to do explicit densification of strided loads * densify more types of strided load * Reorder downsample in local laplacian for slightly better performance * Move allocation padding into the IR. Still WIP. * Simplify concat_bits handling * Use evidence from parent scopes to densify * Disallow padding allocations with custom new expressions * Add test for parent scopes * Remove debugging prints. Avoid nested ramps. * Avoid parent scope loops * Update cmakefiles * Fix for large_buffers * Pad stack allocations too * Restore vld2/3/4 generation on non-Apple ARM chips * Appease clang-format and clang-tidy * Silence clang-tidy * Better comments * Comment improvements * Nuke code that reads out of bounds * Fix stage_strided_loads test * Change strategy for loads from external buffers Some backends don't like non-power-of-two vectors. Do two overlapping half-sized loads and shuffle instead of one funny-sized load. * Add explanatory comment to ARM backend * Fix cpp backend shuffling * Fix missing msan annotations * Magnify heap cost effect in stack_vs_heap performance test * Address review comments * clang-tidy * Fix for when same load node occurs in two different allocate nodes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a new compiler pass that converts strided loads into dense loads followed by shuffles.
For a stride of two, the trick is to do a dense load of twice the size, and then extract either the even or odd lanes. This was previously done in codegen, where it was challenging, because it's not easy to know there if it's safe to do the double-sized load, as it either loads one element beyond or before the original load. We used the alignment of the ramp base to try to tell if it was safe to shift backwards, and we added padding to internal allocations so that for those at least it was safe to shift forwards. Unfortunately the alignment of the ramp base is usually unknown if you don't know anything about the strides of the input, and adding padding to allocations was a serious wart in our memory allocators.
This PR instead actively looks for evidence elsewhere in the Stmt (at some location which definitely executes whenever the load being transformed executes) that it's safe to read further forwards or backwards in memory. The evidence is in the form of a load at the same base address with a different constant offset. It also clusters groups of these loads so that they do the same dense load each and extract the appropriate slice of lanes. For loads from external buffers, it just does a shorter load, and for loads from internal allocations that we can't shift forwards or backwards, it adds padding to the allocation explicitly, via a new padding field on Allocate nodes.
Edit: Some backends don't like non-power-of-two sized vectors, so for loads from external buffers, we now split the load into two overlapping dense loads of half the size, then shuffle out appropriate lanes from each and concat the results.