Implement UnknownSizeFrame for locals with unknown size#125491
Implement UnknownSizeFrame for locals with unknown size#125491snickolls-arm wants to merge 10 commits intodotnet:mainfrom
Conversation
Implements a simple bump allocator for TYP_SIMD and TYP_MASK. Locals are allocated to this space when lvaIsUnknownSizeLocal is true for the variable. The frame is implemented on ARM64 as two homogenenous blocks containing either TYP_SIMD or TYP_MASK locals. The x19 register is reserved for addressing locals in the block. Updates codegen for SVE memory transfer instructions to accept indices in multiples of the vector length (or VL / 8 for masks) instead of deriving them from the size of the local.
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
|
Looking at the throughput differences, the performance of |
| void Compiler::lvaInitUnknownSizeFrame() | ||
| { | ||
| #if defined(FEATURE_SIMD) && defined(TARGET_ARM64) | ||
| compUsesUnknownSizeFrame = false; | ||
| #ifdef DEBUG | ||
| unkSizeFrame.isFinalized = false; | ||
| #endif | ||
| unkSizeFrame.nMask = 0; | ||
| unkSizeFrame.nVector = 0; | ||
| #endif | ||
| } |
There was a problem hiding this comment.
Can these just be initialized inline with their declarations, and this function removed?
There was a problem hiding this comment.
As lvaAssignFrameOffsets is called repeatedly over a few states, we need to repeatedly recompute vector allocations for each state as well. So I think this function is required.
There was a problem hiding this comment.
I'm going to reopen this because I think there is some weird behaviour in my branch related. I'm seeing an assertion fire in UnknownSizeFrame::GetAddressingOffset where index < nVector because nVector == 0. This function is used because I'm assuming that all the locals are going to be traversed and allocated on every stage of the state machine, though I'm not sure this is true? Potentially this shouldn't be called on FINAL_FRAME_LAYOUT?
There was a problem hiding this comment.
Alternatively there are other local variables that pass lvaIsUnknownSizeLocal but are not traversed during lvaAssignVirtualFrameOffsetsToLocals that I would need to handle specially.
There was a problem hiding this comment.
Alternatively there are other local variables that pass
lvaIsUnknownSizeLocalbut are not traversed duringlvaAssignVirtualFrameOffsetsToLocalsthat I would need to handle specially.
That sounds more likely.
There was a problem hiding this comment.
I think it is fields of promoted structures, that are also address exposed, that are causing the problem, because they are placed onto the usual stack and need to maintain their layout. So these fields that have TYP_SIMD are going to pass lvaIsUnknownSizeLocal but won't be allocated in the UnknownSizeFrame.
I think I need to tighten the criteria for considering a local on this new frame to consider these sorts of scenarios as well.
* Add function header * Create UnknownSizeFrame::GetAddressingOffset and revert changes to lvaFrameAddress * Use rsSetRegsModified and remove kill ref position
Can you post the detailed throughput analysis with per-function information? |
Just checking which analysis you mean here? As the only detailed output I can find from |
Sorry, I got my contributors confused. We have some tooling that can break throughput regressions/improvements down by JIT function, contributed by @SingleAccretion. However, it is x64-host only (based on Intel PIN). I collected the data on benchmarks.run_pgo and it looks like this: Base: 86438410243, Diff: 86528734455, +0.1045%
33418435 : +20.56% : 36.23% : +0.0387% : public: void __cdecl Compiler::lvaAssignFrameOffsets(enum Compiler::FrameLayoutState)
15698295 : +15.34% : 17.02% : +0.0182% : protected: void __cdecl CodeGen::genFnProlog(void)
13341778 : +8.18% : 14.46% : +0.0154% : public: void __cdecl emitter::emitIns_R_S(enum instruction, enum emitAttr, enum _regNumber_enum, int, int)
11303530 : +4.61% : 12.25% : +0.0131% : public: void __cdecl Compiler::lvaAssignVirtualFrameOffsetsToLocals(void)
9485383 : +11.28% : 10.28% : +0.0110% : public: void __cdecl emitter::emitIns_S_R(enum instruction, enum emitAttr, enum _regNumber_enum, int, int)
5717661 : +7.00% : 6.20% : +0.0066% : protected: void __cdecl CodeGen::genCheckUseBlockInit(void)
471445 : +4.61% : 0.51% : +0.0005% : protected: void __cdecl CodeGen::genFinalizeFrame(void)
441974 : +0.28% : 0.48% : +0.0005% : public: __cdecl Compiler::Compiler(class ArenaAllocatorT<struct JitMemKindTraits> *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)
377156 : +2.79% : 0.41% : +0.0004% : protected: void __cdecl CodeGen::genPushCalleeSavedRegisters(enum _regNumber_enum, bool *)
282867 : +4.91% : 0.31% : +0.0003% : private: void __cdecl LinearScan::setFrameType(void)
246151 : +4.35% : 0.27% : +0.0003% : protected: void __cdecl CodeGen::genZeroInitFrame(int, int, enum _regNumber_enum, bool *)
92508 : +0.04% : 0.10% : +0.0001% : public: static void __cdecl BitSetOps<unsigned __int64 *, 1, class Compiler *, class TrackedVarBitSetTraits>::LivenessD(class Compiler *, unsigned __int64 *&, unsigned __int64 *const, unsigned __int64 *const, unsigned __int64 *const)
-96660 : -0.09% : 0.10% : -0.0001% : protected: void __cdecl JitExpandArray<unsigned char>::InitializeRange(unsigned int, unsigned int)
-660023 : -100.00% : 0.72% : -0.0008% : public: void __cdecl Compiler::funSetCurrentFunc(unsigned int) These regressions are correspondingly larger in tier0 code where it matters more, but I think we can live with it and if we really care address it in a follow-up. I pushed a merge to resolve the merge conflict. |
No problem, thanks for this. I suppose we see fewer locals on frame at higher optimization level so the impact isn't as strong? |
UnknownSizeFrame Adds lvaIsAllocatedOnUnknownSizeFrame with a stronger criteria for what locals should or shouldn't be allocated in the unknown size frame. Namely promoted struct fields that are address exposed should not be allocated there, because the layout of the structure in memory needs to be preserved.
| // If a promoted field is address exposed, we need to preserve the | ||
| // layout of the structure. Therefore we can't allocate variable size | ||
| // fields to a different part of the stack frame. | ||
| if (varDsc->lvIsStructField && varDsc->IsAddressExposed()) | ||
| { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
This should be done always when the parent struct is dependently promoted. Use lvaGetPromotionType(varDsc->lvParentField)
There was a problem hiding this comment.
In fact, the loops should already be checking for lvaIsFieldOfDependentlyPromotedStruct. You probably just need to move the existing lvaIsUnknownSizeLocal below those checks.
There was a problem hiding this comment.
I see, I've tried moving the checks around a bit but the results seem inconsistent.
I think what I'll do is revert that latest change, and provide a new getter/setter for lvStkOffs in LclVarDsc specifically used for variables that exist on this new frame. Then I can add assertions to GetStackOffset and SetStackOffset and find areas that might be adjusting the stack offset when they shouldn't be. This has found generatePatchpointInfo for example, which I haven't considered yet.
There was a problem hiding this comment.
If the above works OK, are we able to merge this patch as is and follow up with further work? As I have further patches to push but I've held off as there's dependency with this one, e.g. spill temps.
…on the" This reverts commit 0a448e16be8a929311243e3172a8b0c8f7793969.
These locals need to be treated specially as they allocated to a different part of the frame. Adds some assertions to the original accessors to prevent use with variable sized locals.
| // | ||
| unsigned varNum = lclNum; | ||
|
|
||
| // Variable-sized locals reside in a different part of the stack frame. |
There was a problem hiding this comment.
This brings up OSR support for this kind of stack frame, which I hadn't yet run into. I suppose it's not possible to just skip over these kinds of variables. I would have to either disable OSR for the method, or add some support to allow for copying over the extra frame space as well?
Implements a simple bump allocator for
TYP_SIMDandTYP_MASK. Locals are allocated to this space whenlvaIsUnknownSizeLocalis true for the variable.The frame is implemented on ARM64 as two homogenenous blocks containing either
TYP_SIMDorTYP_MASKlocals. The x19 register is reserved for addressing locals in the block. Updates codegen for SVE memory transfer instructions to accept indices in multiples of the vector length (or VL / 8 for masks) instead of deriving them from the size of the local.