Skip to content

Implement UnknownSizeFrame for locals with unknown size#125491

Open
snickolls-arm wants to merge 10 commits intodotnet:mainfrom
snickolls-arm:stack-frame
Open

Implement UnknownSizeFrame for locals with unknown size#125491
snickolls-arm wants to merge 10 commits intodotnet:mainfrom
snickolls-arm:stack-frame

Conversation

@snickolls-arm
Copy link
Copy Markdown
Contributor

Implements a simple bump allocator for TYP_SIMD and TYP_MASK. Locals are allocated to this space when lvaIsUnknownSizeLocal is true for the variable.

The frame is implemented on ARM64 as two homogenenous blocks containing either TYP_SIMD or TYP_MASK locals. The x19 register is reserved for addressing locals in the block. Updates codegen for SVE memory transfer instructions to accept indices in multiples of the vector length (or VL / 8 for masks) instead of deriving them from the size of the local.

Implements a simple bump allocator for TYP_SIMD and TYP_MASK. Locals are
allocated to this space when lvaIsUnknownSizeLocal is true for the variable.

The frame is implemented on ARM64 as two homogenenous blocks containing either
TYP_SIMD or TYP_MASK locals. The x19 register is reserved for addressing locals
in the block. Updates codegen for SVE memory transfer instructions to accept
indices in multiples of the vector length (or VL / 8 for masks) instead of
deriving them from the size of the local.
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 12, 2026
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 12, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@jakobbotsch jakobbotsch self-requested a review March 16, 2026 09:46
@snickolls-arm
Copy link
Copy Markdown
Contributor Author

Looking at the throughput differences, the performance of lvaIsUnknownSizeLocal is probably not very good, maybe adding another property bit to LclVarDsc could help with this.

Comment thread src/coreclr/jit/codegenarmarch.cpp
Comment thread src/coreclr/jit/compiler.hpp Outdated
Comment thread src/coreclr/jit/lclvars.cpp
Comment on lines +5549 to +5559
void Compiler::lvaInitUnknownSizeFrame()
{
#if defined(FEATURE_SIMD) && defined(TARGET_ARM64)
compUsesUnknownSizeFrame = false;
#ifdef DEBUG
unkSizeFrame.isFinalized = false;
#endif
unkSizeFrame.nMask = 0;
unkSizeFrame.nVector = 0;
#endif
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these just be initialized inline with their declarations, and this function removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As lvaAssignFrameOffsets is called repeatedly over a few states, we need to repeatedly recompute vector allocations for each state as well. So I think this function is required.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to reopen this because I think there is some weird behaviour in my branch related. I'm seeing an assertion fire in UnknownSizeFrame::GetAddressingOffset where index < nVector because nVector == 0. This function is used because I'm assuming that all the locals are going to be traversed and allocated on every stage of the state machine, though I'm not sure this is true? Potentially this shouldn't be called on FINAL_FRAME_LAYOUT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively there are other local variables that pass lvaIsUnknownSizeLocal but are not traversed during lvaAssignVirtualFrameOffsetsToLocals that I would need to handle specially.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively there are other local variables that pass lvaIsUnknownSizeLocal but are not traversed during lvaAssignVirtualFrameOffsetsToLocals that I would need to handle specially.

That sounds more likely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fields of promoted structures, that are also address exposed, that are causing the problem, because they are placed onto the usual stack and need to maintain their layout. So these fields that have TYP_SIMD are going to pass lvaIsUnknownSizeLocal but won't be allocated in the UnknownSizeFrame.

I think I need to tighten the criteria for considering a local on this new frame to consider these sorts of scenarios as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

083efd0 addresses this.

Comment thread src/coreclr/jit/lsrabuild.cpp Outdated
Comment thread src/coreclr/jit/regset.cpp
* Add function header
* Create UnknownSizeFrame::GetAddressingOffset and revert changes to lvaFrameAddress
* Use rsSetRegsModified and remove kill ref position
@jakobbotsch
Copy link
Copy Markdown
Member

Looking at the throughput differences, the performance of lvaIsUnknownSizeLocal is probably not very good, maybe adding another property bit to LclVarDsc could help with this.

Can you post the detailed throughput analysis with per-function information?

@snickolls-arm
Copy link
Copy Markdown
Contributor Author

Looking at the throughput differences, the performance of lvaIsUnknownSizeLocal is probably not very good, maybe adding another property bit to LclVarDsc could help with this.

Can you post the detailed throughput analysis with per-function information?

Just checking which analysis you mean here? As the only detailed output I can find from tpdiff is in CSV form. Is there tooling available for processing this?

@jakobbotsch
Copy link
Copy Markdown
Member

Looking at the throughput differences, the performance of lvaIsUnknownSizeLocal is probably not very good, maybe adding another property bit to LclVarDsc could help with this.

Can you post the detailed throughput analysis with per-function information?

Just checking which analysis you mean here? As the only detailed output I can find from tpdiff is in CSV form. Is there tooling available for processing this?

Sorry, I got my contributors confused. We have some tooling that can break throughput regressions/improvements down by JIT function, contributed by @SingleAccretion. However, it is x64-host only (based on Intel PIN).

I collected the data on benchmarks.run_pgo and it looks like this:

Base: 86438410243, Diff: 86528734455, +0.1045%

33418435 : +20.56%  : 36.23% : +0.0387% : public: void __cdecl Compiler::lvaAssignFrameOffsets(enum Compiler::FrameLayoutState)                                                                                                                                                    
15698295 : +15.34%  : 17.02% : +0.0182% : protected: void __cdecl CodeGen::genFnProlog(void)                                                                                                                                                                                       
13341778 : +8.18%   : 14.46% : +0.0154% : public: void __cdecl emitter::emitIns_R_S(enum instruction, enum emitAttr, enum _regNumber_enum, int, int)                                                                                                                               
11303530 : +4.61%   : 12.25% : +0.0131% : public: void __cdecl Compiler::lvaAssignVirtualFrameOffsetsToLocals(void)                                                                                                                                                                
9485383  : +11.28%  : 10.28% : +0.0110% : public: void __cdecl emitter::emitIns_S_R(enum instruction, enum emitAttr, enum _regNumber_enum, int, int)                                                                                                                               
5717661  : +7.00%   : 6.20%  : +0.0066% : protected: void __cdecl CodeGen::genCheckUseBlockInit(void)                                                                                                                                                                              
471445   : +4.61%   : 0.51%  : +0.0005% : protected: void __cdecl CodeGen::genFinalizeFrame(void)                                                                                                                                                                                  
441974   : +0.28%   : 0.48%  : +0.0005% : public: __cdecl Compiler::Compiler(class ArenaAllocatorT<struct JitMemKindTraits> *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)                                             
377156   : +2.79%   : 0.41%  : +0.0004% : protected: void __cdecl CodeGen::genPushCalleeSavedRegisters(enum _regNumber_enum, bool *)                                                                                                                                               
282867   : +4.91%   : 0.31%  : +0.0003% : private: void __cdecl LinearScan::setFrameType(void)                                                                                                                                                                                     
246151   : +4.35%   : 0.27%  : +0.0003% : protected: void __cdecl CodeGen::genZeroInitFrame(int, int, enum _regNumber_enum, bool *)                                                                                                                                                
92508    : +0.04%   : 0.10%  : +0.0001% : public: static void __cdecl BitSetOps<unsigned __int64 *, 1, class Compiler *, class TrackedVarBitSetTraits>::LivenessD(class Compiler *, unsigned __int64 *&, unsigned __int64 *const, unsigned __int64 *const, unsigned __int64 *const)
-96660   : -0.09%   : 0.10%  : -0.0001% : protected: void __cdecl JitExpandArray<unsigned char>::InitializeRange(unsigned int, unsigned int)                                                                                                                                       
-660023  : -100.00% : 0.72%  : -0.0008% : public: void __cdecl Compiler::funSetCurrentFunc(unsigned int)                                                                                                                                                                           

These regressions are correspondingly larger in tier0 code where it matters more, but I think we can live with it and if we really care address it in a follow-up.

I pushed a merge to resolve the merge conflict.

@jakobbotsch jakobbotsch self-requested a review April 14, 2026 09:39
@snickolls-arm
Copy link
Copy Markdown
Contributor Author

Looking at the throughput differences, the performance of lvaIsUnknownSizeLocal is probably not very good, maybe adding another property bit to LclVarDsc could help with this.

Can you post the detailed throughput analysis with per-function information?

Just checking which analysis you mean here? As the only detailed output I can find from tpdiff is in CSV form. Is there tooling available for processing this?

Sorry, I got my contributors confused. We have some tooling that can break throughput regressions/improvements down by JIT function, contributed by @SingleAccretion. However, it is x64-host only (based on Intel PIN).

I collected the data on benchmarks.run_pgo and it looks like this:

Base: 86438410243, Diff: 86528734455, +0.1045%

33418435 : +20.56%  : 36.23% : +0.0387% : public: void __cdecl Compiler::lvaAssignFrameOffsets(enum Compiler::FrameLayoutState)                                                                                                                                                    
15698295 : +15.34%  : 17.02% : +0.0182% : protected: void __cdecl CodeGen::genFnProlog(void)                                                                                                                                                                                       
13341778 : +8.18%   : 14.46% : +0.0154% : public: void __cdecl emitter::emitIns_R_S(enum instruction, enum emitAttr, enum _regNumber_enum, int, int)                                                                                                                               
11303530 : +4.61%   : 12.25% : +0.0131% : public: void __cdecl Compiler::lvaAssignVirtualFrameOffsetsToLocals(void)                                                                                                                                                                
9485383  : +11.28%  : 10.28% : +0.0110% : public: void __cdecl emitter::emitIns_S_R(enum instruction, enum emitAttr, enum _regNumber_enum, int, int)                                                                                                                               
5717661  : +7.00%   : 6.20%  : +0.0066% : protected: void __cdecl CodeGen::genCheckUseBlockInit(void)                                                                                                                                                                              
471445   : +4.61%   : 0.51%  : +0.0005% : protected: void __cdecl CodeGen::genFinalizeFrame(void)                                                                                                                                                                                  
441974   : +0.28%   : 0.48%  : +0.0005% : public: __cdecl Compiler::Compiler(class ArenaAllocatorT<struct JitMemKindTraits> *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)                                             
377156   : +2.79%   : 0.41%  : +0.0004% : protected: void __cdecl CodeGen::genPushCalleeSavedRegisters(enum _regNumber_enum, bool *)                                                                                                                                               
282867   : +4.91%   : 0.31%  : +0.0003% : private: void __cdecl LinearScan::setFrameType(void)                                                                                                                                                                                     
246151   : +4.35%   : 0.27%  : +0.0003% : protected: void __cdecl CodeGen::genZeroInitFrame(int, int, enum _regNumber_enum, bool *)                                                                                                                                                
92508    : +0.04%   : 0.10%  : +0.0001% : public: static void __cdecl BitSetOps<unsigned __int64 *, 1, class Compiler *, class TrackedVarBitSetTraits>::LivenessD(class Compiler *, unsigned __int64 *&, unsigned __int64 *const, unsigned __int64 *const, unsigned __int64 *const)
-96660   : -0.09%   : 0.10%  : -0.0001% : protected: void __cdecl JitExpandArray<unsigned char>::InitializeRange(unsigned int, unsigned int)                                                                                                                                       
-660023  : -100.00% : 0.72%  : -0.0008% : public: void __cdecl Compiler::funSetCurrentFunc(unsigned int)                                                                                                                                                                           

These regressions are correspondingly larger in tier0 code where it matters more, but I think we can live with it and if we really care address it in a follow-up.

I pushed a merge to resolve the merge conflict.

No problem, thanks for this. I suppose we see fewer locals on frame at higher optimization level so the impact isn't as strong?

UnknownSizeFrame

Adds lvaIsAllocatedOnUnknownSizeFrame with a stronger criteria for
what locals should or shouldn't be allocated in the unknown size frame.
Namely promoted struct fields that are address exposed should not be
allocated there, because the layout of the structure in memory needs to
be preserved.
Comment thread src/coreclr/jit/lclvars.cpp Outdated
Comment on lines +5604 to +5610
// If a promoted field is address exposed, we need to preserve the
// layout of the structure. Therefore we can't allocate variable size
// fields to a different part of the stack frame.
if (varDsc->lvIsStructField && varDsc->IsAddressExposed())
{
return false;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done always when the parent struct is dependently promoted. Use lvaGetPromotionType(varDsc->lvParentField)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the loops should already be checking for lvaIsFieldOfDependentlyPromotedStruct. You probably just need to move the existing lvaIsUnknownSizeLocal below those checks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I've tried moving the checks around a bit but the results seem inconsistent.

I think what I'll do is revert that latest change, and provide a new getter/setter for lvStkOffs in LclVarDsc specifically used for variables that exist on this new frame. Then I can add assertions to GetStackOffset and SetStackOffset and find areas that might be adjusting the stack offset when they shouldn't be. This has found generatePatchpointInfo for example, which I haven't considered yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the above works OK, are we able to merge this patch as is and follow up with further work? As I have further patches to push but I've held off as there's dependency with this one, e.g. spill temps.

…on the"

This reverts commit 0a448e16be8a929311243e3172a8b0c8f7793969.
These locals need to be treated specially as they allocated to a different part
of the frame. Adds some assertions to the original accessors to prevent use with
variable sized locals.
//
unsigned varNum = lclNum;

// Variable-sized locals reside in a different part of the stack frame.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This brings up OSR support for this kind of stack frame, which I hadn't yet run into. I suppose it's not possible to just skip over these kinds of variables. I would have to either disable OSR for the method, or add some support to allow for copying over the extra frame space as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants