Fix ARM64 interface dispatch cache torn read#126346
Fix ARM64 interface dispatch cache torn read#126346MichalStrehovsky merged 3 commits intodotnet:mainfrom
Conversation
On ARM64, the CHECK_CACHE_ENTRY macro read m_pInstanceType and m_pTargetCode from a cache entry using two separate ldr instructions separated by a control dependency (cmp/bne). ARM64's weak memory model does not order loads across control dependencies, so the hardware can speculatively satisfy the second load (target) before the first (type) commits. When a concurrent thread atomically populates the entry via stlxp/casp (UpdateCacheEntryAtomically), the reader can observe the new m_pInstanceType but the old m_pTargetCode (0), then br to address 0. Fix by using ldp to load both fields in a single instruction (single-copy atomic on FEAT_LSE2 / ARMv8.4+ hardware), plus a cbz guard to catch torn reads on pre-LSE2 hardware where ldp pair atomicity is not architecturally guaranteed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @agocke, @dotnet/ilc-contrib |
There was a problem hiding this comment.
Pull request overview
Fixes a potential torn-read race in the ARM64 cached interface dispatch fast-path that could lead to branching to address 0 when reading a concurrently populated cache entry.
Changes:
- Replace two independent loads of cache-entry fields with a single
ldppair load to avoid reordering across control dependencies on ARM64. - Add a
cbzguard on the loaded target to treat observed torn reads (type updated, target still 0) as a cache miss on pre-LSE2 hardware. - Mirror the changes in both the GAS (
.S) and ARMASM (.asm) implementations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/coreclr/runtime/arm64/StubDispatch.S | Updates ARM64 stub macro to use ldp + cbz to avoid torn cache-entry reads. |
| src/coreclr/runtime/arm64/StubDispatch.asm | Same logic as above for the ARMASM variant to keep implementations consistent. |
Instead of emitting an add instruction per cache entry when the ldp offset exceeds [-512,504], rebase x9 once when the threshold is crossed. This keeps the per-entry probe to a single ldp for all entries in the 32/64 slot stubs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run runtime-nativeaot-outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Apple ARM64 platforms all have FEAT_LSE2, which makes ldp single-copy atomic for 16-byte aligned pairs. The cbz torn-read guard is unnecessary there. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
I also looked whether the new scheme over at #123252 would have this issue in the monomorphic path (that looks similar to this) and it seems like it doesn't because of ldar use. I guess ldar could be another option here. |
|
/azp run runtime-nativeaot-outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/ba-g the GC test issue is unrelated and I fixed it. the leg timeouts are not interesting because they're not ARM64 |
|
/backport to release/10.0 |
|
Started backporting to |
On ARM64, the CHECK_CACHE_ENTRY macro read m_pInstanceType and m_pTargetCode from a cache entry using two separate ldr instructions separated by a control dependency (cmp/bne). ARM64's weak memory model does not order loads across control dependencies, so the hardware can speculatively satisfy the second load (target) before the first (type) commits. When a concurrent thread atomically populates the entry via stlxp/casp (UpdateCacheEntryAtomically), the reader can observe the new m_pInstanceType but the old m_pTargetCode (0), then br to address 0. Fix by using ldp to load both fields in a single instruction (single-copy atomic on FEAT_LSE2 / ARMv8.4+ hardware), plus a cbz guard to catch torn reads on pre-LSE2 hardware where ldp pair atomicity is not architecturally guaranteed. Fixes dotnet#126345
Backport of #126346 to release/10.0 /cc @MichalStrehovsky ## Customer Impact - [x] Customer reported - [ ] Found internally Reported by a first party. Torn read in ARM64 interface dispatch can cause a dispatch to null. Interface dispatch cell is a pair or MethodTable and a code address. If the type of MethodTable matches the type of `this`, we call the code address. The pair is written atomically, however the read is not atomic and on ARM64 a torn read can happen, matching the MethodTable, but still seeing a 0 code address. ## Regression - [ ] Yes - [x] No Not a regression, this bug has existed ever since ARM64 was added to .NET Native for UWP apps in 2017 or so. ## Testing This is a race condition that requires a lot of luck to hit. Testing is "code review" basically. ## Risk The risk is low, instead of loading two pointers individually, we load them together, and add an extra check for the race condition case. **IMPORTANT**: If this backport is for a servicing release, please verify that: - For .NET 8 and .NET 9: The PR target branch is `release/X.0-staging`, not `release/X.0`. - For .NET 10+: The PR target branch is `release/X.0` (no `-staging` suffix). ## Package authoring no longer needed in .NET 9 **IMPORTANT**: Starting with .NET 9, you no longer need to edit a NuGet package's csproj to enable building and bump the version. Keep in mind that we still need package authoring in .NET 8 and older versions. --------- Co-authored-by: Michal Strehovský <MichalStrehovsky@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
On ARM64, the CHECK_CACHE_ENTRY macro read m_pInstanceType and m_pTargetCode from a cache entry using two separate ldr instructions separated by a control dependency (cmp/bne). ARM64's weak memory model does not order loads across control dependencies, so the hardware can speculatively satisfy the second load (target) before the first (type) commits. When a concurrent thread atomically populates the entry via stlxp/casp (UpdateCacheEntryAtomically), the reader can observe the new m_pInstanceType but the old m_pTargetCode (0), then br to address 0.
Fix by using ldp to load both fields in a single instruction (single-copy atomic on FEAT_LSE2 / ARMv8.4+ hardware), plus a cbz guard to catch torn reads on pre-LSE2 hardware where ldp pair atomicity is not architecturally guaranteed.
Fixes #126345