[Android] In-Proc Crash Reporter by mdh1418 · Pull Request #126916 · dotnet/runtime

mdh1418 · 2026-04-14T22:37:01Z

[Note] The first commit adds the skeleton with stubbed fields, the following commits resolves fields, followed by another commit that emits the json file.

Android CoreCLR does not currently have the same out-of-proc createdump experience available on other platforms. That makes native crashes, aborts, and mixed managed/native failures significantly harder to diagnose in real applications.

This PR adds an opt-in in-proc crash reporter for Android CoreCLR that emits a JSON-formatted crash report modeled after createdump's CrashReportWriter.

When enabled, the runtime writes the crash report directly from the crash path to:

1. logcat / console output
2. an optional *.crashreport.json file when DbgMiniDumpName is set.

The long-term intent is for this reporting path to be async-signal-safe. This PR starts that work by making the lower-risk / low-hanging pieces fit that model where practical, while leaving the more complex runtime-state publication and hardening work to follow-up PRs.

Enabling the crash reporter

The crash reporter is opt-in.

Enable JSON crash reporting

DOTNET_EnableCrashReport=1

Example: write the report to a file as well

DOTNET_EnableCrashReport=1
DOTNET_DbgMiniDumpName=/data/data/<package>/files/dotnet_crash_%p

When configured this way, the runtime will also attempt to write:

/data/data/<package>/files/dotnet_crash_<pid>.crashreport.json

Current design note

This PR is intentionally the first step, not the complete hardening story.

The payload shape and reporting path are designed with async-signal-safety as the target, but today the implementation is a mix of:

crash-time logic that is already close to signal-safe
best-effort runtime inspection for richer managed diagnostics

Follow-up PRs will continue hardening the remaining pieces so the implementation better matches the intended strict crash-path constraints.

Example crash report shapes

https://gist.github.com/mdh1418/a893c7dd953bf725569b22abc10e59c1

Follow-up work

Cross-platform support (iOS/osx)
Emitting the crash report to logcat/console

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dotnet-policy-service · 2026-04-14T22:38:29Z

Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

Adds an opt-in, in-process crash reporter for Android CoreCLR that emits a createdump-shaped JSON crash report to logcat/stderr and optionally to a *.crashreport.json file, with VM-provided managed thread/stack/exception callbacks.

Changes:

Introduces PAL-side crash report generation (JSON writer, module/process name helpers, crash report generator).
Wires VM callbacks for managed thread enumeration/stack walking/exception extraction into the PAL crash reporter on Android startup.
Integrates crash report triggering into the Android crash/abort signal path and adds Android-only build plumbing.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
src/coreclr/vm/crashreportstackwalker.h	Declares Android-only VM registration hook for crash report stack walking.
src/coreclr/vm/crashreportstackwalker.cpp	Implements VM callbacks for managed stack walking, thread enumeration, and exception extraction.
src/coreclr/vm/ceemain.cpp	Registers VM callbacks during Android EE startup.
src/coreclr/vm/CMakeLists.txt	Adds the new VM crashreport stack walker source to the build.
src/coreclr/pal/src/thread/process.cpp	Adds Android crash report initialization, enablement state, and crash-time generation call.
src/coreclr/pal/src/include/pal/process.h	Exposes `PROCIsCrashReportEnabled()` for signal-path gating.
src/coreclr/pal/src/exception/signal.cpp	Avoids duplicate managed stack logging when crash reporting is enabled.
src/coreclr/pal/src/crashreport/moduleenumerator.h	Declares helpers to resolve process/module names via `/proc`.
src/coreclr/pal/src/crashreport/moduleenumerator.cpp	Implements `/proc/self/cmdline` and `/proc/self/maps` parsing for crash-time module/process lookup.
src/coreclr/pal/src/crashreport/inproccrashreporter.h	Declares crash report generator API and VM callback hooks.
src/coreclr/pal/src/crashreport/inproccrashreporter.cpp	Implements JSON crash report generation, logcat/stderr output, optional file output, and callback integration.
src/coreclr/pal/src/crashreport/crashjsonwriter.h	Declares fixed-buffer JSON writer intended for signal-safe usage.
src/coreclr/pal/src/crashreport/crashjsonwriter.cpp	Implements the fixed-buffer JSON writer.
src/coreclr/pal/src/CMakeLists.txt	Adds Android-only PAL crashreport sources to the build.

Copilot · 2026-04-14T22:43:12Z

+CrashJsonAppend(
+    CrashJsonWriter* w,
+    const char* str,
+    int len)
+{
+    if (w->pos + len >= CRASH_JSON_BUFFER_SIZE - 16)
+        return 0;


CrashJsonAppend can fail (returns 0) when the fixed buffer is near capacity, but all callers ignore the return value. This can silently produce malformed/truncated JSON (e.g., missing closing braces/quotes) without any indication to the consumer. Consider tracking an overflowed flag in CrashJsonWriter and making all write operations no-ops after overflow (optionally appending a final "truncated" marker) so the output remains syntactically valid and diagnosable.

Was going to also comment we can follow up with a depth-aware truncation mode in case there are multiple threads with deep callstacks.

mdh1418 · 2026-04-14T22:57:26Z

+    }
+    else if (exceptionType != NULL && exceptionType[0] != '\0')
+    {
+        char hresultBuffer[32];


Do we want to condition this if a native code fault occurs in the middle of managed exception processing? It seems like the managed throwable would still be on the thread. Would we want to keep the managed throwable for context?

Per Jan Kotas's feedback on PR dotnet#126916, the PAL directory is being phased out. Move crashreportwriter.cpp/h from src/coreclr/pal/src/crashreport/ to src/coreclr/debug/crashreport/ alongside existing createdump. The file is still compiled as part of the PAL static library (since process.cpp calls into it from the signal handler path). Only the physical location changes — no logic or behavioral changes. Updated include_directories in both PAL and VM CMakeLists to point to the new location, removing the VM's dependency on ../pal/src. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove the /proc-based native address enrichment that was only being used for createdump-style parity, while keeping the top-level process_name information. This reduces crash-path work and keeps the minimal Android in-proc report focused on the managed diagnostic data reviewers asked for. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove the now-unneeded moduleenumerator helper and resolve process_name directly in the in-proc crash reporter via /proc/self/cmdline with a simple /proc/self/exe fallback. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jkotas · 2026-04-15T18:45:03Z

+            if (pFrame != NULL && pFrame != FRAME_TOP)
+            {
+                WalkContext walkContext = { frameCallback, ctx };
+                pThread->StackWalkFrames(FrameCallbackAdapter, &walkContext,


This is trying to stackwalk a running thread. That is not going to work well. If you want to do a stackwalk, it either needs to be on a current thread or the threads needs to be suspended.

How we wanted to approach this initially is best effort. Unless I'm mistaken, our hands are tied with respect to stopping threads unless we cook up something creative (brittle) ourselves.

I think adding EX_TRY/EX_CATCH would be helpful. We don't want to crash the crash reporter.

This will result into fatal access violations. EX_TRY/EX_CATCH works for clean C++ exceptions only, it won't prevent crashing the crash reporter.

If it's too dicey, we can always skip stack walking the non-crashing threads and just print something like the thread id.

our hands are tied with respect to stopping threads unless we cook up something creative (brittle) ourselves.

This is only dumping managed stacks for threads that are not running managed code at the moment. If this limitation is there to stay, we can at least prevent these threads from returning to managed code so that we can dump their stacks reliably. (The way to do that is to bump g_TrapReturningThreads and then block the thread in the RareDisablePreemptiveGC slow path.)

Could an alternative be to suspend each thread using pthread_kill/SuspendThread, and since we are crashing, there is no need to resume the threads, then we will have all threads suspended when we walk their stacks.

Will g_TrapReturningThreads force the thread to suspend even if running managed code, not just trapping threads running native code returning back to managed? If I recall correctly the suspend thread in CoreCLR will set this flag, then do a coop/preemptive/coop switch that will trap the thread at specific points. If that's the case, then its sounds like the right way to go, but it needs to hit the trap, so question is if there are scenarios where that won't happen, like running inside managed code.

pthread_kill/SuspendThread against the target thread before walking the stack would be straightforward, that is how Mono does it suspend thread logic on platforms where signals are supported.

Maybe we can extend our current activation signal handler and use that to suspend thread, store context in thread object, signal and then wait for resume.

If it's too dicey, we can always skip stack walking the non-crashing threads and just print something like the thread id.

That will be to limiting, we already have the stack of the crashing thread in logcat, the JSON report should include all managed threads, so we should look at a solution where we can suspend the threads before we stackwalk them.

Added thread suspension (via SuspendEE) to then walk the non-crashing threads. Also added a couple of heuristics for whether its safe to walk the threads (mix of SuspendEE requirements + GC in progress/fatal)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Move re-entrancy guard before file open to avoid truncating on contention. Tighten crash report file permissions to 0600. Expand %%/%p/%d dump name templates in crash report path. Remove unused PROCIsCrashReportEnabled declaration and definition. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move crash report configuration reading from PROCAbortInitialize (PAL init) to CrashReportRegisterStackWalker (EE startup) because Android sets DOTNET_* environment variables via JNI after PAL init. Add PROCEnableInProcCrashReport so the VM can arm the crash reporter flag at the correct time. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Install a SIGUSR2 handler at crash reporter init time that, when armed, parks threads on a pipe read. At crash time, send SIGUSR2 via tgkill to every non-crashing managed thread, wait briefly for them to park, walk all stacks, then close the pipe to release. This approach is self-contained in crashreportstackwalker.cpp and does not modify the PAL's activation signal handler. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jkotas · 2026-04-16T22:41:37Z

+}
+
+// Suspend non-crashing threads so their managed stacks can be walked
+// reliably.  Sends SIGUSR2 to every non-crashing managed thread; the


The runtime managed stackwalker is not robust to walk from a random context. It can only reliably walk from a "good spot".

The difference between what you have here and the actual runtime thread suspend is that the actual runtime thread suspension checks whether the thread got stopped at "good spot" and lets the thread run for a bit mode and tries again if it is not.

Broadly speaking, we can split the crashes into two categories:

Clean crash in 3rd party code where the runtime is not corrupted. It should be fine and reliable to do a regular runtime suspend in this case.

Internal runtime crash where the runtime state is corrupted (e.g. crash in the middle of the GC). There is a good change that any stackwalking or thread suspension is going to crash or hang too. I do not think there is much we can do about that from in-proc crash reporter.

With regular runtime suspend, you mean a full SuspendEE or suspending each individual thread using Thread::Suspend/ResumeThread? Doing thread by thread suspend sounds more precise and less risky.

Is there a way to do bare minimum stackwalking not risking taking runtime locks in the process, like just resolving the pure IP's of frames for the suspended thread, not involving any metadata lookups or type system accesss?

Another option could be to use libunwind doing the raw stackwalking of a suspended thread, then we could resume the thread and do potential other lookups later. Less risky, but depending on where the thread crashed, it could still cause deadlocks in later metadata/type lookups.

What about reading the raw memory to resolve bare minimum to get from IP -> symbol while all is suspended. Use parts of the DAC's stackwalking implementation to handle the in-proc crash reporter stackwalking or have option to run stackwalker and other stackwalking API's in a way not taking any locks or backing out of scenarios that needs locking. In Mono we have support wired into the stackwalker to run in async signal safe mode, and it will do custom paths or ignore specific lookups not taking locks or use lock-free implementations if running in async signal safe mode. I think what we might land at is an in-proc crash stackwalker implemented as its own component, handling mixed mode callstacks, reusing logic from DAC's memory based stackwalking implementation. This is probably a more long-term goal and not for this PR.

Let's say we initially try to focus on getting data out for crashes outside of the runtime, assuming we can call high level runtime API's to suspend and do stackwalking, do we have any good mechanism to figure out if the crashing thread is at a spot worth doing a full crash report attempt, like checking if GC is not in progress, not running inside runtime native code, not being in critical regions of the runtime, stackwalking top frames of crashing thread to figure out if its running managed or native code that could be considered "safe" at crash site?

Alternative is to just try and see, if we hang, watchdog will kill us and we won't get a crash report and we would probably see in the tombstone what happened if we didn't.

With regular runtime suspend, you mean a full SuspendEE or suspending each individual thread using Thread::Suspend/ResumeThread? Doing thread by thread suspend sounds more precise and less risky.

I mean a full SuspendEE. I do not think there is a significant difference between the two (for managed stacktraces). If we care about logging partial information where possible, we can start a full SuspendEE and start logging information about the threads as they become suspended.

minimum stackwalking not risking taking runtime locks in the process

Number of the runtime data structures involved in stackwalking and resolving IPs to names are not lock free. I do not see an easy reliable way around that. We can either take the locks like regular stackwalking (and risk dead locks or crashes if the runtime state is corrupted) or do not take the locks (and risk crashes if the runtime state is corrupted or if some other thread is changing the runtime data structures involved). I think the "do not take the locks" option is worse on average.

For unhandled exception logging that you get by default on desktop, we use regular stackwalking with locks.

good mechanism to figure out if the crashing thread is at a spot worth doing a full crash report attempt, like checking if GC is not in progress

Yes, we do some of that today (e.g. look for g_fFatalErrorOccurredOnGCThread). There is no fully reliable way to detect this, it has been done reactively.

Lets start using SuspendEE, if we hit issues with the additional logic happening as part of that API, compare to the internals of SuspendAllThreads function, we can deal with it later.

We should also test some different crash scenarios and see how things play out and where we might tripp.

Added thread suspension (via SuspendEE) to then walk the non-crashing threads. Also added a couple of heuristics for whether its safe to walk the threads (mix of SuspendEE requirements + GC in progress/fatal)

@jkotas

This reverts commit 26ee461. PR dotnet#126916 feedback from @jkotas and @lateralusX converged on using the runtime's regular suspension APIs instead of a bespoke SIGUSR2-based park. The managed stackwalker is only robust to walk from runtime-known safe points, which a dedicated signal handler parking threads on a pipe read cannot guarantee. A follow-up commit reintroduces non-crashing thread walking using ThreadSuspend::SuspendEE / RestartEE. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reintroduce multi-thread stack walking for the in-proc crash reporter, this time using the runtime's regular suspension path (ThreadSuspend::SuspendEE / RestartEE) instead of a bespoke SIGUSR2 park. The managed stackwalker is only robust when threads are at runtime-known safe points, which SuspendEE guarantees. Per reviewer guidance (see PR dotnet#126916 discussion), the reporter takes the standard stackwalker locks rather than attempting a lock-free variant. To avoid deadlocking when the runtime itself is in a compromised state, a reactive safety heuristic gates the suspension attempt: if a fatal error already occurred on the GC thread, a GC is in progress, the crashing thread is a GC special thread, or the crashing thread already holds the thread store lock, the reporter falls back to walking only the crashing thread. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Expose CrashJsonFlush in the crash report JSON writer header and call it from the multi-thread enumeration path after each thread's object is closed, plus once more after the final thread. This writes each walked thread to the on-disk crash report as soon as it is fully serialized, so a later hang or secondary fault during the crash-reporting path does not lose threads that have already been captured. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The in-proc crash reporter previously fell back to HOME/TMPDIR/ /data/local/tmp when DbgMiniDumpName was unset, always emitting a crash report file. Gate the file emission on DbgMiniDumpName being configured, and when it is only a filename, place the report under TMPDIR (or /tmp) so it lands in a writable location. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The in-proc crash report initially mirrored createdump's schema byte-for-byte, including an 'ExceptionType' field that encoded the signal as a synthetic Windows exception code. Per reviewer feedback, ExceptionType is a Windows concept that does not carry meaningful information for Unix crash reports; emit the raw signal number instead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.

Copilot · 2026-04-17T21:44:09Z

+#ifndef __CRASHREPORTSTACKWALKER_H__
+#define __CRASHREPORTSTACKWALKER_H__
+


The new include guard macro __CRASHREPORTSTACKWALKER_H__ starts with a double underscore, which is reserved for the implementation in C/C++. Please rename the guard to a non-reserved identifier (e.g., CRASHREPORTSTACKWALKER_H or CORECLR_CRASHREPORTSTACKWALKER_H).

Copilot · 2026-04-17T21:44:10Z

+    CLRConfigNoCache dmpNameCfg = CLRConfigNoCache::Get("DbgMiniDumpName", /*noprefix*/ false, &getenv);
+    const char* dumpName = dmpNameCfg.IsSet() ? dmpNameCfg.AsString() : nullptr;
+    if (dumpName == nullptr || dumpName[0] == '\0')
+    {
+        return;
+    }


CrashReportRegisterStackWalker() currently returns early when DbgMiniDumpName is unset/empty, which means setting DOTNET_EnableCrashReport=1 (as described in the PR) will not emit any crash report at all. If the intent is that crash reporting can be enabled without a file path (e.g., future logcat emission) or that the env var alone is sufficient, please remove/relax this hard dependency on DbgMiniDumpName (or update the configuration semantics so the behavior matches the documentation).

Copilot · 2026-04-17T21:44:10Z

+    if (g_inProcCrashReportEnabled)
+    {
+        InProcCrashReportGenerate(signal, siginfo, context);
+    }


g_inProcCrashReportEnabled is read from the fatal-signal crash path (PROCCreateCrashDumpIfEnabled) but written during startup (PROCEnableInProcCrashReport) as a plain bool. Because the signal handler can interrupt normal execution at any point, this introduces an async data race/UB. Consider using a signal-safe atomic/volatile type (e.g., volatile sig_atomic_t, Volatile<bool>, or an atomic with relaxed ordering) or publishing the flag via the same existing mechanisms used for other crash-path globals.

Copilot · 2026-04-17T21:44:10Z

 include_directories(BEFORE ${CMAKE_CURRENT_SOURCE_DIR})
 include_directories(${ARCH_SOURCES_DIR})
 include_directories(${CMAKE_CURRENT_SOURCE_DIR}/../interop/inc)
+include_directories(${CLR_DIR})


include_directories(${CLR_DIR}) is added unconditionally for the entire VM build, but the new dependency that needs it (debug/crashreport/inproccrashreporter.h) is Android-only. Consider scoping this include directory to Android builds (e.g., under if(CLR_CMAKE_TARGET_ANDROID)/if(CLR_CMAKE_HOST_ANDROID)) to avoid globally widening the include search path on other platforms.

Suggested change

include_directories(${CLR_DIR})

if(CLR_CMAKE_TARGET_ANDROID OR CLR_CMAKE_HOST_ANDROID)

include_directories(${CLR_DIR})

endif(CLR_CMAKE_TARGET_ANDROID OR CLR_CMAKE_HOST_ANDROID)

Copilot · 2026-04-17T21:44:11Z

 # Include directories

 include_directories(include)
+include_directories(${CLR_DIR})


include_directories(${CLR_DIR}) is added globally for PAL, but the only new include that appears to require it here is Android-only (debug/crashreport/inproccrashreporter.h). Consider making this include directory conditional on Android to reduce the chance of accidental header-name collisions on other targets.

Suggested change

include_directories(${CLR_DIR})

if(CLR_CMAKE_TARGET_ANDROID)

include_directories(${CLR_DIR})

endif()

mdh1418 and others added 9 commits April 14, 2026 10:03

Add PAL foundation for in-proc crash reporting

22d283c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Resolve crash report version field

b866409

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Resolve crash report process name

ef2c128

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Enumerate crash report threads and stack frames

4c6220f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add crash report frame and module enrichment

ee343dc

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add crash report exception details

709eadc

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add createdump frame metadata to crash reports

6a89b32

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Write in-proc crash reports to JSON files

ba40568

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Serialize in-proc crash report generation

34a7545

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 14, 2026 22:37

github-actions bot added the area-PAL-coreclr label Apr 14, 2026

mdh1418 requested review from jkotas, lateralusX and steveisok April 14, 2026 22:37

dotnet-policy-service bot assigned mdh1418 Apr 14, 2026

mdh1418 added area-Diagnostics-coreclr os-android labels Apr 14, 2026

mdh1418 changed the title ~~Inproc crashreport minimal~~ [Android] In-Proc Crash Reporter Apr 14, 2026

Copilot AI reviewed Apr 14, 2026

View reviewed changes

mdh1418 commented Apr 14, 2026

View reviewed changes

jkotas reviewed Apr 14, 2026

View reviewed changes

Comment thread src/coreclr/debug/crashreport/crashjsonwriter.cpp

Comment thread src/coreclr/vm/crashreportstackwalker.cpp Outdated

This was referenced Apr 15, 2026

Unable to pull image from mcr.microsoft.com #117164

Open

System.Net.NameResolution.Tests DNS failures: Name or service not known #126641

Open

ILTrim.Tests failure in linux-x64 checked CLR_Tools_Tests #126922

Closed