Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag |
There was a problem hiding this comment.
Pull request overview
Adds an opt-in, in-process crash reporter for Android CoreCLR that emits a createdump-shaped JSON crash report to logcat/stderr and optionally to a *.crashreport.json file, with VM-provided managed thread/stack/exception callbacks.
Changes:
- Introduces PAL-side crash report generation (JSON writer, module/process name helpers, crash report generator).
- Wires VM callbacks for managed thread enumeration/stack walking/exception extraction into the PAL crash reporter on Android startup.
- Integrates crash report triggering into the Android crash/abort signal path and adds Android-only build plumbing.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/vm/crashreportstackwalker.h | Declares Android-only VM registration hook for crash report stack walking. |
| src/coreclr/vm/crashreportstackwalker.cpp | Implements VM callbacks for managed stack walking, thread enumeration, and exception extraction. |
| src/coreclr/vm/ceemain.cpp | Registers VM callbacks during Android EE startup. |
| src/coreclr/vm/CMakeLists.txt | Adds the new VM crashreport stack walker source to the build. |
| src/coreclr/pal/src/thread/process.cpp | Adds Android crash report initialization, enablement state, and crash-time generation call. |
| src/coreclr/pal/src/include/pal/process.h | Exposes PROCIsCrashReportEnabled() for signal-path gating. |
| src/coreclr/pal/src/exception/signal.cpp | Avoids duplicate managed stack logging when crash reporting is enabled. |
| src/coreclr/pal/src/crashreport/moduleenumerator.h | Declares helpers to resolve process/module names via /proc. |
| src/coreclr/pal/src/crashreport/moduleenumerator.cpp | Implements /proc/self/cmdline and /proc/self/maps parsing for crash-time module/process lookup. |
| src/coreclr/pal/src/crashreport/inproccrashreporter.h | Declares crash report generator API and VM callback hooks. |
| src/coreclr/pal/src/crashreport/inproccrashreporter.cpp | Implements JSON crash report generation, logcat/stderr output, optional file output, and callback integration. |
| src/coreclr/pal/src/crashreport/crashjsonwriter.h | Declares fixed-buffer JSON writer intended for signal-safe usage. |
| src/coreclr/pal/src/crashreport/crashjsonwriter.cpp | Implements the fixed-buffer JSON writer. |
| src/coreclr/pal/src/CMakeLists.txt | Adds Android-only PAL crashreport sources to the build. |
| CrashJsonAppend( | ||
| CrashJsonWriter* w, | ||
| const char* str, | ||
| int len) | ||
| { | ||
| if (w->pos + len >= CRASH_JSON_BUFFER_SIZE - 16) | ||
| return 0; |
There was a problem hiding this comment.
CrashJsonAppend can fail (returns 0) when the fixed buffer is near capacity, but all callers ignore the return value. This can silently produce malformed/truncated JSON (e.g., missing closing braces/quotes) without any indication to the consumer. Consider tracking an overflowed flag in CrashJsonWriter and making all write operations no-ops after overflow (optionally appending a final "truncated" marker) so the output remains syntactically valid and diagnosable.
There was a problem hiding this comment.
Was going to also comment we can follow up with a depth-aware truncation mode in case there are multiple threads with deep callstacks.
| } | ||
| else if (exceptionType != NULL && exceptionType[0] != '\0') | ||
| { | ||
| char hresultBuffer[32]; |
There was a problem hiding this comment.
Do we want to condition this if a native code fault occurs in the middle of managed exception processing? It seems like the managed throwable would still be on the thread. Would we want to keep the managed throwable for context?
Per Jan Kotas's feedback on PR dotnet#126916, the PAL directory is being phased out. Move crashreportwriter.cpp/h from src/coreclr/pal/src/crashreport/ to src/coreclr/debug/crashreport/ alongside existing createdump. The file is still compiled as part of the PAL static library (since process.cpp calls into it from the signal handler path). Only the physical location changes — no logic or behavioral changes. Updated include_directories in both PAL and VM CMakeLists to point to the new location, removing the VM's dependency on ../pal/src. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the /proc-based native address enrichment that was only being used for createdump-style parity, while keeping the top-level process_name information. This reduces crash-path work and keeps the minimal Android in-proc report focused on the managed diagnostic data reviewers asked for. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the now-unneeded moduleenumerator helper and resolve process_name directly in the in-proc crash reporter via /proc/self/cmdline with a simple /proc/self/exe fallback. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| if (pFrame != NULL && pFrame != FRAME_TOP) | ||
| { | ||
| WalkContext walkContext = { frameCallback, ctx }; | ||
| pThread->StackWalkFrames(FrameCallbackAdapter, &walkContext, |
There was a problem hiding this comment.
This is trying to stackwalk a running thread. That is not going to work well. If you want to do a stackwalk, it either needs to be on a current thread or the threads needs to be suspended.
There was a problem hiding this comment.
How we wanted to approach this initially is best effort. Unless I'm mistaken, our hands are tied with respect to stopping threads unless we cook up something creative (brittle) ourselves.
There was a problem hiding this comment.
I think adding EX_TRY/EX_CATCH would be helpful. We don't want to crash the crash reporter.
There was a problem hiding this comment.
This will result into fatal access violations. EX_TRY/EX_CATCH works for clean C++ exceptions only, it won't prevent crashing the crash reporter.
There was a problem hiding this comment.
If it's too dicey, we can always skip stack walking the non-crashing threads and just print something like the thread id.
There was a problem hiding this comment.
our hands are tied with respect to stopping threads unless we cook up something creative (brittle) ourselves.
This is only dumping managed stacks for threads that are not running managed code at the moment. If this limitation is there to stay, we can at least prevent these threads from returning to managed code so that we can dump their stacks reliably. (The way to do that is to bump g_TrapReturningThreads and then block the thread in the RareDisablePreemptiveGC slow path.)
There was a problem hiding this comment.
Could an alternative be to suspend each thread using pthread_kill/SuspendThread, and since we are crashing, there is no need to resume the threads, then we will have all threads suspended when we walk their stacks.
Will g_TrapReturningThreads force the thread to suspend even if running managed code, not just trapping threads running native code returning back to managed? If I recall correctly the suspend thread in CoreCLR will set this flag, then do a coop/preemptive/coop switch that will trap the thread at specific points. If that's the case, then its sounds like the right way to go, but it needs to hit the trap, so question is if there are scenarios where that won't happen, like running inside managed code.
pthread_kill/SuspendThread against the target thread before walking the stack would be straightforward, that is how Mono does it suspend thread logic on platforms where signals are supported.
Maybe we can extend our current activation signal handler and use that to suspend thread, store context in thread object, signal and then wait for resume.
There was a problem hiding this comment.
If it's too dicey, we can always skip stack walking the non-crashing threads and just print something like the thread id.
That will be to limiting, we already have the stack of the crashing thread in logcat, the JSON report should include all managed threads, so we should look at a solution where we can suspend the threads before we stackwalk them.
There was a problem hiding this comment.
Added thread suspension (via SuspendEE) to then walk the non-crashing threads. Also added a couple of heuristics for whether its safe to walk the threads (mix of SuspendEE requirements + GC in progress/fatal)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move re-entrancy guard before file open to avoid truncating on contention. Tighten crash report file permissions to 0600. Expand %%/%p/%d dump name templates in crash report path. Remove unused PROCIsCrashReportEnabled declaration and definition. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move crash report configuration reading from PROCAbortInitialize (PAL init) to CrashReportRegisterStackWalker (EE startup) because Android sets DOTNET_* environment variables via JNI after PAL init. Add PROCEnableInProcCrashReport so the VM can arm the crash reporter flag at the correct time. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Install a SIGUSR2 handler at crash reporter init time that, when armed, parks threads on a pipe read. At crash time, send SIGUSR2 via tgkill to every non-crashing managed thread, wait briefly for them to park, walk all stacks, then close the pipe to release. This approach is self-contained in crashreportstackwalker.cpp and does not modify the PAL's activation signal handler. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| } | ||
|
|
||
| // Suspend non-crashing threads so their managed stacks can be walked | ||
| // reliably. Sends SIGUSR2 to every non-crashing managed thread; the |
There was a problem hiding this comment.
The runtime managed stackwalker is not robust to walk from a random context. It can only reliably walk from a "good spot".
The difference between what you have here and the actual runtime thread suspend is that the actual runtime thread suspension checks whether the thread got stopped at "good spot" and lets the thread run for a bit mode and tries again if it is not.
There was a problem hiding this comment.
Broadly speaking, we can split the crashes into two categories:
- Clean crash in 3rd party code where the runtime is not corrupted. It should be fine and reliable to do a regular runtime suspend in this case.
- Internal runtime crash where the runtime state is corrupted (e.g. crash in the middle of the GC). There is a good change that any stackwalking or thread suspension is going to crash or hang too. I do not think there is much we can do about that from in-proc crash reporter.
There was a problem hiding this comment.
With regular runtime suspend, you mean a full SuspendEE or suspending each individual thread using Thread::Suspend/ResumeThread? Doing thread by thread suspend sounds more precise and less risky.
Is there a way to do bare minimum stackwalking not risking taking runtime locks in the process, like just resolving the pure IP's of frames for the suspended thread, not involving any metadata lookups or type system accesss?
Another option could be to use libunwind doing the raw stackwalking of a suspended thread, then we could resume the thread and do potential other lookups later. Less risky, but depending on where the thread crashed, it could still cause deadlocks in later metadata/type lookups.
What about reading the raw memory to resolve bare minimum to get from IP -> symbol while all is suspended. Use parts of the DAC's stackwalking implementation to handle the in-proc crash reporter stackwalking or have option to run stackwalker and other stackwalking API's in a way not taking any locks or backing out of scenarios that needs locking. In Mono we have support wired into the stackwalker to run in async signal safe mode, and it will do custom paths or ignore specific lookups not taking locks or use lock-free implementations if running in async signal safe mode. I think what we might land at is an in-proc crash stackwalker implemented as its own component, handling mixed mode callstacks, reusing logic from DAC's memory based stackwalking implementation. This is probably a more long-term goal and not for this PR.
There was a problem hiding this comment.
Let's say we initially try to focus on getting data out for crashes outside of the runtime, assuming we can call high level runtime API's to suspend and do stackwalking, do we have any good mechanism to figure out if the crashing thread is at a spot worth doing a full crash report attempt, like checking if GC is not in progress, not running inside runtime native code, not being in critical regions of the runtime, stackwalking top frames of crashing thread to figure out if its running managed or native code that could be considered "safe" at crash site?
Alternative is to just try and see, if we hang, watchdog will kill us and we won't get a crash report and we would probably see in the tombstone what happened if we didn't.
There was a problem hiding this comment.
With regular runtime suspend, you mean a full SuspendEE or suspending each individual thread using Thread::Suspend/ResumeThread? Doing thread by thread suspend sounds more precise and less risky.
I mean a full SuspendEE. I do not think there is a significant difference between the two (for managed stacktraces). If we care about logging partial information where possible, we can start a full SuspendEE and start logging information about the threads as they become suspended.
minimum stackwalking not risking taking runtime locks in the process
Number of the runtime data structures involved in stackwalking and resolving IPs to names are not lock free. I do not see an easy reliable way around that. We can either take the locks like regular stackwalking (and risk dead locks or crashes if the runtime state is corrupted) or do not take the locks (and risk crashes if the runtime state is corrupted or if some other thread is changing the runtime data structures involved). I think the "do not take the locks" option is worse on average.
For unhandled exception logging that you get by default on desktop, we use regular stackwalking with locks.
good mechanism to figure out if the crashing thread is at a spot worth doing a full crash report attempt, like checking if GC is not in progress
Yes, we do some of that today (e.g. look for g_fFatalErrorOccurredOnGCThread). There is no fully reliable way to detect this, it has been done reactively.
There was a problem hiding this comment.
Lets start using SuspendEE, if we hit issues with the additional logic happening as part of that API, compare to the internals of SuspendAllThreads function, we can deal with it later.
There was a problem hiding this comment.
We should also test some different crash scenarios and see how things play out and where we might tripp.
There was a problem hiding this comment.
Added thread suspension (via SuspendEE) to then walk the non-crashing threads. Also added a couple of heuristics for whether its safe to walk the threads (mix of SuspendEE requirements + GC in progress/fatal)
This reverts commit 26ee461. PR dotnet#126916 feedback from @jkotas and @lateralusX converged on using the runtime's regular suspension APIs instead of a bespoke SIGUSR2-based park. The managed stackwalker is only robust to walk from runtime-known safe points, which a dedicated signal handler parking threads on a pipe read cannot guarantee. A follow-up commit reintroduces non-crashing thread walking using ThreadSuspend::SuspendEE / RestartEE. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reintroduce multi-thread stack walking for the in-proc crash reporter, this time using the runtime's regular suspension path (ThreadSuspend::SuspendEE / RestartEE) instead of a bespoke SIGUSR2 park. The managed stackwalker is only robust when threads are at runtime-known safe points, which SuspendEE guarantees. Per reviewer guidance (see PR dotnet#126916 discussion), the reporter takes the standard stackwalker locks rather than attempting a lock-free variant. To avoid deadlocking when the runtime itself is in a compromised state, a reactive safety heuristic gates the suspension attempt: if a fatal error already occurred on the GC thread, a GC is in progress, the crashing thread is a GC special thread, or the crashing thread already holds the thread store lock, the reporter falls back to walking only the crashing thread. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Expose CrashJsonFlush in the crash report JSON writer header and call it from the multi-thread enumeration path after each thread's object is closed, plus once more after the final thread. This writes each walked thread to the on-disk crash report as soon as it is fully serialized, so a later hang or secondary fault during the crash-reporting path does not lose threads that have already been captured. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The in-proc crash reporter previously fell back to HOME/TMPDIR/ /data/local/tmp when DbgMiniDumpName was unset, always emitting a crash report file. Gate the file emission on DbgMiniDumpName being configured, and when it is only a filename, place the report under TMPDIR (or /tmp) so it lands in a writable location. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The in-proc crash report initially mirrored createdump's schema byte-for-byte, including an 'ExceptionType' field that encoded the signal as a synthetic Windows exception code. Per reviewer feedback, ExceptionType is a Windows concept that does not carry meaningful information for Unix crash reports; emit the raw signal number instead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| #ifndef __CRASHREPORTSTACKWALKER_H__ | ||
| #define __CRASHREPORTSTACKWALKER_H__ | ||
|
|
There was a problem hiding this comment.
The new include guard macro __CRASHREPORTSTACKWALKER_H__ starts with a double underscore, which is reserved for the implementation in C/C++. Please rename the guard to a non-reserved identifier (e.g., CRASHREPORTSTACKWALKER_H or CORECLR_CRASHREPORTSTACKWALKER_H).
| CLRConfigNoCache dmpNameCfg = CLRConfigNoCache::Get("DbgMiniDumpName", /*noprefix*/ false, &getenv); | ||
| const char* dumpName = dmpNameCfg.IsSet() ? dmpNameCfg.AsString() : nullptr; | ||
| if (dumpName == nullptr || dumpName[0] == '\0') | ||
| { | ||
| return; | ||
| } |
There was a problem hiding this comment.
CrashReportRegisterStackWalker() currently returns early when DbgMiniDumpName is unset/empty, which means setting DOTNET_EnableCrashReport=1 (as described in the PR) will not emit any crash report at all. If the intent is that crash reporting can be enabled without a file path (e.g., future logcat emission) or that the env var alone is sufficient, please remove/relax this hard dependency on DbgMiniDumpName (or update the configuration semantics so the behavior matches the documentation).
| if (g_inProcCrashReportEnabled) | ||
| { | ||
| InProcCrashReportGenerate(signal, siginfo, context); | ||
| } |
There was a problem hiding this comment.
g_inProcCrashReportEnabled is read from the fatal-signal crash path (PROCCreateCrashDumpIfEnabled) but written during startup (PROCEnableInProcCrashReport) as a plain bool. Because the signal handler can interrupt normal execution at any point, this introduces an async data race/UB. Consider using a signal-safe atomic/volatile type (e.g., volatile sig_atomic_t, Volatile<bool>, or an atomic with relaxed ordering) or publishing the flag via the same existing mechanisms used for other crash-path globals.
| include_directories(BEFORE ${CMAKE_CURRENT_SOURCE_DIR}) | ||
| include_directories(${ARCH_SOURCES_DIR}) | ||
| include_directories(${CMAKE_CURRENT_SOURCE_DIR}/../interop/inc) | ||
| include_directories(${CLR_DIR}) |
There was a problem hiding this comment.
include_directories(${CLR_DIR}) is added unconditionally for the entire VM build, but the new dependency that needs it (debug/crashreport/inproccrashreporter.h) is Android-only. Consider scoping this include directory to Android builds (e.g., under if(CLR_CMAKE_TARGET_ANDROID)/if(CLR_CMAKE_HOST_ANDROID)) to avoid globally widening the include search path on other platforms.
| include_directories(${CLR_DIR}) | |
| if(CLR_CMAKE_TARGET_ANDROID OR CLR_CMAKE_HOST_ANDROID) | |
| include_directories(${CLR_DIR}) | |
| endif(CLR_CMAKE_TARGET_ANDROID OR CLR_CMAKE_HOST_ANDROID) |
| # Include directories | ||
|
|
||
| include_directories(include) | ||
| include_directories(${CLR_DIR}) |
There was a problem hiding this comment.
include_directories(${CLR_DIR}) is added globally for PAL, but the only new include that appears to require it here is Android-only (debug/crashreport/inproccrashreporter.h). Consider making this include directory conditional on Android to reduce the chance of accidental header-name collisions on other targets.
| include_directories(${CLR_DIR}) | |
| if(CLR_CMAKE_TARGET_ANDROID) | |
| include_directories(${CLR_DIR}) | |
| endif() |
[Note] The first commit adds the skeleton with stubbed fields, the following commits resolves fields, followed by another commit that emits the json file.
Android CoreCLR does not currently have the same out-of-proc
createdumpexperience available on other platforms. That makes native crashes, aborts, and mixed managed/native failures significantly harder to diagnose in real applications.This PR adds an opt-in in-proc crash reporter for Android CoreCLR that emits a JSON-formatted crash report modeled after
createdump'sCrashReportWriter.When enabled, the runtime writes the crash report directly from the crash path to:
1. logcat / console output2. an optional
*.crashreport.jsonfile when DbgMiniDumpName is set.The long-term intent is for this reporting path to be async-signal-safe. This PR starts that work by making the lower-risk / low-hanging pieces fit that model where practical, while leaving the more complex runtime-state publication and hardening work to follow-up PRs.
Enabling the crash reporter
The crash reporter is opt-in.
Enable JSON crash reporting
Example: write the report to a file as well
When configured this way, the runtime will also attempt to write:
Current design note
This PR is intentionally the first step, not the complete hardening story.
The payload shape and reporting path are designed with async-signal-safety as the target, but today the implementation is a mix of:
Follow-up PRs will continue hardening the remaining pieces so the implementation better matches the intended strict crash-path constraints.
Example crash report shapes
https://gist.github.com/mdh1418/a893c7dd953bf725569b22abc10e59c1
Follow-up work
Cross-platform support (iOS/osx)
Emitting the crash report to logcat/console