Skip to content

Core dump deduplication#193

Open
nchaimov wants to merge 8 commits into
llnl:develfrom
ParaToolsInc:crash-handler-core-dedup
Open

Core dump deduplication#193
nchaimov wants to merge 8 commits into
llnl:develfrom
ParaToolsInc:crash-handler-core-dedup

Conversation

@nchaimov

Copy link
Copy Markdown
Collaborator

This PR adds a feature to Spindle to deduplicate core dumps by unique crash site, so that only one representative core dump is produced for a given crash site.

The operation is as follows:

  • In the client, a signal handler is registered for signals that indicate a crash.
  • When the signal handler is invoked, if an application signal handler has been registered, it is given an opportunity to fix the problem. After running the application handler, for reads, we retry the read and check if it succeeds; for writes, we check the maps and check whether the faulting address is now in a page that permits writes.
  • If there was no application handler, or the application handler did not fix the problem, then for the first thread that crashes in a process, we determine the library and offset of the crash and send this to the server. For SIGABRT, the abort_msg is used instead, if set.
  • The server checks whether it has already seen this crash site before; if so, it responds that the process was not selected. If not, it forwards the request up the tree, where the process is repeated. The first of a given crash site to reach the root is selected as the representative.
  • Upon receiving a response, the client, if not selected, sets its own core size limit to zero to suppress the core dump.

In order to work with application-registered signal handlers, Spindle intercepts sigaction, signal, bsd_signal and sysv_signal and stores handlers registered through these. To remain invisible to the application, queries for registered signal handlers returns the application's handler rather than Spindle's, or SIG_DFL if no application handler has been registered.

There is a new set of tests which test various crash patterns (for example, all same crash sites; all different crash sites; two sets; mixed abort and segfault). There are also tests which verify that we get the correct faulting page when an access spans two pages with different permissions. There are a set of tests of application handlers which test the safepoint pattern as used by Java and Julia. A test driver script runs the tests and verifies that they crash or exit cleanly as required; that the correct number of core dumps was produced; and that the top frames of the core dumps match the expected crash site.

Known Limitations

  • The current implementation registers an altstack only on the main thread. Stack overflows on other threads may leave the signal handler without any stack space of its own. Fixing this would require wrapping all thread creations and exits to allocate and register the altstack upon thread creation, and free the altstack upon thread exit.
  • There is not currently a central log of which processes crashed at which sites (although this information is recorded across the various log files if Spindle is run at debug log level 2 or higher). Fixing this will require that all requests reach the root rather than short-circuiting once a server recognizes a previously-seen crash site.

I have observed intermittent failures in GitHub Actions where the Flux tests or the Slurm plugin tests hang. I believe these are existing issues and not due to the code in this PR, as we have seen the same failures in the existing devel branch, and, in the case of the Slurm plugin hang, the logs indicate a problem in cachepath consensus.

nchaimov added 8 commits June 17, 2026 15:33
Adds server-side components of crash handling.
Client sends LDCS_MSG_CRASH_REPORT to sever.
Server propagates request. When a single rank is chosen
for a given crashsite, replies with LDCS_MSG_CRASH_RESPONSE.
Add the client-side crash handler. On a fatal signal, it chains to an
application-registered signal handler, if any, and, if the application
handler did not fix the fault, sends LDCS_MSG_CRASH_REPORT to the server
which selects a single process per crashsite. Non-selected crashsites
set their coredump limit to zero.
Adds --crash-dedup option to control whether crash deduplication is
enabled.
Adds --enable-crash-handler to control whether the crash handler is
compiled.
Adds --enable-crash-dedup to control whether crash deduplication is
enabled by default at runtime.
Adds a new script run_crash_tests_template.sh which runs various
configurations of the new crash_test.c testsuite. This tests various
kinds of crashes, numberes of distinct crashsites, etc. to validate that
the crash handler deduplicates them correctly. It also tests chaining to
application signal handlers that use the safepoint pattern to fix faults
and retry.
Run the new crash handler testsuite in CI. Configure CI runner to generate
coredumps, add shared filesystem to aggregate coredumps across nodes in
mutli-node tests. Grants CAP_SYS_RESOURCE so ulimit can be set in
containers.
autotools-generated files re-generated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant