Core dump deduplication by nchaimov · Pull Request #193 · llnl/Spindle

nchaimov · 2026-06-20T00:41:18Z

This PR adds a feature to Spindle to deduplicate core dumps by unique crash site, so that only one representative core dump is produced for a given crash site.

The operation is as follows:

In the client, a signal handler is registered for signals that indicate a crash.
When the signal handler is invoked, if an application signal handler has been registered, it is given an opportunity to fix the problem. After running the application handler, for reads, we retry the read and check if it succeeds; for writes, we check the maps and check whether the faulting address is now in a page that permits writes.
If there was no application handler, or the application handler did not fix the problem, then for the first thread that crashes in a process, we determine the library and offset of the crash and send this to the server. For SIGABRT, the abort_msg is used instead, if set.
The server checks whether it has already seen this crash site before; if so, it responds that the process was not selected. If not, it forwards the request up the tree, where the process is repeated. The first of a given crash site to reach the root is selected as the representative.
Upon receiving a response, the client, if not selected, sets its own core size limit to zero to suppress the core dump.

In order to work with application-registered signal handlers, Spindle intercepts sigaction, signal, bsd_signal and sysv_signal and stores handlers registered through these. To remain invisible to the application, queries for registered signal handlers returns the application's handler rather than Spindle's, or SIG_DFL if no application handler has been registered.

There is a new set of tests which test various crash patterns (for example, all same crash sites; all different crash sites; two sets; mixed abort and segfault). There are also tests which verify that we get the correct faulting page when an access spans two pages with different permissions. There are a set of tests of application handlers which test the safepoint pattern as used by Java and Julia. A test driver script runs the tests and verifies that they crash or exit cleanly as required; that the correct number of core dumps was produced; and that the top frames of the core dumps match the expected crash site.

Known Limitations

The current implementation registers an altstack only on the main thread. Stack overflows on other threads may leave the signal handler without any stack space of its own. Fixing this would require wrapping all thread creations and exits to allocate and register the altstack upon thread creation, and free the altstack upon thread exit.
There is not currently a central log of which processes crashed at which sites (although this information is recorded across the various log files if Spindle is run at debug log level 2 or higher). Fixing this will require that all requests reach the root rather than short-circuiting once a server recognizes a previously-seen crash site.

I have observed intermittent failures in GitHub Actions where the Flux tests or the Slurm plugin tests hang. I believe these are existing issues and not due to the code in this PR, as we have seen the same failures in the existing devel branch, and, in the case of the Slurm plugin hang, the logs indicate a problem in cachepath consensus.

Adds server-side components of crash handling. Client sends LDCS_MSG_CRASH_REPORT to sever. Server propagates request. When a single rank is chosen for a given crashsite, replies with LDCS_MSG_CRASH_RESPONSE.

Add the client-side crash handler. On a fatal signal, it chains to an application-registered signal handler, if any, and, if the application handler did not fix the fault, sends LDCS_MSG_CRASH_REPORT to the server which selects a single process per crashsite. Non-selected crashsites set their coredump limit to zero.

Adds --crash-dedup option to control whether crash deduplication is enabled.

Adds --enable-crash-handler to control whether the crash handler is compiled. Adds --enable-crash-dedup to control whether crash deduplication is enabled by default at runtime.

Adds a new script run_crash_tests_template.sh which runs various configurations of the new crash_test.c testsuite. This tests various kinds of crashes, numberes of distinct crashsites, etc. to validate that the crash handler deduplicates them correctly. It also tests chaining to application signal handlers that use the safepoint pattern to fix faults and retry.

Run the new crash handler testsuite in CI. Configure CI runner to generate coredumps, add shared filesystem to aggregate coredumps across nodes in mutli-node tests. Grants CAP_SYS_RESOURCE so ulimit can be set in containers.

autotools-generated files re-generated

nchaimov added 8 commits June 17, 2026 15:33

Crash handler: server-side coordination

1b91ace

Adds server-side components of crash handling. Client sends LDCS_MSG_CRASH_REPORT to sever. Server propagates request. When a single rank is chosen for a given crashsite, replies with LDCS_MSG_CRASH_RESPONSE.

Crash handler: --crash-dedup option

1e9b628

Adds --crash-dedup option to control whether crash deduplication is enabled.

Crash handler: configure options

79f0f4b

Adds --enable-crash-handler to control whether the crash handler is compiled. Adds --enable-crash-dedup to control whether crash deduplication is enabled by default at runtime.

Crash handler: CI testing

58457f3

Run the new crash handler testsuite in CI. Configure CI runner to generate coredumps, add shared filesystem to aggregate coredumps across nodes in mutli-node tests. Grants CAP_SYS_RESOURCE so ulimit can be set in containers.

Crash handler: bootstrap

4c14f02

autotools-generated files re-generated

Install glibc debug symbols in CI containers

3e6e39e

nchaimov deployed to Spindle CI June 20, 2026 00:41 — with GitHub Actions Active

nchaimov temporarily deployed to Spindle CI June 20, 2026 00:41 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core dump deduplication#193

Core dump deduplication#193
nchaimov wants to merge 8 commits into
llnl:develfrom
ParaToolsInc:crash-handler-core-dedup

nchaimov commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nchaimov commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant