Core dump deduplication#193
Open
nchaimov wants to merge 8 commits into
Open
Conversation
Adds server-side components of crash handling. Client sends LDCS_MSG_CRASH_REPORT to sever. Server propagates request. When a single rank is chosen for a given crashsite, replies with LDCS_MSG_CRASH_RESPONSE.
Add the client-side crash handler. On a fatal signal, it chains to an application-registered signal handler, if any, and, if the application handler did not fix the fault, sends LDCS_MSG_CRASH_REPORT to the server which selects a single process per crashsite. Non-selected crashsites set their coredump limit to zero.
Adds --crash-dedup option to control whether crash deduplication is enabled.
Adds --enable-crash-handler to control whether the crash handler is compiled. Adds --enable-crash-dedup to control whether crash deduplication is enabled by default at runtime.
Adds a new script run_crash_tests_template.sh which runs various configurations of the new crash_test.c testsuite. This tests various kinds of crashes, numberes of distinct crashsites, etc. to validate that the crash handler deduplicates them correctly. It also tests chaining to application signal handlers that use the safepoint pattern to fix faults and retry.
Run the new crash handler testsuite in CI. Configure CI runner to generate coredumps, add shared filesystem to aggregate coredumps across nodes in mutli-node tests. Grants CAP_SYS_RESOURCE so ulimit can be set in containers.
autotools-generated files re-generated
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a feature to Spindle to deduplicate core dumps by unique crash site, so that only one representative core dump is produced for a given crash site.
The operation is as follows:
In order to work with application-registered signal handlers, Spindle intercepts
sigaction,signal,bsd_signalandsysv_signaland stores handlers registered through these. To remain invisible to the application, queries for registered signal handlers returns the application's handler rather than Spindle's, orSIG_DFLif no application handler has been registered.There is a new set of tests which test various crash patterns (for example, all same crash sites; all different crash sites; two sets; mixed abort and segfault). There are also tests which verify that we get the correct faulting page when an access spans two pages with different permissions. There are a set of tests of application handlers which test the safepoint pattern as used by Java and Julia. A test driver script runs the tests and verifies that they crash or exit cleanly as required; that the correct number of core dumps was produced; and that the top frames of the core dumps match the expected crash site.
Known Limitations
I have observed intermittent failures in GitHub Actions where the Flux tests or the Slurm plugin tests hang. I believe these are existing issues and not due to the code in this PR, as we have seen the same failures in the existing devel branch, and, in the case of the Slurm plugin hang, the logs indicate a problem in cachepath consensus.