Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

core dump in macosx using big model #12438

@loadwiki

Description

@loadwiki

(Brief description of the problem in no more than 2 sentences.)
My cpp program sometimes core dump in libmxnet.so when the model is as large as 200M bytes;
no core dump with small model.

Environment info (Required)

imac osx 10.13.6
CPU
compliler:
clang -v
Apple LLVM version 9.1.0 (clang-902.0.39.2)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Build info (Required if built from source)

git diff make/config.mk
@@ -82,7 +82,7 @@ USE_NCCL_PATH = NONE

whether use opencv during compilation

you can disable it, however, you will not able to use

imbin iterator

-USE_OPENCV = 1
+USE_OPENCV = 0

#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
@@ -90,7 +90,7 @@ USE_LIBJPEG_TURBO = 0
USE_LIBJPEG_TURBO_PATH = NONE

use openmp for parallelization

-USE_OPENMP = 1
+USE_OPENMP = 0

Error Message:

(Paste the complete error message, including stack trace.)
lldb main -c /cores/core.97762
(lldb) target create "main" --core "/cores/core.97762"
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 52, in
import weakref
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/weakref.py", line 14, in
from _weakref import (
ImportError: cannot import name _remove_dead_weakref
Core file '/cores/core.97762' (x86_64) was loaded.
(lldb) bt
warning: could not execute support code to read Objective-C class data in the process. This may reduce the quality of type information available.

  • thread Add some ops #1, stop reason = signal SIGSTOP
    • frame #0: 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10 frame #1: 0x00007fff64046589 libsystem_pthread.dylib_pthread_cond_wait + 732
      frame [concurrent-blocking-queue-fix] ConcurrentBlockingQueue::Pop's return… #2: 0x00007fff61c81cb0 libc++.1.dylibstd::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18 frame #3: 0x000000010d6bc364 libmxnet.somxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) + 596
      frame rename #4: 0x000000010d7cd49a libmxnet.somxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const + 954 frame #5: 0x000000010d6ad0d4 libmxnet.soMXPredGetOutput + 340
      frame clean up registry code #6: 0x000000010c1cac30 mainInfer(pred_hnd=0x00007fcba2f00000, image_data=size=1, data=size=1) at face_predict.cpp:296 frame #7: 0x000000010c120e99 mainprocess_camera(model_path="../models/ncnn", camera=0x00007ffee3af5170, output_folder="./output/192.168.150.244", mainThread=true) at main.cpp:278
      frame static graph #8: 0x000000010c125f42 mainmain(argc=4, argv=0x00007ffee3af57b0) at main.cpp:484 frame #9: 0x00007fff63d2d015 libdyld.dylibstart + 1
      (lldb) thread list
      Process 0 stopped
  • thread Add some ops #1: tid = 0x0000, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #2: tid = 0x0001, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP
    thread Update dev branch #3: tid = 0x0002, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #4: tid = 0x0003, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP
    thread change capi #5: tid = 0x0004, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #6: tid = 0x0005, 0x000000010c589a4a libmxnet.sovoid mxnet::op::BatchNormForwardImpl<mshadow::cpu, float, float>(mshadow::Streammshadow::cpu*, mxnet::OpContext const&, mxnet::op::BatchNormParam const&, std::__1::vector<mxnet::TBlob, std::__1::allocatormxnet::TBlob > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::TBlob, std::__1::allocatormxnet::TBlob > const&, std::__1::vector<mxnet::TBlob, std::__1::allocatormxnet::TBlob > const&) + 1002, stop reason = signal SIGSTOP
    thread symbol implementation and fix #7: tid = 0x0006, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #8: tid = 0x0007, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP
    thread new symbol interface #9: tid = 0x0008, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #10: tid = 0x0009, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP
    thread static graph #11: tid = 0x000a, 0x00007fff63e7e28a libsystem_kernel.dylib__workq_kernreturn + 10, stop reason = signal SIGSTOP thread #12: tid = 0x000b, 0x00007fff63e7e28a libsystem_kernel.dylib__workq_kernreturn + 10, stop reason = signal SIGSTOP
    thread out_data is necessary, e.g. sigmoid #13: tid = 0x000c, 0x00007fff63e7e28a libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP

Minimum reproducible example

There is no obvious condition which cause the core dump.
I do manuelly send a sigstop signal to my main program, then main stop as usual.
I'm curious that there is no segment fault or abort or some other signal but a sigstop when the core dump occurs.
At first I compile the mxnet master branch. Then I switch a release tag '1.2.1.rc1', same thing happens.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions