Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

gluon's dataloader crashes if num_workers > 0 #14541

@lambdaofgod

Description

@lambdaofgod

Description

gluon's dataloader sometimes crashes if num_workers > 0
This seems related to #11908

Environment info (Required)

----------Python Info----------
Version      : 3.6.5
Compiler     : GCC 7.2.0
Build        : ('default', 'Apr 29 2018 16:14:56')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.1
Directory    : /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash   : 19c501680183237d52a862e6ae1dc4ddc296305b
----------System Info----------
Platform     : Linux-4.4.0-1077-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-88-88
release      : 4.4.0-1077-aws
version      : #87-Ubuntu SMP Wed Mar 6 00:03:05 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2713.371
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0011 sec, LOAD: 0.3721 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0005 sec, LOAD: 0.0210 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0004 sec, LOAD: 0.0163 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0004 sec, LOAD: 0.1901 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0028 sec, LOAD: 0.0553 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0004 sec, LOAD: 0.0240 sec.

I'm using MXNet 1.3.1

Error Message:

Sometimes I get


Exception in thread Thread-7:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/data/dataloader.py", line 190, in fetcher_loop
    idx, batch = data_queue.get()
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/data/dataloader.py", line 57, in rebuild_ndarray
    fd = fd.detach()
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/connection.py", line 493, in Client
    answer_challenge(c, authkey)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/connection.py", line 737, in answer_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

Sometimes also


Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/data/dataloader.py", line 190, in fetcher_loop
    idx, batch = data_queue.get()
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/data/dataloader.py", line 57, in rebuild_ndarray
    fd = fd.detach()
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/connection.py", line 487, in Client
    c = SocketClient(address)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/connection.py", line 614, in SocketClient
    s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory

Minimum reproducible example

from mxnet import gluon
import numpy as np
import itertools

import mxnet

mxnet.__version__ ## 1.3.1

dummy_dataset = gluon.data.ArrayDataset(
[(np.array([  0,  27, 365,   3], dtype=int),
  np.array([ 2,  0, 12, 27, 11,  0,  3], dtype=int),
  4,
  7,
  0),
 (np.array([  0,  27, 365,   3], dtype=int),
  np.array([ 2,  0, 12, 27, 11,  0,  3], dtype=int),
  4,
  7,
  1)]
)

dummy_dataloader = gluon.data.DataLoader(dummy_dataset, batch_size=1, num_workers=1)

list(itertools.islice(dummy_dataloader, 0, 1))

What have you tried to solve it?

I tried setting pin_memory argument to True, but the problem persists.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions