Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

imagenet hanging in the end? #4391

@amithr1

Description

@amithr1

I have cloned the latest mxnet and tried running imagenet (a very small subset of the images).
What I found is that mxnet in distributed training using dist_device_sync hangs in the end with two workers.

Running cifar was ok and didnt hang in the end.
I get something like ( on two workers):

[11:19:09] src/io/iter_image_recordio.cc:221: ImageRecordIOParser: /tmp/amithr/imagenet_mini_train.rec, use 4 threads for decoding..
[11:16:32] src/io/iter_image_recordio.cc:221: ImageRecordIOParser: /tmp/amithr/imagenet_mini_train.rec, use 4 threads for decoding..
INFO:root:Start training with [gpu(1), gpu(2)]
INFO:root:Start training with [gpu(1), gpu(2)]
INFO:root:Epoch[0] Batch [20] Speed: 33.72 samples/sec Train-accuracy=0.515625
INFO:root:Epoch[0] Batch [20] Speed: 33.80 samples/sec Train-accuracy=0.467187
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=35.018

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions