ctcyang / incubator-mxnet Goto Github PK

License: Apache License 2.0

CMake 0.47% Makefile 0.39% R 1.80% C++ 32.40% Python 32.33% Java 0.68% C 0.63% Shell 1.62% Groovy 0.10% Jupyter Notebook 7.13% Cuda 5.29% Batchfile 0.07% MATLAB 0.20% Perl 8.38% Perl 6 0.04% Scala 5.93% ANTLR 0.01% Smalltalk 0.24% Clojure 2.10% Dockerfile 0.18%

incubator-mxnet's People

Contributors

Stargazers

Watchers

Forkers

nswamy apeforest trellixvulnteam

incubator-mxnet's Issues

When the horovod can support MXNet for training by allreduce?

I heard that horovod will support MXNet for training by allreduce. Do you know when the version can use? I got the message from here: apache/mxnet#10696. @ctcyang

Very hight CPU load when train with Horovod, and slow training speed.

Description

When run Allreduce with horovod, the performance is very poor. How can I speed things up?

Environment info (Required)

----------Python Info----------
('Version      :', '2.7.12')
('Compiler     :', 'GCC 5.4.0 20160609')
('Build        :', ('default', 'Nov 12 2018 14:36:49'))
('Arch         :', ('64bit', ''))
------------Pip Info-----------
('Version      :', '18.1')
('Directory    :', '/usr/local/lib/python2.7/dist-packages/pip')
----------MXNet Info-----------
('Version      :', '1.4.0')
('Directory    :', '/mxnet/python/mxnet')
Hashtag not found. Not installed from pre-built package.
----------System Info----------
('Platform     :', 'Linux-3.10.0-862.9.1.el7.x86_64-x86_64-with-Ubuntu-16.04-xenial')
('system       :', 'Linux')
('node         :', 'da7e5697cb17')
('release      :', '3.10.0-862.9.1.el7.x86_64')
('version      :', '#1 SMP Mon Jul 16 16:29:36 UTC 2018')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'x86_64')
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                56
On-line CPU(s) list:   0-55
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2658 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2300.000
CPU max MHz:           2800.0000
CPU min MHz:           1200.0000
BogoMIPS:              4605.32
Virtualization:        VT-x
Hypervisor vendor:     vertical
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0-13,28-41
NUMA node1 CPU(s):     14-27,42-55
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts spec_ctrl intel_stibp

Package used (Python/R/Scala/Julia):
MXNet: https://github.com/apache/incubator-mxnet on branch master
Horovod: https://github.com/ctcyang/horovod on branch mxnet_feature_fp16
Python: 2.7.12

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio): GCC

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)
MXNet: d2102faa228bdc6723a9da299c6ff5999cbbdcdb
Horovod: 10c35d0b54dd033b6e2d97c623d2afcbff445630

Build config:
(Paste the content of config.mk, or the build command.)

USE_DIST_KVSTORE=1
USE_CUDA=1
USE_CUDA_PATH=/usr/local/cuda
USE_CUDNN=1
USE_NCCL=1
USE_S3=1
USE_PROFILER=1

Error Message:

Training with InsightFace project(https://github.com/deepinsight/insightface), I got a very slow training speed as below, and the CPU load is quite high.

2018-12-06 10:08:42,891 Node[5] Epoch[0] Batch [0-20]   Speed: 14.93 samples/sec        acc=0.000000
2018-12-06 10:08:42,893 Node[1] Epoch[0] Batch [0-20]   Speed: 14.87 samples/sec        acc=0.000000
2018-12-06 10:08:42,895 Node[6] Epoch[0] Batch [0-20]   Speed: 14.92 samples/sec        acc=0.000000
2018-12-06 10:08:42,902 Node[3] Epoch[0] Batch [0-20]   Speed: 14.92 samples/sec        acc=0.000000
2018-12-06 10:08:43,541 Node[4] Epoch[0] Batch [0-20]   Speed: 14.91 samples/sec        acc=0.000000
2018-12-06 10:08:54,391 Node[0] Epoch[0] Batch [0-20]   Speed: 13.90 samples/sec        acc=0.000000
2018-12-06 10:08:55,067 Node[7] Epoch[0] Batch [0-20]   Speed: 13.79 samples/sec        acc=0.000000
2018-12-06 10:08:55,227 Node[2] Epoch[0] Batch [0-20]   Speed: 13.80 samples/sec        acc=0.000000
2018-12-06 10:10:57,954 Node[1] Epoch[0] Batch [20-40]  Speed: 14.81 samples/sec        acc=0.000000
2018-12-06 10:10:58,477 Node[6] Epoch[0] Batch [20-40]  Speed: 14.75 samples/sec        acc=0.000000
2018-12-06 10:11:02,376 Node[4] Epoch[0] Batch [20-40]  Speed: 14.41 samples/sec        acc=0.000000
2018-12-06 10:11:02,539 Node[3] Epoch[0] Batch [20-40]  Speed: 14.32 samples/sec        acc=0.000000
2018-12-06 10:11:03,064 Node[5] Epoch[0] Batch [20-40]  Speed: 14.27 samples/sec        acc=0.000000
2018-12-06 10:11:06,079 Node[0] Epoch[0] Batch [20-40]  Speed: 15.19 samples/sec        acc=0.000000
2018-12-06 10:11:08,218 Node[7] Epoch[0] Batch [20-40]  Speed: 15.02 samples/sec        acc=0.000000
2018-12-06 10:11:11,488 Node[2] Epoch[0] Batch [20-40]  Speed: 14.68 samples/sec        acc=0.000000
2018-12-06 10:13:19,883 Node[4] Epoch[0] Batch [40-60]  Speed: 14.54 samples/sec        acc=0.000000
2018-12-06 10:13:19,884 Node[1] Epoch[0] Batch [40-60]  Speed: 14.09 samples/sec        acc=0.000000
2018-12-06 10:13:19,888 Node[5] Epoch[0] Batch [40-60]  Speed: 14.62 samples/sec        acc=0.000000
2018-12-06 10:13:19,889 Node[3] Epoch[0] Batch [40-60]  Speed: 14.56 samples/sec        acc=0.000000
2018-12-06 10:13:20,714 Node[6] Epoch[0] Batch [40-60]  Speed: 14.06 samples/sec        acc=0.000000
2018-12-06 10:13:26,432 Node[2] Epoch[0] Batch [40-60]  Speed: 14.82 samples/sec        acc=0.000000
2018-12-06 10:13:28,200 Node[0] Epoch[0] Batch [40-60]  Speed: 14.07 samples/sec        acc=0.000000
2018-12-06 10:13:32,749 Node[7] Epoch[0] Batch [40-60]  Speed: 13.84 samples/sec        acc=0.000000
2018-12-06 10:15:40,223 Node[5] Epoch[0] Batch [60-80]  Speed: 14.25 samples/sec        acc=0.000000
2018-12-06 10:15:40,223 Node[4] Epoch[0] Batch [60-80]  Speed: 14.25 samples/sec        acc=0.000000
2018-12-06 10:15:40,224 Node[1] Epoch[0] Batch [60-80]  Speed: 14.25 samples/sec        acc=0.000000
2018-12-06 10:15:40,228 Node[3] Epoch[0] Batch [60-80]  Speed: 14.25 samples/sec        acc=0.000000
2018-12-06 10:15:41,993 Node[0] Epoch[0] Batch [60-80]  Speed: 14.95 samples/sec        acc=0.000000
2018-12-06 10:15:43,600 Node[6] Epoch[0] Batch [60-80]  Speed: 14.00 samples/sec        acc=0.000000
2018-12-06 10:15:47,586 Node[2] Epoch[0] Batch [60-80]  Speed: 14.17 samples/sec        acc=0.000000
2018-12-06 10:15:52,918 Node[7] Epoch[0] Batch [60-80]  Speed: 14.27 samples/sec        acc=0.000000
2018-12-06 10:18:00,695 Node[1] Epoch[0] Batch [80-100] Speed: 14.24 samples/sec        acc=0.000000
2018-12-06 10:18:00,782 Node[0] Epoch[0] Batch [80-100] Speed: 14.41 samples/sec        acc=0.000000
2018-12-06 10:18:01,198 Node[3] Epoch[0] Batch [80-100] Speed: 14.19 samples/sec        acc=0.000000
2018-12-06 10:18:01,380 Node[5] Epoch[0] Batch [80-100] Speed: 14.17 samples/sec        acc=0.000000
2018-12-06 10:18:02,143 Node[4] Epoch[0] Batch [80-100] Speed: 14.09 samples/sec        acc=0.000000
2018-12-06 10:18:03,768 Node[6] Epoch[0] Batch [80-100] Speed: 14.27 samples/sec        acc=0.000000
2018-12-06 10:18:07,044 Node[2] Epoch[0] Batch [80-100] Speed: 14.34 samples/sec        acc=0.000000
2018-12-06 10:18:13,723 Node[7] Epoch[0] Batch [80-100] Speed: 14.20 samples/sec        acc=0.000000

docker ps show 99% CPU load per python process

UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                188045              188026              0                   10:38               ?                   00:00:00            sleep 365d
root                191718              188026              0                   10:38               ?                   00:00:00            /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "4052680704" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "insightface-softmax-launcher-vr[1:4]cq,insightface-softmax-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "4052680704.0;tcp://10.244.0.144,192.168.1.126:46328" --mca btl_tcp_if_include "ps" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
root                191734              191718              2                   10:38               ?                   00:00:00            /usr/local/bin/orted -mca ess env -mca ess_base_jobid 4052680704 -mca ess_base_vpid 2 -mca ess_base_num_procs 3 -mca orte_node_regex insightface-softmax-launcher-vr[1:4]cq,insightface-softmax-worker-[1:0-1]@0(3) -mca orte_hnp_uri 4052680704.0;tcp://10.244.0.144,192.168.1.126:46328 --mca btl_tcp_if_include ps -mca pml ob1 -mca btl ^openib -mca plm rsh -mca plm_rsh_agent /etc/mpi/kubexec.sh -mca orte_default_hostfile /etc/mpi/hostfile -mca hwloc_base_binding_policy none -mca rmaps_base_mapping_policy slot -mca pmix ^s1,s2,cray,isolated
root                191747              191734              99                  10:38               ?                   00:00:26            python train_softmax.py --network r50 --loss-type 2 --margin-m 0.35 --data-dir /datasets/faces_emore --target --kv-store horovod --per-batch-size 100
root                191749              191734              99                  10:38               ?                   00:00:26            python train_softmax.py --network r50 --loss-type 2 --margin-m 0.35 --data-dir /datasets/faces_emore --target --kv-store horovod --per-batch-size 100
root                191751              191734              99                  10:38               ?                   00:00:28            python train_softmax.py --network r50 --loss-type 2 --margin-m 0.35 --data-dir /datasets/faces_emore --target --kv-store horovod --per-batch-size 100
root                191753              191734              99                  10:38               ?                   00:00:25            python train_softmax.py --network r50 --loss-type 2 --margin-m 0.35 --data-dir /datasets/faces_emore --target --kv-store horovod --per-batch-size 100

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:2D:00.0 Off |                    0 |
| N/A   43C    P0    39W / 250W |  10499MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   42C    P0    37W / 250W |  10499MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 00000000:35:00.0 Off |                    0 |
| N/A   41C    P0    36W / 250W |  10499MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  Off  | 00000000:39:00.0 Off |                    0 |
| N/A   40C    P0    36W / 250W |  10499MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P100-PCIE...  Off  | 00000000:A9:00.0 Off |                    0 |
| N/A   40C    P0    35W / 250W |  10499MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P100-PCIE...  Off  | 00000000:AD:00.0 Off |                    0 |
| N/A   39C    P0    36W / 250W |  10499MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P100-PCIE...  Off  | 00000000:B1:00.0 Off |                    0 |
| N/A   40C    P0    37W / 250W |  10499MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P100-PCIE...  Off  | 00000000:B5:00.0 Off |                    0 |
| N/A   39C    P0    37W / 250W |  10499MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    191747      C   python                                     10489MiB |
|    1    191749      C   python                                     10489MiB |
|    2    191751      C   python                                     10489MiB |
|    3    191748      C   python                                     10489MiB |
|    4    191750      C   python                                     10489MiB |
|    5    191752      C   python                                     10489MiB |
|    6    191754      C   python                                     10489MiB |
|    7    191753      C   python                                     10489MiB |
+-----------------------------------------------------------------------------+

Minimum reproducible example

I use train_softmax.py and port the code to work with Horovod:

def train_net(args):
    ctx = []
    if args.kv_store == 'horovod':
      import horovod.mxnet as hvd
      kv = None
      hvd.init()
      ctx.append(mx.gpu(hvd.local_rank()))
      # logging
      head = '%(asctime)-15s Node[' + str(hvd.rank()) + '] %(message)s'
      logging.basicConfig(level=logging.DEBUG, format=head)
    else:
      kv = mx.kvstore.create(args.kv_store)
      cvd = os.environ['CUDA_VISIBLE_DEVICES'].strip()
      if len(cvd)>0:
        for i in xrange(len(cvd.split(','))):
          ctx.append(mx.gpu(i))
      if len(ctx)==0:
        ctx = [mx.cpu()]
        print('use cpu')
      else:
        print('gpu num:', len(ctx))
   
    .....

    opt = optimizer.SGD(learning_rate=base_lr, momentum=base_mom, wd=base_wd, rescale_grad=_rescale)
    if args.kv_store == 'horovod':
      opt = hvd.DistributedOptimizer(opt)

    ....

    # create initializer
    model.bind(data_shapes=train_dataiter.provide_data, label_shapes=train_dataiter.provide_label)
    model.init_params(initializer, arg_params=arg_params, aux_params=aux_params)
    (arg_params, aux_params) = model.get_params()
    if args.kv_store == 'horovod':
      hvd.broadcast_parameters(arg_params, root_rank=0)
      hvd.broadcast_parameters(aux_params, root_rank=0)
    model.set_params(arg_params=arg_params, aux_params=aux_params)

    model.fit(train_dataiter,
        begin_epoch        = begin_epoch,
        num_epoch          = end_epoch,
        eval_data          = val_dataiter,
        eval_metric        = eval_metrics,
        kvstore            = kv,
        optimizer          = opt,
        # optimizer_params   = optimizer_params,
        # initializer        = initializer,
        # arg_params         = arg_params,
        # aux_params         = aux_params,
        allow_missing      = True,
        batch_end_callback = _batch_callback,
        epoch_end_callback = epoch_cb )

start up command:

mpirun --mca btl_tcp_if_include ps -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib python train_softmax.py --network r50 --loss-type 2 --margin-m 0.35 --data-dir /datasets/faces_emore --target  --kv-store horovod --per-batch-size 100

Segmentation fault in horovod.allreduce() and horovod.broadcast_parameters() functions

Description

I installed mxnet and Horovod by through source from here When I run a simple program to test the Horovod environment, I got a "segmentation fault" error.

Environment info (Required)

----------Python Info----------
('Version      :', '2.7.13')
('Compiler     :', 'GCC 4.4.7 20120313 (Red Hat 4.4.7-1)')
('Build        :', ('default', 'Dec 20 2016 23:09:15'))
('Arch         :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version      :', '9.0.1')
('Directory    :', '/home/anaconda2/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
('Version      :', '1.5.0')
('Directory    :', '/home/horovod/mxnet/python/mxnet')
Hashtag not found. Not installed from pre-built package.
----------System Info----------
('Platform     :', 'Linux-3.10.0-327.el7.x86_64-x86_64-with-redhat-7.2-Maipo')
('system       :', 'Linux')
('release      :', '3.10.0-327.el7.x86_64')
('version      :', '#1 SMP Thu Oct 29 17:29:29 EDT 2015')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'x86_64')
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6

Package used (Python/R/Scala/Julia):
I'm using Python.

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):gcc

MXNet commit hash:
No

Error Message:

(Paste the complete error message, including stack trace.)
No error information came out but segmentation fault .

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
I wrote a simple Horovod test program named 'test.py' and it is shown below.

import numpy as np
import mxnet as mx
import horovod.mxnet as hvd
hvd.init()
r=int(hvd.rank())
print("r:", r)
x=mx.nd.ones((2,3,4), dtype=np.float16)
print("x:", x)
y=hvd.allreduce(x)
print("y", y)

Steps to reproduce

(Paste the commands you ran that produced the error.)
Only a simple commands on the terminal python test.py

What have you tried to solve it?

I located where the segmentation error cames. It is because of the allreduce function. Besides, the function 'broadcast_parameters' would also cause segmentation fault.
Could someone help me fix it? Thanks in advance!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.