Coder Social home page Coder Social logo

siarheistar / rnn-distributed-training Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kkhof/rnn-distributed-training

0.0 0.0 0.0 23.88 MB

distributed training of recurrent neural network by Spark 2.2

License: Apache License 2.0

Python 36.26% Jupyter Notebook 63.74%

rnn-distributed-training's Introduction

Distributed Training of RNN by Spark

This prototype is trying to speed up model training for large recurrent neural network. Imagine you want to train your RNN for a large vocabulary,say tens of thousands of frequent words, you don't want to wait for a couple of hours for training. Also you aim to find global optimal model parameters. Why not train your models by workers distributed on a Hadoop cluster? By that way, you will have the chance to pick up models trained in parallel and significantly speed up your training. This prototype is built on top of Spark 2.2. It can be run on AWS EMR clusters to test performance. For prototype purpose, I want to dive in details of code to figure out which is the bottleneck. That is why I didn't use either TensorFlow or Theano packages. Instead, I built the prototype on top of this excellent tutorial.

Standalone Installation Guide:

# Clone the repo
git clone https://github.com/fionayumfe/rnn-distributed-training.git
cd rnn-tutorial-rnnlm

# Create a new virtual environment (optional, but recommended)
virtualenv venv
source venv/bin/activate

# Install requirements
pip install -r requirements.txt

Setting up a Elastic Map-Reduce (EMR) cluster on AWS:

aws emr create-cluster --configurations your-json-file --release-label emr-5.3.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m3.xlarge --auto-terminate

You can also add spark-submit step to your emr script

$SPARK_HOME/bin/spark-submit \
    --master   yarn
    --conf     spark.yarn.submit.waitAppCompletion=false
    --conf     spark.executorEnv.PYTHONHASHSEED=0
    --conf     spark.yarn.executor.memoryOverhead=4096
    --conf     spark.executor.memory=7.5g
    --packages org.apache.hadoop:hadoop-aws:2.7.3
/home/hadoop/spark_main.py

You may also create a shell script as bootstrap step. In the step, all your source files will be copied to your data node and your dependencies will be installed as well. An example bootstrap file is

#!/usr/bin/env bash

aws s3 cp   s3://your_folder/spark_main.py             /home/hadoop/

export PATH="$PATH:/home/hadoop"
export CLASS_PATH="$CLASS_PATH:/home/hadoop"
export PYTHONHASHSEED=0
alias  python=python34

sudo yum -y install your packages
#install dependencies (Non-standard and non-Amazon Machine Image Python modules)
sudo pip-3.4 install py4j boto3  psutil awscli pandas

Setting up a CUDA-enabled GPU instance on AWS EC2:

# Install build tools
sudo apt-get update
sudo apt-get install -y build-essential git python-pip libfreetype6-dev libxft-dev libncurses-dev libopenblas-dev  gfortran python-matplotlib libblas-dev liblapack-dev libatlas-base-dev python-dev python-pydot linux-headers-generic linux-image-extra-virtual
sudo pip install -U pip

# Install CUDA 7
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1410/x86_64/cuda-repo-ubuntu1410_7.0-28_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1410_7.0-28_amd64.deb
sudo apt-get update
sudo apt-get install -y cuda
sudo reboot

# Clone the repo and install requirements
git clone https://github.com/fionayumfe/rnn-distributed-training.git
cd nn-theano
sudo pip install -r requirements.txt

# Set Environment variables
export CUDA_ROOT=/usr/local/cuda-7.0
export PATH=$PATH:$CUDA_ROOT/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_ROOT/lib64
export THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32
# For profiling only
export CUDA_LAUNCH_BLOCKING=1

rnn-distributed-training's People

Contributors

darksnakezero avatar davinnovation avatar dennybritz avatar kkhof avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.