Coder Social home page Coder Social logo

aws-neuron-parallelcluster-samples's Introduction

Train a model on AWS Trn1 ParallelCluster

Introduction

This document explains how to use AWS ParallelCluster to build HPC compute cluster that uses trn1 compute nodes to run your distributed ML training job. Once the nodes are launched, we will run a training task to confirm that the nodes are working, and use SLURM commands to check the job status. In this tutorial, we will use AWS pcluster command to run a YAML file in order to generate the cluster. As an example, we are going to launch multiple trn1.32xl nodes in our cluster.

We are going to set up our ParallelCluster infrastructure as below:

image info

As shown in the figure above, inside a VPC, there are two subnets, a public and a private ones. Head node resides in the public subnet, while the compute fleet (in this case, trn1 instances) are in the private subnet. A Network Address Translation (NAT) gateway is also needed in order for nodes in the private subnet to connect to clients outside the VPC. In the next section, we are going to describe how to set up all the necessary infrastructure for Trn1 ParallelCluster.

Prerequisite infrastructure

VPC Creation

A ParallelCluster requires a VPC that has two subnets and a Network Address Translation (NAT) gateway as shown in the diagram above. Here are the instructions to create the VPC and enable auto-assign public IPv4 address for the public subnet.

Key pair

A key pair is needed for access to the head node of the cluster. You may use an existing one or create a new key pair by following the instruction here

AWS ParallelCluster Python package

AWS ParallelCluster Python package is needed in a local environment (i.e., your Mac/PC desktop with a CLI terminal or an AWS Cloud9) where you issue the command to launch the creation process for your HPC environment in AWS. See here for instructions about installing AWS ParallelCluster Python package in your local environment.

Create a cluster

See table below for script to create trn1 ParallelCluster:

Cluster Link
16xTrn1 nodes trn1-16-nodes-pcluster.md

Launch training job

See table below for script to launch a model training job on the ParallelCluster:

Job Link
BERT Large dp-bert-launch-job.md

Security

See CONTRIBUTING for more information.

License

This library is licensed under the Amazon Software License.

Release Notes

Please refer to the Change Log.

aws-neuron-parallelcluster-samples's People

Contributors

kct22aws avatar aws-maens avatar jeffhataws avatar jyang-aws avatar aws-mesharma avatar amazon-auto avatar aws-donc avatar aws-sadaf avatar awsjoshir avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.