Coder Social home page Coder Social logo

flexhpc's Introduction

Flexible HPC Cluster

Modular Microsoft Azure HPC infrastructure deployment ARM template.

Key Features of this ARM template collection

  • Choose between multiple CentOS, Ubuntu, SUSE or RedHat Linux Images, or use your own image.
  • RDMA (FDR, QDR Infinband), GPU (NVIDIA K80) and CPU only compute nodes are all supported.
  • All appropriate hardware drivers are installed and configured for you via the installation scripts.
  • NFS Server with up to 32TB of Standard_LRS storage attached (defaults to 10TB) built with azure managed disks
  • Dynamically add or remove nodes from your cluster (built with azure scale sets).
  • Add Head nodes or fat nodes to your cluster(s), or simply build standalone nodes.
  • Append your own scripts to install applications or customize the nodes further.

If you find a problem, please report it here.

1. Deploy a Complete Cluster with Head Node & NFS Server.

*** Under Maintenance DO NOT USE - 9/9/2017 ***

This template deploys a complete cluster composed of a head node + nfs server (combined on the same VM), and a cluster of a selectable number of nodes (1-100), built as a scale-set.


  • Deployment takes around 12 minutes. Login is disabled during deployment to prevent conflicts.
  • Head node & Compute nodes will be the same VM type (use the below modular template if you don't want this)

2. Modular Step-by-Step Deployment

This section allows you to deploy the cluster infrastructure step-by-step. You will need to deploy the components of your infrastructure into the same VNET in order for them to connect to each other.

Example usage of this is so that you can setup a "permanent" Head node and NFS AND/OR BeeGFS Server with your application software and data stored safely, and then tear-up and down compute nodes (Fat Nodes & Scale Sets) as you require.


2a. [Mandatory] Create the Network Infrastructure & Head Node

This template will create the main VNET & Subnets for the cluster - deploy this template first. You can treat this system purely as a standalone Head/Master/JumpBox node, or as a combined NFS server & Head/MasterJumpBox node.

2b. [Optional] Deploy a Standalone Linux NFS Server

Standalone Linux NFS Server

2c. [Optional] Deploy a Standalone BeeGFS Storage Cluster

This template deploys a BeeGFS Storage Cluster built using a VM ScaleSet with mixed data + metadata capability on each node. Number of storage/metadata disks and their sizes are configurable. Premium_LRS storage is recommended.


2d. Deploy a Scale Set of Linux Compute Nodes

Deploy a scale set with N nodes into the same existing VNET as your NFS Server + Head Node.


  • Ensure your Headnode & Network is deployed first as per the step 2a above.
  • The compute node install script will mount the home directory and other shares from the head node automatically.
  • The NFS server is currently assumed to be 10.0.0.4.
  • The scale set instances will record their hostnames and IP addresses into the /clustermap mount on the NFS server.
  • VM scaleset overprovisioning is disabled in this version for now to keep things predictable.

2d. Deploy Fat Node(s) VM(s) with optional storage attached.

TBD


3. Manually Increase or Decrease The Number of Compute Nodes in a Scale Set Cluster

The advantage of scale sets is that you can easily grow or shrink the amount of compute nodes as you need them. You can either do this automatically, or you can do this manually using this template - just enter the number of nodes you want to end up with (higher or lower than the current number). Additional compute instances will be configured exactly the same as the existing compute instances using the same cn-setup.sh installation script. Do it here:



4. Cluster Access Instructions

  • To ssh into the headnode or NFS server after deployment: ssh username@headnode-public-ip-address
  • username is the cluster admin username you entered into the template when you deployed.
  • The homedirectory is NFS automounted from the headnode onto all the compute nodes in the scale set.
  • The ssh keys are stored for your user in /share/home/username/.ssh, so passwordless ssh works across the cluster.
  • You will find the private IP addresses for the scaleset nodes in /share/clustermap/hosts (head nodes) or /clustermap/hosts (compute nodes).
  • Upload your data & applications to /share/data with scp or rsync.

Linux Image Support Matrix

You can mix and match VM sku types & linux versions on your head node, NFS server, scaleset compute nodes and fat nodes.
The table below documents the hardware support with the various Linux distributions & versions. YES means the relevant RDMA or GPU drivers are included in the image or added dynamically during deployment by this template.

OS Image RDMA Support GPU Support
Canonical:UbuntuServer:16.04-LTSNOYES*
Canonical:UbuntuServer:16.10NOYES*
OpenLogic:CentOS-HPC:6.5YESTBD
OpenLogic:CentOS:6.8NOTBD
OpenLogic:CentOS-HPC:7.1YESTBD
OpenLogic:CentOS:7.2NOTBD
OpenLogic:CentOS:7.3NOTBD
RedHat:RHEL:7.3NOYES
SUSE:SLES-HPC:12-SP2YES*TBD

(*added by the installation scripts from this template at time of deployment)


Cluster Topology Overview

hpc_vmss_architecture

Credit: Taylor Newill, Xavier Pillons & Thomas Varlet for original base templates.

flexhpc's People

Contributors

mkiernan avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.