Coder Social home page Coder Social logo

vamsi5123 / rccl Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rocm/rccl

0.0 0.0 0.0 3.23 MB

ROCm Communication Collectives Library (RCCL)

License: Other

C 13.77% Makefile 1.98% C++ 79.34% Shell 1.03% CMake 2.04% Cuda 0.62% Groovy 0.58% Awk 0.46% Python 0.17%

rccl's Introduction

RCCL

ROCm Communication Collectives Library

Introduction

RCCL (pronounced "Rickle") is a stand-alone library of standard collective communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, gather, scatter, and all-to-all. There is also initial support for direct GPU-to-GPU send and receive operations. It has been optimized to achieve high bandwidth on platforms using PCIe, xGMI as well as networking using InfiniBand Verbs or TCP/IP sockets. RCCL supports an arbitrary number of GPUs installed in a single node or multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

The collective operations are implemented using ring and tree algorithms and have been optimized for throughput and latency. For best performance, small operations can be either batched into larger operations or aggregated through the API.

Requirements

  1. ROCm supported GPUs
  2. ROCm stack installed on the system (HIP runtime & HCC or HIP-Clang)
  3. For building and running the unit tests, chrpath will need to be installed on your machine first. (sudo apt-get install chrpath)

Quickstart RCCL Build

RCCL directly depends on HIP runtime, plus the HCC C++ compiler or the HIP-Clang compiler which are part of the ROCm software stack. In addition, HC Direct Function call support needs to be present on your machine. There are binaries for hcc and HIP that need to be installed to get HC Direct Function call support. These binaries are currently packaged with roc-master, and will be included in ROCm 2.4.

The root of this repository has a helper script 'install.sh' to build and install RCCL on Ubuntu with a single command. It does not take a lot of options and hard-codes configuration that can be specified through invoking cmake directly, but it's a great way to get started quickly and can serve as an example of how to build/install.

  • ./install.sh -- builds library including unit tests
  • ./install.sh -i -- builds and installs the library to /opt/rocm/rccl; installation path can be changed with --prefix argument (see below.)
  • ./install.sh -d -- installs all necessary dependencies for RCCL. Should be re-invoked if the build folder is removed.
  • ./install.sh -h -- shows help
  • ./install.sh -t -- builds library including unit tests
  • ./install.sh -r -- runs unit tests (must be already built)
  • ./install.sh -p -- builds RCCL package
  • ./install.sh -s -- builds RCCL as a static library (default: shared)
  • ./install.sh -hcc -- builds RCCL with hcc compiler; note that hcc is now deprecated. (default:hip-clang)
  • ./install.sh --prefix -- specify custom path to install RCCL to (default:/opt/rocm)

Manual build

To build the library :

$ git clone https://github.com/ROCmSoftwarePlatform/rccl.git
$ cd rccl
$ mkdir build
$ cd build
$ CXX=/opt/rocm/bin/hipcc cmake ..
$ make -j 8

You may substitute an installation path of your own choosing by passing CMAKE_INSTALL_PREFIX. For example:

$ CXX=/opt/rocm/bin/hipcc cmake -DCMAKE_INSTALL_PREFIX=$PWD/rccl-install ..

Note: ensure rocm-cmake is installed, apt install rocm-cmake.

To build the RCCL package and install package :

Assuming you have already cloned this repository and built the library as shown in the previous section:

$ cd rccl/build
$ make package
$ sudo dpkg -i *.deb

RCCL package install requires sudo/root access because it creates a directory called "rccl" under /opt/rocm/. This is an optional step and RCCL can be used directly by including the path containing librccl.so.

Tests

There are unit tests implemented with the Googletest framework in RCCL, which are currently a work-in-progress. The unit tests require Googletest 1.10 or higher to build and execute properly. To invoke the unit tests, go to the build folder, then the test subfolder, and execute the appropriate unit test executable(s).

Unit test names are now of the format: [CollectiveCall]CorrectnessSweep/[CollectiveCall]CorrectnessTest.[Type of test]/[ncclRedOp_t][datatype][number of elements][number of devices][in place/out of place]_[environment variables]

This allows filtering of unit tests being run by their parameter values by passing the --gtest_filter command line flag, for example:

--gtest_filter="AllReduceCorrectnessSweep*float32*"

will run only AllReduce correctness tests with float32 datatype. See "Running a Subset of the Tests" at https://chromium.googlesource.com/external/github.com/google/googletest/+/HEAD/googletest/docs/advanced.md for more information on how to form more advanced filters.

There are also other performance and error-checking tests for RCCL. These are maintained separately at https://github.com/ROCmSoftwarePlatform/rccl-tests. See the rccl-tests README for more information on how to build and run those tests.

Library and API Documentation

Please refer to the Library documentation for current documentation.

Copyright

All source code and accompanying documentation is copyright (c) 2015-2018, NVIDIA CORPORATION. All rights reserved.

All modifications are copyright (c) 2019-2020 Advanced Micro Devices, Inc. All rights reserved.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.