Coder Social home page Coder Social logo

tarunprabhu / flexflow Goto Github PK

View Code? Open in Web Editor NEW

This project forked from flexflow/flexflow

0.0 1.0 0.0 25.77 MB

A distributed deep learning framework that supports flexible parallelization strategies.

Home Page: https://flexflow.readthedocs.io

License: Apache License 2.0

Shell 1.28% C++ 69.81% Python 13.85% C 0.90% Cuda 12.22% Makefile 0.44% CMake 1.42% Dockerfile 0.09%

flexflow's Introduction

FlexFlow

build gpu tests multinode gpu tests docker pip shell-check clang-format Documentation Status

FlexFlow is a deep learning framework that accelerates distributed DNN training by automatically searching for efficient parallelization strategies. FlexFlow provides a drop-in replacement for PyTorch and TensorFlow Keras. Running existing PyTorch and Keras programs in FlexFlow only requires a few lines of changes to the program.

Install FlexFlow

To install FlexFlow from source code, please read the instructions. If you would like to quickly try FlexFlow, we also provide pre-built Docker packages for several versions of CUDA and for the hip_rocm backend, together with Dockerfiles if you wish to build the containers manually. More info on the Docker images can be found here. You can also use conda to install the FlexFlow Python package (coming soon).

PyTorch Support

Users can also use FlexFlow to optimize the parallelization performance of existing PyTorch models in two steps. First, a PyTorch model can be exported to the FlexFlow model format using flexflow.torch.fx.torch_to_flexflow.

import torch
import flexflow.torch.fx as fx

model = MyPyTorchModule()
fx.torch_to_flexflow(model, "mymodel.ff")

Second, a FlexFlow program can directly import a previously saved PyTorch model and autotune the parallelization performance for a given parallel machine.

from flexflow.pytorch.model import PyTorchModel

def top_level_task():
  torch_model = PyTorchModel("mymodel.ff")
  output_tensor = torch_model.apply(ffmodel, input_tensor)
  ## Model compilation
  ffmodel.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
  ## Model training
  (x_train, y_train) = cifar10.load_data()
  ffmodel.fit(x_train, y_train, epochs=30)

More FlexFlow PyTorch examples: see the pytorch examples folder.

TensorFlow Keras and ONNX Support

FlexFlow prioritizes PyTorch compatibility, but also includes frontends for Tensorflow Keras and ONNX models.

C++ Interface

For users that prefer to program in C/C++. FlexFlow supports a C++ program inference that is equivalent to its Python APIs.

More FlexFlow C++ examples: see the C++ examples folder.

Command-Line Flags

In addition to setting runtime configurations in a FlexFlow Python/C++ program, the FlexFlow runtime also accepts command-line arguments for various runtime parameters:

FlexFlow training flags:

  • -e or --epochs: number of total epochs to run (default: 1)
  • -b or --batch-size: global batch size in each iteration (default: 64)
  • -p or --print-freq: print frequency (default: 10)
  • -d or --dataset: path to the training dataset. If not set, synthetic data is used to conduct training.

Legion runtime flags:

  • -ll:gpu: number of GPU processors to use on each node (default: 0)
  • -ll:fsize: size of device memory on each GPU (in MB)
  • -ll:zsize: size of zero-copy memory (pinned DRAM with direct GPU access) on each node (in MB). This is used for prefecthing training images from disk.
  • -ll:cpu: number of data loading workers (default: 4)
  • -ll:util: number of utility threads to create per process (default: 1)
  • -ll:bgwork: number of background worker threads to create per process (default: 1)

Performance auto-tuning flags:

  • --search-budget or --budget: the number of iterations for the MCMC search (default: 0)
  • --search-alpha or --alpha: a hyper-parameter for the search procedure (default: 0.05)
  • --export-strategy or --export: path to export the best discovered strategy (default: None)
  • --import-strategy or --import: path to import a previous saved strategy (default: None)
  • --enable-parameter-parallel: allow FlexFlow to explore parameter parallelism for performance auto-tuning. (By default FlexFlow only considers data and model parallelism.)
  • --enable-attribute-parallel: allow FlexFlow to explore attribute parallelism for performance auto-tuning. (By default FlexFlow only considers data and model parallelism.) For performance tuning related flags: see performance autotuning.

Contributing

Please let us know if you encounter any bugs or have any suggestions by submitting an issue.

We welcome all contributions to FlexFlow from bug fixes to new features and extensions.

Citations

The Team

FlexFlow is developed and maintained by teams at CMU, Facebook, Los Alamos National Lab, MIT, and Stanford (alphabetically).

License

FlexFlow uses Apache License 2.0.

flexflow's People

Contributors

awgu avatar bhetherman avatar daiyaanarfeen avatar derrickylj avatar dycz0fx avatar eddy16112 avatar efrainq07 avatar eric-zheng avatar facebook-github-bot avatar ferdiko avatar flasew avatar goliaro avatar jhancox avatar jiazhihao avatar kadinlz avatar kateunger avatar lockshaw avatar mandeeplearning avatar mengdiz97 avatar msbaines avatar powderluv avatar rajasbansal avatar reyna-abhyankar avatar soumyac1999 avatar stas00 avatar thomasw21 avatar tnoyola avatar williamberman avatar wmdi avatar xinhaoc avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.