Coder Social home page Coder Social logo

torchmoe / moe-infinity Goto Github PK

View Code? Open in Web Editor NEW
46.0 9.0 4.0 193 KB

PyTorch library for cost-effective, fast and easy serving of MoE models.

License: Apache License 2.0

C++ 52.18% Python 47.82%
inference-engine large-language-models mixture-of-experts huggingface pytorch

moe-infinity's Introduction

MoE-Infinity

MoE-Infinity is a cost-effective, fast, and easy-to-use library for Mixture-of-Experts (MoE) inference and serving.

MoE-Infinity is cost-effective yet fast:

  • Offloading MoE's experts to host memory, allowing memory-constrained GPUs to serve MoE models.
  • Minimizing the expert offloading overheads through several novel techniques: expert activation tracing, activation-aware expert prefetching, and activation-aware expert caching.
  • Supporting LLM acceleration techniques (such as FlashAttention).
  • Supporting multi-GPU environments with numeorous OS-level performance optimizations.
  • Achieving SOTA latency and throughput performance when serving MoEs in a resource-constrained GPU environment (in comparison with HuggingFace Accelerate, DeepSpeed, Mixtral-Offloading, and Ollama/LLama.cpp).

MoE-Infinity is easy-to-use:

Note that: The open-sourced MoE-Infinity has been redesigned for making it HuggingFace-users friendly. This version is different from the version reported in the paper, which takes extreme performance as the top priority. As a result, distributed inference is currently not supported in this open-sourced version.

Contents

Performance

Single GPU A5000 (24GB Memory), per-token-latency (seconds) for generation with a mixed dataset that includes FLAN, BIG-Bench and MMLU datasets. Lower per-token-latency is preferable.

switch-large-128 NLLB-MoE-54B Mixtral-7x8b
MoE-Infinity 0.230 0.239 0.895
Accelerate 1.043 3.071 6.633
DeepSpeed 4.578 8.381 2.486
Mixtral Offloading X X 1.752
Ollama X X 0.903

Single GPU A5000, throughput (token/s) for generation with batch size 32. Higher throughput is preferable.

switch-large-128 NLLB-MoE-54B Mixtral-7x8b
MoE-Infinity 69.105 30.300 12.579
Accelerate 5.788 4.344 1.245
DeepSpeed 7.416 4.334 7.727
Mixtral Offloading X X 7.684
Ollama X X 1.107

The Mixtral Offloading experiment was carried out with a batch size of 16, as utilizing a batch size of 32 would result in Out of Memory errors on the GPU.

Ollama does not support batching for generation, so the throughput is calculated with a batch size of 1.

Installation

We recommend installing MoE-Infinity in a virtual environment. To install MoE-Infinity, you can either install it from PyPI or build it from source.

Install from conda environment

conda env create --file environment.yml
conda activate moe-infinity

Install from PyPI

pip install moe-infinity
conda install -c conda-forge libstdcxx-ng=12 # assume using conda, otherwise install libstdcxx-ng=12 using your package manager or gcc=12

Install from Source

git clone https://github.com/TorchMoE/MoE-Infinity.git
cd MoE-Infinity
pip install -e .

Enable FlashAttention (Optional)

Install FlashAttention (>=2.5.2) for faster inference with the following command.

FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn

Post-installation, MoE-Infinity will automatically integrate with FlashAttention to enhance performance.

Usage and Examples

We provide a simple API for diverse setups, including single GPU, multiple GPUs, and multiple nodes. The following examples show how to use MoE-Infinity to run generation on a Huggingface LLM model.

Sample Code of Huggingface LLM Inference

import torch
import os
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from moe_infinity import MoE

user_home = os.path.expanduser('~')

checkpoint = 'TheBloke/Mixtral-8x7B-v0.1-GPTQ'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

config = {
    "offload_path": os.path.join(user_home, "moe-infinity"),
    "device_memory_ratio": 0.75, # 75% of the device memory is used for caching, change the value according to your device memory size on OOM
}

model = MoE(checkpoint, config)

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda:0")

output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_text)

Running Inference

This command runs the script on selected GPUs.

CUDA_VISIBLE_DEVICES=0,1 python script.py

We provide a simple example to run inference on a Huggingface LLM model. The script will download the model checkpoint and run inference on the specified input text. The output will be printed to the console.

CUDA_VISIBLE_DEVICES=0 python example/interface_example.py --model_name_or_path "mistralai/Mixtral-8x7B-Instruct-v0.1" --offload_dir <your local path on SSD> 

Release Plan

We plan to release two functions in the following months:

  • We currently support PyTorch as the default inference engine, and we are in the process of supporting vLLM as another inference runtime, which includes the support of KV cache offloading.
  • Supporting expert parallelism for distributed MoE inference.
  • More (We welcome contributors to join us!)

Citation

If you use MoE-Inifity for your research, please cite our paper:

@inproceedings{moe-infinity2024,
  title={MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving},
  author={Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina},
  booktitle={https://arxiv.org/abs/2401.14361},
  year={2024}
}

moe-infinity's People

Contributors

drunkcoding avatar luomai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

moe-infinity's Issues

Support Constrained Server Memory

Colab server T4 has 12GB DRAM, 16GB GPU, quantized mixtral has 26GB in size with single checkpoint, cannot bot be loaded into memory on creating the custom format for offloading

TODO for first release

  • API design
  • Document for installation and PyPI
  • performance table
  • Support Mixtral multi-GPU
  • Load trace

MoE-Infinity API Proposal

Description

We propose a class MoE as the entry point. It loads a (potentially sharded) checkpoint inside a model, sending weights to a given device as they are loaded and adds the various hooks that will make this model run properly (even if split across devices).

The class has an additional generate member function to overwrite the default generate and adds tracing capability. It has the same behaviour as HuggingFace model.generate.

class MoE:
  def __init__(self, model_name_or_path: Union[str, os.PathLike], config: Union[str, os.PathLike] = None) -> None:
    """
    Args:
        model_name_or_path (`str` or `os.PathLike`): The model to load. It can be:
            - a name of HuggingFace Transformers model
            - a path to a file containing a whole model state dict
            - a path to a folder containing a unique `.index.json` file and the shards of a checkpoint.
        config (`Dict` or `os.PathLike`): The MoE-Infinity configuration. It can be:
            - a Python dictionary containing the configuration
            - a path to a JSON file containing the configuration
    """
    pass

  def generate(self, input_ids: torch.LongTensor, **kwargs) -> Any:
    """  
    Args:
        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
            The sequence used as a prompt for the generation. If `past` is used, only `bos_token_id` is used as
            prompt.
        **kwargs: Additional arguments for the generation method. Check the HuggingFace documentation of the model's
            `generate` method for the supported arguments.
  
    Returns:
        `torch.LongTensor` of shape `(batch_size, sequence_length)`:
            The generated sequences. Sequences shorter than `min_length` are padded with `pad_token_id`.
    """
    pass

Usage examples

import torch
import os
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from moe_infinity import MoE

user_home = os.path.expanduser('~')

checkpoint = 'mistralai/Mixtral-8x7B-Instruct-v0.1'

# specifies the path on disk to offload parameters
config = {
    "offload_path": os.path.join(user_home, "moe-infinity"),
}

model = MoE(checkpoint, config) # one line change to support offloading

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_text)

How to Install it?

When I install through git clone https://github.com/TorchMoE/MoE-Infinity.git,
there are only CITATIONS.md, LICENSE, and README.md files in the directory.

Therefore, it is impossible to install through pip install -e . and I do not know how to use it.
Should I use it via the dev branch?

Thank you.

Install from pip failed

Tried to install following README:

pip install moe-infinity
conda install -c conda-forge libstdcxx-ng=12 # assume using conda, otherwise install libstdcxx-ng=12 using your package manager or gcc=12

But got,

$ pip install moe-infinity
ERROR: Could not find a version that satisfies the requirement moe-infinity (from versions: none)
ERROR: No matching distribution found for moe-infinity

Is it because the lib is not released on pip yet?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.