Coder Social home page Coder Social logo

torchmoe / moe-infinity Goto Github PK

View Code? Open in Web Editor NEW
51.0 51.0 5.0 242 KB

PyTorch library for cost-effective, fast and easy serving of MoE models.

License: Apache License 2.0

C++ 52.19% Python 47.81%
huggingface inference-engine large-language-models mixture-of-experts pytorch

moe-infinity's Issues

TODO for first release

  • API design
  • Document for installation and PyPI
  • performance table
  • Support Mixtral multi-GPU
  • Load trace

Install from pip failed

Tried to install following README:

pip install moe-infinity
conda install -c conda-forge libstdcxx-ng=12 # assume using conda, otherwise install libstdcxx-ng=12 using your package manager or gcc=12

But got,

$ pip install moe-infinity
ERROR: Could not find a version that satisfies the requirement moe-infinity (from versions: none)
ERROR: No matching distribution found for moe-infinity

Is it because the lib is not released on pip yet?

Support Constrained Server Memory

Colab server T4 has 12GB DRAM, 16GB GPU, quantized mixtral has 26GB in size with single checkpoint, cannot bot be loaded into memory on creating the custom format for offloading

How to Install it?

When I install through git clone https://github.com/TorchMoE/MoE-Infinity.git,
there are only CITATIONS.md, LICENSE, and README.md files in the directory.

Therefore, it is impossible to install through pip install -e . and I do not know how to use it.
Should I use it via the dev branch?

Thank you.

Output of Mixtral-8*7b is strange

Thanks for your great work. I try running the mistral moe, but I got some strange output.
When I use cpu to run the model with following script, I get normal output.
script:

from transformers import AutoModelForCausalLM, AutoTokenizer
import time
model_id = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
text = "Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying"
inputs = tokenizer(text, return_tensors="pt")
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=100)
cost = time.time() - start
print(model.dtype)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Time cost: {cost}s")

output:

torch.float32
Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying a degree in English Literature and Creative Writing at the University of Winchester. I have always had a passion for writing and I am hoping to pursue a career in journalism. I have a love for all things fashion, beauty and lifestyle related and I am hoping to share my thoughts and opinions with you all.

I have always been a huge fan of reading and writing and I am hoping to share my passion with you all. I am hoping to share my thoughts and opinions on all things

58.17350935935974s

But when I use moe-infinity to run the model, I get strange output.
script:

import torch
import os
from transformers import AutoTokenizer
import time
from moe_infinity import MoE

model_id = "mistralai/Mixtral-8x7B-v0.1"
config = {
    "offload_path": "baselines/cache",
    "device_memory_ratio": 0.75, # 75% of the device memory is used for caching, change the value according to your device memory size on OOM
}

model = MoE(model_id, config)
input_text = "Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying "
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer(input_text, return_tensors="pt")
inputs = {k: v.to('cuda') for k, v in inputs.items()}
start = time.time()
output = model.generate(**inputs, max_new_tokens=100)
cost = time.time() - start
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
print(f"Time cost: {cost}s")

output:

Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying qu‘‘Âub“‘‘� du…­‘‘‘­‘­Â9‘­‘­‘­an’adqu
‘dededok‘‘‘­’’‘ququ‘‘‘‘ok‘‘‘‘’‘’’’‘‘‘’’’’’ak’’‘‘‘‘‘‘’’’’’’’’’’’’ dess‘‘af’’ of ofged dec

Time cost: 216.83905959129333s

I run the model on NVIDIA GeForce RTX 4090.
Could you give me some advice. Thanks for your help.

MoE-Infinity API Proposal

Description

We propose a class MoE as the entry point. It loads a (potentially sharded) checkpoint inside a model, sending weights to a given device as they are loaded and adds the various hooks that will make this model run properly (even if split across devices).

The class has an additional generate member function to overwrite the default generate and adds tracing capability. It has the same behaviour as HuggingFace model.generate.

class MoE:
  def __init__(self, model_name_or_path: Union[str, os.PathLike], config: Union[str, os.PathLike] = None) -> None:
    """
    Args:
        model_name_or_path (`str` or `os.PathLike`): The model to load. It can be:
            - a name of HuggingFace Transformers model
            - a path to a file containing a whole model state dict
            - a path to a folder containing a unique `.index.json` file and the shards of a checkpoint.
        config (`Dict` or `os.PathLike`): The MoE-Infinity configuration. It can be:
            - a Python dictionary containing the configuration
            - a path to a JSON file containing the configuration
    """
    pass

  def generate(self, input_ids: torch.LongTensor, **kwargs) -> Any:
    """  
    Args:
        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
            The sequence used as a prompt for the generation. If `past` is used, only `bos_token_id` is used as
            prompt.
        **kwargs: Additional arguments for the generation method. Check the HuggingFace documentation of the model's
            `generate` method for the supported arguments.
  
    Returns:
        `torch.LongTensor` of shape `(batch_size, sequence_length)`:
            The generated sequences. Sequences shorter than `min_length` are padded with `pad_token_id`.
    """
    pass

Usage examples

import torch
import os
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from moe_infinity import MoE

user_home = os.path.expanduser('~')

checkpoint = 'mistralai/Mixtral-8x7B-Instruct-v0.1'

# specifies the path on disk to offload parameters
config = {
    "offload_path": os.path.join(user_home, "moe-infinity"),
}

model = MoE(checkpoint, config) # one line change to support offloading

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_text)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.