torchmoe / moe-infinity Goto Github PK

View Code? Open in Web Editor NEW

51.0 51.0 5.0 242 KB

PyTorch library for cost-effective, fast and easy serving of MoE models.

License: Apache License 2.0

C++ 52.19% Python 47.81%

huggingface inference-engine large-language-models mixture-of-experts pytorch

moe-infinity's Issues

run on the mutiple gpus

Is there an unquantized version that can run on multiple GPUs?

Install from pip failed

Tried to install following README:

pip install moe-infinity
conda install -c conda-forge libstdcxx-ng=12 # assume using conda, otherwise install libstdcxx-ng=12 using your package manager or gcc=12

But got,

$ pip install moe-infinity
ERROR: Could not find a version that satisfies the requirement moe-infinity (from versions: none)
ERROR: No matching distribution found for moe-infinity

Is it because the lib is not released on pip yet?

Support Constrained Server Memory

Colab server T4 has 12GB DRAM, 16GB GPU, quantized mixtral has 26GB in size with single checkpoint, cannot bot be loaded into memory on creating the custom format for offloading

Grok-1 Support

-Use the link https://huggingface.co/keyfan/grok-1-hf/tree/main

How to Install it?

When I install through git clone https://github.com/TorchMoE/MoE-Infinity.git,
there are only CITATIONS.md, LICENSE, and README.md files in the directory.

Therefore, it is impossible to install through pip install -e . and I do not know how to use it.
Should I use it via the dev branch?

Thank you.

Output of Mixtral-8*7b is strange

Thanks for your great work. I try running the mistral moe, but I got some strange output.
When I use cpu to run the model with following script, I get normal output.
script:

from transformers import AutoModelForCausalLM, AutoTokenizer
import time
model_id = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
text = "Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying"
inputs = tokenizer(text, return_tensors="pt")
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=100)
cost = time.time() - start
print(model.dtype)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Time cost: {cost}s")

output:

torch.float32
Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying a degree in English Literature and Creative Writing at the University of Winchester. I have always had a passion for writing and I am hoping to pursue a career in journalism. I have a love for all things fashion, beauty and lifestyle related and I am hoping to share my thoughts and opinions with you all.

I have always been a huge fan of reading and writing and I am hoping to share my passion with you all. I am hoping to share my thoughts and opinions on all things

58.17350935935974s

But when I use moe-infinity to run the model, I get strange output.
script:

import torch
import os
from transformers import AutoTokenizer
import time
from moe_infinity import MoE

model_id = "mistralai/Mixtral-8x7B-v0.1"
config = {
    "offload_path": "baselines/cache",
    "device_memory_ratio": 0.75, # 75% of the device memory is used for caching, change the value according to your device memory size on OOM
}

model = MoE(model_id, config)
input_text = "Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying "
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer(input_text, return_tensors="pt")
inputs = {k: v.to('cuda') for k, v in inputs.items()}
start = time.time()
output = model.generate(**inputs, max_new_tokens=100)
cost = time.time() - start
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
print(f"Time cost: {cost}s")

output:

Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying qu‘‘Âub“‘‘� du…‘‘‘‘ÂÂ9‘‘‘an’adqu
‘dededok‘‘‘’’‘ququ‘‘‘‘ok‘‘‘‘’‘’’’‘‘‘’’’’’ak’’‘‘‘‘‘‘’’’’’’’’’’’’ dess‘‘af’’ of ofged dec

Time cost: 216.83905959129333s

I run the model on NVIDIA GeForce RTX 4090.
Could you give me some advice. Thanks for your help.

Description

We propose a class MoE as the entry point. It loads a (potentially sharded) checkpoint inside a model, sending weights to a given device as they are loaded and adds the various hooks that will make this model run properly (even if split across devices).

The class has an additional generate member function to overwrite the default generate and adds tracing capability. It has the same behaviour as HuggingFace model.generate.

class MoE:
  def __init__(self, model_name_or_path: Union[str, os.PathLike], config: Union[str, os.PathLike] = None) -> None:
    """
    Args:
        model_name_or_path (`str` or `os.PathLike`): The model to load. It can be:
            - a name of HuggingFace Transformers model
            - a path to a file containing a whole model state dict
            - a path to a folder containing a unique `.index.json` file and the shards of a checkpoint.
        config (`Dict` or `os.PathLike`): The MoE-Infinity configuration. It can be:
            - a Python dictionary containing the configuration
            - a path to a JSON file containing the configuration
    """
    pass

  def generate(self, input_ids: torch.LongTensor, **kwargs) -> Any:
    """  
    Args:
        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
            The sequence used as a prompt for the generation. If `past` is used, only `bos_token_id` is used as
            prompt.
        **kwargs: Additional arguments for the generation method. Check the HuggingFace documentation of the model's
            `generate` method for the supported arguments.
  
    Returns:
        `torch.LongTensor` of shape `(batch_size, sequence_length)`:
            The generated sequences. Sequences shorter than `min_length` are padded with `pad_token_id`.
    """
    pass

Usage examples

import torch
import os
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from moe_infinity import MoE

user_home = os.path.expanduser('~')

checkpoint = 'mistralai/Mixtral-8x7B-Instruct-v0.1'

# specifies the path on disk to offload parameters
config = {
    "offload_path": os.path.join(user_home, "moe-infinity"),
}

model = MoE(checkpoint, config) # one line change to support offloading

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_text)

torchmoe / moe-infinity Goto Github PK

moe-infinity's Issues

run on the mutiple gpus

TODO for first release

Install from pip failed

Support Constrained Server Memory

Grok-1 Support

How to Install it?

Output of Mixtral-8*7b is strange

MoE-Infinity API Proposal

Description

Usage examples

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent