torchmoe / moe-infinity Goto Github PK
View Code? Open in Web Editor NEWPyTorch library for cost-effective, fast and easy serving of MoE models.
License: Apache License 2.0
PyTorch library for cost-effective, fast and easy serving of MoE models.
License: Apache License 2.0
Is there an unquantized version that can run on multiple GPUs?
Tried to install following README:
pip install moe-infinity
conda install -c conda-forge libstdcxx-ng=12 # assume using conda, otherwise install libstdcxx-ng=12 using your package manager or gcc=12
But got,
$ pip install moe-infinity
ERROR: Could not find a version that satisfies the requirement moe-infinity (from versions: none)
ERROR: No matching distribution found for moe-infinity
Is it because the lib is not released on pip
yet?
Colab server T4 has 12GB DRAM, 16GB GPU, quantized mixtral has 26GB in size with single checkpoint, cannot bot be loaded into memory on creating the custom format for offloading
-Use the link https://huggingface.co/keyfan/grok-1-hf/tree/main
When I install through git clone https://github.com/TorchMoE/MoE-Infinity.git,
there are only CITATIONS.md, LICENSE, and README.md files in the directory.
Therefore, it is impossible to install through pip install -e . and I do not know how to use it.
Should I use it via the dev branch?
Thank you.
Thanks for your great work. I try running the mistral moe, but I got some strange output.
When I use cpu to run the model with following script, I get normal output.
script:
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
model_id = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
text = "Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying"
inputs = tokenizer(text, return_tensors="pt")
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=100)
cost = time.time() - start
print(model.dtype)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"Time cost: {cost}s")
output:
torch.float32
Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying a degree in English Literature and Creative Writing at the University of Winchester. I have always had a passion for writing and I am hoping to pursue a career in journalism. I have a love for all things fashion, beauty and lifestyle related and I am hoping to share my thoughts and opinions with you all.
I have always been a huge fan of reading and writing and I am hoping to share my passion with you all. I am hoping to share my thoughts and opinions on all things
58.17350935935974s
But when I use moe-infinity to run the model, I get strange output.
script:
import torch
import os
from transformers import AutoTokenizer
import time
from moe_infinity import MoE
model_id = "mistralai/Mixtral-8x7B-v0.1"
config = {
"offload_path": "baselines/cache",
"device_memory_ratio": 0.75, # 75% of the device memory is used for caching, change the value according to your device memory size on OOM
}
model = MoE(model_id, config)
input_text = "Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying "
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer(input_text, return_tensors="pt")
inputs = {k: v.to('cuda') for k, v in inputs.items()}
start = time.time()
output = model.generate(**inputs, max_new_tokens=100)
cost = time.time() - start
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
print(f"Time cost: {cost}s")
output:
Hello my name is Katie and I am a 20 year old student from the UK. I am currently studying qu‘‘Âub“‘‘� du…‘‘‘‘ÂÂ9‘‘‘an’adqu
‘dededok‘‘‘’’‘ququ‘‘‘‘ok‘‘‘‘’‘’’’‘‘‘’’’’’ak’’‘‘‘‘‘‘’’’’’’’’’’’’ dess‘‘af’’ of ofged dec
Time cost: 216.83905959129333s
I run the model on NVIDIA GeForce RTX 4090.
Could you give me some advice. Thanks for your help.
We propose a class MoE
as the entry point. It loads a (potentially sharded) checkpoint inside a model, sending weights to a given device as they are loaded and adds the various hooks that will make this model run properly (even if split across devices).
The class has an additional generate
member function to overwrite the default generate and adds tracing capability. It has the same behaviour as HuggingFace model.generate
.
class MoE:
def __init__(self, model_name_or_path: Union[str, os.PathLike], config: Union[str, os.PathLike] = None) -> None:
"""
Args:
model_name_or_path (`str` or `os.PathLike`): The model to load. It can be:
- a name of HuggingFace Transformers model
- a path to a file containing a whole model state dict
- a path to a folder containing a unique `.index.json` file and the shards of a checkpoint.
config (`Dict` or `os.PathLike`): The MoE-Infinity configuration. It can be:
- a Python dictionary containing the configuration
- a path to a JSON file containing the configuration
"""
pass
def generate(self, input_ids: torch.LongTensor, **kwargs) -> Any:
"""
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
The sequence used as a prompt for the generation. If `past` is used, only `bos_token_id` is used as
prompt.
**kwargs: Additional arguments for the generation method. Check the HuggingFace documentation of the model's
`generate` method for the supported arguments.
Returns:
`torch.LongTensor` of shape `(batch_size, sequence_length)`:
The generated sequences. Sequences shorter than `min_length` are padded with `pad_token_id`.
"""
pass
import torch
import os
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from moe_infinity import MoE
user_home = os.path.expanduser('~')
checkpoint = 'mistralai/Mixtral-8x7B-Instruct-v0.1'
# specifies the path on disk to offload parameters
config = {
"offload_path": os.path.join(user_home, "moe-infinity"),
}
model = MoE(checkpoint, config) # one line change to support offloading
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.