Coder Social home page Coder Social logo

fminference / flexgen Goto Github PK

View Code? Open in Web Editor NEW
9.1K 108.0 530.0 37.59 MB

Running large language models on a single GPU for throughput-oriented scenarios.

License: Apache License 2.0

Python 96.61% Shell 3.39%
deep-learning gpt-3 high-throughput large-language-models machine-learning offloading opt

flexgen's Introduction

FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU [paper]

FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes.

Motivation

In recent years, large language models (LLMs) have shown great performance across a wide range of tasks. Increasingly, LLMs have been applied not only to interactive applications (such as chat), but also to many "back-of-house" tasks. These tasks include benchmarking, information extraction, data wrangling, and form processing.

One key characteristic of these applications is that they are throughput-oriented: they require running LLM inferences over millions of tokens in batches, e.g., all the private documents in a company's corpus, or all the tasks in the HELM benchmark. These workloads are less sensitive to latency - the user starts up a job and lets it run overnight - but increasing throughput is critical for reducing costs. Throughput is a measure of tokens processed per second over the job's entire runtime (which can be hours). Throughput-oriented workloads provide opportunities to trade off latency for higher throughput, which makes it easier to take advantage of low-cost commodity GPUs.

The goal of FlexGen is to create a high-throughput system to enable new and exciting applications of foundation models to throughput-oriented tasks on low-cost hardware, such as a single commodity GPU instead of expensive systems.

Check out the examples of what you can run on a single commodity GPU with FlexGen, including benchmarking and data wrangling.

Limitation. As an offloading-based system running on weak GPUs, FlexGen also has its limitations. FlexGen can be significantly slower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases. FlexGen is mostly optimized for throughput-oriented batch processing settings (e.g., classifying or extracting information from many documents in batches), on single GPUs.


This project was made possible thanks to a collaboration with

                   


Content

Installation

Requirements:

Method 1: With pip

pip install flexgen

Method 2: From source

git clone https://github.com/FMInference/FlexGen.git
cd FlexGen
pip install -e .

Usage and Examples

Get Started with a Single GPU

OPT-1.3B

To get started, you can try a small model like OPT-1.3B first. It fits into a single GPU so no offloading is required. FlexGen will automatically download weights from Hugging Face.

python3 -m flexgen.flex_opt --model facebook/opt-1.3b

You should see some text generated by OPT-1.3B and the benchmark results.

OPT-30B

To run large models like OPT-30B, you will need to use CPU offloading. You can try commands below. The --percent argument specifies the offloading strategy for parameters, attention cache and hidden states separately. The exact meaning of this argument can be found here.

python3 -m flexgen.flex_opt --model facebook/opt-30b --percent 0 100 100 0 100 0

OPT-175B

To run OPT-175B, you need to download the weights from metaseq and convert the weights into Alpa format. You can then try to offloading all weights to disk by

python3 -m flexgen.flex_opt --model facebook/opt-175b --percent 0 0 100 0 100 0 --offload-dir YOUR_SSD_FOLDER

Run HELM Benchmark with FlexGen

FlexGen can be integrated into HELM, a language model benchmark framework, as its execution backend. You can use the commands below to run a Massive Multitask Language Understanding (MMLU) scenario with a single T4 (16GB) GPU and 200GB of DRAM.

pip install crfm-helm
python3 -m flexgen.apps.helm_run --description mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical --pad-to-seq-len 512 --model facebook/opt-30b --percent 20 80 0 100 0 100 --gpu-batch-size 48 --num-gpu-batches 3 --max-eval-instance 100

Note that only a subset of HELM scenarios is tested. See more tested scenarios here.

Run Data Wrangling Tasks with FlexGen

You can run the examples in this paper, 'Can Foundation Models Wrangle Your Data?', by following the instructions here.

Scaling to Distributed GPUs

If you have multiple machines with GPUs, FlexGen can combine offloading with pipeline parallelism to allow scaling. For example, if you have 2 GPUs but the aggregated GPU memory is less than the model size, you still need offloading. FlexGen allow you to do pipeline parallelism with these 2 GPUs to accelerate the generation. But to have scaled performance, you should have GPUs on distributed machines. See examples here.

API Example

We demonstrate the usage of FlexGen API in completion.py. This example shows how to run generation for two sentences. To get the best throughput out of FlexGen, you typically need to batch more sentences.

Generation API

FlexGen has a generation API following the style of Hugging Face's transformers.

output_ids = model.generate(
	input_ids,
	do_sample=True,
	temperature=0.7,
	max_new_tokens=32,
	stop=stop)

Example Commands

You can use the example commands below. If you do not have enough GPU/CPU memory, see the Handle Out-Of-Memory section.

# Complete with OPT-6.7B. You need at least 15GB of GPU memory.
python3 -m flexgen.apps.completion --model facebook/opt-6.7b
# Complete with OPT-30B. You need about 90GB of CPU memory.
python3 -m flexgen.apps.completion --model facebook/opt-30b --percent 0 100 100 0 100 0
# Complete with instruction-tuned OPT-IML-MAX-30B. You need about 90GB of CPU memory.
python3 -m flexgen.apps.completion --model facebook/opt-iml-max-30b --percent 0 100 100 0 100 0

Frequently Asked Questions

How to set the offloading strategy and --percent?

We will release an automatic policy optimizer later, but now you have to manually try a few strategies. The idea of high-throughput generation is to offload parameters and attention cache as much as possible to the CPU and disk if necessary. You can see the reference strategies in our benchmark here. To avoid out-of-memory, you can tune the --percent to offload more tensors to the CPU and disk.

How to handle out-of-memory?

If you do not have enough GPU/CPU memory, here are a few things you can try. They save more memory but run slower.

  • Do not pin weights by adding --pin-weight 0. This can reduce the weight memory usage on CPU by around 20% or more.
  • Enable weight compression by adding --compress-weight. This can reduce the weight memory usage by around 70%.
  • Offload all weights to disk by using --percent 0 0 100 0 100 0. This requires very little CPU and GPU memory.

Performance Results

Generation Throughput (token/s)

The corresponding effective batch sizes and lowest offloading devices are in parentheses. Please see here for more details.

System OPT-6.7B OPT-30B OPT-175B
Hugging Face Accelerate 25.12 (2 on GPU) 0.62 (8 on CPU) 0.01 (2 on disk)
DeepSpeed ZeRO-Inference 9.28 (16 on CPU) 0.60 (4 on CPU) 0.01 (1 on disk)
Petals 8.25 (2 on GPU) 2.84 (2 on GPU) 0.08 (2 on GPU)
FlexGen 25.26 (2 on GPU) 7.32 (144 on CPU) 0.69 (256 on disk)
FlexGen with Compression 29.12 (72 on GPU) 8.38 (512 on CPU) 1.12 (144 on CPU)
  • Hardware: an NVIDIA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD.
  • Workload: input sequence length = 512, output sequence length = 32. The batch size is tuned to a large value that maximizes the generation throughput for each system.
  • Metric: generation throughput (token/s) = number of the generated tokens / (time for processing prompts + time for generation).

How to reproduce.

Latency-Throughput Trade-Off

The figure below shows the latency and throughput trade-off of three offloading-based systems on OPT-175B (left) and OPT-30B (right). FlexGen achieves a new Pareto-optimal frontier with significatnly higher maximum throughput for both models. Other systems cannot further increase throughput due to out-of-memory. "FlexGen(c)" is FlexGen with compression.

image

How It Works

FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. Through a linear programming optimizer, it searches for the best pattern to store and access the tensors, including weights, activations, and attention key/value (KV) cache. FlexGen further compresses both weights and KV cache to 4 bits with negligible accuracy loss.

One key idea of FlexGen is to play the latency-throughput trade-off. Achieving low latency is inherently challenging for offloading methods, but the I/O efficiency of offloading can be greatly boosted for throughput-oriented scenarios (see the figure above). FlexGen utilizes a block schedule to reuse weight and overlap I/O with computation, as shown in figure (b) below, while other baseline systems use an inefficient row-by-row schedule, as shown in figure (a) below.

image

More technical details see our paper.

Roadmap

We plan to work on the following features.

  • Optimize the performance for multiple GPUs on the same machine
  • Support more models (BLOOM, CodeGen, GLM)
  • Release the cost model and policy optimizer
  • Macbook Support (M1 and M2)
  • AMD Support

flexgen's People

Contributors

binhangyuan avatar borda avatar danfu09 avatar eltociear avatar kemingy avatar keroro824 avatar lukelin-web avatar meatfucker avatar merrymercy avatar mryab avatar nicholasachow avatar shotarok avatar shughes-uk avatar takanotaiga avatar tomaarsen avatar ying1123 avatar zhangce avatar zhuohan123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flexgen's Issues

Soft Label of Flexgen

I tried to extract output prediction probs from pytorch_backend.py in flex_opt. But I can not get logits returns. I have little idea about extracting logits from your complex distribution computations. Thus, I am wondering if you could support the output of both ids and logits in apps?

3090

Hello, I have 3090.
How fast can I run Erebus 30B if I will use FlexGen with Compression?

ValueError: cannot reshape array of size 0 into shape (7168,28672)

Hello!

I got an error with running:
python -m flexgen.flex_opt --model facebook/opt-30b --percent 0 100 100 0 100 0

warmup - init weights
Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "D:\FlexGen\flexgen\flex_opt.py", line 1314, in <module>
    run_flexgen(args)
  File "D:\FlexGen\flexgen\flex_opt.py", line 1206, in run_flexgen
    model.init_all_weights()
  File "D:\FlexGen\flexgen\flex_opt.py", line 793, in init_all_weights
    self.init_weight(j)
  File "D:\FlexGen\flexgen\flex_opt.py", line 650, in init_weight
    self.layers[j].init_weight(self.weight_home[j], expanded_path)
  File "D:\FlexGen\flexgen\flex_opt.py", line 500, in init_weight
    weights = init_weight_list(weight_specs, self.policy, self.env)
  File "D:\FlexGen\flexgen\flex_opt.py", line 121, in init_weight_list
    weight.load_from_np_file(weight_specs[i][2])
  File "D:\FlexGen\flexgen\pytorch_backend.py", line 124, in load_from_np_file
    self.load_from_np(np.load(filename))
  File "D:\venv\lib\site-packages\numpy\lib\npyio.py", line 432, in load
    return format.read_array(fid, allow_pickle=allow_pickle,
  File "D:\venv\lib\site-packages\numpy\lib\format.py", line 831, in read_array
    array.shape = shape
ValueError: cannot reshape array of size 0 into shape (7168,28672)

Just a suggestion: Think about what Automatic1111 did to Stable Diffusion

Think about what Automatic1111 did to Stable Diffusion, from a rather brute one-shot image generator significantly worse than the commercial counterparts it is now a distribution with thousands of features, hundreds of extensions, visual gradio support and even an API.
It's development pace is sometimes almost impossible to watch, the model performance increased at least 2 times from the original start while consuming a fraction of the original GPU memory and even going beyond the capabilities of the model.

The core reason why this worked out was the local and automated installation process, it simply works on almost any system.
All you need is to pull the GIT image and anything needed is downloaded/installed. No need to fight with dependencies, etc.

The second reason is the highly active team of devs that allowed integration of the thousands of contributions (and of course the few core devs).

FlexGen looks to me like it could have the potential to shake up the industry similar.
But to draw in people and devs it needs to start with the accessibility. Troublefree installing and gradio or similar web interface.

[Multi-line Chatbot] Multiple line chat answers cut off?

Are multiple line answers in the chatbot cut off?
It seems like it "has more to say" sometimes, but the output is trimmed to just the first line.
For example

Human: Write a short poem about a carrot
Assistant: Here is a short poem about a carrot.
Human:

Should there be the actual poem after it shows "Here is a short poem about a carrot."?
If so, how do I edit the chatbot.py script to allow multi-line output?

Thanks.

opt-175b model how to load model from disc.

I tried to load the opt-175b model by using the following command:

python3 -m flexgen.flex_opt --model facebook/opt-175b --percent 0 0 100 0 100 0 --offload-dir ./tmp_offline

The issue I have is, that after converting the weights using Alpha to numpy as described in a folder. I'm not knowing how to define that the script should use the folder for loading the model.

I always get the error: OSError: facebook/opt-175b is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

How do you have to define the command to load the locally stored weights of the opt-175b model?

Mac

Would this be possible to run on a mac?

AMD Radeon Pro 5500M 4 GB
Intel UHD Graphics 630 1536 MB

Issue with flexgen when running python script

Description:

I encountered an issue when running the following command at one 3090:

bash:
python3 -m flexgen.flex_opt --model facebook/opt-30b --percent 0 100 100 0 100 0 --num-gpu-batches 2
The error message I received is:

error:
model size: 55.803 GB, cache size: 5.578 GB, hidden size (prefill): 0.058 GB
warmup - init weights
Load the pre-trained pytorch weights of opt-30b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Loading checkpoint shards: 43%|█████████████████████████████████████████▏ | 3/7 [03:34<04:48, 72.03s/it]Killed
I am trying to use flexgen to optimize the model size, but the process seems to be getting killed midway through. I am not sure why this is happening, and I would appreciate any help in resolving this issue.

Thank you!

Out-of-memory during weight download and conversion

I’m on a system hardlimited to 40GB of cpu ram + swap.

When I try to load opt-30b the process is killed from memory exhaustion. If I load the model manually using device_map="auto", offload_folder=“offload”, the load succeeds.

Is there a way to pass these flags manually or otherwise accommodate ram limits during initial loading?

Suggestion: Add support for different decoding strategies (Top P)

Firstly thank you for sharing this awesome and easy to use work!! It’s a great step forward in democratising LLMs.

It would be really helpful in practical applications if we could adjust different decoding strategies.

I believe some of the most useful would be:

  • Top P
  • Top K
  • Contrastive Search

All the best,

Anuj Nayyar

RuntimeError: CUDA error: out of memory | OPT-1.3b | RTX 3090

It seems that I am encountering several issues while attempting to run the smallest model. I would greatly appreciate it if someone could assist me in debugging this problem.

Setup: RTX 3090 24GB, WSL2

After running python -m flexgen.flex_opt --model facebook/opt-1.3b I'm getting the following output:

I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
warmup - init weights
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/Developer/FlexGen/flexgen/pytorch_backend.py", line 881, in copy_worker_func
    cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/Developer/FlexGen/flexgen/pytorch_backend.py", line 881, in copy_worker_func
    cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/Developer/FlexGen/flexgen/pytorch_backend.py", line 881, in copy_worker_func
    cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/Developer/FlexGen/flexgen/pytorch_backend.py", line 881, in copy_worker_func
    cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
warmup - generate
benchmark - generate
benchmark - delete weights

It stops there ^^^

[Apple M1 Max] TypeError: object.__new__() takes exactly one argument (the type to instantiate)

I am on Apple M1 Max:

$ git clone https://github.com/FMInference/FlexGen.git
$ cd FlexGen
$ conda create -n flexgen python=3.10
$ conda activate flexgen
$ pip install .
$ python -m flexgen.flex_opt --model facebook/opt-1.3b 
Downloading (…)okenizer_config.json: 100%|██████| 685/685 [00:00<00:00, 273kB/s]
Downloading (…)lve/main/config.json: 100%|██████| 651/651 [00:00<00:00, 172kB/s]
Downloading (…)olve/main/vocab.json: 100%|███| 899k/899k [00:00<00:00, 1.91MB/s]
Downloading (…)olve/main/merges.txt: 100%|███| 456k/456k [00:00<00:00, 1.19MB/s]
Downloading (…)cial_tokens_map.json: 100%|█████| 221/221 [00:00<00:00, 60.4kB/s]
Exception in thread Thread-2 (copy_worker_func):
Traceback (most recent call last):
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-3 (copy_worker_func):
Traceback (most recent call last):
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-4 (copy_worker_func):
Traceback (most recent call last):
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-5 (copy_worker_func):
Traceback (most recent call last):
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/threading.py", line 953, in run
    self.run()
    self.run()
Traceback (most recent call last):
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    self._target(*self._args, **self._kwargs)
  File "/Users/ondrej/repos/FlexGen/flexgen/pytorch_backend.py", line 879, in copy_worker_func
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/threading.py", line 953, in run
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/threading.py", line 953, in run
    self.run()
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/threading.py", line 953, in run
    return _run_code(code, main_globals, None,
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/runpy.py", line 86, in _run_code
    self._target(*self._args, **self._kwargs)
  File "/Users/ondrej/repos/FlexGen/flexgen/pytorch_backend.py", line 879, in copy_worker_func
    torch.cuda.set_device(cuda_id)
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/site-packages/torch/cuda/__init__.py", line 326, in set_device
    self._target(*self._args, **self._kwargs)
  File "/Users/ondrej/repos/FlexGen/flexgen/pytorch_backend.py", line 879, in copy_worker_func
    self._target(*self._args, **self._kwargs)
  File "/Users/ondrej/repos/FlexGen/flexgen/pytorch_backend.py", line 879, in copy_worker_func
    torch.cuda.set_device(cuda_id)
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/site-packages/torch/cuda/__init__.py", line 326, in set_device
    exec(code, run_globals)
  File "/Users/ondrej/repos/FlexGen/flexgen/flex_opt.py", line 1308, in <module>
    torch.cuda.set_device(cuda_id)
    torch._C._cuda_setDevice(device)
    torch._C._cuda_setDevice(device)
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/site-packages/torch/cuda/__init__.py", line 326, in set_device
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'
    torch.cuda.set_device(cuda_id)
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/site-packages/torch/cuda/__init__.py", line 326, in set_device
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'
    torch._C._cuda_setDevice(device)
    run_flexgen(args)
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'
  File "/Users/ondrej/repos/FlexGen/flexgen/flex_opt.py", line 1190, in run_flexgen
    torch._C._cuda_setDevice(device)
    model = OptLM(opt_config, env, args.path, policy)
  File "/Users/ondrej/repos/FlexGen/flexgen/flex_opt.py", line 612, in __init__
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'
    self.load_weight_stream = torch.cuda.Stream()
  File "/Users/ondrej/mambaforge/envs/flexgen/lib/python3.10/site-packages/torch/cuda/streams.py", line 34, in __new__
    return super(Stream, cls).__new__(cls, priority=priority, **kwargs)
TypeError: object.__new__() takes exactly one argument (the type to instantiate)

I am assuming it by default uses NVIDIA GPU which I don't have, so it fails. In that case it should give a user friendly error message.

something wrong in the google colab

!cd ./FlexGen && python3 -m flexgen.flex_opt --model facebook/opt-1.3b
2023-02-21 15:25:58.653992: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-21 15:25:59.475428: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-21 15:25:59.475530: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-21 15:25:59.475548: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Downloading (…)okenizer_config.json: 100% 685/685 [00:00<00:00, 108kB/s]
Downloading (…)lve/main/config.json: 100% 651/651 [00:00<00:00, 111kB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 6.53MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 3.32MB/s]
Downloading (…)cial_tokens_map.json: 100% 221/221 [00:00<00:00, 77.4kB/s]
model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
warmup - init weights
Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading (…)lve/main/config.json: 100% 653/653 [00:00<00:00, 81.5kB/s]
Downloading (…)"pytorch_model.bin";: 100% 2.63G/2.63G [00:29<00:00, 88.2MB/s]
^C

RuntimeError: CUDA error: out of memory | WSL2 | RTX 3090 | OPT-6.7B

Problem

Clean git clone. Running this command python -m flexgen.flex_opt --model facebook/opt-6.7b gives the following output:

I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
model size: 12.386 GB, cache size: 1.062 GB, hidden size (prefill): 0.017 GB
warmup - init weights
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/Developer/FlexGen/flexgen/pytorch_backend.py", line 881, in copy_worker_func
    cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/Developer/FlexGen/flexgen/pytorch_backend.py", line 881, in copy_worker_func
    cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/Developer/FlexGen/flexgen/pytorch_backend.py", line 881, in copy_worker_func
    cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/Developer/FlexGen/flexgen/pytorch_backend.py", line 881, in copy_worker_func
    cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Setup

System:

  • RTX 3090 24gb VRAM
  • 64gb RAM
  • WSL2:
[wsl2]
memory=64GB

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 528.49       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   55C    P0   112W / 350W |   2363MiB / 24576MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

top output:

top - 14:37:12 up 29 min,  0 users,  load average: 0.02, 0.08, 0.08
Tasks:  12 total,   1 running,  11 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  64108.9 total,  57727.9 free,    195.4 used,   6185.6 buff/cache
MiB Swap:  16384.0 total,  16384.0 free,      0.0 used.  63209.2 avail Mem

df -h / output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        251G  102G  137G  43% /

Please help!

Support Galactica

Add support for Galactica model would be very helpful. It seems like the most powerful full open-source LLM in MMLU benchmark.

Context Length?

Is there any docs available in ascertaining the context length when running a given model on flexgen and what it depends on?

is cpu peak_mem monitoring work?

Thank you for the amazing project!

I was on checking opt-30B model with the provided code in readme.

python3 -m flexgen.flex_opt --model facebook/opt-30b --percent 0 100 100 0 100 0

and the result is

image

as I watched the progress, peak memory was 95/126GB, So I wonder is this right or, bug?

any keyword would help me, thanks!

Question about the num-gpu-batches and gpu-batch-size

According to batch_size_table.md, from 144=48 x 3 (144 from batch_size_table.md and 48 x 3 from bench_suite.py) I can think that batch-size is composed of num-gpu-batches and gpu-batch-size together in FlexGen. But I don't understand the actual meaning of these two parameters. Shouldn't num-gpu-batches be the number of batches? and gpu-batch-size is the batch-size.
image

image

rtx3090 cublasGemmStridedBatchedExFix CUBLAS_STATUS_NOT_SUPPORTED

(flexgen) [ai@k3s4-worker FlexGen]$ python3 -m flexgen.flex_opt --model facebook/opt-1.3b
model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
warmup - init weights
warmup - generate
Traceback (most recent call last):
File "/home/miniconda3/envs/flexgen/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/miniconda3/envs/flexgen/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ai/FlexGen/flexgen/flex_opt.py", line 1307, in
run_flexgen(args)
File "/home/ai/FlexGen/flexgen/flex_opt.py", line 1201, in run_flexgen
output_ids = model.generate(
File "/home/ai/FlexGen/flexgen/flex_opt.py", line 873, in generate
self.generation_loop_overlap_single_batch()
File "/home/ai/FlexGen/flexgen/flex_opt.py", line 1010, in generation_loop_overlap_single_batch
self.compute_layer(i, j, 0)
File "/home/ai/FlexGen/flexgen/flex_opt.py", line 776, in compute_layer
self.layers[j].forward(self.hidden[i][j][k], self.cache_read_buf[j][k],
File "/home/ai/FlexGen/flexgen/flex_opt.py", line 446, in forward
h, new_k_cache, new_v_cache = self.compute.mha(h, mask, w_q, b_q,
File "/home/ai/FlexGen/flexgen/pytorch_backend.py", line 330, in mha
attn_weights = torch.bmm(q, k)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

Offloading on Windows?

I've got 24GB ram and a 3070, trying to run the OPT-30b and offload everything to SSD, it seems to run out of memory without writing anything to disk ?

Have tried a few things like the below however I don't see any disk space being used up before it throws an out of memory exception?

python -m flexgen.flex_opt --model facebook/opt-30b --percent 0 0 100 0 100 0 --offload-dir f:/modelcache
python -m flexgen.flex_opt --model facebook/opt-30b --percent 0 0 0 0 0 0 --offload-dir f:/modelcache

Support for RWKV language model

FlexGen looks great :)

Would you like to support RWKV language model? It's an RNN (actually a linear transformer with both GPT & RNN mode, so quite similar with usual GPT) with GPT-level performance - no attention, so faster and saves VRAM. And there is already a 14B params model:

https://github.com/BlinkDL/ChatRWKV

You are welcome to join RWKV discord if you are interested :)

AttributeError

AttributeError: 'OptLM' object has no attribute 'weight_home'

Update documentation to explicitly describe compatability/performance with early Pascal cards

This was originally a question I wanted to ask, but in the interest of not abusing Github Issues, I'm disguising it as a feature request for documentation :)

There are a couple of very inexpensive cards with large VRAM; the Tesla M40 24GB (Maxwell) and Tesla P40 24GB (Pascal). Neither of these seem to have Tensor cores, which makes them pretty useless for FP16 math - and maybe equally useless for int8/int4, I'm not sure.

What is confusing to a lot of people who are interested in running LLM's on commodity hardware is that Tesla M40 is listed as part of the "Pascal" family, and a feature of Pascal is the inclusion of FP16 processing. However, the Tesla P40 specifically lacks FP16 support and thus runs FP16 at 1/64th the performance of other Tesla Pascal series cards.

Question 1: Do you know if FlexGen will run on a P40 24GB with reasonable performance, given that it is using 8bit or 4bit math? Is it comperable to other Pascal cards in terms of performance?

Question 2: Do you know if FlexGen can split a model across multiple Tesla P40 cards? Something I read suggested that splitting the model was not possible using bitsandbytes on older cards, but I'm not clear on the reason.

For context; if it turns out that the Tesla P40, or 2-3 Tesla P40's, can give reasonable performance in the < 1 second/token range for inference on large models, it would open up a new world of possibility to individuals looking to run LLM's at home.

Add Erebus and GALACTICA support

Hello!
I propose to add support for the Erebus family of models, these are finetune models of the original OPT. I looked at the code, and the support is not too difficult to add, and I was able to run a couple of models without major code modification. I can provide PR if needed.
The link to one of the models, there are also the rest.
https://huggingface.co/KoboldAI/OPT-2.7B-Erebus

Questions about the intermediate tensor buffers design

Hi Team! Really nice work!

I am a little bit confused about the design choices related to the intermediate tensor buffers when reading the codes.

  1. Could you explain the purpose of cache_home, cache_read_buf and cache_write_buf? I am wondering why we need multiple buffers (instead of a single one)
  2. I noticed that for the kv cache, there are cache_home, cache_read_buf, and cache_write_buf, but for the hidden states, there is only self.hidden. Could you explain the reason for this difference?
  3. Additionally, I am curious why there is no need to have a cudastream for hidden states' loading and storing.

My basic understanding:
When loading the cache, tensor will be copied from cache_home to cache_read_buf and then, when storing the buffer tensor will be copied from write_buf to cache_home. But I don't really understand why we cannot modify them in a single buffer.

These confusions may be due to some special design or necessity in the implementation, or they may be the result of not understanding the code particularly well. I'm very much looking forward to your answers, thanks in advance!

Soft lockup after running flex_opt

Hello! FlexGen is a brilliant project, but there might be some locking issues. I ran the command python3 bench_suite.py 6b7_1x1 but it throw a soft lockup BUG:
image
How can I fix it? Thanks :)

Can I use FlaxGen's offloading and compression without caching?

I am researching a method to generate texts with a single call to a decoder only CLM (like BLOOM, OPT, GPT3...) Therefore I will not need to cache. Yet I still want to benefit FlaxGen's offloading and compression.
Can I do that? If I can, how can I do that?

Doesn't seem to obey --path argument, instead try to download to .cache again

On Windows at least, it seems to be path is not obeyed and kept downloading into .cache directory of the c:\ file system (which I don't have enough space.)

I've manually downloaded required files into other drive, where the relative path is ../models--facebook--opt-30b

Executing python -m flexgen.flex_opt --model facebook/opt-30b --path ../models--facebook--opt-30b still causes it to download the item into aforementioned .cache directory -- tried other variant python -m flexgen.flex_opt --model facebook/opt-30b --path ../models--facebook--opt-30b/snapshots/ceea0a90ac0f6fae7c2c34bcb40477438c152546.

Am I misunderstanding the way --path work or there's something wrong with it?

Also it would be nice to have the option to inhibit all the automatic download and just stop with error as I do not want to exhaust disk space that's already precious...

Unable to run the benchmark

Hi,

I'm trying to run the benchmark bench_30b_1x4.sh (except that I set N_GPUS=2), but I get the following python exception:

rank #1: TypeError: sequence item 6: expected str instance, NoneType found
Traceback (most recent call last):
  File "/home/fungiboletus/miniconda3/envs/flexgen/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/fungiboletus/miniconda3/envs/flexgen/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/fungiboletus/flexgen/flexgen/dist_flex_opt.py", line 694, in <module>
    raise e
  File "/home/fungiboletus/flexgen/flexgen/dist_flex_opt.py", line 690, in <module>
    run_flexgen_dist(args)
  File "/home/fungiboletus/flexgen/flexgen/dist_flex_opt.py", line 620, in run_flexgen_dist
    outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
  File "/home/fungiboletus/miniconda3/envs/flexgen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3432, in batch_decode
    return [
  File "/home/fungiboletus/miniconda3/envs/flexgen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3433, in <listcomp>
    self.decode(
  File "/home/fungiboletus/miniconda3/envs/flexgen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3471, in decode
    return self._decode(
  File "/home/fungiboletus/miniconda3/envs/flexgen/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 949, in _decode
    sub_texts.append(self.convert_tokens_to_string(current_sub_text))
  File "/home/fungiboletus/miniconda3/envs/flexgen/lib/python3.10/site-packages/transformers/models/gpt2/tokenization_gpt2.py", line 316, in convert_tokens_to_string
    text = "".join(tokens)
TypeError: sequence item 6: expected str instance, NoneType found
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. ...

I use Python 3.10.9 with Pytorch 1.13.1 with Cuda 11.7, and mpirun 2.1.1.

Question about allocations among different memory hierarchies

    dev_percents = [policy.w_disk_percent, policy.w_cpu_percent, policy.w_gpu_percent]
    dev_choices = [env.disk, env.cpu, env.gpu]

    sizes = [np.prod(spec[0]) for spec in weight_specs]
    sizes_cumsum = np.cumsum(sizes)

    for i in range(len(weight_specs)):
        mid_percent = (sizes_cumsum[i] - sizes[i] / 2) / sizes_cumsum[-1]
        home = get_choice(mid_percent * 100, dev_percents, dev_choices)

For a quite simple example as sizes=[10,20,30,40], dev_percents=[60,20,20]
sizes_cumsum=[10,30,60,100], percents_cumsum=[60,80,100]
mid_percent caculated for each weigh_spec are:
weight_spec 1: (10-10/2)/100=5%
weight_spec 2: (30-20/2)/100=20%
weight_spec 3: (60-30/2)/100=45%
weight_spec 4: (100-40/2)/100=80%

def get_choice(cur_percent, percents, choices):
    percents = np.cumsum(percents)
    assert np.abs(percents[-1] - 100) < 1e-5

    for i in range(len(percents)):
        if cur_percent < percents[i]:
            return choices[i]
    return choices[-1]

As the logic of get_choice, weight_spec 1, 2, 3 will be allocated in disk, while the total percent of three is 70% exceeded 60% specified for disk. There are no any allocations for CPU, and weight_spec 4 will be allocated in GPU,

Question: FlexGen seems slower than simple CPU code, am I missing something? [see discussion]

Hi!
I'm trying to reproduce FlexGen results and compare with more naive methods and i'm getting weird results. Can you please help me?

edit: added benchmark details and a minimalistic code to reproduce my claims.

library versions (click to expand)

ubuntu-server 18.04
PyTorch and dependencies installed via anaconda 4.9.1, package versions:
pytorch 1.13.1 py3.8_cuda11.7_cudnn8.5.0_0
numpy 1.23.4 py38h14f4228_0
numpy-base 1.23.4 py38h31eccc5_0

Transformers and tqdm installed via pip:
transformers 4.25.1
tqdm 4.64.1
bitsandbytes 0.37.0

I ran OPT-175B on a similar machine:

  • dual Xeon 6426Y (mid range server cpu) and 256GB RAM which is slightly more than in the benchmark, ~but the code never uses more than 200GB. (the benchmark setup has 208 GB)

  • using prefix length 512 and output length 32, similar to the README benchmark, and used a batch size of 8 (edited; thanks to @merrymercy for pointing out the discrepancy).

I am using standard Hugging Face code, with transformers.models.opt.modeling_opt.OPTForCausalLM.
The model was quantized to 8-bit using PyTorch PTDQ on linear layers with all default parameters.

Based on my measurements, I am getting 2.06 tokens per second in a basic CPU setup for a 8-bit model, or about 3.9 seconds per batch-step. This is basic HuggingFace + PyTorch PTDQ, no deepspeed / accelerate. Note: this does not account for prefill, so it is not a fair comparison, see adjusted figures below

In turn, FlexGen reports 1.12 tokens per second for a 4-bit OPT-175B model [tricky:
image

And, weirdly, __ simple 8-bit CPU inference beats both FlexGen and FlexGen(c)__ -- given the large-batch setup in question.

Did I understand the evaluation setup correctly? If not, can you please tell me what am I missing?

Summary and corrections from the discussion below:

Based on the suggestions by @merrymercy , it is inappropriate to compare CPU with batch size 64 since to does not fit in the original testing environment. I have updated the metrics with batch size 8 (to be safe), the decoding throughput fell from 3.66 to 2.06 tokens/second.

Based on the discussion with @Ying1123 : In Section 6.0, the generative throughput is defined as "the number of generated tokens / (prefill time +
decoding time)".

Here, prefill time stands for encoding the input sequence in parallel, layer-by-layer.
If the baseline algorithm prefills naively on CPU, FlexGen(c)-4-bit does indeed outperform the CPU-8bit baseline. For CPU, most of the time constitutes to prefilling. For GPU, there is the opposite situation: prefill is quick since it can be done with one offloading cycle; in turn, generation requires multiple offloading cycles and takes longer.

In further discussion, we consider an option of running prefill on GPU (using simple offloading, streaming KVs to CPU), then running inference on a CPU.

On a single T4 GPU, you can prefill 8 samples of 512 tokens with OPT-175B model in 8-bit precision (cuda 8-bit code runs Linear8bitLt from bitsandbytes 0.37.0 threshold=6) in 91.2 seconds using naive overlapped offloading. The CPU decoding time is, in turn, 124.3 seconds on 2x 6426Y. The aggregate baseline throughput is 8 * 32 / (91.18 + 124.277) ~= 1.19 tokens / second.

While the naive code is still faster, the difference between flexgen and baseline is not as significant as I originally thought.
Important: later in this thread, @Ying1123 provides their own evaluation on a somewhat weaker CPU (2ghz, less cores, virtualized). For that setup FlexGen-4bit on GPU is indeed 1.6x faster than 8-bit CPU baseline, even if we account for gpu prefill. I thank @Ying1123 and @merrymercy for pointing it out the differences and apologize for taking up their time.

(Expand) Limitations that I left unaddressed

  • the baseline algorithm uses 8-bit compression, while FlexGen(c) is using a 4-bit compression algorithm; It would be better to evaluate with the same compression level. If the baseline is switched for 4-bit compression, it would also make sense to increase the batch size.
  • the throughput comparison depends on the chosen sequence length and CPU type. I have a hunch that shorter sequence lengths would benefit from GPU-side decoding while longer sequence lengths favour CPU to avoid transferring the attention cache. @Ying1123 correctly points out that it would be best to compare the two approaches more systematically.
  • the GPU prefill was measured separately on a different machine. This is because the original 6426Y machine has no gpu attached. In turn, the machine with T4 has a more powerful CPU (epyc 7742) that decodes faster (1.67t/s final throughput), but is significantly more expensive. For a pure academic comparison, it would be best to evaluate both setups on a number of identical machines with difference cpu/gpu balance.

CPU/GPU transfer

Really cool work. I am trying to optimize CPU/GPU transfer of attention cache tensors for a large language model that I run on multiple GPUs. I also don't need to use disk and don't need to keep parts of the same tensor on different devices. So I don't think I can use flexgen out of the box and am now just trying to understand whether your code is much faster at copying tensors between cpu and gpu over basic .cpu() and .gpu(). If so is there a place in the codebase with faster cpu/gpu copying utilities? Or do you have a general strategy (like always pin_memory() and then use non-blocking=True/ try to reuse pre-allocated buffers?

Offloading to disk does not work (opt-66b)

I just tried running flexgen.flex_opt with the following command:
python3 -m flexgen.flex_opt --model facebook/opt-66b --percent 0 0 100 0 100 0 --offload-dir ~/tmp/offload/ but this only fills the CPU memory until the process is killed by the OS and the folder ~/tmp/offload/ stays completely empty

System:

  • Arch Linux
  • 128GiB RAM
  • RTX3090

ValueError: Invalid model name: galactica-30b

how do I pass the argument so that facebook/galactica-30b is loaded?

This generates the error in the title: python -m flexgen.flex_opt --model facebook/galactica-30b --gpu-batch-size 32 --percent 100 0 100 0 100 0

line 118, in get_opt_config
    raise ValueError(f"Invalid model name: {name}")
ValueError: Invalid model name: galactica-30b

Also, knowing what other arguments to pass in order to optimally run this model with flexgen would also be good.

Suggestion: Add GPT-NeoX 20B support

JAX is already a library that is optimized for GPU training, and the NeoX repo itself already requires significant GPU resources that could benefit from offloading.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.