squeezeailab / squeezellm Goto Github PK
View Code? Open in Web Editor NEW[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
Home Page: https://arxiv.org/abs/2306.07629
License: MIT License
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
Home Page: https://arxiv.org/abs/2306.07629
License: MIT License
I previously contributed a pull request that reduced the runtime of the main clustering algorithm from over two hours to just six minutes for the Llama 2 7B model (#60). In the 'Further Suggestions' section of that PR, I mentioned potential optimizations by exploiting the 1D nature of the task.
I'm excited to share that I've developed a Python package, flash1dkmeans, which implements a faster 1D K-means algorithm. This package is now part of the Any-Precision LLM project, a variable bit-rate quantization scheme using SqueezeLLM as the seed model. With this new implementation, we've managed to further reduce the execution time for SqueezeLLM to 38 seconds on an i9-13900K machine, achieving a further tenfold speed increase.
If interested in integrating this speed enhancement, you can refer to the code in Any-Precision LLM, as an example where we use the package to create the seed model. For maximum performance gains, consider accelerating the caller function with @numba.njit(parallel=True)
. However, even using the standard multiprocessing pool should yield significant improvements.
This package can serve as an almost drop-in replacement for sklearn's K-means if you're looking to speed up SqueezeLLM further. Of course, sticking with sklearn for better transparency is perfectly fine too. I wanted to share these findings, as your work helped create ours ๐ .
Hello, I have a question that I hope can be answered.
Why do LLaMA-2-7B and Mistral models only provide Dense-only (0%) quantized models, but not 0.05% Sparsity and 0.45% sparsity quantized models?
Is it because the quantization effect is not good for these two models?
thanks
Thank you for sharing the great work!!
I attempted to evaluate the dense and sparse checkpoint and encountered a bug. It is too tiny to be a pull request, just reporting here.
https://github.com/SqueezeAILab/SqueezeLLM/blob/main/squeezellm/quant.py#L199
- num = getattr(numvals[name1])
+ num = numvals[name1]
Hello,
I hope this message finds you well. I wanted to express my appreciation for your paper on "squeezellm" and the seminar you conducted. I found both to be remarkably insightful.
I have been eager to try out a model implementing squeezellm in practice. However, I have encountered some errors when attempting to apply the code provided in the tutorial. Specifically, after downloading the llama-2 7B (3-bit) model from the URL you provided, I attempted to run the following command:
CUDA_VISIBLE_DEVICES=0 python llama.py ./models/sq-llama-2-7b-w3-s0/ c4 --wbits 3 --load sq-llama-2-7b-w3-s0.pt --benchmark 128 --check --torch_profile
Unfortunately, upon execution, I received the following error message:
Traceback (most recent call last):
File "/home/work3/user/etc/SqueezeLLM/llama.py", line 317, in <module>
model = load_quant(
File "/home/work3/user/etc/SqueezeLLM/llama.py", line 157, in load_quant
state_dict = torch.load(checkpoint)
File "/home/home/user/anaconda3/envs/sqllm/lib/python3.9/site-packages/torch/serialization.py", line 998, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/home/user/anaconda3/envs/sqllm/lib/python3.9/site-packages/torch/serialization.py", line 445, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/home/user/anaconda3/envs/sqllm/lib/python3.9/site-packages/torch/serialization.py", line 426, in __init__
super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'sq-llama-2-7b-w3-s0.pt'
I have verified that all the required dependencies such as Python, transformers, tokenizer, and CUDA are installed with the correct versions:
Python==3.9
tokenizers==0.13.3
transformers==4.29.0
Have you encountered a similar error before? If so, I would greatly appreciate any guidance or suggestions on how to resolve it.
Great work, thank you for sharing it!
I was waiting for the Vicuna v1.3 weights to get out (they were previously marked as coming soon in the readme) as it is my current model of choice (and perform significantly better than v1.1) but I see that this section has been removed from the readme.
Is there any plan to quantize that family of models?
Why do LLaMA-2-7B have s0 quantized models, but no s5 and s45 sparsity quantized models?
Do you intend to release the code for creating the quantisation weights?
I would like to port this to GPTJ model (which is open source unlike llama). But to do this I would need to code for computing the quantised weights. It seems only the inference code is released so far.
Thank you for your amazing work, I'm interested in whether SqueezeLLM could support the JAIS model for quantization. Are there plans to include JAIS model support?
https://huggingface.co/core42/jais-30b-v3
https://huggingface.co/core42/jais-30b-chat-v3
Hey there,
I was just wondering how this compared to SpQR, the perplexity/size seems on par.
Here's their paper:
https://arxiv.org/abs/2306.03078
Thank you!
์๋
ํ์ธ์.
๋
ผ๋ฌธ์ ์ฝ๋ค ๊ถ๊ธํ ์ ์ด ์๊ฒจ ์ง๋ฌธ ๋๋ฆฝ๋๋ค.
Table1์ ๋ณด๋ฉด, Dense-only์ธ ๊ฒฝ์ฐ์ Avg.Bits ๊ฐ 4๊ฐ ์๋ 4.05๋ก ํ๊ธฐ๋์ด ์๋๋ฐ, ์ 4-bit ๊ฐ ์๋๊ฐ์?
์ ๊ฐ ์ดํดํ๊ธฐ๋ก๋,
dense-only๋ sparse matrix๋ฅผ ์ฌ์ฉํ์ง ์๊ธฐ ๋๋ฌธ์
weights๊ฐ ์ ๋ถ integer์ด๊ณ ๊ทธ๋ก ์ธํด 4-bit precision์ด ๋ง๋ค ์๊ฐํ์ต๋๋ค.
ํน์ non-uniform quantization์ผ๋ก ์ธํ ์ด๋ค ์ค๋ฒํค๋ ๋๋ฌธ์ Avg.Bits๊ฐ 4.05 ์ธ๊ฐ์?
+์ถ์ ) ์๋ 8์์ ๊ฒฝ๋ํ ์คํฐ๋ ์คํ๋ผ์ธ ๋ฐ์ ์์ ๋ฐํํ์๋ ๊ฑฐ ๋ค์์๋๋ฐ ๋๋ฌด ์ธ์ ๊น์์ต๋๋ค! ์ฌํด๋ ์ฐธ์ํด์ฃผ์๋์ง ๊ถ๊ธํฉ๋๋ค ใ ใ
๊ฐ์ฌํฉ๋๋ค
In both Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py <path-to-llama-7b-hf> c4 --wbits 4 --load sq-llama-7b-w3-o0.pt --benchmark 128 --check
and Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py <path-to-llama-7b-hf> c4 --wbits 4 --load sq-llama-7b-w3-o0.pt --eval
there is a typo in the name of the sq-llama. It should be sq-llama-7b-w3-s0.pt not sq-llama-7b-w3-o0.pt.
The suggested flag --wbits 4 doesn't match the pretrained checkpoint suggested that has been quantized at 3 bits. So it should be --wbits 3.
Great job!
After quantizing the model, I'd like to be able to save the resulting model for later use. However, PyTorch's save commands do not seem to be working properly. Is there anything I need to change in the model config?
Are there any plan to standardize and support this quantization method in a broader scope?
Currently it seems that quantizing new type of model is very hard as we need to customize the modeling_<model>.py
to extract the weight gradients.
First, thanks very much for creating this cool technology.
On one A100 GPU w/ 80GB VRAM, I tried benchmarking sq-vicuna-7b-v1.3-w3-s0
and its base. It is a bit strange that running median time has not been reduced a lot. This seems different to the speed-up results reported in your paper. Do you mind helping on tracing a possible reason? Is it related to my experiment was on a more powerful GPU?
Median | PPL | max memory(MiB) | |
---|---|---|---|
w3-s0 | 0.025365471839904785 | 16.07021141052246 | 3602.3271484375 |
FP16 | 0.02616262435913086 | 14.921088218688965 | 25906.5771484375 |
Script:
#!/bin/bash
# vicuna v1.3 Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py models/sq-vicuna-7b-v1.3-w3-s0 c4 --wbits 3 --load models/sq-vicuna-7b-v1.3-w3-s0/sq-vicuna-7b-v1.3-w3-s0.pt --benchmark 128 --check
# vicuna v1.3 base
# HF naming can use cache
CUDA_VISIBLE_DEVICES=0 python llama.py lmsys/vicuna-7b-v1.3 c4 --wbits 16 --benchmark 128 --check
Hello,
Thank you for sharing your excellent paper.
I noticed that you conducted quantization in an output channel-wise manner. Have you ever tried quantization in an input channel-wise manner? I'm curious about the reasons for choosing output channel-wise quantization instead of input channel-wise.
Thank you.
In run/.py, changing the line
from transformers import Trainer
to
from src.transformers import Trainer
solved the problem.
Hi, I want to say thank you by your great work on quantisation. However, in the current codebase, I don't see any specific implementation on how these quantisation algorithms work? Can you provide them in the near future?
Hello!
I followed D+S packing instruction and stored the packed .pt file in "~/models/${model_name}-squeezellm/packed_weight", where model_name="Llama-2-7b-chat-hf". When I load this model in vLLM:
python examples/llm_engine_example.py --dtype float16 --model ~/models/${model_name}-squeezellm/packed_weight --quantization squeezellm
vLLM complained cannot find parameters "sparse_threshold.model.layers.*". Any idea why? I repeated the quantization from scratch several times but all ended up in this error.
To get a quick fix, I manually skip the above error in vLLM model loading step in llama.py , if we cannot find the missing param. However, this time the model cannot generate meaningful output. So I believe the above parameters are indeed not loaded correctly.
This is a really nice work!
I followed the instruction to quantize Llama-2-7b-chat-hf. At kmeans clustering step, I ran the following command:
python nuq.py --bit 4 --model_type llama --model ~/models/${model_name}-squeezellm/model_chunks --gradient ~/models/${model_name}-squeezellm/gradient_chunks --output ~/models/${model_name}-squeezellm/LUT
And got this error:
Quantizing layers [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
Quantizing layer 0
0%| | 0/7 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/mmilin/projects/ming_benchmark_vllm/SqueezeLLM/quantization/nuq.py", line 166, in <module>
kmeans = KMeans(
File "/home/mmilin/projects/ming_benchmark_vllm/venv/lib/python3.9/site-packages/sklearn/base.py", line 1152, in wrapper
return fit_method(estimator, *args, **kwargs)
File "/home/mmilin/projects/ming_benchmark_vllm/venv/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 1519, in fit
centers_init = self._init_centroids(
File "/home/mmilin/projects/ming_benchmark_vllm/venv/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 1019, in _init_centroids
centers, _ = _kmeans_plusplus(
File "/home/mmilin/projects/ming_benchmark_vllm/venv/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 229, in _kmeans_plusplus
center_id = random_state.choice(n_samples, p=sample_weight / sample_weight.sum())
File "numpy/random/mtrand.pyx", line 973, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities are not non-negative
I manually checked the sample_weight. It has negative elements, which is wired.
Hi. First of all, thank you for sharing your great work.
I am looking for a way to fine-tune LLaMA-7B model
with SqueezeLLM.
It seems that the publicly available code in llama.py
can only benchmark
or eval
.
Do I need a special way to fine-tune SqueezeLLM or can I train it with normal pytorch fine-tuning?
Hello, thanks for this paper!
When looking at the curve that suggests 30B-q3 could be used instead of 7B fp16 I got really interested in seeing the graphic with 65B-q3. You mention using A600 GPUs which have 48GB. Would a llama 65b q3 theoretically fit there?
My reference is llama.cpp q5_1 being 46GB.
Thanks in advance!
Hi, thanks for your amazing work
I would love to know if you have any update on the sparse training script, or maybe if you are planning to quantize vicuna 1.5 since it has massive improvements over the llama1 counterpart.
It could lead to many more quantization on llama 2 based chat model like wizardLM
Thanks again.
Hi, thanks for your amazing job. I wonder to reproduce the paper results using dense-sparse quantization. Could you provide some suggestions?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.