squeezeailab / squeezellm Goto Github PK

[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization

Home Page: https://arxiv.org/abs/2306.07629

License: MIT License

Python 64.72% C++ 8.58% Cuda 26.71%

efficient-inference large-language-models llama llm localllm model-compression natural-language-processing post-training-quantization quantization small-models text-generation transformer

squeezellm's People

Contributors

Stargazers

Watchers

Forkers

niconico6 zhen-dong dumpmemory standardgalactic atousa-jafari hertera1 mrcodechef hbcbh1999 aiworkspace mingfeima rosssong shengzing techthiyanes vuiseng9 paixai brevity2021 baas-hans babyblue26 jjhw 0seba xianfuwongintel shism2 yangwang92 sprrp yonghuazhang-buaa siddharth0112358 starmys whuhxb aramachandran2000 losif63 mz0in guspuffygit sidjha1 syphonarch harborwater tzhoutaalas zhaojiam marchon alex-ayu1 fwtan pluto1944

squeezellm's Issues

Add 65B-q3 evaluation

Hello, thanks for this paper!

When looking at the curve that suggests 30B-q3 could be used instead of 7B fp16 I got really interested in seeing the graphic with 65B-q3. You mention using A600 GPUs which have 48GB. Would a llama 65b q3 theoretically fit there?

My reference is llama.cpp q5_1 being 46GB.

Thanks in advance!

Typos in the README.md

In both Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py <path-to-llama-7b-hf> c4 --wbits 4 --load sq-llama-7b-w3-o0.pt --benchmark 128 --check
and Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py <path-to-llama-7b-hf> c4 --wbits 4 --load sq-llama-7b-w3-o0.pt --eval
there is a typo in the name of the sq-llama. It should be sq-llama-7b-w3-s0.pt not sq-llama-7b-w3-o0.pt.

The suggested flag --wbits 4 doesn't match the pretrained checkpoint suggested that has been quantized at 3 bits. So it should be --wbits 3.
Great job!

A question about LLaMA-2-7B and Mistral models only provide Dense-only (0%) quantized models

Hello, I have a question that I hope can be answered.
Why do LLaMA-2-7B and Mistral models only provide Dense-only (0%) quantized models, but not 0.05% Sparsity and 0.45% sparsity quantized models?
Is it because the quantization effect is not good for these two models?
thanks

access to quantisation code

Do you intend to release the code for creating the quantisation weights?

I would like to port this to GPTJ model (which is open source unlike llama). But to do this I would need to code for computing the quantised weights. It seems only the inference code is released so far.

When I use SqueezeLLM to quantize the LLaMA2-13B model and test it, the speed is extremely slow.

I used the following command to run the LLaMA2-13B model:
CUDA_VISIBLE_DEVICES=0 python llama.py /mnt/llama2-13b wikitext2 --wbits 4 --load sq-llama-13b-w4-s0.45.pt --include_sparse --eval
The --load option loads the model after packing. I followed the README instructions step by step, but when I tested this code, the speed was extremely slow. I thought the model might have had an error during quantization, so I downloaded the model from Hugging Face as shown in the README(sq-llama-13b-w4-s45). However, it still remains slow.

I don’t know why this issue persists. What can I do to resolve it?

How to run dense-sparse quantization the papar mentioned?

Hi, thanks for your amazing job. I wonder to reproduce the paper results using dense-sparse quantization. Could you provide some suggestions?

Why do LLaMA-2-7B have s0 quantized models, but no s5 and s45 sparsity quantized models?

Future plan for this project

Are there any plan to standardize and support this quantization method in a broader scope?

Currently it seems that quantizing new type of model is very hard as we need to customize the modeling_<model>.py to extract the weight gradients.

finetune SqueezeLLM

Hi. First of all, thank you for sharing your great work.
I am looking for a way to fine-tune LLaMA-7B model with SqueezeLLM.
It seems that the publicly available code in llama.py can only benchmark or eval.
Do I need a special way to fine-tune SqueezeLLM or can I train it with normal pytorch fine-tuning?

Can you update the version that can quant OPT family?

Some .py file seems to be designed for the Llama family (e.g., pack.py) . Can you update them for other LLM families?

D+S packing in vLLM seems buggy

Hello!

I followed D+S packing instruction and stored the packed .pt file in "~/models/${model_name}-squeezellm/packed_weight", where model_name="Llama-2-7b-chat-hf". When I load this model in vLLM:

python examples/llm_engine_example.py --dtype float16 --model ~/models/${model_name}-squeezellm/packed_weight --quantization squeezellm

vLLM complained cannot find parameters "sparse_threshold.model.layers.*". Any idea why? I repeated the quantization from scratch several times but all ended up in this error.

To get a quick fix, I manually skip the above error in vLLM model loading step in llama.py , if we cannot find the missing param. However, this time the model cannot generate meaningful output. So I believe the above parameters are indeed not loaded correctly.

Dense-only quantization bit precision

안녕하세요.
논문을 읽다 궁금한 점이 생겨 질문 드립니다.

Table1을 보면, Dense-only인 경우에 Avg.Bits 가 4가 아닌 4.05로 표기되어 있는데, 왜 4-bit 가 아닌가요?

제가 이해하기로는,
dense-only는 sparse matrix를 사용하지 않기 때문에
weights가 전부 integer이고 그로 인해 4-bit precision이 맞다 생각했습니다.
혹시 non-uniform quantization으로 인한 어떤 오버헤드 때문에 Avg.Bits가 4.05 인가요?

+추신) 작년 8월에 경량화 스터디 오프라인 밋업에서 발표하시는 거 들었었는데 너무 인상 깊었습니다! 올해도 참석해주시는지 궁금합니다 ㅎㅎ

감사합니다

Vicuna-1.5?

Hi, thanks for your amazing work
I would love to know if you have any update on the sparse training script, or maybe if you are planning to quantize vicuna 1.5 since it has massive improvements over the llama1 counterpart.
It could lead to many more quantization on llama 2 based chat model like wizardLM
Thanks again.

Installation instructions did not lead to the local transformers version being selected, giving errors

In run/.py, changing the line

from transformers import Trainer

from src.transformers import Trainer

solved the problem.

Further speeding up the quantization process

I previously contributed a pull request that reduced the runtime of the main clustering algorithm from over two hours to just six minutes for the Llama 2 7B model (#60). In the 'Further Suggestions' section of that PR, I mentioned potential optimizations by exploiting the 1D nature of the task.

I'm excited to share that I've developed a Python package, flash1dkmeans, which implements a faster 1D K-means algorithm. This package is now part of the Any-Precision LLM project, a variable bit-rate quantization scheme using SqueezeLLM as the seed model. With this new implementation, we've managed to further reduce the execution time for SqueezeLLM to 38 seconds on an i9-13900K machine, achieving a further tenfold speed increase.

If interested in integrating this speed enhancement, you can refer to the code in Any-Precision LLM, as an example where we use the package to create the seed model. For maximum performance gains, consider accelerating the caller function with @numba.njit(parallel=True). However, even using the standard multiprocessing pool should yield significant improvements.

This package can serve as an almost drop-in replacement for sklearn's K-means if you're looking to speed up SqueezeLLM further. Of course, sticking with sklearn for better transparency is perfectly fine too. I wanted to share these findings, as your work helped create ours 👍 .

quantisation implementation

Hi, I want to say thank you by your great work on quantisation. However, in the current codebase, I don't see any specific implementation on how these quantisation algorithms work? Can you provide them in the near future?

Error encountered during execution of the SqueezeLLM tutorial

Hello,

I hope this message finds you well. I wanted to express my appreciation for your paper on "squeezellm" and the seminar you conducted. I found both to be remarkably insightful.

I have been eager to try out a model implementing squeezellm in practice. However, I have encountered some errors when attempting to apply the code provided in the tutorial. Specifically, after downloading the llama-2 7B (3-bit) model from the URL you provided, I attempted to run the following command:

CUDA_VISIBLE_DEVICES=0 python llama.py ./models/sq-llama-2-7b-w3-s0/ c4 --wbits 3 --load sq-llama-2-7b-w3-s0.pt --benchmark 128 --check --torch_profile

Unfortunately, upon execution, I received the following error message:

Traceback (most recent call last):
  File "/home/work3/user/etc/SqueezeLLM/llama.py", line 317, in <module>
    model = load_quant(
  File "/home/work3/user/etc/SqueezeLLM/llama.py", line 157, in load_quant
    state_dict = torch.load(checkpoint)
  File "/home/home/user/anaconda3/envs/sqllm/lib/python3.9/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/home/user/anaconda3/envs/sqllm/lib/python3.9/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/home/user/anaconda3/envs/sqllm/lib/python3.9/site-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'sq-llama-2-7b-w3-s0.pt'

I have verified that all the required dependencies such as Python, transformers, tokenizer, and CUDA are installed with the correct versions:

Python==3.9
tokenizers==0.13.3
transformers==4.29.0

Have you encountered a similar error before? If so, I would greatly appreciate any guidance or suggestions on how to resolve it.

SpQR

Hey there,
I was just wondering how this compared to SpQR, the perplexity/size seems on par.

Here's their paper:
https://arxiv.org/abs/2306.03078

Thank you!

Will It work in V100 GPU ?

sample_weight is negative when running kmeans clustering

This is a really nice work!

I followed the instruction to quantize Llama-2-7b-chat-hf. At kmeans clustering step, I ran the following command:

python nuq.py --bit 4 --model_type llama --model ~/models/${model_name}-squeezellm/model_chunks --gradient ~/models/${model_name}-squeezellm/gradient_chunks --output ~/models/${model_name}-squeezellm/LUT

And got this error:

Quantizing layers [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
Quantizing layer 0
  0%|                                                                                                  | 0/7 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/mmilin/projects/ming_benchmark_vllm/SqueezeLLM/quantization/nuq.py", line 166, in <module>
    kmeans = KMeans(
  File "/home/mmilin/projects/ming_benchmark_vllm/venv/lib/python3.9/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/mmilin/projects/ming_benchmark_vllm/venv/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 1519, in fit
    centers_init = self._init_centroids(
  File "/home/mmilin/projects/ming_benchmark_vllm/venv/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 1019, in _init_centroids
    centers, _ = _kmeans_plusplus(
  File "/home/mmilin/projects/ming_benchmark_vllm/venv/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 229, in _kmeans_plusplus
    center_id = random_state.choice(n_samples, p=sample_weight / sample_weight.sum())
  File "numpy/random/mtrand.pyx", line 973, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities are not non-negative

I manually checked the sample_weight. It has negative elements, which is wired.

On A100 card, speed-up effect does not show up.

First, thanks very much for creating this cool technology.

On one A100 GPU w/ 80GB VRAM, I tried benchmarking sq-vicuna-7b-v1.3-w3-s0 and its base. It is a bit strange that running median time has not been reduced a lot. This seems different to the speed-up results reported in your paper. Do you mind helping on tracing a possible reason? Is it related to my experiment was on a more powerful GPU?

	Median	PPL	max memory(MiB)
w3-s0	0.025365471839904785	16.07021141052246	3602.3271484375
FP16	0.02616262435913086	14.921088218688965	25906.5771484375

Script:

#!/bin/bash

# vicuna v1.3 Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py models/sq-vicuna-7b-v1.3-w3-s0 c4 --wbits 3 --load models/sq-vicuna-7b-v1.3-w3-s0/sq-vicuna-7b-v1.3-w3-s0.pt --benchmark 128 --check 

# vicuna v1.3 base
# HF naming can use cache
CUDA_VISIBLE_DEVICES=0 python llama.py lmsys/vicuna-7b-v1.3 c4 --wbits 16 --benchmark 128 --check

how can I get the models of 0.45% sparsity by myself?

Minor bug for --include_sparse

Thank you for sharing the great work!!

I attempted to evaluate the dense and sparse checkpoint and encountered a bug. It is too tiny to be a pull request, just reporting here.

https://github.com/SqueezeAILab/SqueezeLLM/blob/main/squeezellm/quant.py#L199

-            num = getattr(numvals[name1])
+            num = numvals[name1]

channel-wise quantization

Hello,
Thank you for sharing your excellent paper.
I noticed that you conducted quantization in an output channel-wise manner. Have you ever tried quantization in an input channel-wise manner? I'm curious about the reasons for choosing output channel-wise quantization instead of input channel-wise.
Thank you.

How to save the quantized model with full weights?

After quantizing the model, I'd like to be able to save the resulting model for later use. However, PyTorch's save commands do not seem to be working properly. Is there anything I need to change in the model config?

Support JAIS models

Thank you for your amazing work, I'm interested in whether SqueezeLLM could support the JAIS model for quantization. Are there plans to include JAIS model support?

https://huggingface.co/core42/jais-30b-v3
https://huggingface.co/core42/jais-30b-chat-v3

Vicuna v1.3

Great work, thank you for sharing it!

I was waiting for the Vicuna v1.3 weights to get out (they were previously marked as coming soon in the readme) as it is my current model of choice (and perform significantly better than v1.1) but I see that this section has been removed from the readme.

Is there any plan to quantize that family of models?