Coder Social home page Coder Social logo

squeezeailab / squeezellm Goto Github PK

View Code? Open in Web Editor NEW
592.0 17.0 38.0 1.54 MB

[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization

Home Page: https://arxiv.org/abs/2306.07629

License: MIT License

Python 64.72% C++ 8.58% Cuda 26.71%
efficient-inference large-language-models llm model-compression natural-language-processing post-training-quantization quantization text-generation transformer llama

squeezellm's People

Contributors

amirgholami avatar baas-hans avatar chooper1 avatar guspuffygit avatar kssteven418 avatar sidjha1 avatar syphonarch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

squeezellm's Issues

Further speeding up the quantization process

I previously contributed a pull request that reduced the runtime of the main clustering algorithm from over two hours to just six minutes for the Llama 2 7B model (#60). In the 'Further Suggestions' section of that PR, I mentioned potential optimizations by exploiting the 1D nature of the task.

I'm excited to share that I've developed a Python package, flash1dkmeans, which implements a faster 1D K-means algorithm. This package is now part of the Any-Precision LLM project, a variable bit-rate quantization scheme using SqueezeLLM as the seed model. With this new implementation, we've managed to further reduce the execution time for SqueezeLLM to 38 seconds on an i9-13900K machine, achieving a further tenfold speed increase.

If interested in integrating this speed enhancement, you can refer to the code in Any-Precision LLM, as an example where we use the package to create the seed model. For maximum performance gains, consider accelerating the caller function with @numba.njit(parallel=True). However, even using the standard multiprocessing pool should yield significant improvements.

This package can serve as an almost drop-in replacement for sklearn's K-means if you're looking to speed up SqueezeLLM further. Of course, sticking with sklearn for better transparency is perfectly fine too. I wanted to share these findings, as your work helped create ours ๐Ÿ‘ .

Error encountered during execution of the SqueezeLLM tutorial

Hello,

I hope this message finds you well. I wanted to express my appreciation for your paper on "squeezellm" and the seminar you conducted. I found both to be remarkably insightful.

I have been eager to try out a model implementing squeezellm in practice. However, I have encountered some errors when attempting to apply the code provided in the tutorial. Specifically, after downloading the llama-2 7B (3-bit) model from the URL you provided, I attempted to run the following command:

CUDA_VISIBLE_DEVICES=0 python llama.py ./models/sq-llama-2-7b-w3-s0/ c4 --wbits 3 --load sq-llama-2-7b-w3-s0.pt --benchmark 128 --check --torch_profile

Unfortunately, upon execution, I received the following error message:

Traceback (most recent call last):
  File "/home/work3/user/etc/SqueezeLLM/llama.py", line 317, in <module>
    model = load_quant(
  File "/home/work3/user/etc/SqueezeLLM/llama.py", line 157, in load_quant
    state_dict = torch.load(checkpoint)
  File "/home/home/user/anaconda3/envs/sqllm/lib/python3.9/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/home/user/anaconda3/envs/sqllm/lib/python3.9/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/home/user/anaconda3/envs/sqllm/lib/python3.9/site-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'sq-llama-2-7b-w3-s0.pt'

I have verified that all the required dependencies such as Python, transformers, tokenizer, and CUDA are installed with the correct versions:

Python==3.9
tokenizers==0.13.3
transformers==4.29.0

Have you encountered a similar error before? If so, I would greatly appreciate any guidance or suggestions on how to resolve it.

Vicuna v1.3

Great work, thank you for sharing it!

I was waiting for the Vicuna v1.3 weights to get out (they were previously marked as coming soon in the readme) as it is my current model of choice (and perform significantly better than v1.1) but I see that this section has been removed from the readme.

Is there any plan to quantize that family of models?

access to quantisation code

Do you intend to release the code for creating the quantisation weights?

I would like to port this to GPTJ model (which is open source unlike llama). But to do this I would need to code for computing the quantised weights. It seems only the inference code is released so far.

Dense-only quantization bit precision

์•ˆ๋…•ํ•˜์„ธ์š”.
๋…ผ๋ฌธ์„ ์ฝ๋‹ค ๊ถ๊ธˆํ•œ ์ ์ด ์ƒ๊ฒจ ์งˆ๋ฌธ ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

Table1์„ ๋ณด๋ฉด, Dense-only์ธ ๊ฒฝ์šฐ์— Avg.Bits ๊ฐ€ 4๊ฐ€ ์•„๋‹Œ 4.05๋กœ ํ‘œ๊ธฐ๋˜์–ด ์žˆ๋Š”๋ฐ, ์™œ 4-bit ๊ฐ€ ์•„๋‹Œ๊ฐ€์š”?

์ œ๊ฐ€ ์ดํ•ดํ•˜๊ธฐ๋กœ๋Š”,
dense-only๋Š” sparse matrix๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—
weights๊ฐ€ ์ „๋ถ€ integer์ด๊ณ  ๊ทธ๋กœ ์ธํ•ด 4-bit precision์ด ๋งž๋‹ค ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.
ํ˜น์‹œ non-uniform quantization์œผ๋กœ ์ธํ•œ ์–ด๋–ค ์˜ค๋ฒ„ํ—ค๋“œ ๋•Œ๋ฌธ์— Avg.Bits๊ฐ€ 4.05 ์ธ๊ฐ€์š”?

+์ถ”์‹ ) ์ž‘๋…„ 8์›”์— ๊ฒฝ๋Ÿ‰ํ™” ์Šคํ„ฐ๋”” ์˜คํ”„๋ผ์ธ ๋ฐ‹์—…์—์„œ ๋ฐœํ‘œํ•˜์‹œ๋Š” ๊ฑฐ ๋“ค์—ˆ์—ˆ๋Š”๋ฐ ๋„ˆ๋ฌด ์ธ์ƒ ๊นŠ์—ˆ์Šต๋‹ˆ๋‹ค! ์˜ฌํ•ด๋„ ์ฐธ์„ํ•ด์ฃผ์‹œ๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค ใ…Žใ…Ž

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค

Typos in the README.md

In both Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py <path-to-llama-7b-hf> c4 --wbits 4 --load sq-llama-7b-w3-o0.pt --benchmark 128 --check
and Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py <path-to-llama-7b-hf> c4 --wbits 4 --load sq-llama-7b-w3-o0.pt --eval
there is a typo in the name of the sq-llama. It should be sq-llama-7b-w3-s0.pt not sq-llama-7b-w3-o0.pt.

The suggested flag --wbits 4 doesn't match the pretrained checkpoint suggested that has been quantized at 3 bits. So it should be --wbits 3.
Great job!

How to save the quantized model with full weights?

After quantizing the model, I'd like to be able to save the resulting model for later use. However, PyTorch's save commands do not seem to be working properly. Is there anything I need to change in the model config?

Future plan for this project

Are there any plan to standardize and support this quantization method in a broader scope?

Currently it seems that quantizing new type of model is very hard as we need to customize the modeling_<model>.py to extract the weight gradients.

On A100 card, speed-up effect does not show up.

First, thanks very much for creating this cool technology.

On one A100 GPU w/ 80GB VRAM, I tried benchmarking sq-vicuna-7b-v1.3-w3-s0 and its base. It is a bit strange that running median time has not been reduced a lot. This seems different to the speed-up results reported in your paper. Do you mind helping on tracing a possible reason? Is it related to my experiment was on a more powerful GPU?

Median PPL max memory(MiB)
w3-s0 0.025365471839904785 16.07021141052246 3602.3271484375
FP16 0.02616262435913086 14.921088218688965 25906.5771484375

Script:

#!/bin/bash

# vicuna v1.3 Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py models/sq-vicuna-7b-v1.3-w3-s0 c4 --wbits 3 --load models/sq-vicuna-7b-v1.3-w3-s0/sq-vicuna-7b-v1.3-w3-s0.pt --benchmark 128 --check 

# vicuna v1.3 base
# HF naming can use cache
CUDA_VISIBLE_DEVICES=0 python llama.py lmsys/vicuna-7b-v1.3 c4 --wbits 16 --benchmark 128 --check

channel-wise quantization

Hello,
Thank you for sharing your excellent paper.
I noticed that you conducted quantization in an output channel-wise manner. Have you ever tried quantization in an input channel-wise manner? I'm curious about the reasons for choosing output channel-wise quantization instead of input channel-wise.
Thank you.

quantisation implementation

Hi, I want to say thank you by your great work on quantisation. However, in the current codebase, I don't see any specific implementation on how these quantisation algorithms work? Can you provide them in the near future?

D+S packing in vLLM seems buggy

Hello!

I followed D+S packing instruction and stored the packed .pt file in "~/models/${model_name}-squeezellm/packed_weight", where model_name="Llama-2-7b-chat-hf". When I load this model in vLLM:

python examples/llm_engine_example.py --dtype float16 --model ~/models/${model_name}-squeezellm/packed_weight --quantization squeezellm

vLLM complained cannot find parameters "sparse_threshold.model.layers.*". Any idea why? I repeated the quantization from scratch several times but all ended up in this error.

To get a quick fix, I manually skip the above error in vLLM model loading step in llama.py , if we cannot find the missing param. However, this time the model cannot generate meaningful output. So I believe the above parameters are indeed not loaded correctly.

sample_weight is negative when running kmeans clustering

This is a really nice work!

I followed the instruction to quantize Llama-2-7b-chat-hf. At kmeans clustering step, I ran the following command:

python nuq.py --bit 4 --model_type llama --model ~/models/${model_name}-squeezellm/model_chunks --gradient ~/models/${model_name}-squeezellm/gradient_chunks --output ~/models/${model_name}-squeezellm/LUT

And got this error:

Quantizing layers [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
Quantizing layer 0
  0%|                                                                                                  | 0/7 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/mmilin/projects/ming_benchmark_vllm/SqueezeLLM/quantization/nuq.py", line 166, in <module>
    kmeans = KMeans(
  File "/home/mmilin/projects/ming_benchmark_vllm/venv/lib/python3.9/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/mmilin/projects/ming_benchmark_vllm/venv/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 1519, in fit
    centers_init = self._init_centroids(
  File "/home/mmilin/projects/ming_benchmark_vllm/venv/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 1019, in _init_centroids
    centers, _ = _kmeans_plusplus(
  File "/home/mmilin/projects/ming_benchmark_vllm/venv/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 229, in _kmeans_plusplus
    center_id = random_state.choice(n_samples, p=sample_weight / sample_weight.sum())
  File "numpy/random/mtrand.pyx", line 973, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities are not non-negative

I manually checked the sample_weight. It has negative elements, which is wired.

finetune SqueezeLLM

Hi. First of all, thank you for sharing your great work.
I am looking for a way to fine-tune LLaMA-7B model with SqueezeLLM.
It seems that the publicly available code in llama.py can only benchmark or eval.
Do I need a special way to fine-tune SqueezeLLM or can I train it with normal pytorch fine-tuning?

Add 65B-q3 evaluation

Hello, thanks for this paper!

When looking at the curve that suggests 30B-q3 could be used instead of 7B fp16 I got really interested in seeing the graphic with 65B-q3. You mention using A600 GPUs which have 48GB. Would a llama 65b q3 theoretically fit there?

My reference is llama.cpp q5_1 being 46GB.

Thanks in advance!

Vicuna-1.5?

Hi, thanks for your amazing work
I would love to know if you have any update on the sparse training script, or maybe if you are planning to quantize vicuna 1.5 since it has massive improvements over the llama1 counterpart.
It could lead to many more quantization on llama 2 based chat model like wizardLM
Thanks again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.