bytedance / affinequant Goto Github PK

View Code? Open in Web Editor NEW

16.0 2.0 1.0 934 KB

Official implementation of the ICLR 2024 paper AffineQuant

License: Apache License 2.0

Python 96.14% Shell 3.86%

research

affinequant's Introduction

AffineQuant: Affine Transformation Quantization for Large Language Models (Link)

AffineQuant is a simple and powerful quantization technique for LLMs.

AffineQuant: Affine Transformation Quantization for Large Language Models
- Contents
- Install
- Model Zoo
- Usage
- Results
- Related Project
- Citation

Install

conda create -n affinequant python=3.10 -y
conda activate affinequant
git clone https://github.com/bytedance/AffineQuant.git
cd AffineQuant
pip install --upgrade pip 
pip install -e .

We also leverage the kernel from AutoGPTQ to achieve real quantization. So you should also install the bug-fixed AutoGPTQ as follows:

pip install auto_gptq

Model Zoo

Coming Soon.

Usage

We provide full script to run AffineQuant in ./scripts/. We use LLaMa-7B as an example here:

Obtain the channel-wise scales and shifts required for initialization:

Optional, we also offer the script that you can generate channel-wise scales and shifts by yourself:

python generate_act_scale_shift.py --model /PATH/TO/LLaMA/llama-7b

Weight-only quantization

# W3A16
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w3a16 \
--eval_ppl --wbits 3 --abits 16 --lwc --let --use_ln_matrix --sf 1e-2

# W3A16g128
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w3a16g128 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc --let --use_ln_matrix --sf 1e-2

weight-activation quantization

# W4A4
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w4a4 \
--eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \
--tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

More detailed and optional arguments:

--model: the local model path or huggingface format.
--wbits: weight quantization bits.
--abits: activation quantization bits.
--group_size: group size of weight quantization. If no set, use per-channel quantization for weight as default.
--epochs: training epochs. You can set it as 0 to evaluate pre-trained AffineQuant checkpoints.
--nsamples: number of calibration samples, 128 as default.
--eval_ppl: evaluating the perplexity of quantized models.
--tasks: evaluating zero-shot tasks.
--resume: loading pre-trained AffineQuant parameters.
--multigpu: to inference larger network on multiple GPUs
--real_quant: real quantization, which can see memory reduce
--save_dir: saving the quantization model for further exploration.
--use_matrix: using qkt affine mateix or not.
--use_ln_matrix: using layernorm affine matrix.
--sf: stability factor for gradual mask.

Results

AffineQuant achieve SoTA performance in weight-only quantization
AffineQuant achieve SoTA performance in weight-activation quantization

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers

RPTQ: Reorder-Based Post-Training Quantization for Large Language Models

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

MLC LLM

AutoGPTQ

Citation

@inproceedings{
ma2024affinequant,
title={AffineQuant: Affine Transformation Quantization for Large Language Models},
author={Yuexiao Ma and Huixia Li and Xiawu Zheng and Feng Ling and Xuefeng Xiao and Rui Wang and Shilei Wen and Fei Chao and Rongrong Ji},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=of2rhALq8l}
}

affinequant's People

Contributors

Stargazers

Watchers

Forkers

loken14

affinequant's Issues

Why doesn't the optimization goal include Quantized X

Hi there, I am a rookie in model quantization. I've read your paper and the results are impressing! However, I couldn't help but wonder, in your optimization problem, you use the below formula:

Why not use Q(XA^(-1))Q(AW) as part of the optimization goal. Wouldn't it help if the quantization error on X is also taken into account?

I hope my silly question won't bother you 😊./

Best regards.

The 4W4A LLaMa2 results on six zero-shot datasets

Hi, I notice in your paper there are only 4W4A LLaMa1 results on six zero-shot datasets. May i ask do you have 4W4A LLaMa2 results on zero-shot tasks. Thanks.

does it support LLAMA3-8B-insturct or Qwen2-7b-chat？

Loss is Nan

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME，the error description:

`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 ===
[2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625
[2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625
[2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625
[2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625
[2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625
[2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625
[2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625
[2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625
[2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625
[2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625
[2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625
[2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625
[2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625
[2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625
[2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625
[2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625
[2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625
[2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625
[2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it?
CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande