Coder Social home page Coder Social logo

affinequant's Introduction

AffineQuant: Affine Transformation Quantization for Large Language Models (Link)

AffineQuant is a simple and powerful quantization technique for LLMs.

overview

Contents

Install

conda create -n affinequant python=3.10 -y
conda activate affinequant
git clone https://github.com/bytedance/AffineQuant.git
cd AffineQuant
pip install --upgrade pip 
pip install -e .

We also leverage the kernel from AutoGPTQ to achieve real quantization. So you should also install the bug-fixed AutoGPTQ as follows:

pip install auto_gptq

Model Zoo

Coming Soon.

Usage

We provide full script to run AffineQuant in ./scripts/. We use LLaMa-7B as an example here:

  1. Obtain the channel-wise scales and shifts required for initialization:

Optional, we also offer the script that you can generate channel-wise scales and shifts by yourself:

python generate_act_scale_shift.py --model /PATH/TO/LLaMA/llama-7b
  1. Weight-only quantization
# W3A16
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w3a16 \
--eval_ppl --wbits 3 --abits 16 --lwc --let --use_ln_matrix --sf 1e-2

# W3A16g128
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w3a16g128 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc --let --use_ln_matrix --sf 1e-2
  1. weight-activation quantization
# W4A4
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w4a4 \
--eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \
--tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

More detailed and optional arguments:

  • --model: the local model path or huggingface format.
  • --wbits: weight quantization bits.
  • --abits: activation quantization bits.
  • --group_size: group size of weight quantization. If no set, use per-channel quantization for weight as default.
  • --epochs: training epochs. You can set it as 0 to evaluate pre-trained AffineQuant checkpoints.
  • --nsamples: number of calibration samples, 128 as default.
  • --eval_ppl: evaluating the perplexity of quantized models.
  • --tasks: evaluating zero-shot tasks.
  • --resume: loading pre-trained AffineQuant parameters.
  • --multigpu: to inference larger network on multiple GPUs
  • --real_quant: real quantization, which can see memory reduce
  • --save_dir: saving the quantization model for further exploration.
  • --use_matrix: using qkt affine mateix or not.
  • --use_ln_matrix: using layernorm affine matrix.
  • --sf: stability factor for gradual mask.

Results

  • AffineQuant achieve SoTA performance in weight-only quantization weight_only
  • AffineQuant achieve SoTA performance in weight-activation quantization weight_activation weight_activation1

Related Project

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers

RPTQ: Reorder-Based Post-Training Quantization for Large Language Models

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

MLC LLM

AutoGPTQ

Citation

@inproceedings{
ma2024affinequant,
title={AffineQuant: Affine Transformation Quantization for Large Language Models},
author={Yuexiao Ma and Huixia Li and Xiawu Zheng and Feng Ling and Xuefeng Xiao and Rui Wang and Shilei Wen and Fei Chao and Rongrong Ji},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=of2rhALq8l}
}

affinequant's People

Contributors

bobma-bytedance avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

loken14

affinequant's Issues

Why doesn't the optimization goal include Quantized X

Hi there, I am a rookie in model quantization. I've read your paper and the results are impressing! However, I couldn't help but wonder, in your optimization problem, you use the below formula:
截屏2024-08-30 00 10 09

Why not use Q(XA^(-1))Q(AW) as part of the optimization goal. Wouldn't it help if the quantization error on X is also taken into account?

I hope my silly question won't bother you 😊./

Best regards.

Loss is Nan

Thanks for your work. Well, it's occurs Loss is Nan when training stage that follows Usage step in ReadME,the error description:

`[2024-07-25 11:44:08 root](affinequant.py 159): INFO === Start quantize layer 3 ===
[2024-07-25 11:44:48 root](affinequant.py 282): INFO layer 3 iter 0 loss:0.012547658756375313 norm:0.005398863460868597 max memory_allocated 25510.10009765625
[2024-07-25 11:45:24 root](affinequant.py 282): INFO layer 3 iter 1 loss:0.010688988491892815 norm:0.002033534459769726 max memory_allocated 25510.10009765625
[2024-07-25 11:46:00 root](affinequant.py 282): INFO layer 3 iter 2 loss:0.010013626888394356 norm:0.0012934970436617732 max memory_allocated 25510.10009765625
[2024-07-25 11:46:35 root](affinequant.py 282): INFO layer 3 iter 3 loss:0.009677225723862648 norm:0.0008282792987301946 max memory_allocated 25510.10009765625
[2024-07-25 11:47:11 root](affinequant.py 282): INFO layer 3 iter 4 loss:0.009517742320895195 norm:0.00047518432256765664 max memory_allocated 25510.10009765625
[2024-07-25 11:47:47 root](affinequant.py 282): INFO layer 3 iter 5 loss:0.009451627731323242 norm:0.00032417610054835677 max memory_allocated 25510.10009765625
[2024-07-25 11:48:23 root](affinequant.py 282): INFO layer 3 iter 6 loss:0.009420475922524929 norm:0.0002503570867702365 max memory_allocated 25510.10009765625
[2024-07-25 11:48:58 root](affinequant.py 282): INFO layer 3 iter 7 loss:0.009402991272509098 norm:0.0002355567121412605 max memory_allocated 25510.10009765625
[2024-07-25 11:49:34 root](affinequant.py 282): INFO layer 3 iter 8 loss:0.009390492923557758 norm:0.00022407056530937552 max memory_allocated 25510.10009765625
[2024-07-25 11:50:10 root](affinequant.py 282): INFO layer 3 iter 9 loss:0.00937829539179802 norm:0.00022252841154113412 max memory_allocated 25510.10009765625
[2024-07-25 11:50:47 root](affinequant.py 282): INFO layer 3 iter 10 loss:0.00936979427933693 norm:0.00040410575456917286 max memory_allocated 25510.10009765625
[2024-07-25 11:51:23 root](affinequant.py 282): INFO layer 3 iter 11 loss:0.009373501874506474 norm:0.0007587362779304385 max memory_allocated 25510.10009765625
[2024-07-25 11:51:59 root](affinequant.py 282): INFO layer 3 iter 12 loss:0.009358054026961327 norm:0.0008263972704298794 max memory_allocated 25510.10009765625
[2024-07-25 11:52:35 root](affinequant.py 282): INFO layer 3 iter 13 loss:0.009706217795610428 norm:0.004431357607245445 max memory_allocated 25510.10009765625
[2024-07-25 11:53:12 root](affinequant.py 282): INFO layer 3 iter 14 loss:0.009572970680892467 norm:0.001220526173710823 max memory_allocated 25510.10009765625
[2024-07-25 11:53:48 root](affinequant.py 282): INFO layer 3 iter 15 loss:0.009499846026301384 norm:0.0015964835183694959 max memory_allocated 25510.10009765625
[2024-07-25 11:54:25 root](affinequant.py 282): INFO layer 3 iter 16 loss:0.009589510038495064 norm:0.0031670858152210712 max memory_allocated 25510.10009765625
[2024-07-25 11:55:02 root](affinequant.py 282): INFO layer 3 iter 17 loss:0.009425442665815353 norm:0.0016205854481086135 max memory_allocated 25510.10009765625
[2024-07-25 11:55:14 root](affinequant.py 272): INFO Loss is NAN, stopping training

/data/users/wangh/code/AffineQuant/quantize/affinequant.py(275)affinequant()
-> loss_list.append(loss.data)
(Pdb) `

this is my bash file ,how to fix it?
CUDA_VISIBLE_DEVICES=0 python main.py \ --model /data2/models/Llama-2/llama-2-7b-hf \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \ --tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.