ist-daslab / gptq Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 146.0 296 KB

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Home Page: https://arxiv.org/abs/2210.17323

License: Apache License 2.0

Python 95.72% C++ 0.51% Cuda 3.77%

gptq's People

Contributors

Stargazers

Watchers

Forkers

d-matrix-ai xinlinli170 yaohuicai techthiyanes yangwang92 yidanzhou mindrages touristshaun berserkr linhduongtuan baissfoundation gqadonis alpindale dumpmemory yidong72 xydrolase bestpredicts zineos katanallama flytigerw zurichrain tallesairan stancx1 wenhuach21 trianxy xiuyu-li oobabooga co-simulation murilocurti t-vi xu-kai bihujrj itsharex natuan forutepiano kimjehyun zhaojp-frank berniri summerflowers apollohuang1 henriettaknight icloudai raymond-lu6 leiwang1999 kaiser1711 minghaobd aiorganisation derekliu-hz p3nj mao-ku wesleysanjose jjhw cygwynd ertkonuk shaunhenju minghui-bd lihengwannafly charlie-xiaoqi guoqiangjia pkafma-aon sorokinvld nanqiai fernandezbaptiste bofeng2477 luoyingsong rosssong deep-learner-msp paixai ghhpc botmasterza caozhongz stjordanis aiworkspace sunmarc kp-forks willyjtong huzicong berumotto-vermouth dujianhua1008 hbcbh1999 syaikhipin zzz0906 xindong-sony harveyp123 highel michaeljayw jan-karsten-kuhnke yonghuazhang-buaa feihugis lyf-00 unix1986 xipingyan blacksamorez philhoonoh jvhs0706 soonchangai smalltong02 mekkcyber kimho666 f901107

gptq's Issues

quant_cuda_kernel.cu(212): error: identifier "__hfma2" is undefined

an error is reported when compiling the quant_cuda kernel.

in my case,
Cuda compilation tools, release 12.0, V12.0.140

Reproduction of the results in the paper

@efrantar
Following the instructions in README.md, baseline and RTN perplexities match exactly as listed in Tables 2-3 in the paper.
However, GPTQ perplexity does not.

Is this due to differences in the calibration sample? Or is the result in the Tables statistics out of multiple runs with different random seeds?
Could you share the command that reproduces the results in the paper?

Much appreciated!

GPTQ pseudo-quantization saved weights (pt format) How load Re-evaluation

GPTQ pseudo-quantization saved weights (pt format) How load re-evaluation, I set the --load parameter after the execution, found an error:
gptq-main\quant.py", line 636, in forward
raise ValueError('Only supports a single token currently.')

H_inv not updated

After each quantization step, H_inv should be updated, but in the code fasterquant, H_inv is not updated. Is it a bug?

How to run on multi GPUs?

Im try run opt--30b on 4*2080Ti, However, the following error message appears when loading parameters.

Starting ...
Ready.
Traceback (most recent call last):
  File "opt.py", line 424, in <module>
    quantizers = opt_sequential(model, dataloader, DEV)
  File "/home/cciip/miniconda3/envs/int/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "opt.py", line 83, in opt_sequential
    gptq[name] = GPTQ(subset[name])
  File "/home/cciip/private/tianjie/gptq/gptq.py", line 29, in __init__
    self.H = torch.zeros((self.columns, self.columns), device=self.dev)
RuntimeError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 10.75 GiB total capacity; 9.30 GiB already allocated; 77.62 MiB free; 9.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How can I make it work?

Conversion of OPT-175B singleton to HF checkpoint

Delete

Can GPTQ models be used for fine-tuning?

I think the answer is no but wanted to check. can some expert let me know? thanks.

Regarding the method for computing the Hessian matrix.

I would like to ask about line 61 in your gptq.py file: inp = math.sqrt(2 / self.nsamples) * inp.float(). According to the paper, it seems that it should be written as follows: inp = math.sqrt(tmp / self.nsamples) * inp.float(). After making this modification, I noticed a reduction in quantization error. Could you please verify if my understanding is correct, and if there might be any misunderstanding on my part?

Can --save work with --groupsize in opt.py?

Hello there, nice work!

If I understand well, when groupsize is set above 0, the quantizer in module gptq of opt is only responsible for each group. The opt_pack3 is counting on the quantizer.pack function, which has only the zeros and scales of the last group if groupsize is set above 0.

So can --save and --groupsize work together in opt.py right now?

Why are PPL so low on PTB?

Hello

Many thanks for your work, it's great to (finally) see results reported on openly available LLMs 😊
However, I was surprised when I saw perplexities on PTB for OPT and BLOOM models: 10.33 and 13.63 respectively.
Indeed, GPT-3's paper reports a PPL of 20.50 on such dataset and I was wondering whether you had any explanation for this (nearly 2x) difference?

Thanks!

About the cuda code, I think "tmp2 >> 30" should be " tmp2 >> 31"

tmp = (tmp2 >> 30) | ((tmp1 << 1) & 0x6);

to construct a 3-bit number, tmp1 contributes 2bit, why tmp2 also contributes 2bit?
I'm confused.

Why is the wikitext-2 ppl calculated in the code lower than the ppl by lm-evaluation-harness?

About 50% lower.
What causes the difference? Does the ppl calculation method different?

How to adopt GPTQ on Conv2d with `groups` attribute?

Hi,

Thanks for your impressive work! It really helps me quantize lots of large models.
Recently, I try to implement GPTQ on grouped Conv2d layer, but the results seem to be not good.
Could you provide some hints to support GPTQ on grouped Conv2d?

Here is my rough implementation now:

In add_batch function, divide inp into different group and store hessian respectively.
In fasterquant function, divide W into different group and apply GPTQ with chunk of W and corresponding hessian.
Concat the different groups of Q to full Q.

Thank you in advance.

How to apply 3/4-bit quantization to vision-language model?

How to apply 3/4-bit quantization to vision-language model like BLIP2？

opt_eval error

After quant opt-125m and save the quant model. When I use ‘opt_eval’, get an error： Only supports a single token currently

qweight is empty when I gave --save option

As I want to get the quantized model through the GPTQ algorithm, I gave the --save option when I run the python script.

However, the qweight of each layer is empty because of pack function in Quant3Linear class. (quant.py)
I think the while loop (line 147 ~ line 170) is not executed so the qweight is just an empty ndarray.

If I comment out the while loop, I can get the qweight.
What is the role of the while loop? Can I just comment out and run the transformers?

Pretrained Weights for Bloom and BloomZ (4-bit)

Hi,

Thanks a lot for the excellent work.

Could you share with us a pretrained weights for Bloom and BloomZ (4-bits) ?

Title: Feature Request: Add Saving Quantized Weights Functionality to bloom.py

Description:

Hi there,

I noticed that the opt.py file in the repository provides a method for saving quantized weights, but this functionality is not available in the bloom.py file. I was wondering if it would be possible to add this feature to bloom.py as well.

Being able to save quantized weights is a really useful feature for optimizing the size of models, and it would be great to have this functionality available in all relevant files in the repository.

If this feature could be added to bloom.py, I think it would be a really helpful addition for anyone who is working with this file.

Thank you for your time and consideration.

Best regards,

PPL results on wikitext/ptb/c4 are worse than the official result

Hi, I ran the bloom.py using fp16 to test the perplexity (PPL) of BLOOM on Wikitext-2, PTB, and C4 datasets. The results are 11.79 / 20.14 / 17.68, which is worse than the official results of 11.37/19.40/14.13.

GPTQ转化的INT8模型，如何运行呢？请大佬指教

GPTQ for BERT

I'm looking for the GPTQ implementation for BERT, why isn't it in the repository? i want to try 4bit implementation for speed comparison and try other models as well

running speed slow on NVIDIA vGPU

I test qwen-7b GPT-Q quantization on a vGPU that is half of the A10‘s performance.

Driver Version：470.161.03
CUDA Version: 11.4

I have noticed that the processing speed of the context and the decoding speed are particularly slow,

context(500 tokens) processing speed: 48 tokens/s
decode speed: 1.6 token/s

Then, I test other model such as https://huggingface.co/ClueAI/ChatYuan-large-v2 and the speed is within expectations. So I guess that GPT-Q does not work well on vGPU？

The code is nothing special, looks like

from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
...

Compatibility of Quant3Linear and 4-bit quantization

Hi! I've noticed that the quantization layer would pack the quantized weight using class Quant3Linear, as shown below:

However, it seems to me that it only suits for 2bits and 3bits weights. If the original weights in $intweight is 4bits, some bits would be lost.

Could you explain the logic behind this? Thanks!

Use modified Cholesky decomposition instead of regularized Cholesky

Hello, I just wanted to point out that the problem you are trying to avoid with regularization of indefinite Hessians has been encountered in the numerical analysis and optimization literature, and the class of algorithms goes by the name modified Cholesky, with cost $O(n^2)$ which is asymptotically irrelevant.

quantized GPTJ - error on inference

hi there, i'm trying to quantize a finetuned version of gptj trought the https://github.com/AlpinDale/gptq-gptj repo.

To quantize the model i use this command:

CUDA_VISIBLE_DEVICES=0 python gptj.py ../finetuned6B/checkpoint-3000/ c4 --wbits 4 --save GPTJQ.pt

the process complete successfully and the file GPTJQ.pt is produced. The only warning i get is:

Token indices sequence length is longer than the specified maximum sequence length for this model (3403 > 2048). Running this sequence through the model will result in indexing errors.

When i run the inference trough this command:

CUDA_VISIBLE_DEVICES=0 python gptj-inference.py EleutherAI/gpt-j-6b --wbits 4 --load GPTJQ.pt --text "Hello"

i get the following error. what am i doing wrong?

thank you very much for any help!

the error:

CUDA extension not installed.
Loading model ...
Traceback (most recent call last):
File "gptj-inference.py", line 120, in
model = load_quant(args.model, args.load, args.wbits)
File "gptj-inference.py", line 55, in load_quant
model.load_state_dict(torch.load(checkpoint))
File "/home/gianmarco/miniconda3/envs/gpt_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GPTJForCausalLM:
Missing key(s) in state_dict: "transformer.h.0.attn.k_proj.qzeros", "transformer.h.0.attn.k_proj.scales", "transformer.h.0.attn.k_proj.bias", "transformer.h.0.attn.k_proj.qweight", "transformer.h.0.attn.v_proj.qzeros", "transformer.h.0.attn.v_proj.scales", "transformer.h.0.attn.v_proj.bias", "transformer.h.0.attn.v_proj.qweight", "transformer.h.0.attn.q_proj.qzeros", "transformer.h.0.attn.q_proj.scales", "transformer.h.0.attn.q_proj.bias", "transformer.h.0.attn.q_proj.qweight", "transformer.h.0.attn.out_proj.qzeros", "transformer.h.0.attn.out_proj.scales", "transformer.h.0.attn.out_proj.bias", "transformer.h.0.attn.out_proj.qweight", "transformer.h.0.mlp.fc_in.qzeros", "transformer.h.0.mlp.fc_in.scales", "transformer.h.0.mlp.fc_in.qweight", "transformer.h.0.mlp.fc_out.qzeros", "transformer.h.0.mlp.fc_out.scales", "transformer.h.0.mlp.fc_out.qweight", "transformer.h.1.attn.k_proj.qzeros", "transformer.h.1.attn.k_proj.scales", "transformer.h.1.attn.k_proj.bias", "transformer.h.1.attn.k_proj.qweight", "transformer.h.1.attn.v_proj.qzeros", "transformer.h.1.attn.v_proj.scales", "transformer.h.1.attn.v_proj.bias", "transformer.h.1.attn.v_proj.qweight", "transformer.h.1.attn.q_proj.qzeros", "transformer.h.1.attn.q_proj.scales", "transformer.h.1.attn.q_proj.bias", "transformer.h.1.attn.q_proj.qweight", "transformer.h.1.attn.out_proj.qzeros", "transformer.h.1.attn.out_proj.scales", "transformer.h.1.attn.out_proj.bias", "transformer.h.1.attn.out_proj.qweight", "transformer.h.1.mlp.fc_in.qzeros", "transformer.h.1.mlp.fc_in.scales", "transformer.h.1.mlp.fc_in.qweight", "transformer.h.1.mlp.fc_out.qzeros", "transformer.h.1.mlp.fc_out.scales", "transformer.h.1.mlp.fc_out.qweight", "transformer.h.2.attn.k_proj.qzeros", "transformer.h.2.attn.k_proj.scales", "transformer.h.2.attn.k_proj.bias", "transformer.h.2.attn.k_proj.qweight", "transformer.h.2.attn.v_proj.qzeros", "transformer.h.2.attn.v_proj.scales", "transformer.h.2.attn.v_proj.bias", "transformer.h.2.attn.v_proj.qweight", "transformer.h.2.attn.q_proj.qzeros", "transformer.h.2.attn.q_proj.scales", "transformer.h.2.attn.q_proj.bias", "transformer.h.2.attn.q_proj.qweight", "transformer.h.2.attn.out_proj.qzeros", "transformer.h.2.attn.out_proj.scales", "transformer.h.2.attn.out_proj.bias", "transformer.h.2.attn.out_proj.qweight", "transformer.h.2.mlp.fc_in.qzeros", "transformer.h.2.mlp.fc_in.scales", "transformer.h.2.mlp.fc_in.qweight", "transformer.h.2.mlp.fc_out.qzeros", "transformer.h.2.mlp.fc_out.scales", "transformer.h.2.mlp.fc_out.qweight", "transformer.h.3.attn.k_proj.qzeros", "transformer.h.3.attn.k_proj.scales", "transformer.h.3.attn.k_proj.bias", "transformer.h.3.attn.k_proj.qweight", "transformer.h.3.attn.v_proj.qzeros", "transformer.h.3.attn.v_proj.scales", "transformer.h.3.attn.v_proj.bias", "transformer.h.3.attn.v_proj.qweight", "transformer.h.3.attn.q_proj.qzeros", "transformer.h.3.attn.q_proj.scales", "transformer.h.3.attn.q_proj.bias", "transformer.h.3.attn.q_proj.qweight", "transformer.h.3.attn.out_proj.qzeros", "transformer.h.3.attn.out_proj.scales", "transformer.h.3.attn.out_proj.bias", "transformer.h.3.attn.out_proj.qweight", "transformer.h.3.mlp.fc_in.qzeros", "transformer.h.3.mlp.fc_in.scales", "transformer.h.3.mlp.fc_in.qweight", "transformer.h.3.mlp.fc_out.qzeros", "transformer.h.3.mlp.fc_out.scales", "transformer.h.3.mlp.fc_out.qweight", "transformer.h.4.attn.k_proj.qzeros", "transformer.h.4.attn.k_proj.scales", "transformer.h.4.attn.k_proj.bias", "transformer.h.4.attn.k_proj.qweight", "transformer.h.4.attn.v_proj.qzeros", "transformer.h.4.attn.v_proj.scales", "transformer.h.4.attn.v_proj.bias", "transformer.h.4.attn.v_proj.qweight", "transformer.h.4.attn.q_proj.qzeros", "transformer.h.4.attn.q_proj.scales", "transformer.h.4.attn.q_proj.bias", "transformer.h.4.attn.q_proj.qweight", "transformer.h.4.attn.out_proj.qzeros", "transformer.h.4.attn.out_proj.scales", "transformer.h.4.attn.out_proj.bias", "transformer.h.4.attn.out_proj.qweight", "transformer.h.4.mlp.fc_in.qzeros", "transformer.h.4.mlp.fc_in.scales", "transformer.h.4.mlp.fc_in.qweight", "transformer.h.4.mlp.fc_out.qzeros", "transformer.h.4.mlp.fc_out.scales", "transformer.h.4.mlp.fc_out.qweight", "transformer.h.5.attn.k_proj.qzeros", "transformer.h.5.attn.k_proj.scales", "transformer.h.5.attn.k_proj.bias", "transformer.h.5.attn.k_proj.qweight", "transformer.h.5.attn.v_proj.qzeros", "transformer.h.5.attn.v_proj.scales", "transformer.h.5.attn.v_proj.bias", "transformer.h.5.attn.v_proj.qweight", "transformer.h.5.attn.q_proj.qzeros", "transformer.h.5.attn.q_proj.scales", "transformer.h.5.attn.q_proj.bias", "transformer.h.5.attn.q_proj.qweight", "transformer.h.5.attn.out_proj.qzeros", "transformer.h.5.attn.out_proj.scales", "transformer.h.5.attn.out_proj.bias", "transformer.h.5.attn.out_proj.qweight", "transformer.h.5.mlp.fc_in.qzeros", "transformer.h.5.mlp.fc_in.scales", "transformer.h.5.mlp.fc_in.qweight", "transformer.h.5.mlp.fc_out.qzeros", "transformer.h.5.mlp.fc_out.scales", "transformer.h.5.mlp.fc_out.qweight", "transformer.h.6.attn.k_proj.qzeros", "transformer.h.6.attn.k_proj.scales", "transformer.h.6.attn.k_proj.bias", "transformer.h.6.attn.k_proj.qweight", "transformer.h.6.attn.v_proj.qzeros", "transformer.h.6.attn.v_proj.scales", "transformer.h.6.attn.v_proj.bias", "transformer.h.6.attn.v_proj.qweight", "transformer.h.6.attn.q_proj.qzeros", "transformer.h.6.attn.q_proj.scales", "transformer.h.6.attn.q_proj.bias", "transformer.h.6.attn.q_proj.qweight", "transformer.h.6.attn.out_proj.qzeros", "transformer.h.6.attn.out_proj.scales", "transformer.h.6.attn.out_proj.bias", "transformer.h.6.attn.out_proj.qweight", "transformer.h.6.mlp.fc_in.qzeros", "transformer.h.6.mlp.fc_in.scales", "transformer.h.6.mlp.fc_in.qweight", "transformer.h.6.mlp.fc_out.qzeros", "transformer.h.6.mlp.fc_out.scales", "transformer.h.6.mlp.fc_out.qweight", "transformer.h.7.attn.k_proj.qzeros", "transformer.h.7.attn.k_proj.scales", "transformer.h.7.attn.k_proj.bias", "transformer.h.7.attn.k_proj.qweight", "transformer.h.7.attn.v_proj.qzeros", "transformer.h.7.attn.v_proj.scales", "transformer.h.7.attn.v_proj.bias", "transformer.h.7.attn.v_proj.qweight", "transformer.h.7.attn.q_proj.qzeros", "transformer.h.7.attn.q_proj.scales", "transformer.h.7.attn.q_proj.bias", "transformer.h.7.attn.q_proj.qweight", "transformer.h.7.attn.out_proj.qzeros", "transformer.h.7.attn.out_proj.scales", "transformer.h.7.attn.out_proj.bias", "transformer.h.7.attn.out_proj.qweight", "transformer.h.7.mlp.fc_in.qzeros", "transformer.h.7.mlp.fc_in.scales", "transformer.h.7.mlp.fc_in.qweight", "transformer.h.7.mlp.fc_out.qzeros", "transformer.h.7.mlp.fc_out.scales", "transformer.h.7.mlp.fc_out.qweight", "transformer.h.8.attn.k_proj.qzeros", "transformer.h.8.attn.k_proj.scales", "transformer.h.8.attn.k_proj.bias", "transformer.h.8.attn.k_proj.qweight", "transformer.h.8.attn.v_proj.qzeros", "transformer.h.8.attn.v_proj.scales", "transformer.h.8.attn.v_proj.bias", "transformer.h.8.attn.v_proj.qweight", "transformer.h.8.attn.q_proj.qzeros", "transformer.h.8.attn.q_proj.scales", "transformer.h.8.attn.q_proj.bias", "transformer.h.8.attn.q_proj.qweight", "transformer.h.8.attn.out_proj.qzeros", "transformer.h.8.attn.out_proj.scales", "transformer.h.8.attn.out_proj.bias", "transformer.h.8.attn.out_proj.qweight", "transformer.h.8.mlp.fc_in.qzeros", "transformer.h.8.mlp.fc_in.scales", "transformer.h.8.mlp.fc_in.qweight", "transformer.h.8.mlp.fc_out.qzeros", "transformer.h.8.mlp.fc_out.scales", "transformer.h.8.mlp.fc_out.qweight", "transformer.h.9.attn.k_proj.qzeros", "transformer.h.9.attn.k_proj.scales", "transformer.h.9.attn.k_proj.bias", "transformer.h.9.attn.k_proj.qweight", "transformer.h.9.attn.v_proj.qzeros", "transformer.h.9.attn.v_proj.scales", "transformer.h.9.attn.v_proj.bias", "transformer.h.9.attn.v_proj.qweight", "transformer.h.9.attn.q_proj.qzeros", "transformer.h.9.attn.q_proj.scales", "transformer.h.9.attn.q_proj.bias", "transformer.h.9.attn.q_proj.qweight", "transformer.h.9.attn.out_proj.qzeros", "transformer.h.9.attn.out_proj.scales", "transformer.h.9.attn.out_proj.bias", "transformer.h.9.attn.out_proj.qweight", "transformer.h.9.mlp.fc_in.qzeros", "transformer.h.9.mlp.fc_in.scales", "transformer.h.9.mlp.fc_in.qweight", "transformer.h.9.mlp.fc_out.qzeros", "transformer.h.9.mlp.fc_out.scales", "transformer.h.9.mlp.fc_out.qweight", "transformer.h.10.attn.k_proj.qzeros", "transformer.h.10.attn.k_proj.scales", "transformer.h.10.attn.k_proj.bias", "transformer.h.10.attn.k_proj.qweight", "transformer.h.10.attn.v_proj.qzeros", "transformer.h.10.attn.v_proj.scales", "transformer.h.10.attn.v_proj.bias", "transformer.h.10.attn.v_proj.qweight", "transformer.h.10.attn.q_proj.qzeros", "transformer.h.10.attn.q_proj.scales", "transformer.h.10.attn.q_proj.bias", "transformer.h.10.attn.q_proj.qweight", "transformer.h.10.attn.out_proj.qzeros", "transformer.h.10.attn.out_proj.scales", "transformer.h.10.attn.out_proj.bias", "transformer.h.10.attn.out_proj.qweight", "transformer.h.10.mlp.fc_in.qzeros", "transformer.h.10.mlp.fc_in.scales", "transformer.h.10.mlp.fc_in.qweight", "transformer.h.10.mlp.fc_out.qzeros", "transformer.h.10.mlp.fc_out.scales", "transformer.h.10.mlp.fc_out.qweight", "transformer.h.11.attn.k_proj.qzeros", "transformer.h.11.attn.k_proj.scales", "transformer.h.11.attn.k_proj.bias", "transformer.h.11.attn.k_proj.qweight", "transformer.h.11.attn.v_proj.qzeros", "transformer.h.11.attn.v_proj.scales", "transformer.h.11.attn.v_proj.bias", "transformer.h.11.attn.v_proj.qweight", "transformer.h.11.attn.q_proj.qzeros", "transformer.h.11.attn.q_proj.scales", "transformer.h.11.attn.q_proj.bias", "transformer.h.11.attn.q_proj.qweight", "transformer.h.11.attn.out_proj.qzeros", "transformer.h.11.attn.out_proj.scales", "transformer.h.11.attn.out_proj.bias", "transformer.h.11.attn.out_proj.qweight", "transformer.h.11.mlp.fc_in.qzeros", "transformer.h.11.mlp.fc_in.scales", "transformer.h.11.mlp.fc_in.qweight", "transformer.h.11.mlp.fc_out.qzeros", "transformer.h.11.mlp.fc_out.scales", "transformer.h.11.mlp.fc_out.qweight", "transformer.h.12.attn.k_proj.qzeros", "transformer.h.12.attn.k_proj.scales", "transformer.h.12.attn.k_proj.bias", "transformer.h.12.attn.k_proj.qweight", "transformer.h.12.attn.v_proj.qzeros", "transformer.h.12.attn.v_proj.scales", "transformer.h.12.attn.v_proj.bias", "transformer.h.12.attn.v_proj.qweight", "transformer.h.12.attn.q_proj.qzeros", "transformer.h.12.attn.q_proj.scales", "transformer.h.12.attn.q_proj.bias", "transformer.h.12.attn.q_proj.qweight", "transformer.h.12.attn.out_proj.qzeros", "transformer.h.12.attn.out_proj.scales", "transformer.h.12.attn.out_proj.bias", "transformer.h.12.attn.out_proj.qweight", "transformer.h.12.mlp.fc_in.qzeros", "transformer.h.12.mlp.fc_in.scales", "transformer.h.12.mlp.fc_in.qweight", "transformer.h.12.mlp.fc_out.qzeros", "transformer.h.12.mlp.fc_out.scales", "transformer.h.12.mlp.fc_out.qweight", "transformer.h.13.attn.k_proj.qzeros", "transformer.h.13.attn.k_proj.scales", "transformer.h.13.attn.k_proj.bias", "transformer.h.13.attn.k_proj.qweight", "transformer.h.13.attn.v_proj.qzeros", "transformer.h.13.attn.v_proj.scales", "transformer.h.13.attn.v_proj.bias", "transformer.h.13.attn.v_proj.qweight", "transformer.h.13.attn.q_proj.qzeros", "transformer.h.13.attn.q_proj.scales", "transformer.h.13.attn.q_proj.bias", "transformer.h.13.attn.q_proj.qweight", "transformer.h.13.attn.out_proj.qzeros", "transformer.h.13.attn.out_proj.scales", "transformer.h.13.attn.out_proj.bias", "transformer.h.13.attn.out_proj.qweight", "transformer.h.13.mlp.fc_in.qzeros", "transformer.h.13.mlp.fc_in.scales", "transformer.h.13.mlp.fc_in.qweight", "transformer.h.13.mlp.fc_out.qzeros", "transformer.h.13.mlp.fc_out.scales", "transformer.h.13.mlp.fc_out.qweight", "transformer.h.14.attn.k_proj.qzeros", "transformer.h.14.attn.k_proj.scales", "transformer.h.14.attn.k_proj.bias", "transformer.h.14.attn.k_proj.qweight", "transformer.h.14.attn.v_proj.qzeros", "transformer.h.14.attn.v_proj.scales", "transformer.h.14.attn.v_proj.bias", "transformer.h.14.attn.v_proj.qweight", "transformer.h.14.attn.q_proj.qzeros", "transformer.h.14.attn.q_proj.scales", "transformer.h.14.attn.q_proj.bias", "transformer.h.14.attn.q_proj.qweight", "transformer.h.14.attn.out_proj.qzeros", "transformer.h.14.attn.out_proj.scales", "transformer.h.14.attn.out_proj.bias", "transformer.h.14.attn.out_proj.qweight", "transformer.h.14.mlp.fc_in.qzeros", "transformer.h.14.mlp.fc_in.scales", "transformer.h.14.mlp.fc_in.qweight", "transformer.h.14.mlp.fc_out.qzeros", "transformer.h.14.mlp.fc_out.scales", "transformer.h.14.mlp.fc_out.qweight", "transformer.h.15.attn.k_proj.qzeros", "transformer.h.15.attn.k_proj.scales", "transformer.h.15.attn.k_proj.bias", "transformer.h.15.attn.k_proj.qweight", "transformer.h.15.attn.v_proj.qzeros", "transformer.h.15.attn.v_proj.scales", "transformer.h.15.attn.v_proj.bias", "transformer.h.15.attn.v_proj.qweight", "transformer.h.15.attn.q_proj.qzeros", "transformer.h.15.attn.q_proj.scales", "transformer.h.15.attn.q_proj.bias", "transformer.h.15.attn.q_proj.qweight", "transformer.h.15.attn.out_proj.qzeros", "transformer.h.15.attn.out_proj.scales", "transformer.h.15.attn.out_proj.bias", "transformer.h.15.attn.out_proj.qweight", "transformer.h.15.mlp.fc_in.qzeros", "transformer.h.15.mlp.fc_in.scales", "transformer.h.15.mlp.fc_in.qweight", "transformer.h.15.mlp.fc_out.qzeros", "transformer.h.15.mlp.fc_out.scales", "transformer.h.15.mlp.fc_out.qweight", "transformer.h.16.attn.k_proj.qzeros", "transformer.h.16.attn.k_proj.scales", "transformer.h.16.attn.k_proj.bias", "transformer.h.16.attn.k_proj.qweight", "transformer.h.16.attn.v_proj.qzeros", "transformer.h.16.attn.v_proj.scales", "transformer.h.16.attn.v_proj.bias", "transformer.h.16.attn.v_proj.qweight", "transformer.h.16.attn.q_proj.qzeros", "transformer.h.16.attn.q_proj.scales", "transformer.h.16.attn.q_proj.bias", "transformer.h.16.attn.q_proj.qweight", "transformer.h.16.attn.out_proj.qzeros", "transformer.h.16.attn.out_proj.scales", "transformer.h.16.attn.out_proj.bias", "transformer.h.16.attn.out_proj.qweight", "transformer.h.16.mlp.fc_in.qzeros", "transformer.h.16.mlp.fc_in.scales", "transformer.h.16.mlp.fc_in.qweight", "transformer.h.16.mlp.fc_out.qzeros", "transformer.h.16.mlp.fc_out.scales", "transformer.h.16.mlp.fc_out.qweight", "transformer.h.17.attn.k_proj.qzeros", "transformer.h.17.attn.k_proj.scales", "transformer.h.17.attn.k_proj.bias", "transformer.h.17.attn.k_proj.qweight", "transformer.h.17.attn.v_proj.qzeros", "transformer.h.17.attn.v_proj.scales", "transformer.h.17.attn.v_proj.bias", "transformer.h.17.attn.v_proj.qweight", "transformer.h.17.attn.q_proj.qzeros", "transformer.h.17.attn.q_proj.scales", "transformer.h.17.attn.q_proj.bias", "transformer.h.17.attn.q_proj.qweight", "transformer.h.17.attn.out_proj.qzeros", "transformer.h.17.attn.out_proj.scales", "transformer.h.17.attn.out_proj.bias", "transformer.h.17.attn.out_proj.qweight", "transformer.h.17.mlp.fc_in.qzeros", "transformer.h.17.mlp.fc_in.scales", "transformer.h.17.mlp.fc_in.qweight", "transformer.h.17.mlp.fc_out.qzeros", "transformer.h.17.mlp.fc_out.scales", "transformer.h.17.mlp.fc_out.qweight", "transformer.h.18.attn.k_proj.qzeros", "transformer.h.18.attn.k_proj.scales", "transformer.h.18.attn.k_proj.bias", "transformer.h.18.attn.k_proj.qweight", "transformer.h.18.attn.v_proj.qzeros", "transformer.h.18.attn.v_proj.scales", "transformer.h.18.attn.v_proj.bias", "transformer.h.18.attn.v_proj.qweight", "transformer.h.18.attn.q_proj.qzeros", "transformer.h.18.attn.q_proj.scales", "transformer.h.18.attn.q_proj.bias", "transformer.h.18.attn.q_proj.qweight", "transformer.h.18.attn.out_proj.qzeros", "transformer.h.18.attn.out_proj.scales", "transformer.h.18.attn.out_proj.bias", "transformer.h.18.attn.out_proj.qweight", "transformer.h.18.mlp.fc_in.qzeros", "transformer.h.18.mlp.fc_in.scales", "transformer.h.18.mlp.fc_in.qweight", "transformer.h.18.mlp.fc_out.qzeros", "transformer.h.18.mlp.fc_out.scales", "transformer.h.18.mlp.fc_out.qweight", "transformer.h.19.attn.k_proj.qzeros", "transformer.h.19.attn.k_proj.scales", "transformer.h.19.attn.k_proj.bias", "transformer.h.19.attn.k_proj.qweight", "transformer.h.19.attn.v_proj.qzeros", "transformer.h.19.attn.v_proj.scales", "transformer.h.19.attn.v_proj.bias", "transformer.h.19.attn.v_proj.qweight", "transformer.h.19.attn.q_proj.qzeros", "transformer.h.19.attn.q_proj.scales", "transformer.h.19.attn.q_proj.bias", "transformer.h.19.attn.q_proj.qweight", "transformer.h.19.attn.out_proj.qzeros", "transformer.h.19.attn.out_proj.scales", "transformer.h.19.attn.out_proj.bias", "transformer.h.19.attn.out_proj.qweight", "transformer.h.19.mlp.fc_in.qzeros", "transformer.h.19.mlp.fc_in.scales", "transformer.h.19.mlp.fc_in.qweight", "transformer.h.19.mlp.fc_out.qzeros", "transformer.h.19.mlp.fc_out.scales", "transformer.h.19.mlp.fc_out.qweight", "transformer.h.20.attn.k_proj.qzeros", "transformer.h.20.attn.k_proj.scales", "transformer.h.20.attn.k_proj.bias", "transformer.h.20.attn.k_proj.qweight", "transformer.h.20.attn.v_proj.qzeros", "transformer.h.20.attn.v_proj.scales", "transformer.h.20.attn.v_proj.bias", "transformer.h.20.attn.v_proj.qweight", "transformer.h.20.attn.q_proj.qzeros", "transformer.h.20.attn.q_proj.scales", "transformer.h.20.attn.q_proj.bias", "transformer.h.20.attn.q_proj.qweight", "transformer.h.20.attn.out_proj.qzeros", "transformer.h.20.attn.out_proj.scales", "transformer.h.20.attn.out_proj.bias", "transformer.h.20.attn.out_proj.qweight", "transformer.h.20.mlp.fc_in.qzeros", "transformer.h.20.mlp.fc_in.scales", "transformer.h.20.mlp.fc_in.qweight", "transformer.h.20.mlp.fc_out.qzeros", "transformer.h.20.mlp.fc_out.scales", "transformer.h.20.mlp.fc_out.qweight", "transformer.h.21.attn.k_proj.qzeros", "transformer.h.21.attn.k_proj.scales", "transformer.h.21.attn.k_proj.bias", "transformer.h.21.attn.k_proj.qweight", "transformer.h.21.attn.v_proj.qzeros", "transformer.h.21.attn.v_proj.scales", "transformer.h.21.attn.v_proj.bias", "transformer.h.21.attn.v_proj.qweight", "transformer.h.21.attn.q_proj.qzeros", "transformer.h.21.attn.q_proj.scales", "transformer.h.21.attn.q_proj.bias", "transformer.h.21.attn.q_proj.qweight", "transformer.h.21.attn.out_proj.qzeros", "transformer.h.21.attn.out_proj.scales", "transformer.h.21.attn.out_proj.bias", "transformer.h.21.attn.out_proj.qweight", "transformer.h.21.mlp.fc_in.qzeros", "transformer.h.21.mlp.fc_in.scales", "transformer.h.21.mlp.fc_in.qweight", "transformer.h.21.mlp.fc_out.qzeros", "transformer.h.21.mlp.fc_out.scales", "transformer.h.21.mlp.fc_out.qweight", "transformer.h.22.attn.k_proj.qzeros", "transformer.h.22.attn.k_proj.scales", "transformer.h.22.attn.k_proj.bias", "transformer.h.22.attn.k_proj.qweight", "transformer.h.22.attn.v_proj.qzeros", "transformer.h.22.attn.v_proj.scales", "transformer.h.22.attn.v_proj.bias", "transformer.h.22.attn.v_proj.qweight", "transformer.h.22.attn.q_proj.qzeros", "transformer.h.22.attn.q_proj.scales", "transformer.h.22.attn.q_proj.bias", "transformer.h.22.attn.q_proj.qweight", "transformer.h.22.attn.out_proj.qzeros", "transformer.h.22.attn.out_proj.scales", "transformer.h.22.attn.out_proj.bias", "transformer.h.22.attn.out_proj.qweight", "transformer.h.22.mlp.fc_in.qzeros", "transformer.h.22.mlp.fc_in.scales", "transformer.h.22.mlp.fc_in.qweight", "transformer.h.22.mlp.fc_out.qzeros", "transformer.h.22.mlp.fc_out.scales", "transformer.h.22.mlp.fc_out.qweight", "transformer.h.23.attn.k_proj.qzeros", "transformer.h.23.attn.k_proj.scales", "transformer.h.23.attn.k_proj.bias", "transformer.h.23.attn.k_proj.qweight", "transformer.h.23.attn.v_proj.qzeros", "transformer.h.23.attn.v_proj.scales", "transformer.h.23.attn.v_proj.bias", "transformer.h.23.attn.v_proj.qweight", "transformer.h.23.attn.q_proj.qzeros", "transformer.h.23.attn.q_proj.scales", "transformer.h.23.attn.q_proj.bias", "transformer.h.23.attn.q_proj.qweight", "transformer.h.23.attn.out_proj.qzeros", "transformer.h.23.attn.out_proj.scales", "transformer.h.23.attn.out_proj.bias", "transformer.h.23.attn.out_proj.qweight", "transformer.h.23.mlp.fc_in.qzeros", "transformer.h.23.mlp.fc_in.scales", "transformer.h.23.mlp.fc_in.qweight", "transformer.h.23.mlp.fc_out.qzeros", "transformer.h.23.mlp.fc_out.scales", "transformer.h.23.mlp.fc_out.qweight", "transformer.h.24.attn.k_proj.qzeros", "transformer.h.24.attn.k_proj.scales", "transformer.h.24.attn.k_proj.bias", "transformer.h.24.attn.k_proj.qweight", "transformer.h.24.attn.v_proj.qzeros", "transformer.h.24.attn.v_proj.scales", "transformer.h.24.attn.v_proj.bias", "transformer.h.24.attn.v_proj.qweight", "transformer.h.24.attn.q_proj.qzeros", "transformer.h.24.attn.q_proj.scales", "transformer.h.24.attn.q_proj.bias", "transformer.h.24.attn.q_proj.qweight", "transformer.h.24.attn.out_proj.qzeros", "transformer.h.24.attn.out_proj.scales", "transformer.h.24.attn.out_proj.bias", "transformer.h.24.attn.out_proj.qweight", "transformer.h.24.mlp.fc_in.qzeros", "transformer.h.24.mlp.fc_in.scales", "transformer.h.24.mlp.fc_in.qweight", "transformer.h.24.mlp.fc_out.qzeros", "transformer.h.24.mlp.fc_out.scales", "transformer.h.24.mlp.fc_out.qweight", "transformer.h.25.attn.k_proj.qzeros", "transformer.h.25.attn.k_proj.scales", "transformer.h.25.attn.k_proj.bias", "transformer.h.25.attn.k_proj.qweight", "transformer.h.25.attn.v_proj.qzeros", "transformer.h.25.attn.v_proj.scales", "transformer.h.25.attn.v_proj.bias", "transformer.h.25.attn.v_proj.qweight", "transformer.h.25.attn.q_proj.qzeros", "transformer.h.25.attn.q_proj.scales", "transformer.h.25.attn.q_proj.bias", "transformer.h.25.attn.q_proj.qweight", "transformer.h.25.attn.out_proj.qzeros", "transformer.h.25.attn.out_proj.scales", "transformer.h.25.attn.out_proj.bias", "transformer.h.25.attn.out_proj.qweight", "transformer.h.25.mlp.fc_in.qzeros", "transformer.h.25.mlp.fc_in.scales", "transformer.h.25.mlp.fc_in.qweight", "transformer.h.25.mlp.fc_out.qzeros", "transformer.h.25.mlp.fc_out.scales", "transformer.h.25.mlp.fc_out.qweight", "transformer.h.26.attn.k_proj.qzeros", "transformer.h.26.attn.k_proj.scales", "transformer.h.26.attn.k_proj.bias", "transformer.h.26.attn.k_proj.qweight", "transformer.h.26.attn.v_proj.qzeros", "transformer.h.26.attn.v_proj.scales", "transformer.h.26.attn.v_proj.bias", "transformer.h.26.attn.v_proj.qweight", "transformer.h.26.attn.q_proj.qzeros", "transformer.h.26.attn.q_proj.scales", "transformer.h.26.attn.q_proj.bias", "transformer.h.26.attn.q_proj.qweight", "transformer.h.26.attn.out_proj.qzeros", "transformer.h.26.attn.out_proj.scales", "transformer.h.26.attn.out_proj.bias", "transformer.h.26.attn.out_proj.qweight", "transformer.h.26.mlp.fc_in.qzeros", "transformer.h.26.mlp.fc_in.scales", "transformer.h.26.mlp.fc_in.qweight", "transformer.h.26.mlp.fc_out.qzeros", "transformer.h.26.mlp.fc_out.scales", "transformer.h.26.mlp.fc_out.qweight", "transformer.h.27.attn.k_proj.qzeros", "transformer.h.27.attn.k_proj.scales", "transformer.h.27.attn.k_proj.bias", "transformer.h.27.attn.k_proj.qweight", "transformer.h.27.attn.v_proj.qzeros", "transformer.h.27.attn.v_proj.scales", "transformer.h.27.attn.v_proj.bias", "transformer.h.27.attn.v_proj.qweight", "transformer.h.27.attn.q_proj.qzeros", "transformer.h.27.attn.q_proj.scales", "transformer.h.27.attn.q_proj.bias", "transformer.h.27.attn.q_proj.qweight", "transformer.h.27.attn.out_proj.qzeros", "transformer.h.27.attn.out_proj.scales", "transformer.h.27.attn.out_proj.bias", "transformer.h.27.attn.out_proj.qweight", "transformer.h.27.mlp.fc_in.qzeros", "transformer.h.27.mlp.fc_in.scales", "transformer.h.27.mlp.fc_in.qweight", "transformer.h.27.mlp.fc_out.qzeros", "transformer.h.27.mlp.fc_out.scales", "transformer.h.27.mlp.fc_out.qweight".
Unexpected key(s) in state_dict: "transformer.h.0.attn.k_proj.weight", "transformer.h.0.attn.v_proj.weight", "transformer.h.0.attn.q_proj.weight", "transformer.h.0.attn.out_proj.weight", "transformer.h.0.mlp.fc_in.weight", "transformer.h.0.mlp.fc_out.weight", "transformer.h.1.attn.k_proj.weight", "transformer.h.1.attn.v_proj.weight", "transformer.h.1.attn.q_proj.weight", "transformer.h.1.attn.out_proj.weight", "transformer.h.1.mlp.fc_in.weight", "transformer.h.1.mlp.fc_out.weight", "transformer.h.2.attn.k_proj.weight", "transformer.h.2.attn.v_proj.weight", "transformer.h.2.attn.q_proj.weight", "transformer.h.2.attn.out_proj.weight", "transformer.h.2.mlp.fc_in.weight", "transformer.h.2.mlp.fc_out.weight", "transformer.h.3.attn.k_proj.weight", "transformer.h.3.attn.v_proj.weight", "transformer.h.3.attn.q_proj.weight", "transformer.h.3.attn.out_proj.weight", "transformer.h.3.mlp.fc_in.weight", "transformer.h.3.mlp.fc_out.weight", "transformer.h.4.attn.k_proj.weight", "transformer.h.4.attn.v_proj.weight", "transformer.h.4.attn.q_proj.weight", "transformer.h.4.attn.out_proj.weight", "transformer.h.4.mlp.fc_in.weight", "transformer.h.4.mlp.fc_out.weight", "transformer.h.5.attn.k_proj.weight", "transformer.h.5.attn.v_proj.weight", "transformer.h.5.attn.q_proj.weight", "transformer.h.5.attn.out_proj.weight", "transformer.h.5.mlp.fc_in.weight", "transformer.h.5.mlp.fc_out.weight", "transformer.h.6.attn.k_proj.weight", "transformer.h.6.attn.v_proj.weight", "transformer.h.6.attn.q_proj.weight", "transformer.h.6.attn.out_proj.weight", "transformer.h.6.mlp.fc_in.weight", "transformer.h.6.mlp.fc_out.weight", "transformer.h.7.attn.k_proj.weight", "transformer.h.7.attn.v_proj.weight", "transformer.h.7.attn.q_proj.weight", "transformer.h.7.attn.out_proj.weight", "transformer.h.7.mlp.fc_in.weight", "transformer.h.7.mlp.fc_out.weight", "transformer.h.8.attn.k_proj.weight", "transformer.h.8.attn.v_proj.weight", "transformer.h.8.attn.q_proj.weight", "transformer.h.8.attn.out_proj.weight", "transformer.h.8.mlp.fc_in.weight", "transformer.h.8.mlp.fc_out.weight", "transformer.h.9.attn.k_proj.weight", "transformer.h.9.attn.v_proj.weight", "transformer.h.9.attn.q_proj.weight", "transformer.h.9.attn.out_proj.weight", "transformer.h.9.mlp.fc_in.weight", "transformer.h.9.mlp.fc_out.weight", "transformer.h.10.attn.k_proj.weight", "transformer.h.10.attn.v_proj.weight", "transformer.h.10.attn.q_proj.weight", "transformer.h.10.attn.out_proj.weight", "transformer.h.10.mlp.fc_in.weight", "transformer.h.10.mlp.fc_out.weight", "transformer.h.11.attn.k_proj.weight", "transformer.h.11.attn.v_proj.weight", "transformer.h.11.attn.q_proj.weight", "transformer.h.11.attn.out_proj.weight", "transformer.h.11.mlp.fc_in.weight", "transformer.h.11.mlp.fc_out.weight", "transformer.h.12.attn.k_proj.weight", "transformer.h.12.attn.v_proj.weight", "transformer.h.12.attn.q_proj.weight", "transformer.h.12.attn.out_proj.weight", "transformer.h.12.mlp.fc_in.weight", "transformer.h.12.mlp.fc_out.weight", "transformer.h.13.attn.k_proj.weight", "transformer.h.13.attn.v_proj.weight", "transformer.h.13.attn.q_proj.weight", "transformer.h.13.attn.out_proj.weight", "transformer.h.13.mlp.fc_in.weight", "transformer.h.13.mlp.fc_out.weight", "transformer.h.14.attn.k_proj.weight", "transformer.h.14.attn.v_proj.weight", "transformer.h.14.attn.q_proj.weight", "transformer.h.14.attn.out_proj.weight", "transformer.h.14.mlp.fc_in.weight", "transformer.h.14.mlp.fc_out.weight", "transformer.h.15.attn.k_proj.weight", "transformer.h.15.attn.v_proj.weight", "transformer.h.15.attn.q_proj.weight", "transformer.h.15.attn.out_proj.weight", "transformer.h.15.mlp.fc_in.weight", "transformer.h.15.mlp.fc_out.weight", "transformer.h.16.attn.k_proj.weight", "transformer.h.16.attn.v_proj.weight", "transformer.h.16.attn.q_proj.weight", "transformer.h.16.attn.out_proj.weight", "transformer.h.16.mlp.fc_in.weight", "transformer.h.16.mlp.fc_out.weight", "transformer.h.17.attn.k_proj.weight", "transformer.h.17.attn.v_proj.weight", "transformer.h.17.attn.q_proj.weight", "transformer.h.17.attn.out_proj.weight", "transformer.h.17.mlp.fc_in.weight", "transformer.h.17.mlp.fc_out.weight", "transformer.h.18.attn.k_proj.weight", "transformer.h.18.attn.v_proj.weight", "transformer.h.18.attn.q_proj.weight", "transformer.h.18.attn.out_proj.weight", "transformer.h.18.mlp.fc_in.weight", "transformer.h.18.mlp.fc_out.weight", "transformer.h.19.attn.k_proj.weight", "transformer.h.19.attn.v_proj.weight", "transformer.h.19.attn.q_proj.weight", "transformer.h.19.attn.out_proj.weight", "transformer.h.19.mlp.fc_in.weight", "transformer.h.19.mlp.fc_out.weight", "transformer.h.20.attn.k_proj.weight", "transformer.h.20.attn.v_proj.weight", "transformer.h.20.attn.q_proj.weight", "transformer.h.20.attn.out_proj.weight", "transformer.h.20.mlp.fc_in.weight", "transformer.h.20.mlp.fc_out.weight", "transformer.h.21.attn.k_proj.weight", "transformer.h.21.attn.v_proj.weight", "transformer.h.21.attn.q_proj.weight", "transformer.h.21.attn.out_proj.weight", "transformer.h.21.mlp.fc_in.weight", "transformer.h.21.mlp.fc_out.weight", "transformer.h.22.attn.k_proj.weight", "transformer.h.22.attn.v_proj.weight", "transformer.h.22.attn.q_proj.weight", "transformer.h.22.attn.out_proj.weight", "transformer.h.22.mlp.fc_in.weight", "transformer.h.22.mlp.fc_out.weight", "transformer.h.23.attn.k_proj.weight", "transformer.h.23.attn.v_proj.weight", "transformer.h.23.attn.q_proj.weight", "transformer.h.23.attn.out_proj.weight", "transformer.h.23.mlp.fc_in.weight", "transformer.h.23.mlp.fc_out.weight", "transformer.h.24.attn.k_proj.weight", "transformer.h.24.attn.v_proj.weight", "transformer.h.24.attn.q_proj.weight", "transformer.h.24.attn.out_proj.weight", "transformer.h.24.mlp.fc_in.weight", "transformer.h.24.mlp.fc_out.weight", "transformer.h.25.attn.k_proj.weight", "transformer.h.25.attn.v_proj.weight", "transformer.h.25.attn.q_proj.weight", "transformer.h.25.attn.out_proj.weight", "transformer.h.25.mlp.fc_in.weight", "transformer.h.25.mlp.fc_out.weight", "transformer.h.26.attn.k_proj.weight", "transformer.h.26.attn.v_proj.weight", "transformer.h.26.attn.q_proj.weight", "transformer.h.26.attn.out_proj.weight", "transformer.h.26.mlp.fc_in.weight", "transformer.h.26.mlp.fc_out.weight", "transformer.h.27.attn.k_proj.weight", "transformer.h.27.attn.v_proj.weight", "transformer.h.27.attn.q_proj.weight", "transformer.h.27.attn.out_proj.weight", "transformer.h.27.mlp.fc_in.weight", "transformer.h.27.mlp.fc_out.weight".

How can we use this lib to quantize Falcon7b / 40b models?

AssertionError

File "/usr/local/lib/python3.9/dist-packages/datasets/load.py", line 1675, in load_dataset
builder_instance = load_dataset_builder(
File "/usr/local/lib/python3.9/dist-packages/datasets/load.py", line 1452, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/usr/local/lib/python3.9/dist-packages/datasets/load.py", line 1177, in dataset_module_factory
raise e1 from None
File "/usr/local/lib/python3.9/dist-packages/datasets/load.py", line 1156, in dataset_module_factory
return HubDatasetModuleFactoryWithoutScript(
File "/usr/local/lib/python3.9/dist-packages/datasets/load.py", line 743, in init
assert self.name.count("/") == 1
AssertionError

When I use this command, python3 opt.py facebook/opt-125m c4, I get the above error.
Could you please help me solve this issue?

About `--sym` zero point

If <wbit=4, sym=True>, self.zero=8 as implemented here.

I thought self.zero=0 according to quantization doc or some inference code.

Is there any standard or consensus about symmetric quantization ?

How should I verify the speedup effect of the algorithm?

Hi~ Thank you for your great works! It seems that GPTQ would lead to significant speedups for end-to-end inference. But after quantizing INT8 BLOOM-7B with GPTQ, I found it twice slower than FP16 model. How could I make it speedup as shown in paper?

Please comment on why the A100 specific commit makes it faster?

Regarding: 54d35a8

Would be nice to know the reason behind the changes that make it faster on A100 specifically versus a 3090 for example.

Thanks!

Does GPTQ reduce to OBQ if I set block size to 1?

Reconstruct Quantized Model Layer in torch.

Hi, after quantizing LLAMA3, the layer sort of expanded in this checkpoint :

Now i can load the original Llama3 just fine using the layers provided here https://github.com/meta-llama/llama3/blob/main/llama/model.py, because the checkpoint has the same corresponding layers.

I wonder if you guys have written model.py for quantized llama model.

Thanks

How to run the quantized model for perditions on my prompts?

I am able to quantize llama 7b model to 4 bit. But how can I run this for my prediction. If I try transformer library i get error.

Python 3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("llama_7b_4bit_2.bin")
Traceback (most recent call last):
File "/home/intel-spc/Documents/tarun/t2/tar/lib/python3.10/site-packages/transformers/configuration_utils.py", line 659, in _get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/home/intel-spc/Documents/tarun/t2/tar/lib/python3.10/site-packages/transformers/configuration_utils.py", line 750, in _dict_from_json_file
text = reader.read()
File "/usr/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/home/intel-spc/Documents/tarun/t2/tar/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 456, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
File "/home/intel-spc/Documents/tarun/t2/tar/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 944, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/intel-spc/Documents/tarun/t2/tar/lib/python3.10/site-packages/transformers/configuration_utils.py", line 574, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/intel-spc/Documents/tarun/t2/tar/lib/python3.10/site-packages/transformers/configuration_utils.py", line 662, in _get_config_dict
raise EnvironmentError(
OSError: It looks like the config file at 'llama_7b_4bit_2.bin' is not a valid JSON file.

act-order on inference

Hi, does act-order usage require inference change? Thank you

Is there a beginners guide to the GPTQ algorithm?

I found the paper hard to follow, is there a beginner / dummys guide? Perhaps slides from a talk or tutorial? Thanks!

How to apply 3/4-bit quantization to computer vision models?

LAMBADA evaluation accuracy

Hello, I've been experimenting with GPTQ and trying to replicate your LAMBADA zero-shot results. But I have been getting significantly lower accuracy (10-15% lower for OPT specifically) compared to the paper, even for the FP16 baseline. I'm using your pipeline based on LM evaluation harness. I was wondering if you have seen this before?

GPTQ on BERT based

Hi all,

Wish this message finds everyone well. I have read the paper and found there is a table which compares the performance on OBQ and GPTQ on Bert-based model. Could anyone help me with finding the codes or implementation of GPTQ on BERT based model.? Thanks for your help

ValueError: not enough values to unpack (expected 2, got 1)

Hello,
I tried your instruction and got a value error. Was I doing right for benchmarking ? Thank you.

CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 3 --save opt125m-3bit.pt

CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --load opt125m-3bit.pt --benchmark 128
Loading model ...
Done.
Found cached dataset json (/$HOME/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
Found cached dataset json (/$HOME/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
Benchmarking ...
Traceback (most recent call last):
File "/$HOME/gptq/opt.py", line 455, in
...
File "/$HOM/mambaforge/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 637, in forward
batch_size, seq_length = input_shape
ValueError: not enough values to unpack (expected 2, got 1)

Application to GPT-J family

Congratulations on your achievement.

Can you give us some hints and recommandations to adapt the procedure in order to quantify the GPT-J models family ?

Inference of the Quantised Model (OPT-13B)

Hey!
Huge congratulations on your achievement and thank you for sharing!
I am following the steps to quantise an OPT model (13B) that I have finetuned. I wish to serve this model for inference.
Will I simply be able to save the quantised model, and load it into the transformers library?

If not whats the best way to do this?

All the very best

Question about the difference between the pseudocode and the implementation

The Hessian inverse information in your pseudocode is computed by cholesky of H's inverse. In code, you use the cholesky first and then cholesky inverse and then cholesky again. I am not sure the reason of the difference. And is the cholesky_inverse kernel necessary here？Can I just compute the H's inverse and then use cholesky?

Thank you so much.

Test on CNN model containing group conv by GPTQ method

Hi,
for supportting CNN mode, I modified the GPTQ code as follows:
1, supportting group conv;
2, use symmetric quantization without zero point parameter.

But I found it performance not good on mobilenetv2/mnasnet1_0 models when quantization bits = 4.
Here are my results:
model | FP32 | GPTQ_W4 sym
mbv2 71.88 60.84(84.64%)
mnasnet1_0 73.47 64.71(88.08%)
I saw resnet18/resnet50 quantization result in your paper only, have you tested gptq on mobilenetv2/mnasnet1_0 model?

Looking forward to your reply...

pack_model takes too long time

I used auto_gptq to quantize a large language model, this model's transformer has 80 layers, I found each layer needs almost 4 mininutes to pack, I have to wait serveral hours before the whole packing step finishes. Are there better suggestions of solving the problem? Can the packing model step speedup?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.