I finally was able to quantize llama model on a GPU. But interface still does not work

TypeError: init() got an unexpected keyword argument 'sym' </blockquo

How do I install <a class="issue-link js-issue-link" data-error-text="Failed to load t

Interface does not work on CPU about autogptq HOT 19 OPEN

Oxi84 commented on May 8, 2024

Interface does not work on CPU

from autogptq.

Comments (19)

Oxi84 commented on May 8, 2024

I was checking the code trying to fix it but nothing works. Here is the content of the quantize config.

bits: 4
group_size: 128
damp_percent: 0.01
desc_act: true
sym: true
true_sequential: true

from autogptq.

Oxi84 commented on May 8, 2024

It works better when i remove last 2 items from json, but still it stops working when calling generate:

bits: 4
group_size: 128
damp_percent: 0.01
desc_act: true

this is the error I get:
##############################
Traceback (most recent call last):
File "/root/Documents/LiClipse Workspace/Lamma4bit/lalama4b.py", line 24, in
print(tokenizer.decode(model.generate(**tokenizer("Jane is", return_tensors="pt") )[0]))
File "/root/anaconda3/envs/ai/lib/python3.8/site-packages/auto_gptq/modeling/_base.py", line 270, in generate
return self.model.generate(**kwargs)
File "/root/anaconda3/envs/ai/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/root/anaconda3/envs/ai/lib/python3.8/site-packages/transformers/generation/utils.py", line 1231, in generate
self._validate_model_kwargs(model_kwargs.copy())
File "/root/anaconda3/envs/ai/lib/python3.8/site-packages/transformers/generation/utils.py", line 1109, in _validate_model_kwargs
raise ValueError(
ValueError: The following model_kwargs are not used by the model: ['token_type_ids'] (note: typos in the generate arguments will also show up in this list)

from autogptq.

PanQiWei commented on May 8, 2024

TypeError: init() got an unexpected keyword argument 'sym'

it depends on which version you used, if <=v0.5.0, this argument is not supported in quantize_config, if from the up-to-date source code, it should work.

from autogptq.

PanQiWei commented on May 8, 2024

ValueError: The following model_kwargs are not used by the model: ['token_type_ids'] (note: typos in the generate arguments will also show up in this list)

This means tokenizer returns token_type_ids but **kwargs in model.generate doesn't accept it, you should remove it before passing tokenized data into this method. Maybe you should set return_token_type_ids=False in tokenizer.__call__.

from autogptq.

Oxi84 commented on May 8, 2024

Thanks for the answer. This works well without token_id problem as well:

input_ids = tokenizer(input_text, return_tensors="pt").input_ids
out = model.generate(input_ids=input_ids,max_length=20)

The speed is very slow on CPU, takes 10 minutes on a CPU for 10 tokens, but I guess this is normal.

Input : The benefits of deadlifting are:
Output : Increased strength and muscle mass

from autogptq.

Oxi84 commented on May 8, 2024

So I have tried it on GPU, it works, but the content is 100% giberish, i have no idea why this happens. I exactly did as in the documentation. Also speed for 4bit on a GPU is 2-3x slower than with fp16. Is it normal to work slower?

Install:

!pip install transformers
!git clone https://github.com/PanQiWei/AutoGPTQ.git
!pip install AutoGPTQ/

Quantize:

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

print(1)
pretrained_model_dir = "huggyllama/llama-7b"
quantized_model_dir = "workspace/llama-4bit"


tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
example = tokenizer("auto_gptq is a useful tool that can automatically compress model into 4-bit or even higher rate by using GPTQ algorithm.", return_tensors="pt")

quantize_config = BaseQuantizeConfig(
bits=4,  # quantize model to 4-bit
group_size=128,  # it is recommended to set the value to 128
)

print(12)
# load un-quantized model, the model will always be force loaded into cpu
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

print(13)
# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask" 
# with value under torch.LongTensor type.
model.quantize([example], use_triton=False)
print(14)

load on a gpu:

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "huggyllama/llama-7b"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)

quantized_model_dir = "llama4bit"
model11 = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False)

run the text generation:

import time
timea = time.time()
input_text = "The benefits of deadlifting are:"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda:0")
out = model11.generate(input_ids=input_ids,max_length=100)
print(tokenizer.decode(out[0]))
print("took",-timea + time.time())

the result:

<s> The benefits of deadlifting are:Webachivendor BegriffsklärlisPrefix Dragonskyrilledominument Agencyferrerзовilen BoyscottingÙ Dez Collegadoionaopus zewnętrzipagegiaandeniernogreenilo PremiumitulslantпраniaelescontribetersZeroardihelmoportulibernste MasculATE counterharedients ==>vat Chiefурruspenas Float zewnętrzipagegiaandeniernogreenilo PremiumitulslantпраniaelescontribetersZeroardihelmoportulibernste MasculATE counterhared
took 7.574721097946167

result with non quantized llama:

<s> The benefits of deadlifting are:
Increased strength and muscle mass
Improved posture and core strength
Improved grip strength
Improved balance and coordination
Improved bone density and joint health
Improved athletic performance and recovery
Improved mood and self-esteem
Improved ability to perform daily tasks
Improved ability to perform work tasks
Improved ability to perform sports
Im
took 2.3102331161499023

from autogptq.

PanQiWei commented on May 8, 2024

For the bad performance of the quantized model, I think it's because you just used one text to quantize, I would recommend to use more than a hundred or even thousand samples, and they should from the dataset that originally used to train the model.

For the speed, it's normal that quantized model is slower than original one, and inference using multiple GPUs will futher slowing down the speed for the communication(data transfer) between them.

from autogptq.

Oxi84 commented on May 8, 2024

Thanks for the answer. Actually i think there is something wrong with GPU version. I tested 4 bit on CPU and GPU, and on CPU is fine while on GPU does not make sense. I wrote the exact code above so it can be replicated, I will try on Google co-lab as well.

Here is google colab notebook where i tried OPT-125M that also gives bad response: https://colab.research.google.com/drive/1QQ7f-eI_k3YbO-b5Qq8yX7ub1QCY5mxK#scrollTo=3yMOQr9pk3kO

CPU:

 Input : The benefits of deadlifting are:
 Output : Increased strength and muscle mass

GPU:

  <s> The benefits of deadlifting are:Webachivendor BegriffsklärlisPrefix Dragonskyrilledominument Agencyferrerзовilen BoyscottingÙ Dez Collegadoionaopus zewnętrzipagegiaandeniernogreenilo PremiumitulslantпраniaelescontribetersZeroardihelmoportulibernste MasculATE counterharedients ==>vat Chiefурruspenas Float zewnętrzipagegiaandeniernogreenilo PremiumitulslantпраniaelescontribetersZeroardihelmoportulibernste MasculATE counterhared

from autogptq.

TheBloke commented on May 8, 2024

What model are you testing with that gives this garbage output? Is it still quantized_model_dir = "workspace/llama-4bit" ?

Please do testing with #43 if you're not already as that contains a number of important compatibility and performance fixes

Even with #43 there are still some compatibility issues with older models. I'm going to do a report on that today or tomorrow so they can be investigated by the devs.

from autogptq.

Oxi84 commented on May 8, 2024

How do I install #43? Should be something like this.

!git clone https://github.com/PanQiWei/AutoGPTQ/pull/43   --quiet
!pip install AutoGPTQ/pull/43/  --quiet

I tried llama 7B and now tried also otp on google colab.
Here is the colab notebook link with an example.

https://colab.research.google.com/drive/1QQ7f-eI_k3YbO-b5Qq8yX7ub1QCY5mxK#scrollTo=3yMOQr9pk3kO

from autogptq.

Oxi84 commented on May 8, 2024

@TheBloke you can see the colab now, i changed sharing settings.

from autogptq.

TheBloke commented on May 8, 2024

OK I'm looking. I'll get back to you in a bit

from autogptq.

TheBloke commented on May 8, 2024

OK I have looked into it and you are right, I cannot get any useful result from the test code. It always seems to just print the last token over and over again.

@PanQiWei the example code in the README is producing the following output. I've tested it on two separate Linux systems. One with pip install auto-gptq, the other with pip install . from this latest repo (without PR)

Python code (copied from README):

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples, use_triton=False)

# save quantized model
model.save_quantized(quantized_model_dir)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to("cuda:0"))[0]))

# or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])

Output:

</s>auto_gptq is is is is is is is is is is is is is is

auto-gptq is is is is is is is is is is is is is is is

The result is the same with every prompt I've tried.

By comparison, the base model works OK:

from transformers import AutoModelForCausalLM
model_base = AutoModelForCausalLM.from_pretrained(pretrained_model_dir).to("cuda:0")

print(tokenizer.decode(model_base.generate(**tokenizer("auto_gptq is", return_tensors="pt").to("cuda:0"))[0]))

Output:

</s>auto_gptq is a good one.
I've tried auto_gptq

from autogptq.

Oxi84 commented on May 8, 2024

About 2-3 days before i installed from source on A CPU only machine, and interface worked. I don't know what version it was, but I know that this version did not support sym: true, true_sequential: true, so i had to delete them for json config.

Input : The benefits of deadlifting are:
Output : Increased strength and muscle mass

So something is broken in the latest version in the file that does interface for all the models. Part that does quantization is likely fine.

from autogptq.

PanQiWei commented on May 8, 2024

Hi, many things are added into the main branch since last week, can you help to check if the problem still exists? ❤️

from autogptq.

Oxi84 commented on May 8, 2024

Hello, I will do in a few days and update. I will also update for these non-sense result that I get after quantization on a GPU.

This problem with working on CPU is not really a problem because CPU interface is extremely slow, but you can always run bf16 on some CPUs and that saves memory and works well on newer CPUs.

from autogptq.

Oxi84 commented on May 8, 2024

One question, did you update pip with those changes?

from autogptq.

PanQiWei commented on May 8, 2024

the new version will plan to be released in about two weeks, for now you need to install from source to experience new features and optimizations

from autogptq.

PanQiWei commented on May 8, 2024

Hello, I will do in a few days and update. I will also update for these non-sense result that I get after quantization on a GPU.

This problem with working on CPU is not really a problem because CPU interface is extremely slow, but you can always run bf16 on some CPUs and that saves memory and works well on newer CPUs.

support inference purely on CPU is not in the current future feature plan of auto-gptq, if you want to save memory, you can consider using CPU offload strategy which is now full supported.

from autogptq.

Interface does not work on CPU about autogptq HOT 19 OPEN

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent