autogptq / autogptq Goto Github PK

View Code? Open in Web Editor NEW

3.8K 32.0 376.0 7.99 MB

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

License: MIT License

Python 64.85% C++ 4.54% Cuda 30.39% C 0.12% Dockerfile 0.10% Makefile 0.01%

transformers deep-learning inference large-language-models llms nlp pytorch quantization transformer

autogptq's Introduction

AutoGPTQ

An easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization).

English | 中文

News or Update

2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with Marlin int4*fp16 matrix multiplication kernel support, with the argument use_marlin=True when loading models.
2023-08-23 - (News) - 🤗 Transformers, optimum and peft have integrated auto-gptq, so now running and training GPTQ models can be more available to everyone! See this blog and it's resources for more details!

For more histories please turn to here

Performance Comparison

Inference Speed

The result is generated using this script, batch size of input is 1, decode strategy is beam search and enforce the model to generate 512 tokens, speed metric is tokens/s (the larger, the better).

The quantized model is loaded using the setup that can gain the fastest inference speed.

model	GPU	num_beams	fp16	gptq-int4
llama-7b	1xA100-40G	1	18.87	25.53
llama-7b	1xA100-40G	4	68.79	91.30
moss-moon 16b	1xA100-40G	1	12.48	15.25
moss-moon 16b	1xA100-40G	4	OOM	42.67
moss-moon 16b	2xA100-40G	1	06.83	06.78
moss-moon 16b	2xA100-40G	4	13.10	10.80
gpt-j 6b	1xRTX3060-12G	1	OOM	29.55
gpt-j 6b	1xRTX3060-12G	4	OOM	47.36

Perplexity

For perplexity comparison, you can turn to here and here

Installation

AutoGPTQ is available on Linux and Windows only. You can install the latest stable release of AutoGPTQ from pip with pre-built wheels:

CUDA/ROCm version	Installation	Built against PyTorch
CUDA 11.8	`pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/`	2.2.1+cu118
CUDA 12.1	`pip install auto-gptq --no-build-isolation`	2.2.1+cu121
ROCm 5.7	`pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/`	2.2.1+rocm5.7

AutoGPTQ can be installed with the Triton dependency with pip install auto-gptq[triton] --no-build-isolation in order to be able to use the Triton backend (currently only supports linux, no 3-bits quantization).

For older AutoGPTQ, please refer to the previous releases installation table.

On NVIDIA systems, AutoGPTQ does not support Maxwell or lower GPUs.

Install from source

Clone the source code:

git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ

A few packages are required in order to build from source: pip install numpy gekko pandas.

Then, install locally from source:

pip install -vvv --no-build-isolation -e .

You can set BUILD_CUDA_EXT=0 to disable pytorch extension building, but this is strongly discouraged as AutoGPTQ then falls back on a slow python implementation.

As a last resort, if the above command fails, you can try python setup.py install.

On ROCm systems

To install from source for AMD GPUs supporting ROCm, please specify the ROCM_VERSION environment variable. Example:

ROCM_VERSION=5.6 pip install -vvv --no-build-isolation -e .

The compilation can be speeded up by specifying the PYTORCH_ROCM_ARCH variable (reference) in order to build for a single target device, for example gfx90a for MI200 series devices.

For ROCm systems, the packages rocsparse-dev, hipsparse-dev, rocthrust-dev, rocblas-dev and hipblas-dev are required to build.

Quick Tour

Quantization and Inference

warning: this is just a showcase of the usage of basic apis in AutoGPTQ, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.

Below is an example for the simplest use of auto_gptq to quantize a model and inference after quantization:

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)

# save quantized model
model.save_quantized(quantized_model_dir)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)

# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")

# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

# or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])

For more advanced features of model quantization, please reference to this script

Customize Model

Below is an example to extend `auto_gptq` to support `OPT` model, as you will see, it's very easy:

from auto_gptq.modeling import BaseGPTQForCausalLM


class OPTGPTQForCausalLM(BaseGPTQForCausalLM):
    # chained attribute name of transformer layer block
    layers_block_name = "model.decoder.layers"
    # chained attribute names of other nn modules that in the same level as the transformer layer block
    outside_layer_modules = [
        "model.decoder.embed_tokens", "model.decoder.embed_positions", "model.decoder.project_out",
        "model.decoder.project_in", "model.decoder.final_layer_norm"
    ]
    # chained attribute names of linear layers in transformer layer module
    # normally, there are four sub lists, for each one the modules in it can be seen as one operation,
    # and the order should be the order when they are truly executed, in this case (and usually in most cases),
    # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
    inside_layer_modules = [
        ["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
        ["self_attn.out_proj"],
        ["fc1"],
        ["fc2"]
    ]

After this, you can use OPTGPTQForCausalLM.from_pretrained and other methods as shown in Basic.

Evaluation on Downstream Tasks

You can use tasks defined in auto_gptq.eval_tasks to evaluate model's performance on specific down-stream task before and after quantization.

The predefined tasks support all causal-language-models implemented in 🤗 transformers and in this project.

Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:

from functools import partial

import datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.eval_tasks import SequenceClassificationTask


MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
ID2LABEL = {
    0: "negative",
    1: "neutral",
    2: "positive"
}
LABELS = list(ID2LABEL.values())


def ds_refactor_fn(samples):
    text_data = samples["text"]
    label_data = samples["label"]

    new_samples = {"prompt": [], "label": []}
    for text, label in zip(text_data, label_data):
        prompt = TEMPLATE.format(labels=LABELS, text=text)
        new_samples["prompt"].append(prompt)
        new_samples["label"].append(ID2LABEL[label])

    return new_samples


#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())
tokenizer = AutoTokenizer.from_pretrained(MODEL)

task = SequenceClassificationTask(
        model=model,
        tokenizer=tokenizer,
        classes=LABELS,
        data_name_or_path=DATASET,
        prompt_col_name="prompt",
        label_col_name="label",
        **{
            "num_samples": 1000,  # how many samples will be sampled to evaluation
            "sample_max_len": 1024,  # max tokens for each sample
            "block_max_len": 2048,  # max tokens for each data block
            # function to load dataset, one must only accept data_name_or_path as input
            # and return datasets.Dataset
            "load_fn": partial(datasets.load_dataset, name="english"),
            # function to preprocess dataset, which is used for datasets.Dataset.map,
            # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
            "preprocess_fn": ds_refactor_fn,
            # truncate label when sample's length exceed sample_max_len
            "truncate_prompt": False
        }
    )

# note that max_new_tokens will be automatically specified internally based on given classes
print(task.run())

# self-consistency
print(
    task.run(
        generation_config=GenerationConfig(
            num_beams=3,
            num_return_sequences=3,
            do_sample=True
        )
    )
)

Learn More

tutorials provide step-by-step guidance to integrate auto_gptq with your own project and some best practice principles.

examples provide plenty of example scripts to use auto_gptq in different ways.

Supported Models

you can use model.config.model_type to compare with the table below to check whether the model you use is supported by auto_gptq.

for example, model_type of WizardLM, vicuna and gpt4all are all llama, hence they are all supported by auto_gptq.

model type	quantization	inference	peft-lora	peft-ada-lora	peft-adaption_prompt
bloom	✅	✅	✅	✅
gpt2	✅	✅	✅	✅
gpt_neox	✅	✅	✅	✅	✅requires this peft branch
gptj	✅	✅	✅	✅	✅requires this peft branch
llama	✅	✅	✅	✅	✅
moss	✅	✅	✅	✅	✅requires this peft branch
opt	✅	✅	✅	✅
gpt_bigcode	✅	✅	✅	✅
codegen	✅	✅	✅	✅
falcon(RefinedWebModel/RefinedWeb)	✅	✅	✅	✅

Supported Evaluation Tasks

Currently, auto_gptq supports: LanguageModelingTask, SequenceClassificationTask and TextSummarizationTask; more Tasks will come soon!

Running tests

Tests can be run with:

pytest tests/ -s

FAQ

Which kernel is used by default?

AutoGPTQ defaults to using exllamav2 int4*fp16 kernel for matrix multiplication.

How to use Marlin kernel?

Marlin is an optimized int4 * fp16 kernel was recently proposed at https://github.com/IST-DASLab/marlin. This is integrated in AutoGPTQ when loading a model with use_marlin=True. This kernel is available only on devices with compute capability 8.0 or 8.6 (Ampere GPUs).

Acknowledgement

Special thanks Elias Frantar, Saleh Ashkboos, Torsten Hoefler and Dan Alistarh for proposing GPTQ algorithm and open source the code, and for releasing Marlin kernel for mixed precision computation.
Special thanks qwopqwop200, for code in this project that relevant to quantization are mainly referenced from GPTQ-for-LLaMa.
Special thanks to turboderp, for releasing Exllama and Exllama v2 libraries with efficient mixed precision kernels.

autogptq's People

Contributors

Stargazers

Watchers

Forkers

decentralised-ai 3outeille setmaster suparious yangwang92 eltociear gladiopeace fangod ai-jie01 aiorganisation yabarji59 aiorganisation genezc dumpmemory techthiyanes brewswang wizrds slang98 z80maniac cyd3nt thebloke sciumo ph0rk0z leiwang1999 yhyu13 ddkang1 pszemraj laaza catcat0921 dariosucic joecryptotoo jjhw shamjithkv kuntal-c ashrafulsbmcbd guoqiangjia abhinavkulkarni mconsidine jfontestad harpy2 billcai scr1ptechnick dessix haorand mexicanamerican dosdroid pineking hydraroot zhangzhengeric ndc2019 josegron jllllll avert lihengwannafly elieron aemon-algiz geekinglcq angainordev sorokinvld dogewatch zongqiangzhang wesley7137 rochemedia metawabbit qqq-tech wzb1005 paixai georggr marisakirisame zuodh honkware underspecified casper-hansen are-we-gfx1100-yet leejodie mcx justingalerne duaneking david10rio evdcush afirez xinqiyang tostino giorgiopiatti hbcbh1999 shadowkun sasha0552 zeroxclem pent kp-forks lshang0311 sibtainrazajamali promptengineer48 asdlei99 qmpham kekewind coffeevampir3 pavadik yujun-8848 apprikatai

autogptq's Issues

Cant Install it

Hello,

I tried to install via:

sudo BUILD_CUDA_EXT=0 /root/anaconda3/envs/ai/bin/pip install auto-gptq
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

This is a CPU machine without any cuda, but as far as I understand using BUILD_CUDA_EXT=0 should block cuda part of the install

[API break change inform] refactor from_quantized api in faster-llama branch

This is just a notification to anyone that already used faster-llama branch in their projects.

If you are using faster-llama branch that is of commit f159aea and anyone that comes after it, you may have to change your code for there are some break changes in from_quantized api.

fused_attn and fused_mlp: renamed fused_attn and fused_mlp to inject_fused_attention and inject_fused_mlp to improve readability, and those two arguments are available for all models that supported in auto-gptq but for now will be ignored and raise a log warning if set to True, except for llama.
device: change default value of device from "cpu" to None, if all device, device_map and max_memory are None, device_map will be set to "auto", which means by default multiple gpus will be used if they are available. So if you want to load quantized model to a specific device, you must manually set device and keep device_map and max_memory to None
warmup_triton: add warmup_triton argument so that one can set it to False to skip warmup procedure when using triton, hopefully it can speedup model loading, but the effects are remain unknown for I haven't test it yet.

Anyone that encounter bugs or behavior that does not fit the change description above when trying the new api can report in this issue. 🥂

Submit you project to pypi

Hi there. I want to use your module in my new GPT4All-ui:
https://github.com/nomic-ai/gpt4all-ui

It would be eazyer and more elegant if you could provide your tool as an installable package from pypi.
All you need to do is create a pypi acount, then use twine to submit your model to pypi.

Then you can install it using: pip install your_package_name

Thanks

This looks awesome!

I don't have any issue, but I just saw this repo and wanted to say that this looks really cool and very useful.

For the last week I have been making GPTQ models for various Llamas (Alpaca, Koala, Vicuna, etc). The GPTQ-for-LLaMa code is excellent, but it is also quite complicated and opaque, and I wouldn't even know where to begin in expanding it to support other models.

So for example now I would like to try GPTQ on the recently released GPT4ALL-J model but GPTQ-for-LLaMa does not support this, and the one repo that does mention GPTQ for GPT-J has hardly been updated in six months.

So now I'm really excited to try your code. And in particular I think it could be really good if oobabooga implemented your code in text-generation-webui. Currently he implements GPTQ-for-LLaMa directly, but because the GPTQ-for-LLaMa code is changing rapidly, this causes various problems. Ooba made his own fork for stability, but this does not support the latest GPTQ methods; specifically, --act-order. If a user wants to use --act-order, they have to link qwopqwop's latest GPTQ-for-LlaMa in instead. Which is fine.. until qwopqwop does a refactor and then it breaks everything :)

So right now we have to choose between stability in the UI without latest features, or latest features without stability. And as someone who is providing GPTQs to users, it's really hard to communicate all that to everyone (especially when half of them don't even read the README..)

So I can see a real benefit to having an easy-to-use intermediate layer like AutoGPTQ which can be easily integrated into inference clients/UIs, fine tuning code, etc.

I will be watching this project with interest! Let me know if I can do any QA for you.

Add StableLM From StabilityAI (Same format as GPT-NeoX)

Note: It's the same format as GPT-NeoX, so it will likely work already.

Just need to add explicit references to it for ease of use as well as update the ReadMe for visibility.

GitHub Repo

(Checkpoint links are Hugging Face repos with model weights)

Size	StableLM-Base-Alpha	StableLM-Tuned-Alpha	Training Tokens	Web Demo
3B	checkpoint	checkpoint	800B
7B	checkpoint	checkpoint	800B	HuggingFace
15B	(in progress)	(pending)	1.5T
30B	(in progress)	(pending)	1.5T
65B	(in progress)	(pending)	1.5T
175B	(planned)

AutoGPTQ in textgen

I made a PR in oobabooga/text-generation-webui for adding AutoGPTQ inference.

I would appreciate feedback on the implementation from the perspective of this repo. Also there are a few issues that might be caused by AutoGPTQ itself.

support user customized `device_map`

It seems like device_map does not offload anything to CPU if it's constructed manually.

Here's an example:

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch

model_path = "/opt/models/vicuna-13B-1.1-GPTQ-4bit-128g"

device_map = {'model.embed_tokens': 0, 'model.norm': 0, 'lm_head': 0, 'model.layers.0': 0, 'model.layers.1': 'cpu', 'model.layers.2': 'cpu', 'model.layers.3': 'cpu', 'model.layers.4': 'cpu', 'model.layers.5': 'cpu', 'model.layers.6': 'cpu', 'model.layers.7': 'cpu', 'model.layers.8': 'cpu', 'model.layers.9': 'cpu', 'model.layers.10': 'cpu', 'model.layers.11': 'cpu', 'model.layers.12': 'cpu', 'model.layers.13': 'cpu', 'model.layers.14': 'cpu', 'model.layers.15': 'cpu', 'model.layers.16': 'cpu', 'model.layers.17': 'cpu', 'model.layers.18': 'cpu', 'model.layers.19': 'cpu', 'model.layers.20': 'cpu', 'model.layers.21': 'cpu', 'model.layers.22': 'cpu', 'model.layers.23': 'cpu', 'model.layers.24': 'cpu', 'model.layers.25': 'cpu', 'model.layers.26': 'cpu', 'model.layers.27': 'cpu', 'model.layers.28': 'cpu', 'model.layers.29': 'cpu', 'model.layers.30': 'cpu', 'model.layers.31': 'cpu', 'model.layers.32': 'cpu', 'model.layers.33': 'cpu', 'model.layers.34': 'cpu', 'model.layers.35': 'cpu', 'model.layers.36': 'cpu', 'model.layers.37': 'cpu', 'model.layers.38': 'cpu', 'model.layers.39': 'cpu'}

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
)

full_gpu = True

if full_gpu:
    model = AutoGPTQForCausalLM.from_quantized(
        model_path,
        device="cuda:0",
        use_safetensors=True,
        quantize_config=quantize_config,
        model_basename="vicuna-13B-1.1-GPTQ-4bit-128g.latest"
    )
else:
    model = AutoGPTQForCausalLM.from_quantized(
        model_path,
        device="cpu",
        use_safetensors=True,
        quantize_config=quantize_config,
        model_basename="vicuna-13B-1.1-GPTQ-4bit-128g.latest",
        device_map=device_map
    )

mem_gb = round(torch.cuda.memory_allocated(0) / 1000 / 1000 / 1000)
print(f"USED VRAM: {mem_gb}GB")

The model is https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g. The device_map is constructed so that only the first model layer is on GPU, and the rest is supposed to be on CPU. The last two lines measure the used VRAM.

When full_gpu = True, everything is on GPU, and I get this:

USED VRAM: 7GB

which is expected.

But now, I set full_gpu = False and the device map is used. However, at the end I get the same result:

USED VRAM: 7GB

I double-checked that the device_map is actually used, but it seems like it doesn't offload anything. Am I missing something?

Recent pull 23 generating jammed sentence output with new quantized neox20b 4bit model

#23

neox-20b 4bit models quantized with above generates jammed sentences as per example below.

The smell of tobacco smoke in theseemingly ceaseless breeze which swept through during these conversationswas unmistakable evidence of his presence to anyone who heard that faintpunctual pungency overrode any other possible olfactory suggestion; butRobert could sense without really seeing more than once how he etc.

Previous main version using same seed generates above correctly as:

The smell of tobacco smoke in the seemingly ceaseless breeze which swept through during these conversations was unmistakable evidence of his presence to anyone who heard that faint punctual pungency overrode any other possible olfactory suggestion; but Robert could sense without really seeing more than once how he etc.

Issue with positional params with `BaseGPTQForCausalLM.forward()`

I am trying my new perplexity calculation on AutoGPTQ for the first time, and hitting this error:

│ /workspace/TB_ppl/auto_gptq/eval_tasks/perplexity.py:88 in run                                   │
│                                                                                                  │
│    85 │   │   │   │   │   tokens[0][batch_start] = self._tokenizer.bos_token_id                  │
│    86 │   │   │   │                                                                              │
│    87 │   │   │   │   with torch.no_grad():                                                      │
│ ❱  88 │   │   │   │   │   outputs = self._model(tokens[:, batch_start:batch_start+batch_size])   │
│    89 │   │   │   │   │   batch_logits = outputs.logits.float()                                  │
│    90 │   │   │   │                                                                              │
│    91 │   │   │   │   tokens[0][batch_start] = token_org                                         │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: BaseGPTQForCausalLM.forward() takes 1 positional argument but 2 were given

This is the issue: outputs = self._model(tokens)

Looking at modeling/_base.py all that's needed to get this call working is:

    def forward(self, *args, **kwargs):
        return self.model(*args, **kwargs)

Is that OK to PR? Just want to check there's not going to be any side effects I've not thought of.

Plan for providing pre-compiled binaries from PyPi for pip install?

Hi @PanQiWei

I was wondering what your project plan was for providing pre-compiled binaries via PyPi?

oobabooga mentioned this a key issue for integration in text-gen-UI.

Do you have a thought as to when you would like to look at this? Eg whether for 0.2.0, or beyond?

If you're not already looking at it, I'd be happy to look at the process and try a Github Actions type setup. I could do that this weekend.

Let me know.

`use_triton=True` always chooses the first device

https://github.com/PanQiWei/AutoGPTQ/blob/144bd804369264998dce2ca5b97a365d01a725a6/auto_gptq/modeling/_base.py#L489-L493

When you set use_triton=True then the device is automatically set to cuda:0. But is it really necessary to set it to the first device? Let's assume I'm passing device="cuda:1" and use_triton=True to from_quantized. After that the device will be changed from cuda:1 to cuda:0. But can it remain cuda:1?

Maybe change the device only if it's cpu?

I don't have multiple GPUs and I don't know much about Triton, so I'm not sure if this is a bug or not.

Quantizing Lora Alpaca (a model with an adapter)

Hello,

Is it possible to load a lora model - peft model with an adapter such as alpaca lora (https://github.com/tloen/alpaca-lora)?

There is a script there, to add peft weights to model but it does not work properly ( https://github.com/tloen/alpaca-lora/blob/main/export_hf_checkpoint.py)

bloom quantize problems

error info:
model.quantize([example])
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/ansible/online/operator/udf_pod_git/chat-gpt/AutoGPTQ/auto_gptq/modeling/_base.py", line 189, in quantize
layer(layer_input, **additional_layer_inputs)[0][0].cpu()
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'alibi'

bloom的位置编码是要特殊处理下吗？

Cuda devices unavailable

Attempting to load models in the latest auto-gptq results in failing to use cuda devices:

The safetensors archive passed at E:\MLModels\llm\llama\alpaca-lora-65B-GPTQ-4bit\alpaca-lora-65B-GPTQ-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
Python.Runtime.PythonException: SafetensorError : device cuda:0 is invalid
['  File "C:\\ProgramData\\Miniconda3\\envs\\hfllm\\lib\\site-packages\\auto_gptq\\modeling\\auto.py", line 63, in from_quantized\n    return GPTQ_CAUSAL_LM_MODEL_MAP[model_type].from_quantized(\n', '  File "C:\\ProgramData\\Miniconda3\\envs\\hfllm\\lib\\site-packages\\auto_gptq\\modeling\\_base.py", line 544, in from_quantized\n    model = accelerate.load_checkpoint_and_dispatch(\n', '  File "C:\\ProgramData\\Miniconda3\\envs\\hfllm\\lib\\site-packages\\accelerate\\big_modeling.py", line 479, in load_checkpoint_and_dispatch\n    load_checkpoint_in_model(\n', '  File "C:\\ProgramData\\Miniconda3\\envs\\hfllm\\lib\\site-packages\\accelerate\\utils\\modeling.py", line 971, in load_checkpoint_in_model\n    checkpoint = load_state_dict(checkpoint_file, device_map=device_map)\n', '  File "C:\\ProgramData\\Miniconda3\\envs\\hfllm\\lib\\site-packages\\accelerate\\utils\\modeling.py", line 832, in load_state_dict\n    return safe_load_file(checkpoint_file, device=devices[0])\n', '  File "C:\\ProgramData\\Miniconda3\\envs\\hfllm\\lib\\site-packages\\safetensors\\torch.py", line 99, in load_file\n    with safe_open(filename, framework="pt", device=device) as f:\n']   at Python.Runtime.PyObject.Invoke(PyTuple args, PyDict kw)
   at Python.Runtime.PyObject.InvokeMethod(String name, PyTuple args, PyDict kw)
   at Python.Runtime.PyObject.TryInvokeMember(InvokeMemberBinder binder, Object[] args, Object& result)
   at CallSite.Target(Closure , CallSite , Object , String , Boolean , String , Object , Boolean , Object )
   at System.Dynamic.UpdateDelegates.UpdateAndExecute7[T0,T1,T2,T3,T4,T5,T6,TRet](CallSite site, T0 arg0, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5, T6 arg6)
   at airtistServer.Chat.Models.Llama.Alpaca4bit65b.CreateModel() in D:\AIrtist\AIrtist-discord-bot\airtistServer\Chat\Models\Llama\Alpaca4bit65b.cs:line 50
   at airtistServer.Chat.Models.AbstractModel.InitModel(String rootModelPath, PythonContext _pyCtx, Object torchDevice, Int32 pipeDevice) in D:\AIrtist\AIrtist-discord-bot\airtistServer\Chat\Models\AbstractModel.cs:line 53
   at System.Dynamic.UpdateDelegates.UpdateAndExecuteVoid5[T0,T1,T2,T3,T4](CallSite site, T0 arg0, T1 arg1, T2 arg2, T3 arg3, T4 arg4)
   at airtistServer.Jobs.GPTJob.ProcessRequest(QueueData qd) in D:\AIrtist\AIrtist-discord-bot\airtistServer\Jobs\GPTJob.cs:line 210
Unhandled exception. System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at airtistServer.Jobs.GPTJob.ProcessRequest(QueueData qd) in D:\AIrtist\AIrtist-discord-bot\airtistServer\Jobs\GPTJob.cs:line 232
   at airtistServer.Jobs.GPTJob.Test(String prompt, Int32 maxlen) in D:\AIrtist\AIrtist-discord-bot\airtistServer\Jobs\GPTJob.cs:line 83
   at Program.<Main>$(String[] args) in D:\AIrtist\AIrtist-discord-bot\GPTServer\Program.cs:line 34

I am running on windows using miniconda, and the latest huggingfaces transformers and tokenizers..
this is the code used to attempt to load the model through auto-gptq:
in this case, torchDevice is cuda:1 or cuda:0, doesn't matter which i select.

config = AutoGPTQ!.BaseQuantizeConfig(bits: 4, group_size: 128, desc_act: false );

modle = AutoGPTQ!.AutoGPTQForCausalLM.from_quantized(ModelRootPath + ModelPath + "\\", use_safetensors: true, model_basename: "alpaca-lora-65B-GPTQ-4bit-128g", device: TorchDevice, use_triton: false, quantize_config: config);

The cuda devices are available in the same script, as I can load models via the huggingfaces pipeline onto my cuda devices with no issues.

any guidance would be appreciated

max_memory and offload_folder options not working for big models

Hi,

I have a GeForce RTX 3060 GPU with 12GB VRAM. I am able to load models up to 3B parameters and quantize them, however am running into trouble when I try to load 6B parameter or bigger models.

Here are the GPU details:

$ nvidia-smi 
Sun May 14 15:33:10 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060         On | 00000000:01:00.0 Off |                  N/A |
|  0%   40C    P8               14W / 170W|      1MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I try to load them using

pretrained_model_dir = "EleutherAI/gpt-j-6b"
quantized_model_dir = "EleutherAI/gpt-j-6b-4bit-128g"

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
)

max_memory={0: "8GiB"}

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config, max_memory=max_memory, offload_folder="offload")

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples, use_triton=False)

However, I get the error:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[6], line 11
      8 model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config, max_memory=max_memory, offload_folder="offload")
     10 # quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
---> 11 model.quantize(examples, use_triton=False)
     13 # save quantized model
     14 model.save_quantized(quantized_model_dir)

File [/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/utils/_contextlib.py:115](https://vscode-remote+ssh-002dremote-002borigin-002econcentricai-002ecom.vscode-resource.vscode-cdn.net/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/utils/_contextlib.py:115), in context_decorator..decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File [/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/auto_gptq/modeling/_base.py:220](https://vscode-remote+ssh-002dremote-002borigin-002econcentricai-002ecom.vscode-resource.vscode-cdn.net/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/auto_gptq/modeling/_base.py:220), in BaseGPTQForCausalLM.quantize(self, examples, batch_size, use_triton, use_cuda_fp16, autotune_warmup_after_quantized, cache_examples_on_gpu)
    218     ori_outside_layer_module_devices[module_name] = get_device(module)
    219     if module is not None:
--> 220         move_to_device(module, cur_layer_device)
    222 # get inputs for first layer
    223 layers[0] = LayerHijacker(layers[0], cur_layer_device)

File [/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/auto_gptq/modeling/_utils.py:24](https://vscode-remote+ssh-002dremote-002borigin-002econcentricai-002ecom.vscode-resource.vscode-cdn.net/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/auto_gptq/modeling/_utils.py:24), in move_to_device(obj, device)
     22 def move_to_device(obj: Union[torch.Tensor, nn.Module], device: torch.device):
     23     if get_device(obj) != device:
---> 24         obj = obj.to(device)
     25     return obj

File [/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:1145](https://vscode-remote+ssh-002dremote-002borigin-002econcentricai-002ecom.vscode-resource.vscode-cdn.net/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:1145), in Module.to(self, *args, **kwargs)
   1141         return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1142                     non_blocking, memory_format=convert_to_format)
   1143     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
-> 1145 return self._apply(convert)

File [/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:820](https://vscode-remote+ssh-002dremote-002borigin-002econcentricai-002ecom.vscode-resource.vscode-cdn.net/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:820), in Module._apply(self, fn)
    816 # Tensors stored in modules are graph leaves, and we don't want to
    817 # track autograd history of `param_applied`, so we have to use
    818 # `with torch.no_grad():`
    819 with torch.no_grad():
--> 820     param_applied = fn(param)
    821 should_use_set_data = compute_should_use_set_data(param, param_applied)
    822 if should_use_set_data:

File [/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:1143](https://vscode-remote+ssh-002dremote-002borigin-002econcentricai-002ecom.vscode-resource.vscode-cdn.net/opt/anaconda3/envs/autogptq/lib/python3.9/site-packages/torch/nn/modules/module.py:1143), in Module.to..convert(t)
   1140 if convert_to_format is not None and t.dim() in (4, 5):
   1141     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1142                 non_blocking, memory_format=convert_to_format)
-> 1143 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

NotImplementedError: Cannot copy out of meta tensor; no data!

No matter what memory I specify in max_memory, I get the same error.

What am I missing?

Why we need qzeros?

I notice that both the backend store the zeros as qzeros which confused me because the zeros are float16 dtypes and I do not understand the benefit from that format compress.

Comprehensive benchmarking of AutoGPTQ; Triton vs CUDA; vs old 'ooba' GfL

Over on @LaaZa's text-gen-ui PR there has been discussion of inference performance.

In particular, people are concerned that recent versions of GPTQ-for-LLaMa have performed worse than the older versions. This is one reason why many people still use ooba's fork of GfL.

In order to help people understand the improvements AutoGPTQ has made, and also to help @PanQiWei and @qwopqwop200 look at potential performance differences in AutoGPTQ, I have compiled a list of benchmarks.

I have compared AutoGPTQ CUDA, AutoGPTQ Triton and the old GfL ooba fork with CUDA.

I've compared act-order/desc_act vs not, with and without streaming in text-gen-UI, and with/without the fused_attn and fused_mlp parameters.

I've not done absolutely every possible permutation of all those params. I only ran a few tests with streaming enabled in text-gen-ui as it always performs much worse. But I think I have enough here to get a good overview.

I will be posting these figures in the text-gen-ui PR thread as well.

Other benchmarks to do

To these results we could also add the latest GPTQ-for-LLaMa Triton and CUDA figures. I did already benchmark that yesterday and it compared the same or worse than AutoGPTQ. But later today I will run those benchmarks again and add them to the spreadsheet in this new format.

I would also like to test with some weaker/smaller GPUs, to see how performance might vary with less GPU processing available. And I'd also like to test on some larger models, to see if there is any difference in performance delta with varying model sizes.

Benchmarks of: ooba CUDA; AutoGPTQ CUDA; AutoGPTQ Triton

Implementations tested

AutoGPTQ as of #43, using text-gen-ui from LaaZa's AutoGPTQ PR
GPTQ-for-LLaMa using ooba's GfL fork (oobabooga/GPTQ-for-LLaMa), using text-gen-ui from main, commit: 875da16b7bc2a676d1a9d389bf22ee4579722073.

Test system

Ubuntu 20.04, with Docker https://hub.docker.com/repository/docker/thebloke/runpod-pytorch-new/general
CUDA toolkit 11.6
NVidia 4090 24GB
WizardLM 7B 128g

Test method

All benchmarking done in text-gen-ui, using the output time and token/s reported by it.
text-gen-ui restarted between each test.
Output limit = 512 tokens.
'Default' paramater set used + 'ban eos token' set so as to always get 512 tokens returned
Run one test and discard results as warm up, then record 4 results.

Results spreadsheet and overview charts

Spreadsheet

Google sheets with results and charts

Charts

A chart showing streaming performance is in the spreadsheet.

Description of results

AutoGPTQ vs 'ooba' CUDA with --no-stream

AutoGPTQ CUDA outperforms GfL 'ooba' CUDA by 15% on a no-act-order model
AutoGPTQ CUDA outperforms GfL 'ooba' CUDA by 10% on an act-order model (comparing AutoGPTQ on act-order model to GfL on no-act-order model)
AutoGPTQ Triton is 5% slower than GfL 'ooba' CUDA
AutoGPTQ Triton is 20% slower than AutoGPTQ CUDA

AutoGPTQ vs 'ooba' CUDA with streaming (no-act-order model)

AutoGPTQ CUDA outperforms GfL 'ooba' CUDA by 12%
AutoGPTQ CUDA outperforms AutoGPTQ Triton by 13%
AutoGPTQ Triton outperforms GfL 'ooba' CUDA by 15%
- Interesting that with streaming on, Triton does better than GfL CUDA.

desc_act models vs non-desc_act models

AutoGPTQ CUDA is 4.5% slower on a desc_act model vs not
AutoGPTQ Triton has no performance difference between desc_act model vs not
AutoGPTQ CUDA records significantly higher GPU usage % on desc_act models
- 80% usage with desc_act + fused_attn vs 30% with no-desc_act model
- This might be a problem on weaker cards? That needs testing.
AutoGPTQ Triton has only a few % extra GPU usage with desc_act models.

fused_attn and fused_mlp

AutoGPTQ CUDA: fused_attn increases performance by 20%
- This seems to account for nearly all the performance difference between AutoGPTQ CUDA and ooba GfL CUDA
AutoGPTQ Triton: fused_mlp on its own increases performance by 15%
AutoGPTQ Triton: fused_attn on its own increases performance by 26%
AutoGPTQ Triton: fused_mlp and fused_attn together increases performance over no/no by 48%

Slow loading time with AutoGPTQ Triton

AutoGPTQ Triton takes significantly longer to load a model vs CUDA
- I didn't record yet benchmarks for this, but from looking through my logs I see:
  - CUDA: 2 -3 seconds
  - Triton: 40-45 seconds (with or without fused_mlp)
  - Triton + fused_attn: 55 - 90 seconds
- I'll add this to the benchmark table later.

Results table

Implementation	Method	Streaming	Model type	fused_attn	fused_mlp	GPU usage max %	VRAM max after 512 tok	Avg token/s	Run 1 tokens/s	Run 2 tokens/s	Run 3 tokens/s	Run 4 tokens/s
ooba GfL	CUDA	No	no-act-order	N/A	N/A	25%	5837	23.84	23.86	23.91	23.70	23.90
AutoGPTQ	CUDA	No	no-act-order	No	N/A	24%	6711	22.66	22.63	22.63	22.78	22.61
AutoGPTQ	CUDA	No	no-act-order	Yes	N/A	28%	6849	27.22	27.23	27.33	27.33	27.00
AutoGPTQ	Triton	No	no-act-order	No	No	27%	6055	15.25	15.25	15.25	15.29	15.20
AutoGPTQ	Triton	No	no-act-order	No	Yes	30%	6691	17.48	17.52	17.51	17.43	17.47
AutoGPTQ	Triton	No	no-act-order	Yes	No	30%	6013	19.33	19.37	19.42	19.29	19.24
AutoGPTQ	Triton	No	no-act-order	Yes	Yes	34%	6649	22.58	22.19	22.67	22.70	22.75
AutoGPTQ	CUDA	No	act-order	No	N/A	64%	6059	20.35	20.38	20.42	20.31	20.30
AutoGPTQ	CUDA	No	act-order	Yes	N/A	80%	6079	26.02	26.12	26.15	26.18	25.61
AutoGPTQ	Triton	No	act-order	No	No	30%	6057	15.39	15.47	15.30	15.35	15.42
AutoGPTQ	Triton	No	act-order	No	yes	33%	6691	17.48	17.54	17.53	17.38	17.48
AutoGPTQ	Triton	No	act-order	Yes	No	33%	6013	19.55	19.56	19.59	19.51	19.55
AutoGPTQ	Triton	No	act-order	Yes	yes	38%	6649	22.86	22.86	23.01	22.98	22.57
ooba GfL	CUDA	Yes	no-act-order	N/A	N/A	17%	5837	14.93	14.86	14.77	15.05	15.05
AutoGPTQ	CUDA	Yes	no-act-order	No	N/A	20%	6711	16.85	16.94	16.84	16.87	16.76
AutoGPTQ	CUDA	Yes	no-act-order	Yes	no	22%	6849	19.55	19.87	19.40	19.50	19.41
AutoGPTQ	Triton	Yes	no-act-order	Yes	Yes	27%	6429	17.19	17.08	17.38	17.15	17.14
AutoGPTQ	Triton	Yes	act-order	No	No	25%	6055	12.43	12.39	12.46	12.40	12.47
AutoGPTQ	Triton	Yes	act-order	Yes	Yes	33%	6649	17.19	17.03	17.03	17.26	17.42

Benchmark logs

Benchmark logs in full

ooba GPTQ-for-LLaMA CUDA no streaming (`--no-stream`). no-act-order model. no fused_attn

Command

python server.py --model wiz-no-act --wbits 4 --groupsize 128 --model_type llama  --listen --no-stream

Benchmark

GPU usage max: 25%, VRAM idle: 6037, VRAM after 512 tokens: 5837

Output generated in 21.46 seconds (23.86 tokens/s, 512 tokens, context 16, seed 104167586)
Output generated in 21.41 seconds (23.91 tokens/s, 512 tokens, context 16, seed 448558865)
Output generated in 21.60 seconds (23.70 tokens/s, 512 tokens, context 16, seed 816202521)
Output generated in 21.42 seconds (23.90 tokens/s, 512 tokens, context 16, seed 63649370)

ooba GPTQ-for-LLaMA CUDA with streaming. no-act-order model. no fused_attn

Command

python server.py --model wiz-no-act --wbits 4 --groupsize 128 --model_type llama  --listen

Benchmark

GPU usage max: 17%, VRAM idle: 5247, VRAM after 512 tokens: 5837

Output generated in 34.39 seconds (14.86 tokens/s, 511 tokens, context 16, seed 572742302)
Output generated in 34.60 seconds (14.77 tokens/s, 511 tokens, context 16, seed 677465334)
Output generated in 33.95 seconds (15.05 tokens/s, 511 tokens, context 16, seed 1685629937)
Output generated in 33.95 seconds (15.05 tokens/s, 511 tokens, context 16, seed 1445023832)

AutoGPTQ CUDA no streaming (`--no-stream`). no-act-order model. fused_attn enabled

Command

python server.py --model wiz-no-act  --autogptq  --listen --quant_attn --wbits 4 --groupsize 128 --model_type llama --no-stream

Benchmark

GPU usage max: 28%, VRAM idle: 6849, VRAM after 512 tokens: 6849

Output generated in 18.81 seconds (27.23 tokens/s, 512 tokens, context 16, seed 1130150188)
Output generated in 18.74 seconds (27.33 tokens/s, 512 tokens, context 16, seed 939013757)
Output generated in 18.73 seconds (27.33 tokens/s, 512 tokens, context 16, seed 1724107769)
Output generated in 18.97 seconds (27.00 tokens/s, 512 tokens, context 16, seed 54252597)

AutoGPTQ CUDA with streaming. no-act-order model. fused_attn enabled

Command

python server.py --model wiz-no-act  --autogptq  --listen --quant_attn --wbits 4 --groupsize 128 --model_type llama

Benchmark

GPU usage max: 22%, VRAM idle: 5437, VRAM after 512 tokens: 6849

Output generated in 25.71 seconds (19.87 tokens/s, 511 tokens, context 16, seed 1472734050)
Output generated in 26.33 seconds (19.40 tokens/s, 511 tokens, context 16, seed 1285036592)
Output generated in 26.20 seconds (19.50 tokens/s, 511 tokens, context 16, seed 938935319)
Output generated in 26.32 seconds (19.41 tokens/s, 511 tokens, context 16, seed 2142008394)

AutoGPTQ CUDA no streaming (`--no-stream`). no-act-order model. no fused_attn

Command

python server.py --model wiz-no-act  --autogptq  --listen  --wbits 4 --groupsize 128 --model_type llama --no-stream

Benchmark

GPU usage max: 24%, VRAM idle: 6711, VRAM after 512 tokens: 6711

Output generated in 22.63 seconds (22.63 tokens/s, 512 tokens, context 16, seed 1551481428)
Output generated in 22.63 seconds (22.63 tokens/s, 512 tokens, context 16, seed 1993869704)
Output generated in 22.48 seconds (22.78 tokens/s, 512 tokens, context 16, seed 596462747)
Output generated in 22.64 seconds (22.61 tokens/s, 512 tokens, context 16, seed 619504695)

AutoGPTQ CUDA with streaming. no-act-order model. no fused_attn

Command

python server.py --model wiz-no-act  --autogptq  --listen  --wbits 4 --groupsize 128 --model_type llama

Benchmark

GPU usage max: 20%, VRAM idle: 5277, VRAM after 512 tokens: 6711

Output generated in 30.16 seconds (16.94 tokens/s, 511 tokens, context 16, seed 709588940)
Output generated in 30.34 seconds (16.84 tokens/s, 511 tokens, context 16, seed 574596607)
Output generated in 30.30 seconds (16.87 tokens/s, 511 tokens, context 16, seed 16071815)
Output generated in 30.48 seconds (16.76 tokens/s, 511 tokens, context 16, seed 1202346043)

AutoGPTQ CUDA no streaming (`--no-stream`). act-order / desc_act model. fused_attn=yes

Command

python server.py --model wiz-act  --autogptq  --listen --quant_attn --wbits 4 --groupsize 128 --model_type llama --no-stream

Benchmark

GPU usage max: 80%, VRAM idle: 6077, VRAM after 512 tokens: 6079

Output generated in 19.60 seconds (26.12 tokens/s, 512 tokens, context 16, seed 1857860293)
Output generated in 19.58 seconds (26.15 tokens/s, 512 tokens, context 16, seed 616647949)
Output generated in 19.56 seconds (26.18 tokens/s, 512 tokens, context 16, seed 1384039801)
Output generated in 19.99 seconds (25.61 tokens/s, 512 tokens, context 16, seed 411623614)

AutoGPTQ CUDA no streaming (`--no-stream`). act-order / desc_act model. fused_attn=no

Command

python server.py --model wiz-act  --autogptq  --listen --wbits 4 --groupsize 128 --model_type llama --no-stream

Benchmark

GPU usage max: 64%, VRAM idle: 6059, VRAM after 512 tokens: 6059

Output generated in 25.12 seconds (20.38 tokens/s, 512 tokens, context 16, seed 1777836493)
Output generated in 25.07 seconds (20.42 tokens/s, 512 tokens, context 16, seed 349075793)
Output generated in 25.21 seconds (20.31 tokens/s, 512 tokens, context 16, seed 188931785)
Output generated in 25.22 seconds (20.30 tokens/s, 512 tokens, context 16, seed 485419750)

AutoGPTQ Triton no streaming (`--no-stream`). no-act-order model. fused_attn=yes. fused_mlp=yes

Command

python server.py --model wiz-no-act  --autogptq --autogptq-triton --fused_mlp --quant_attn  --listen  --wbits 4 --groupsize 128 --model_type llama --no-stream

Benchmark

GPU usage max: 34%, VRAM idle: 6649, VRAM after 512 tokens: 6649

Output generated in 23.07 seconds (22.19 tokens/s, 512 tokens, context 16, seed 1396024982)
Output generated in 22.59 seconds (22.67 tokens/s, 512 tokens, context 16, seed 1322798716)
Output generated in 22.56 seconds (22.70 tokens/s, 512 tokens, context 16, seed 935785726)
Output generated in 22.50 seconds (22.75 tokens/s, 512 tokens, context 16, seed 2135223819)

AutoGPTQ Triton with streaming. no-act-order model. fused_attn=yes. fused_mlp=yes

Command

python server.py --model wiz-no-act  --autogptq --autogptq-triton --fused_mlp --quant_attn  --listen  --wbits 4 --groupsize 128 --model_type llama

Benchmark

GPU usage max: 27%, VRAM idle: 6299, VRAM after 512 tokens: 6429

Output generated in 29.92 seconds (17.08 tokens/s, 511 tokens, context 16, seed 1687853126)
Output generated in 29.40 seconds (17.38 tokens/s, 511 tokens, context 16, seed 1796675019)
Output generated in 29.79 seconds (17.15 tokens/s, 511 tokens, context 16, seed 1342449921)
Output generated in 29.81 seconds (17.14 tokens/s, 511 tokens, context 16, seed 1283884954)

AutoGPTQ Triton no streaming (`--no-stream`). no-act-order model. fused_attn=no. fused_mlp=no

Command

python server.py --model wiz-no-act  --autogptq --autogptq-triton  --listen  --wbits 4 --groupsize 128 --model_type llama --no-stream

Benchmark

GPU usage max: 27%, VRAM idle: 6055, VRAM after 512 tokens: 6055

Output generated in 33.57 seconds (15.25 tokens/s, 512 tokens, context 16, seed 1071469137)
Output generated in 33.58 seconds (15.25 tokens/s, 512 tokens, context 16, seed 1554707022)
Output generated in 33.48 seconds (15.29 tokens/s, 512 tokens, context 16, seed 588803760)
Output generated in 33.69 seconds (15.20 tokens/s, 512 tokens, context 16, seed 719688473)

AutoGPTQ Triton no streaming (`--no-stream`). no-act-order model. fused_attn=no. fused_mlp=yes

Command

python server.py --model wiz-no-act  --autogptq --autogptq-triton --fused_mlp  --listen  --wbits 4 --groupsize 128 --model_type llama --no-stream

Benchmark

GPU usage max: 30%, VRAM idle: 6691, VRAM after 512 tokens: 6691

Output generated in 29.23 seconds (17.52 tokens/s, 512 tokens, context 16, seed 1413673599)
Output generated in 29.24 seconds (17.51 tokens/s, 512 tokens, context 16, seed 2120666307)
Output generated in 29.38 seconds (17.43 tokens/s, 512 tokens, context 16, seed 2057265550)
Output generated in 29.32 seconds (17.47 tokens/s, 512 tokens, context 16, seed 1082953773)

AutoGPTQ Triton no streaming (`--no-stream`). no-act-order model. fused_attn=yes. fused_mlp=no

Command

python server.py --model wiz-no-act  --autogptq --autogptq-triton --quant_attn  --listen  --wbits 4 --groupsize 128 --model_type llama --no-stream

Benchmark

GPU usage max: 30%, VRAM idle: 6013, VRAM after 512 tokens: 6013

Output generated in 26.43 seconds (19.37 tokens/s, 512 tokens, context 16, seed 1512231234)
Output generated in 26.36 seconds (19.42 tokens/s, 512 tokens, context 16, seed 2018026458)
Output generated in 26.54 seconds (19.29 tokens/s, 512 tokens, context 16, seed 1882161798)
Output generated in 26.61 seconds (19.24 tokens/s, 512 tokens, context 16, seed 1512440780)

AutoGPTQ Triton no streaming (`--no-stream`). act-order/desc_act model. fused_attn=yes. fused_mlp=yes

Command

python server.py --model wiz-act  --autogptq --autogptq-triton --fused_mlp --quant_attn  --listen  --wbits 4 --groupsize 128 --model_type llama --no-stream

Benchmark

GPU usage max: 38%, VRAM idle: 6649, VRAM after 512 tokens: 6649

Output generated in 22.40 seconds (22.86 tokens/s, 512 tokens, context 16, seed 1359206825)
Output generated in 22.25 seconds (23.01 tokens/s, 512 tokens, context 16, seed 609149608)
Output generated in 22.28 seconds (22.98 tokens/s, 512 tokens, context 16, seed 226374340)
Output generated in 22.68 seconds (22.57 tokens/s, 512 tokens, context 16, seed 1070157383)

AutoGPTQ Triton with streaming. act-order/desc_act model. fused_attn=yes. fused_mlp=yes

Command

python server.py --model wiz-act  --autogptq --autogptq-triton --fused_mlp --quant_attn  --listen  --wbits 4 --groupsize 128 --model_type llama

Benchmark

GPU usage max: 33%, VRAM idle: 6299, VRAM after 512 tokens: 6649

Output generated in 30.00 seconds (17.03 tokens/s, 511 tokens, context 16, seed 456349974)
Output generated in 30.00 seconds (17.03 tokens/s, 511 tokens, context 16, seed 767092960)
Output generated in 29.61 seconds (17.26 tokens/s, 511 tokens, context 16, seed 381684718)
Output generated in 29.33 seconds (17.42 tokens/s, 511 tokens, context 16, seed 283294303)

AutoGPTQ Triton no streaming (`--no-stream`). act-order/desc_act model. fused_attn=no. fused_mlp=yes

Command

python server.py --model wiz-act  --autogptq --autogptq-triton --fused_mlp  --listen  --wbits 4 --groupsize 128 --model_type llama --no-stream

Benchmark

GPU usage max:33%, VRAM idle: 6691, VRAM after 512 tokens: 6691

Output generated in 29.19 seconds (17.54 tokens/s, 512 tokens, context 16, seed 1575265983)
Output generated in 29.21 seconds (17.53 tokens/s, 512 tokens, context 16, seed 1616043283)
Output generated in 29.47 seconds (17.38 tokens/s, 512 tokens, context 16, seed 1647334679)
Output generated in 29.29 seconds (17.48 tokens/s, 512 tokens, context 16, seed 256676128)

AutoGPTQ Triton no streaming (`--no-stream`). act-order/desc_act model. fused_attn=yes. fused_mlp=no

Command

python server.py --model wiz-act  --autogptq --autogptq-triton --quant_attn  --listen  --wbits 4 --groupsize 128 --model_type llama --no-stream

Benchmark

GPU usage max:33%, VRAM idle: 6013, VRAM after 512 tokens: 6013

Output generated in 26.18 seconds (19.56 tokens/s, 512 tokens, context 16, seed 289490511)
Output generated in 26.13 seconds (19.59 tokens/s, 512 tokens, context 16, seed 2123553925)
Output generated in 26.24 seconds (19.51 tokens/s, 512 tokens, context 16, seed 563248868)
Output generated in 26.19 seconds (19.55 tokens/s, 512 tokens, context 16, seed 1773520422)

AutoGPTQ Triton no streaming (`--no-stream`). act-order/desc_act model. fused_attn=no. fused_mlp=no

Command

python server.py --model wiz-act  --autogptq --autogptq-triton  --listen  --wbits 4 --groupsize 128 --model_type llama --no-stream

Benchmark

GPU usage max:30%, VRAM idle: 6057, VRAM after 512 tokens: 6057

Output generated in 33.09 seconds (15.47 tokens/s, 512 tokens, context 16, seed 1881763981)
Output generated in 33.47 seconds (15.30 tokens/s, 512 tokens, context 16, seed 83555537)
Output generated in 33.36 seconds (15.35 tokens/s, 512 tokens, context 16, seed 332008224)
Output generated in 33.20 seconds (15.42 tokens/s, 512 tokens, context 16, seed 657280485)

AutoGPTQ Triton with streaming. act-order/desc_act model. fused_attn=no. fused_mlp=no

Command

python server.py --model wiz-act  --autogptq --autogptq-triton  --listen  --wbits 4 --groupsize 128 --model_type llama

Benchmark

GPU usage max: 25%, VRAM idle: 5503, VRAM after 512 tokens: 6055

Output generated in 41.23 seconds (12.39 tokens/s, 511 tokens, context 16, seed 1164743843)
Output generated in 41.02 seconds (12.46 tokens/s, 511 tokens, context 16, seed 509370735)
Output generated in 41.21 seconds (12.40 tokens/s, 511 tokens, context 16, seed 246113358)
Output generated in 40.99 seconds (12.47 tokens/s, 511 tokens, context 16, seed 667851869)

about GPTQ reproducibility

I have now confirmed that this issue is caused by a difference in 2 places.

use symmetric quantization
use true sequential (If you don't want to use it, you can implement it like this)

    inside_layer_modules = [
        ["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj",
         "self_attn.out_proj",
         "fc1",
         "fc2"]
    ]

With the following changes, AutoGPTQ succeeded in making GPTQ-for-LLaMA behave identically. However, these changes seem to significantly reduce performance, at least for opt-125m. (Note: GPTQ-for-llama also has bad performance.)
AutoGPTQ 29.87
not use sequensal : 29.70
not use sequensal and symmetric quantization 132.69
GPTQ-for-llama: 29.20
save and load GPTQ-for-llama: 132.69

Inference Issue with .half() data type on CPU

Hi! I have an issue related to inference for a model quantized with autoGPTQ.

The docs/README of the AutoGPTQ library make it seem like it is possible to load and run inference directly on CPU (regardless of whether the original quantization, on GPU, was done with Triton),like for example the default device is CPU, etc. However, I get an error with the .half() data type on CPU, and I can't seem to find a way around it.

Here's the error message I get when trying to generate text using the generate() method of a pre-trained AutoGPTQ model:

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

I think the issue might be related to this line of the code, but I'm not sure. Can you confirm if it's possible to run inference directly on CPU with AutoGPTQ, and if so, how to do it? Or is this a bug that needs to be fixed?

I was trying to quantize this gpt-neox based model, then run inference on CPU (inference on GPU works fine). At first I thought the issue was related to the original model being stored in float16, but after trying some in float32 to start with, that is not the case.

Thanks for the great work on the AutoGPTQ library, it's super cool!

detailed error log

>>> outputs = model.generate(
...     **inputs,
...     penalty_alpha=0.6,
...     top_k=4,
...     temperature=0.7,
...     do_sample=True,
...     max_new_tokens=128,
...     # length_penalty=0.9,
...     pad_token_id=model.config.eos_token_id
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/transformers/generation/utils.py", line 2524, in sample
    outputs = self(
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 662, in forward
    outputs = self.gpt_neox(
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 553, in forward
    outputs = layer(
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 320, in forward
    attention_layer_outputs = self.attention(
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 116, in forward
    qkv = self.query_key_value(hidden_states)
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/pszemraj/miniconda3/envs/quant/lib/python3.10/site-packages/auto_gptq/nn_modules/qlinear.py", line 243, in forward
    out = torch.matmul(x.half(), weights)
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
>>>

quantization config:

{
  "bits": 4,
  "group_size": 128,
  "damp_percent": 0.01,
  "desc_act": true,
  "sym": true,
  "true_sequential": true
}

Does it support the chatglm-6b model?Hope to support Chatglm 6b

Improve `from_quantized` loading time

Hello,

Thank you for this awesome project! I ran through the Getting started example and it worked. I used triton. But the AutoGPTQForCausalLM.from_quantized took 2minutes. Is that expected? I used the facebook/opt-125m?

Quantisation does not work on a GPU server as well

I tried the official example from the page, and it does not work.

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

print(1)
pretrained_model_dir = "huggyllama/llama-7b"
quantized_model_dir = "workspace/llama-4bit"


tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
example = tokenizer("auto_gptq is a useful tool that can automatically compress model into 4-bit or even higher rate by using GPTQ algorithm.", return_tensors="pt")

quantize_config = BaseQuantizeConfig(
bits=4,  # quantize model to 4-bit
group_size=128,  # it is recommended to set the value to 128
)

print(12)
# load un-quantized model, the model will always be force loaded into cpu
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

print(13)
# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask" 
# with value under torch.LongTensor type.
model.quantize([example], use_triton=False)
print(14)

And i get this error:

TypeError Traceback (most recent call last)
Cell In[1], line 27
24 print(13)
25 # quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
26 # with value under torch.LongTensor type.
---> 27 model.quantize([example], use_triton=False)
28 print(14)

File /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py:152, in BaseGPTQForCausalLM.quantize(self, examples, use_triton, autotune_warmup_after_quantized)
150 example[k] = v.to(CUDA)
151 try:
--> 152 self.model(**example)
153 except ValueError:
154 pass

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []

TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'token_type_ids'

How to run 'speedup_quantization' branch inference on 2x gpus?

Looking forward to multi-gpu use but maybe it's too early to test this branch before a merge?

https://github.com/PanQiWei/AutoGPTQ/tree/speedup_quantization/auto_gptq

Using simple inference script with additional options (do_sample=True, etc.) renamed "neox_generate_v3.py" based on code
provided here:

Running as:

CUDA_VISIBLE_DEVICES="0,1" python neox_generate_v3.py (are more commands needed here for inference and offloading?)

Model begins loading on both gpus (great) then series of Accelerate errors ending with:

ValueError: wf is on the meta device, we need a value to put in on cuda:0.

Using Ubuntu 22.04, python3.10 with accelerate 18.0

CUDA issue during model inference

Hi, thanks again for this repo. I'm having a CUDA-related issue when I try to run a compressed model. Steps are as follows:

I compressed a 65B LLaMa model by slightly adapting the script basic_usage_wikitext2.py. Basically I just changed the file paths to point at my LLaMa model. Compression seemed to work fine and a compressed model was saved to the directory I specified.
I confirmed that I was able to run inference on a LLaMa 7B model using a single GPU. It worked fine when I followed the Huggingface docs.
I ran the run_text_summarization_task.py script with my compressed model, and got an error. To isolate the error, I wrote a minimal example. See below.

I'm unfortunately not an expert on the inner workings of CUDA, so any guidance on how to debug this would be appreciated.

Example

pretrained_model_dir = [path to uncompressed model]
quantized_model_dir = [path to compressed model]
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=False)
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False)
device = "cuda:0"
prompt = "What's the difference between a llama and an alpaca?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
generate_ids = model(**inputs)

Stack trace as a result of running the example:

Traceback (most recent call last):
  File "/net/nfs.cirrascale/allennlp/davidw/proj/sandbox/compress_alpaca/run_alpaca.py", line 24, in <module>

  File "/opt/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda3/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 264, in forward
    return self.model(**kwargs)
  File "/opt/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/opt/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/opt/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/opt/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/opt/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda3/lib/python3.10/site-packages/auto_gptq/nn_modules/qlinear.py", line 194, in forward
    out = out.half()
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Result of calling nvidia-smi on my machine

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                   51 |
| N/A   28C    P0    58W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Generating Sentence Quality

I used basic_usage.py to convert LLamA7B and Vicuna13B, is it normal that the effect of direct inference is poor?

Loading wrong model

Hi, thank you for your great work!
I experience a very odd issue.

when I execute:

model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_name, quantize_config, max_memory=max_memory).to('cuda')

it seems it instantiates the wrong model:

  File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: '**GPTNeoXForCausalLM**' object has no attribute 'quantize'

Do you have any idea what could be the cause of this, it is calling the right from_pretrained() method, but the instance is wrong.

the model quantized is not performant

i m not sure i m the only one or not

i used this to quantize two models, one is models--eachadea--vicuna-13b-1.1 and one is models--decapoda-research--llama-7b-hf

both works fine, but when i try to inference them, they are very slow, token generation is slow and sometime it just stuck with gpu usage 100%. i have to control c.
this is where it stucks

>>> from transformers import pipeline
>>> generate = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0, max_length=512)
The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
>>> generate("write openai ceo an email to address about how important is to opensource gpt-4")
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 209, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1109, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1116, in run_single
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1015, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 251, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 269, in generate
    return self.model.generate(**kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/generation/utils.py", line 1437, in generate
    return self.greedy_search(
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/generation/utils.py", line 2248, in greedy_search
    outputs = self(
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
    hidden_states = self.mlp(hidden_states)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 157, in forward
    return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/siegfried/miniconda3/envs/embeddings/lib/python3.10/site-packages/auto_gptq/nn_modules/qlinear.py", line 200, in forward
    ).to(torch.int16 if self.bits == 8 else torch.int8)
KeyboardInterrupt
>>>
>>> quit()

since i came from GPTQ-for-LLama cuda branch, i noticed that the old cuda branch fork is pretty performant

both vicuna-13b-GPTQ-4bit-128g and gpt4-x-alpaca-13b-native-4bit-128g are quantized by that old cuda branch of GPTQ-for-LLaMa, they are fast. i m wondering whats changed

both model can not be loaded by autogptq cause some layer issue, they can only be loaded by python setup_cuda.py install with the old cuda branch of GPTQ-for-LLaMa

Features to aid widespread adoption: allow any model filename, and allow specifying quantize_config at load

Firstly, thanks for all the great new commits over the last few days. AutoGPTQ is looking better and better every day!

I have been doing some testing on AutoGPTQ and am really pleased to see it is now rivalling GPTQ-for-LLaMa. As I've said before, I feel AutoGPTQ is the future of GPTQ because it is so much easier to understand and integrate.

In order to help it gain widespread usage, I feel it needs some additional features:

In .save_quantized and .from_quantized, the ability to save/load any filename. Currently a name of gptq_model-Xbit.bin is mandatory. I feel that it should be possible to save with any filename, and then to load from any filename.
Allow passing a BaseQuantizeConfig to from_quantized. This will allow loading pre-existing models that do not have a quantize_config.json.

With these changes made, it should be easy for any user to write AutoGPTQ code that can load any of the already existing GPTQ models in existence. For example I have many released to the community at https://huggingface.co/TheBloke, eg https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g

I have tested AutoGPTQ with several of these GPTQ models (which were made with GPTQ-for-LLaMa) and they work great. But only if I manually add a quantize_config.json to each, and also rename them to gptq_model-4bit.bin.

Requiring people to do this for every GPTQ model on HF will IMHO slow adoption of AutoGPTQ. I don't mind doing it for my models, but other creators may not bother. And it'd save everyone time if it wasn't needed at all.

I am happy to look into implementing these features myself as a PR, but it will likely take me a little while to figure out the code and test it properly. So I thought I would raise them as an Issue first. Let me know what you think.

Thanks again for all your great work!

nan when converting neox and opt models with AutoGPTQ-triton

Will test default cuda version next but encountering nan for all conversions using 'AutoGPTQ-triton'

Using ubuntu 22.04, python3.10, transformers 4.28(dev), 64 gigs ram, 2x RTX 24 gig cards.

Installed successfully with all dependencies.

Am I missing a particular package version?

python basic_usage.py

pretrained_model_dir = "models/gpt-neox-20b"
quantized_model_dir = "4bit_converted"

2023-04-22 13:29:42 INFO [auto_gptq.modeling._base] Quantizing attention.query_key_value in layer 1/44...
2023-04-22 13:29:47 INFO [auto_gptq.quantization.gptq] duration: 5.032277584075928
2023-04-22 13:29:47 INFO [auto_gptq.quantization.gptq] avg loss: 17.77143669128418
2023-04-22 13:29:47 INFO [auto_gptq.modeling._base] Quantizing attention.dense in layer 1/44...
2023-04-22 13:29:48 INFO [auto_gptq.quantization.gptq] duration: 1.7948594093322754
2023-04-22 13:29:48 INFO [auto_gptq.quantization.gptq] avg loss: 1.888306736946106
2023-04-22 13:29:49 INFO [auto_gptq.modeling._base] Quantizing mlp.dense_h_to_4h in layer 1/44...
2023-04-22 13:29:50 INFO [auto_gptq.quantization.gptq] duration: 1.8883254528045654
2023-04-22 13:29:50 INFO [auto_gptq.quantization.gptq] avg loss: 28.566619873046875
2023-04-22 13:29:50 INFO [auto_gptq.modeling._base] Quantizing mlp.dense_4h_to_h in layer 1/44...
2023-04-22 13:30:02 INFO [auto_gptq.quantization.gptq] duration: 11.343331575393677
2023-04-22 13:30:02 INFO [auto_gptq.quantization.gptq] avg loss: nan
2023-04-22 13:30:02 INFO [auto_gptq.modeling._base] Start quantizing layer 2/44
2023-04-22 13:30:02 INFO [auto_gptq.modeling._base] Quantizing attention.query_key_value in layer 2/44...
2023-04-22 13:30:04 INFO [auto_gptq.quantization.gptq] duration: 1.9044442176818848
2023-04-22 13:30:04 INFO [auto_gptq.quantization.gptq] avg loss: nan
etc. stopped

How do I check what floating point precision the model is using?

Is there a way for me to check the precision of the weights the model is using when doing inference?

I.e., I'd like to know if the model is using FP16 weights, or bfloat16. (And if there is a way for me to force the model to maybe use bfloat16?)

pip install auto-gptq error

python==3.8.5, When runing pip install auto-gptq, it occurs:
Any suggestions?

Collecting auto-gptq
  Using cached auto_gptq-0.1.0.tar.gz (35 kB)
Requirement already satisfied: accelerate>=0.18.0 in /opt/conda/lib/python3.8/site-packages (from auto-gptq) (0.19.0)
Requirement already satisfied: datasets in /opt/conda/lib/python3.8/site-packages (from auto-gptq) (2.12.0)
Requirement already satisfied: numpy in /opt/conda/lib/python3.8/site-packages (from auto-gptq) (1.19.2)
Requirement already satisfied: rouge in /opt/conda/lib/python3.8/site-packages (from auto-gptq) (1.0.1)
Requirement already satisfied: torch>=1.13.0 in /opt/conda/lib/python3.8/site-packages (from auto-gptq) (2.0.1)
Requirement already satisfied: safetensors in /opt/conda/lib/python3.8/site-packages (from auto-gptq) (0.3.1)
Requirement already satisfied: transformers>=4.26.1 in /opt/conda/lib/python3.8/site-packages (from auto-gptq) (4.28.1)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from accelerate>=0.18.0->auto-gptq) (23.1)
Requirement already satisfied: psutil in /opt/conda/lib/python3.8/site-packages (from accelerate>=0.18.0->auto-gptq) (5.7.2)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.8/site-packages (from accelerate>=0.18.0->auto-gptq) (5.3.1)
Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (11.7.99)
Requirement already satisfied: nvidia-nvtx-cu11==11.7.91 in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (11.7.91)
Requirement already satisfied: nvidia-nccl-cu11==2.14.3 in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (2.14.3)
Requirement already satisfied: nvidia-cudnn-cu11==8.5.0.96 in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (8.5.0.96)
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (2.11.2)
Requirement already satisfied: triton==2.0.0 in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (2.0.0)
Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (11.7.99)
Requirement already satisfied: filelock in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (3.0.12)
Requirement already satisfied: nvidia-cuda-cupti-cu11==11.7.101 in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (11.7.101)
Requirement already satisfied: sympy in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (1.11.1)
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (4.5.0)
Requirement already satisfied: nvidia-cublas-cu11==11.10.3.66 in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (11.10.3.66)
Requirement already satisfied: nvidia-cusolver-cu11==11.4.0.1 in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (11.4.0.1)
Requirement already satisfied: nvidia-cusparse-cu11==11.7.4.91 in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (11.7.4.91)
Requirement already satisfied: nvidia-curand-cu11==10.2.10.91 in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (10.2.10.91)
Requirement already satisfied: networkx in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (2.0)
Requirement already satisfied: nvidia-cufft-cu11==10.9.0.58 in /opt/conda/lib/python3.8/site-packages (from torch>=1.13.0->auto-gptq) (10.9.0.58)
Requirement already satisfied: wheel in /opt/conda/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.13.0->auto-gptq) (0.35.1)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.13.0->auto-gptq) (50.3.1.post20201107)
Requirement already satisfied: lit in /opt/conda/lib/python3.8/site-packages (from triton==2.0.0->torch>=1.13.0->auto-gptq) (16.0.3)
Requirement already satisfied: cmake in /opt/conda/lib/python3.8/site-packages (from triton==2.0.0->torch>=1.13.0->auto-gptq) (3.26.3)
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from transformers>=4.26.1->auto-gptq) (2.24.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/conda/lib/python3.8/site-packages (from transformers>=4.26.1->auto-gptq) (0.13.3)
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.8/site-packages (from transformers>=4.26.1->auto-gptq) (2020.11.13)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers>=4.26.1->auto-gptq) (4.65.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /opt/conda/lib/python3.8/site-packages (from transformers>=4.26.1->auto-gptq) (0.14.1)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.8/site-packages (from huggingface-hub<1.0,>=0.11.0->transformers>=4.26.1->auto-gptq) (2023.5.0)
Requirement already satisfied: xxhash in /opt/conda/lib/python3.8/site-packages (from datasets->auto-gptq) (3.2.0)
Requirement already satisfied: dill<0.3.7,>=0.3.0 in /opt/conda/lib/python3.8/site-packages (from datasets->auto-gptq) (0.3.6)
Requirement already satisfied: aiohttp in /opt/conda/lib/python3.8/site-packages (from datasets->auto-gptq) (3.8.4)
Requirement already satisfied: pandas in /opt/conda/lib/python3.8/site-packages (from datasets->auto-gptq) (1.1.4)
Requirement already satisfied: pyarrow>=8.0.0 in /opt/conda/lib/python3.8/site-packages (from datasets->auto-gptq) (12.0.0)
Requirement already satisfied: multiprocess in /opt/conda/lib/python3.8/site-packages (from datasets->auto-gptq) (0.70.14)
Requirement already satisfied: responses<0.19 in /opt/conda/lib/python3.8/site-packages (from datasets->auto-gptq) (0.18.0)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /opt/conda/lib/python3.8/site-packages (from aiohttp->datasets->auto-gptq) (3.1.0)
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.8/site-packages (from aiohttp->datasets->auto-gptq) (20.3.0)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.8/site-packages (from aiohttp->datasets->auto-gptq) (1.3.3)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.8/site-packages (from aiohttp->datasets->auto-gptq) (1.9.2)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.8/site-packages (from aiohttp->datasets->auto-gptq) (4.0.2)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.8/site-packages (from aiohttp->datasets->auto-gptq) (6.0.4)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.8/site-packages (from aiohttp->datasets->auto-gptq) (1.3.1)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->transformers>=4.26.1->auto-gptq) (2020.11.8)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->transformers>=4.26.1->auto-gptq) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->transformers>=4.26.1->auto-gptq) (1.25.11)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->transformers>=4.26.1->auto-gptq) (3.0.4)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.8/site-packages (from jinja2->torch>=1.13.0->auto-gptq) (1.1.1)
Requirement already satisfied: decorator>=4.1.0 in /opt/conda/lib/python3.8/site-packages (from networkx->torch>=1.13.0->auto-gptq) (4.4.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.8/site-packages (from pandas->datasets->auto-gptq) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.8/site-packages (from pandas->datasets->auto-gptq) (2020.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->datasets->auto-gptq) (1.15.0)
Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.8/site-packages (from sympy->torch>=1.13.0->auto-gptq) (1.3.0)
Building wheels for collected packages: auto-gptq
  Building wheel for auto-gptq (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/setup.py'"'"'; __file__='"'"'/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-wmu3_p95
       cwd: /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/
  Complete output (158 lines):
  /opt/conda/lib/python3.8/site-packages/setuptools/dist.py:452: UserWarning: Normalizing 'v0.1.0' to '0.1.0'
    warnings.warn(tmpl.format(**locals()))
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.8
  creating build/lib.linux-x86_64-3.8/auto_gptq
  copying auto_gptq/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq
  creating build/lib.linux-x86_64-3.8/auto_gptq/utils
  copying auto_gptq/utils/data_utils.py -> build/lib.linux-x86_64-3.8/auto_gptq/utils
  copying auto_gptq/utils/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/utils
  creating build/lib.linux-x86_64-3.8/auto_gptq/nn_modules
  copying auto_gptq/nn_modules/layernorm_triton.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules
  copying auto_gptq/nn_modules/qlinear_triton.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules
  copying auto_gptq/nn_modules/qlinear.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules
  copying auto_gptq/nn_modules/qlinear_old.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules
  copying auto_gptq/nn_modules/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules
  creating build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks
  copying auto_gptq/eval_tasks/text_summarization_task.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks
  copying auto_gptq/eval_tasks/_base.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks
  copying auto_gptq/eval_tasks/sequence_classification_task.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks
  copying auto_gptq/eval_tasks/language_modeling_task.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks
  copying auto_gptq/eval_tasks/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks
  creating build/lib.linux-x86_64-3.8/auto_gptq/quantization
  copying auto_gptq/quantization/gptq.py -> build/lib.linux-x86_64-3.8/auto_gptq/quantization
  copying auto_gptq/quantization/quantizer.py -> build/lib.linux-x86_64-3.8/auto_gptq/quantization
  copying auto_gptq/quantization/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/quantization
  creating build/lib.linux-x86_64-3.8/auto_gptq/modeling
  copying auto_gptq/modeling/llama.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
  copying auto_gptq/modeling/_base.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
  copying auto_gptq/modeling/gpt2.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
  copying auto_gptq/modeling/bloom.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
  copying auto_gptq/modeling/auto.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
  copying auto_gptq/modeling/gptj.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
  copying auto_gptq/modeling/_const.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
  copying auto_gptq/modeling/_utils.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
  copying auto_gptq/modeling/gpt_neox.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
  copying auto_gptq/modeling/moss.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
  copying auto_gptq/modeling/opt.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
  copying auto_gptq/modeling/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
  creating build/lib.linux-x86_64-3.8/auto_gptq/nn_modules/triton_utils
  copying auto_gptq/nn_modules/triton_utils/custom_autotune.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules/triton_utils
  copying auto_gptq/nn_modules/triton_utils/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules/triton_utils
  creating build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks/_utils
  copying auto_gptq/eval_tasks/_utils/generation_utils.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks/_utils
  copying auto_gptq/eval_tasks/_utils/classification_utils.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks/_utils
  copying auto_gptq/eval_tasks/_utils/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks/_utils
  running build_ext
  /opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py:388: UserWarning: The detected CUDA version (11.1) has a minor version mismatch with the version that was used to compile PyTorch (11.7). Most likely this shouldn't be a problem.
    warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
  building 'quant_cuda' extension
  creating /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8
  creating /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/quant_cuda
  Emitting ninja build file /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/2] c++ -MMD -MF /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/quant_cuda/quant_cuda.o.d -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda -I/opt/conda/include/python3.8 -c -c /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda.cpp -o /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/quant_cuda/quant_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
  [2/2] /usr/local/cuda/bin/nvcc  -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda -I/opt/conda/include/python3.8 -c -c /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu -o /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/quant_cuda/quant_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17
  FAILED: /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/quant_cuda/quant_cuda_kernel.o
  /usr/local/cuda/bin/nvcc  -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda -I/opt/conda/include/python3.8 -c -c /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu -o /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/quant_cuda/quant_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17
  /opt/conda/lib/python3.8/site-packages/torch/include/c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
            detected during:
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
  (61): here
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
  /opt/conda/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h(77): here
  
  /opt/conda/lib/python3.8/site-packages/torch/include/c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
            detected during:
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]"
  (61): here
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]"
  /opt/conda/lib/python3.8/site-packages/torch/include/ATen/core/qualified_name.h(73): here
  
  /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu(1128): error: identifier "__hfma2" is undefined
  
  /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu(1128): error: identifier "__hfma2" is undefined
  
  /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu(1262): error: identifier "__hfma2" is undefined
  
  /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu(1262): error: identifier "__hfma2" is undefined
  
  /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu(1380): error: identifier "__hfma2" is undefined
  
  /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu(1380): error: identifier "__hfma2" is undefined
  
  /opt/conda/lib/python3.8/site-packages/torch/include/c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
            detected during:
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
  (61): here
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
  /opt/conda/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h(77): here
  
  /opt/conda/lib/python3.8/site-packages/torch/include/c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
            detected during:
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]"
  (61): here
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]"
  /opt/conda/lib/python3.8/site-packages/torch/include/ATen/core/qualified_name.h(73): here
  
  6 errors detected in the compilation of "/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu".
  ninja: build stopped: subcommand failed.
  Traceback (most recent call last):
    File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
      subprocess.run(
    File "/opt/conda/lib/python3.8/subprocess.py", line 512, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/setup.py", line 49, in <module>
      setup(
    File "/opt/conda/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
      return distutils.core.setup(**attrs)
    File "/opt/conda/lib/python3.8/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/opt/conda/lib/python3.8/distutils/dist.py", line 966, in run_commands
      self.run_command(cmd)
    File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/opt/conda/lib/python3.8/site-packages/wheel/bdist_wheel.py", line 290, in run
      self.run_command('build')
    File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/opt/conda/lib/python3.8/distutils/command/build.py", line 135, in run
      self.run_command(cmd_name)
    File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
      _build_ext.run(self)
    File "/opt/conda/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
      _build_ext.build_ext.run(self)
    File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 340, in run
      self.build_extensions()
    File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
      build_ext.build_extensions(self)
    File "/opt/conda/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 194, in build_extensions
      self.build_extension(ext)
    File "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
      _build_ext.build_extension(self, ext)
    File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
      objects = self.compiler.compile(sources,
    File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  ----------------------------------------
  ERROR: Failed building wheel for auto-gptq
  Running setup.py clean for auto-gptq
Failed to build auto-gptq
Installing collected packages: auto-gptq
    Running setup.py install for auto-gptq ... error
    ERROR: Command errored out with exit status 1:
     command: /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/setup.py'"'"'; __file__='"'"'/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-mwsj76kg/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/include/python3.8/auto-gptq
         cwd: /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/
    Complete output (160 lines):
    /opt/conda/lib/python3.8/site-packages/setuptools/dist.py:452: UserWarning: Normalizing 'v0.1.0' to '0.1.0'
      warnings.warn(tmpl.format(**locals()))
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.8
    creating build/lib.linux-x86_64-3.8/auto_gptq
    copying auto_gptq/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq
    creating build/lib.linux-x86_64-3.8/auto_gptq/utils
    copying auto_gptq/utils/data_utils.py -> build/lib.linux-x86_64-3.8/auto_gptq/utils
    copying auto_gptq/utils/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/utils
    creating build/lib.linux-x86_64-3.8/auto_gptq/nn_modules
    copying auto_gptq/nn_modules/layernorm_triton.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules
    copying auto_gptq/nn_modules/qlinear_triton.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules
    copying auto_gptq/nn_modules/qlinear.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules
    copying auto_gptq/nn_modules/qlinear_old.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules
    copying auto_gptq/nn_modules/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules
    creating build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks
    copying auto_gptq/eval_tasks/text_summarization_task.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks
    copying auto_gptq/eval_tasks/_base.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks
    copying auto_gptq/eval_tasks/sequence_classification_task.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks
    copying auto_gptq/eval_tasks/language_modeling_task.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks
    copying auto_gptq/eval_tasks/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks
    creating build/lib.linux-x86_64-3.8/auto_gptq/quantization
    copying auto_gptq/quantization/gptq.py -> build/lib.linux-x86_64-3.8/auto_gptq/quantization
    copying auto_gptq/quantization/quantizer.py -> build/lib.linux-x86_64-3.8/auto_gptq/quantization
    copying auto_gptq/quantization/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/quantization
    creating build/lib.linux-x86_64-3.8/auto_gptq/modeling
    copying auto_gptq/modeling/llama.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
    copying auto_gptq/modeling/_base.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
    copying auto_gptq/modeling/gpt2.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
    copying auto_gptq/modeling/bloom.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
    copying auto_gptq/modeling/auto.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
    copying auto_gptq/modeling/gptj.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
    copying auto_gptq/modeling/_const.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
    copying auto_gptq/modeling/_utils.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
    copying auto_gptq/modeling/gpt_neox.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
    copying auto_gptq/modeling/moss.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
    copying auto_gptq/modeling/opt.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
    copying auto_gptq/modeling/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/modeling
    creating build/lib.linux-x86_64-3.8/auto_gptq/nn_modules/triton_utils
    copying auto_gptq/nn_modules/triton_utils/custom_autotune.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules/triton_utils
    copying auto_gptq/nn_modules/triton_utils/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/nn_modules/triton_utils
    creating build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks/_utils
    copying auto_gptq/eval_tasks/_utils/generation_utils.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks/_utils
    copying auto_gptq/eval_tasks/_utils/classification_utils.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks/_utils
    copying auto_gptq/eval_tasks/_utils/__init__.py -> build/lib.linux-x86_64-3.8/auto_gptq/eval_tasks/_utils
    running build_ext
    /opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py:388: UserWarning: The detected CUDA version (11.1) has a minor version mismatch with the version that was used to compile PyTorch (11.7). Most likely this shouldn't be a problem.
      warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
    building 'quant_cuda' extension
    creating /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8
    creating /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/quant_cuda
    Emitting ninja build file /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/build.ninja...
    Compiling objects...
    Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
    [1/2] c++ -MMD -MF /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/quant_cuda/quant_cuda.o.d -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda -I/opt/conda/include/python3.8 -c -c /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda.cpp -o /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/quant_cuda/quant_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    [2/2] /usr/local/cuda/bin/nvcc  -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda -I/opt/conda/include/python3.8 -c -c /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu -o /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/quant_cuda/quant_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17
    FAILED: /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/quant_cuda/quant_cuda_kernel.o
    /usr/local/cuda/bin/nvcc  -I/opt/conda/lib/python3.8/site-packages/torch/include -I/opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda -I/opt/conda/include/python3.8 -c -c /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu -o /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/build/temp.linux-x86_64-3.8/quant_cuda/quant_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17
    /opt/conda/lib/python3.8/site-packages/torch/include/c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
              detected during:
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
    (61): here
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
    /opt/conda/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h(77): here
    
    /opt/conda/lib/python3.8/site-packages/torch/include/c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
              detected during:
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]"
    (61): here
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]"
    /opt/conda/lib/python3.8/site-packages/torch/include/ATen/core/qualified_name.h(73): here
    
    /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu(1128): error: identifier "__hfma2" is undefined
    
    /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu(1128): error: identifier "__hfma2" is undefined
    
    /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu(1262): error: identifier "__hfma2" is undefined
    
    /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu(1262): error: identifier "__hfma2" is undefined
    
    /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu(1380): error: identifier "__hfma2" is undefined
    
    /tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu(1380): error: identifier "__hfma2" is undefined
    
    /opt/conda/lib/python3.8/site-packages/torch/include/c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
              detected during:
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
    (61): here
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
    /opt/conda/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h(77): here
    
    /opt/conda/lib/python3.8/site-packages/torch/include/c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
              detected during:
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]"
    (61): here
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]"
    /opt/conda/lib/python3.8/site-packages/torch/include/ATen/core/qualified_name.h(73): here
    
    6 errors detected in the compilation of "/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/quant_cuda/quant_cuda_kernel.cu".
    ninja: build stopped: subcommand failed.
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
        subprocess.run(
      File "/opt/conda/lib/python3.8/subprocess.py", line 512, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/setup.py", line 49, in <module>
        setup(
      File "/opt/conda/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "/opt/conda/lib/python3.8/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/opt/conda/lib/python3.8/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/opt/conda/lib/python3.8/site-packages/setuptools/command/install.py", line 61, in run
        return orig.install.run(self)
      File "/opt/conda/lib/python3.8/distutils/command/install.py", line 545, in run
        self.run_command('build')
      File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/opt/conda/lib/python3.8/distutils/command/build.py", line 135, in run
        self.run_command(cmd_name)
      File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
        _build_ext.run(self)
      File "/opt/conda/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
        _build_ext.build_ext.run(self)
      File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 340, in run
        self.build_extensions()
      File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
        build_ext.build_extensions(self)
      File "/opt/conda/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 194, in build_extensions
        self.build_extension(ext)
      File "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
        _build_ext.build_extension(self, ext)
      File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
        objects = self.compiler.compile(sources,
      File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
        _write_ninja_file_and_compile_objects(
      File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
        _run_ninja_build(
      File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
        raise RuntimeError(message) from e
    RuntimeError: Error compiling objects for extension
    ----------------------------------------
ERROR: Command errored out with exit status 1: /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/setup.py'"'"'; __file__='"'"'/tmp/pip-install-4ahd0ixx/auto-gptq_8427ebbff1cb4b05a77734c7bf015427/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-mwsj76kg/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/include/python3.8/auto-gptq Check the logs for full command output.
```

Issue loading models quantised with older GPTQ code: `ValueError: QuantLinear() does not have a parameter or a buffer named bias.`

As discussed in: oobabooga/text-generation-webui#1668

Trying to load for example: https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g

We get:

The safetensors archive passed at /workspace/models/anon8231489123_vicuna-13b-GPTQ-4bit-128g/vicuna-13b-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
Traceback (most recent call last):
  File "/workspace/test_gptq_cuda.py", line 29, in <module>
    model = get_model("vicuna-13b-4bit-128g", triton=False, is_llama=True, model_has_desc_act=False)
  File "/workspace/test_gptq_cuda.py", line 21, in get_model
    return AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_safetensors=True, model_basename=model_base, device="cuda:0", use_triton=triton, quantize_config=get_config(model_has_desc_act))
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/auto.py", line 66, in from_quantized
    return quant_func(
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/llama.py", line 107, in from_quantized
    model = accelerate.load_checkpoint_and_dispatch(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
    load_checkpoint_in_model(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py", line 946, in load_checkpoint_in_model
    set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py", line 131, in set_module_tensor_to_device
    raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
ValueError: QuantLinear() does not have a parameter or a buffer named bias.

This model loads fine with the latest GPTQ-for-LLaMa code, both CUDA and Triton branches.

Could such models be made to work in AutoGPTQ as well? If necessary, including some compatibility code for whatever has changed to cause this error?

I know this is a pain, but some of the most popular models on HF are encoded in this way, eg the model above which has 32,227 downloads.

I think being able to support the widest range of models will help the community as a whole move to AutoGPTQ as the one standard for GPTQ.

How to load models like TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g using AutoGPTQ?

I don't think the usual path of downloading the non-quantized model and then trying to quantize the model is required for my use case. I would want to just load such 4bit GPTQ models and inference on them.

However, I am facing issues, as there are no .bin files like AutoGPTQ expects. Also, the tokenizer is also not working.

What to do now?

I have simply tried to modify the given code to just use what might be required for me

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig


pretrained_model_dir = "vicuna-13B-1.1-GPTQ-4bit-128g"
quantized_model_dir = "vicuna-13B-1.1-GPTQ-4bit-128g"


tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]
print(examples)
# quantize_config = BaseQuantizeConfig(
#     bits=4,  # quantize model to 4-bit
#     group_size=128,  # it is recommended to set the value to 128
# )

# # load un-quantized model, by default, the model will always be loaded into CPU memory
# model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# # quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
# model.quantize(examples, use_triton=False)

# # save quantized model
# model.save_quantized(quantized_model_dir)

# # save quantized model using safetensors
# model.save_quantized(quantized_model_dir, use_safetensors=True)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False, model_basename="vicuna-13B-1.1-GPTQ-4bit-128g.compat.no-act-order.pt")
```
`

GPTNeoX-based model issue: Can't load GPTQ model, but can do inference from in-memory GPTQ

Today I am trying to run AutoGPTQ on a GPTNeoX-based model. Specifically the model is: https://huggingface.co/h2oai/h2ogpt-oasst1-512-20b

Here is the error I receive when I execute model = AutoGPTQForCausalLM.from_quantized(args.quantized_model_dir, device="cuda:0") on my GPTQ quantised model:

Traceback (most recent call last):
  File "/root/quant_h2o.py", line 157, in <module>
    main()
  File "/root/quant_h2o.py", line 130, in main
    model = AutoGPTQForCausalLM.from_quantized(args.quantized_model_dir, device="cuda:0")
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/auto.py", line 51, in from_quantized
    return GPTQ_CAUSAL_LM_MODEL_MAP[model_type].from_quantized(
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 342, in from_quantized
    model.load_state_dict(torch.load(model_save_name))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GPTNeoXForCausalLM:
        Missing key(s) in state_dict: "embed_out.qweight", "embed_out.qzeros", "embed_out.scales", "embed_out.g_idx".
        Unexpected key(s) in state_dict: "embed_out.weight".

In order to test this better, I slightly modified quant_with_alpaca.py. Here is the end of the file:

    model.quantize(examples_for_quant)

    if not args.quantized_model_dir:
        args.quantized_model_dir = args.pretrained_model_dir

    if args.save_and_reload:
        print(f"Saving model to {args.quantized_model_dir}")
        model.save_quantized(args.quantized_model_dir)

        print(f"Saving safetensors to {args.quantized_safetensor_dir}")
        model.save_quantized(args.quantized_safetensor_dir, use_safetensors=True)

    print("Testing inference without reloading")
    pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device="cuda:0")
    for example in random.sample(examples, k=min(4, len(examples))):
        print(f"prompt: {example['prompt']}")
        print(f"origin: {example['output']}")
        start = time.time()
        generated_text = pipeline(
            example['prompt'],
            return_full_text=False,
            num_beams=1,
            max_length=len(example["input_ids"]) + 128  # use this instead of max_new_token to disable UserWarning when integrate with logging
        )[0]['generated_text']
        end = time.time()
        print(f"quant: {generated_text}")
        num_new_tokens = len(tokenizer(generated_text)["input_ids"])
        print(f"generate {num_new_tokens} tokens using {end-start: .4f}s")
        print("=" * 42)

    if args.save_and_reload:
        print(f"Testing inference AFTER reloading from {args.quantized_model_dir}")
        model = AutoGPTQForCausalLM.from_quantized(args.quantized_model_dir, device="cuda:0")

        pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device="cuda:0")
        for example in random.sample(examples, k=min(4, len(examples))):
            print(f"prompt: {example['prompt']}")
            print(f"origin: {example['output']}")
            start = time.time()
            generated_text = pipeline(
                example['prompt'],
                return_full_text=False,
                num_beams=1,
                max_length=len(example["input_ids"]) + 128  # use this instead of max_new_token to disable UserWarning when integrate with logging
            )[0]['generated_text']
            end = time.time()
            print(f"quant: {generated_text}")
            num_new_tokens = len(tokenizer(generated_text)["input_ids"])
            print(f"generate {num_new_tokens} tokens using {end-start: .4f}s")
            print("=" * 42)

So I changed it to first save the model, then do inference on the in-memory copy, then try to load the file from disk and do inference again.

The result is this:

... model is packing ...
2023-04-24 10:32:42 INFO [auto_gptq.modeling._utils] gpt_neox.layers.43.mlp.dense_h_to_4h
2023-04-24 10:32:47 INFO [auto_gptq.modeling._utils] Model packed.
Saving model to /workspace/h2oquant2
Saving safetensors to /workspace/h2oquant2-safetensor
Testing inference without reloading
The model 'GPTNeoXGPTQForCausalLM' is not supported for . Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
prompt: Instruction:
Generate a paragraph explaining the meaning of 'machine learning'.
Output:

origin: Machine Learning is a subset of Artificial Intelligence that uses data patterns and algorithms to enable computers to learn and apply knowledge without explicit programming. It uses data to make predictions, build models, and make decisions. Machine Learning is used in many different fields, from recommendation engines and medical diagnosis to object detection and language processing. It is becoming increasingly important in our society, as it allows us to gain valuable insights from data and use them to make better decisions. With Machine Learning, computers can learn to recognize patterns and improve their accuracy with the more data they are exposed to. This enables us to discover hidden insights, develop smarter automation and build more intelligent systems.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.

.... other examples removed here ....

generate 169 tokens using  18.2065s
==========================================
Testing inference AFTER reloading from /workspace/h2oquant2
Traceback (most recent call last):
  File "/root/quant_h2o.py", line 157, in <module>
    main()
  File "/root/quant_h2o.py", line 130, in main
    model = AutoGPTQForCausalLM.from_quantized(args.quantized_model_dir, device="cuda:0")
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/auto.py", line 51, in from_quantized
    return GPTQ_CAUSAL_LM_MODEL_MAP[model_type].from_quantized(
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 342, in from_quantized
    model.load_state_dict(torch.load(model_save_name))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GPTNeoXForCausalLM:
        Missing key(s) in state_dict: "embed_out.qweight", "embed_out.qzeros", "embed_out.scales", "embed_out.g_idx".
        Unexpected key(s) in state_dict: "embed_out.weight".

Conclusion: Inference works OK from the in-memory copy, but then the saved model cannot be loaded from disk. I am not sure what is wrong and would be grateful for any suggestions.

Module 'quant_cuda' has no attribute 'vecquant4matmul'

Unable to load the package after:

BUILD_CUDA_EXT=0 
!pip install auto-gptq

Ran into this error:

 Building wheel for auto-gptq (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [83 lines of output]
      /opt/conda/lib/python3.10/site-packages/setuptools/dist.py:493: UserWarning: Normalizing 'v0.0.5' to '0.0.5'
        warnings.warn(tmpl.format(**locals()))
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.10
      creating build/lib.linux-x86_64-3.10/auto_gptq
      copying auto_gptq/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq
      creating build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gptj.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/bloom.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/llama.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_const.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gpt_neox.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/moss.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_base.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/opt.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/auto.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      creating build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/gptq.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/quantizer.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      creating build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/sequence_classification_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/_base.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/text_summarization_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/language_modeling_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      creating build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/qlinear.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/qlinear_triton.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      creating build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/classification_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/generation_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/data_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      creating build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      copying auto_gptq/nn_modules/triton_utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      copying auto_gptq/nn_modules/triton_utils/custom_autotune.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      running build_ext
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-3byddf0o/auto-gptq_c352ae6c79014b188e05fc83034e86ff/setup.py", line 47, in <module>
          setup(
        File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/opt/conda/lib/python3.10/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 343, in run
          self.run_command("build")
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/opt/conda/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/opt/conda/lib/python3.10/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 499, in build_extensions
          _check_cuda_version(compiler_name, compiler_version)
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 387, in _check_cuda_version
          raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
      RuntimeError:
      The detected CUDA version (12.1) mismatches the version that was used to compile
      PyTorch (11.3). Please make sure to use the same CUDA versions.
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for auto-gptq
  Running setup.py clean for auto-gptq
Failed to build auto-gptq
Installing collected packages: safetensors, rouge, auto-gptq
  Running setup.py install for auto-gptq ... error
  error: subprocess-exited-with-error
  
  × Running setup.py install for auto-gptq did not run successfully.
  │ exit code: 1
  ╰─> [87 lines of output]
      /opt/conda/lib/python3.10/site-packages/setuptools/dist.py:493: UserWarning: Normalizing 'v0.0.5' to '0.0.5'
        warnings.warn(tmpl.format(**locals()))
      running install
      /opt/conda/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.10
      creating build/lib.linux-x86_64-3.10/auto_gptq
      copying auto_gptq/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq
      creating build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gptj.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/bloom.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/llama.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_const.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gpt_neox.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/moss.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_base.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/opt.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/auto.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      creating build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/gptq.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/quantizer.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      creating build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/sequence_classification_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/_base.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/text_summarization_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/language_modeling_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      creating build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/qlinear.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/qlinear_triton.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      creating build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/classification_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/generation_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/data_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      creating build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      copying auto_gptq/nn_modules/triton_utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      copying auto_gptq/nn_modules/triton_utils/custom_autotune.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      running build_ext
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-3byddf0o/auto-gptq_c352ae6c79014b188e05fc83034e86ff/setup.py", line 47, in <module>
          setup(
        File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/opt/conda/lib/python3.10/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/setuptools/command/install.py", line 68, in run
          return orig.install.run(self)
        File "/opt/conda/lib/python3.10/distutils/command/install.py", line 568, in run
          self.run_command('build')
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/opt/conda/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/opt/conda/lib/python3.10/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 499, in build_extensions
          _check_cuda_version(compiler_name, compiler_version)
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 387, in _check_cuda_version
          raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
      RuntimeError:
      The detected CUDA version (12.1) mismatches the version that was used to compile
      PyTorch (11.3). Please make sure to use the same CUDA versions.
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> auto-gptq

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

关于 example size 的问题

hi,
我在运行样例代码中遇到下面问题：

ValueError: Attention mask should be of size (1, 1, 30, 30), but is torch.Size([1, 30, 30])

我只替换了样例代码的pretrained_model_dir路径，我用的是经过训练之后的llama-7B 模型，大约26GB大小。

我并没有对example进行更改，同时检查了example的尺寸

>>> example.get("input_ids").shape
torch.Size([1, 30])
>>> example.get("attention_mask").shape
torch.Size([1, 30])

请问这个报错和transformers版本有关系吗？

我的transformers版本是最新的版本。Commits on Apr 20, 2023（474bf508dfe0d46fc38585a1bb793e5ba74fddfd）

CUDA inference: issue with group_size = 1024 + desc_act = False. (Triton unaffected)

Hi @PanQiWei and @qwopqwop200

I have encountered a strange bug that is specific to group_size = 1024 + desc_act=False + CUDA inference.

Last night I did a bunch of quantisations, covering all permutations of quantisation parameters.

Today I am testing perplexity, and I have found that models quantised with group_size = 1024 + desc_act = False do not work with model(tokens) syntax in CUDA. But model.generate(..) works fine.

Here is test code to demonstrate the issue:

import os

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import numpy as np
import torch
import torch.nn as nn
import argparse

def get_wikitext2(nsamples, seed, seqlen, tokenizer):
    from datasets import load_dataset

    wikidata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
    wikilist = [' \n' if s == '' else s for s in wikidata['text'] ]

    text = ''.join(wikilist)
    trainenc = tokenizer(text, return_tensors='pt')

    import random
    random.seed(seed)
    np.random.seed(0)
    torch.random.manual_seed(0)

    traindataset = []
    for _ in range(nsamples):
        i = random.randint(0, trainenc.input_ids.shape[1] - seqlen - 1)
        j = i + seqlen
        inp = trainenc.input_ids[:, i:j]
        attention_mask = torch.ones_like(inp)
        traindataset.append({'input_ids':inp,'attention_mask': attention_mask})
    return traindataset

pretrained_model_dir = "/workspace/models/huggyllama_llama-7b"
quantized_model_dir = "/workspace/test-1024g"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)

if not os.path.isdir(quantized_model_dir):
    quantize_config = BaseQuantizeConfig(
        bits=4,
        group_size=1024,
        desc_act=False
    )

    traindataset = get_wikitext2(128, 0, 2048, tokenizer)
    # load un-quantized model, the model will always be force loaded into cpu
    print("Loading model")
    model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

    print("Quantising")
    model.quantize(traindataset, use_triton=False)

    os.makedirs(quantized_model_dir, exist_ok=True)
    model.save_quantized(quantized_model_dir, use_safetensors=True)

print("Reloading model just quantised")
for triton in [ True, False ]:
    print(f"Testing with use_triton = {triton}")
    model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=triton, use_safetensors=True)

    # Make a long text
    sentence = "auto gptq is " * 500
    input_ids = tokenizer(sentence, return_tensors="pt", truncation=False).input_ids.to("cuda:0")
    # Run model on first 512 tokens
    try:
        output = model(input_ids = input_ids[:, 0:512])
        print(f"Succeeded for triton = {triton}")
    except:
        print(f"FAILED for triton = {triton}")
        raise

Output:

root@1f66221a311b:/workspace/gptq-ppl-test# python test_1024.py
Downloading builder script: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.48k/8.48k [00:00<00:00, 4.79MB/s]
Downloading metadata: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.84k/6.84k [00:00<00:00, 5.21MB/s]
Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.25k/9.25k [00:00<00:00, 6.48MB/s]
Downloading and preparing dataset wikitext/wikitext-2-raw-v1 to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.72M/4.72M [00:01<00:00, 4.61MB/s]
Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.
Token indices sequence length is longer than the specified maximum sequence length for this model (335688 > 2048). Running this sequence through the model will result in indexing errors
Loading model
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:21<00:00, 40.74s/it]
Quantising
Reloading model just quantised
Testing with use_triton = True
The safetensors archive passed at /workspace/test-1024g/gptq_model-4bit-1024g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:33<00:00,  2.80s/it]
Succeeded for triton = True
Testing with use_triton = False
The safetensors archive passed at /workspace/test-1024g/gptq_model-4bit-1024g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
FAILED for triton = False
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /workspace/gptq-ppl-test/test_1024.py:66 in <module>                                             │
│                                                                                                  │
│   63 │   input_ids = tokenizer(sentence, return_tensors="pt", truncation=False).input_ids.to(    │
│   64 │   # Run model on first 512 tokens                                                         │
│   65 │   try:                                                                                    │
│ ❱ 66 │   │   output = model(input_ids[:, 0:512])                                                 │
│   67 │   │   print(f"Succeeded for triton = {triton}")                                           │
│   68 │   except:                                                                                 │
│   69 │   │   print(f"FAILED for triton = {triton}")                                              │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py:374 in forward               │
│                                                                                                  │
│   371 │   │   return self.model.to(device)                                                       │
│   372 │                                                                                          │
│   373 │   def forward(self, *args, **kwargs):                                                    │
│ ❱ 374 │   │   return self.model(*args, **kwargs)                                                 │
│   375 │                                                                                          │
│   376 │   def generate(self, **kwargs):                                                          │
│   377 │   │   """shortcut for model.generate"""                                                  │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:688 in       │
│ forward                                                                                          │
│                                                                                                  │
│   685 │   │   return_dict = return_dict if return_dict is not None else self.config.use_return   │
│   686 │   │                                                                                      │
│   687 │   │   # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)    │
│ ❱ 688 │   │   outputs = self.model(                                                              │
│   689 │   │   │   input_ids=input_ids,                                                           │
│   690 │   │   │   attention_mask=attention_mask,                                                 │
│   691 │   │   │   position_ids=position_ids,                                                     │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:578 in       │
│ forward                                                                                          │
│                                                                                                  │
│   575 │   │   │   │   │   None,                                                                  │
│   576 │   │   │   │   )                                                                          │
│   577 │   │   │   else:                                                                          │
│ ❱ 578 │   │   │   │   layer_outputs = decoder_layer(                                             │
│   579 │   │   │   │   │   hidden_states,                                                         │
│   580 │   │   │   │   │   attention_mask=attention_mask,                                         │
│   581 │   │   │   │   │   position_ids=position_ids,                                             │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:306 in       │
│ forward                                                                                          │
│                                                                                                  │
│   303 │   │   # Fully Connected                                                                  │
│   304 │   │   residual = hidden_states                                                           │
│   305 │   │   hidden_states = self.post_attention_layernorm(hidden_states)                       │
│ ❱ 306 │   │   hidden_states = self.mlp(hidden_states)                                            │
│   307 │   │   hidden_states = residual + hidden_states                                           │
│   308 │   │                                                                                      │
│   309 │   │   outputs = (hidden_states,)                                                         │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:158 in       │
│ forward                                                                                          │
│                                                                                                  │
│   155 │   │   self.act_fn = ACT2FN[hidden_act]                                                   │
│   156 │                                                                                          │
│   157 │   def forward(self, x):                                                                  │
│ ❱ 158 │   │   return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))            │
│   159                                                                                            │
│   160                                                                                            │
│   161 class LlamaAttention(nn.Module):                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/qlinear_old.py:221 in forward       │
│                                                                                                  │
│   218 │   │   │                                                                                  │
│   219 │   │   │      weight = torch.bitwise_right_shift(torch.unsqueeze(self.qweight, 1).expan   │
│   220 │   │   │      torch.bitwise_and(weight,(2 ** self.bits) - 1, out=weight)                  │
│ ❱ 221 │   │   │      weight = weight.reshape(-1, self.group_size, weight.shape[2])               │
│   222 │   │   │   elif self.bits == 3:                                                           │
│   223 │   │   │      zeros = self.qzeros.reshape(self.qzeros.shape[0], self.qzeros.shape[1]//3   │
│   224 │   │   │      zeros = (zeros >> self.wf.unsqueeze(0))                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: shape '[-1, 1024, 4096]' is invalid for input of size 45088768

As you can see, when doing Triton inference, there is no problem.

But with CUDA inference on group_size = 1024 + desc_act = False model, there is this error.

The error does not happen with CUDA + group_size = 1024 + desc_act = True.

Support for gpt2 architecture

Hi is it possible to add GPT2 architecture support? This would allow models like cerebras and (maybe?) codegen.

能否支持codegen？

MOSS能支持的话，照理应该也能支持codegen？

Install fails due to missing nvcc

Hi,

Thanks for this package, it seems very promising!

I've followed the instructions to install from source, but I get an error. Here are what I believe are the relevant lines:

copying auto_gptq/nn_modules/triton_utils/__init__.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/triton_utils
copying auto_gptq/nn_modules/triton_utils/custom_autotune.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/triton_utils
running build_ext
/opt/miniconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
warnings.warn(msg.format('we could not find ninja.'))
error: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'

It looks like I don't have nvcc installed, but apparently nvcc isn't necessary for PyTorch to run fine. My nvidia-smi command works fine.

Is there some way to remove the dependency on nvcc?

Thanks,

Dave

pos.id := as syntax error

File "/opt/conda/lib/python3.7/site-packages/auto_gptq/modeling/init.py", line 1, in
from ._base import BaseGPTQForCausalLM, BaseQuantizeConfig

File "/opt/conda/lib/python3.7/site-packages/auto_gptq/modeling/_base.py", line 182
if (pos_ids := kwargs.get("position_ids", None)) is not None:
^
SyntaxError: invalid syntax

Can any body help please

Evaluation / benchmark mode

This is great stuff! I was wondering if you could also add a evaluation mode, for example to calculate perplexity in language models? Sometimes GPTQ doesn't work too well and the performance is hurt significantly. There is a risk that sometime takes a badly quantized model without knowing. So perhaps some kind of method to calculate metrics to compare the original model with the quantized model would be a great help.

GPTQ support for MPT models

Hi @PanQiWei,

I would like to request support for MPT models as they are SOTA with a commercial license.

MPT models (Base, Story-Writer, Instruct, Chat):
https://huggingface.co/mosaicml/mpt-7b

I found an implementation that may be relevant here:
https://github.com/0cc4m/GPTQ-for-LLaMa

This user also provided a huggingface demo, if it has any relevance:
https://huggingface.co/OccamRazor/mpt-7b-storywriter-4bit-128g

triton implementation

Here's an implementation using triton. I think we can provide faster speeds.
https://github.com/qwopqwop200/AutoGPTQ-triton

Importance of dataset used during quantisation?

Something I don't yet understand in GPTQ is the significance of the dataset used for quantisation?

In Qwopqwop's GPTQ-for-LLaMa, the examples use c4. I've also seen him use wikitext2 and ptb.

But now AutoGPTQ has an example that uses the Alpaca instruction/response data.

Are there benchmarks to indicate which dataset is best to use for quantisation? Or does it depend on the type of model being quantised, or the expected use case of the model?

Thanks in advance!

quant_cuda debug

Hi, I admire ur work a lot! I wanted to ask you is there a way to debug errors from quant_cuda. I receive one with gptq models:

  File "C:\Users\lazar\anaconda3\envs\intellibridge\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           │             │       └ {}
           │             └ (tensor([[[ 0.0093, -0.0398,  0.2615,  ..., -0.0125,  0.0017,  0.0076],
           │                        [ 0.0070, -0.0074,  0.0269,  ..., -0.0029,  ...
           └ <bound method QuantLinear.forward of QuantLinear()>

  File "D:\AI\IntelliBridge\utils\third_party\AutoGPTQ\auto_gptq\quantization\quant.py", line 299, in forward
    quant_cuda.vecquant4matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx)
    │          │               │ │        │             │    │                    │            └ QuantLinear()
    │          │               │ │        │             │    │                    └ QuantLinear()
    │          │               │ │        │             │    └ QuantLinear()
    │          │               │ │        │             └ tensor([[0., 0., 0.,  ..., 0., 0., 0.],
    │          │               │ │        │                       [0., 0., 0.,  ..., 0., 0., 0.],
    │          │               │ │        │                       [0., 0., 0.,  ..., 0., 0., 0.],
    │          │               │ │        │                    ...
    │          │               │ │        └ QuantLinear()
    │          │               │ └ <method 'float' of 'torch._C._TensorBase' objects>
    │          │               └ tensor([[ 0.0093, -0.0398,  0.2615,  ..., -0.0125,  0.0017,  0.0076],
    │          │                         [ 0.0070, -0.0074,  0.0269,  ..., -0.0029,  0.0...
    │          └ <built-in method vecquant4matmul of PyCapsule object at 0x00000144B1452E20>
    └ <module 'quant_cuda' from 'C:\\Users\\lazar\\anaconda3\\envs\\intellibridge\\lib\\site-packages\\quant_cuda.cp310-win_amd64.p...

RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "C:\\Users\\lazar\\anaconda3\\envs\\intellibridge\\lib\\site-packages\\torch\\include\\c10/cuda/impl/CUDAGuardImpl.h":25, please report a bug to PyTorch. 

`
Shapes:

x - 26, 5120
out - 26, 5120
out_shape - 1, 26, 5120
scales - 40, 5120
qzeros - 40, 640
g_idx - 5120

3X slow inference on GeForce RTX 3060 after 4bit-128g quantization

Hi,

Thanks for the great work!

You can find the code and the output in the attached Jupyter notebook: quantize.ipynb

I am attaching a report of my experiment of quantizing facebook/opt-2.7b and evaluating performance on a SequenceClassificationTask on GeForce RTX 3060.

I cannot fit the default fp32 model from HuggingFace on 12GB of VARM, so I am reporting numbers for fp16 version of the model and the quantized one below.

I used the following quantization config:

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
)

I have the following findings:

Memory usage of the quantized model is approx half of the fp16 one - 3.08GB vs 6.55GB. I was hoping for a 3-3.5x reduction in memory as compared to an fp16 model as most of the weights are being stored as 4-bit nibbles.
Inference is very slow, it takes almost 3x long for the quantized model - 76.29s vs 23.28s - this is a bit puzzling!
Also, accuracy is 1/3 for a classification task with 3 choices - for both the original fp16 and quantized model - so models are performing as well as a random model. Is something wrong going on?

I am using CUDA 11.7:

$ ls -l /usr/local/cuda
lrwxrwxrwx 1 root root 21 May 14 09:23 /usr/local/cuda -> /usr/local/cuda-11.7/

$ which nvcc
/usr/local/cuda/bin/nvcc

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

I do see a 3.5x reduction in disk space required, the original model uses 5.3GB of disk space.

$ du -h facebook/opt-2.7b-4bit-128g/*
5.0K    facebook/opt-2.7b-4bit-128g/config.json
1.5G    facebook/opt-2.7b-4bit-128g/gptq_model-4bit-128g.bin
1.8G    facebook/opt-2.7b-4bit-128g/gptq_model-4bit-128g.safetensors
4.0K    facebook/opt-2.7b-4bit-128g/quantize_config.json

Let me know if I am missing any important detail.

Thanks!

Interface does not work on CPU

I finally was able to quantize llama model on a GPU. But interface still does not work on CPU, there seem to be problem with loading config file for quantization:

from transformers import AutoTokenizer, TextGenerationPipeline,AutoModelForCausalLM
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantized_model_dir = "/volume/models/llama4b/"
tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cpu", use_triton=False)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to("cpu"))[0]))

and i get this error:

Traceback (most recent call last):
File "/root/Documents/LiClipse Workspace/Lamma4bit/lalama4b.py", line 11, in
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cpu", use_triton=False)
File "/root/anaconda3/envs/ai/lib/python3.8/site-packages/auto_gptq/modeling/auto.py", line 54, in from_quantized
return GPTQ_CAUSAL_LM_MODEL_MAP[model_type].from_quantized(
File "/root/anaconda3/envs/ai/lib/python3.8/site-packages/auto_gptq/modeling/_base.py", line 365, in from_quantized
quantize_config = BaseQuantizeConfig.from_pretrained(save_dir)
File "/root/anaconda3/envs/ai/lib/python3.8/site-packages/auto_gptq/modeling/_base.py", line 47, in from_pretrained
return cls(**json.load(f))
TypeError: init() got an unexpected keyword argument 'sym'

I was trying to follow the basic usage (coming from GPTQ-for-llama

I m new to machine learning and transformer
in last steps of basic usages

# load quantized model, currently only support cpu or single gpu
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False)

# or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto_gptq is")[0]["generated_text"])

it gives me

>>> model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False)
>>> pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
The model 'OPTGPTQForCausalLM' is not supported for . Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].

i suppose its not supported for that pipeline?

then i try to use Customize Model part and use this instead

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
)


tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
model = OPTGPTQForCausalLM.from_pretrained(quantized_model_dir, quantize_config, device=0, use_triton=False)
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto_gptq is")[0]["generated_text"])

i got

Traceback (most recent call last):
  File "/home/siegfried/model-gptq/load.py", line 35, in <module>
    model = OPTGPTQForCausalLM.from_pretrained(quantized_model_dir, quantize_config, device=0, use_triton=False)
  File "/home/siegfried/miniconda3/envs/autogptq/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 329, in from_pretrained
    model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path, **model_init_kwargs)
  File "/home/siegfried/miniconda3/envs/autogptq/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
  File "/home/siegfried/miniconda3/envs/autogptq/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2405, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory opt-125m-4bit.

i m pretty newbie to this, so if you can, would you be able to explain why?

GPU memory requirement to quantize Bloom-176B?

Awesome work! Just curious how much GPU RAM is needed to quantize a 176B model? I tried on A100 80GB but got OOM exception. Would more GPUs be helpful in this case? Thank you so much!

autogptq / autogptq Goto Github PK

autogptq's Introduction

AutoGPTQ

English | 中文

News or Update

Performance Comparison

Inference Speed

Perplexity

Installation

Install from source

On ROCm systems

Quick Tour

Quantization and Inference

Customize Model

Evaluation on Downstream Tasks

Learn More

Supported Models

Supported Evaluation Tasks

Running tests

FAQ

Which kernel is used by default?

How to use Marlin kernel?

Acknowledgement

autogptq's People

Contributors

Stargazers

Watchers

Forkers

autogptq's Issues

Benchmarks of: ooba CUDA; AutoGPTQ CUDA; AutoGPTQ Triton

Implementations tested

Test system

Test method

Results spreadsheet and overview charts

Spreadsheet

Charts

Description of results

AutoGPTQ vs 'ooba' CUDA with --no-stream

AutoGPTQ vs 'ooba' CUDA with streaming (no-act-order model)

desc_act models vs non-desc_act models

fused_attn and fused_mlp

Slow loading time with AutoGPTQ Triton

Results table

Benchmark logs

ooba GPTQ-for-LLaMA CUDA no streaming (--no-stream). no-act-order model. no fused_attn

Command

Benchmark

ooba GPTQ-for-LLaMA CUDA with streaming. no-act-order model. no fused_attn

Command

Benchmark

AutoGPTQ CUDA no streaming (--no-stream). no-act-order model. fused_attn enabled

Command

Benchmark

AutoGPTQ CUDA with streaming. no-act-order model. fused_attn enabled

Command

Benchmark

AutoGPTQ CUDA no streaming (--no-stream). no-act-order model. no fused_attn

Command

Benchmark

AutoGPTQ CUDA with streaming. no-act-order model. no fused_attn

Command

Benchmark

AutoGPTQ CUDA no streaming (--no-stream). act-order / desc_act model. fused_attn=yes

Command

Benchmark

AutoGPTQ CUDA no streaming (--no-stream). act-order / desc_act model. fused_attn=no

Command

Benchmark

AutoGPTQ Triton no streaming (--no-stream). no-act-order model. fused_attn=yes. fused_mlp=yes

Command

Benchmark

AutoGPTQ Triton with streaming. no-act-order model. fused_attn=yes. fused_mlp=yes

Command

Benchmark

AutoGPTQ Triton no streaming (--no-stream). no-act-order model. fused_attn=no. fused_mlp=no

Command

Benchmark

AutoGPTQ Triton no streaming (--no-stream). no-act-order model. fused_attn=no. fused_mlp=yes

Command

Benchmark

ooba GPTQ-for-LLaMA CUDA no streaming (`--no-stream`). no-act-order model. no fused_attn

AutoGPTQ CUDA no streaming (`--no-stream`). no-act-order model. fused_attn enabled

AutoGPTQ CUDA no streaming (`--no-stream`). no-act-order model. no fused_attn

AutoGPTQ CUDA no streaming (`--no-stream`). act-order / desc_act model. fused_attn=yes

AutoGPTQ CUDA no streaming (`--no-stream`). act-order / desc_act model. fused_attn=no

AutoGPTQ Triton no streaming (`--no-stream`). no-act-order model. fused_attn=yes. fused_mlp=yes

AutoGPTQ Triton no streaming (`--no-stream`). no-act-order model. fused_attn=no. fused_mlp=no

AutoGPTQ Triton no streaming (`--no-stream`). no-act-order model. fused_attn=no. fused_mlp=yes

AutoGPTQ Triton no streaming (`--no-stream`). no-act-order model. fused_attn=yes. fused_mlp=no

AutoGPTQ Triton no streaming (`--no-stream`). act-order/desc_act model. fused_attn=yes. fused_mlp=yes

AutoGPTQ Triton no streaming (`--no-stream`). act-order/desc_act model. fused_attn=no. fused_mlp=yes

AutoGPTQ Triton no streaming (`--no-stream`). act-order/desc_act model. fused_attn=yes. fused_mlp=no

AutoGPTQ Triton no streaming (`--no-stream`). act-order/desc_act model. fused_attn=no. fused_mlp=no