qwopqwop200 / gptq-for-llama Goto Github PK

View Code? Open in Web Editor NEW

2.9K 2.9K 452.0 2.72 MB

4 bits quantization of LLaMA using GPTQ

License: Apache License 2.0

Python 100.00%

gptq-for-llama's People

Contributors

Stargazers

Watchers

Forkers

oobabooga c00renut stanleyjacob conceptofmind sansyo decapoda-research kimjuik entn-at cian0 henrywoo j0aod3v ukaserge clcarwin yuriwerewolf pstefanou12 fabiorafaelcoutada kosty1024bit brentes techventurebuilder goswamig diolps techthiyanes krabatak mindrages smalinin ruilimit stjordanis zphang rjb7731 touristshaun jackangel danielwe2 aliasfoxkde liuyixin-louis zlh1992 thelatestnow andreyrgw mrcodechef tpoisonooo caoruidong-1979 frrabelo stackpwnies hbcbh1999 linhduongtuan klonggan bluesenkim musabgultekin digitous 0cc4m gqadonis moomoofarm1 liangxiaoyun qweszxc7410 tobbez southglory cusernamepilot kamalsky everettjunco yanggum evolventaagg ricky-sb alpindale alexsacr chiselscala dcaudell soon14 lotositsh nero10578 kustomzone daansolo iamlemec hackthecrisis21 guccialex marcus-arcadius fdoperezi mistobaan bestpredicts hertera1 goxtopia zineos mayaeary ph0rk0z aljungberg johnsmith0031 joonyeong97 anzz1 wilfoderek jamesdconley sterlind mastertaffer aohenuo dumpmemory houjingsong flytigerw leedaga w4ffl35 winglian qubitium coding4allhq jeancodeunmax

gptq-for-llama's Issues

Saving checkpoints?

I haven't tried this yet but I wondered if is possible to save the quantized checkpoints. Or does it quantize it every time you run it?

Issue compiling in docker - No CUDA runtime is found

This isn't a bug per se, more a cry for help at this point

I'm trying to get 4bit working on oobabooga/text-generation-webui but can't get the cuda extensions in this repository to build.

I am using docker with continuumio/miniconda3 as base image, picked because the instructions for text-generation-webui setup uses conda. I already have it working with that repository and with cuda set up and gpu working there.

The setup for apt and conda is:

RUN apt-get update && apt-get install -y git software-properties-common build-essential gnupg ninja-build && apt-get clean
RUN conda install torchvision torchaudio pytorch-cuda=11.7 git -c pytorch -c nvidia 
RUN conda install -c "nvidia/label/cuda-11.7.0" cuda

Last line also tried with cudatoolkit, cuda-libraries and many others. I've verified that nvcc and cuda header files are in place, and am frankly unsure what's missing.

The compile error i get is long, but (I think) the relevant sections are:

#0 3.448 No CUDA runtime is found, using CUDA_HOME='/opt/conda'
#0 3.470 running install
#0 3.470 /opt/conda/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
#0 3.470   warnings.warn(
#0 3.528 /opt/conda/lib/python3.10/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip 
and other standards-based tools.
..........
#0 3.642   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1780, in _get_cuda_arch_flags
#0 3.642     arch_list[-1] += '+PTX'
#0 3.642 IndexError: list index out of range

I assume the error is connected to "No CUDA runtime is found" - but after a day's search I still haven't figured out what's missing in the installation. The docker file and repo can be found at https://github.com/TheTerrasque/text-generation-webui/tree/feature/docker

CUDA kernel that supports more than a single token

Would it be possible to adapt https://github.com/mstnegate/int4matmul_kernels to this repository, or any other kernel that would allow inference on prompts more than one token in length?

Questions about group size

From the research paper and the tables in the readme it looks like that group-size 64 is very effective in improving the quality of the models. Most noticable in the smaller models or in the 3bit version.

The tables suggest that group size is somehow usable but the README also states that group size can not be used with CUDA? But this whole project needs CUDA? I build an group size 64 model but I can not run the benchmark or inference.

Is group size usable? If so, how?

Change ints to double in quant_cuda_kernel.cu?

I was getting this error when running python setup_cuda.py

quant_cuda_kernel.cu(149):` error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
detected during instantiation of "void VecQuant2MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=double]"
(87): here

quant_cuda_kernel.cu(261): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
detected during instantiation of "void VecQuant3MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=double]"
(171): here

quant_cuda_kernel.cu(337): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
detected during instantiation of "void VecQuant4MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=double]"
(283): here

quant_cuda_kernel.cu(409): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
detected during instantiation of "void VecQuant8MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=double]"
(359): here

SO I changed all the integers to doubles... and it compiled with a bunch of warnings... Is this defeating the point or is this a valid solution?
Nothing runs atm so I guess not.

Windows build fails with unresolved symbols

When building with (textgen_3.10.venv) PS C:\g\GPTQ-for-LLaMa> python setup_cuda.py install

   Creating library C:\g\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda.cp310-win_amd64.lib and object C:\g\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda.cp310-win_amd64.exp
quant_cuda.obj : error LNK2001: unresolved external symbol __imp___tls_offset_?init@?1??lazy_init_num_threads@internal@at@@YAXXZ@4_NA
quant_cuda_kernel.obj : error LNK2001: unresolved external symbol __imp___tls_offset_?init@?1??lazy_init_num_threads@internal@at@@YAXXZ@4_NA
quant_cuda.obj : error LNK2001: unresolved external symbol __imp___tls_index_?init@?1??lazy_init_num_threads@internal@at@@YAXXZ@4_NA
quant_cuda_kernel.obj : error LNK2001: unresolved external symbol __imp___tls_index_?init@?1??lazy_init_num_threads@internal@at@@YAXXZ@4_NA
build\lib.win-amd64-cpython-310\quant_cuda.cp310-win_amd64.pyd : fatal error LNK1120: 2 unresolved externals
error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.34.31933\\bin\\HostX64\\x64\\link.exe' failed with exit code 1120
(textgen_3.10.venv) PS C:\g\GPTQ-for-LLaMa> python setup_cuda.py install

I am on CUDA 12.0

(textgen_3.10.venv) PS C:\g\GPTQ-for-LLaMa> nvidia-smi.exe
Sat Mar 11 17:49:49 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 528.02       Driver Version: 528.02       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+

Benchmark fails when using 4bit file

Trying to test these models in 4bit but having an issue running the benchmark on the compressed file.

The command
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --load llama7b-4bit.pt --benchmark 2048 --check
fails on my machine with the log

Benchmarking ...
Traceback (most recent call last):
  File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/llama.py", line 410, in <module>
    benchmark(model, input_ids, check=args.check)
  File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/llama.py", line 309, in benchmark
    out = model(
  File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 857, in forward
    outputs = self.model.decoder(
  File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 629, in forward
    layer_outputs = decoder_layer(
  File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 310, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/DataButButter/AI/Text/GPTQ-for-LLaMa/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 199, in forward
    attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) / math.sqrt(self.head_dim)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

For extra information the kernel test completes successfully

Benchmarking LLaMa-33B FC2 matvec ...
FP16: 0.0007241859436035157
4bit: 0.00010311341285705566
Verifiying kernel correctness ...
Simu: tensor([-0.5063, -0.5161,  0.7185,  ..., -0.0747, -0.4014,  0.3023],
       device='cuda:0')
Kern: tensor([-0.5063, -0.5161,  0.7185,  ..., -0.0747, -0.4014,  0.3023],
       device='cuda:0')

The benchmark also completes successfully when just using the normal HF model without 4bit conversion.

Error when installing cuda kernel

If I follow the instructions in the readme, I'm getting an error now even though it worked a few days ago.

conda create --name gptq python=3.9 -y
conda activate gptq
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
pip install -r requirements.txt
python setup_cuda.py install

Output:

raceback (most recent call last):
  File "~/text-generation-webui/repositories/GPTQ-for-LLaMa/setup_cuda.py", line 6, in <module>
    ext_modules=[cpp_extension.CUDAExtension(
  File "~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1048, in CUDAExtension
    library_dirs += library_paths(cuda=True)
  File "~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1179, in library_paths
    if (not os.path.exists(_join_cuda_home(lib_dir)) and
  File "~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2223, in _join_cuda_home
    raise EnvironmentError('CUDA_HOME environment variable is not set. '
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

If I try to manually set CUDA_HOME=$CONDA_PREFIX/ (which wasn't necessary previously) it still doesn't work. I get this error:

running install
~/miniconda3/envs/textgen/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
~/miniconda3/envs/textgen/lib/python3.10/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing quant_cuda.egg-info/PKG-INFO
writing dependency_links to quant_cuda.egg-info/dependency_links.txt
writing top-level names to quant_cuda.egg-info/top_level.txt
reading manifest file 'quant_cuda.egg-info/SOURCES.txt'
writing manifest file 'quant_cuda.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
error: [Errno 2] No such file or directory: 'CUDA_HOME=~/miniconda3/envs/textgen/bin/nvcc'

Does not compile on CUDA 12.0

On running the setup_cuda.py install, I was initially getting:
RuntimeError: The detected CUDA version (12.0) mismatches the version that was used to compile PyTorch (11.8). Please make sure to use the same CUDA versions.

Tried being clever and manually compiling which returned the following errors:
`
/usr/local/lib64/python3.11/site-packages/torch/include/ATen/core/qualified_name.h(73): here

/usr/local/lib64/python3.11/site-packages/torch/include/pybind11/detail/../cast.h: In function ‘typename pybind11::detail::type_caster<typename pybind11::detail::intrinsic_type::type>::cast_op_type pybind11::detail::cast_op(make_caster&)’:
/usr/local/lib64/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:42:120: error: expected template-name before ‘<’ token
42 | return caster.operator typename make_caster::template cast_op_type();
| ^
/usr/local/lib64/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:42:120: error: expected identifier before ‘<’ token
/usr/local/lib64/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:42:123: error: expected primary-expression before ‘>’ token
42 | return caster.operator typename make_caster::template cast_op_type();
| ^
/usr/local/lib64/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:42:126: error: expected primary-expression before ‘)’ token
42 | return caster.operator typename make_caster::template cast_op_type();
|
`

Quantising on multiple GPU?

Hi,

I'm trying to quantise 65B on a server with 8x V100. Obviously, that's not going to fit in VRAM on any single GPU 😅
Is it possible to use more than one GPU for quantisation, or load and quantise layer-by-layer?

I've tried on CPU, but I get the error:
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

Thanks! (and great work!)

4-bit llama gets progressively slower with each generation

potential Mistakes in the test data selection for perplexity evaluation

ptb_text_only uses the validation file instead of the test file. while it is still from the same dataset, and should result in similar results, makes 1 to 1 comparisons difficult.
c4 only has validation, so that is fine.

wikitext-2 uses test

GPTQ-for-LLaMa/datautils.py

Line 13 in 468c47c

testdata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')

ptb_text_only uses validation

GPTQ-for-LLaMa/datautils.py

Line 35 in 468c47c

valdata = load_dataset('ptb_text_only', 'penn_treebank', split='validation')

c4 uses validation

GPTQ-for-LLaMa/datautils.py

Lines 59 to 61 in 468c47c

    
           valdata = load_dataset( 
        
               'allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation',use_auth_token=True 
        
           )

please correct me if this is intended. :)

More VRAM Efficient Attention

There was talk of this in oobabooga/text-generation-webui#177. Creating an issue to track here.

Problem: GPTQ-for-LLaMA appears to use a relatively large amount of VRAM in addition to the model sizes. This negates some of the size reduction benefit of using low-bit quantization. A large portion of this VRAM may be due to the attention mechanism used.

Solution: A more efficient Attention mechanism would further reduce the VRAM requirements of GPTQ-for-LLaMA.

Tokenizer class LLaMATokenizer does not exist or is not currently imported.

I am not sure how to deal with this.

Python 3.10.9 on Arch Linux.

[0] # CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --save llama7b-4bit.pt

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:09<00:00,  3.60it/s]
Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.38k/2.38k [00:00<00:00, 3.79MB/s]
Downloading and preparing dataset json/allenai--c4 to /root/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 319M/319M [01:13<00:00, 4.36MB/s]
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:14<00:00, 74.30s/it]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.50s/it]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.
Downloading and preparing dataset json/allenai--c4 to /root/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40.5M/40.5M [00:06<00:00, 6.15MB/s]
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.63s/it]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.17it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.
Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141/141 [00:00<00:00, 175kB/s]
Traceback (most recent call last):
  File "/root/llama/GPTQ-for-LLaMa/llama.py", line 393, in <module>
    dataloader, testloader = get_loaders(
  File "/root/llama/GPTQ-for-LLaMa/datautils.py", line 111, in get_loaders
    return get_c4(nsamples, seed, seqlen, model)
  File "/root/llama/GPTQ-for-LLaMa/datautils.py", line 64, in get_c4
    tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
  File "/root/llama/venv/lib/python3.10/site-packages/transformers-4.27.0.dev0-py3.10.egg/transformers/models/auto/tokenization_auto.py", line 676, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.
(venv)

[0] # pip list
Package                  Version            Editable project location
------------------------ ------------------ ---------------------------------------------
aiohttp                  3.8.4
aiosignal                1.3.1
astunparse               1.6.3
async-timeout            4.0.2
attrs                    22.2.0
certifi                  2022.12.7
charset-normalizer       3.1.0
datasets                 2.10.1
dill                     0.3.6
exceptiongroup           1.1.0
expecttest               0.1.4
filelock                 3.9.0
frozenlist               1.3.3
fsspec                   2023.3.0
huggingface-hub          0.13.1
hypothesis               6.68.2
idna                     3.4
Jinja2                   3.1.2
MarkupSafe               2.1.2
mpmath                   1.3.0
multidict                6.0.4
multiprocess             0.70.14
networkx                 3.0
numpy                    1.24.2
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
packaging                23.0
pandas                   1.5.3
pip                      22.3.1
psutil                   5.9.4
pyarrow                  11.0.0
python-dateutil          2.8.2
pytz                     2022.7.1
PyYAML                   6.0
quant-cuda               0.0.0
regex                    2022.10.31
requests                 2.28.2
responses                0.18.0
setuptools               65.5.0
six                      1.16.0
sortedcontainers         2.4.0
sympy                    1.11.1
tokenizers               0.13.2
torch                    2.1.0a0+gitc7bd9b9 /root/llama/venv/lib/python3.10/site-packages
tqdm                     4.65.0
transformers             4.27.0.dev0
types-dataclasses        0.6.6
typing_extensions        4.5.0
urllib3                  1.26.14
wheel                    0.38.4
xxhash                   3.2.0
yarl                     1.8.2

Bad results for WinoGrande - more testers searched

Facebook published posted expected results for the WinoGrande test with a score of 70 for the 7B model.

I wrote a small script see #40 that fetches the dataset from datasets and runs the tests.

Because the prompt and parameters were not published (see meta-llama/llama#188) I wrote a prompt myself. It is probably not very good but was the only version that was working at all.

The problem: With the 4bit 7B model I only get about 48% that means the model is not better than random..

So something is off. One or more of:

Wrong parameters
Bad prompt
Something else is wrong in my script
quantization hurts model performance
Bug in implementation of quant or inference

As I am new to this topic, it can very well be a problem on my end.

So I would like to get help fixing the prompt/script and would also like to see results for other models:

other model versions
other quantizations

Is it possible to reuse the GPTQ implementation to quant oasst-sft-1-pythia-12b into 4b?

Model: https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b/tree/main

the performance of oasst-sft seems much better than llama, it would be nice to quant it into 4b and run on much lower end GPU.

could you point out if it is possible to quant it with gptq?

How to fine-tune the 4-bit model?

First a big thanks for this amazing effort!

I was just trying to fine-tune this 4-bit model under the transformers framework. The model could be loaded successfully and the training process worked well, however, the loss just became nan after one single loss.backward().

Here is the my code:

import sys
from pathlib import Path
sys.path.insert(0, str(Path("/efs-storage/text-generation-webui/repositories/GPTQ-for-LLaMa")))
import llama

load_quant = llama.load_quant

model = load_quant(
    "/file/llama-7b-hf",
    '/file/llama-7b-hf-int4/llama-7b-4bit.pt',
    4
    )

model.to(device)
....
for batch in tqdm(dataloader):
        model.train()
        input_ids = batch[0]['input_ids'].squeeze(1).to(device)
        attention_mask = batch[0]['attention_mask'].squeeze(1).to(device)
        tgt_labels = batch[1].squeeze(1).to(device)
        loss = model(
                    input_ids = input_ids,
                    attention_mask = attention_mask,
                    labels=tgt_labels
                    ).loss 
        model.zero_grad()
        loss.backward()
        optimizer.step()
        torch.cuda.empty_cache()

And here is how my loss looks like after training one batch:

tensor(nan, device='cuda:1', dtype=torch.float16, grad_fn=<NllLossBackward0>)

I wonder is there any way to fine-tune the 4-bit model? thanks!

Support other models?

How hard is it to implement quantizing other models to 4 bit. I see there is already a python file for bloom but that only comes in really small and really big flavors.

But if we could convert other 13-30b models it would be a big help. Or is the plan to wait on bitsandbytes?

[Request] Mixed Precission Quantization

I believe that we can achieve further optimisation beyond even 4bit quantization with selective quantization of specifically chosen layers down to 2bits.

See: https://arxiv.org/abs/2203.08368

By selectively quantizing 50% of the layers down to 2bits, it may even be possible to run 65B Llama on a 24GB VRAM gpu.

I don't know precisely which layers would work best, (it may be an arduous process of trial and error). Perhaps the best thing to do would be to let the user specify which level of quantization they desire for each layer.

4bit is not the end of the road.

Multiple errors while compiling the kernel

Hello, while trying to run python setup_cuda.py install, I get this error:

(venv) C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa>python setup_cuda.py install
running install
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\setuptools\command\easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing quant_cuda.egg-info\PKG-INFO
writing dependency_links to quant_cuda.egg-info\dependency_links.txt
writing top-level names to quant_cuda.egg-info\top_level.txt
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\utils\cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'quant_cuda.egg-info\SOURCES.txt'
writing manifest file 'quant_cuda.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_ext
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\utils\cpp_extension.py:358: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
  warnings.warn(f'Error checking compiler version for {compiler}: {error}')
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\utils\cpp_extension.py:387: UserWarning: The detected CUDA version (11.4) has a minor version mismatch with the version that was used to compile PyTorch (11.7). Most likely this shouldn't be a problem.
  warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
building 'quant_cuda' extension
"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\TH -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include" -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\include "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\include" "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\Include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tpquant_cuda.cpp /Fobuild\temp.win-amd64-cpython-310\Release\quant_cuda.obj /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /EHsc -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0
quant_cuda.cpp
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\c10/macros/Macros.h(138): warning C4067: unexpected tokens following preprocessor directive - expected a newline

Then after a long list of errors, I get this at the end:

"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\bin\nvcc" -c quant_cuda_kernel.cu -o build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\TH -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include" -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\include "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\include" "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\Include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --use-local-env
quant_cuda_kernel.cu
C:/Users/Username/Documents/GitHub/GPTQ-for-LLaMa/venv/lib/site-packages/torch/include\c10/macros/Macros.h(138): warning C4067: unexpected tokens following preprocessor directive - expected a newline
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\pybind11\cast.h(624): error: too few arguments for template template parameter "Tuple"
          detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]"
(721): here

C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\pybind11\cast.h(717): error: too few arguments for template template parameter "Tuple"
          detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]"
(721): here

2 errors detected in the compilation of "quant_cuda_kernel.cu".
error: command 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.4\\bin\\nvcc.exe' failed with exit code 1

Any idea what could be causing this? I've tried installing CUDA Toolkit 11.3 and Torch 1.12.1, but they too give the same error.

How to convert the official ckp to fit your repo

Hello, I have downloaded the official ckp consolidated.00.pth/tokenizer.model on my computer, how should I convert the them to fit your repo, other than download from huggingface again ?

3-bit quantization fails during the packing stage

Packing ...
model.decoder.layers.0.self_attn.q_proj
Traceback (most recent call last):
  File "/root/convert/GPTQ-for-LLaMa/llama.py", line 420, in <module>
    llama_pack3(model, quantizers)
  File "/root/convert/GPTQ-for-LLaMa/llama.py", line 216, in llama_pack3
    qlayers[name].pack(layers[name], quantizers[name].scale, quantizers[name].zero)
  File "/root/convert/GPTQ-for-LLaMa/quant.py", line 142, in pack
    self.bias = linear.bias.clone()
AttributeError: 'NoneType' object has no attribute 'clone'

What would be required to quantize 65B model to 2-bit?

Presumably more than 130 GB of RAM? How much would it slow it down if using a swap file? Anything else? It seems like since GPTQ has the best results on larger models this should be looked into. It would be incredible to get almost the whole performance of the 65B model using only 16 GB vRAM.

How to deal with the model from huggingface?

I am quite new to LLM. I checked the LLaMa github repo and huggingface repo. I found https://huggingface.co/decapoda-research/llama-7b-hf/tree/main has multiple .bin files, but the github repo's download.sh seems not be the case.
Now I am quite lost as nowhere provide a guide to use it (maybe it is quite common in LLM usage?).
Could you teach me how to deal with the .bin files and save it to decapoda-research/llama-7b-hf in your example?

PosixPath object has no attribute endswith Win11 WSL2

$ python server.py --load-in-4bit --model llama-13b-hf
Loading llama-13b-hf...
Loading model ...
Traceback (most recent call last):
  File "/mnt/e/Projects/text-generation-webui/server.py", line 191, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/mnt/e/Projects/text-generation-webui/modules/models.py", line 119, in load_model
    model = load_quant(path_to_model, Path(f"models/{pt_model}"), 4)
  File "/mnt/e/Projects/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 241, in load_quant
    if checkpoint.endswith('.safetensors'):
AttributeError: 'PosixPath' object has no attribute 'endswith'

Getting this issue when trying to run on Win11 under WSL2 with text generation WebUI. Behaviour is different prior to 68cfaf9, though still broken.

Output of conda list:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_openmp_mutex             5.1                       1_gnu
blas                      1.0                         mkl
brotlipy                  0.7.0           py310h7f8727e_1002
bzip2                     1.0.8                h7b6447c_0
c-ares                    1.18.1               h7f8727e_0
ca-certificates           2023.01.10           h06a4308_0
certifi                   2022.12.7       py310h06a4308_0
cffi                      1.15.1          py310h5eee18b_3
charset-normalizer        2.0.4              pyhd3eb1b0_0
cryptography              39.0.1          py310h9ce1e76_0
curl                      7.88.1               h5eee18b_0
datasets                  2.10.1                   pypi_0    pypi
dill                      0.3.6                    pypi_0    pypi
expat                     2.4.9                h6a678d5_0
ffmpeg                    4.3                  hf484d3e_0    pytorch
flit-core                 3.6.0              pyhd3eb1b0_0
freetype                  2.12.1               h4a9f257_0
gdbm                      1.18                 hd4cb3f1_4
gettext                   0.21.0               hf68c758_0
giflib                    5.2.1                h5eee18b_3
git                       2.34.1          pl5262hc120c5b_0
gmp                       6.2.1                h295c915_3
gnutls                    3.6.15               he1e5248_0
icu                       58.2                 he6710b0_3
idna                      3.4             py310h06a4308_0
intel-openmp              2021.4.0          h06a4308_3561
jpeg                      9e                   h5eee18b_1
krb5                      1.19.4               h568e23c_0
lame                      3.100                h7b6447c_0
lcms2                     2.12                 h3be6417_0
ld_impl_linux-64          2.38                 h1181459_1
lerc                      3.0                  h295c915_0
libcurl                   7.88.1               h91b91d3_0
libdeflate                1.17                 h5eee18b_0
libedit                   3.1.20221030         h5eee18b_0
libev                     4.33                 h7f8727e_1
libffi                    3.4.2                h6a678d5_6
libgcc-ng                 11.2.0               h1234567_1
libgomp                   11.2.0               h1234567_1
libiconv                  1.16                 h7f8727e_2
libidn2                   2.3.2                h7f8727e_0
libnghttp2                1.46.0               hce63b2e_0
libpng                    1.6.39               h5eee18b_0
libssh2                   1.10.0               h8f2d780_0
libstdcxx-ng              11.2.0               h1234567_1
libtasn1                  4.16.0               h27cfd23_0
libtiff                   4.5.0                h6a678d5_2
libunistring              0.9.10               h27cfd23_0
libuuid                   1.41.5               h5eee18b_0
libwebp                   1.2.4                h11a3e52_1
libwebp-base              1.2.4                h5eee18b_1
libxml2                   2.9.14               h74e7548_0
lz4-c                     1.9.4                h6a678d5_0
mkl                       2021.4.0           h06a4308_640
mkl-service               2.4.0           py310h7f8727e_0
mkl_fft                   1.3.1           py310hd6ae3a3_0
mkl_random                1.2.2           py310h00e6091_0
multiprocess              0.70.14                  pypi_0    pypi
ncurses                   6.4                  h6a678d5_0
nettle                    3.7.3                hbbd107a_1
numpy                     1.23.5          py310hd5efca6_0
numpy-base                1.23.5          py310h8e6c178_0
openh264                  2.1.1                h4ff587b_0
openssl                   1.1.1t               h7f8727e_0
pcre2                     10.37                he7ceb23_1
perl                      5.34.0               h5eee18b_2
pillow                    9.4.0           py310h6a678d5_0
pip                       23.0.1          py310h06a4308_0
pyarrow                   11.0.0                   pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0
pydub                     0.25.1                   pypi_0    pypi
pyopenssl                 23.0.0          py310h06a4308_0
pyparsing                 3.0.9                    pypi_0    pypi
pysocks                   1.7.1           py310h06a4308_0
python                    3.10.9               h7a1cb2a_2
pytorch                   1.13.1             py3.10_cpu_0    pytorch
pytorch-mutex             1.0                         cpu    pytorch
pytz                      2022.7.1                 pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
quant-cuda                0.0.0                    pypi_0    pypi
readline                  8.2                  h5eee18b_0
requests                  2.28.1          py310h06a4308_0
responses                 0.18.0                   pypi_0    pypi
rfc3986                   2.0.0                    pypi_0    pypi
safetensors               0.3.0                    pypi_0    pypi
sentencepiece             0.1.97                   pypi_0    pypi
setuptools                65.6.3          py310h06a4308_0
six                       1.16.0             pyhd3eb1b0_1
sqlite                    3.40.1               h5082296_0
tk                        8.6.12               h1ccaba5_0
tokenizers                0.13.2                   pypi_0    pypi
torchaudio                0.13.1                py310_cpu    pytorch
torchvision               0.14.1                py310_cpu    pytorch
typing_extensions         4.4.0           py310h06a4308_0
tzdata                    2022g                h04d1e81_0
urllib3                   1.26.14         py310h06a4308_0
wheel                     0.38.4          py310h06a4308_0
xxhash                    3.2.0                    pypi_0    pypi
xz                        5.2.10               h5eee18b_1
zlib                      1.2.13               h5eee18b_0
zstd                      1.5.2                ha4553b6_0

running build_exit error

Running into an error:

running bdist_egg running egg_info writing quant_cuda.egg-info\PKG-INFO writing dependency_links to quant_cuda.egg-info\dependency_links.txt writing top-level names to quant_cuda.egg-info\top_level.txt reading manifest file 'quant_cuda.egg-info\SOURCES.txt' writing manifest file 'quant_cuda.egg-info\SOURCES.txt' installing library code to build\bdist.win-amd64\egg running install_lib running build_ext error: [WinError 2] The system cannot find the file specified

converting local hf model with llama.py

Hi @qwopqwop200,
I'm trying to convert LLaMA HF models to 4bit, all files being local (input and output)
I get this error about hf token:

python llama.py /media/alex/Daemon/AI/LLaMA-HF/llama-7b c4 --wbits 4 --save /media/alex/Daemon/AI/LLaMA-HF/llama-7b/llama-7b-4bit.pt Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:03<00:00, 8.45it/s] Traceback (most recent call last): File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 401, in <module> dataloader, testloader = get_loaders( File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/datautils.py", line 111, in get_loaders return get_c4(nsamples, seed, seqlen, model) File "/home/alex/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/datautils.py", line 56, in get_c4 traindata = load_dataset( File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/datasets/load.py", line 1759, in load_dataset builder_instance = load_dataset_builder( File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/datasets/load.py", line 1496, in load_dataset_builder dataset_module = dataset_module_factory( File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/datasets/load.py", line 1218, in dataset_module_factory raise e1 from None File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/datasets/load.py", line 1185, in dataset_module_factory raise e File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/datasets/load.py", line 1158, in dataset_module_factory dataset_info = hf_api_dataset_info( File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/datasets/utils/_hf_hub_fixes.py", line 152, in dataset_info return hf_api.dataset_info(repo_id, revision=revision, timeout=timeout, use_auth_token=use_auth_token) File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn return fn(*args, **kwargs) File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1676, in dataset_info headers = self._build_hf_headers(token=token) File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 4205, in _build_hf_headers return build_hf_headers( File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn return fn(*args, **kwargs) File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/huggingface_hub/utils/_headers.py", line 117, in build_hf_headers token_to_send = get_token_to_send(token) File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/huggingface_hub/utils/_headers.py", line 149, in get_token_to_send raise EnvironmentError( OSError: Token is required (token=True), but no token found. You need to provide a token or be logged in to Hugging Face with huggingface-cli loginorhuggingface_hub.login. See https://huggingface.co/settings/tokens.

What am I missing? I'm stuck

Thank you!
Ale

GPTQ+flexgen, is it possible?

Trying to get LLaMa 30B 4bit quantized to run with 12GB of vram and I'm hitting OOM since the model is a bit more than 16gb
Is it possible to use offloading to load a percentage of the model to cpu using GPTQ?

The current installed version of g++ is greater than the maximum required version by CUDA

Getting
RuntimeError: The current installed version of g++ (12.2.1) is greater than the maximum required version by CUDA 11.7 (11.5.0). Please make sure to use an adequate version of g++ (>=6.0.0, <=11.5.0).
when trying to run

python setup_cuda.py install

Using Nobara Linux (Fedora with CUDA patched kernel). What's the best approach to proceed?

Request: Optional non-CUDA version

Amazing work! Thank you so much for sharing this.

Despite my attempts, I wasn't able to replicate the quantization functions without CUDA. It would be hugely helpful if users could use AMD or Apple Silicon GPUs too (which already have PyTorch support as 'mps'.

Apple Silicon may seem like an odd option, but shared memory means they are some of the only options for high-memory inference on consumer hardware. For example, it is possible to get up to 64gb of GPU-accessible memory on the Mac Studio.

Any code changes or advice to achieve this would be sincerely appreciated!

Are these errors expected ?

So are these errors expected?

python llama.py ../../models/llama-30b c4 --wbits 4 --save llama-30b-4bit.pt

Model Quantization Instructions

How does someone produce a quantized model (e.g. 4bit) themselves?
What are the steps?

How to use for inference?

I could the model compression running and the benchmark also works.

But how would I use the model for inference? It there any example? The standard things from transformers are not working with "Only supports a single token currently." That seems related to #6.

AttributeError: 'LLaMAModel' object has no attribute 'decoder'

CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4

Traceback (most recent call last):
File "llama.py", line 419, in
llama_eval(model, testloader, DEV)
File "/home/zhangjp/anaconda3/envs/pytorch2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "llama.py", line 121, in llama_eval
layers = model.model.decoder.layers
File "/home/zhangjp/anaconda3/envs/pytorch2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'LLaMAModel' object has no attribute 'decoder'

GPTQ C++ Implementation Question

         I had a quick glance at the GPTQ paper yesterday, but haven't dug into details yet.
Do you think it is possible to demonstrate a simple routine for performing quantization using this method?
For example, what is the most trivial way (not necessary to be optimal) to implement a function like this:
// src - input 32-bit floats
// dst - output quantized data
// n - number of input floats
void quantize_gptq(float * src, void * dst, int n);
If I can get a prototype of this and it does not look too complex, I can try to plug it in ggml.
The main challenge will be to implement it efficiently with SIMD, but I need to see some initial implementation to work on.
Originally posted by @ggerganov in ggerganov/llama.cpp#9 (comment)

@qwopqwop200 This is for a related project. I thought you might be qualified to answer the question above.

Link to original question.

Lonnnnnnnnng context load time before generation

I'm running llama 65b on dual 3090s and at longer contexts I'm noticing seriously long context load times (the time between sending a prompt and tokens actually being received/streamed). It seems my CPU is only using a single core and maxing it out to 100%... Is there something it's doing that's heavily serialized? ... Any way to parallelize the workflow?

cuda extension problem

I test to install in nvidia docker, the build ninja includes incorrent sm_id like -gencode arch=compute_52,code=sm_52

# Install kernels
python setup_cuda.py install

cuda_post_cflags = -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"''  -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++14

This should be ok.

The setup_cuda.py should be changed, the sm_89 shall be 4090, is not in the parameters.

from setuptools import setup, Extension
from torch.utils import cpp_extension

nvcc_args = [
    '-gencode', 'arch=compute_80,code=sm_80',
    '-gencode', 'arch=compute_86,code=sm_86',
    '-gencode', 'arch=compute_90,code=sm_90'
]

setup(
    name='quant_cuda',
    ext_modules=[cpp_extension.CUDAExtension(
        'quant_cuda', ['quant_cuda.cpp', 'quant_cuda_kernel.cu'], extra_compile_args={'nvcc': nvcc_args}
    )],
    cmdclass={'build_ext': cpp_extension.BuildExtension}
)

FP8 Quantization?

FP8 would enable greater dynamic range than Int8, and less information loss during compression. It would require GPUs with 2x more memory than Int4, for those who can afford it.

Bad performance of OPT models

Hello. I tested quantized OPT-2.7B-Erebus model with 4-bit quantization and without it
Tested with the same prompt

Got following results on RTX 4090:
Original model:

Loading OPT-2.7B-Erebus...
Loaded the model in 3.33 seconds.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 9.10 seconds (21.97 tokens/s, 200 tokens)

4-bit quantized model:

Loading OPT-2.7B-Erebus...
Loading model ...
Done.
Loaded the model in 1.00 seconds.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 57.77 seconds (3.46 tokens/s, 200 tokens)

You can notice huge 5x performance drop.
Do you have any ideas what might be causing it?

Quantizing GALACTICA?

I have tried quantizing galactica-30b with this command:

CUDA_VISIBLE_DEVICES=0 python opt.py /models/galactica-30b --wbits 4 --save galactica-30b-4bit.pt c4

And then using it in the web UI with this one:

python server.py --listen  --gptq-bits 4  --model galactica-30b --gptq-model-type opt

The results look very bad. For the prompt

The top 10 equations of all time are:

I get the completion

The top 10 equations of all time are:

The equation that has been used the most is a simple one, namely y=x; it was also found in our earlier study on elementary functions and integrals[ A Study On Solving Equations With New Methods For Symbolic Integration And Differentiation Of Computer Algebra Systems In General Purpose Calculators By Using Microsoft Excel VBA Programming Language”] as well) but now we see its usage even more frequently than before! It should be noted here though what can not happen to this very useful function since by using MS-Excel’s own builtin “y=m*n+b/(cde...ghijklmnopqrstuvwxyz|{}~–“the user will get exactly zero result for any number he may enter into x variable!! So there must exist some kindred method which gives us results like those obtained from just mentioned formula above with no problems at least regarding division operation involved.. One such possibility might come out if you look carefully enough around your office

Am I doing something wrong?

Problem with setup_cuda.py install

Hi,

When running python setup_cuda.py install I get the following error:

running install
running bdist_egg
running egg_info
writing quant_cuda.egg-info\PKG-INFO
writing dependency_links to quant_cuda.egg-info\dependency_links.txt
writing top-level names to quant_cuda.egg-info\top_level.txt
reading manifest file 'quant_cuda.egg-info\SOURCES.txt'
writing manifest file 'quant_cuda.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_ext
error: [WinError 2] The system cannot find the file specified

I have no idea why this is happening. Any help would be appreciated.

Quantization produces non-deterministic weights

Below is a segment of the 7B 4bit weights generated using the line in the same environment with two different video cards. An A4000 (on the left) and an A6000 (on the right).

Notice how every 20-40bytes there is a half byte difference? These differences are always off by one, a B becomes an A and a 5 becomes a 6 etc. This issue seems to persist across all model sizes when producing weights on different cards.

No idea what is causing it.

Without reproducible builds it is hard to say if we're actually producing the same weights.

probability tensor contains either `inf`, `nan` or element < 0

CUDA_VISIBLE_DEVICES=0 python llama_inference.py decapoda-research/llama-7b-hf --wbits 4 --load llama7b-4bit.pt --text "this is llama"
Loading model ...
Done.
Traceback (most recent call last):
File "llama_inference.py", line 115, in
generated_ids = model.generate(
File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers-4.27.0.dev0-py3.8.egg/transformers/generation/utils.py", line 1452, in generate
return self.sample(
File "/root/miniconda3/lib/python3.8/site-packages/transformers-4.27.0.dev0-py3.8.egg/transformers/generation/utils.py", line 2504, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0

state_dict error on model load

Currently using https://huggingface.co/decapoda-research/llama-13b-hf for the .config file and https://huggingface.co/decapoda-research/llama-13b-hf-int4 for the actual model file.

On load, following the readme, I get this error.

Missing key(s) in state_dict: "model.decoder.embed_tokens.weight", "model.decoder.layers.0.self_attn.q_proj.zeros", "model.decoder.layers.0.self_attn.q_proj.scales", etc...

Using transformers Version: 4.27.0.dev0
Using torch Version: 1.13.1
Using datasets Version: 2.10.1
and on CUDA release 11.7

Everything as far as I can tell has been set up properly, with the kernel compiling working.

EDIT:
After downgrading to 1.12.1, I'm being told 3060 doesn't work with the version of pytorch. It's still giving me the error, though.

EDIT 2: Going to downgrade CUDA to 11.3 to see if it does anything.

4-bit llama gets progressively slower with each text generation

The generation takes more time with each message, as if there's an overhead
For example: The second response is 11x faster than the last response. They have the same number of tokens.
The issue persists both on llama-7b and llama-13b
Running llama with: python3.10 server.py --load-in-4bit --model llama-7b-hf --cai-chat --no-stream

specs:
Gpu: RTX 3060 12GB
Cpu: Intel i5 12400f
Ram: 64GB DDR4 3200MHz
OS: Linux

Nvcc fatal : Unsupported gpu architecture 'compute_86'

I get the following error when trying to run setup.py from gptq install. I have a RTX 3090 and followed instructions from this github gist
FAILED: D:/AI/text-generation-webui/repositories/GPTQ-for-LLaMa/build/temp.win-amd64-cpython-310/Release/quant_cuda_kernel.obj C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc --generate-dependencies-with-compile --dependency-output D:\AI\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj.d --use-local-env -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\TH -IC:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -IC:\Users\cruge\miniconda3\envs\textgen\include -IC:\Users\cruge\miniconda3\envs\textgen\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\cppwinrt" -c D:\AI\text-generation-webui\repositories\GPTQ-for-LLaMa\quant_cuda_kernel.cu -o D:\AI\text-generation-webui\repositories\GPTQ-for-LLaMa\build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 nvcc fatal : Unsupported gpu architecture 'compute_86' ninja: build stopped: subcommand failed. Traceback (most recent call last): File "C:\Users\cruge\miniconda3\envs\textgen\lib\site-packages\torch\utils\cpp_extension.py", line 1808, in _run_ninja_build subprocess.run( File "C:\Users\cruge\miniconda3\envs\textgen\lib\subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

llama_inference RuntimeError: Internal: src/sentencepiece_processor.cc

python llama_inference.py ./llama-7b-hf --wbits 4 --load ./llama-7b-4bit.pt --text "this is llama"

Loading model ...
Done.

Traceback (most recent call last):
File "/root/GPTQ-for-LLaMa/llama_inference.py", line 114, in
tokenizer = AutoTokenizer.from_pretrained(args.model)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 679, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1804, in from_pretrained
return cls._from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1958, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 72, in init
self.sp_model.Load(vocab_file)
File "/opt/conda/lib/python3.10/site-packages/sentencepiece/init.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/opt/conda/lib/python3.10/site-packages/sentencepiece/init.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

Will loras work with this?

See: https://github.com/tloen/alpaca-lora/blob/main/generate.py

Tried modifying the code to look like this, but no luck initially.

from peft import PeftModel
from transformers import LLaMATokenizer, LLaMAForCausalLM, GenerationConfig

tokenizer = LLaMATokenizer.from_pretrained("decapoda-research/llama-7b-hf")

model = LLaMAForCausalLM.from_pretrained(
"decapoda-research/llama-7b-hf",
load_in_8bit=True,
device_map="auto",
)
model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b")

NameError: name 'quant_cuda' is not defined

I loaded successfully the 7b llama model in 4bit but when I try to generate some text this happens:

Starting the web UI...
Loading the extension "gallery"... Ok.
Loading llama-7b...
CUDA extension not installed.
Loading model ...
Done.
Loaded the model in 4.07 seconds.
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
Traceback (most recent call last):
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\gradio\routes.py", line 374, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\gradio\blocks.py", line 1017, in process_api
result = await self.call_function(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\gradio\blocks.py", line 849, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 867, in run
result = context.run(func, *args)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\gradio\utils.py", line 453, in async_iteration
return next(iterator)
File "C:\Users\zblac\Downloads\oobabooga\text-generation-webui\modules\chat.py", line 126, in chatbot_wrapper
for reply in generate_reply(f"{prompt}{' ' if len(reply) > 0 else ''}{reply}", max_new_tokens, do_sample, temperature, top_p, typical_p, repetition_penalty, top_k, min_length, no_repeat_ngram_size, num_beams, penalty_alpha, length_penalty, early_stopping, eos_token=eos_token, stopping_string=f"\n{name1}:"):
File "C:\Users\zblac\Downloads\oobabooga\text-generation-webui\modules\text_generation.py", line 170, in generate_reply
output = eval(f"shared.model.generate({', '.join(generate_params)}){cuda}")[0]
File "", line 1, in
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1452, in generate
return self.sample(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2468, in sample
outputs = self(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 772, in forward
outputs = self.model(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 621, in forward
layer_outputs = decoder_layer(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 318, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 218, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "C:\Users\zblac\Downloads\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\zblac\Downloads\oobabooga\text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 198, in forward
quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.zeros)
NameError: name 'quant_cuda' is not defined

RuntimeError: Tensors must have same number of dimensions: got 3 and 4

The following command:

python repositories/GPTQ-for-LLaMa/llama.py /path/to/my/text-generation-webui/models/llama-7b c4 --wbits 4 --save llama-7b-4bit.pt

Fails with:

__Traceback (most recent call last):
File "/path/to/my/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 410, in
quantizers = llama_sequential(model, dataloader, DEV)
File "/usr/local/lib64/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/path/to/my/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 88, in llama_sequential
outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask)[0]
File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in call_impl
return forward_call(*input, **kwargs)
File "/my/homedir/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 318, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in call_impl
return forward_call(*input, **kwargs)
File "/my/homedir/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 228, in forward
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, offset=offset)
File "/my/homedir/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 142, in apply_rotary_pos_emb
q_embed = (q * cos) + (rotate_half(q) * sin)
File "/my/homedir/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 136, in rotate_half
return torch.cat((-x2, x1), dim=-1)
RuntimeError: Tensors must have same number of dimensions: got 3 and 4

Any thoughts on what might be wrong? That source model works just fine on its own.

I've tried downloading models converted by others and using them in text-generation-webui. The server starts fine, but whenever I click generate, I again get the same "Tensors must have same number of dimensions" error.

In case it matters, I do have two GPUs. But setting CUDA_VISIBLE_DEVICES=0 doesn't help, neither with llama.py nor text-generation-webui's server.py. Nor does --gpu_memory 5 5 or similar help in text-generation-webui.

	valdata = load_dataset(
	'allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation',use_auth_token=True
	)

qwopqwop200 / gptq-for-llama Goto Github PK

gptq-for-llama's People

Contributors

Stargazers

Watchers

Forkers

gptq-for-llama's Issues

Recommend Projects

Recommend Topics

Recommend Org