RuntimeError: cutlassF: no kernel found to launch!,about pytorch-labs/gpt-fast

drisspg commented on August 27, 2024 2

Ohh @yifuwang thank you, that is a great catch I will put up a PR right now to fix this in PyTorch

from gpt-fast.

VendaCino commented on August 27, 2024 1

It seems your GPU not support bf16, change all torch.bfloat16 to torch.float32 may work.

from gpt-fast.

yifuwang commented on August 27, 2024 1

@drisspg I tested on a V100. Both eager and compiled runs into the same error.

I think the issue is that mem_eff_attention doesn't support bf16 on sm < 80: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/cutlassF.h#L286

I tested with float16 and it works. Shall we default gpt-fast to float16 for V100 and under?

from gpt-fast.

VendaCino commented on August 27, 2024 1

'⁇ ⁇ ⁇' is because tensor value nan.
I debug found that the kv_cache in that attention layer is nan.
and this issue will not happen when all dtype is torch.float32 but not torch.float16
and this issue not happen when I use tinyllama but not viucna-7b.

hope this information can help to trace the problem.

update:
deep debug found that it is because x.max() = inf
I think some layer output too large and float16 not ok to show that.

Time to load model: 1.97 seconds
tensor(7.0664, device='cuda:0', dtype=torch.float16)
tensor(18.3906, device='cuda:0', dtype=torch.float16)
tensor(inf, device='cuda:0', dtype=torch.float16)
tensor(nan, device='cuda:0', dtype=torch.float16)

it depends on the weight of model, so when I test in tinyllama it works well.

when I use model.pth

Time to load model: 10.10 seconds
tensor(7.0625, device='cuda:0', dtype=torch.float16)
tensor(18.3438, device='cuda:0', dtype=torch.float16)
tensor(1532., device='cuda:0', dtype=torch.float16)

so i guess something wrong in WeightOnlyInt8Linear

class WeightOnlyInt8Linear(torch.nn.Module):
    ...

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        return F.linear(input, self.weight.to(dtype=input.dtype)) * self.scales -> here loss the precision

change it to

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        return (F.linear(input.to(dtype=torch.float32), self.weight.to(dtype=torch.float32)) * self.scales).to(dtype=input.dtype)

everything looks good

Time to load model: 1.66 seconds
tensor(7.0664, device='cuda:0', dtype=torch.float16)
tensor(18.3906, device='cuda:0', dtype=torch.float16)
tensor(1535., device='cuda:0', dtype=torch.float16)

from gpt-fast.

merveermann commented on August 27, 2024

I have the same error

from gpt-fast.

Armod-I commented on August 27, 2024

same error

from gpt-fast.

merveermann commented on August 27, 2024

My conda environment is as below:

GPU: RTX 5000
CUDA: 12.3

Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
blas 1.0 mkl
brotlipy 0.7.0 py311h9bf148f_1002 pytorch-nightly
bzip2 1.0.8 h7b6447c_0
ca-certificates 2023.08.22 h06a4308_0
certifi 2023.11.17 py311h06a4308_0
cffi 1.15.1 py311h9bf148f_3 pytorch-nightly
charset-normalizer 2.0.4 pyhd3eb1b0_0
cryptography 38.0.4 py311h46ebde7_0 pytorch-nightly
cuda-cudart 12.1.105 0 nvidia
cuda-cupti 12.1.105 0 nvidia
cuda-libraries 12.1.0 0 nvidia
cuda-nvrtc 12.1.105 0 nvidia
cuda-nvtx 12.1.105 0 nvidia
cuda-opencl 12.3.101 0 nvidia
cuda-runtime 12.1.0 0 nvidia
ffmpeg 4.2.2 h20bf706_0
filelock 3.9.0 py311_0 pytorch-nightly
freetype 2.12.1 h4a9f257_0
fsspec 2023.12.2 pypi_0 pypi
giflib 5.2.1 h5eee18b_3
gmp 6.2.1 h295c915_3
gmpy2 2.1.2 py311hc9b5ff0_0
gnutls 3.6.15 he1e5248_0
huggingface-hub 0.19.4 pypi_0 pypi
idna 3.4 py311h06a4308_0
intel-openmp 2021.4.0 h06a4308_3561
jinja2 3.1.2 py311h06a4308_0
jpeg 9e h5eee18b_1
lame 3.100 h7b6447c_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libcublas 12.1.0.26 0 nvidia
libcufft 11.0.2.4 0 nvidia
libcufile 1.8.1.2 0 nvidia
libcurand 10.3.4.101 0 nvidia
libcusolver 11.4.4.55 0 nvidia
libcusparse 12.0.2.55 0 nvidia
libdeflate 1.17 h5eee18b_1
libffi 3.4.4 h6a678d5_0
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libidn2 2.3.4 h5eee18b_0
libjpeg-turbo 2.0.0 h9bf148f_0 pytorch-nightly
libnpp 12.0.2.50 0 nvidia
libnvjitlink 12.1.105 0 nvidia
libnvjpeg 12.1.1.14 0 nvidia
libopus 1.3.1 h7b6447c_0
libpng 1.6.39 h5eee18b_0
libstdcxx-ng 11.2.0 h1234567_1
libtasn1 4.19.0 h5eee18b_0
libtiff 4.5.1 h6a678d5_0
libunistring 0.9.10 h27cfd23_0
libuuid 1.41.5 h5eee18b_0
libvpx 1.7.0 h439df22_0
libwebp 1.2.4 h11a3e52_1
libwebp-base 1.2.4 h5eee18b_1
llvm-openmp 14.0.6 h9e868ea_0
lz4-c 1.9.4 h6a678d5_0
markupsafe 2.1.1 py311h5eee18b_0
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py311h9bf148f_0 pytorch-nightly
mkl_fft 1.3.1 py311hc796f24_0 pytorch-nightly
mkl_random 1.2.2 py311hbba84a0_0 pytorch-nightly
mpc 1.1.0 h10f8cd9_1
mpfr 4.0.2 hb69a4c5_1
mpmath 1.2.1 py311_0 pytorch-nightly
ncurses 6.4 h6a678d5_0
nettle 3.7.3 hbbd107a_1
networkx 3.1 py311h06a4308_0
numpy 1.24.3 py311hc206e33_0
numpy-base 1.24.3 py311hfd5febd_0
openh264 2.1.1 h4ff587b_0
openssl 3.0.12 h7f8727e_0
packaging 23.2 pypi_0 pypi
pillow 9.3.0 py311h3fd9d12_2 pytorch-nightly
pip 23.3.1 py311h06a4308_0
pycparser 2.21 pyhd3eb1b0_0
pyopenssl 23.2.0 py311h06a4308_0
pysocks 1.7.1 py311_0 pytorch-nightly
python 3.11.5 h955ad1f_0
pytorch 2.3.0.dev20231214 py3.11_cuda12.1_cudnn8.9.2_0 pytorch-nightly
pytorch-cuda 12.1 ha16c6d3_5 pytorch-nightly
pytorch-mutex 1.0 cuda pytorch-nightly
pyyaml 6.0.1 py311h5eee18b_0
readline 8.2 h5eee18b_0
requests 2.28.1 py311_0 pytorch-nightly
sentencepiece 0.1.99 pypi_0 pypi
setuptools 68.2.2 py311h06a4308_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.41.2 h5eee18b_0
sympy 1.12 py311h06a4308_0
tk 8.6.12 h1ccaba5_0
torchaudio 2.2.0.dev20231214 py311_cu121 pytorch-nightly
torchtriton 2.1.0+bcad9dabe1 py311 pytorch-nightly
torchvision 0.18.0.dev20231214 py311_cu121 pytorch-nightly
tqdm 4.66.1 pypi_0 pypi
typing_extensions 4.7.1 py311h06a4308_0
tzdata 2023c h04d1e81_0
urllib3 1.26.14 py311_0 pytorch-nightly
wheel 0.41.2 py311h06a4308_0
x264 1!157.20191217 h7b6447c_0
xz 5.4.5 h5eee18b_0
yaml 0.2.5 h7b6447c_0
zlib 1.2.13 h5eee18b_0
zstd 1.5.5 hc292b87_0

from gpt-fast.

Chillee commented on August 27, 2024

cc: @drisspg

from gpt-fast.

drisspg commented on August 27, 2024

Can you try using a the patch release, or nightly?

from gpt-fast.

Chillee commented on August 27, 2024

@drisspg #46 (comment) @merveermann says they're using the nightly I believe.

from gpt-fast.

drisspg commented on August 27, 2024

So this error is being thrown on Nightly for devices: V100, RTX5000
Is there any others?

Also it is possible to give example inputs of to SDPA that are causing this error to be thrown?
Is this only happening when the model is being compiled?

My hunch is that compile is doing some memory planning optimizations that cause the alignment check here: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/attention.cu#L1023-L1027
to fail for all possible kernels.

from gpt-fast.

goodboyyes2009 commented on August 27, 2024

thank you all, after change all torch.bfloat16 to torch.float32, run with unquantized model works well
but run with int8 seems wrong

root@md:/home/projects/gpt-fast# CUDA_VISIBLE_DEVICES=0 python3 generate.py --compile --checkpoint_path /models/huggingface_models/meta-Llama-2-7b-hf/model_int8.pth --max_new_tokens 100
Loading model ...
Using int8 weight-only quantization!
/opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Time to load model: 2.52 seconds
[2023-12-19 00:54:26,247] [0/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
Compilation time: 101.21 seconds
Hello, my name is ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ 
Time for inference 1: 4.87 sec total, 20.53 tokens/sec
Bandwidth achieved: 141.08 GB/s
Hello, my name is ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ 
Time for inference 2: 4.87 sec total, 20.55 tokens/sec
Bandwidth achieved: 141.25 GB/s
Hello, my name is ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ 
Time for inference 3: 4.87 sec total, 20.55 tokens/sec
Bandwidth achieved: 141.24 GB/s
Hello, my name is ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ 
Time for inference 4: 4.87 sec total, 20.55 tokens/sec
Bandwidth achieved: 141.22 GB/s
Hello, my name is ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ 
Time for inference 5: 4.87 sec total, 20.55 tokens/sec
Bandwidth achieved: 141.24 GB/s
==========
Average tokens/sec: 20.55
Memory used: 8.01 GB

from gpt-fast.

VendaCino commented on August 27, 2024

@goodboyyes2009 Did you re-run quatilized.py after torch.bfloat16 to torch.float32?

from gpt-fast.

goodboyyes2009 commented on August 27, 2024

@VendaCino oh, sorry, I do re-run quatilized.py, but I change all torch.bfloat16 to torch.float16

from gpt-fast.

goodboyyes2009 commented on August 27, 2024

OK. Thank you very much! @VendaCino

from gpt-fast.

RuntimeError: cutlassF: no kernel found to launch! about gpt-fast HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent