<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

That was cuda fp16i8 *15+ -> cuda fp16 *1</c

Fixes my problem and is also much faster even with compute 6: <div class="snippet-

That was cuda fp16i8 *15+ -> cuda fp16 *1 and <cod

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

0.5.0 operators.cu fails to compile on compute 6.x about chatrwkv HOT 15 CLOSED

blinkdl commented on July 20, 2024

0.5.0 operators.cu fails to compile on compute 6.x

from chatrwkv.

Comments (15)

BlinkDL commented on July 20, 2024 2

lets go for 1: "Reduce to fp32 and convert back to 16 only on older architectures"

from chatrwkv.

BlinkDL commented on July 20, 2024 1

fixed :) and it's 10% faster on A100 too

from chatrwkv.

BlinkDL commented on July 20, 2024 1

That was cuda fp16i8 *15+ -> cuda fp16 *1 and RWKV-4-Pile-7B-20230109-ctx4096.pth (it actually seems like it's using less memory now as well, so I could probably add a few more layers on the GPU.)

yeah i am processing in chunks so vram usage is much smaller for longer inputs

from chatrwkv.

Blealtan commented on July 20, 2024

It's required to be atomic since the reduction is done in parallel across blocks. Though there are some possible options:

Reduce to fp32 and convert back to 16 only on older architectures.
Always reduce to fp32 and convert back, this provides higher precision but should slow down a bit.
Do a stable fp16 reduction through summing afterward, this avoids the current numeric indeterministic, but might slow down more.
Stable fp32 reduction. Best precision, works everywhere, but hurt performance even more.

@BlinkDL your opinion?

from chatrwkv.

KerfuffleV2 commented on July 20, 2024

Fixes my problem and is also much faster even with compute 6:

Output generated in 46.39 seconds (1.70 tokens/s, 79 tokens)

Fastest I saw before this was 1.17.

from chatrwkv.

BlinkDL commented on July 20, 2024

Fixes my problem and is also much faster even with compute 6:
Output generated in 46.39 seconds (1.70 tokens/s, 79 tokens)
Fastest I saw before this was 1.17.

cool. what model and strategy

from chatrwkv.

KerfuffleV2 commented on July 20, 2024

That was cuda fp16i8 *15+ -> cuda fp16 *1 and RWKV-4-Pile-7B-20230109-ctx4096.pth (it actually seems like it's using less memory now as well, so I could probably add a few more layers on the GPU.)

from chatrwkv.

burgerlawful commented on July 20, 2024

I have a 3070 and 1060 card. Using torch 1.13.1+cu117, rwkv 0.6.0, latest chatrwkv version.
"RWKV_CUDA_ON" doesn't work for me when I try to assign the strategy to my 1060 with "cuda:1 fp16" for example so I can't split the strategy between my gpus with RWKV_CUDA_ON, but if I set CUDA_VISIBLE_DEVICES=1 and "cuda fp16" it runs successfully on just my 1060.

Here's the error I get:

Run prompt...
C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py:568: UserWarning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
 (Triggered internally at ..\torch\csrc\jit\codegen\cuda\manager.cpp:336.)
  x, state[i*5+0], state[i*5+1], state[i*5+2], state[i*5+3] = ATT(
Traceback (most recent call last):
  File "G:\test\ChatRWKV\v2\chat.py", line 164, in <module>
    out = run_rnn(pipeline.encode(init_prompt))
  File "G:\test\ChatRWKV\v2\chat.py", line 133, in run_rnn
    out, model_state = model.forward(tokens[:CHUNK_LEN], model_state)
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py", line 568, in forward
    x, state[i*5+0], state[i*5+1], state[i*5+2], state[i*5+3] = ATT(
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py", line 472, in fallback_cuda_fuser
            sx = torch.cat((sx.unsqueeze(0), xx[:-1,:]))
            kx = xx * k_mix + sx * (1 - k_mix)
            vx = xx * v_mix + sx * (1 - v_mix)
                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            rx = xx * r_mix + sx * (1 - r_mix)

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

and if I set CUDA_LAUNCH_BLOCKING=1 and run it again:

Run prompt...
Traceback (most recent call last):
  File "G:\test\ChatRWKV\v2\chat.py", line 164, in <module>
    out = run_rnn(pipeline.encode(init_prompt))
  File "G:\test\ChatRWKV\v2\chat.py", line 133, in run_rnn
    out, model_state = model.forward(tokens[:CHUNK_LEN], model_state)
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py", line 568, in forward
    x, state[i*5+0], state[i*5+1], state[i*5+2], state[i*5+3] = ATT(
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py", line 480, in cuda_att_seq
            y, aa, bb, pp = cuda_wkv(T, C, t_decay, t_first, k, v, aa, bb, pp)

            out = (r * y) @ ow
                   ~~~~~~~~~~~ <--- HERE
            return x + out, xx[-1,:], aa, bb, pp
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`

from chatrwkv.

BlinkDL commented on July 20, 2024

@burgerlawful try using 1060 as cuda:0 so that it will compile a low version cuda

from chatrwkv.

burgerlawful commented on July 20, 2024

If I do that 'cuda:0 fp16' works on the 1060, but if I add the 3070, like 'cuda:0 fp16 -> cuda:1 fp16' I get the error again.

from chatrwkv.

BlinkDL commented on July 20, 2024

If I do that 'cuda:0 fp16' works on the 1060, but if I add the 3070, like 'cuda:0 fp16 -> cuda:1 fp16' I get the error again.

how abt 'cuda:1 fp16 -> cuda:0 fp16'

from chatrwkv.

burgerlawful commented on July 20, 2024

Doesn't work either, the only combination I found that works with RWKV_CUDA_ON and both cards is to put only the last layer on the 1060

from chatrwkv.

Ph0rk0z commented on July 20, 2024

That was cuda fp16i8 *15+ -> cuda fp16 *1 and RWKV-4-Pile-7B-20230109-ctx4096.pth (it actually seems like it's using less memory now as well, so I could probably add a few more layers on the GPU.)

I am at a quarter speed on compute 6.1 but I am doing:python server.py --cai-chat --model rwkv-4-pile-14b --rwkv-cuda-on --rwkv-strategy "cuda fp16i8 *22 -> cuda fp16"

from chatrwkv.

BlinkDL commented on July 20, 2024

@burgerlawful pls try latest ChatRWKV & rwkv 0.7.3
should be fixed now

from chatrwkv.

burgerlawful commented on July 20, 2024

@burgerlawful pls try latest ChatRWKV & rwkv 0.7.3 should be fixed now

I just tried it and can confirm that it works now, thank you.

from chatrwkv.

0.5.0 operators.cu fails to compile on compute 6.x about chatrwkv HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent