Coder Social home page Coder Social logo

Comments (15)

BlinkDL avatar BlinkDL commented on July 20, 2024 2

lets go for 1: "Reduce to fp32 and convert back to 16 only on older architectures"

from chatrwkv.

BlinkDL avatar BlinkDL commented on July 20, 2024 1

fixed :) and it's 10% faster on A100 too

from chatrwkv.

BlinkDL avatar BlinkDL commented on July 20, 2024 1

That was cuda fp16i8 *15+ -> cuda fp16 *1 and RWKV-4-Pile-7B-20230109-ctx4096.pth (it actually seems like it's using less memory now as well, so I could probably add a few more layers on the GPU.)

yeah i am processing in chunks so vram usage is much smaller for longer inputs

from chatrwkv.

Blealtan avatar Blealtan commented on July 20, 2024

It's required to be atomic since the reduction is done in parallel across blocks. Though there are some possible options:

  1. Reduce to fp32 and convert back to 16 only on older architectures.
  2. Always reduce to fp32 and convert back, this provides higher precision but should slow down a bit.
  3. Do a stable fp16 reduction through summing afterward, this avoids the current numeric indeterministic, but might slow down more.
  4. Stable fp32 reduction. Best precision, works everywhere, but hurt performance even more.

@BlinkDL your opinion?

from chatrwkv.

KerfuffleV2 avatar KerfuffleV2 commented on July 20, 2024

Fixes my problem and is also much faster even with compute 6:

Output generated in 46.39 seconds (1.70 tokens/s, 79 tokens)

Fastest I saw before this was 1.17.

from chatrwkv.

BlinkDL avatar BlinkDL commented on July 20, 2024

Fixes my problem and is also much faster even with compute 6:

Output generated in 46.39 seconds (1.70 tokens/s, 79 tokens)

Fastest I saw before this was 1.17.

cool. what model and strategy

from chatrwkv.

KerfuffleV2 avatar KerfuffleV2 commented on July 20, 2024

That was cuda fp16i8 *15+ -> cuda fp16 *1 and RWKV-4-Pile-7B-20230109-ctx4096.pth (it actually seems like it's using less memory now as well, so I could probably add a few more layers on the GPU.)

from chatrwkv.

burgerlawful avatar burgerlawful commented on July 20, 2024

I have a 3070 and 1060 card. Using torch 1.13.1+cu117, rwkv 0.6.0, latest chatrwkv version.
"RWKV_CUDA_ON" doesn't work for me when I try to assign the strategy to my 1060 with "cuda:1 fp16" for example so I can't split the strategy between my gpus with RWKV_CUDA_ON, but if I set CUDA_VISIBLE_DEVICES=1 and "cuda fp16" it runs successfully on just my 1060.

Here's the error I get:

Run prompt...
C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py:568: UserWarning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
 (Triggered internally at ..\torch\csrc\jit\codegen\cuda\manager.cpp:336.)
  x, state[i*5+0], state[i*5+1], state[i*5+2], state[i*5+3] = ATT(
Traceback (most recent call last):
  File "G:\test\ChatRWKV\v2\chat.py", line 164, in <module>
    out = run_rnn(pipeline.encode(init_prompt))
  File "G:\test\ChatRWKV\v2\chat.py", line 133, in run_rnn
    out, model_state = model.forward(tokens[:CHUNK_LEN], model_state)
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py", line 568, in forward
    x, state[i*5+0], state[i*5+1], state[i*5+2], state[i*5+3] = ATT(
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py", line 472, in fallback_cuda_fuser
            sx = torch.cat((sx.unsqueeze(0), xx[:-1,:]))
            kx = xx * k_mix + sx * (1 - k_mix)
            vx = xx * v_mix + sx * (1 - v_mix)
                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            rx = xx * r_mix + sx * (1 - r_mix)

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

and if I set CUDA_LAUNCH_BLOCKING=1 and run it again:

Run prompt...
Traceback (most recent call last):
  File "G:\test\ChatRWKV\v2\chat.py", line 164, in <module>
    out = run_rnn(pipeline.encode(init_prompt))
  File "G:\test\ChatRWKV\v2\chat.py", line 133, in run_rnn
    out, model_state = model.forward(tokens[:CHUNK_LEN], model_state)
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py", line 568, in forward
    x, state[i*5+0], state[i*5+1], state[i*5+2], state[i*5+3] = ATT(
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py", line 480, in cuda_att_seq
            y, aa, bb, pp = cuda_wkv(T, C, t_decay, t_first, k, v, aa, bb, pp)

            out = (r * y) @ ow
                   ~~~~~~~~~~~ <--- HERE
            return x + out, xx[-1,:], aa, bb, pp
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`

from chatrwkv.

BlinkDL avatar BlinkDL commented on July 20, 2024

@burgerlawful try using 1060 as cuda:0 so that it will compile a low version cuda

from chatrwkv.

burgerlawful avatar burgerlawful commented on July 20, 2024

If I do that 'cuda:0 fp16' works on the 1060, but if I add the 3070, like 'cuda:0 fp16 -> cuda:1 fp16' I get the error again.

from chatrwkv.

BlinkDL avatar BlinkDL commented on July 20, 2024

If I do that 'cuda:0 fp16' works on the 1060, but if I add the 3070, like 'cuda:0 fp16 -> cuda:1 fp16' I get the error again.

how abt 'cuda:1 fp16 -> cuda:0 fp16'

from chatrwkv.

burgerlawful avatar burgerlawful commented on July 20, 2024

Doesn't work either, the only combination I found that works with RWKV_CUDA_ON and both cards is to put only the last layer on the 1060

from chatrwkv.

Ph0rk0z avatar Ph0rk0z commented on July 20, 2024

That was cuda fp16i8 *15+ -> cuda fp16 *1 and RWKV-4-Pile-7B-20230109-ctx4096.pth (it actually seems like it's using less memory now as well, so I could probably add a few more layers on the GPU.)

I am at a quarter speed on compute 6.1 but I am doing:python server.py --cai-chat --model rwkv-4-pile-14b --rwkv-cuda-on --rwkv-strategy "cuda fp16i8 *22 -> cuda fp16"

from chatrwkv.

BlinkDL avatar BlinkDL commented on July 20, 2024

@burgerlawful pls try latest ChatRWKV & rwkv 0.7.3
should be fixed now

from chatrwkv.

burgerlawful avatar burgerlawful commented on July 20, 2024

@burgerlawful pls try latest ChatRWKV & rwkv 0.7.3 should be fixed now

I just tried it and can confirm that it works now, thank you.

from chatrwkv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.