Comments (15)
lets go for 1: "Reduce to fp32 and convert back to 16 only on older architectures"
from chatrwkv.
fixed :) and it's 10% faster on A100 too
from chatrwkv.
That was
cuda fp16i8 *15+ -> cuda fp16 *1
andRWKV-4-Pile-7B-20230109-ctx4096.pth
(it actually seems like it's using less memory now as well, so I could probably add a few more layers on the GPU.)
yeah i am processing in chunks so vram usage is much smaller for longer inputs
from chatrwkv.
It's required to be atomic since the reduction is done in parallel across blocks. Though there are some possible options:
- Reduce to fp32 and convert back to 16 only on older architectures.
- Always reduce to fp32 and convert back, this provides higher precision but should slow down a bit.
- Do a stable fp16 reduction through summing afterward, this avoids the current numeric indeterministic, but might slow down more.
- Stable fp32 reduction. Best precision, works everywhere, but hurt performance even more.
@BlinkDL your opinion?
from chatrwkv.
Fixes my problem and is also much faster even with compute 6:
Output generated in 46.39 seconds (1.70 tokens/s, 79 tokens)
Fastest I saw before this was 1.17.
from chatrwkv.
Fixes my problem and is also much faster even with compute 6:
Output generated in 46.39 seconds (1.70 tokens/s, 79 tokens)
Fastest I saw before this was 1.17.
cool. what model and strategy
from chatrwkv.
That was cuda fp16i8 *15+ -> cuda fp16 *1
and RWKV-4-Pile-7B-20230109-ctx4096.pth
(it actually seems like it's using less memory now as well, so I could probably add a few more layers on the GPU.)
from chatrwkv.
I have a 3070 and 1060 card. Using torch 1.13.1+cu117, rwkv 0.6.0, latest chatrwkv version.
"RWKV_CUDA_ON" doesn't work for me when I try to assign the strategy to my 1060 with "cuda:1 fp16" for example so I can't split the strategy between my gpus with RWKV_CUDA_ON, but if I set CUDA_VISIBLE_DEVICES=1 and "cuda fp16" it runs successfully on just my 1060.
Here's the error I get:
Run prompt...
C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py:568: UserWarning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
(Triggered internally at ..\torch\csrc\jit\codegen\cuda\manager.cpp:336.)
x, state[i*5+0], state[i*5+1], state[i*5+2], state[i*5+3] = ATT(
Traceback (most recent call last):
File "G:\test\ChatRWKV\v2\chat.py", line 164, in <module>
out = run_rnn(pipeline.encode(init_prompt))
File "G:\test\ChatRWKV\v2\chat.py", line 133, in run_rnn
out, model_state = model.forward(tokens[:CHUNK_LEN], model_state)
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py", line 568, in forward
x, state[i*5+0], state[i*5+1], state[i*5+2], state[i*5+3] = ATT(
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py", line 472, in fallback_cuda_fuser
sx = torch.cat((sx.unsqueeze(0), xx[:-1,:]))
kx = xx * k_mix + sx * (1 - k_mix)
vx = xx * v_mix + sx * (1 - v_mix)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
rx = xx * r_mix + sx * (1 - r_mix)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
and if I set CUDA_LAUNCH_BLOCKING=1 and run it again:
Run prompt...
Traceback (most recent call last):
File "G:\test\ChatRWKV\v2\chat.py", line 164, in <module>
out = run_rnn(pipeline.encode(init_prompt))
File "G:\test\ChatRWKV\v2\chat.py", line 133, in run_rnn
out, model_state = model.forward(tokens[:CHUNK_LEN], model_state)
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py", line 568, in forward
x, state[i*5+0], state[i*5+1], state[i*5+2], state[i*5+3] = ATT(
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py", line 480, in cuda_att_seq
y, aa, bb, pp = cuda_wkv(T, C, t_decay, t_first, k, v, aa, bb, pp)
out = (r * y) @ ow
~~~~~~~~~~~ <--- HERE
return x + out, xx[-1,:], aa, bb, pp
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
from chatrwkv.
@burgerlawful try using 1060 as cuda:0 so that it will compile a low version cuda
from chatrwkv.
If I do that 'cuda:0 fp16' works on the 1060, but if I add the 3070, like 'cuda:0 fp16 -> cuda:1 fp16' I get the error again.
from chatrwkv.
If I do that 'cuda:0 fp16' works on the 1060, but if I add the 3070, like 'cuda:0 fp16 -> cuda:1 fp16' I get the error again.
how abt 'cuda:1 fp16 -> cuda:0 fp16'
from chatrwkv.
Doesn't work either, the only combination I found that works with RWKV_CUDA_ON and both cards is to put only the last layer on the 1060
from chatrwkv.
That was
cuda fp16i8 *15+ -> cuda fp16 *1
andRWKV-4-Pile-7B-20230109-ctx4096.pth
(it actually seems like it's using less memory now as well, so I could probably add a few more layers on the GPU.)
I am at a quarter speed on compute 6.1 but I am doing:python server.py --cai-chat --model rwkv-4-pile-14b --rwkv-cuda-on --rwkv-strategy "cuda fp16i8 *22 -> cuda fp16"
from chatrwkv.
@burgerlawful pls try latest ChatRWKV & rwkv 0.7.3
should be fixed now
from chatrwkv.
@burgerlawful pls try latest ChatRWKV & rwkv 0.7.3 should be fixed now
I just tried it and can confirm that it works now, thank you.
from chatrwkv.
Related Issues (20)
- 'No CUDA GPUs are available' in google colab with V100 GPU and high RAM HOT 2
- huggingface无法访问,模型无法下载 HOT 4
- Prompt for RAG with RWKV-4-World-7B-v1-20230626-ctx4096 HOT 1
- [Feature Request] text2music HOT 2
- RuntimeError: Error building extension 'wkv_cuda_v1' HOT 2
- How to write the RWKV in autogressive style like RNN HOT 2
- NameError: name 'PIPELINE' is not defined HOT 1
- 大哥,乱码了 HOT 1
- 回复总是截断了,如何让回复自然的结束 HOT 1
- eagle-7B HOT 1
- Inference doesn't work on Apple Macbook even when using CPU fp32 as strategy HOT 1
- "cpu fp32i8" strategy not working in RWKV v6 through Python rwkv module HOT 2
- How to run new v5-Eagle-7B HOT 2
- mps slower than cpu HOT 1
- model path list HOT 1
- add text condition for gen music HOT 1
- [pip package] Make loading aware that os.environ can change HOT 2
- [pip package] feature request: pipeline.generate: add ability to get the state, if it was not provided HOT 1
- 如何选模型基座?
- [requires_grad]在本地部署CHATRWKV时遇到了AttributeError: 'str' object has no attribute 'requires_grad'
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chatrwkv.