Encountering an issue with setting RWKV_CUDA_ON to '1

The same issue was reported in <a class="issue-link js-issue-link" data-error-text="Fa

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RuntimeError: CUDA error: an illegal memory access was encountered about chatrwkv HOT 4 CLOSED

blinkdl commented on August 21, 2024

RuntimeError: CUDA error: an illegal memory access was encountered

from chatrwkv.

Comments (4)

nenkoru commented on August 21, 2024

The same issue was reported in #38 (comment) by @burgerlawful

UPD: same behaviour happens on 0.7.2

from chatrwkv.

BlinkDL commented on August 21, 2024

@nenkoru pls try latest ChatRWKV & rwkv 0.7.3
should be fixed now

from chatrwkv.

nenkoru commented on August 21, 2024

Yeah, it works now. But I don't see any big difference between turning this feature on or off. Here are timings per .forward call. However, I noticed slightly less memory usage of around 1.5 - 2GB for 14B model so I managed to fit a little more layers without quantization into i8. It's also worth mentioning that the CPU on the machine fairly slow - AMD Athlon 3000G with 2 cores(4 threads) that could be the bottleneck in my case.

Cuda optimization OFF. 7B
cuda:0 fp16 -> cuda:1 fp16 -> cuda:2 fp16

forward time: 0.023224592208862305
forward time: 0.02307891845703125
forward time: 0.023171663284301758
forward time: 0.023991107940673828

Cuda optimization ON. 7B
cuda:0 fp16 -> cuda:1 fp16 -> cuda:2 fp16

forward time: 0.023212432861328125
forward time: 0.021632671356201172
forward time: 0.021616697311401367
forward time: 0.022366046905517578

Cuda optimization ON. 14B.
cuda:0 fp16 *10 -> cuda:1 fp16 *10 -> cuda:2 fp16 *10 -> cuda:3 fp16 *10 -> cuda:3 fp16i8

forward time: 0.02741098403930664
forward time: 0.030609130859375
forward time: 0.029874563217163086
forward time: 0.02969217300415039
forward time: 0.028966665267944336
forward time: 0.030252695083618164

Cuda optimization OFF. 14B.
cuda:0 fp16 *10 -> cuda:1 fp16 *10 -> cuda:2 fp16 *10 -> cuda:3 fp16 *7 -> cuda:3 fp16i8

forward time: 0.029529571533203125
forward time: 0.030321836471557617
forward time: 0.02967238426208496
forward time: 0.030389785766601562
forward time: 0.03206944465637207
forward time: 0.030094385147094727

from chatrwkv.

BlinkDL commented on August 21, 2024

@nenkoru It's 10x faster for long inputs

Use more fp16i8 and 2 GPUs will be enough.

Use fp16 in later (instead of early) layers for better quality.

from chatrwkv.

Recommend Projects

RuntimeError: CUDA error: an illegal memory access was encountered about chatrwkv HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent