Coder Social home page Coder Social logo

blinkdl / rwkv-lm Goto Github PK

View Code? Open in Web Editor NEW
11.6K 136.0 800.0 17.62 MB

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

License: Apache License 2.0

Python 88.33% Cuda 8.72% C++ 1.99% Shell 0.96%
attention-mechanism deep-learning gpt gpt-2 gpt-3 language-model linear-attention lstm pytorch rnn

rwkv-lm's Introduction

BlinkDL

A minimalist deep learning library in Javascript using WebGL + asm.js. Runs in your browser.

Currently it is a proof-of-concept (inference only). Note: Convolution is buggy when memories overlap.

The WebGL backend is powered by weblas: https://github.com/waylonflinn/weblas.

Example

https://withablink.coding.me/goPolicyNet/ : a weiqi (baduk, go) policy network in AlphaGo style:

board_image

const N = 19;
const NN = N * N;
const nFeaturePlane = 8;
const nFilter = 128;

const x = new BlinkArray();
x.Init('weblas');
x.nChannel = nFeaturePlane;
x.data = new Float32Array(nFeaturePlane * NN);
for (var i = 0; i < NN; i++)
    x.data[5 * NN + i] = 1; // set feature plane for empty board

// pre-act residual network with 6 residual blocks
const bak = new Float32Array(nFilter * NN);
x.Convolution(nFilter, 3);
x.CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak);
x.BatchNorm().ReLU().Convolution(1, 1).Softmax();

performance_image

Usage

<script src='weblas.js' type='text/javascript'></script>
<script src='BlinkDL.js' type='text/javascript'></script>

Todo

  • Convolution (3x3_pad_1 and 1x1), BatchNorm, ReLU, Softmax
  • Pooling layer
  • FC layer
  • Strided convolution
  • Transposed convolution
  • Webworker and async
  • Faster inference with weblas pipeline, WebGPU, WebAssembly
  • Memory manager
  • Training

rwkv-lm's People

Contributors

blinkdl avatar fluxlinkage avatar picocreator avatar saharnooby avatar www avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rwkv-lm's Issues

Please help to shoot the error while loading locally trained model to chat.py

Hi,

I was training the model locally from scratch.

 python train.py --load_model  --wandb  --proj_dir out --data_file ../data/enwik8 --data_type utf-8 --vocab_size 0 --ctx_len 512 --epoch_steps 5000 --epoch_count 500 --epoch_begin 0 --epoch_save 5 --micro_bsz 12 --n_layer 6 --n_embd 512 --pre_ffn 0 --head_qk 0 --lr_init 8e-4 --lr_final 1e-5 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 --accelerator gpu --devices 1 --precision tf32 --strategy ddp_find_unused_parameters_false --grad_cp 0

and changed settings in chat.py

args.FLOAT_MODE = "fp32" # fp32 (good for CPU) // fp16 (recommended for GPU) // bf16 (less accurate)
args.vocab_size = 50277
args.head_qk = 0
args.pre_ffn = 0
args.grad_cp = 0
args.my_pos_emb = 0

# args.MODEL_NAME = '/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-14b/RWKV-4-Pile-14B-20230108-5170'
args.MODEL_NAME = './out/rwkv-40'

args.n_layer = 6 # 40
args.n_embd = 512 # 5120
args.ctx_len = 512 # 1024

Getting an error:

(py38) ➜  RWKV-v4neo git:(main) ✗ python chat.py
Loading...

RWKV_HEAD_QK_DIM 0 RWKV_JIT_ON 1

loading... ./out/rwkv-40
emb.weight                               float32    cpu
blocks.0.ln1.weight                      float32    cuda:0
blocks.0.ln1.bias                        float32    cuda:0
blocks.0.ln2.weight                      float32    cuda:0
blocks.0.ln2.bias                        float32    cuda:0
blocks.0.ln0.weight                      float32    cuda:0
blocks.0.ln0.bias                        float32    cuda:0
blocks.0.att.time_decay                  float32    cuda:0
blocks.0.att.time_first                  float32    cuda:0
blocks.0.att.time_mix_k                  float32    cuda:0
blocks.0.att.time_mix_v                  float32    cuda:0
blocks.0.att.time_mix_r                  float32    cuda:0
blocks.0.att.key.weight                  float32    cuda:0
blocks.0.att.value.weight                float32    cuda:0
blocks.0.att.receptance.weight           float32    cuda:0
blocks.0.att.output.weight               float32    cuda:0
blocks.0.ffn.time_mix_k                  float32    cuda:0
blocks.0.ffn.time_mix_r                  float32    cuda:0
blocks.0.ffn.key.weight                  float32    cuda:0
blocks.0.ffn.receptance.weight           float32    cuda:0
blocks.0.ffn.value.weight                float32    cuda:0
..........................................................................................
ln_out.weight                            float32    cuda:0
ln_out.bias                              float32    cuda:0
head.weight                              float32    cuda:0

Run prompt...
Traceback (most recent call last):
  File "chat.py", line 193, in <module>
    out = run_rnn(tokenizer.tokenizer.encode(init_prompt))
  File "chat.py", line 163, in run_rnn
    current_state = model.forward(model_tokens, current_state, preprocess_only = True)
  File "/mnt/d/workspace/RWKV-LM/RWKV-v4neo/src/model_run.py", line 200, in forward
    x = w.emb.weight[ctx[-1]]
IndexError: index 48656 is out of bounds for dimension 0 with size 6064

would please help to take look if I mistake something. Thanks.

RWKV on TPU?

Is it possible to train RWKV on TPU? If not, what would be needed?

Contact link

Thank you for your contributions and for opensourcing your work.

I'd like to get in contact with you but I can't find an email link on your social. If you're interested, but don't want to share email you can email me at [email protected].

Add a requirements.txt?

Ran into this issue while installing:

python ./train.py --help                                                                                                                     
Traceback (most recent call last):                                                                                                                                                  
  File "/fast/RWKV-LM/RWKV-v4/./train.py", line 10, in <module>                                                                                                                     
    from src.binidx import MMapIndexedDataset                                                                                                                                       
  File "/fast/RWKV-LM/RWKV-v4/src/binidx.py", line 30, in <module>                                                                                                                  
    6: np.float,                                                                                                                                                                    
  File "/fast/RWKV-LM/venv/lib/python3.10/site-packages/numpy/__init__.py", line 305, in __getattr__                                                                                
    raise AttributeError(__former_attrs__[attr])                                                                                                                                    
AttributeError: module 'numpy' has no attribute 'float'.                                                                                                                            
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you 
specifically wanted the numpy scalar type, use `np.float64` here.                                                                                                                   
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:                                                                
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations. Did you mean: 'cfloat'?  

Root issue was because np.float was removed https://numpy.org/doc/stable/release/1.24.0-notes.html#expired-deprecations

Anyway, a requirements.txt with versions would be helpful. Unless I missed something?

Bare metal cluster

I have 5 old (< 3yo) gaming rigs that are fairly serviceable. Can this be run with a distributed load across multiple machines? I can set up an internal network with a hub and secondary network cards with one interfacing machine.

Paper covering additional tokens idea

Hi there. You mention in the readme that you're interested in potentially adding some special tokens/markers to represent stuff like capitalisation. Just wanted to let you know we tried that in the ULMFiT paper, and it worked pretty well. You can read the details here: https://arxiv.org/abs/1801.06146 . We went beyond capitalisation and added some other tokens too.

Anyhoo this is just an FYI in case it's helpful to you.

关于调用模型做分类任务

你好作者!我对此工作很感兴趣,因为我现在在用基于transformer的模型做分类任务,transformer或者RNN在分类任务里通常采用最后一个模块的每个通道的最后一个元素作为输出,并通过全连接层映射到几个类别。
请问你觉得RWKV原理类似吗?依旧提取最后一个元素作为输出是否稳妥呢?希望您能给出一些建议,我将很感激!

RWKV-v4neo train error

try [train](example: train a simple L6-D512 RWKV from scratch on enwik8)

RWKV-LM/RWKV-v4neo$ python train.py --proj_dir "out" --data_file "../../data/enwik8" --data_type "utf-8" --vocab_size 0 --ctx_len 512 --epoch_steps 5000 --epoch_count 500 --epoch_begin 0 --epoch_save 5 --micro_bsz 12 --n_layer 6 --n_embd 512 --pre_ffn 0 --head_qk 0 --lr_init 8e-4 --lr_final 1e-5 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 --accelerator gpu --devices 1 --precision bf16 --strategy ddp_find_unused_parameters_false --grad_cp 0

then i got:

| Name | Type | Params

0 | emb | Embedding | 3.1 M
1 | blocks | ModuleList | 20.5 M
2 | ln_out | LayerNorm | 1.0 K
3 | head | Linear | 3.1 M

26.7 M Trainable params
0 Non-trainable params
26.7 M Total params
106.770 Total estimated model params size (MB)
Epoch 0: 0%| | 0/5000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/exat500g/RWKV-LM/RWKV-v4neo/train.py", line 340, in
trainer.fit(model, data_loader)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
self.fit_loop.run()
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 174, in advance
batch = next(data_fetcher)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next
return self.fetching_function()
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 263, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 277, in _fetch_next_batch
batch = next(iterator)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/supporters.py", line 557, in next
return self.request_next_batch(self.loader_iters)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/supporters.py", line 569, in request_next_batch
return apply_to_collection(loader_iters, Iterator, next)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
return function(data, *args, **kwargs)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
return self._process_data(data)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
raise exception
UnboundLocalError: Caught UnboundLocalError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/exat500g/RWKV-LM/RWKV-v4neo/src/dataset.py", line 208, in getitem
dix = [self.stoi[s] for s in data[i : i + req_len]]
UnboundLocalError: local variable 'i' referenced before assignment

Question about RWKV formula

In the first formula in README, RWKV is rewritten into recurrent form by letting $W_n=(n-1)w$. Is there a particular reason for using $n-1$ instead of $n$? The latter is more natural, and in From GPT to RWKV (the formulas) the recurrent formula of RWKV also implies the latter. So I believe you probably have tried it but for some reason it is suboptimal.

HumanEval benchmarks?

Hi, I was wondering whether this model can achieve GPT-4 level performance on the HumanEval benchmark, a proxy for effectiveness at code generation. I'm fine if I have to train or transfer-learn, but I only have a single GPU. Massive transformers require way too much compute. I'd also like your opinion on how well the model might work in tandem with InCoder and LLaMa/Alpaca in a model ensemble. Sort of like Stable Diffusion, which has multiple different models that each specialize in a different task. Thanks!

Radeon Open Compute support

There are many users with AMD graphics cards that want to train this model in a GPU accelerated manner. Radeon Open Compute is AMD's equivalent to CUDA (the relevant component in Radeon Open Compute is called HIP).

  • Attempt to HIPify the codebase
  • Make the HIP port optimised for full performance

RWKV-4 169m/430m in browser with ORT Web / TF.js / tfjs-tflite?

Hi, really exciting project! I'm wondering if you've published the model conversion script that you used to create the js_models files from the .pth model file? It would be awesome to see how the larger and newer models like RWKV-4 169m/430m perform in the browser! I think the inference speed of RWKV opens up many new possibilities for language models on the web.

question about the environment setup

Could you please share the environment setup (especially for training) ? E.g. the version of python / torch / cuda / pytorch_lightning, or other options that you consider important.

Thanks!

Question about the training compute

Great work! I am working on a survey and would be interested to know the total training compute (FLOPs) of RWKV-14B model? What was the training time (GPU-hours) on how many A100s? Also any idea about the GPU utilization rate?

Why isn't everyone using RWKV if it's so much better than transformers?

Hi!

The machine learning (ML) community is progressing at a remarkable pace and is embracing new techniques very quickly. Based on my comprehension of this model, it appears to offer a distinct set of advantages relative to transformers, while lacking any real drawbacks. Despite these benefits, it remains unclear why adopting this approach is not more widespread among individuals and organizations in the field.

Why is this the case? I really can't wrap my head around it.

Access/train to use the embeddings

Hi @BlinkDL ! Really interested in your work here. I am looking to test out some of the models for embedding based tasks. What is the best way to access the embeddings? I would be looking to use these for training as well (i.e. contrastive loss using siamese training setup). Any information on this would be greatly appreciated.

ChatRWKV triggers segmentation fault when using streaming or split loading on 4-14B

Hi,
This is probably something related to my setup, but I can't work out what's causing this. When attempting to load the v4 14B model with ChatRWKV v2 with the split or stream methods, I get a segmentation fault midway through loading. With cuda fp16i8 *1+ or fp16 *1+, it simply segmentation faults sooner than cuda fp16 *10+ or cuda fp16i8 *10+. Fp32 with split and stream behaves the same way.

I've tried loading the model with fp16i8 all to the GPU to see if it would do the same thing, but with cuda fp16i8 it will load up until it runs out of memory.

Any advice would be appreciated.

Ubuntu 20.04 on WL2
python 3.10
torch 1.13.1 cu117
pip rwkv 0.5
and always latest git pull

7B model works great because I can load it all on a 12gb GPU.

Really impressive project, and that's a total understatement. I'm blown away by the 7b model.

AttributeError: type object 'Trainer' has no attribute 'add_argparse_args'

Using https://github.com/resloved/RWKV-notebooks/blob/master/RWKV_v4neo_Fine_Tuning.ipynb
which uses https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v4neo gives me an error when it comes to the training part

########## work in progress ##########
Traceback (most recent call last):
  File "/content/RWKV-LM/RWKV-v4neo/train.py", line 109, in <module>
    parser = Trainer.add_argparse_args(parser)
AttributeError: type object 'Trainer' has no attribute 'add_argparse_args'

edit: had to downgrade to pytorch-lightning==1.9.0.
On another note, I was under the impression that n_epochs would limit the number of epochs to that number but it just keeps going past it?

RLHF for finetuning

Hi, Thank you for your efforts!

have you considered finetuning using methods like rlhf?

Implementation details about wkv

Amazing work!
But I'm really confused about the implementation details for wkv cuda kernel (in RWKV-LM\RWKV-v4neo\cuda\wkv_cuda.cu).
How does the implementation match the equations shown in README? Could you please give a more detailed comment about it?
For example, what is the meaning of local variable p, pp, ...
Thanks

Possible benchmark leakage in Pile dataset

As I understand, all models in RWKV family were trained on Pile dataset. I'm concerned with possible lack of preprocessing of the dataset.

Pile paper states To avoid leakage of data from downstream evaluations, recent work ... has removed any data in the training set that may overlap with the evaluation metrics. We decided not to perform any such removal.

This means that unfiltered Pile may contain correct solutions for benchmark tasks, which can make comparison with other models unfair -- because RWKV models may be directly trained on the validation data.

Although authors of Pile mention Pile’s own validation and test sets. If RWKV was evaluated on these, maybe there is no problem at all.

Please forgive any possible lack of understanding from my side, I'm only a beginner in the ML :)

4-bit quantization to reduce VRam requirement

Hi,
Is it possible to use something like GPTQ to get a 4-bit quantized version of the latest 7B Instruct model. That's got to be one of the fastest and "smartest" models I have tested in the 7B range, but the VRam required is too much for what I have.
Llama 7B in 4 bit works quite well on smaller VRam cards. Something like that for RWKV would be great.
Also are there any resource on how I could get an embedding from any of the models.
Thanks

VRAM performance

Hi @BlinkDL! First off this is amazing and seems very promising for scaling down large Transformers to be more production friendly.

I'm wondering if you have any benchmarks regarding VRAM performance? Specifically I've got 3 questions:

1 - How much VRAM does this model (or rather, the CUDA version) need for training? Are we talking 1060 size (6gb), 3090 size (20gb), or a6000+ size (40+gb)
2 - Same question as 1, but for inference?
3 - Can this run on CPU reasonably?

CUDA compilation error with Ctx Length>2000

Hello,
I am trying out RWKV with audio modality and when I set T_MAX>>1000, it throws this error:

Emitting ninja build file /root/.cache/torch_extensions/py39_cu116/timex/build.ninja...
Building extension module timex...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=timex -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/surya-env/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' --use_fast_math --extra-device-vectorization -DTmax=10000 -DBF=8 -DBB=2 -std=c++14 -c cuda/timex_cuda.cu -o timex_cuda.cuda.o 
FAILED: timex_cuda.cuda.o 
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=timex -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/surya-env/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' --use_fast_math --extra-device-vectorization -DTmax=10000 -DBF=8 -DBB=2 -std=c++14 -c cuda/timex_cuda.cu -o timex_cuda.cuda.o 
ptxas error   : Entry function '_Z15kernel_backwardIfEvPKT_S2_S2_PS0_S3_iii' uses too much shared data (0x30d40 bytes, 0xc000 max)
ptxas error   : Entry function '_Z14kernel_forwardIfEvPKT_S2_PS0_S0_iii' uses too much shared data (0x57e40 bytes, 0xc000 max)
ninja: build stopped: subcommand failed.

GPU: A100, VRAM: 42GB, CUDA 11.6

I am okay if the training takes a bit long. But I need this to work.
Don't know any CUDA. Can you suggest some workarounds?

Thanks for the incredible work btw!

v4 model.py vs model_run.py

Hi,
Thanks for this awesome repo!
I'm trying to understand the code and found that in the v4 folder, there's this model.py and model_run.py, which contains GPT and RWKV_GPT respectively which all uses different initialization methods. Could you elaborate on when should which one be used? Thanks in advnace!

Repeated download errors - wrong readme?

Re:
Download RWKV-4 0.1/0.4/1.5/3/7/14B weights: https://huggingface.co/BlinkDL
But how?
I have repeated errors

OSError: BlinkDL/rwkv-4-pile-430m does not appear to have a file named config.json. Checkout 'https://huggingface.co/BlinkDL/rwkv-4-pile-430m/main' for available files.
@6Y3GRwDjtGVo4nAe2 ➜ ~/Downloads/ChatRWKV (main) $ 

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/BlinkDL/rwkv-4-pile-430m/resolve/main/config.json

whatever model I try via simple:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("BlinkDL/rwkv-4-pile-430m" , use_auth=True)
model = AutoModelForCausalLM.from_pretrained("BlinkDL/rwkv-4-pile-430m", use_auth=True)

Even AI gives up:

Here are the steps for creating your own config.json file based on the model specifications and loading it locally:

First, you need to know the model architecture and hyperparameters of the model you want to download. You can try to find this information from the model page or contact the author for details.
Second, you need to create a PretrainedConfig object that matches the model specifications. You can use one of the subclasses of PretrainedConfig from HuggingFace’s library depending on the model type1.
Third, you need to save your PretrainedConfig object as a config.json file using to_json_file method2. You can specify a local path where you want to save this file.
Fourth, you need to load the model using from_pretrained with local_files_only=True and provide the path to your config.json file3. You also need to provide the path to other model files such as pytorch_model.bin.
etc.

This

import os
from copy import deepcopy
from rwkv.model import RWKV

os.environ["RWKV_JIT_ON"] = '1'
os.environ["RWKV_CUDA_ON"] = '0' # if '1' then use CUDA kernel for seq mode (much faster)
from rwkv.model import RWKV                         # everything in /v2/rwkv folder
model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-1b5/RWKV-4-Pile-1B5-20220903-8040', strategy='cuda fp16')

out, state = model.forward([187, 510, 1563, 310, 247], None)   # use 20B_tokenizer.json
print(out.detach().cpu().numpy())                   # get logits
out, state = model.forward([187, 510], None)
out, state = model.forward([1563], state)           # RNN has state (use deepcopy if you want to clone it)
out, state = model.forward([310, 247], state)
print(out.detach().cpu().numpy())                   # same result as above

does not work either:
@6Y3GRwDjtGVo4nAe2 ➜ ~/Downloads/ChatRWKV (main) $ pip install rwkv
ERROR: Could not find a version that satisfies the requirement rwkv (from versions: none)
ERROR: No matching distribution found for rwkv
@6Y3GRwDjtGVo4nAe2 ➜ ~/Downloads/ChatRWKV (main) $ pip install rwkv-rs
Collecting rwkv-rs
Downloading rwkv_rs-0.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 6.6 MB/s eta 0:00:00
Installing collected packages: rwkv-rs
Successfully installed rwkv-rs-0.2.3
@6Y3GRwDjtGVo4nAe2 ➜ ~/Downloads/ChatRWKV (main) $ python transf.py
Traceback (most recent call last):
File "/home/codespace/Downloads/ChatRWKV/transf.py", line 3, in
from rwkv.model import RWKV
ModuleNotFoundError: No module named 'rwkv'

etc.

fine tuning a pretrained model with txt file

I tried to fine tuning a 3B model with a text file, encoded in UTF8, using the provided train.py.

It seems that the tokens are calculated with the unique characters. So I got

Building token list...                                                                                        
Data has 330926472 tokens, 7129 vocab size.

and error message

RuntimeError: Error(s) in loading state_dict for RWKV:                                                        
        size mismatch for emb.weight: copying a param with shape torch.Size([50277, 2560]) from checkpoint,   
the shape in current model is torch.Size([7129, 2560]).                                                       
        size mismatch for head.weight: copying a param with shape torch.Size([50277, 2560]) from checkpoint,  
the shape in current model is torch.Size([7129, 2560]).

I think I need to tokenize the text file with the same tokenizer to prepare the pile data before I fed it to the training script. Is there an example?

Training speed check

Just curious about this number:

Training speed: (new training code) RWKV-4 14B BF16 ctxlen4096 = 114K tokens/s on 8x8 A100 80G

So to clarify, you're ingesting 114 ktokens/s on 64x A100s during training, or ~1.7k/s/GPU?

For comparison, Meta trained Llama 13B on 1T tokens in 135,168 GPU hours (also A100-80GB), which is ~0.0074B tokens/GPU/hour. You're achieving 0.0064B tokens/GPU/hour if my quick math is right, so training performance is quite comparable with Meta's even that they make a point of saying how heavily they optimised their training. (They say they hand-coded their backward functions and did work to parallelise activation computatiopn with GPU communication, although AFAIK they didn't release their training code). That's impressive!

And that's on 4k context size? Traditional transformer dot product attention is O(n^2) on sequence length so going from 2k to 4k should 4x the compute, so I guess Llama would have only achieved 0.00185B tokens/GPU/hour with the same window size used here.

Sequence to Sequence?

Hey @BlinkDL! Awesome project!

I was wondering if you have performed any Seq-2-Seq experiments with it? Any reason for going with GPT model in the first place as opposed to something like T5 (standard Transformer)?
Any direction on what changes will be required to make a standard encoder-decoder architecture with RWKV?

Also, is there any report on in-context-learning/FSL capability of the latest trained model?

Any results on few-shot settings?

Thanks for your wonderful work!
Do you have any results on the few-shot settings? Do RWKV-LLMs perform the similar emergent ability like GPT-3, e.g. the chain-of-thought?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.