blinkdl / rwkv-lm Goto Github PK

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

License: Apache License 2.0

Python 87.61% Cuda 9.46% C++ 2.05% Shell 0.87%

attention-mechanism deep-learning gpt gpt-2 gpt-3 language-model linear-attention lstm pytorch rnn

rwkv-lm's Introduction

RWKV: Parallelizable RNN with Transformer-level LM Performance (pronounced as "RwaKuv" (rʌkuv in IPA), from 4 major params: R W K V)

RWKV homepage: https://www.rwkv.com

RWKV-5/6 Eagle/Finch paper: https://arxiv.org/abs/2404.05892

Awesome RWKV in Vision: https://github.com/Yaziwel/Awesome-RWKV-in-Vision

RWKV-6 3B Demo: https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-1

RWKV-6 7B Demo: https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2

RWKV-6 GPT-mode demo code (with comments and explanations): https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/rwkv_v6_demo.py

RWKV-6 RNN-mode demo: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v6_demo.py

HOW TO TEST TRAINING RWKV-5 on MiniPile (1.5G tokens)

For reference, use python 3.10, torch 2.3.1+cu121 (or latest), cuda 12.5+, latest deepspeed, but keep pytorch-lightning==1.9.5

pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade

cd RWKV-v5/
./demo-training-prepare.sh
./demo-training-run.sh
(you may want to log in to wandb first)

Your loss curve should look almost exactly the same as this, with the same ups and downs (if you use the same bsz & config):

You can run your model using https://pypi.org/project/rwkv/ (use "rwkv_vocab_v20230424" instead of "20B_tokenizer.json")

Use https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/make_data.py to prepare binidx data from jsonl, and compute "--my_exit_tokens" and "--magic_prime".

The "epoch" in train.py is "mini-epoch" (not real epoch. only for convenience), and 1 mini-epoch = 40320 * ctx_len tokens.

For example, if your binidx has 1498226207 tokens and ctxlen=4096, set "--my_exit_tokens 1498226207" (this will override epoch_count), and it will be 1498226207/(40320 * 4096) = 9.07 miniepochs. The trainer will auto-exit after "--my_exit_tokens" tokens. Set "--magic_prime" to the largest 3n+2 prime smaller than datalen/ctxlen-1 (= 1498226207/4096-1 = 365776), which is "--magic_prime 365759" in this case.

simple: prepare SFT jsonl => repeat your SFT data 3 or 4 times in make_data.py. more repetition leads to overfitting.

advanced: repeat your SFT data 3 or 4 times in your jsonl (note make_data.py will shuffle all jsonl items) => add some base data (such as slimpajama) to your jsonl => and only repeat 1 times in make_data.py.

Train RWKV-6: use /RWKV-v5/ and add --my_testing "x060" to demo-training-prepare.sh and demo-training-run.sh

Simple inference for RWKV-5: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v5_demo.py

Simple inference for RWKV-6: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v6_demo.py

Note: In [state = kv + w * state] everything must be in fp32 because w can be very close to 1. So we can keep state and w in fp32, and convert kv to fp32.

lm_eval: https://github.com/BlinkDL/ChatRWKV/blob/main/run_lm_eval.py

chat demo for developers: https://github.com/BlinkDL/ChatRWKV/blob/main/API_DEMO_CHAT.py

Tips for small model / small data: When I train RWKV music models, I use deep & narrow (such as L29-D512) dimensions, and apply wd and dropout (such as wd=2 dropout=0.02). Note RWKV-LM dropout is very effective - use 1/4 of your usual value.

HOW TO FINETUNE RWKV-5 MODELS

Use .jsonl format for your data (see https://huggingface.co/BlinkDL/rwkv-5-world for formats).

Use https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/make_data.py to tokenizer it using World tokenizer into binidx, suitable for finetuning World models.

Rename the base checkpoint in your model folder to rwkv-init.pth, and change the training commands to use --n_layer 32 --n_embd 4096 --vocab_size 65536 --lr_init 1e-5 --lr_final 1e-5 for 7B.

0.1B = --n_layer 12 --n_embd 768 // 0.4B = --n_layer 24 --n_embd 1024 // 1.5B = --n_layer 24 --n_embd 2048 // 3B = --n_layer 32 --n_embd 2560 // 7B = --n_layer 32 --n_embd 4096

State-tuning (tuning the initial state. zero inference overhead)

Currently unoptimized implementation, takes same vram as full SFT

--train_type "states" --load_partial 1 --lr_init 1 --lr_final 0.01 --warmup_steps 10 (yes, use very high LR)

use rwkv 0.8.26+ to auto-load the trained "time_state"

Initializing RWKV 5/6 Models

When you train RWKV from scratch, try my initialization for best performance. Check generate_init_weight() of src/model.py:

emb.weight => nn.init.uniform_(a=-1e-4, b=1e-4)
(Note ln0 of block0 is the layernorm for emb.weight)
head.weight => nn.init.orthogonal_(gain=0.5*sqrt(n_vocab / n_embd))

att.receptance.weight => nn.init.orthogonal_(gain=1)
att.key.weight => nn.init.orthogonal_(gain=0.1)
att.value.weight => nn.init.orthogonal_(gain=1)
att.gate.weight => nn.init.orthogonal_(gain=0.1)
att.output.weight => zero

att.ln_x.weight (groupnorm) => ((1 + layer_id) / total_layers) ** 0.7

ffn.key.weight => nn.init.orthogonal_(gain=1)
ffn.value.weight => zero
ffn.receptance.weight => zero

!!! If you are using positional embedding, maybe it's better to remove block.0.ln0 and use default initialization for emb.weight instead of my uniform_(a=-1e-4, b=1e-4) !!!

Introducing RWKV

RWKV is an RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's 100% attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" mode to quickly compute the hidden state for the "RNN" mode.

So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).

RWKV Runner GUI https://github.com/josStorer/RWKV-Runner with one-click install and API

All latest RWKV weights: https://huggingface.co/BlinkDL

HF-compatible RWKV weights: https://huggingface.co/RWKV

RWKV pip package: https://pypi.org/project/rwkv/

os.environ["RWKV_JIT_ON"] = '1'
os.environ["RWKV_CUDA_ON"] = '0' # if '1' then use CUDA kernel for seq mode (much faster)
from rwkv.model import RWKV                         # pip install rwkv
model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-1b5/RWKV-4-Pile-1B5-20220903-8040', strategy='cuda fp16')

out, state = model.forward([187, 510, 1563, 310, 247], None)   # use 20B_tokenizer.json
print(out.detach().cpu().numpy())                   # get logits
out, state = model.forward([187, 510], None)
out, state = model.forward([1563], state)           # RNN has state (use deepcopy if you want to clone it)
out, state = model.forward([310, 247], state)
print(out.detach().cpu().numpy())                   # same result as above

nanoRWKV: https://github.com/BlinkDL/nanoRWKV (does not require custom CUDA kernel to train, works for any GPU/CPU)

RWKV Discord: https://discord.gg/bDSBUMeFpc (7k+ members)

Twitter: https://twitter.com/BlinkDL_AI

Homepage: https://www.rwkv.com

Cool Community RWKV Projects:

All (300+) RWKV projects: https://github.com/search?o=desc&q=rwkv&s=updated&type=Repositories

https://github.com/OpenGVLab/Vision-RWKV Vision RWKV

https://github.com/feizc/Diffusion-RWKV Diffusion RWKV

https://github.com/cgisky1980/ai00_rwkv_server Fastest WebGPU inference (nVidia/AMD/Intel)

https://github.com/cryscan/web-rwkv backend for ai00_rwkv_server

https://github.com/saharNooby/rwkv.cpp Fast CPU/cuBLAS/CLBlast inference: int4/int8/fp16/fp32

https://github.com/JL-er/RWKV-PEFT lora/pissa/Qlora/Qpissa/state tuning

https://github.com/RWKV/RWKV-infctx-trainer Infctx trainer

https://github.com/daquexian/faster-rwkv

mlc-ai/mlc-llm#1275

https://github.com/TheRamU/Fay/blob/main/README_EN.md Digital Assistant with RWKV

https://github.com/harrisonvanderbyl/rwkv-cpp-cuda Fast GPU inference with cuda/amd/vulkan

RWKV v6 in 250 lines (with tokenizer too): https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v6_demo.py

RWKV v5 in 250 lines (with tokenizer too): https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v5_demo.py

RWKV v4 in 150 lines (model, inference, text generation): https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py

RWKV v4 preprint https://arxiv.org/abs/2305.13048

RWKV v4 introduction, and in 100 lines of numpy: https://johanwind.github.io/2023/03/23/rwkv_overview.html https://johanwind.github.io/2023/03/23/rwkv_details.html

RWKV v6 illustrated:

A cool paper (Spiking Neural Network) using RWKV: https://github.com/ridgerchu/SpikeGPT

You are welcome to join the RWKV discord https://discord.gg/bDSBUMeFpc to build upon it. We have plenty of potential compute (A100 40Gs) now (thanks to Stability and EleutherAI), so if you have interesting ideas I can run them.

RWKV [loss vs token position] for 10000 ctx4k+ documents in Pile. RWKV 1B5-4k is mostly flat after ctx1500, but 3B-4k and 7B-4k and 14B-4k have some slopes, and they are getting better. This debunks the old view that RNNs cannot model long ctxlens. We can predict that RWKV 100B will be great, and RWKV 1T is probably all you need :)

ChatRWKV with RWKV 14B ctx8192:

I believe RNN is a better candidate for fundamental models, because: (1) It's more friendly for ASICs (no kv cache). (2) It's more friendly for RL. (3) When we write, our brain is more similar to RNN. (4) The universe is like an RNN too (because of locality). Transformers are non-local models.

RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M

GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M

Training speed: (new training code) RWKV-4 14B BF16 ctxlen4096 = 114K tokens/s on 8x8 A100 80G (ZERO2+CP). (old training code) RWKV-4 1.5B BF16 ctxlen1024 = 106K tokens/s on 8xA100 40G.

I am doing image experiments too (For example: https://huggingface.co/BlinkDL/clip-guided-binary-autoencoder) and RWKV will be able to do txt2img diffusion :) My idea: 256x256 rgb image -> 32x32x13bit latents -> apply RWKV to compute transition probability for each of the 32x32 grid -> pretend the grids are independent and "diffuse" using these probabilities.

Smooth training - no loss spikes! (lr & bsz change around 15G tokens)

All of the trained models will be open-source. Inference is very fast (only matrix-vector multiplications, no matrix-matrix multiplications) even on CPUs, so you can even run a LLM on your phone.

How it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's very simple once you understand it.

RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. Moreover, you can fine-tune RWKV into a non-parallelizable RNN (then you can use outputs of later layers of the previous token) if you want extra performance.

Here are some of my TODOs. Let's work together :)

HuggingFace integration (check huggingface/transformers#17230 ), and optimized CPU & iOS & Android & WASM & WebGL inference. RWKV is a RNN and very friendly for edge devices. Let's make it possible to run a LLM on your phone.
Test it on bidirectional & MLM tasks, and image & audio & video tokens. I think RWKV can support Encoder-Decoder via this: for each decoder token, use a learned mixture of [decoder previous hidden state] & [encoder final hidden state]. Hence all decoder tokens will have access to the encoder output.
Now training RWKV-4a with one single tiny extra attention (just a few extra lines comparing with RWKV-4) to further improve some difficult zeroshot tasks (such as LAMBADA) for smaller models. See https://github.com/BlinkDL/RWKV-LM/commit/a268cd2e40351ee31c30c5f8a5d1266d35b41829

User feedback:

I've so far toyed around the character-based model on our relatively small pre-training dataset (around 10GB of text), and the results are extremely good - similar ppl to models taking much, much longer to train.

dear god rwkv is fast. i switched to another tab after starting training it from scratch & when i returned it was emitting plausible english & maori words, i left to go microwave some coffee & when i came back it was producing fully grammatically correct sentences.

Tweet from Sepp Hochreiter (thank you!): https://twitter.com/HochreiterSepp/status/1524270961314484227

You can find me (BlinkDL) in the EleutherAI Discord too: https://www.eleuther.ai/get-involved/

Quick start

IMPORTANT: Use deepspeed==0.7.0 pytorch-lightning==1.9.5 torch==1.13.1+cu117 and cuda 11.7.1 or 11.7 (note torch2 + deepspeed has weird bugs and hurts model performance)

Use https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v4neo (latest code, compatible with v4).

Here is a great prompt for testing Q&A of LLMs. Works for any model: (found by minimizing ChatGPT ppls for RWKV 1.5B)

prompt = f'\nQ & A\n\nQuestion:\n{qq}\n\nDetailed Expert Answer:\n' # let the model generate after this

Inference

Run RWKV-4 Pile models: Download models from https://huggingface.co/BlinkDL. Set TOKEN_MODE = 'pile' in run.py and run it. It's fast even on CPU (the default mode).

Colab for RWKV-4 Pile 1.5B: https://colab.research.google.com/drive/1F7tZoPZaWJf1fsCmZ5tjw6sYHiFOYVWM

Run RWKV-4 Pile models in your browser (and onnx version): see this issue #7

RWKV-4 Web Demo: https://josephrocca.github.io/rwkv-v4-web/demo/ (note: only greedy sampling for now)

For the old RWKV-2: see the release here for a 27M params model on enwik8 with 0.72 BPC(dev). Run run.py in https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v2-RNN. You can even run it in your browser: https://github.com/BlinkDL/AI-Writer/tree/main/docs/eng https://blinkdl.github.io/AI-Writer/eng/ (this is using tf.js WASM single-thread mode).

Training / Fine-tuning

pip install deepspeed==0.7.0 // pip install pytorch-lightning==1.9.5 // torch 1.13.1+cu117

NOTE: add weight decay (0.1 or 0.01) and dropout (0.1 or 0.01) when training on small amt of data. try x=x+dropout(att(x)) x=x+dropout(ffn(x)) x=dropout(x+att(x)) x=dropout(x+ffn(x)) etc.

Training RWKV-4 from scratch: run train.py, which by default is using the enwik8 dataset (unzip https://data.deepai.org/enwik8.zip).

You will be training the "GPT" version because it's paralleziable and faster to train. RWKV-4 can extrapolate, so training with ctxLen 1024 can work for ctxLen of 2500+. You can fine-tune the model with longer ctxLen and it can quickly adapt to longer ctxLens.

Fine-tuning RWKV-4 Pile models: use 'prepare-data.py' in https://github.com/BlinkDL/RWKV-v2-RNN-Pile/tree/main/RWKV-v3 to tokenize .txt into train.npy data. Then use https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4neo/train.py to train it.

Read the inference code in src/model.py and try using the final hidden state（.xx .aa .bb) as a faithful sentence embedding for other tasks. Probably you should begin with .xx and .aa/.bb (.aa divided by .bb).

Colab for fine-tuning RWKV-4 Pile models: https://colab.research.google.com/github/resloved/RWKV-notebooks/blob/master/RWKV_v4_RNN_Pile_Fine_Tuning.ipynb

Large corpus: Use https://github.com/Abel2076/json2binidx_tool to convert .jsonl into .bin and .idx

The jsonl format sample (one line for each document):

{"text": "This is the first document."}
{"text": "Hello\nWorld"}
{"text": "1+1=2\n1+2=3\n2+2=4"}

generated by code like this:

ss = json.dumps({"text": text}, ensure_ascii=False)
out.write(ss + "\n")

Infinite ctxlen training (WIP): https://github.com/Blealtan/RWKV-LM-LoRA/tree/dev-infctx

How to use RWKV hidden state as text embedding

Consider RWKV 14B. The state has 200 vectors, that is, 5 vectors for each block: fp16 (xx), fp32 (aa), fp32 (bb), fp32 (pp), fp16 (xx).

Do not avg pool because different vectors (xx aa bb pp xx) in the state have very different meanings and ranges. You can probably remove pp.

I suggest firstly collect the mean+stdev statistics of each channel of each vector, and normalize all of them (note: the normalization should be data-indepedent and collected from various texts). Then train a linear classifer.

Towards RWKV-5 (just to record some new ideas)

Lastest Design

RWKV-5 is multi-head and here shows one head. There is also a LayerNorm for each head (hence actually GroupNorm).

$ \begin{array}{|l|l|l|} \hline & \text { RWKV-4 with real-valued } k \,\&\, v \,\&\, u \,\&\, w & \text { RWKV-5 with matrix-valued } \mathrm{k}^{\dagger} \mathrm{v} \,\&\, \mathrm{u} \,\&\, \mathrm{w} \\ \hline \mathrm{y}_0 & \mathrm{r}_0 \frac{\mathrm{uk}_0 \mathrm{v}_0}{\mathrm{uk}_0} & \mathrm{r}_0\left(\mathrm{uk}_0^{\dagger} \mathrm{v}_0\right) \\ \hline \mathrm{y}_1 & \mathrm{r}_1 \frac{\mathrm{uk}_1 \mathrm{v}_1+\mathrm{k}_0 \mathrm{v}_0}{\mathrm{uk}_1+\mathrm{k}_0} & \mathrm{r}_1\left(\mathrm{uk}_1^{\dagger} \mathrm{v}_1+\mathrm{k}_0^{\dagger} \mathrm{v}_0\right) \\ \hline \mathrm{y}_2 & \mathrm{r}_2 \frac{\mathrm{uk}_2 \mathrm{v}_2+\mathrm{k}_1 \mathrm{v}_1+\mathrm{wk}_0 \mathrm{v}_0}{\mathrm{uk}_2+\mathrm{k}_1+\mathrm{wk}_0} & \mathrm{r}_2\left(\mathrm{uk}_2^{\dagger} \mathrm{v}_2+\mathrm{k}_1^{\dagger} \mathrm{v}_1+\mathrm{wk}_0^{\dagger} \mathrm{v}_0\right) \\ \hline \mathrm{y}_3 & \mathrm{r}_3 \frac{\mathrm{uk}_3 \mathrm{v}_3+\mathrm{k}_2 \mathrm{v}_2+\mathrm{wk}_1 \mathrm{v}_1+\mathrm{w}^2 \mathrm{k}_0 \mathrm{v}_0}{\mathrm{uk}_3+\mathrm{k}_2+\mathrm{wk}_1+\mathrm{w}^2 \mathrm{k}_0} & \mathrm{r}_3\left(\mathrm{uk}_3^{\dagger} \mathrm{v}_3+\mathrm{k}_2^{\dagger} \mathrm{v}_2+\mathrm{wk}_1^{\dagger} \mathrm{v}_1+\mathrm{w}^2 \mathrm{k}_0^{\dagger} \mathrm{v}_0\right) \\ \hline \end{array}$

$\left[\begin{array}{ll} \mathrm{y}_{20} & \cdots \mathrm{y}_{2 \mathrm{c}} \end{array}\right]=\left[\begin{array}{lll} \mathrm{r}_{20} & \cdots & \mathrm{r}_{2 \mathrm{c}} \end{array}\right]$ $\left(\left[\begin{array}{ccc} \mathrm{u}_{00} & \cdots & \mathrm{u}_{0 \mathrm{c}} \\ \vdots & \ddots & \vdots \\ \mathrm{u}_{\mathrm{c} 0} & \cdots & \mathrm{u}_{\mathrm{cc}} \end{array}\right]\left[\begin{array}{ccc} \mathrm{k}_{20} \mathrm{v}_{20} & \cdots & \mathrm{k}_{20} \mathrm{v}_{2 \mathrm{c}} \\ \vdots & \ddots & \vdots \\ \mathrm{k}_{2 \mathrm{c}} \mathrm{v}_{20} & \cdots & \mathrm{k}_{2 \mathrm{c}} \mathrm{v}_{2 \mathrm{c}} \end{array}\right]+\left[\begin{array}{ccc} \mathrm{k}_{10} \mathrm{v}_{10} & \cdots & \mathrm{k}_{10} \mathrm{v}_{1 \mathrm{c}} \\ \vdots & \ddots & \vdots \\ \mathrm{k}_{1 \mathrm{c}} \mathrm{v}_{10} & \cdots & \mathrm{k}_{1 \mathrm{c}} \mathrm{v}_{1 \mathrm{c}} \end{array}\right]+\left[\begin{array}{ccc} \mathrm{w}_{00} & \cdots & \mathrm{w}_{0 \mathrm{c}} \\ \vdots & \ddots & \vdots \\ \mathrm{w}_{\mathrm{c} 0} & \cdots & \mathrm{w}_{\mathrm{cc}} \end{array}\right]\left[\begin{array}{ccc} \mathrm{k}_{00} \mathrm{v}_{00} & \cdots & \mathrm{k}_{00} \mathrm{v}_{0 c} \\ \vdots & \ddots & \vdots \\ \mathrm{k}_{0 \mathrm{c}} \mathrm{v}_{00} & \cdots & \mathrm{k}_{0 \mathrm{c}} \mathrm{v}_{0 c} \end{array}\right] \right)$

RWKV-6

Dynamic Mix & Dynamic Decay. Example (do this for both TimeMix & ChannelMix):

TIME_MIX_EXTRA_DIM = 32
self.time_mix_k_w1 = nn.Parameter(torch.empty(args.n_embd, TIME_MIX_EXTRA_DIM).uniform_(-0.01, 0.01))
self.time_mix_k_w2 = nn.Parameter(torch.zeros(TIME_MIX_EXTRA_DIM, args.n_embd))
self.time_mix_v_w1 = nn.Parameter(torch.empty(args.n_embd, TIME_MIX_EXTRA_DIM).uniform_(-0.01, 0.01))
self.time_mix_v_w2 = nn.Parameter(torch.zeros(TIME_MIX_EXTRA_DIM, args.n_embd))
self.time_mix_r_w1 = nn.Parameter(torch.empty(args.n_embd, TIME_MIX_EXTRA_DIM).uniform_(-0.01, 0.01))
self.time_mix_r_w2 = nn.Parameter(torch.zeros(TIME_MIX_EXTRA_DIM, args.n_embd))
self.time_mix_g_w1 = nn.Parameter(torch.empty(args.n_embd, TIME_MIX_EXTRA_DIM).uniform_(-0.01, 0.01))
self.time_mix_g_w2 = nn.Parameter(torch.zeros(TIME_MIX_EXTRA_DIM, args.n_embd))
...
time_mix_k = self.time_mix_k.view(1,1,-1) + (x @ self.time_mix_k_w1) @ self.time_mix_k_w2
time_mix_v = self.time_mix_v.view(1,1,-1) + (x @ self.time_mix_v_w1) @ self.time_mix_v_w2
time_mix_r = self.time_mix_r.view(1,1,-1) + (x @ self.time_mix_r_w1) @ self.time_mix_r_w2
time_mix_g = self.time_mix_g.view(1,1,-1) + (x @ self.time_mix_g_w1) @ self.time_mix_g_w2

xx = self.time_shift(x)
xk = x * time_mix_k + xx * (1 - time_mix_k)
xv = x * time_mix_v + xx * (1 - time_mix_v)
xr = x * time_mix_r + xx * (1 - time_mix_r)
xg = x * time_mix_g + xx * (1 - time_mix_g)

RWKV-7

Use parallelized mode to quickly generate the state, then use a finetuned full RNN (the layers of token n can use outputs of all layer of token n-1) for sequential generation.

Some old ideas

Now time decay is like 0.999^T (0.999 is learnable). Change it to something like (0.999^T + 0.1) where 0.1 is learnable too. The 0.1 part will be kept forever. Or, A^T + B^T + C = fast-decay + slow-decay + constant. Can even use different formulas (for example, K^2 instead of e^K for a decay component, or, without normalization).
Use complex-valued decay (so, rotation instead of decay) in some channels.
Inject some trainable and extrapolatable positional encoding?
Aside from 2d rotation, we can try other Lie groups such as 3d rotation ( SO(3) ). Non-abelian RWKV lol.
RWKV might be great on analog devices (search for Analog Matrix-vector multiplication & Photonic Matrix-vector multiplication). The RNN mode is very hardware-friendly (processing-in-memory). Can be a SNN too (https://github.com/ridgerchu/SpikeGPT). I wonder if it can be optimized for quantum computation.
Trainable initial hidden state (xx aa bb pp xx).
Layerwise (or even row/column-wise, elementwise) LR, and test Lion optimizer.

Vision Tasks

I find it's good to add a 2d pos encoding:

self.pos_emb_x = nn.Parameter(torch.zeros((1,args.my_pos_emb,args.n_embd)))
self.pos_emb_y = nn.Parameter(torch.zeros((args.my_pos_emb,1,args.n_embd)))
...
x = x + pos_emb_x + pos_emb_y

In a BPE langauge model, it's the best to use [tokenShift of 1 token] (you can mix more tokens in a char-level English model). However you can try [tokenShift of N (or N-1) (or N+1) tokens] if the image size is N x N, because that will be like mixing [the token above the current positon (or the token above the to-be-predicted positon)] with [current token]. You can use try different tokenShift styles for "ATT" & "FFN", or mixing different tokenShift styles - such as mixing [token A] with [token A-1] and [token A-(N-1)] etc.

Misc

Maybe we can improve memorization by simply repeating the context (I guess 2 times is enough). Example: Reference -> Reference(again) -> Question -> Answer

Idea: Bytes-aware Embedding

The idea is to make sure each token in vocab understand its length and raw UTF-8 bytes.

Let a = max(len(token)) for all token in vocab. Define AA : float[a][d_emb]

Let b = max(len_in_utf8_bytes(token)) for all token in vocab. Define BB : float[b][256][d_emb]

For each token X in vocab, let [x0, x1, ..., xn] be its raw UTF-8 bytes. We will add some extra values to its embedding EMB(X):

EMB(X) += AA[len(X)] + BB[0][x0] + BB[1][x1] + ... + BB[n][xn] (note: AA BB are learnable weights)

We can do this for the final Linear(d_emb, n_vocab) projection too.
We can use some small networks to generate AA and BB, for some extra regularization (for example, BB[m][xi] and BB[n][xi] should be related).

Old Idea

I have an idea to improve tokenization. We can hardcode some channels to have meanings. Example:

Channel 0 = "space"

Channel 1 = "capitalize first letter"

Channel 2 = "capitalize all letters"

Therefore:

Embedding of "abc": [0, 0, 0, x0, x1, x2 , ..]

Embedding of " abc": [1, 0, 0, x0, x1, x2, ..]

Embedding of " Abc": [1, 1, 0, x0, x1, x2, ..]

Embedding of "ABC": [0, 0, 1, x0, x1, x2, ...]

......

so they will share most of the embedding. And we can rapidly compute the output probability of all variations of "abc".

Note: the above method is assuming that p(" xyz") / p("xyz") is the same for any "xyz", which can be wrong.

Better: define emb_space emb_capitalize_first emb_capitalize_all to be a function of emb.

Maybe the Best: let 'abc' ' abc' etc. to share the last 90% of their embeddings.

At this moment, all our tokenizers spend too many items to represent all variations of 'abc' ' abc' ' Abc' etc. Moreover the model cannot discover that these are actually similar if some of these variations are rare in the dataset. The method here can improve this. I plan to test this in a new version of RWKV.

Idea: Better Initial States

Example (single-round Q & A):

Generate the final state of all wiki documents.
For any user Q, find the best wiki document, and use its final state as the initial state.
Train a model to directly generate the optimal initial state for any user Q.

However this can be a bit more tricky for multi-round Q & A :)

How it works

RWKV is inspired by Apple's AFT (https://arxiv.org/abs/2105.14103).

Moreover it's using a number of my tricks, such as:

SmallInitEmb: https://github.com/BlinkDL/SmallInitEmb (applicable to all transformers) which helps the embedding quality, and stabilizes Post-LN (which is what I am using).
Token-shift: https://github.com/BlinkDL/RWKV-LM#token-shift-time-shift-mixing (applicable to all transformers), especially helpful for char-level models.
Head-QK: https://github.com/BlinkDL/RWKV-LM#the-head-qk-trick-learning-to-copy-and-avoid-tokens (applicable to all transformers). Note: it's helpful, but I disabled it in the Pile model to keep it 100% RNN.
Extra R-gate in the FFN (applicable to all transformers). I am also using reluSquared from Primer.
Better initilization: I init most of the matrices to ZERO (see RWKV_Init in https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v2-RNN/src/model.py).
You can transfer some parameters from a small model to a large model (note: I sort & smooth them too), for faster and better convergence (see https://www.reddit.com/r/MachineLearning/comments/umq908/r_rwkvv2rnn_a_parallelizable_rnn_with/).
My CUDA kernel: https://github.com/BlinkDL/RWKV-CUDA to speedup training.

The pseudocode (execution from top to bottom):

The a b c d factors work together to build a time-decay curve: [X, 1, W, W^2, W^3, ...].

Write out the formulas for "token at pos 2" and "token at pos 3" and you will get the idea:

a and b: EMAs of kv and k.
c and d: these are a and b combined with "self-attention".

kv / k is the memory mechanism. The token with high k can be remembered for a long duration, if W is close to 1 in the channel.

The R-gate is important for performance. k = info strength of this token (to be passed to future tokens). r = whether to apply the info to this token.

RWKV-3 improvements

Use different trainable TimeMix factors for R / K / V in SA and FF layers. Example:

xx = self.time_shift(x)
xk = x * self.time_mix_k + xx * (1 - self.time_mix_k)
xv = x * self.time_mix_v + xx * (1 - self.time_mix_v)
xr = x * self.time_mix_r + xx * (1 - self.time_mix_r)

Use preLN instead of postLN (more stable & faster convergence):

if self.layer_id == 0:
	x = self.ln0(x)
x = x + self.att(self.ln1(x))
x = x + self.ffn(self.ln2(x))

Explaining the code for RWKV-3 GPT mode

The GPT mode - overview

The building blocks of RWKV-3 GPT mode are similar to that of a usual preLN GPT.

The only difference is an extra LN after embedding. Note you can absorb this LN into the embedding after finishing the training.

x = self.emb(idx)  # input: idx = token indices
x = self.ln_emb(x) # extra LN after embedding
x = x + self.att_0(self.ln_att_0(x)) # preLN
x = x + self.ffn_0(self.ln_ffn_0(x))
...
x = x + self.att_n(self.ln_att_n(x))
x = x + self.ffn_n(self.ln_ffn_n(x))
x = self.ln_head(x) # final LN before projection
x = self.head(x)    # output: x = logits

It is important to initialize emb to tiny values, such as nn.init.uniform_(a=-1e-4, b=1e-4), to utilize my trick https://github.com/BlinkDL/SmallInitEmb.

For the 1.5B RWKV-3, I use Adam (no wd, no dropout) optimizer on 8 * A100 40G.

batchSz = 32 * 896, ctxLen = 896. I am using tf32 so the batchSz is a bit small.

For the first 15B tokens, LR is fixed at 3e-4, and beta=(0.9, 0.99).

Then I set beta=(0.9, 0.999), and do an exponential decay of LR, reaching 1e-5 at 332B tokens.

The GPT mode - ATT block

The RWKV-3 does not have any attention in the usual sense, but we will call this block ATT anyway.

B, T, C = x.size() # x = (Batch,Time,Channel)

# Mix x with the previous timestep to produce xk, xv, xr
xx = self.time_shift(x) # self.time_shift = nn.ZeroPad2d((0,0,1,-1))
xk = x * self.time_mix_k + xx * (1 - self.time_mix_k)
xv = x * self.time_mix_v + xx * (1 - self.time_mix_v)
xr = x * self.time_mix_r + xx * (1 - self.time_mix_r)

# Use xk, xv, xr to produce k, v, r
k = self.key(xk).transpose(-1, -2)
v = self.value(xv).transpose(-1, -2)
r = self.receptance(xr)
k = torch.clamp(k, max=60) # clamp k to avoid overflow
k = torch.exp(k)
kv = k * v

# Compute the W-curve = [e^(-n * e^time_decay), e^(-(n-1) * e^time_decay), ..., 1, e^(time_first)]
self.time_w = torch.cat([torch.exp(self.time_decay) * self.time_curve.to(x.device), self.time_first], dim=-1)
w = torch.exp(self.time_w)

# Use W to mix kv and k respectively. Add K_EPS to wk to avoid divide-by-zero
if RUN_DEVICE == 'cuda':
    wkv = TimeX.apply(w, kv, B,C,T, 0)
    wk = TimeX.apply(w, k, B,C,T, K_EPS)
else:
    w = w[:,-T:].unsqueeze(1)
    wkv = F.conv1d(nn.ZeroPad2d((T-1, 0, 0, 0))(kv), w, groups=C)
    wk = F.conv1d(nn.ZeroPad2d((T-1, 0, 0, 0))(k), w, groups=C) + K_EPS

# The RWKV formula
rwkv = torch.sigmoid(r) * (wkv / wk).transpose(-1, -2)
rwkv = self.output(rwkv) # final output projection

The self.key, self.receptance, self.output matrices are all initialized to zero.

The time_mix, time_decay, time_first vectors are transferred from a smaller trained model (note: I sort & smooth them too).

The GPT mode - FFN block

The FFN block has three tricks comparing with the usual GPT:

My time_mix trick.
The sqReLU from the Primer paper.
An extra receptance-gate (similar to the receptance-gate in ATT block).

# Mix x with the previous timestep to produce xk, xr
xx = self.time_shift(x)
xk = x * self.time_mix_k + xx * (1 - self.time_mix_k)
xr = x * self.time_mix_r + xx * (1 - self.time_mix_r)

# The usual FFN operation
k = self.key(xk)
k = torch.square(torch.relu(k)) # from the Primer paper
kv = self.value(k)

# Apply an extra receptance-gate to kv
rkv = torch.sigmoid(self.receptance(xr)) * kv
return rkv

The self.value, self.receptance matrices are all initialized to zero.

RWKV-4 improvements

From GPT to RWKV (the formulas)

Let F[t] be the system state at t.

Let x[t] be the new external input at t.

In GPT, predicting F[t+1] requires considering F[0], F[1], .. F[t]. So it takes O(T^2) to generate a length T sequence.

The simplified formula for GPT:

$F[\mathrm{t}+1]=\frac{\sum_{\mathrm{i}=0}^{\mathrm{t}} \exp (\mathbf{Q}x[\mathrm{t}] * \mathbf{K}F[\mathrm{i}]) \cdot(\mathbf{V}F[\mathrm{i}])}{\sum_{\mathrm{i}=0}^{\mathrm{t}} \exp (\mathbf{Q}x[\mathrm{t}] * \mathbf{K}F[\mathrm{i}])}$

It's very capable in theory, however that does not mean we can fully utilize its capability with usual optimizers. I suspect the loss landscape is too difficult for our current methods.

Compare with the simplified formula for RWKV (the parallel mode, looks similar to Apple's AFT):

$F[\mathrm{t}+1]=\sigma(\mathbf{R}x[\mathrm{t}]) \cdot \frac{\sum_{\mathrm{i}=0}^{\mathrm{t}} \exp (\mathbf{W} \cdot(\mathrm{t}-\mathrm{i})) \cdot \exp (\mathbf{K}F[\mathrm{i}]) \cdot(\mathbf{V}F[\mathrm{i}])}{\sum_{\mathrm{i}=0}^{\mathrm{t}} \exp (\mathbf{W} \cdot(\mathrm{t}-\mathrm{i})) \cdot \exp (\mathbf{K }F[\mathrm{i}])}$

The R, K, V are trainable matrices, and W is a trainable vector (time-decay factor for each channel).

In GPT, the contribution of F[i] to F[t+1] is weighted by $\exp (\mathbf{Q}x[\mathrm{t}] * \mathbf{K}F[\mathrm{i}])$ .

In RWKV-2, the contribution of F[i] to F[t+1] is weighted by $\sigma(\mathbf{R}x[\mathrm{t}]) \cdot \exp (\mathbf{W} \cdot(\mathrm{t}-\mathrm{i})) \cdot \exp (\mathbf{K}F[\mathrm{i}])$ .

The $\sigma$ is a non-linearity and we can use sigmoid.
Note $\sigma(\mathbf{R}x[\mathrm{t}])$ is not in the denominator, and I call R the "receptance".
The $\exp (\mathbf{W} \cdot(\mathrm{t}-\mathrm{i}))$ is the time-decay factor. I proposed the same idea (scaling the attention by distance) in Aug 2020 and called it the "time-weighting" (check the commit history of https://github.com/BlinkDL/minGPT-tuned).

Here comes the punchline: we can rewrite it into a RNN (recursive formula). Note:

$F[1]=\sigma(\mathbf{R }x[0]) \cdot \frac{ \exp (\mathbf{K }F[0]) \cdot(\mathbf{V }F[0])}{\exp (\mathbf{K }F[0])}$

$F[2]=\sigma(\mathbf{R }x[1]) \cdot \frac{ \exp (\mathbf{K }F[1]) \cdot(\mathbf{V }F[1])+\exp (\mathbf{W} ) \cdot \exp (\mathbf{K }F[0]) \cdot(\mathbf{V }F[0])}{ \exp (\mathbf{K }F[1])+\exp (\mathbf{W} ) \cdot \exp (\mathbf{K }F[0])}$

Therefore it's straightforward to verify:

$F[t+1]=\sigma(\mathbf{R }x[t]) \cdot \frac{\exp (\mathbf{K}F[\mathrm{t}]) \cdot(\mathbf{V}F[\mathrm{t}])+\exp (\mathbf{W}) \cdot A[\mathrm{t}]}{ \exp (\mathbf{K}F[\mathrm{t}])+\exp (\mathbf{W}) \cdot B[\mathrm{t}]}$

where A[t] and B[t] are the numerator and denominator of the previous step, respectively.

I believe RWKV is performant because W is like repeatedly applying a diagonal matrix. Note (P^{-1} D P)^n = P^{-1} D^n P, so it is similar to repeatedly applying a general diagonalizable matrix.

Moreover it's possible to turn it into a continuous ODE (a bit similar to State Space Models). I will write about it later.

Star History

Multimodal ideas

I have an idea for [text --> 32x32 RGB image] using a LM (transformer, RWKV, etc.). Will test it soon.

Firstly, LM loss (instead of L2 loss), so the image will not be blurry.

Secondly, color quantization. For example, only allowing 8 levels for R/G/B. Then the image vocab size is 8x8x8 = 512 (for each pixel), instead of 2^24. Therefore, a 32x32 RGB image = a len1024 sequence of vocab512 (image tokens), which is a typical input for usual LMs. (Later we can use diffusion models to upsample and generate RGB888 images. We might be able to use a LM for this too.)

Thirdly, 2D positional embeddings that are easy for the model to understand. For example, add one-hot X & Y coords to the first 64(=32+32) channels. Say if the pixel is at x=8, y=20, then we will add 1 to channel 8 and channel 52 (=32+20). Moreover probably we can add the float X & Y coords (normalized to 0~1 range) to another 2 channels. And other periodic pos. encoding might help too (will test).

Finally, RandRound when doing the color quantization in the DataLoader. For example, if the float level is 4.578, then there is a 57.8% chance to use 5, and (1-57.8%) chance to use 4. And we can allow both 4 and 5 in the prediction, but the loss will be higher if the prediction is 4.

Multi-task training might help too. I will try this dataset format: [TxtFirst] [Desc of Img (txt tokens)] [Img] [img tokens] and sometimes [ImgFirst] [img tokens] [Txt] [Desc of Img (txt tokens)] ... the order of the imgs should be randomized in the DataLoader, and [TxtFirst] [ImgFirst] [Img] [Txt] are special tokens and do random sampling of the full dataset. So sometimes the model will see the img tokens first and then the corresponding txt tokens, which is a [img -> txt] task. And the model will see some partial imgs and partial txts. I think a char-level LM might help the model to write correct text on images.

How to sample a large dataset (for training)

I am using a trick to sample the Pile deterministically yet randomly enough.

Let's say the pile has x chunks (a chunk = ctx_len tokens).

pick a prime number p just less than x, and make sure p = 2 (mod 3).

Use (step * step * step) mod p to sample it. Add some bias to step for extra randomness.

The top-p-x sampling method (for inference)

We propose a new sampling method called top-p-x:

it's like top-p, and the only difference is you also keep all tokens whose prob > x.

Try x = 0.01 first.

Better Learning Rate Schedule via Variantional Method of Loss Curve

I propose a simple new method to find better LR schedules. The method is cost-efficient and practical for large LMs. The takeaway is we can model the loss curve dynamics (phenomenology) w.r.t. the LR, and a nice closed-form LR curve can be directly computed from it using variantional method. Moreover we can predict the final loss with reasonable accuracy.

UPDATE: In "Conclusion 1.", use the best-fitting regime (ignore the initial steps where our approximations break down) to fit the parameters.

Try this: fixed lr for 1 hr, then exponential decay to 0.2 * lr in 12 hrs, and choose the t=[1hr, 13hr] segment.

In the last three plots, black = predicted loss curve of the new LR schedule, blue = original (unoptimized) real loss curve, orange = new LR schedule.

RWKV v1

We propose the RWKV language model, with alternating time-mix and channel-mix layers:

$\begin{align*} \text{Time-mix :} && \text{TM}_{t,c} &&=&&\text{sigmoid}(\text{R}_{t,c}) &&\cdot&& &&\textstyle\sum_{u} &&\textbf{W}_{t,u,c} &&\cdot&& \text{softmax}_t(\text{K}_{u,c}) &&\cdot&& \text{V}_{u,c}\\ \text{Channel-mix :} && \text{CM}_{t,c} &&=&&\text{sigmoid}(\text{R}_{t,c}) &&\cdot&& &&\textstyle\sum_d &&\textbf{W}_{c,d} &&\cdot&& \text{gelu}(\text{K}_{t,d}) &&\cdot&& \text{V}_{t,d} \end{align*}$

The R, K, V are generated by linear transforms of input, and W is parameter. The idea of RWKV is to decompose attention into R(target) * W(src, target) * K(src). So we can call R "receptance", and sigmoid means it's in 0~1 range.
The Time-mix is similar to AFT (https://arxiv.org/abs/2105.14103). There are two differences.

(1) We changed the normalization (denominator). For masked language models, we define:

$\text{softmax}_t(\text{K}_{u,c}) = \frac{\exp(\text{K}_{u,c})}{\sum_{v \leq t}\exp(\text{K}_{v,c})}$

(UPDATE: We are using the original AFT normalization in v2)

Initialize K and R matrices (and the output projection matrix) to ZERO for fast & stable convergence.

(2) We decompose W_{t,u,c} and introduce multi-head W (here h is the corresponding head of c):

$W_{t,u,c}=f_h(t-u)\cdot \alpha_h(u) \cdot \beta_h(t)$

Moreover we multiply the final output of Time-mix layer by γ(t). The reason for the α β γ factors, is because the context size is smaller when t is small, and this can be compensated using the α β γ factors.

(UPDATE: We remove α β γ factors in v2-RNN and restrict W to be of a simple form and hence able to rewrite it as RNN)

The Channel-mix is similar to GeGLU (https://arxiv.org/abs/2002.05202) with an extra R factor. Initialize R and W matrices to ZERO for fast & stable convergence.
Finally, we add extra token-shift (time-shift mixing) as in (https://github.com/BlinkDL/minGPT-tuned).

Token-shift (time-shift mixing)

The token-shift explicitly uses (half the channels of this token) & (half the channels of prev token) to generate all vectors (QKV, RWKV, ...).

self.time_shift = nn.ZeroPad2d((0,0,1,-1))

x = torch.cat([self.time_shift(x[:, :, :C//2]), x[:, :, C//2:]], dim = -1)

Dividing channels by 2 and shift-1 works great for char-level English and char-level Chinese LM.

However for BPE-level English LM, it's only effective if your embedding is large enough (at least 1024 - so the usual small L12-D768 model is not enough).

My theory on the effectiveness of token-shift:

When we train a GPT, the hidden representation of a token has to accomplish two different objects:

Predict the next token. Sometimes this is easy (obvious next token).
Collect all previous context info, so later tokens can use it. This is always hard.

The shifted channels can focus on (2), so we have good propagation of info. It's like some kind of residual connection, or a small RNN inside the transformer.

You can use token-shift in usual QKV self-attention too. I looked at the weights, and found V really likes the shifted channels, less so for Q. Makes sense if you think about it. I also found you may want to use less mixing in higher layers.

p.s. There is a MHA_pro model in this repo with strong performance. Give it a try :)

The Head-QK Trick: learning to copy and avoid tokens

In usual transformer, a small model has difficulty copying tokens (such as person names) in the context. We add extra Q & K to the final output such that the model can directly copy (or avoid) tokens in the context. Afterwards the model will teach itself NER (named entity recognition) if you look at the learned weights.

q = self.head_q(x)[:,:T,:] # projecting to 256-d
k = self.head_k(x)[:,:T,:] # projecting to 256-d
c = (q @ k.transpose(-2, -1)) * (1.0 / 256)
c = c.masked_fill(self.copy_mask[:T,:T] == 0, 0)
c = c @ F.one_hot(idx, num_classes = self.config.vocab_size).float()       
x = self.head(x) + c

Note: when a token occurs multiple times in the context, it might be better to use max(prob) instead of sum(prob).

The top-a sampling method

We also propose a new sampling method called top-a (as in src/utils.py):

(1) Find the max probability p_max after softmax.

(2) Remove all entries whose probability is lower than 0.2 * pow(p_max, 2). So it's adaptive, hence "top-a".

(3) Feel free to tune the 0.2 and 2 factor. Tune 0.2 first.

The idea of top-a:

If max_prob=0.9, then remove all tokens with prob < 0.162 (so, removing all alternatives)
If max_prob=0.5, then remove all tokens with prob < 0.05 (so, allowing more choices)
If max_prob=0.1, then remove all tokens with prob < 0.002 (so, allowing lots of possibilities)

probs = F.softmax(logits, dim=-1)

limit = torch.pow(torch.max(probs), 2) * 0.02
logits[probs < limit] = -float('Inf')

Performance

Character-level loss on simplebooks-92 dataset https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip

Gray: usual MHA+Rotary+GeGLU - performance not as good. 17.2M params.

Red: RWKV ("linear" attention) - VRAM friendly - quite faster when ctx window is long - good performance. 16.6M params.

Green: MHA+Rotary+GeGLU+Token_shift. 17.2M params.

Blue: MHA_pro (MHA with various tweaks & RWKV-type-FFN) - slow - needs more VRAM - good performance. 16.6M params.

@software{peng_bo_2021_5196578,
  author       = {PENG Bo},
  title        = {BlinkDL/RWKV-LM: 0.01},
  month        = aug,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {0.01},
  doi          = {10.5281/zenodo.5196577},
  url          = {https://doi.org/10.5281/zenodo.5196577}
}

Initialization

We use careful initialization for RWKV to get fast convergence - orthogonal matrices with proper scaling, and special time_w curves. Check model.py for details.

Some learned time_w examples:

rwkv-lm's People

Contributors

Stargazers

Watchers

Forkers

trendingtechnology sailfish009 planet-kim learningpro baixiaoxiao07 zylx0532 mahaoyang jinriwufeng makeitt ledudu glacierappstudio zixiliuusc pccai 0invest1 xiefei122 jsyzc2019 umnlp caizongchao wangchichi1999 yutan557 dangf15 ishine wonbingzi boningliang can-cc blank-hudson citydream163 fangego connect-gitee wozhanqiangtou ixlm gungungun kastnerkyle andrei997 credwood dumpmemory klins101 yashkumaratri leondz adambear linlingde yayawawo phimachine shouxia satyajitghana shatz01 multipath xloem ojus1 copperdong techthiyanes trump888 zouyu0008 yuulin ap-wb sanchit-ahuja shaun95 tunsheng qiaolingchen00 1363837242 mcx michelbartels crashmoon axkuhta cat-state zhangzichou bojanfaletic laplacekorea codingchild2424 mannykayy zeta1999 cluster327 jack-kks mtanghu b1sounours marcus-arcadius harrisonvanderbyl 01tibai metavai austin23cook linchenxi15 colinsongf zhufanpo x6888 maybelaterornot tiny111can shunzyt iopav pantyhose-x gnu-linux-libre iarrationality cyberflamego boboq ljclws wangv587 miigogo lianjunw gg-big-org ajunlonglive lemonfan-maker

rwkv-lm's Issues

Any paper about RWKV?

It is awesome and intereseting. I wonder if there is any paper about RWKV?

Thanks.

ChatRWKV triggers segmentation fault when using streaming or split loading on 4-14B

Hi,
This is probably something related to my setup, but I can't work out what's causing this. When attempting to load the v4 14B model with ChatRWKV v2 with the split or stream methods, I get a segmentation fault midway through loading. With cuda fp16i8 *1+ or fp16 *1+, it simply segmentation faults sooner than cuda fp16 *10+ or cuda fp16i8 *10+. Fp32 with split and stream behaves the same way.

I've tried loading the model with fp16i8 all to the GPU to see if it would do the same thing, but with cuda fp16i8 it will load up until it runs out of memory.

Any advice would be appreciated.

Ubuntu 20.04 on WL2
python 3.10
torch 1.13.1 cu117
pip rwkv 0.5
and always latest git pull

7B model works great because I can load it all on a 12gb GPU.

Really impressive project, and that's a total understatement. I'm blown away by the 7b model.

Which model could be trained with one single RTX-4090(24G vram)?

I apologize for asking such a simple question. I am planning to buy a GPU to evaluate RWKV, and I would like to get an RTX-4090. Does anyone know which model would allow me to make full use of a single 4090 while training?

The universe is like an RNN too (because of locality)

I believe RNN is a better candidate for fundamental models, because: (1) It's more friendly for ASICs (no kv cache). (2) It's more friendly for RL. (3) When we write, our brain is more similar to RNN. (4) The universe is like an RNN too (because of locality). Transformers are non-local models.

I don't want to be that guy, but The Universe Is Not Locally Real, and the Physics Nobel Prize Winners Proved It

How to port this for Apple "MPS" for either training or inference ?

Would it be possible to train it or inference this model on a Mac M1 Ultra 128 GB RAM system ?

Question about RWKV formula

In the first formula in README, RWKV is rewritten into recurrent form by letting $W_n=(n-1)w$. Is there a particular reason for using $n-1$ instead of $n$? The latter is more natural, and in From GPT to RWKV (the formulas) the recurrent formula of RWKV also implies the latter. So I believe you probably have tried it but for some reason it is suboptimal.

Any plans on writing a paper?

Amazing project, it would be awesome if all the information is in a research paper, easier to read. Thanks

What does "stream and split" strategy even mean?

The readme.md mentioned a strategy called "stream and split", how does it work? I haven't seen it anywhere outside of this repo and even in this repo.

建议：用大脑分区的概念

建议：用大脑分区的概念
你好，开发者，若能将神经网络像人脑使用区块链设计能行吗?
注：我不懂编程，但如果能为你带来新的思路，那就再好不过了。
将参数分区，例如：1：0语言、1图像、2短期记忆、3音频

v4 model.py vs model_run.py

Hi,
Thanks for this awesome repo!
I'm trying to understand the code and found that in the v4 folder, there's this model.py and model_run.py, which contains GPT and RWKV_GPT respectively which all uses different initialization methods. Could you elaborate on when should which one be used? Thanks in advnace!

Repeated download errors - wrong readme?

Re:
Download RWKV-4 0.1/0.4/1.5/3/7/14B weights: https://huggingface.co/BlinkDL
But how?
I have repeated errors

OSError: BlinkDL/rwkv-4-pile-430m does not appear to have a file named config.json. Checkout 'https://huggingface.co/BlinkDL/rwkv-4-pile-430m/main' for available files.
@6Y3GRwDjtGVo4nAe2 ➜ ~/Downloads/ChatRWKV (main) $ 

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/BlinkDL/rwkv-4-pile-430m/resolve/main/config.json

whatever model I try via simple:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("BlinkDL/rwkv-4-pile-430m" , use_auth=True)
model = AutoModelForCausalLM.from_pretrained("BlinkDL/rwkv-4-pile-430m", use_auth=True)

Even AI gives up:

Here are the steps for creating your own config.json file based on the model specifications and loading it locally:

First, you need to know the model architecture and hyperparameters of the model you want to download. You can try to find this information from the model page or contact the author for details.
Second, you need to create a PretrainedConfig object that matches the model specifications. You can use one of the subclasses of PretrainedConfig from HuggingFace’s library depending on the model type1.
Third, you need to save your PretrainedConfig object as a config.json file using to_json_file method2. You can specify a local path where you want to save this file.
Fourth, you need to load the model using from_pretrained with local_files_only=True and provide the path to your config.json file3. You also need to provide the path to other model files such as pytorch_model.bin.
etc.

This

import os
from copy import deepcopy
from rwkv.model import RWKV

os.environ["RWKV_JIT_ON"] = '1'
os.environ["RWKV_CUDA_ON"] = '0' # if '1' then use CUDA kernel for seq mode (much faster)
from rwkv.model import RWKV                         # everything in /v2/rwkv folder
model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-1b5/RWKV-4-Pile-1B5-20220903-8040', strategy='cuda fp16')

out, state = model.forward([187, 510, 1563, 310, 247], None)   # use 20B_tokenizer.json
print(out.detach().cpu().numpy())                   # get logits
out, state = model.forward([187, 510], None)
out, state = model.forward([1563], state)           # RNN has state (use deepcopy if you want to clone it)
out, state = model.forward([310, 247], state)
print(out.detach().cpu().numpy())                   # same result as above

does not work either:
@6Y3GRwDjtGVo4nAe2 ➜ ~/Downloads/ChatRWKV (main) $ pip install rwkv
ERROR: Could not find a version that satisfies the requirement rwkv (from versions: none)
ERROR: No matching distribution found for rwkv
@6Y3GRwDjtGVo4nAe2 ➜ ~/Downloads/ChatRWKV (main) $ pip install rwkv-rs
Collecting rwkv-rs
Downloading rwkv_rs-0.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 6.6 MB/s eta 0:00:00
Installing collected packages: rwkv-rs
Successfully installed rwkv-rs-0.2.3
@6Y3GRwDjtGVo4nAe2 ➜ ~/Downloads/ChatRWKV (main) $ python transf.py
Traceback (most recent call last):
File "/home/codespace/Downloads/ChatRWKV/transf.py", line 3, in
from rwkv.model import RWKV
ModuleNotFoundError: No module named 'rwkv'

etc.

Access/train to use the embeddings

Hi @BlinkDL ! Really interested in your work here. I am looking to test out some of the models for embedding based tasks. What is the best way to access the embeddings? I would be looking to use these for training as well (i.e. contrastive loss using siamese training setup). Any information on this would be greatly appreciated.

4-bit quantization to reduce VRam requirement

Hi,
Is it possible to use something like GPTQ to get a 4-bit quantized version of the latest 7B Instruct model. That's got to be one of the fastest and "smartest" models I have tested in the 7B range, but the VRam required is too much for what I have.
Llama 7B in 4 bit works quite well on smaller VRam cards. Something like that for RWKV would be great.
Also are there any resource on how I could get an embedding from any of the models.
Thanks

Why isn't everyone using RWKV if it's so much better than transformers?

Hi!

The machine learning (ML) community is progressing at a remarkable pace and is embracing new techniques very quickly. Based on my comprehension of this model, it appears to offer a distinct set of advantages relative to transformers, while lacking any real drawbacks. Despite these benefits, it remains unclear why adopting this approach is not more widespread among individuals and organizations in the field.

Why is this the case? I really can't wrap my head around it.

Is there any plan to port this on MacOS with GPU?

This is quite an amazing project, really love this repo.
Since I am using MacOS, is there any plan to support MacOS GPU?

fine tuning a pretrained model with txt file

I tried to fine tuning a 3B model with a text file, encoded in UTF8, using the provided train.py.

It seems that the tokens are calculated with the unique characters. So I got

Building token list...                                                                                        
Data has 330926472 tokens, 7129 vocab size.

and error message

RuntimeError: Error(s) in loading state_dict for RWKV:                                                        
        size mismatch for emb.weight: copying a param with shape torch.Size([50277, 2560]) from checkpoint,   
the shape in current model is torch.Size([7129, 2560]).                                                       
        size mismatch for head.weight: copying a param with shape torch.Size([50277, 2560]) from checkpoint,  
the shape in current model is torch.Size([7129, 2560]).

I think I need to tokenize the text file with the same tokenizer to prepare the pile data before I fed it to the training script. Is there an example?

Sequence to Sequence?

Hey @BlinkDL! Awesome project!

I was wondering if you have performed any Seq-2-Seq experiments with it? Any reason for going with GPT model in the first place as opposed to something like T5 (standard Transformer)?
Any direction on what changes will be required to make a standard encoder-decoder architecture with RWKV?

Also, is there any report on in-context-learning/FSL capability of the latest trained model?

Contact link

Thank you for your contributions and for opensourcing your work.

I'd like to get in contact with you but I can't find an email link on your social. If you're interested, but don't want to share email you can email me at [email protected].

RWKV on TPU?

Is it possible to train RWKV on TPU? If not, what would be needed?

Discord Invite Broken

Its currently not possible (at least for me) to join the discord

How to make RWKV-v4-RNN training data set?

Paper covering additional tokens idea

Hi there. You mention in the readme that you're interested in potentially adding some special tokens/markers to represent stuff like capitalisation. Just wanted to let you know we tried that in the ULMFiT paper, and it worked pretty well. You can read the details here: https://arxiv.org/abs/1801.06146 . We went beyond capitalisation and added some other tokens too.

Anyhoo this is just an FYI in case it's helpful to you.

RuntimeError: exp_vml_cpu not implemented for 'Long'

decay_speed = 0
修改为
decay_speed = 0.0

Possible benchmark leakage in Pile dataset

As I understand, all models in RWKV family were trained on Pile dataset. I'm concerned with possible lack of preprocessing of the dataset.

Pile paper states To avoid leakage of data from downstream evaluations, recent work ... has removed any data in the training set that may overlap with the evaluation metrics. We decided not to perform any such removal.

This means that unfiltered Pile may contain correct solutions for benchmark tasks, which can make comparison with other models unfair -- because RWKV models may be directly trained on the validation data.

Although authors of Pile mention Pile’s own validation and test sets. If RWKV was evaluated on these, maybe there is no problem at all.

was there any filtering of Pile dataset to remove any benchmark data before training RWKV?
what data is used to perform comparison benchmarks with other models?

Please forgive any possible lack of understanding from my side, I'm only a beginner in the ML :)

Meet in the middle type of training

It would be interesting to see if the new paper from Microsoft (https://arxiv.org/pdf/2303.07295.pdf) would have the same positive impact for RWKV. I don't see why not.

Is this something in the pipeline?

Radeon Open Compute support

There are many users with AMD graphics cards that want to train this model in a GPU accelerated manner. Radeon Open Compute is AMD's equivalent to CUDA (the relevant component in Radeon Open Compute is called HIP).

Attempt to HIPify the codebase
Make the HIP port optimised for full performance

Any plans to train a instruction-following version?

Alpaca released their dataset on instruction tuning which is used on many other LLMs, any plan to finetune the RWKV on similar data?
Here are a source of Chinse alpaca:https://github.com/LC1332/Chinese-alpaca-lora

Other languages training plan

Is there a training plan for this project in other languages (e.g. Japanese)?

RLHF for finetuning

Hi, Thank you for your efforts!

have you considered finetuning using methods like rlhf?

Can I use this model commercialy?

llama , alpaca, vicuna are not allowed in commerical

but is this also?

Bare metal cluster

I have 5 old (< 3yo) gaming rigs that are fairly serviceable. Can this be run with a distributed load across multiple machines? I can set up an internal network with a hub and secondary network cards with one interfacing machine.

能出个训练使用教程吗

question about the environment setup

Could you please share the environment setup (especially for training) ? E.g. the version of python / torch / cuda / pytorch_lightning, or other options that you consider important.

Thanks!

Question about the training compute

Great work! I am working on a survey and would be interested to know the total training compute (FLOPs) of RWKV-14B model? What was the training time (GPU-hours) on how many A100s? Also any idea about the GPU utilization rate?

RWKV-4 169m/430m in browser with ORT Web / TF.js / tfjs-tflite?

Hi, really exciting project! I'm wondering if you've published the model conversion script that you used to create the js_models files from the .pth model file? It would be awesome to see how the larger and newer models like RWKV-4 169m/430m perform in the browser! I think the inference speed of RWKV opens up many new possibilities for language models on the web.

The experimental results on the homepage are based on zero-shot learning？

Dear author

I was wondering if I could ask you a question about your experiment results. Specifically, I'm curious to know if your results were based on zero-shot, one-shot, or few-shot conditions.。 experiment results lnk

Implementation details about wkv

Amazing work!
But I'm really confused about the implementation details for wkv cuda kernel (in RWKV-LM\RWKV-v4neo\cuda\wkv_cuda.cu).
How does the implementation match the equations shown in README? Could you please give a more detailed comment about it?
For example, what is the meaning of local variable p, pp, ...
Thanks

Training speed check

Just curious about this number:

Training speed: (new training code) RWKV-4 14B BF16 ctxlen4096 = 114K tokens/s on 8x8 A100 80G

So to clarify, you're ingesting 114 ktokens/s on 64x A100s during training, or ~1.7k/s/GPU?

For comparison, Meta trained Llama 13B on 1T tokens in 135,168 GPU hours (also A100-80GB), which is ~0.0074B tokens/GPU/hour. You're achieving 0.0064B tokens/GPU/hour if my quick math is right, so training performance is quite comparable with Meta's even that they make a point of saying how heavily they optimised their training. (They say they hand-coded their backward functions and did work to parallelise activation computatiopn with GPU communication, although AFAIK they didn't release their training code). That's impressive!

And that's on 4k context size? Traditional transformer dot product attention is O(n^2) on sequence length so going from 2k to 4k should 4x the compute, so I guess Llama would have only achieved 0.00185B tokens/GPU/hour with the same window size used here.

关于调用模型做分类任务

你好作者！我对此工作很感兴趣，因为我现在在用基于transformer的模型做分类任务，transformer或者RNN在分类任务里通常采用最后一个模块的每个通道的最后一个元素作为输出，并通过全连接层映射到几个类别。
请问你觉得RWKV原理类似吗？依旧提取最后一个元素作为输出是否稳妥呢？希望您能给出一些建议，我将很感激！

Please help to shoot the error while loading locally trained model to chat.py

Hi,

I was training the model locally from scratch.

 python train.py --load_model  --wandb  --proj_dir out --data_file ../data/enwik8 --data_type utf-8 --vocab_size 0 --ctx_len 512 --epoch_steps 5000 --epoch_count 500 --epoch_begin 0 --epoch_save 5 --micro_bsz 12 --n_layer 6 --n_embd 512 --pre_ffn 0 --head_qk 0 --lr_init 8e-4 --lr_final 1e-5 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 --accelerator gpu --devices 1 --precision tf32 --strategy ddp_find_unused_parameters_false --grad_cp 0

and changed settings in chat.py

args.FLOAT_MODE = "fp32" # fp32 (good for CPU) // fp16 (recommended for GPU) // bf16 (less accurate)
args.vocab_size = 50277
args.head_qk = 0
args.pre_ffn = 0
args.grad_cp = 0
args.my_pos_emb = 0

# args.MODEL_NAME = '/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-14b/RWKV-4-Pile-14B-20230108-5170'
args.MODEL_NAME = './out/rwkv-40'

args.n_layer = 6 # 40
args.n_embd = 512 # 5120
args.ctx_len = 512 # 1024

Getting an error:

(py38) ➜  RWKV-v4neo git:(main) ✗ python chat.py
Loading...

RWKV_HEAD_QK_DIM 0 RWKV_JIT_ON 1

loading... ./out/rwkv-40
emb.weight                               float32    cpu
blocks.0.ln1.weight                      float32    cuda:0
blocks.0.ln1.bias                        float32    cuda:0
blocks.0.ln2.weight                      float32    cuda:0
blocks.0.ln2.bias                        float32    cuda:0
blocks.0.ln0.weight                      float32    cuda:0
blocks.0.ln0.bias                        float32    cuda:0
blocks.0.att.time_decay                  float32    cuda:0
blocks.0.att.time_first                  float32    cuda:0
blocks.0.att.time_mix_k                  float32    cuda:0
blocks.0.att.time_mix_v                  float32    cuda:0
blocks.0.att.time_mix_r                  float32    cuda:0
blocks.0.att.key.weight                  float32    cuda:0
blocks.0.att.value.weight                float32    cuda:0
blocks.0.att.receptance.weight           float32    cuda:0
blocks.0.att.output.weight               float32    cuda:0
blocks.0.ffn.time_mix_k                  float32    cuda:0
blocks.0.ffn.time_mix_r                  float32    cuda:0
blocks.0.ffn.key.weight                  float32    cuda:0
blocks.0.ffn.receptance.weight           float32    cuda:0
blocks.0.ffn.value.weight                float32    cuda:0
..........................................................................................
ln_out.weight                            float32    cuda:0
ln_out.bias                              float32    cuda:0
head.weight                              float32    cuda:0

Run prompt...
Traceback (most recent call last):
  File "chat.py", line 193, in <module>
    out = run_rnn(tokenizer.tokenizer.encode(init_prompt))
  File "chat.py", line 163, in run_rnn
    current_state = model.forward(model_tokens, current_state, preprocess_only = True)
  File "/mnt/d/workspace/RWKV-LM/RWKV-v4neo/src/model_run.py", line 200, in forward
    x = w.emb.weight[ctx[-1]]
IndexError: index 48656 is out of bounds for dimension 0 with size 6064

would please help to take look if I mistake something. Thanks.

Can I train this model with smaller dataset with CPU?

Add a requirements.txt?

Ran into this issue while installing:

python ./train.py --help                                                                                                                     
Traceback (most recent call last):                                                                                                                                                  
  File "/fast/RWKV-LM/RWKV-v4/./train.py", line 10, in <module>                                                                                                                     
    from src.binidx import MMapIndexedDataset                                                                                                                                       
  File "/fast/RWKV-LM/RWKV-v4/src/binidx.py", line 30, in <module>                                                                                                                  
    6: np.float,                                                                                                                                                                    
  File "/fast/RWKV-LM/venv/lib/python3.10/site-packages/numpy/__init__.py", line 305, in __getattr__                                                                                
    raise AttributeError(__former_attrs__[attr])                                                                                                                                    
AttributeError: module 'numpy' has no attribute 'float'.                                                                                                                            
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you 
specifically wanted the numpy scalar type, use `np.float64` here.                                                                                                                   
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:                                                                
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations. Did you mean: 'cfloat'?

Root issue was because np.float was removed https://numpy.org/doc/stable/release/1.24.0-notes.html#expired-deprecations

Anyway, a requirements.txt with versions would be helpful. Unless I missed something?

能出一个简易的ui吗?

你好，开发者，请问能出简易的训练介面吗?

Any results on few-shot settings?

Thanks for your wonderful work!
Do you have any results on the few-shot settings? Do RWKV-LLMs perform the similar emergent ability like GPT-3, e.g. the chain-of-thought?

AttributeError: type object 'Trainer' has no attribute 'add_argparse_args'

Using https://github.com/resloved/RWKV-notebooks/blob/master/RWKV_v4neo_Fine_Tuning.ipynb
which uses https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v4neo gives me an error when it comes to the training part

########## work in progress ##########
Traceback (most recent call last):
  File "/content/RWKV-LM/RWKV-v4neo/train.py", line 109, in <module>
    parser = Trainer.add_argparse_args(parser)
AttributeError: type object 'Trainer' has no attribute 'add_argparse_args'

edit: had to downgrade to pytorch-lightning==1.9.0.
On another note, I was under the impression that n_epochs would limit the number of epochs to that number but it just keeps going past it?

萌新能问一下怎么从已经训练了一半的模型接着训练，因为从头开始训练太久了（训练集还是一样的）

HumanEval benchmarks?

Hi, I was wondering whether this model can achieve GPT-4 level performance on the HumanEval benchmark, a proxy for effectiveness at code generation. I'm fine if I have to train or transfer-learn, but I only have a single GPU. Massive transformers require way too much compute. I'd also like your opinion on how well the model might work in tandem with InCoder and LLaMa/Alpaca in a model ensemble. Sort of like Stable Diffusion, which has multiple different models that each specialize in a different task. Thanks!

CUDA compilation error with Ctx Length>2000

Hello,
I am trying out RWKV with audio modality and when I set T_MAX>>1000, it throws this error:

Emitting ninja build file /root/.cache/torch_extensions/py39_cu116/timex/build.ninja...
Building extension module timex...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=timex -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/surya-env/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' --use_fast_math --extra-device-vectorization -DTmax=10000 -DBF=8 -DBB=2 -std=c++14 -c cuda/timex_cuda.cu -o timex_cuda.cuda.o 
FAILED: timex_cuda.cuda.o 
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=timex -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/surya-env/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' --use_fast_math --extra-device-vectorization -DTmax=10000 -DBF=8 -DBB=2 -std=c++14 -c cuda/timex_cuda.cu -o timex_cuda.cuda.o 
ptxas error   : Entry function '_Z15kernel_backwardIfEvPKT_S2_S2_PS0_S3_iii' uses too much shared data (0x30d40 bytes, 0xc000 max)
ptxas error   : Entry function '_Z14kernel_forwardIfEvPKT_S2_PS0_S0_iii' uses too much shared data (0x57e40 bytes, 0xc000 max)
ninja: build stopped: subcommand failed.

GPU: A100, VRAM: 42GB, CUDA 11.6

I am okay if the training takes a bit long. But I need this to work.
Don't know any CUDA. Can you suggest some workarounds?

Thanks for the incredible work btw!

RWKV-v4neo train error

try [train](example: train a simple L6-D512 RWKV from scratch on enwik8)

RWKV-LM/RWKV-v4neo$ python train.py --proj_dir "out" --data_file "../../data/enwik8" --data_type "utf-8" --vocab_size 0 --ctx_len 512 --epoch_steps 5000 --epoch_count 500 --epoch_begin 0 --epoch_save 5 --micro_bsz 12 --n_layer 6 --n_embd 512 --pre_ffn 0 --head_qk 0 --lr_init 8e-4 --lr_final 1e-5 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 --accelerator gpu --devices 1 --precision bf16 --strategy ddp_find_unused_parameters_false --grad_cp 0

then i got:

| Name | Type | Params

0 | emb | Embedding | 3.1 M
1 | blocks | ModuleList | 20.5 M
2 | ln_out | LayerNorm | 1.0 K
3 | head | Linear | 3.1 M

26.7 M Trainable params
0 Non-trainable params
26.7 M Total params
106.770 Total estimated model params size (MB)
Epoch 0: 0%| | 0/5000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/exat500g/RWKV-LM/RWKV-v4neo/train.py", line 340, in
trainer.fit(model, data_loader)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
self.fit_loop.run()
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 174, in advance
batch = next(data_fetcher)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next
return self.fetching_function()
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 263, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 277, in _fetch_next_batch
batch = next(iterator)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/supporters.py", line 557, in next
return self.request_next_batch(self.loader_iters)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/trainer/supporters.py", line 569, in request_next_batch
return apply_to_collection(loader_iters, Iterator, next)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
return function(data, *args, **kwargs)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
return self._process_data(data)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
raise exception
UnboundLocalError: Caught UnboundLocalError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/exat500g/miniconda3/envs/pytorch113/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/exat500g/RWKV-LM/RWKV-v4neo/src/dataset.py", line 208, in getitem
dix = [self.stoi[s] for s in data[i : i + req_len]]
UnboundLocalError: local variable 'i' referenced before assignment

VRAM performance

Hi @BlinkDL! First off this is amazing and seems very promising for scaling down large Transformers to be more production friendly.

I'm wondering if you have any benchmarks regarding VRAM performance? Specifically I've got 3 questions:

1 - How much VRAM does this model (or rather, the CUDA version) need for training? Are we talking 1060 size (6gb), 3090 size (20gb), or a6000+ size (40+gb)
2 - Same question as 1, but for inference?
3 - Can this run on CPU reasonably?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.