picocreator / rwkv-lm-lora Goto Github PK

This project forked from blealtan/rwkv-lm-lora

RWKV is a RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

License: Apache License 2.0

rwkv-lm-lora's People

Contributors

Stargazers

Watchers

Forkers

rfsfreitas

rwkv-lm-lora's Issues

NOTE: Optimizing OOM on a single GPU

The following is investigating / follow up on a single GPU facing high memory fragmentation. Causing OOM issues despite clearly having sufficient space.

This manifest to either of the following

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 186.00 MiB (GPU 0; 22.13 GiB total capacity; 16.56 GiB already allocated; 164.62 MiB free; 16.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

or with PYTORCH_CUDA_ALLOC_CONF configured to cudaMallocAsync

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/picocreator/rwkv-proj/picocreator-memory-experiment/RWKV-v4wavenet/src/model.py", line 323, in forward
        xr = x * self.time_mix_r + xx * (1 - self.time_mix_r)
        k = self.key(xk)
        k = torch.square(torch.relu(k))
                         ~~~~~~~~~~ <--- HERE
        kv = self.value(k)
        return (torch.sigmoid(self.receptance(xr)) * kv,
RuntimeError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated     : 16.66 GiB
Requested               : 31.27 MiB
Device limit            : 22.13 GiB
Free (according to CUDA): 4.62 MiB
PyTorch limit (set by user-supplied memory fraction)
                        : 17179869184.00 GiB

TODO: Validate deepspeed 3 on models > 14B

In theory, with deepspeed 3 we can now train models as large as we want them to be, provided enough GPUs, and ram is thrown at it.

In practise, I dun have the hardware to validate that. So this is an issue, for that.

Setup and install required dependencies as per

https://github.com/PicoCreator/RWKV-LM-LoRA/tree/picocreator-dev-infctx#environment-setup

Clone the current branch

git clone --branch picocreator-dev-infctx https://github.com/PicoCreator/RWKV-LM-LoRA.git picocreator-dev-infctx

Run the following notebook

https://github.com/PicoCreator/RWKV-LM-LoRA/blob/picocreator-dev-infctx/notebook/trainer-validation/large-model-size-validation.ipynb

This notebook only does 10 steps, enough to prove it works? but not do a full actual training run.

Let me know the error if it failed, if it pass - upload the notebook here, so i can keep it as record

Roadmap for getting infi-ctx to feature parity with RWKV main branch

The following is the list of missing features in the RWKV infi-ctx branch, for it to be a full replacement to the main RWKV trainer

support for following data formats (with chunking?)
- text files
- numpy files
- binidx
support for model init weight
support for model resize weights (init from smaller to bigger model)
support for world tokenizer
Learning Rate init -> Learning Rate Final support
warmup steps?
helper script to add new tokens to existing model

TODO: Add v5 dropout support (and other changes)

Overall list of blink code changes : BlinkDL/RWKV-LM@a637aea...b42fc10
R2 specific change : BlinkDL@5368633
dropout support : BlinkDL@b42fc10
Inference code : https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py

TODO: World tokenizer speed ups

Right now, somehow the HF tokenizer allows it to scale seamlessly across all the processor threads.
The world tokenizer does not.

Maybe we need it to be in rust, maybe something else - i dun know.

Either way the following is the rough scope

Make changes to the MT_TRIE_TOKENIZER here : https://github.com/PicoCreator/RWKV-LM-LoRA/blob/picocreator-dev-infctx/RWKV-v4neo/src/dataflow/trie_tokenizer.py#L131
To run the tokenization process, you can run the notebook steps here : https://github.com/PicoCreator/RWKV-LM-LoRA/blob/picocreator-dev-infctx/notebook/dataset-config/dataset-config-examples.ipynb

Additional notes

All we really need is the encode function
If we are gonna use rust / etc, i will require this to be supported easily via a pip/conda install, as i do not want to overcomplicate our setup/dependency process.
Otherwise, if its pure python, i need it to somehow make use of the many cpu cores most training machines would have
Current code attempted at multi-process/etc did not work (see commented out code), it might have been me making mistakes on my part.

picocreator / rwkv-lm-lora Goto Github PK

rwkv-lm-lora's People

Contributors

Stargazers

Watchers

Forkers

rwkv-lm-lora's Issues

NOTE: Optimizing OOM on a single GPU

TODO: Validate deepspeed 3 on models > 14B

Setup and install required dependencies as per

Clone the current branch

Run the following notebook

Roadmap for getting infi-ctx to feature parity with RWKV main branch

TODO: Add v5 dropout support (and other changes)

TODO: World tokenizer speed ups

Consider masking `multi_column_prefix_encodings` ?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent