Coder Social home page Coder Social logo

picocreator / rwkv-lm-lora Goto Github PK

View Code? Open in Web Editor NEW

This project forked from blealtan/rwkv-lm-lora

9.0 9.0 1.0 35.35 MB

RWKV is a RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

License: Apache License 2.0

rwkv-lm-lora's People

Contributors

blealtan avatar blinkdl avatar picocreator avatar saharnooby avatar www avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

rfsfreitas

rwkv-lm-lora's Issues

NOTE: Optimizing OOM on a single GPU

The following is investigating / follow up on a single GPU facing high memory fragmentation. Causing OOM issues despite clearly having sufficient space.

This manifest to either of the following

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 186.00 MiB (GPU 0; 22.13 GiB total capacity; 16.56 GiB already allocated; 164.62 MiB free; 16.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

or with PYTORCH_CUDA_ALLOC_CONF configured to cudaMallocAsync

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/picocreator/rwkv-proj/picocreator-memory-experiment/RWKV-v4wavenet/src/model.py", line 323, in forward
        xr = x * self.time_mix_r + xx * (1 - self.time_mix_r)
        k = self.key(xk)
        k = torch.square(torch.relu(k))
                         ~~~~~~~~~~ <--- HERE
        kv = self.value(k)
        return (torch.sigmoid(self.receptance(xr)) * kv,
RuntimeError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated     : 16.66 GiB
Requested               : 31.27 MiB
Device limit            : 22.13 GiB
Free (according to CUDA): 4.62 MiB
PyTorch limit (set by user-supplied memory fraction)
                        : 17179869184.00 GiB

TODO: Validate deepspeed 3 on models > 14B

In theory, with deepspeed 3 we can now train models as large as we want them to be, provided enough GPUs, and ram is thrown at it.

In practise, I dun have the hardware to validate that. So this is an issue, for that.

Setup and install required dependencies as per

https://github.com/PicoCreator/RWKV-LM-LoRA/tree/picocreator-dev-infctx#environment-setup

Clone the current branch

git clone --branch picocreator-dev-infctx https://github.com/PicoCreator/RWKV-LM-LoRA.git picocreator-dev-infctx

Run the following notebook

https://github.com/PicoCreator/RWKV-LM-LoRA/blob/picocreator-dev-infctx/notebook/trainer-validation/large-model-size-validation.ipynb

This notebook only does 10 steps, enough to prove it works? but not do a full actual training run.

Let me know the error if it failed, if it pass - upload the notebook here, so i can keep it as record

Roadmap for getting infi-ctx to feature parity with RWKV main branch

The following is the list of missing features in the RWKV infi-ctx branch, for it to be a full replacement to the main RWKV trainer

  • support for following data formats (with chunking?)
    • text files
    • numpy files
    • binidx
  • support for model init weight
  • support for model resize weights (init from smaller to bigger model)
  • support for world tokenizer
  • Learning Rate init -> Learning Rate Final support
  • warmup steps?
  • helper script to add new tokens to existing model

TODO: World tokenizer speed ups

Right now, somehow the HF tokenizer allows it to scale seamlessly across all the processor threads.
The world tokenizer does not.

Maybe we need it to be in rust, maybe something else - i dun know.

Either way the following is the rough scope

Additional notes

  • All we really need is the encode function
  • If we are gonna use rust / etc, i will require this to be supported easily via a pip/conda install, as i do not want to overcomplicate our setup/dependency process.
  • Otherwise, if its pure python, i need it to somehow make use of the many cpu cores most training machines would have
  • Current code attempted at multi-process/etc did not work (see commented out code), it might have been me making mistakes on my part.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.