Coder Social home page Coder Social logo

danielsarf / training-an-slm Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 910 KB

This is my submission for my 6th Semester project. An SLM is basically an LLM but under 1 Billion parameters (the definition varies from person to person). This project was recommended by my teacher to develop my interest in Transformers and Language Model Training!

License: MIT License

Python 100.00%

training-an-slm's Introduction

Training my first Language Model!

This experiment is an adaptation of Andrej Karpathy's nanoGPT to train a small language model. Despite its small size, this model can generate relatively coherent text but struggles with instruction-following and accurate information retrieval. Typically, models with over a billion parameters are needed for consistently usable responses.

Datasets were sourced from Hugging Face. The model architecture consists of 123.59 million parameters, 12 layers with 12 heads per layer, and an embedding dimension of 768 across a vocabulary of 50304 tokens. Bias was not used and initial training had a dropout rate of 0.0, which was adjusted to 1.0 during fine-tuning. The dataset mix included various sets from Open-Orca, Databricks Dolly 15 and others1.

Training results

Initial vs Favorite Model Output

Initial vs Favorite Model Output

Red is base training and blue is finetuning

Installation

Prerequisites

  • Download the datasets mentioned in the create_first_training_data.py script and place them in Data/Datasets.
  • Install all required libraries via pip. Ensure you have access tokens to huggingface-cli and wandb.

Steps

  1. Clone the repository.

    git clone https://github.com/DanielSarf/Training-an-SLM.git
  2. Install the necessary dependencies.

    pip install -r requirements.txt

Usage

To train and use the model, execute the following steps:

Prepare your training data

python create_first_training_data.py

This script will create tokenized training and validation data from various datasets like OpenOrca, WizardLM, etc.

Train the model

python train.py

Fine-tune the model

python finetune.py

Run and evaluate the model

python run.py

Example

<|user|>:
Hello, how are you?
<|assistant|>:
I am an AI model here to assist you!<|endoftext|>

Note

Ensure you input the correct model import paths in finetune.py and run.py.

Scripts

  • create_first_training_data.py: Responsible for creating the initial training and validation data by combining multiple datasets and tokenizing them.
  • train.py: Performs the training loop with evaluation, learning rate decay, and logging. Responsible for saving checkpoints and managing model state during training.
  • finetune.py: Continues training a pre-trained model with additional datasets, improving its performance on specific tasks or datasets.
  • run.py: Loads the trained model allowing for interactive querying to evaluate model performance.

Contribute

We welcome contributions!

Feel free to fork the repository and send in your pull requests!

Model Architecture and Training

The architecture of the model comprises 12 layers with 12 attention heads per layer, resulting in approximately 123.59 million parameters. The embedding dimension is set to 768, spread across a vocabulary of 50304 tokens. Initial training parameters excluded biases, with a dropout rate of 0.0. During fine-tuning, dropout was set to 0.1 to improve generalization and model robustness.

Dataset and Training Procedure

Initial Training

Initially the model was trained on a context size of 512 tokens at a batch size of 12 and did 136,000 iterations. Datasets were Open-Orca, Databricks Dolly 15K, WizardLM-Orca and Open-Platypus. This is a total of ~1.4 billion tokens.

Fine-Tuning

After preliminary training, fine-tuning was performed, adjusting the dropout rate and other hyperparameters to refine model performance on specific tasks or datasets. The finetuning data came from oasst1 and oasst2.

Results and Performance

While the trained model can generate coherent text to some extent, its performance on following instructions and retrieving accurate information is limited. This aligns with known challenges in training smaller language models, which typically require over a billion parameters to consistently yield usable responses.

The overall aim was to test the limits of a relatively small language model and observe the trade-offs in capability versus the model's size.

Acknowledgements

Special thanks to Andrej Karpathy for nanoGPT, which served as the basis for this project, and to the various dataset providers on Hugging Face who made this research possible.

For any further inquiries or detailed descriptions of script functionalities, please refer to the provided documentation or reach out to me.

Footnotes

  1. Refer to the dataset sources for detailed information. You can find commented out linked to them in 'create_first_training_data.py'. โ†ฉ

training-an-slm's People

Contributors

danielsarf avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.