Supervised Dynamic Large Language Model Library

Abstract

Framework for training supervised dynamic language models that implements the state-of-the-art methods for training large language models.

Framework Structure

For unified training and evaluation and easy implementation of new language methods we define in our framework two basic concepts, Model and Model Trainer.

Model. This concept is implemented as abstract class AModel in supervisedllm.models. The UML diagram of the class can be find below.

Every dynamic language method implemented in this framework must be a child class from supervisedllm.models.AModel. In order one to use all feature (training, logging, ...) of the supervisedllm framework one must implement the new models as sub-class of AModel class. This abstract class defines four abstract methods that are used for the training and evaluation that needed to be implemented:

forward(): this methods implements the infirence of the model on a data point,
train_step(): the procedure executed during one training step
validate_step: the procedure executed during one validation step
loss(): definition of the loss for each specific model that will be used for training.

Example implementation of the Faster R-CNN model can be found in supervisedllm.models.baseline_models.SequentialClassifier

Model Trainer. This class is used for training models implemented in this framework. The class is handeling the model trainin, logging and booking. In order this class to be used the models must be child class from the supervisedllm.models.AModel. This means that the model must implement four abstract functions from the parent class (see above).

Installation

In order to set up the necessary environment:

Virtualenv

Install virtualenv and virtualenvwrapper.
Create virtualenviroment for the project:
```
mkvirtualenv supervised-llm
```
Install the project in edit mode:
```
python setup.py develop
```

Optional and needed only once after git clone:

install several pre-commit git hooks with:
```
pre-commit install
# You might also want to run `pre-commit autoupdate`
```
and checkout the configuration under .pre-commit-config.yaml. The -n, --no-verify flag of git commit can be used to deactivate pre-commit hooks temporarily.
install nbstripout git hooks to remove the output cells of committed notebooks with:
```
nbstripout --install --attributes notebooks/.gitattributes
```
This is useful to avoid large diffs due to plots in your notebooks. A simple nbstripout --uninstall will revert these changes.

Then take a look into the scripts and notebooks folders.

Project Organization

├── AUTHORS.md              <- List of developers and maintainers.
├── CHANGELOG.md            <- Changelog to keep track of new features and fixes.
├── CONTRIBUTING.md         <- Guidelines for contributing to this project.
├── Dockerfile              <- Build a docker container with `docker build .`.
├── LICENSE.txt             <- License as chosen on the command-line.
├── README.md               <- The top-level README for developers.
├── configs                 <- Directory for configurations of model & application.
├── data
│   ├── external            <- Data from third party sources.
│   ├── interim             <- Intermediate data that has been transformed.
│   ├── processed           <- The final, canonical data sets for modeling.
│   └── raw                 <- The original, immutable data dump.
├── docs                    <- Directory for Sphinx documentation in rst or md.
├── requrements.txt         <- The python environment file for reproducibility.
├── models                  <- Trained and serialized models, model predictions,
│                              or model summaries.
├── notebooks               <- Jupyter notebooks. Naming convention is a number (for
│                              ordering), the creator's initials and a description,
│                              e.g. `1.0-fw-initial-data-exploration`.
├── pyproject.toml          <- Build configuration. Don't change! Use `pip install -e .`
│                              to install for development or to build `tox -e build`.
├── references              <- Data dictionaries, manuals, and all other materials.
├── reports                 <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures             <- Generated plots and figures for reports.
├── scripts                 <- Analysis and production scripts which import the
│                              actual PYTHON_PKG, e.g. train_model.
├── setup.py                <- Use `python setup.py develop` to install for
│                              development or `python setup.py bdist_wheel` to build.
├── src
│   └── kiwissenbase        <- Actual Python package where the main functionality goes.
├── tests                   <- Unit tests which can be run with `pytest`.
├── .coveragerc             <- Configuration for coverage reports of unit tests.
├── .isort.cfg              <- Configuration for git hook that sorts imports.
└── .pre-commit-config.yaml <- Configuration of pre-commit git hooks.

Minimal Example

Model Training

To train a model, use the script scripts/train_model.py. Which model to train, specific parameters and data/output directories are provided by config.yaml files. See section Config Files for details.

Train a model from a config by python scripts/train_model.py --config path/to/config.yaml. We provide some default config files:

configs/arxiv/bert.yaml: Train Bert classifier on ArXiv dataset
...

Additional arguments to scripts/train_model.py:

--quiet / --verbose / --very-verbose: Set log-level, i.e. number of logging messages shown during training.
-d / --debug: Use to disable multiprocessing entirely. Useful for debugging.
-nc / --no-cuda: Use to disable GPU-usage entirely. Useful for debugging.
--resume-training: Use to resume a previous training.
--resume-from: Provide save directory of previous training for resuming.

Terminal Output

Starting Tensorboard

In the framework we provide out of the box tensorboard logging. In order to see the training progress in tensorboard, first you need to start the tensorboard:

tensorboard --logdir results/logging/tensorboard

Config Files

For reproducibility and tractability of the experiments done as well as for convenience we store all the models' hyperparameters into a yaml config file. All of the configuration files used to train the models are stored in configs folder. Each configuration file contains 5 main parts. The first part of the yaml configuration file is:

name: bert_arxiv
num_runs: 1
num_workers: 0
world_size: 1 #4 num_GPU
distributed: false #true
gpus: !!python/tuple ["0"] # ["0", "1", "2", "3"]
seed: 1

name: Key holds the name of the experiment. The user can use any name that finds suitable for the experiment.
num_runs: Number of times the experiment will be repeated.
num_works: How many processes should be used for the training.
gpus: Which gpus to be used.
seed: Value of the initial seed.

The second part of the configuration file is the model.

model:
  module: supervisedllm.models.baseline_models
  name: SequentialClassifier
  args:
    backbone_name: bert
    output_layers_dim: !!python/tuple [32, 32]

In this part the user can define which model will be used for the training as well as the hyperparameters. In the example above, we use SequentialClassifier model. In order to do that we have supply the module: supervisedllm.models.baseline_models the python package where the model is and the name: SequentialClassifier name of the model. Next, in the arg key we set the all hyperparameters needed for the specific model.

The third part of the yaml file is the data loader part.

data_loader:
  module: supervisedllm.data.dataloaders
  name: TopicDataLoader
  args:
    root_dir: ./data/preprocessed/arxiv
    is_dynamic: true
    use_covariates: false
    use_tmp_covariates: false
    reward_field: reward # reward_normalized
    transformer_name: bert # bert, albert, roberta
    batch_size: 6 #32 #8
    validation_batch_size: 6
    n_workers: 4
    pin_memory: true

The forth part is the optimizer that we are going to use during the training.

optimizer:
  min_lr_rate: 1e-14 # used for early stopping
  gradient_norm_clipping: 1.0
  module: torch.optim
  name: SGD #Adam
  args:
    lr: 0.001

The last part that we have to define is the trainer. In this part we set all the parameters that are used for training and logging.

trainer:
  module: supervisedllm.trainer
  name: BaseTrainingProcedure
  args:
    bm_metric: accuracy
    save_after_epoch: 1
    eval_test: false
    lr_schedulers: !!python/tuple
      - optimizer: # name of the optimizer
          counter: 1 # anneal lr rate if there is no improvement after n steps
          module: torch.optim.lr_scheduler
          name: StepLR # StepLR or MultiStepLR
          args:
            step_size: 3 # for StepLR
            gamma: 0.2
    schedulers: !!python/tuple
      - module: supervisedllm.utils.param_scheduler
        name: ExponentialScheduler
        label: beta_scheduler
        args:
          max_value: 1.0
          max_steps: 5000
          decay_rate: 0.0025
  epochs: 18 #4 #30 #20
  save_dir: ./results/saved/
  logging:
    logged_train_stats:
      !!python/tuple [
        "loss",
        "accuracy"
      ]
    logged_val_stats:
      !!python/tuple [
        "loss",
        "accuracy"
      ]
    logged_test_stats:
      !!python/tuple [
        "loss",
        "accuracy"
      ]
    tensorboard_dir: ./results/logging/tensorboard/
    logging_dir: ./results/logging/raw/
    formatters:
      verbose: "%(levelname)s %(asctime)s %(module)s %(process)d %(thread)d %(message)s"
      simple: "%(levelname)s %(asctime)s %(message)s"

In this library we have an object that is called Trainer that is responsible for training the models and logging and creating checkpoints during training.

Evaluate Trained Model

To evaluate a trained model, use the script scripts/evaluate_model.py. In order a model to be evaluated we need at least to provide path to the trained model, path to the dataset root directory and the output directory, where the results will be stored. The script will create new dirctory with name that is concatination of the experiment name and the dataset split on which we want to evaluate the mode.

Example code for running evaluation:

python scripts/evaluate_model.py --model_dir path/to/model.pth --split [train|validate|test] --data-root-dir path/to/data_dir --output-dir path/to/output_dir

Additional arguments to scripts/evaluate_model.py:

--evaluation-custom-name: Custom name for the folder where the evaluation will be stored. If not provided the name of the folder will be experiment_name + dataset split
--gpus: GPUs used for evaluation.
--num-workers: Number of threads used for the evaluation.
--quiet / --verbose / --very-verbose: Set log-level, i.e. number of logging messages shown during training.
-d / --debug: Use to disable multiprocessing entirely. Useful for debugging.
-nc / --no-cuda: Use to disable GPU-usage entirely. Useful for debugging.

cvejoski / supervised-dynamic-llm-old Goto Github PK

supervised-dynamic-llm-old's Introduction

Supervised Dynamic Large Language Model Library

Abstract

Framework Structure

Installation

Virtualenv

Project Organization

Minimal Example

Model Training

Terminal Output

Starting Tensorboard

Config Files

Evaluate Trained Model

Example code for running evaluation:

supervised-dynamic-llm-old's People

Contributors

Watchers

supervised-dynamic-llm-old's Issues

Recommend Projects

Recommend Topics

Recommend Org