jamesmf / cclm Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 324 KB

Composable Character Level Models

License: MIT License

Python 97.89% Dockerfile 2.11%

cclm's Introduction

CCLM

Composable, Character-Level Models

Why `cclm`?

The goal of cclm is to make the deep learning model development process modular by providing abstractions for structuring a computational graph.

If we think of the ML lifecycle as producing a usable class Model that consumers can call on input to get output, then comparing the model training process to human-led software development highlights some big differences. For instance often when we retrain models, we usually change the whole model at once - imagine a developer telling you every commit they made touched every line of code in the package. Similarly, using a pretrained model is like using a 'batteries included' framework: you likely end up inheriting a good deal of functionality you don't require, and it may be hard to customize. These differences suggest that there may be changes that could make it easier to manage deep learning model development, particularly as models continue to explode in size.

How does it work?

The way cclm aims to achieve the above is by making the model building process composable. There are many ways to pretrain a model on text, and infinite corpora on which to train, and each application has different needs.

cclm makes it possible to define a base input on which to build many different computational graphs, then combine them. For instance, if there is a standard, published cclm model trained with masked language modeling (MLM) on (wikitext + bookcorpus), you might start with that, but add a second component to that model that uses the same base, but is pretrained to extract entities from wiki-ner. By combining the two pretrained components with a ComposedModel, you get a model with information from both tasks that you can then use as a starting point for your downstream task.

Common model components will be published onto the cclm-shelf to make it simple to mix and match capabilities.

The choice to emphasize character-level rather than arbitrary tokenization schemes is to make the input as generically useful across tasks as possible. Character-level input also makes it simpler to add realistic typos/noise to make models more robust to imperfect inputs.

Basic concepts

The main output of a training job with cclm is a ComposedModel, which consists of a Preprocessor that turns text into a vector[int], a base model that embeds that vector input, and one or more model components that accept the output of the embedder. The ComposedModel concatenates the output from those models together to produce its final output.

The package uses datasets and tokenizers from huggingface for a standard interface and to benefit from their great framework. To fit models and preprocessors, you can also pass a List[str] directly.

To start, you need a Preprocessor.

from cclm.preprocessing import Preprocessor

prep = Preprocessor()  # set max_example_len to specify a maximum input length
prep.fit(dataset) # defines the model's vocabulary (character-level)

Once you have that, you can create an Embedder, which is the common base on which all the separate models will sit. This is a flexible class primarily responsible for holding a model that embeds a sequence of integers (representing characters) into a space the components expect. For more complicated setups, the Embedder could have a ComposedModel as its model

from cclm.models import Embedder

embedder = Embedder(prep.max_example_len, prep.n_chars)

The embedder doesn't necessarily need to be fit by itself, as you can fit it while you do your first pretraining task.

Now you're ready to build your first model using a pretraining task (here masked language modeling)

from cclm.pretraining import MaskedLanguagePretrainer

pretrainer = MaskedLanguagePretrainer(embedder=embedder)
pretrainer.fit(dataset, epochs=10)

The MaskedLanguagePretrainer defines a transformer-based model to do masked language modeling. Calling .fit() will use the Preprocessor to produce masked inputs and try to identify the missing input token(s) using sampled_softmax loss or negative sampling. This is just one example of a pretraining task, but others can be found in cclm.pretrainers.

Once you've trained one or more models using Pretrainer objects, you can compose them together into one model.

composed = ComposedModel(embedder, [pretrainer_a.model, pretrainer_b.model])

You can then use composed.model(x) to embed input

x = prep.string_to_array("cclm is neat", prep.max_example_len)
emb = composed.model(x)   # has shape (1, prep.max_example_len, pretrainer_a_model_shape[-1]+pretrainer_b_model_shape[-1])

... or create a new model with something like

# pool the output across the character dimension
gmp = tf.keras.layers.GlobalMaxPool1D()
# add a classification head on top
d = tf.keras.layers.Dense(1, activation="sigmoid")
keras_model = tf.keras.Model(composed.model.input, d(gmp(composed.model.output)))

Shelf

The Shelf class is used to load off-the-shelf components. These are published to a separate repo using git lfs, and are loaded with a specific tag.

from cclm.shelf import Shelf

shelf = Shelf()
identifier = "en_wiki_clm_1"
item_type = "preprocessor"
shelf.fetch(identifier, item_type, tag="v0.2.1", cache_dir=".cclm")
prep = Preprocessor(
    load_from=os.path.join(cache_dir, identifier, item_type, "cclm_config.json")
)

cclm's People

Contributors

Stargazers

Watchers

cclm's Issues

Add option for Preprocessor to add [CLS] token

The Preprocessor should have a CLS token and append it by default.

add changelog

logic for picking random substring in MLM pretrainer

should start at the beginning of a token
shouldn't avoid a second encode from tokenizer

dockerfile args for installing deep graph library

The dockerfile should be modified to optionally also install dgl to enable pretraining using graph networks.

masked character sequence pretrainer

pretraining task for the base of a model.

input_string --> spatial dropout, noise data augmentation --> input_string

initial shelf implementation

The Shelf should have a basic method for saving the contents of a dir (tar it up) and pushing it to the cclm-shelf, as well as for fetching and unpacking it.

start using poetry

follow poetry walkthrough to evaluate it as the build tool of choice

readme stale

fix references to MLMPreprocessor and calling .fit() directly on the pretrainer

add dgl pretraining task

Add an optional dependency on dgl and add a graph pretraining task.

The task can take any form (node/edge classification, edge prediction, deep infomax, ...). The key piece is that the node features should be generated by a Preprocessor on some text, then an Embedder + model component combo should be the first layer before the graph convolutions/attention.

To demonstrate the value of this, also implement an example with a graph from wikidata or similar.

standardize pretrainer api

To avoid confusion between the model and the pretrainer, the Pretrainer should have a method train(dataset) that is standard across pretrainers as much as possible.

Add causal language model pretrainer

In addition to the MLM pretrainer, it should be simple to add a causal LM pretrainer

Support lower versions of python

The current requirement of python^3.8 is because of importlib.metadata for the version. But it can be replicated in lower python versions with importlib_metadata

Add categorical pretrainer

A simple option should exist to pretrain on a classification task (like ag_news). This could share implementation details with a future feature to easily slap a classification head on a ComposedModel

remove CCLMModelBase in favor of Embedder

The abstraction of an Embedder is closer to what the previous implementation did, and tying the function to the class allows us to change the input length

dependencies in setup.py

putting the dependencies in setup.py for now.

also leaving keras in to get the old code working, but we can remove that as a dependency and user tf.keras where we want, then eventually make it more flexible/backend agnostic

script for training and persisting pretrained model on wikipedia-en

dataset = load_dataset('wikipedia', '20200501.en')

consider not requiring a preprocessor on a base

the model base object does depend on the preprocessor, but then doesn't really use it.

modify MLM pretrainer to accept model architecture args

Possible multiple transformer layers
Should they be in a row after downsample, or not

mixed precision support

Make the model training use mixed precision

new tokenizer

no need to reinvent the wheel for the pretraining tasks that will require tokenization.

this package looks like a fast, simple option

https://github.com/OpenNMT/Tokenizer

put position embedding back on transformer implementation

Since relative position could be learned from the conv layers, the original implementation of transformers (on top of convolutions) ignored the position embedding. But since the embedding is small and potentially useful, it should be added back.

decide on a default dataset

hugginface/datasets might be a good place to start

https://github.com/huggingface/datasets

TypeError: Object of type int64 is not JSON serializable

persisting the config of the Embedder object in the base_pretraining example fails.

Likely one of the attributes (n_chars or max_len) is set by referencing an input/output shape (or similar) so it should be found and cast to int

increase test coverage on pretrainers

There aren't tests for all the pretrainers, which makes it hard to modify models or experiment with changes like #29

create evaluation method for MLM pretraining

Create a simple function for qualitatively assessing how sensible the output from the MLM pretrainer is

Implement freezing for pretrainer and test it for base

The base currently has a freeze_embedder implemented that needs testing. The same kind of functionality should live on the pretrainer classes so that it's easy to freeze a 'tower' before composing many models together.

add distillation pretrainer

add a pretrainer that learns from a teacher on a basic task like masked character modeling.

train bert or similar on a task like autoencoding a str from the [CLS]
keep the 'head' of that model around and put it on top of a new embedder + model component
train the new student model on the same task

Make preprocessor aware of downsampling

Transformer layers are costly with respect to input length, and that is particularly a problem with character-level models. One option is to reduce the sequence length with strided convolutions or strided pooling before the transformer layer(s), then upsample afterward.

To make this pattern more straightforward across multiple components, the Preprocessor can be make aware of the downsample_factor and make sure that inputs are padded appropriately to make the upsampling the same shape.