Coder Social home page Coder Social logo

roberta-tiny-cased's Introduction

RoBERTa-tiny-cased

A small case-preserving RoBERTa model pre-trained for production use. Feel free to download our model from Baidu Disk (Extraction Code: yhmq) or Google Drive or from HuggingFace model hub.

Model Parameters

Layers Hidden Size #Heads #Parameters
RoBERTa-tiny 4 512 8 28M

Pre-training Data

We used a 43G corpus consists of Wikipedia, BookCorpus and UMBC WebBase Corpus. Except BookCorpus, all pre-training data is case-preserving.

Corpus Name Corpus Size Domain #Sentences #Words
Wikipedia 21G wiki ~212M ~4B
BookCorpus 4.4G fiction, story, etc. ~74M ~1B
UMBC WebBase Corpus 18G web pages ~180M ~3B

Pre-training Procedure

We used code from Transformers to pre-train RoBERTa-tiny. Datasets library was used to provide fast and efficient access to disk data. During pre-training, we followed the setting from RoBERTa and only used MLM loss as pre-training objective. However, we used Wordpiece as tokenizer, while BPE was used in original RoBERTa. Input data was organized in FULL-SENTENCES format. The whole pre-training took about 5 days on 8 V100 GPUs for 20 epochs.

Here we list some important hyperparameters:

Initial Learning Rate Epochs/Steps Batch Size Maximum Length
1e-4 20/~1.8M 512 256

Results

We fine-tuned our RoBERTa-tiny (cased) model on all tasks from GLUE (Task descriptions are listed below), and compared the test set results with BERT-small, an uncased BERT model with same structure released by Google.

CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI
Task Description Classification (grammatical acceptability) Classification (sentiment) Classification (semantic equivalence) Classification (semantic similarity) Classification (semantic consistency) Classification (NLI) Classification (NLI) Classification (NLI) Classification (NLI)
#Sentences for Training 8551 67349 3668 5749 363870 392702 104743 2490 635
Average Input Length 11.5 14.0 54.7 29.1 31.4 40.8 50.9 68.0 37.5
Model Overall CoLA SST-2 MRPC STS-B QQP MNLI-m MNLI-mm QNLI RTE WNLI
BERT-small (uncased) 71.2 27.8 89.7 83.4/76.2 78.8/77.0 68.1/87.0 77.6 77.0 86.4 61.8 62.3
RoBERTa-tiny (cased, ours) 74.0 35.9 89.8 86.2/81.8 83.8/82.7 68.9/88.2 77.7 77.2 85.9 66.5 65.1

For RTE, STS, MRPC and QNLI, we found it helpful to finetune starting from the MNLI single-task model, rather than the baseline pretrained RoBERTa. For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs:

  • batch sizes: 8, 16, 32, 64, 128
  • learning rates: 3e-4, 1e-4, 5e-5, 3e-5

Use Our Model

Our pre-trained model is specially suitable for low latency applications. Combined with knowledge distillation and task-specific fine-tuning, our model can achieve high inference speed while keeping similar performance with larger models. HuggingFace Transformers is recommended when loading our model for further fine-tuning:

from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("haisongzhang/roberta-tiny-cased")

model = AutoModelWithLMHead.from_pretrained("haisongzhang/roberta-tiny-cased")

Note: When loading the tokenizer in transformers, use BertTokenizer instead of RobertaTokenizer since Wordpiece was used in this model.

Acknowledgement

This work was done by my intern @raleighhan(撖朝润) during his internship at NLP Group of Tencent AI Lab.

References

HuggingFace Transformers: https://github.com/huggingface/transformers

BERT: Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.

RoBERTa: Liu Y, Ott M, Goyal N, et al. Roberta: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv:1907.11692, 2019.

BERT-small: Turc I, Chang M W, Lee K, et al. Well-read students learn better: On the importance of pre-training compact models[J]. arXiv preprint arXiv:1908.08962, 2019.

roberta-tiny-cased's People

Contributors

haisongzhang avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.